Best Practices For Building Talent In Analytics

Best practice pinned on noticeboard

Companies across all industries depend more and more on analytics and insights to run their businesses profitably. But, attracting, managing and retaining talented personnel to execute on those strategies remains a challenge. This is not the case for consumer products heavyweight The Procter & Gamble Company (P&G), which has been at the top of its analytics game for 50 years now.

During the 2014 Retail/Consumer Goods Analytics Summit, Glenn Wegryn, retired associate director of analytics for P&G, shared best practices for building the talent capabilities required to ensure success. A leadership council is in charge of sharing analytics best practices across P&G — breaking down silos to make sure the very best talent is being leveraged to solve the company’s most pressing business issues.

So, what are the characteristics of a great data analyst and where can you find them?

“I always look for people with solid quantitative backgrounds because that is the hardest thing to learn on the job,” said Wegryn.

Combine that with mature communication skills and a talent for business acumen and you’ve got the perfect formula for a great data analyst.

When it comes to sourcing analytics, Wegryn says companies have an important strategic decision to make: Do you build it internally, leveraging resources like consultants and universities? Do you buy it from a growing community of technology solution providers? Or, do you adopt a hybrid model?

“Given the explosion of business analytics programs across the country, your organization should find ample opportunities to tap into those resources,” advised Wegryn.

To retain and nurture your organization’s business analysts, Wegryn recommended creating a career path that grows and the importance of encouraging talented personnel internally until they reach a trusted CEO advisory role.

Wegryn also shared key questions an organization should ask to unleash the value of analytics, and suggested that analytics should always start and end with a decision.

“You make a decision in business that leads to action that gleans insights that leads to another decision,” he said. “While the business moves one way, the business analyst works backward in a focused, disciplined and controlled manner.”

Perhaps most importantly, the key to building the talent capability to ensure analytics success came from P&G’s retired chairman, president and CEO Bob McDonald: “… having motivation from the top helps.”

Wegryn agreed: “It really helps when the person at the top of the chain is driven on data.”

The inaugural Retail & Consumer Goods Analytics Summit event was held September 11-12, 2014 at the W Hotel in San Diego, California. The conference featured keynotes from retail and consumer goods leaders, peer-to-peer exchanges and relationship building.

Article originally appeared HERE.

Originally Posted at: Best Practices For Building Talent In Analytics by analyticsweekpick

Avoiding a Data Science Hype Bubble

In this post, Josh Poduska, Chief Data Scientist at Domino Data Lab, advocates for a common taxonomy of terms within the data science industry. The proposed definitions enable data science professionals to cut through the hype and increase the speed of data science innovation. 


The noise around AI, data science, machine learning, and deep learning is reaching a fever pitch. As this noise has grown, our industry has experienced a divergence in what people mean when they say “AI”, “machine learning”, or “data science”. It can be argued that our industry lacks a common taxonomy. If there is a taxonomy, then we, as data science professionals, have not done a very good job of adhering to it. This has consequences. Two consequences include the creation of a hype-bubble that leads to unrealistic expectations and an increasing inability to communicate, especially with non-data science colleagues. In this post, I’ll cover concise definitions and then argue how it is vital to our industry that we be consistent with how we define terms like “AI”.

Concise Definitions

  • Data Science: A discipline that uses code and data to build models that are put into production to generate predictions and explanations.
  • Machine Learning: A class of algorithms or techniques for automatically capturing complex data patterns in the form of a model.
  • Deep Learning: A class of machine learning algorithms that uses neural networks with more than one hidden layer.
  • AI: A category of systems that operate in a way that is comparable to humans in the degree of autonomy and scope.


Our terms have a lot of star power. They inspire people to dream and imagine a better world which leads to their overuse. More buzz around our industry raises the tide that lifts all boats, right? Sure, we all hope the tide will continue to rise. But, we should work for a sustainable rise and avoid a hype bubble that will create widespread disillusionment if it bursts.

I recently attended Domino’s rev conference, a summit for data science leaders and practitioners. I heard multiple leaders seeking advice on how to help executives, mid-level managers, and even new data scientists have proper expectations of data science projects without sacrificing enthusiasm for data science. Unrealistic expectations slow down progress by deflating the enthusiasm when projects yield less than utopian results. They also make it harder than it should be to agree on project success metrics and ROI goals.

The frequent overuse of “AI” when referring to any solution that makes any kind of prediction has been a major cause of this hype. Because of frequent overuse, people instinctively associate data science projects with near perfect human-like autonomous solutions. Or, at a minimum, people perceive that data science can easily solve their specific predictive need, without any regard to whether their organizational data will support such a model.


Incorrect use of terms also gums up conversations. This can be especially damaging in the early planning phases of a data science project when a cross-functional team assembles to articulate goals and design the end solution. I know a data science manager that requires his team of data scientists to be literally locked in a room for an our hour with business leaders before he will approve any new data science project. Okay, the door is not literally locked, but it is shut, and he does require them to discuss the project for a full hour. They’ve seen a reduction in project rework as they’ve focused on early alignment with business stakeholders. The challenge of explaining data science concepts is hard enough as it is. We only make this harder when we can’t define our own terms.

I’ve been practicing data science for a long time now. I’ve worked with hundreds of analytical leaders and practitioners from all over the world. Since AI and deep learning came on the scene, I’ve increasingly had to pause conversations and ask questions to discover what people really mean when they use certain terms. For example, how would you interpret these statements which are based on conversations I’ve had?

  • “Our goal is to make our solution AI-driven within 5 years.”
  • “We need to get better at machine learning before we invest in deep learning.”
  • “We use AI to predict fraud so our customers can spend with confidence.”
  • “Our study found that organizations investing in AI realize a 10% revenue boost.”

Confusing, right?

One has to ask a series of questions to be able to understand what is really going on.

The most common term-confusion I hear is when someone talks about AI solutions, or doing AI, when they really should be talking about building a deep learning or machine learning model. It seems that far too often the interchange of terms is on purpose, with the speaker hoping to get a hype-boost by saying “AI”. Let’s dive into each of the definitions and see if we can come to an agreement on a taxonomy.

Data Science

First of all, I view data science as a scientific discipline, like any other scientific discipline. Take biology, for example. Biology encompasses a set of ideas, theories, methods, and tools. Experimentation is common. The biological research community is continually adding to the discipline’s knowledge base. Data science is no different. Practitioners do data science. Researchers advance the field with new theory, concepts, and tools.

The practice of data science involves marrying code (usually some statistical programming language) with data to build models. This includes the important and dominant initial steps of data acquisition, cleansing, and preparation. Data science models usually make predictions (e.g., predict loan risk, predict disease diagnosis, predict how to respond to a chat, predict what objects are in an image). Data science models can also explain or describe the world for us (e.g., which combination of factors are most influential in making a disease diagnosis, which customers are most similar to each other and how). Finally, these models are put into production to make predictions and explanations when applied to new data. Data science is a discipline that uses code and data to build models that are put into production to generate predictions and explanations.

It can be difficult to craft a definition for data science while, at the same time, distinguishing it from statistical analysis. I came to the data science profession via educational training in math and statistics as well as professional experience as a statistician. Like many of you, I was doing data science before it was a thing.

Statistical analysis is based on samples, controlled experiments, probabilities, and distributions. It usually answers questions about likelihood of events or the validity of statements. It uses different algorithms like t-test, chi-square, ANOVA, DOE, response surface designs, etc. These algorithms sometimes build models too. For example, response surface designs are techniques to estimate the polynomial model of a physical system based on observed explanatory factors and how they relate to the response factor.

One key point in my definition is that data science models are applied to new data to make future predictions and descriptions, or “put into production”. While it is true that response surface models can be used on new data to predict a response, it is usually a hypothetical prediction about what might happen if the inputs were changed. The engineers then change the inputs and observe the responses that are generated by the physical system in its new state. The response surface model is not put into production. It does not take new input settings by the thousands, over time, in batches or streams, and predict responses.

My data science definition is by no means fool-proof, but I believe putting predictive and descriptive models into production starts to capture the essence of data science.

Machine Learning

Machine learning as a term goes back to the 1950s. Today, it is viewed by data scientists as a set of techniques that are used within data science. It is a toolset or a class of techniques for building the models mentioned above. Instead of a human explicitly articulating the logic for a model, machine learning enables computers to generate (or learn) models on their own. This is done by processing an initial set of data, discovering complex hidden patterns in that data, and capturing those patterns in a model so they can be applied later to new data in order to make predictions or explanations. The magic behind this process of automatically discovering patterns lies in the algorithms. Algorithms are the workhorses of machine learning. Common machine learning algorithms include the various neural network approaches, clustering techniques, gradient boosting machines, random forests, and many more. If data science is a discipline like biology, then machine learning is like microscopy or genetic engineering. It is a class of tools and techniques with which the discipline is practiced.

Deep Learning

Deep learning is the easiest of these terms to define. Deep learning is a class of machine learning algorithms that uses neural networks with more than one hidden layer. Neural networks themselves date back to the 1950s. Deep learning algorithms have recently become very popular starting in the 1980s, with a lull in the 1990s and 2000s, followed by a revival in our decade due to relatively small tweaks in the way the deep networks were constructed that proved to have astonishing effects. Deep learning can be applied to a variety of use cases including image recognition, chat assistants, and recommender systems. For example, Google Speech, Google Photos, and Google Search are some of the original solutions built using deep learning.


AI has been around for a long time. Long before the recent hype storm that has co-opted it with buzzwords. How do we, as data scientists, define it? When and how should we use it? What is AI to us? Honestly, I’m not sure anyone really knows. This might be our “emperor has no clothes” moment. We have the ambiguity and the resulting hype that comes from the promise of something new and unknown. The CEO of a well known data science company was recently talking with our team at Domino when he mentioned “AI”. He immediately caught himself and said, “I know that doesn’t really mean anything. I just had to start using it because everyone is talking about it. I resisted for a long time but finally gave in.”

That said, I’ll take a stab at it: AI is a category of systems that people hope to create which have the defining characteristic that they will be comparable to humans in the degree of autonomy and scope of operation.

To extend our analogy, if data science is like biology and machine learning is like genetic engineering, then AI is like disease resistance. It’s the end result, a set of solutions or systems that we are striving to create through the application of machine learning (often deep learning) and other techniques.

Here’s the bottom line. I believe that we need to draw a distinction between techniques that are part of AI solutions, AI-like solutions, and true AI solutions. This includes AI building blocks, solutions with AI-ish qualities, and solutions that approach human autonomy and scope. These are three separate things. People just say “AI” for all three far too often.

For example,

  • Deep learning is not AI. It is a technique that can be used as part of an AI solution.
  • Most data science projects are not AI solutions. A customer churn model is not an AI solution, no matter if it used deep learning or logistic regression.
  • A self driving car is an AI solution. It is a solution that operates with complexity and autonomy that approaches what humans are capable of doing.

Remember those cryptic statements from above? In each case I asked questions to figure out exactly what was going on under the hood. Here is what I found.

  • An executive said: “Our goal is to make our solution AI-driven within 5 years.”
    The executive meant: “We want to have a couple machine learning models in production within 5 years.”
  • A manager said: “We need to get better at machine learning before we invest in deep learning.”
    The manager meant: “We need to train our analysts in basic data science principles before we are ready to try deep learning approaches.”
  • A marketer said: “We use AI to predict fraud so our customers can spend with confidence.”
    The marketer meant: “Our fraud score is based on a logistic regression model that has been working well for years.”
  • An industry analyst said: “Our study found that organizations investing in AI realize a 10% revenue boost.”
    The industry analyst meant: “Organizations that have any kind of predictive model in production realize a 10% revenue boost.”

The Ask

Whether you 100% agree with my definitions or not, I think we can all agree that there is too much hype in our industry today, especially around AI. Each of us has seen how this hype limits real progress. I argue that a lot of the hype is from misuse of the terms of data science. My ask is that, as data science professionals, we try harder to be conscious of how we use these key terms, and that we politely help others who work with us learn to use these terms in the right way. I believe that the quicker we can iterate to an agreed-upon taxonomy and insist on adherence to it, the quicker we can cut through hype and increase our speed of innovation as we build the solutions of today and tomorrow.

The post Avoiding a Data Science Hype Bubble appeared first on Data Science Blog by Domino.

Source: Avoiding a Data Science Hype Bubble by analyticsweek

Investment banks recruit for rise of big data analytics

Big data, or the large pools of data that can be captured, processed and then analysed, is now reaching into every sector and function of the global economy.

Financial services businesses, including the investment banks, generate and store more data than any other business in any other sector – broadly because it is such a transaction-heavy industry, often driven by models and algorithms.

Despite accumulating a wealth of information on capital market transactions, trades, financial markets, and other client and market data, the investment banks have been slower to embrace today’s definition of big data analytics than many consumer retail businesses, technology businesses, and even retail banking.

Organisations such as Amazon, Google, eBay and the UK’s big four supermarkets have been using big data analytics for many years, tracking consumer behaviour to suggest potential new products to consumers and develop customer loyalty schemes. Where investment banks have used big data, it has often been restricted to tracking individual sub-categories of asset classes.

The UK’s high-street banks have also been increasingly active in this area, using data analytics to study purchasing patterns, social media and location data, in order to tailor products and associated marketing material to individual customers’ needs.

Using big data analytics to increase profitability

The investment banks are now looking at how they can use big data to do what they do better, faster and more efficiently.

Senior executives at the banks want to enhance how they use data to raise profitability, map out markets and company-wide exposures, and ultimately win more deals.

While banks have, for many years, used data and value at risk modelling to measure and quantify the level of financial risk in a portfolio of assets, the fundamental difference with big data is that it has become an established standalone functional department rather than a series of small subsets of internal business units.

Big-data teams are now taking on the role of an influential internal consultancy, communicating to senior executives key insights on how to improve profitability.

Another key difference is that the banks are now not only analysing structured data, such as market or trading data, but also unstructured data, which can include sources such as tweets, blogs, Facebook posts and marketing material. This is now collected and recorded from a bank’s customers or clients – a significant shift from how data used to be captured.

Using large amounts of both structured and unstructured data and market data, the investment banks are now accurately modelling the outcome of investment decisions, and getting real-time insights into client demand.

Big data is also a fundamental element of risk-profiling for the banks, enabling data analysts to immediately assess the impact of the escalation in geopolitical risk on portfolios and their exposure to specific markets and asset classes. Specifically, banks have now built systems that will map out market-shaping past events in order to identify future patterns.

We are also seeing the banks using big data to analyse the effectiveness of their deals, looking for insights into which trades they did or did not win on a client-by-client basis.

But despite the recent growth in the use of big data by the banks, key challenges remain.

Unlike retail and technology giants such as Google, Facebook and Amazon, or any new startup or fintech company, the IT and data systems at most banks were not originally constructed to analyse structured and unstructured data. Updating and remodelling entire IT and data systems to accommodate the systems needed to generate a deep analysis of a bank’s data is time-consuming and costly.

Banks that have merged or acquired other banks or financial services businesses are likely to face even more complex issues when incorporating and updating legacy IT systems.

Surge in hiring big data analytics specialists

The competition between banks and fund managers to hire big data specialists is heating up.

The banks are actively recruiting big data and analytics specialists to fill two main, but significantly different roles: big data engineers and data scientists/analytics/insights.

Big data engineers will typically come from a strong IT development or coding background and are responsible for designing data platforms and applications. A big data engineer can typically command £55,000 a year and may also be known as a software engineer – big data, big data software architect or Hadoop developer.

Data scientists, in contrast, are responsible for bridging the gap between data analytics and business decision-making, capable of translating complex data into key strategy insight.

Data scientists – also known as analytics and insights manager or director of data science – are expected to have sharp technical and quantitative skills. Data scientists are in highest demand and this is where the biggest skill shortage exists.

Data scientists are responsible for examining the data, identifying key trends, and writing the complex algorithms that will see the raw data transformed into a piece of analysis or insight that the business can use to gain a competitive advantage.

Such is the shortage of individuals with this skillset that a data scientist can command between £75,000 and £110,000 a year, straight from graduating from university.

Big data teams will often be competing to hire from the same pool of mathematics and physics PhDs from which other areas of the investment bank will be hiring.

Christopher Adeyeri is associate director – head of technology at recruitment firm Astbury Marsden

Originally posted via “Investment banks recruit for rise of big data analytics”

Source by analyticsweekpick

Exploring the Structure of High-Dimensional Data with HyperTools in Kaggle Kernels

Exploring the structure of high-dimensional data with HyperTools in Kaggle Kernels

The datasets we encounter as scientists, analysts, and data nerds are increasingly complex. Much of machine learning is focused on extracting meaning from complex data. However, there is still a place for us lowly humans: the human visual system is phenomenal at detecting complex structure and discovering subtle patterns hidden in massive amounts of data. Every second that our eyes are open, countless data points (in the form of light patterns hitting our retinas) are pouring into visual areas of our brain. And yet, remarkably, we have no problem at all recognizing a neat looking shell on a beach, or our friend’s face in a large crowd. Our brains are “unsupervised pattern discovery aficionados.”

On the other hand, there is at least one major drawback to relying on our visual systems to extract meaning from the world around us: we are essentially capped at perceiving just 3 dimensions at a time, and many datasets we encounter today are higher dimensional.

So, the question of the hour is: how can we harness the incredible pattern-recognition superpowers of our brains to visualize complex and high-dimensional datasets?

Dimensionality Reduction

In comes dimensionality reduction, stage right. Dimensionality reduction is just what it sounds like– transforming a high-dimensional dataset into a lower-dimensional dataset. For example, take this UCI ML dataset on Kaggle comprising observations about mushrooms, organized as a big matrix. Each row is comprised of a bunch of features of the mushroom, like cap size, cap shape, cap color, odor etc. The simplest way to do dimensionality reduction might be to simply ignore some of the features (e.g. pick your favorite three—say size, shape, and color—and ignore everything else). However, this is problematic if the features you drop contained valuable diagnostic information (e.g. whether the mushrooms are poisonous).

A more sophisticated approach is to reduce the dimensionality of the dataset by only considering its principal components, or the combinations of features that explains the most variance in the dataset. Using a technique called principal components analysis (or PCA), we can reduced the dimensionality of a dataset, while preserving as much of its precious variance as possible. The key intuition is that we can create a new set of (a smaller number of) features, where each of the new features is some combination of the old features. For example, one of these new features might reflect a mix of shape and color, and another might reflect a mix of size and poisonousness. In general, each new feature will be constructed from a weighted mix of the original features.

Below is a figure to help with the intuition. Imagine that you had a 3 dimensional dataset (left), and you wanted to reduce it to a 2 dimensional dataset (right). PCA finds the principal axes in the original 3D space where the variance between points is the highest. Once we identify the two axes that explain the most variance (the black lines in the left panel), we can re-plot the data along just those axes, as shown on the right. Our 3D dataset is now 2D. Here we have chosen a low-dimensional example so we could visualize what is happening. However, this technique can be applied in the same way to higher-dimensional datasets.

We created the HyperTools package to facilitate these sorts of dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including matplotlib, scikit-learn and seaborn. HyperTools is designed with ease of use as a primary objective. We highlight two example use cases below.

Mushroom foraging with HyperTools: Visualizing static ‘point clouds’

First, let’s explore the mushrooms dataset we referenced above. We start by importing the relevant libraries:

import pandas as pd
import hypertools as hyp

and then we read in our data into a pandas DataFrame:

data = pd.read_csv('../input/mushrooms.csv')
index class cap-shape cap-surface cap-color bruises odor gill-attachment
0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f
5 e x y y t a f

Each row of the DataFrame corresponds to a mushroom observation, and each column reflects a descriptive feature of the mushroom (only some of the rows and columns are shown above). Now let’s plot the high-dimensional data in a low dimensional space by passing it to HyperTools. To handle text columns, HyperTools will first convert each text column into a series of binary ‘dummy’ variables before performing the dimensionality reduction. For example, if the ‘cap size’ column contained ‘big’ and ‘small’ labels, this single column would be turned into two binary columns: one for ‘big’ and one for ‘small’, where 1s represents the presence of that feature and 0s represents the absence (for more on this, see the documentation for the get_dummies function in pandas).

hyp.plot(data, 'o')

In plotting the DataFrame, we are effectively creating a three-dimensional “mushroom space,” where mushrooms that exhibit similar features appear as nearby dots, and mushrooms that exhibit different features appear as more distant dots. By visualizing the DataFrame in this way, it becomes immediately clear that there are multiple clusters in the data. In other words, all combinations of mushroom features are not equally likely, but rather certain combinations of features tend to go together. To better understand this space, we can color each point according to some feature in the data that we are interested in knowing more about. For example, let’s color the points according to whether the mushrooms are (p)oisonous or (e)dible (the class_labels feature):

hyp.plot(data,'o', group=class_labels, legend=list(set(class_labels)))

Visualizing the data in this way highlights that mushrooms’ poisonousness appears stable within each cluster (e.g. mushrooms that have similar features), but varies across clusters. In addition, it looks like there are a number of distinct clusters that are poisonous/edible. We can explore this further by using the ‘cluster’ feature of HyperTools, which colors the observations using k-means clustering. In the description of the dataset, it was noted that there were 23 different types of mushrooms represented in this dataset, so we’ll set the n_clusters parameter to 23:

hyp.plot(data, 'o', n_clusters=23)

To gain access to the cluster labels, the clustering tool may be called directly using, and the resulting labels may then be passed to hyp.plot:

cluster_labels =, n_clusters=23)
hyp.plot(data, group=cluster_labels)

By default, HyperTools uses PCA to do dimensionality reduction, but with a few additional lines of code we can use other dimensionality reduction methods by directly calling the relevant functions from sklearn. For example, we can use t-SNE to reduce the dimensionality of the data using:

from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=3)
reduced_data_TSNE = TSNE_model.fit_transform(
hyp.plot(reduced_data_TSNE,'o', group=class_labels, legend=list(set(class_labels)))

Different dimensionality reduction methods highlight or preserve different aspects of the data. A repository containing additional examples (including different dimensionality reduction methods) may be found here.

The data expedition above provides one example of how the geometric structure of data may be revealed through dimensionality reduction and visualization. The observations in the mushrooms dataset formed distinct clusters, which we identified using HyperTools. Explorations and visualizations like this could help guide analysis decisions (e.g. whether to use a particular type of classifier to discriminate poisonous vs. edible mushrooms). If you’d like to play around with HyperTools and the mushrooms dataset, check out and fork this Kaggle Kernel!

Climate science with HyperTools: Visualizing dynamic data

Whereas the mushrooms dataset comprises static observations, here we will take a look at some global temperature data, which will showcase how HyperTools may be used to visualize timeseries data using dynamic trajectories.

This next dataset is made up of monthly temperature recordings from a sample of 20 global cities over the 138 year interval ranging from 1875–2013. To prepare this dataset for analysis with HyperTools, we created a time by cities matrix, where each row is a temperature recording for subsequent months, and each column is the temperature value for a different city. You can replicate this demo by using the Berkeley Earth Climate Change dataset on Kaggle or by cloning this GitHub repo. To visualize temperature changes over time, we will use HyperTools to reduce the dimensionality of the data, and then plot the temperature changes over time as a line:


Well that just looks like a hot mess, now doesn’t it? However, we promise there is structure in there– so let’s find it! Because each city is in a different location, the mean and variance of its temperature timeseries may be higher or lower than the other cities. This will in turn affect how much that city is weighted when dimensionality reduction is performed. To normalize the contribution of each city to the plot, we can set the normalize flag (default value: False). Setting normalize='across' <will normalize (z-score) each column of the data. HyperTools incorporates a number of useful normalization options, which you can read more about here.

hyp.plot(temps, normalize='across')

Now we’re getting somewhere! Rotating the plot with the mouse reveals an interesting shape to this dataset. To help highlight the structure and understand how it changes over time, we can color the lines by year, where more red lines indicates early and more blue lines indicate later timepoints:

hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r')

Coloring the lines has now revealed two key structural aspects of the data. First, there is a systematic shift from blue to red, indicating a systematic change in the pattern of global temperatures over the years reflected in the dataset. Second, within each year (color), there is a cyclical pattern, reflecting seasonal changes in the temperature patterns. We can also visualize these two phenomena using a two dimensional plot:

hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r', ndims=2)

Now, for the grand finale. In addition to creating static plots, HyperTools can also create animated plots, which can sometimes reveal additional patterns in the data. To create an animated plot, simply pass animate=True to hyp.plot when visualizing timeseries data. If you also pass chemtrails=True, a low-opacity trace of the data will remain in the plot:

hyp.plot(temps, normalize='across', animate=True, chemtrails=True)

That pleasant feeling you get from looking at the animation is called “global warming.”

This concludes our exploration of climate and mushroom data with HyperTools. For more, please visit the project’s GitHub repository, readthedocs site, a paper we wrote, or our demo notebooks.


Andrew is a Cognitive Neuroscientist in the Contextual Dynamics Laboratory. His postdoctoral work integrates ideas from basic learning and memory research with computational techniques used in data science to optimize learning in natural educational settings, like the classroom or online. Additionally, he develops open-source software for data visualization, research and education.

The Contextual Dynamics Lab at Dartmouth College uses computational models and brain recordings to understand how we extract information from the world around us. You can learn more about us at

Originally Posted at: Exploring the Structure of High-Dimensional Data with HyperTools in Kaggle Kernels by analyticsweek

Rules-Based Versus Dynamic Algorithms: The Fate of Artificial Intelligence

With so many recent headlines pertaining to advanced machine learning and deep learning, it may be easy to forget that at one point in the lengthy history of Artificial Intelligence, the term largely denoted relatively simple, rules-based algorithms.

According to TopQuadrant CEO Irene Polikoff, “It’s interesting, AI used to be synonymous with rules and expert systems. These days, it seems to be, in people’s minds, synonymous with machine learning.”

In contemporary enterprise settings, the latter is applauded for its dynamic mutability while the former is derided for a static rigidity not, so the argument goes, emblematic of truly intelligent systems. If humans are devising the rules, is it truly AI?

Nonetheless, certain aspects of rules-based, ‘algorithmic AI’ persist in part because of their applicability to different use cases, in addition to machine learning’s shortcomings. The most notable is the ‘black box’ phenomenon (highly prevalent in facets of unsupervised learning and deep learning) in which the results of machine learning models are difficult to explain.

A closer examination of the utility and drawbacks of each approach indicates that in many cases pertaining to automation, the two balance each other for explainable, trustworthy intelligent systems and solutions.

Machine Learning Algorithms
Machine learning algorithms are widely acclaimed for their automation capabilities, which have produced palpable business value for data management and data engineering mainstays for some time now. However, they also deliver the same results for specific facets of data governance. When ensuring that captured data conforms to business glossary definitions for consistent reuse throughout the enterprise without ambiguity, it’s useful to automate the tagging of data in accordance to those normative terms and business concepts. Machine learning is an integral means of automating this process. For example, when using what Polikoff referenced as “controlled vocabularies” to tag documents stemming from content management systems for regulatory compliance or other governance needs, “machine learning is used to find the most right placed term that applies to documents,” Polikoff revealed.

Human in the Loop and Explainability
There are two critical considerations for this (and other) automation use cases of supervised machine learning. The first is that, despite the fact that certain machine learning algorithms will eventually be able to readily incorporate the results of previous results to increase the accuracy of future results, the learning is far from autonomous. “There is some training involved; even after you train there’s users-in-the-loop to view the tags and accept them or reject them,” Polikoff mentioned. “That could be an ongoing process or you could decide at some point to let it run by itself.” Those who choose the latter option may encounter the black box phenomenon in which there’s limited explainability for the results of machine learning algorithms and the models that produced them. “With machine learning, what people are starting to talk about more and more today is how much can we rely on something that’s very black box?” Polikoff said. “Who is at fault if it goes wrong and there are some conclusions where it’s not correct and users don’t understand how this black box operates?”

Algorithmic AI
Conversely, there’s never a lack of explainability associated with rules-based AI in which humans devise the rules upon which algorithms are based. Transparent understanding of the results of such algorithms are their strength; their immutability is often considered their weakness when compared with dynamic machine learning algorithms. However, when attempting to circumscribe the black box effect “to some extent rules address them,” Polikoff maintained. “The rule is clearly defined; you can always examine it; you can seek it. Rules are very appropriate. They’re more powerful together [with machine learning].” The efficacy of the tandem of rules and machine learning is duly demonstrated in the data governance tagging use case, which is substantially enhanced by deploying a standards-based enterprise knowledge graph to represent the documents and their tags in conjunction with vocabularies. According to Polikoff, “you can have from one perspective a controlled vocabulary with some rules in it, and from another perspective you have machine learning. You can combine both.”

In this example machine learning would be deployed to “find [the] most likely tags in the document, look at the rules about the concepts those tags represent, and add more knowledge based on that,” Polikoff said. Implicit to this process are the business rules for the terms upon which the tags are created, which helps define them. Equally valuable is the knowledge graph environment which can link the knowledge gleaned from the tagging to other data, governance concepts, and policies. The aforementioned rules, in the form of vocabularies or a business glossary, aggrandize machine learning’s automation for more accurate results.

Synthesized Advantages
The mutable nature of machine learning algorithms doesn’t mean the end of rules or the value extracted from rules-based, algorithmic AI. Both can work simultaneously to enrich each other’s performance, particularly for automation use cases. The addition of rules can increase the explainability for machine learning, resulting in greater understanding of the results of predictive models. When leveraged in linked data settings, there’s the potential for “a combination of machine learning and inferencing working together and ultimately, since both of them are using a knowledge graph for the presentation of the knowledge and the presentation of the data, that makes for clarity,” Polikoff remarked. “It’s quite a smooth and integrated environment where you can combine those processes.”

Originally Posted at: Rules-Based Versus Dynamic Algorithms: The Fate of Artificial Intelligence by jelaniharper

Analytics for government: where big data and Big Brother collide


There is rightfully a lot of hype around e-government. The application of analytics in the private sector has had a significant impact on our lives.

And, at first blush, it seems like a great idea for our governments to be more like Google or Amazon, using data and analytics to deliver improved services more cost effectively, when and where people need them.

However, while many of the benefits found in the private sector can translate directly to application in the public sector, there are hurdles our governments will have to clear that the Googles of the world simply dodge.

A lot is already happening in e-government. The Glasgow Smart City initiative applies a combination of advanced technologies to benefit the people of Glasgow. Traffic management, more efficient policing, optimising green technologies, improving public transportation and many other initiatives are all driven by the application of technology and data analytics.

Transparency and privacy are two key areas where the public sector will not be able to rely on the private sector for innovation

We also see examples such as Torbay Council and the London Borough of Islington using analytics to drive efficiency in the delivery of services and to increase transparency.

Torbay Council makes available expenditure data on its public website to increase transparency, while using analytics internally to help budget holders run their services more efficiently.

The London Borough of Islington was able to save £800,000 annually by combining CCTV data with operational data to create dashboards that helped them to deploy parking enforcement personnel more effectively, as well as reduce ticket processing time from six months to four days.

On both grand and more pedestrian scales, analytics is improving public services.

The benefits of applying analytics in government are real, but the public sector should be cautious about simply taking the experience of the private sector and trying to apply it directly.

The public sector will need to carefully rethink the often adversarial nature analytics can take in the private sector. Amazon’s recommendation algorithms may be “cool”, but the algorithms are not your friend. They are there to get you to spend more.

It is critical that the public understands how data is being used and participates in managing that process

Transparency and privacy are the two key concerns in which the public sector will not be able to rely on the private sector for innovation.

Data ownership, as an example, is an area in which companies such as Google and Amazon are not good role models. Amazon owns my purchase history, but should the Government “own” my health data?

Amazon can use my purchase data for any purpose it sees fit without telling me who is accessing it or why. Should government CCTV data be treated the same?

This is not a good model for e-government to follow. In fact, the challenge was highlighted earlier this year by the Government’s surveillance camera commissioner Tony Porter. Algorithms are able to predict behaviour and automatically track individuals.

It is critical that the public understands how data is being used and participates in managing that process. This is where the public sector will need to drive new innovations, educating citizens and empowering them to participate in controlling their data and its usage.

In other words, delivering data-driven government while keeping Big Brother at bay.

Note: This article originally appeared in Telegraph. Click for link here.

Originally Posted at: Analytics for government: where big data and Big Brother collide

Is Big Data The Most Hyped Technology Ever?

I read an article today on the topic of Big Data. In the article, the author claims that the term Big Data is the most hyped technology ever, even compared to such things as cloud computing and Y2K. I thought this was a bold claim and one that is testable. Using Google Trends, I looked at the popularity of three IT terms to understand the relative hype of each (as measured by number of searches on the topic): Web 2.0, cloud computing and big data. The chart from Google Trends appears below.

We can learn a couple of things from this graph. First, the interest in Big Data continues to grow since its first measurable growth appeared in early 2011. Still, the number of searches for the respective terms clearly shows that Web 2.0 and cloud computing received more searches than Big Data. While we don’t know if interest in Big Data will continue to grow, Google Trends, in fact, predicts very a very slow growth rate for Big Data through the end of 2015.

Second, the growth rates of Web 2.0 and cloud computing are faster compared to the growth rate of Big Data, showing that public interest grew more quickly for those terms than for Big Data. Interest in Web 2.0 reached its maximum in a little over 2 years since its initial ascent. Interest in cloud computing reached its peak in about 3.5 years. Interest in Big Data has been growing steadily for over 3.7 years.

One thing of interest. For these three technology terms, the growth of the two latter technology terms started at the peak of the previous term. As one technology becomes commonplace, another takes its place.

So, is Big Data the most hyped technology ever? No.

Originally Posted at: Is Big Data The Most Hyped Technology Ever?

How to Define KPIs for Successful Business Intelligence

Realizing that you can only improve what you measure is a good way to think about KPIs. Often companies want to improve different aspects of their business all at once, but can’t put a finger on what will measure their progress towards overarching company goals. Does it come down to comparing the growth of last year to this year? Or, is it just about the cost of acquiring new customers?

If you’re nervously wondering now, “wait, what is my cost per deal?”, don’t sweat it. Another growing pain of deciding on KPIs is discovering that there is a lot of missing information.

Defining Your KPIs

Choosing the right KPI is crucial to make effective, data-driven decisions. If you choose the right KPI, it will help to concentrate the efforts of employees towards a meaningful goal, however, choose incorrectly and you could waste significant resources chasing after vanity metrics.

In order to rally the efforts of your team and achieve your long-term objectives, you have to measure the right things. For example, if the goal is to increase revenue at a SaaS company by 25% over the next two quarters, you couldn’t determine success by focusing on the number of likes your Facebook page got. Instead, we could ask questions like: Are we protecting our ARR by retaining our existing customers? Do we want to look at the outreach efforts of our sales development representatives, and whether that results in increased demos and signups? Should we look at the impact of increased training for the sales team on closed deals?

Dashboard Design Banner

Similarly, if we wanted to evaluate the effectiveness of various marketing channels, we need to determine more than an end goal of increasing sales or brand awareness. Instead, we’ll need a more precise definition of success. This might include ad impressions, click through rates, conversion numbers, new email list subscribers, page visits, bounce rates, and much more.

Looking at all these factors will allow us to determine which channels are driving the most traffic and revenue. If we dig a bit deeper, there will be even more insights to discover. In addition to discovering which channels produce traffic most likely to translate into a conversion, we can also learn if other factors such as timing make a difference to reach our target audience.

Of course, every industry and business are different. To establish meaningful KPIs, you’ll need to determine what most clearly correlates with your company’s goals. Here are a few examples:

  • Finance – Working capital, Operating cash flow, Return on equity, Quick ratio, Debt to equity ratio, Inventory turnover, Accounts receivable turnover, Gross profit margin
  • Marketing – Customer acquisition cost, Conversion rate of a particular channel, Percentage of leads generated from a particular channel, Customer Churn, Dormant customers, Average spend per customer
  • Healthcare – Inpatient mortality rate, Bed turnover, Readmission rate, Average length of stay, Patient satisfaction, Total operating margin, Average cost per discharge, Cash receipt to bad debt, Claims denial rate
  • Retail – Gross margin (as a percentage of selling price), Inventory turnover, Sell-through percentage, Average sales per transaction, Percentage of total stock not displayed

If your business is committed to data-driven decision making, establishing the right KPIs is crucial. Although the process of building a performance-driven culture is iterative, clearly defining the desired end result will go a long way towards help you establish effective KPIs that will help focus the efforts of your team towards that goal, whether it’s to move product off shelves faster, create better patient outcomes, or increase your revenue per customer.

The good news is that in the business intelligence world, measuring performance can be especially precise, quick and easy. Yet, the first hurdle every data analyst faces is the initial struggle to choose and agree on company KPIs & KPI tracking. If you are about to embark on a BI project, here’s a useful guide on how to decide what it is that you want to measure:

Step 1: Isolate Pain Points, Identify Core Business Goals

A lot of companies start by trying to quantify their current performance. But again, as a data analyst, the beauty of your job and the power of business intelligence is that you can drill into an endless amount of very detailed metrics. From clicks, site traffic, and conversion rates, to service call satisfaction and renewals, the list goes on. So ask yourself: What makes the company better at what they do?

You can approach this question by focusing on stage growth, where a startup would focus most on metrics that validate business models, whereas an enterprise company would focus on metrics like customer lifetime value analysis. Or, you can examine this question by industry: a services company (consultancies) would focus more on quality of services rendered, whereas a company that develops products would focus on product usage.

Ready to dive in? Start by going from top-down through each department to elicit requirements and isolate the pain points and health factors for every department. Here are some examples of KPI metrics you may want to look at:


  • Product related tickets
  • Customer satisfaction
  • Usage statistics (SaaS products)

Marketing KPIs

  • Brand awareness
  • Conversion rate
  • Site traffic
  • Social shares


  • Number of bugs
  • Length of development cycle
  • App usage

Step 2: Break It Down to A Few KPIs

Once you choose a few important KPIs, then try to break it down even further. Remember, while there’s no magic number, less is almost always more when it comes to KPIs. That’s because if you track too many KPIs, as a data analyst you may start to lose your audience and the focus of the common business user. Choosing the top 7-10 KPIs is a great number to aim for and you can do that by breaking down your core business goals into a much more specific metric.

Remember, the point of a KPI is to gain focus and align goals for measurable improvement. Spend more time choosing the KPIs than simply throwing too many into the mix, which will just push the question of focus further down the road (and require more work!).

Step 3: Carefully Assess Your Data


After you have your main 7-10 elements – you can start digging into the data and start some data modeling. A good question to ask at this point is: How does the business currently make decisions? Counterintuitively, in order to answer that question you may want to look at where the company is currently not making its decisions based on data, or not collecting the right data.

This is where you get to flex your muscles as a “data hero” or a good analyst! Take every KPI and present it as a business question. Then break the business questions into facts, dimensions, filters, and order (example).

Not every business questions contain all of these elements – but there will always be a fact because you have to measure something. You’ll need to answer the following before moving on:

  • What are the data sources
  • Predict the complexity of your data model
  • Tools to prepare, manage and analyze data (BI solution)

Do this by breaking each KPI into its data components, asking questions like: what do I need to count, what do I need to aggregate, which filters need to apply? For each of these questions, you have to know which the data sources are being used and where the tables coming from.

Consider that data will often come from multiple, disparate data sources. For example, for information on a marketing or sales pipeline, you’ll probably need Google Analytics/Adwords data combined your CRM data. As a data analyst, it’s important to recognize that the most powerful KPIs often comes from a combination of multiple data sources. Make sure you are using the right tools, such as a BI tool that has built-in data connectors, to prepare and join data accurately easily.

Step 4: Represent KPIs in an Accurate and Effective Fashion

Congrats! You’ve connected your KPI data to your business. Now you’ll need to find a way to represent the metrics in the most effective way. Check out some of these different BI dashboard examples for some inspiration.

One tip to keep mind is that the goal of your dashboard is to put everyone on the same page. Still, users will each have their own questions and areas where they want to explore, which is why building an interactive, highly visual BI dashboards are important. Your BI solution should offer interactive dashboards that allow its users to perform basic analytical tasks, such as filtering the views, drilling down, and examining underlying data – all with little training.

See an example:

Profit & Loss - Financial Dashboard


As a data analyst you should always look for what other insights you can achieve with the data that the business never thought of asking. People are often entrenched in their own processes and as an analyst, you offer an “outsider’s perspective” of sorts, since you only see the data, while others are clouded by their day-to-day business tasks. Don’t be afraid to ask the hard questions. Start with the most basic and you’ll be surprised how big companies don’t know the answers–and you’ll be a data hero just for asking.

Dashboard Design Banner

Source: How to Define KPIs for Successful Business Intelligence by analyticsweek

Follow the Money: The Demand for Deep Learning

Numbers don’t lie.

According to CB Insights, 100 of the most promising private startups focused on Artificial Intelligence raised $11.7 billion in equity funding in 367 deals during 2017. Several of those companies focus on deep learning technologies, including the most well-funded, ByteDance, which accounts for over a fourth of 2017’s private startup funding with 3.1 billion dollars raised.

In the first half of last year alone, corporate venture capitalists contributed nearly 2 billion dollars of disclosed equity funding in 88 deals to AI startups, which surpassed the total financing for AI startups for all of 2016. The single largest corporate venture capitalist deal in the early part of 2017 was the $600 million Series D funding awarded to NIO, an organization based in China that specializes in autonomous vehicles (among other types of craft), which relies on aspects of deep learning.

According to Forrester, venture capital funding activity in computer vision increased at a CAGR of 137% from 2015 to 2017. Most aspects of advanced pattern recognition, including speech, image, facial recognition and others, hinge on deep learning. A Forbes post noted, “Google, Baidu, Microsoft, Facebook, Salesforce, Amazon, and all other major players are talking about – and investing heavily in – what has become known as “deep learning”. Indeed, both Microsoft and Google have created specific entities to fund companies specializing in AI.

According to Razorthink CEO Gary Oliver, these developments are indicative of a larger trend in which, “If you look at where the investments are going from the venture community, if you look at some of the recent reports that have come out, the vast majority are focused on companies that are doing deep learning.”

Endless Learning
Deep learning is directly responsible for many of the valuable insights organizations can access via AI, since it can rapidly parse through data at scale to discern patterns that are otherwise too difficult to see or take too long to notice. In particular, deep learning actuates the unsupervised prowess of machine learning by detecting data-driven correlations to business objectives for variables on which it wasn’t specifically trained. “That’s what’s kind of remarkable about deep learning,” maintained Tom Wilde, CEO of indico, which recently announced $4 million in new equity seed funding. “That’s why when we see it in action we’re always like whoa, that’s pretty cool that the math can decipher that.” Deep learning’s capacity for unsupervised learning makes is extremely suitable for analyzing semi-structured and unstructured data. Moreover, when it’s leveraged on the enormous datasets required for speech, image, or even video analysis, it provides these benefits at scale at speeds equal to modern business timeframes.

Although this unsupervised aspect of deep learning is one of its more renowned, it’s important to realize that deep learning is actually an advanced form of classic machine learning. As such, it was spawned from the latter despite the fact that its learning capabilities vastly exceed those of traditional machine learning. Nonetheless, there are still enterprise tasks which are suitable for traditional machine learning, and others which require deep learning. “People are aware now that there’s a difference between machine learning and deep learning, and they’re excited about the use cases deep learning can help,” Razorthink VP of Marketing Barbara Reichert posited. “We understand the value of hybrid models and how to apply both deep learning and machine learning so you get the right model for whatever problem you’re trying to solve.”

Whereas deep learning is ideal for analyzing big data sets with vast amounts of variables, classic machine learning persists in simpler tasks. A good example of this fact is its utility in data management staples such as data discovery, in which it can determine relationships between data and use cases. “Once the data is sent through those [machine learning algorithms] the relationships are predicted,” commented Io-Tahoe Chief Technology Officer Rohit Mahajan. “That’s where we have to fine-tune a patented data set that will actually predict the right relationships with the right confidence.”

Data Science
An examination of the spending on AI companies and their technologies certainly illustrates a prioritization of deep learning’s worth to contemporary organizations. It directly impacts some of the more sophisticated elements of AI including robotics, computer vision, and user interfaces based on natural language and speech. However, it also provides unequivocally tangible business value in its analysis of unstructured data, sizable data sets, and the conflation of the two. Additionally, by applying these assets of deep learning to common data modeling needs, it can automate and accelerate certain facets of data science that had previously proved perplexing to organizations.

“Applications in the AI space are making it such that you don’t need to be a data science expert,” Wilde said. “It’s helpful if you kind of understand it at a high level, and that’s actually improved a lot. But today, you don’t need to be a data scientist to use these technologies.”

Source by jelaniharper