Jan 17, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Statistics  Source


More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> 10 Techniques to Boost Your Data Modeling by analyticsweek

>> BDAS Analytics Suite Blends Big Data With HR by analyticsweekpick

>> Video: R and Python in in Azure HDInsight by analyticsweekpick

Wanna write? Click Here


 Choosing the Right Data Catalog for Your Business – insideBIGDATA Under  Business Analytics

 AMA Updates Population Health Tool to Improve Patient Care Access – Health IT Analytics Under  Health Analytics

 Global Streaming Analytics Market 2018 Revenue, Potential Growth, Analysis, Price, Market Share, Growth Rate … – The West Chronicle (press release) (blog) Under  Streaming Analytics

More NEWS ? Click Here


Applied Data Science: An Introduction


As the world’s data grow exponentially, organizations across all sectors, including government and not-for-profit, need to understand, manage and use big, complex data sets—known as big data…. more


Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython


Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored f… more


Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.


Q:What does NLP stand for?
A: * Interaction with human (natural) and computers languages
* Involves natural language understanding

Major tasks:
– Machine translation
– Question answering: “what’s the capital of Canada?”
– Sentiment analysis: extract subjective information from a set of documents, identify trends or public opinions in the social media

– Information retrieval



Advanced #Analytics in #Hadoop

 Advanced #Analytics in #Hadoop

Subscribe to  Youtube


Torture the data, and it will confess to anything. – Ronald Coase


#BigData #BigOpportunity in Big #HR by @MarcRind #JobsOfFuture #Podcast

 #BigData #BigOpportunity in Big #HR by @MarcRind #JobsOfFuture #Podcast


iTunes  GooglePlay


As recently as 2009 there were only a handful of big data projects and total industry revenues were under $100 million. By the end of 2012 more than 90 percent of the Fortune 500 will likely have at least some big data initiatives under way.

Sourced from: Analytics.CLUB #WEB Newsletter

Best Practices For Building Talent In Analytics

Best practice pinned on noticeboard

Companies across all industries depend more and more on analytics and insights to run their businesses profitably. But, attracting, managing and retaining talented personnel to execute on those strategies remains a challenge. This is not the case for consumer products heavyweight The Procter & Gamble Company (P&G), which has been at the top of its analytics game for 50 years now.

During the 2014 Retail/Consumer Goods Analytics Summit, Glenn Wegryn, retired associate director of analytics for P&G, shared best practices for building the talent capabilities required to ensure success. A leadership council is in charge of sharing analytics best practices across P&G — breaking down silos to make sure the very best talent is being leveraged to solve the company’s most pressing business issues.

So, what are the characteristics of a great data analyst and where can you find them?

“I always look for people with solid quantitative backgrounds because that is the hardest thing to learn on the job,” said Wegryn.

Combine that with mature communication skills and a talent for business acumen and you’ve got the perfect formula for a great data analyst.

When it comes to sourcing analytics, Wegryn says companies have an important strategic decision to make: Do you build it internally, leveraging resources like consultants and universities? Do you buy it from a growing community of technology solution providers? Or, do you adopt a hybrid model?

“Given the explosion of business analytics programs across the country, your organization should find ample opportunities to tap into those resources,” advised Wegryn.

To retain and nurture your organization’s business analysts, Wegryn recommended creating a career path that grows and the importance of encouraging talented personnel internally until they reach a trusted CEO advisory role.

Wegryn also shared key questions an organization should ask to unleash the value of analytics, and suggested that analytics should always start and end with a decision.

“You make a decision in business that leads to action that gleans insights that leads to another decision,” he said. “While the business moves one way, the business analyst works backward in a focused, disciplined and controlled manner.”

Perhaps most importantly, the key to building the talent capability to ensure analytics success came from P&G’s retired chairman, president and CEO Bob McDonald: “… having motivation from the top helps.”

Wegryn agreed: “It really helps when the person at the top of the chain is driven on data.”

The inaugural Retail & Consumer Goods Analytics Summit event was held September 11-12, 2014 at the W Hotel in San Diego, California. The conference featured keynotes from retail and consumer goods leaders, peer-to-peer exchanges and relationship building.

Article originally appeared HERE.

Originally Posted at: Best Practices For Building Talent In Analytics by analyticsweekpick

Avoiding a Data Science Hype Bubble

In this post, Josh Poduska, Chief Data Scientist at Domino Data Lab, advocates for a common taxonomy of terms within the data science industry. The proposed definitions enable data science professionals to cut through the hype and increase the speed of data science innovation. 


The noise around AI, data science, machine learning, and deep learning is reaching a fever pitch. As this noise has grown, our industry has experienced a divergence in what people mean when they say “AI”, “machine learning”, or “data science”. It can be argued that our industry lacks a common taxonomy. If there is a taxonomy, then we, as data science professionals, have not done a very good job of adhering to it. This has consequences. Two consequences include the creation of a hype-bubble that leads to unrealistic expectations and an increasing inability to communicate, especially with non-data science colleagues. In this post, I’ll cover concise definitions and then argue how it is vital to our industry that we be consistent with how we define terms like “AI”.

Concise Definitions

  • Data Science: A discipline that uses code and data to build models that are put into production to generate predictions and explanations.
  • Machine Learning: A class of algorithms or techniques for automatically capturing complex data patterns in the form of a model.
  • Deep Learning: A class of machine learning algorithms that uses neural networks with more than one hidden layer.
  • AI: A category of systems that operate in a way that is comparable to humans in the degree of autonomy and scope.


Our terms have a lot of star power. They inspire people to dream and imagine a better world which leads to their overuse. More buzz around our industry raises the tide that lifts all boats, right? Sure, we all hope the tide will continue to rise. But, we should work for a sustainable rise and avoid a hype bubble that will create widespread disillusionment if it bursts.

I recently attended Domino’s rev conference, a summit for data science leaders and practitioners. I heard multiple leaders seeking advice on how to help executives, mid-level managers, and even new data scientists have proper expectations of data science projects without sacrificing enthusiasm for data science. Unrealistic expectations slow down progress by deflating the enthusiasm when projects yield less than utopian results. They also make it harder than it should be to agree on project success metrics and ROI goals.

The frequent overuse of “AI” when referring to any solution that makes any kind of prediction has been a major cause of this hype. Because of frequent overuse, people instinctively associate data science projects with near perfect human-like autonomous solutions. Or, at a minimum, people perceive that data science can easily solve their specific predictive need, without any regard to whether their organizational data will support such a model.


Incorrect use of terms also gums up conversations. This can be especially damaging in the early planning phases of a data science project when a cross-functional team assembles to articulate goals and design the end solution. I know a data science manager that requires his team of data scientists to be literally locked in a room for an our hour with business leaders before he will approve any new data science project. Okay, the door is not literally locked, but it is shut, and he does require them to discuss the project for a full hour. They’ve seen a reduction in project rework as they’ve focused on early alignment with business stakeholders. The challenge of explaining data science concepts is hard enough as it is. We only make this harder when we can’t define our own terms.

I’ve been practicing data science for a long time now. I’ve worked with hundreds of analytical leaders and practitioners from all over the world. Since AI and deep learning came on the scene, I’ve increasingly had to pause conversations and ask questions to discover what people really mean when they use certain terms. For example, how would you interpret these statements which are based on conversations I’ve had?

  • “Our goal is to make our solution AI-driven within 5 years.”
  • “We need to get better at machine learning before we invest in deep learning.”
  • “We use AI to predict fraud so our customers can spend with confidence.”
  • “Our study found that organizations investing in AI realize a 10% revenue boost.”

Confusing, right?

One has to ask a series of questions to be able to understand what is really going on.

The most common term-confusion I hear is when someone talks about AI solutions, or doing AI, when they really should be talking about building a deep learning or machine learning model. It seems that far too often the interchange of terms is on purpose, with the speaker hoping to get a hype-boost by saying “AI”. Let’s dive into each of the definitions and see if we can come to an agreement on a taxonomy.

Data Science

First of all, I view data science as a scientific discipline, like any other scientific discipline. Take biology, for example. Biology encompasses a set of ideas, theories, methods, and tools. Experimentation is common. The biological research community is continually adding to the discipline’s knowledge base. Data science is no different. Practitioners do data science. Researchers advance the field with new theory, concepts, and tools.

The practice of data science involves marrying code (usually some statistical programming language) with data to build models. This includes the important and dominant initial steps of data acquisition, cleansing, and preparation. Data science models usually make predictions (e.g., predict loan risk, predict disease diagnosis, predict how to respond to a chat, predict what objects are in an image). Data science models can also explain or describe the world for us (e.g., which combination of factors are most influential in making a disease diagnosis, which customers are most similar to each other and how). Finally, these models are put into production to make predictions and explanations when applied to new data. Data science is a discipline that uses code and data to build models that are put into production to generate predictions and explanations.

It can be difficult to craft a definition for data science while, at the same time, distinguishing it from statistical analysis. I came to the data science profession via educational training in math and statistics as well as professional experience as a statistician. Like many of you, I was doing data science before it was a thing.

Statistical analysis is based on samples, controlled experiments, probabilities, and distributions. It usually answers questions about likelihood of events or the validity of statements. It uses different algorithms like t-test, chi-square, ANOVA, DOE, response surface designs, etc. These algorithms sometimes build models too. For example, response surface designs are techniques to estimate the polynomial model of a physical system based on observed explanatory factors and how they relate to the response factor.

One key point in my definition is that data science models are applied to new data to make future predictions and descriptions, or “put into production”. While it is true that response surface models can be used on new data to predict a response, it is usually a hypothetical prediction about what might happen if the inputs were changed. The engineers then change the inputs and observe the responses that are generated by the physical system in its new state. The response surface model is not put into production. It does not take new input settings by the thousands, over time, in batches or streams, and predict responses.

My data science definition is by no means fool-proof, but I believe putting predictive and descriptive models into production starts to capture the essence of data science.

Machine Learning

Machine learning as a term goes back to the 1950s. Today, it is viewed by data scientists as a set of techniques that are used within data science. It is a toolset or a class of techniques for building the models mentioned above. Instead of a human explicitly articulating the logic for a model, machine learning enables computers to generate (or learn) models on their own. This is done by processing an initial set of data, discovering complex hidden patterns in that data, and capturing those patterns in a model so they can be applied later to new data in order to make predictions or explanations. The magic behind this process of automatically discovering patterns lies in the algorithms. Algorithms are the workhorses of machine learning. Common machine learning algorithms include the various neural network approaches, clustering techniques, gradient boosting machines, random forests, and many more. If data science is a discipline like biology, then machine learning is like microscopy or genetic engineering. It is a class of tools and techniques with which the discipline is practiced.

Deep Learning

Deep learning is the easiest of these terms to define. Deep learning is a class of machine learning algorithms that uses neural networks with more than one hidden layer. Neural networks themselves date back to the 1950s. Deep learning algorithms have recently become very popular starting in the 1980s, with a lull in the 1990s and 2000s, followed by a revival in our decade due to relatively small tweaks in the way the deep networks were constructed that proved to have astonishing effects. Deep learning can be applied to a variety of use cases including image recognition, chat assistants, and recommender systems. For example, Google Speech, Google Photos, and Google Search are some of the original solutions built using deep learning.


AI has been around for a long time. Long before the recent hype storm that has co-opted it with buzzwords. How do we, as data scientists, define it? When and how should we use it? What is AI to us? Honestly, I’m not sure anyone really knows. This might be our “emperor has no clothes” moment. We have the ambiguity and the resulting hype that comes from the promise of something new and unknown. The CEO of a well known data science company was recently talking with our team at Domino when he mentioned “AI”. He immediately caught himself and said, “I know that doesn’t really mean anything. I just had to start using it because everyone is talking about it. I resisted for a long time but finally gave in.”

That said, I’ll take a stab at it: AI is a category of systems that people hope to create which have the defining characteristic that they will be comparable to humans in the degree of autonomy and scope of operation.

To extend our analogy, if data science is like biology and machine learning is like genetic engineering, then AI is like disease resistance. It’s the end result, a set of solutions or systems that we are striving to create through the application of machine learning (often deep learning) and other techniques.

Here’s the bottom line. I believe that we need to draw a distinction between techniques that are part of AI solutions, AI-like solutions, and true AI solutions. This includes AI building blocks, solutions with AI-ish qualities, and solutions that approach human autonomy and scope. These are three separate things. People just say “AI” for all three far too often.

For example,

  • Deep learning is not AI. It is a technique that can be used as part of an AI solution.
  • Most data science projects are not AI solutions. A customer churn model is not an AI solution, no matter if it used deep learning or logistic regression.
  • A self driving car is an AI solution. It is a solution that operates with complexity and autonomy that approaches what humans are capable of doing.

Remember those cryptic statements from above? In each case I asked questions to figure out exactly what was going on under the hood. Here is what I found.

  • An executive said: “Our goal is to make our solution AI-driven within 5 years.”
    The executive meant: “We want to have a couple machine learning models in production within 5 years.”
  • A manager said: “We need to get better at machine learning before we invest in deep learning.”
    The manager meant: “We need to train our analysts in basic data science principles before we are ready to try deep learning approaches.”
  • A marketer said: “We use AI to predict fraud so our customers can spend with confidence.”
    The marketer meant: “Our fraud score is based on a logistic regression model that has been working well for years.”
  • An industry analyst said: “Our study found that organizations investing in AI realize a 10% revenue boost.”
    The industry analyst meant: “Organizations that have any kind of predictive model in production realize a 10% revenue boost.”

The Ask

Whether you 100% agree with my definitions or not, I think we can all agree that there is too much hype in our industry today, especially around AI. Each of us has seen how this hype limits real progress. I argue that a lot of the hype is from misuse of the terms of data science. My ask is that, as data science professionals, we try harder to be conscious of how we use these key terms, and that we politely help others who work with us learn to use these terms in the right way. I believe that the quicker we can iterate to an agreed-upon taxonomy and insist on adherence to it, the quicker we can cut through hype and increase our speed of innovation as we build the solutions of today and tomorrow.

The post Avoiding a Data Science Hype Bubble appeared first on Data Science Blog by Domino.

Source: Avoiding a Data Science Hype Bubble by analyticsweek

Jan 10, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Accuracy check  Source


More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> 6 Best Practices for Maximizing Big Data Value by analyticsweekpick

>> To Trust A Bot or Not? Ethical Issues in AI by tony

>> Inside CXM: New Global Thought Leader Hub for Customer Experience Professionals by bobehayes

Wanna write? Click Here


 Sales Performance Management (SPM) Market 2018-2025: CAGR, Top Manufacturers, Drivers, Trends, Challenges … – The Dosdigitos (press release) (blog) Under  Sales Analytics

 Teradata Vantage brings multiple data sources on one platform, hope to become core of large enterprises – The Indian Express Under  Talent Analytics

 Duke Engineering Establishes Big Data, Precision Medicine Center – Health IT Analytics Under  Big Data Analytics

More NEWS ? Click Here


Pattern Discovery in Data Mining


Learn the general concepts of data mining along with basic methodologies and applications. Then dive into one subfield in data mining: pattern discovery. Learn in-depth concepts, methods, and applications of pattern disc… more


On Intelligence


Jeff Hawkins, the man who created the PalmPilot, Treo smart phone, and other handheld devices, has reshaped our relationship to computers. Now he stands ready to revolutionize both neuroscience and computing in one strok… more


Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.


Q:What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset
A: Outliers:
– An observation point that is distant from other observations
– Can occur by chance in any distribution
– Often, they indicate measurement error or a heavy-tailed distribution
– Measurement error: discard them or use robust statistics
– Heavy-tailed distribution: high skewness, can’t use tools assuming a normal distribution
– Three-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean
– Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean

Three-sigma rules example: in a sample of 1000 observations, the presence of up to 5 observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected, being less than twice the expected number and hence within 1 standard deviation of the expected number (Poisson distribution).

If the nature of the distribution is known a priori, it is possible to see if the number of outliers deviate significantly from what can be expected. For a given cutoff (samples fall beyond the cutoff with probability p), the number of outliers can be approximated with a Poisson distribution with lambda=pn. Example: if one takes a normal distribution with a cutoff 3 standard deviations from the mean, p=0.3% and thus we can approximate the number of samples whose deviation exceed 3 sigmas by a Poisson with lambda=3

Identifying outliers:
– No rigid mathematical method
– Subjective exercise: be careful
– Boxplots
– QQ plots (sample quantiles Vs theoretical quantiles)

Handling outliers:
– Depends on the cause
– Retention: when the underlying model is confidently known
– Regression problems: only exclude points which exhibit a large degree of influence on the estimated coefficients (Cook’s distance)

– Observation lying within the general distribution of other observed values
– Doesn’t perturb the results but are non-conforming and unusual
– Simple example: observation recorded in the wrong unit (°F instead of °C)

Identifying inliers:
– Mahalanobi’s distance
– Used to calculate the distance between two random vectors
– Difference with Euclidean distance: accounts for correlations
– Discard them



#FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

 #FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

Subscribe to  Youtube


Information is the oil of the 21st century, and analytics is the combustion engine. – Peter Sondergaard


#FutureOfData with Rob(@telerob) / @ConnellyAgency on running innovation in agency

 #FutureOfData with Rob(@telerob) / @ConnellyAgency on running innovation in agency


iTunes  GooglePlay


And one of my favourite facts: At the moment less than 0.5% of all data is ever analysed and used, just imagine the potential here.

Sourced from: Analytics.CLUB #WEB Newsletter

Investment banks recruit for rise of big data analytics

Big data, or the large pools of data that can be captured, processed and then analysed, is now reaching into every sector and function of the global economy.

Financial services businesses, including the investment banks, generate and store more data than any other business in any other sector – broadly because it is such a transaction-heavy industry, often driven by models and algorithms.

Despite accumulating a wealth of information on capital market transactions, trades, financial markets, and other client and market data, the investment banks have been slower to embrace today’s definition of big data analytics than many consumer retail businesses, technology businesses, and even retail banking.

Organisations such as Amazon, Google, eBay and the UK’s big four supermarkets have been using big data analytics for many years, tracking consumer behaviour to suggest potential new products to consumers and develop customer loyalty schemes. Where investment banks have used big data, it has often been restricted to tracking individual sub-categories of asset classes.

The UK’s high-street banks have also been increasingly active in this area, using data analytics to study purchasing patterns, social media and location data, in order to tailor products and associated marketing material to individual customers’ needs.

Using big data analytics to increase profitability

The investment banks are now looking at how they can use big data to do what they do better, faster and more efficiently.

Senior executives at the banks want to enhance how they use data to raise profitability, map out markets and company-wide exposures, and ultimately win more deals.

While banks have, for many years, used data and value at risk modelling to measure and quantify the level of financial risk in a portfolio of assets, the fundamental difference with big data is that it has become an established standalone functional department rather than a series of small subsets of internal business units.

Big-data teams are now taking on the role of an influential internal consultancy, communicating to senior executives key insights on how to improve profitability.

Another key difference is that the banks are now not only analysing structured data, such as market or trading data, but also unstructured data, which can include sources such as tweets, blogs, Facebook posts and marketing material. This is now collected and recorded from a bank’s customers or clients – a significant shift from how data used to be captured.

Using large amounts of both structured and unstructured data and market data, the investment banks are now accurately modelling the outcome of investment decisions, and getting real-time insights into client demand.

Big data is also a fundamental element of risk-profiling for the banks, enabling data analysts to immediately assess the impact of the escalation in geopolitical risk on portfolios and their exposure to specific markets and asset classes. Specifically, banks have now built systems that will map out market-shaping past events in order to identify future patterns.

We are also seeing the banks using big data to analyse the effectiveness of their deals, looking for insights into which trades they did or did not win on a client-by-client basis.

But despite the recent growth in the use of big data by the banks, key challenges remain.

Unlike retail and technology giants such as Google, Facebook and Amazon, or any new startup or fintech company, the IT and data systems at most banks were not originally constructed to analyse structured and unstructured data. Updating and remodelling entire IT and data systems to accommodate the systems needed to generate a deep analysis of a bank’s data is time-consuming and costly.

Banks that have merged or acquired other banks or financial services businesses are likely to face even more complex issues when incorporating and updating legacy IT systems.

Surge in hiring big data analytics specialists

The competition between banks and fund managers to hire big data specialists is heating up.

The banks are actively recruiting big data and analytics specialists to fill two main, but significantly different roles: big data engineers and data scientists/analytics/insights.

Big data engineers will typically come from a strong IT development or coding background and are responsible for designing data platforms and applications. A big data engineer can typically command £55,000 a year and may also be known as a software engineer – big data, big data software architect or Hadoop developer.

Data scientists, in contrast, are responsible for bridging the gap between data analytics and business decision-making, capable of translating complex data into key strategy insight.

Data scientists – also known as analytics and insights manager or director of data science – are expected to have sharp technical and quantitative skills. Data scientists are in highest demand and this is where the biggest skill shortage exists.

Data scientists are responsible for examining the data, identifying key trends, and writing the complex algorithms that will see the raw data transformed into a piece of analysis or insight that the business can use to gain a competitive advantage.

Such is the shortage of individuals with this skillset that a data scientist can command between £75,000 and £110,000 a year, straight from graduating from university.

Big data teams will often be competing to hire from the same pool of mathematics and physics PhDs from which other areas of the investment bank will be hiring.

Christopher Adeyeri is associate director – head of technology at recruitment firm Astbury Marsden

Originally posted via “Investment banks recruit for rise of big data analytics”

Source by analyticsweekpick

Exploring the Structure of High-Dimensional Data with HyperTools in Kaggle Kernels

Exploring the structure of high-dimensional data with HyperTools in Kaggle Kernels

The datasets we encounter as scientists, analysts, and data nerds are increasingly complex. Much of machine learning is focused on extracting meaning from complex data. However, there is still a place for us lowly humans: the human visual system is phenomenal at detecting complex structure and discovering subtle patterns hidden in massive amounts of data. Every second that our eyes are open, countless data points (in the form of light patterns hitting our retinas) are pouring into visual areas of our brain. And yet, remarkably, we have no problem at all recognizing a neat looking shell on a beach, or our friend’s face in a large crowd. Our brains are “unsupervised pattern discovery aficionados.”

On the other hand, there is at least one major drawback to relying on our visual systems to extract meaning from the world around us: we are essentially capped at perceiving just 3 dimensions at a time, and many datasets we encounter today are higher dimensional.

So, the question of the hour is: how can we harness the incredible pattern-recognition superpowers of our brains to visualize complex and high-dimensional datasets?

Dimensionality Reduction

In comes dimensionality reduction, stage right. Dimensionality reduction is just what it sounds like– transforming a high-dimensional dataset into a lower-dimensional dataset. For example, take this UCI ML dataset on Kaggle comprising observations about mushrooms, organized as a big matrix. Each row is comprised of a bunch of features of the mushroom, like cap size, cap shape, cap color, odor etc. The simplest way to do dimensionality reduction might be to simply ignore some of the features (e.g. pick your favorite three—say size, shape, and color—and ignore everything else). However, this is problematic if the features you drop contained valuable diagnostic information (e.g. whether the mushrooms are poisonous).

A more sophisticated approach is to reduce the dimensionality of the dataset by only considering its principal components, or the combinations of features that explains the most variance in the dataset. Using a technique called principal components analysis (or PCA), we can reduced the dimensionality of a dataset, while preserving as much of its precious variance as possible. The key intuition is that we can create a new set of (a smaller number of) features, where each of the new features is some combination of the old features. For example, one of these new features might reflect a mix of shape and color, and another might reflect a mix of size and poisonousness. In general, each new feature will be constructed from a weighted mix of the original features.

Below is a figure to help with the intuition. Imagine that you had a 3 dimensional dataset (left), and you wanted to reduce it to a 2 dimensional dataset (right). PCA finds the principal axes in the original 3D space where the variance between points is the highest. Once we identify the two axes that explain the most variance (the black lines in the left panel), we can re-plot the data along just those axes, as shown on the right. Our 3D dataset is now 2D. Here we have chosen a low-dimensional example so we could visualize what is happening. However, this technique can be applied in the same way to higher-dimensional datasets.

We created the HyperTools package to facilitate these sorts of dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including matplotlib, scikit-learn and seaborn. HyperTools is designed with ease of use as a primary objective. We highlight two example use cases below.

Mushroom foraging with HyperTools: Visualizing static ‘point clouds’

First, let’s explore the mushrooms dataset we referenced above. We start by importing the relevant libraries:

import pandas as pd
import hypertools as hyp

and then we read in our data into a pandas DataFrame:

data = pd.read_csv('../input/mushrooms.csv')
index class cap-shape cap-surface cap-color bruises odor gill-attachment
0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f
5 e x y y t a f

Each row of the DataFrame corresponds to a mushroom observation, and each column reflects a descriptive feature of the mushroom (only some of the rows and columns are shown above). Now let’s plot the high-dimensional data in a low dimensional space by passing it to HyperTools. To handle text columns, HyperTools will first convert each text column into a series of binary ‘dummy’ variables before performing the dimensionality reduction. For example, if the ‘cap size’ column contained ‘big’ and ‘small’ labels, this single column would be turned into two binary columns: one for ‘big’ and one for ‘small’, where 1s represents the presence of that feature and 0s represents the absence (for more on this, see the documentation for the get_dummies function in pandas).

hyp.plot(data, 'o')

In plotting the DataFrame, we are effectively creating a three-dimensional “mushroom space,” where mushrooms that exhibit similar features appear as nearby dots, and mushrooms that exhibit different features appear as more distant dots. By visualizing the DataFrame in this way, it becomes immediately clear that there are multiple clusters in the data. In other words, all combinations of mushroom features are not equally likely, but rather certain combinations of features tend to go together. To better understand this space, we can color each point according to some feature in the data that we are interested in knowing more about. For example, let’s color the points according to whether the mushrooms are (p)oisonous or (e)dible (the class_labels feature):

hyp.plot(data,'o', group=class_labels, legend=list(set(class_labels)))

Visualizing the data in this way highlights that mushrooms’ poisonousness appears stable within each cluster (e.g. mushrooms that have similar features), but varies across clusters. In addition, it looks like there are a number of distinct clusters that are poisonous/edible. We can explore this further by using the ‘cluster’ feature of HyperTools, which colors the observations using k-means clustering. In the description of the dataset, it was noted that there were 23 different types of mushrooms represented in this dataset, so we’ll set the n_clusters parameter to 23:

hyp.plot(data, 'o', n_clusters=23)

To gain access to the cluster labels, the clustering tool may be called directly using hyp.tools.cluster, and the resulting labels may then be passed to hyp.plot:

cluster_labels = hyp.tools.cluster(data, n_clusters=23)
hyp.plot(data, group=cluster_labels)

By default, HyperTools uses PCA to do dimensionality reduction, but with a few additional lines of code we can use other dimensionality reduction methods by directly calling the relevant functions from sklearn. For example, we can use t-SNE to reduce the dimensionality of the data using:

from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=3)
reduced_data_TSNE = TSNE_model.fit_transform(hyp.tools.df2mat(data))
hyp.plot(reduced_data_TSNE,'o', group=class_labels, legend=list(set(class_labels)))

Different dimensionality reduction methods highlight or preserve different aspects of the data. A repository containing additional examples (including different dimensionality reduction methods) may be found here.

The data expedition above provides one example of how the geometric structure of data may be revealed through dimensionality reduction and visualization. The observations in the mushrooms dataset formed distinct clusters, which we identified using HyperTools. Explorations and visualizations like this could help guide analysis decisions (e.g. whether to use a particular type of classifier to discriminate poisonous vs. edible mushrooms). If you’d like to play around with HyperTools and the mushrooms dataset, check out and fork this Kaggle Kernel!

Climate science with HyperTools: Visualizing dynamic data

Whereas the mushrooms dataset comprises static observations, here we will take a look at some global temperature data, which will showcase how HyperTools may be used to visualize timeseries data using dynamic trajectories.

This next dataset is made up of monthly temperature recordings from a sample of 20 global cities over the 138 year interval ranging from 1875–2013. To prepare this dataset for analysis with HyperTools, we created a time by cities matrix, where each row is a temperature recording for subsequent months, and each column is the temperature value for a different city. You can replicate this demo by using the Berkeley Earth Climate Change dataset on Kaggle or by cloning this GitHub repo. To visualize temperature changes over time, we will use HyperTools to reduce the dimensionality of the data, and then plot the temperature changes over time as a line:


Well that just looks like a hot mess, now doesn’t it? However, we promise there is structure in there– so let’s find it! Because each city is in a different location, the mean and variance of its temperature timeseries may be higher or lower than the other cities. This will in turn affect how much that city is weighted when dimensionality reduction is performed. To normalize the contribution of each city to the plot, we can set the normalize flag (default value: False). Setting normalize='across' <will normalize (z-score) each column of the data. HyperTools incorporates a number of useful normalization options, which you can read more about here.

hyp.plot(temps, normalize='across')

Now we’re getting somewhere! Rotating the plot with the mouse reveals an interesting shape to this dataset. To help highlight the structure and understand how it changes over time, we can color the lines by year, where more red lines indicates early and more blue lines indicate later timepoints:

hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r')

Coloring the lines has now revealed two key structural aspects of the data. First, there is a systematic shift from blue to red, indicating a systematic change in the pattern of global temperatures over the years reflected in the dataset. Second, within each year (color), there is a cyclical pattern, reflecting seasonal changes in the temperature patterns. We can also visualize these two phenomena using a two dimensional plot:

hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r', ndims=2)

Now, for the grand finale. In addition to creating static plots, HyperTools can also create animated plots, which can sometimes reveal additional patterns in the data. To create an animated plot, simply pass animate=True to hyp.plot when visualizing timeseries data. If you also pass chemtrails=True, a low-opacity trace of the data will remain in the plot:

hyp.plot(temps, normalize='across', animate=True, chemtrails=True)

That pleasant feeling you get from looking at the animation is called “global warming.”

This concludes our exploration of climate and mushroom data with HyperTools. For more, please visit the project’s GitHub repository, readthedocs site, a paper we wrote, or our demo notebooks.


Andrew is a Cognitive Neuroscientist in the Contextual Dynamics Laboratory. His postdoctoral work integrates ideas from basic learning and memory research with computational techniques used in data science to optimize learning in natural educational settings, like the classroom or online. Additionally, he develops open-source software for data visualization, research and education.

The Contextual Dynamics Lab at Dartmouth College uses computational models and brain recordings to understand how we extract information from the world around us. You can learn more about us at http://www.context-lab.com.

Originally Posted at: Exploring the Structure of High-Dimensional Data with HyperTools in Kaggle Kernels by analyticsweek

Jan 03, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Extrapolating  Source


More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> @BrianHaugli @The_Hanover ‏on Building a #Leadership #Security #Mindset by v1shal

>> Big Data Winning Strategy For Enterprises: Start Small by d3eksha

>> The convoluted world of data scientist by v1shal

Wanna write? Click Here


R Basics – R Programming Language Introduction


Learn the essentials of R Programming – R Beginner Level!… more


Hypothesis Testing: A Visual Introduction To Statistical Significance


Statistical significance is a way of determining if an outcome occurred by random chance, or did something cause that outcome to be different than the expected baseline. Statistical significance calculations find their … more


Winter is coming, warm your Analytics Club
Yes and yes! As we are heading into winter what better way but to talk about our increasing dependence on data analytics to help with our decision making. Data and analytics driven decision making is rapidly sneaking its way into our core corporate DNA and we are not churning practice ground to test those models fast enough. Such snugly looking models have hidden nails which could induce unchartered pain if go unchecked. This is the right time to start thinking about putting Analytics Club[Data Analytics CoE] in your work place to help Lab out the best practices and provide test environment for those models.


Q:Give examples of bad and good visualizations?
A: Bad visualization:
– Pie charts: difficult to make comparisons between items when area is used, especially when there are lots of items
– Color choice for classes: abundant use of red, orange and blue. Readers can think that the colors could mean good (blue) versus bad (orange and red) whereas these are just associated with a specific segment
– 3D charts: can distort perception and therefore skew data
– Using a solid line in a line chart: dashed and dotted lines can be distracting

Good visualization:
– Heat map with a single color: some colors stand out more than others, giving more weight to that data. A single color with varying shades show the intensity better
– Adding a trend line (regression line) to a scatter plot help the reader highlighting trends



#BigData @AnalyticsWeek #FutureOfData #Podcast with @MPFlowersNYC, @enigma_data

 #BigData @AnalyticsWeek #FutureOfData #Podcast with @MPFlowersNYC, @enigma_data

Subscribe to  Youtube


The most valuable commodity I know of is information. – Gordon Gekko


@JohnTLangton from @Wolters_Kluwer discussed his #AI Lead Startup Journey #FutureOfData #Podcast

 @JohnTLangton from @Wolters_Kluwer discussed his #AI Lead Startup Journey #FutureOfData #Podcast


iTunes  GooglePlay


Three-quarters of decision-makers (76 per cent) surveyed anticipate significant impacts in the domain of storage systems as a result of the “Big Data” phenomenon.

Sourced from: Analytics.CLUB #WEB Newsletter