Aug 29, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Data interpretation  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Landscape of Big Data by v1shal

>> The Future of Big Data? Three Use Cases of Prescriptive Analytics by analyticsweekpick

>> Data Monetization Workshop 2018: Key Themes & Takeaways by analyticsweek

Wanna write? Click Here

[ FEATURED COURSE]

Artificial Intelligence

image

This course includes interactive demonstrations which are intended to stimulate interest and to help students gain intuition about how artificial intelligence methods work under a variety of circumstances…. more

[ FEATURED READ]

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

image

Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the “data-analytic thinking” necessary for e… more

[ TIPS & TRICKS OF THE WEEK]

Finding a success in your data science ? Find a mentor
Yes, most of us dont feel a need but most of us really could use one. As most of data science professionals work in their own isolations, getting an unbiased perspective is not easy. Many times, it is also not easy to understand how the data science progression is going to be. Getting a network of mentors address these issues easily, it gives data professionals an outside perspective and unbiased ally. It’s extremely important for successful data science professionals to build a mentor network and use it through their success.

[ DATA SCIENCE Q&A]

Q:What is cross-validation? How to do it right?
A: It’s a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. Mainly used in settings where the goal is prediction and one wants to estimate how accurately a model will perform in practice. The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and get an insight on how the model will generalize to an independent data set.

Examples: leave-one-out cross validation, K-fold cross validation

How to do it right?

the training and validation data sets have to be drawn from the same population
predicting stock prices: trained for a certain 5-year period, it’s unrealistic to treat the subsequent 5-year a draw from the same population
common mistake: for instance the step of choosing the kernel parameters of a SVM should be cross-validated as well
Bias-variance trade-off for k-fold cross validation:

Leave-one-out cross-validation: gives approximately unbiased estimates of the test error since each training set contains almost the entire data set (n?1n?1 observations).

But: we average the outputs of n fitted models, each of which is trained on an almost identical set of observations hence the outputs are highly correlated. Since the variance of a mean of quantities increases when correlation of these quantities increase, the test error estimate from a LOOCV has higher variance than the one obtained with k-fold cross validation

Typically, we choose k=5 or k=10, as these values have been shown empirically to yield test error estimates that suffer neither from excessively high bias nor high variance.
Source

[ VIDEO OF THE WEEK]

@BrianHaugli @The_Hanover ?on Building a #Leadership #Security #Mindset #FutureOfData #Podcast

 @BrianHaugli @The_Hanover ?on Building a #Leadership #Security #Mindset #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Data matures like wine, applications like fish. – James Governor

[ PODCAST OF THE WEEK]

#GlobalBusiness at the speed of The #BigAnalytics

 #GlobalBusiness at the speed of The #BigAnalytics

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

140,000 to 190,000. Too few people with deep analytical skills to fill the demand of Big Data jobs in the U.S. by 2018.

Sourced from: Analytics.CLUB #WEB Newsletter

How to pick the right sample for your analysis

Unless we are lucky enough to have access to an entire population and the capacity to analyse all of that data, we have to make do with samples from our population to make statistical inferences. Choosing a sample that is a good representation of your population is the heart of a quality analysis, as all of the fancy statistical tricks in the world can’t make accurate conclusions from bad or biased data. In this blog post, I will discuss some key concepts in selecting a representative sample.

First things first: What is your population of interest?

Obviously it is a bit tricky to get a representative sample if you’re not sure what population you’re trying to represent, so the first step is to carefully consider this question. Imagine a situation where your company wanted you to assess the mean daily number of page views their website receives. Well, what mean daily page views do they want to know about?

Let’s consider some complications to that question. The first is seasonality. Your website might receive more hits at certain times of year. Do we want to include or exclude these periods from our population? Another consideration is the demographic profile of the people visiting your website. Are we interested in all visitors? Or do we want visitors of a certain sex, age group or region? A final consideration is whether there has been some sort of change in condition that may have increased the visitors to the site. For example, was there an advertising campaign launched recently? Has the website added additional languages which mean a broader audience can access it? Does the website sell a product that now ships to additional places?

Let’s imagine our website is a retail platform that sells children’s toys. We see some seasonal spikes in page views every year at Easter, Christmas and two major sale periods every year (Black Friday and post-Christmas). No major advertising campaigns are planned outside these seasonal periods, nor any changes planned to the site. Our company want to know what the “typical” number of mean daily page views is outside these seasonal periods. They don’t care about individual demographic groups of visitors, they just want the visitors as a whole. Therefore, we need to find a sample that reflects this.

Choosing a representative sample

Sample size

Sample size is a key element to representative sampling as it increases your chances of gaining sufficient information about the population, rather than having your statistics influenced by anomalous observations. For example, imagine if by chance we sampled a much higher than average value. Let’s see how this influences a sample of 30 page views compared to a sample of 10. We’ll generate samples in R from a Poisson distribution with a mean of 220 views per day, and add an outlier of 260 views per day to each (by the way, we use a Poisson distribution as it is the most appropriate distribution to model count data like this):


set.seed(567)

# Sample of 30 (29 from the Poisson distribution and an outlier of 260)
sample1 <- c(rpois(29, lambda = 220), 260)

# Sample of 10 (9 from the Poisson distribution and an outlier of 260)
sample2 <- c(rpois(9, lambda = 220), 260)

Compared to the population mean, the mean of the sample of 30 is 221.5 whereas the mean of the sample of 10 is 224.4. As you can see, the smaller sample is far more influenced by the extreme value than the larger one.

A sufficient sample size depends on a lot of things. For example, if the event we were trying to describe was rare (e.g., 1 event per 100 days), a sample of 30 would likely be too small to assess its mean occurrence. When conducting hypothesis testing, the correct sample size is generally calculated using power calculations, something I won’t get into here as it can get veeeeery complicated.

An additional consideration is that overestimating the required sample size can also be undesirable as there may be time, monetary or even ethical reasons to limit the number of observations collected. For example, the company that asked us to assess page views would likely be unhappy if we spent 100 days collecting information on mean daily page views when the same question could be reliably answered from 30 days of data collection.

Representativeness

However, a sufficient sample size won’t be enough if the data are not representative. Representativeness means that the data are sampled from all observations in the population and excludes anything that is outside the population. Representativeness is violated when the sample is biased to a subset of the population or when the sample includes observations from outside the population of interest.

Let’s simulate the number of page views our website received per day in 2014. As you can see in the R code below, I’ve included increased page views for our peak periods of Easter, Black Friday/Christmas, and the post-Christmas sales.


# Simulate some data for 2014, with mean page views of 220 per day.
days <- seq(as.Date("2014/1/1"), as.Date("2014/12/31"), "days")
page.views <- rpois(365, lambda = 220)
views = "2014/01/01" & views$days <= "2014/01/10"] = "2014/04/01" & views$days <= "2014/04/21"] = "2014/11/28" & views$days <= "2014/12/31"] <- rpois(34, lambda = 500)

ggplot2_chunk-1

If you look at the graph above, you can see that for most of the year the page views sit fairly consistently around a mean of 220 (as shown in the dotted black line). If we sampled any time outside of the peak periods, we would be pretty safe in assuming we have a representative sample. However, what if we sampled between March 15 and April 14? We would catch some of the Easter peak period in our sample and our sample would no longer represent typical (non-peak) daily page views – instead, we would overestimate our page views by including observations from the peak period population in our sample.

A final thing to consider: the method of measurement

While not part of representative sampling per se, an extremely important and related concept is how the thing you are measuring relates to your concept of interest. Why does our company want to know how many page views we get? Do they specifically want to know how many visitors they receive a day in order to plan things like server demand? Or do they want to extrapolate from number of visitors to comment on something like the popularity of the page? It is important to consider whether the measurement you take is a good reflection of what you are interested in before you make inferences based on your data. This falls under the branch of statistics known as validity, which is again beyond the scope of this post but an extremely interesting topic.

The take away message

I hope this has been a helpful introduction to picking a good sample, and a reminder that even when you have really big data, you can’t escape basic considerations such as what your population is, and whether the variables you have can really answer your question!

Original post here. The full code used to create the figures in this post is located in this gist on my Github page.

Source: How to pick the right sample for your analysis

Aug 22, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Data security  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Secret Sauce to Sustain Business through Technology Driven Era by d3eksha

>> Join View from the Top 500 (VFT-500) to Share with and Learn from your CEM Peers by bobehayes

>> â€œPutting Data Everywhere”: Leveraging Centralized Business Intelligence for Full-Blown Data Culture by jelaniharper

Wanna write? Click Here

[ FEATURED COURSE]

Applied Data Science: An Introduction

image

As the world’s data grow exponentially, organizations across all sectors, including government and not-for-profit, need to understand, manage and use big, complex data sets—known as big data…. more

[ FEATURED READ]

Big Data: A Revolution That Will Transform How We Live, Work, and Think

image

“Illuminating and very timely . . . a fascinating — and sometimes alarming — survey of big data’s growing effect on just about everything: business, government, science and medicine, privacy, and even on the way we think… more

[ TIPS & TRICKS OF THE WEEK]

Grow at the speed of collaboration
A research by Cornerstone On Demand pointed out the need for better collaboration within workforce, and data analytics domain is no different. A rapidly changing and growing industry like data analytics is very difficult to catchup by isolated workforce. A good collaborative work-environment facilitate better flow of ideas, improved team dynamics, rapid learning, and increasing ability to cut through the noise. So, embrace collaborative team dynamics.

[ DATA SCIENCE Q&A]

Q:How do you control for biases?
A: * Choose a representative sample, preferably by a random method
* Choose an adequate size of sample
* Identify all confounding factors if possible
* Identify sources of bias and include them as additional predictors in statistical analyses
* Use randomization: by randomly recruiting or assigning subjects in a study, all our experimental groups have an equal chance of being influenced by the same bias

Notes:
– Randomization: in randomized control trials, research participants are assigned by chance, rather than by choice to either the experimental group or the control group.
– Random sampling: obtaining data that is representative of the population of interest

Source

[ VIDEO OF THE WEEK]

Pascal Marmier (@pmarmier) @SwissRe discusses running data driven innovation catalyst

 Pascal Marmier (@pmarmier) @SwissRe discusses running data driven innovation catalyst

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

In God we trust. All others must bring data. – W. Edwards Deming

[ PODCAST OF THE WEEK]

#DataScience Approach to Reducing #Employee #Attrition

 #DataScience Approach to Reducing #Employee #Attrition

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

More than 200bn HD movies – which would take a person 47m years to watch.

Sourced from: Analytics.CLUB #WEB Newsletter

2018 Trends in Data Governance: Heightened Expectations

Organizations will contend with an abundance of trends impacting data governance in the coming year. The data landscape has effectively become decentralized, producing more data, quicker, than it ever has before. Ventures in the Internet of Things and Artificial Intelligence are reinforcing these trends, escalating the need for consistent data governance. Increasing regulatory mandates such as the General Data Protection Regulation (GDPR) compound this reality.

Other than regulations, the most dominant trend affecting data governance in the new year involves customer experience. The demand to reassure consumers that organizations have effective, secure protocols in place to safely govern their data has never been higher in the wake of numerous security breaches.

According to Stibo Systems Chief Marketing Officer Prashant Bhatia, “Our expectations, both as individuals as well as from a B2B standpoint, are only getting higher. In order for companies to keep up, they’ve got to have [governance] policies in place. And, consumers want to know that whatever data they share with a third party is trusted and secure.”

The distributed nature of consumer experience—and the heightened expectations predicated on it—is just one of the many drivers for homogeneous governance throughout a heterogeneous data environment. Governing that data in a centralized fashion may be the best way of satisfying the decentralized necessities of contemporary data processes because, according to Bhatia:

“Now you’re able to look at all of those different types of data and data attributes across domains and be able to centralize that, cleanse it, get it to the point where it’s usable for the rest of the enterprise, and then share that data out across the systems that need it regardless of where they are.”

Metadata Management Best Practices
The three preeminent aspects of a centralized approach to governing data are the deployment of a common data model, common taxonomies, and “how you communicate that data for…integration,” Bhatia added. Whether integrating (or aggregating) data between different sources either within or outside of the enterprise, metatdata management plays a crucial role in doing so effectually. The primary advantage metadata yields in this regards is in contextualizing the underlying data to clarify both their meaning and utility. “Metadata is a critical set of attributes that helps provide that overall context as to why a piece of data matters, and how it may or may not be used,” Bhatia acknowledged. Thus, in instances in which organizations need to map to a global taxonomy—such as for inter-organizational transmissions between supply chain partners or to receive data from global repositories established between companies—involving metadata is of considerable benefit.

According to Bhatia, metadata “has to be accounted for in the overall mapping because ultimately it needs to be used or associated with throughout any other business process that happens within the enterprise. It’s absolutely critical because metadata just gives you that much more information for contextualization.” When attempting to integrate or aggregate various decentralized sources, such an approach is also useful. Mapping between varying taxonomies and data models becomes essential when utilizing sources from decentralized environments into a centralized one, as does involving metadata in these efforts. Mapping metadata is so advantageous because “the more data you can have, the more context you can have, the more accurate it is, [and] the better you’re going to be able to use it within a… business process going forward,” Bhatia mentioned.

Regulatory Austerity
Forrester’s 2018 predictions identify the GDPR as one of the fundamental challenges organizations will contend with in the coming year. The GDPR issue is so prominent because it exists at the juncture between a number of data governance trends. It represents the greater need to satisfy consumer expectations as part of governance, alludes to the nexus between governance and security for privacy concerns, and illustrates the overarching importance of regulations in governance programs. The European Union’s GDPR creates stringent mandates about how consumer information is stored and what rights people have regarding data about them. Its penalties are some of the more convincing drivers for formalizing governance practices.

“Once the regulation is in place, you no longer have a choice,” Bhatia remarked about the GDPR. “Whether you are a European company or you have European interactions, the fact of the matter is you’ve got to put governance in place because the integrity of what you’re sending, what you’re receiving, when you’re doing it, and how you’re doing it…All those things no longer becomes a ‘do I need to’, but now ‘I have to’.” Furthermore, the spring 2018 implementation of GDPR highlights the ascending trend towards regulatory compliance—and stiff penalties—associated with numerous vertical industries. Centralized governance measures are a solution for providing greater utility for the data stewardship and data lineage required for compliance.

Data Stewardship
The focus on regulations and distributed computing environments only serves to swell the overall complexity of data stewardship in 2018. However, dealing with decentralized data sources in a centralized manner abets the role of a data steward in a number of ways. Stewards primarily exist to implement and maintain the policies begat from governance councils. Centralizing data management and its governance via the plethora of means available for doing so today (including Master Data Management, data lakes, enterprise data fabrics and others) enable the enterprise to “cultivate the data stewardship aspect into something that’s executable,” Bhatia said. “If you don’t have the tools to actually execute and formalize a governance process, then all you have is a process.” Conversely, the stewardship role is so pivotal because it supervises those processes at the point in which they converge with technological action. “If you don’t have the process and the rules of engagement to allow the tools to do what they need to do, all you have is the technology,” Bhatia reflected. “You don’t have a solution.”

Data Lineage
One of the foremost ways in which data stewards can positively impact centralized data governance—as opposed to parochial, business unit or use case-based governance—is by facilitating data provenance. Doing so may actually be the most valuable part of data stewardship, especially when one considers the impact of data provenance on regulatory compliance. According to Bhatia, provenance factors into “ensuring that what was expected to happen did happen” in accordance to governance mandates. Tracing how data was used, stored, transformed, and analyzed can deliver insight vital to regulatory reporting. Evaluating data lineage is a facet of stewardship that “measures the results and the accuracy [of governance measures] by which we can determine have we remained compliant and have we followed the letter of the law,” commented Bhatia. Without this information gleaned from data provenance capabilities, organizations “have a flawed process in place,” Bhatia observed.

As such, there is a triad between regulations, stewardship, and data provenance. Addressing one of these realms of governance will have significant effects on the other two, especially when leveraging centralized means of effecting the governance of distributed resources. “The ability to have a history of where data came from, where it might have been cleansed and how it might emerge, who it was shared with and when it was shared, all these different transactions and engagements are absolutely critical from a governance and compliance standpoint,” Bhatia revealed.

Governance Complexities
The complexities attending data governance in the next couple of years show few signs of decreasing. Organizations are encountering more data than ever before from a decentralized paradigm characterized by an array of on-premise and cloud architectures that complicate various facets of governance hallmarks such as data modeling, data quality, metadata management, and data lineage. Moreover, data is produced much more celeritously than before with an assortment of machine-generated streaming options. When one considers the expanding list of regulatory demands and soaring consumer expectations for governance accountability, the pressures on this element of data management become even more pronounced. Turning to a holistic, centralized means of mitigating the complexities of today’s data sphere may be the most viable means of effecting data governance.

“As more data gets created the need, which was already high, for having a centralized platform to share data and push it back out, only becomes more important,” Bhatia said.

And, with an assortment of consumers, regulators, and C-level executives evincing a vested interest in this process, organizations won’t have many chances to do so correctly.

Source

Aug 15, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Complex data  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> The real-time machine for every business: Big data-driven market analytics by thomassujain

>> Data Virtualization: A Spectrum of Approaches – A GigaOm Market Landscape Report by analyticsweekpick

>> An Agile Approach to Big Data Management: Intelligent Workload Routing Automation by jelaniharper

Wanna write? Click Here

[ FEATURED COURSE]

Introduction to Apache Spark

image

Learn the fundamentals and architecture of Apache Spark, the leading cluster-computing framework among professionals…. more

[ FEATURED READ]

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

image

Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the “data-analytic thinking” necessary for e… more

[ TIPS & TRICKS OF THE WEEK]

Grow at the speed of collaboration
A research by Cornerstone On Demand pointed out the need for better collaboration within workforce, and data analytics domain is no different. A rapidly changing and growing industry like data analytics is very difficult to catchup by isolated workforce. A good collaborative work-environment facilitate better flow of ideas, improved team dynamics, rapid learning, and increasing ability to cut through the noise. So, embrace collaborative team dynamics.

[ DATA SCIENCE Q&A]

Q:What is better: good data or good models? And how do you define ‘good”? Is there a universal good model? Are there any models that are definitely not so good?
A: * Good data is definitely more important than good models
* If quality of the data wasn’t of importance, organizations wouldn’t spend so much time cleaning and preprocessing it!
* Even for scientific purpose: good data (reflected by the design of experiments) is very important

How do you define good?
– good data: data relevant regarding the project/task to be handled
– good model: model relevant regarding the project/task
– good model: a model that generalizes on external data sets

Is there a universal good model?
– No, otherwise there wouldn’t be the overfitting problem!
– Algorithm can be universal but not the model
– Model built on a specific data set in a specific organization could be ineffective in other data set of the same organization
– Models have to be updated on a somewhat regular basis

Are there any models that are definitely not so good?
– ‘all models are wrong but some are useful” George E.P. Box
– It depends on what you want: predictive models or explanatory power
– If both are bad: bad model

Source

[ VIDEO OF THE WEEK]

@JohnNives on ways to demystify AI for enterprise #FutureOfData #Podcast

 @JohnNives on ways to demystify AI for enterprise #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Processed data is information. Processed information is knowledge Processed knowledge is Wisdom. – Ankala V. Subbarao

[ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with Dr. Nipa Basu, @DnBUS

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Dr. Nipa Basu, @DnBUS

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

The largest AT&T database boasts titles including the largest volume of data in one unique database (312 terabytes) and the second largest number of rows in a unique database (1.9 trillion), which comprises AT&T’s extensive calling records.

Sourced from: Analytics.CLUB #WEB Newsletter

Data Science Programming: Python vs R

“Data Scientist – The Sexiest Job of 21st Century.”- Harvard Business Review

If you are already into a big data related career then you must already be familiar with the set of big data skillsthat you need to master to grab the sexiest job of 21st century. With every industry generating massive amounts of data – the need to crunch data requires more powerful and sophisticated programming tools like Python and R language. Python and R are among the popular programming languages that a data scientist must know to pursue a lucrative career in data science.

Python is popular as a general purpose web programming language whereas R is popular for its great features for data visualization as it was particularly developed for statistical computing. At DeZyre, our career counsellors often get questions from prospective students as to what should they learn first Python programming or R programming. If you are unsure on which programming language to learn first then you are on the right page.

Python and R language top the list of basic tools for statistical computing among the set of data scientist skills.Data scientists often debate on the fact that which one is more valuable R programming or Python programming, however both the programming languages have their specialized key features complementing each other.

Data Science with Python Language

Data science consists of several interrelated but different activities such as computing statistics, building predictive models, accessing and manipulating data, building explanatory models, data visualizations, integrating models into production systems and much more on data. Python programming provides data scientists with a set of libraries that helps them perform all these operations on data.

Python is a general purpose multi-paradigm programming language for data science that has gained wide popularity-because of its syntax simplicity and operability on different eco-systems. Python programming can help programmers play with data by allowing them to do anything they need with data – data munging, data wrangling, website scraping, web application building, data engineering and more. Python language makes it easy for programmers to write maintainable, large scale robust code.

“Python programming has been an important part of Google since the beginning, and remains so as the system grows and evolves. Today dozens of Google engineers use Python language, and we’re looking for more people with skills in this language.” – said Peter Norvig, Director at Google.

Unlike R language, Python language does not have in-built packages but it has support for libraries like Scikit, Numpy, Pandas, Scipy and Seaborn that data scientists can use to perform useful statistical and machine learningtasks. Python programming is similar to pseudo code and makes sense immediately just like English language. The expressions and characters used in the code can be mathematical, however, the logic can be easily adhered from the code.

What makes Python language the King of Data Science Programming Languages?

“In Python programming, everything is an object. It’s possible to write applications in Python language using several programming paradigms, but it does make for writing very clear and understandable object-oriented code.”- said Brian Curtin, member of Python Software Foundation

1) Broadness

The public package index for Python language popularly known as PyPi has approximately 40K add-ons available listed under 300 different categories. So, if a developer or a data scientist has to do something with Python language then there is high probability that someone already has it and they need not begin from the scratch. Python programming is used extensively for various tasks ranging from CGI and web development, system testing and automation, and ETL to gaming.

2) Efficient

Developers these days spend lot of time in defining and processing big data. With the increasing amount of data that needs to be processed, it becomes extremely important for programmers to efficiently manage the in-memory usage. Python language has generators both from functions and also as expressions which helps in iterative processing i.e. one item at a time. When there are large number of processes to be applied to a set of data in that case generators in Python language prove to be great advantage as they grab the source data ,one item at a time and then pass through the entire processing chain.

The generator based migration tool collective.transmogrifier helps make complex and interdependent updates to the data as it is being processed from the old site and then allows the programmers to create and store objects in constant memory at the new site.The transmogrifier plays vital role in Python programming when dealing with larger data sets.

3) Can be Easily Mastered Under Expert Guidance-Read It, Use it with Ease

Python language has gained wide popularity as the syntax is clear and readable making it easy to learn under expert guidance. Data scientists can gain expertise knowledge and master programming with Python in scientific computing by taking industry expert oriented Python programming courses. The readability of the syntax makes it easier for other peer programmers update already written Python programs at a faster pace and also helps write new programs quickly.

Applications of Python language-

  • Python programming is used by Mozilla for exploring their broad code base. Mozilla releases several open source packages built using Python.
  • Dropbox, a popular file hosting service founded by Drew Houston as he kept forgetting his USB. The project was started to fulfill his personal needs but it turned out to be so good that even others started using it.Dropbox is completely written in Python language which now has close to 150 million registered users.
  • Walt Disney uses Python language to enhance the supremacy of their creative processes.
  • Some other exceptional products written in Python language are –

i. Cocos2d-A popular open source 2D gaming framework

ii.Mercurial- A popular cross-platform, distributed code revision control tool used by developers.

iii.Bit Torrent- File sharing software

iv.Reddit- Entertainment and Social News website.

Limitations of Python Programming-

  • Python is an interpreted language and thus is many a times slower than the compiled languages.
  • “A possible disadvantage of Python is its slow speed of execution. But many Python packages have been optimized over the years and execute at C speed.”- said Pierre Carbonnelle, a Python programmer who runs the PyPL language index.
  • Python language being a dynamically typed language poses certain design restrictions. It requires rigorous testing because errors show up only during runtime.
  • Python programming has gained popularity on desktop and server platforms but is still weak on mobile computing platforms as there are very less number of mobile apps that are developed using Python language. Python programming can be rarely found on the client side of web applications.

Click here to know more about our IBM Certified Hadoop Developer course

Data Science with R Language

Millions of data scientists and statisticians use R programming to get away with challenging problems related to statistical computing and quantitative marketing. R language has become an essential tool for finance and business analytics-driven organizations like LinkedIn, Twitter, Bank of America, Facebook and Google.

R is an open source programming language and environment for statistical computing and graphics available on Linux, Windows and Mac. R language has an innovative package system that allows developers to extend the functionality to new heights by providing cross-platform distribution and testing of data and code. With more than 5K publicly released packages available for download, it is just a great programming language for exploratory data analysis language can easily be integrated with other object oriented programming languages like C, C++ and Java. R language has array-oriented syntax making it easier for programmers to translate math to code, in particular for professionals with minimal programming background.

Why use R programming for data science?

1.R language is one of the best tools for data scientists in the world of data visualization. It virtually has everything that a data scientist needs- statistical models, data manipulation and visualization charts.

2.Data scientists can create unique and beautiful data visualizations with R language that go far beyond the out-dated line plots and bar charts. With R programming, data scientists can draw meaningful insights from data in multiple dimensions using 3D surfaces and multi-panel charts. The Economist and The New York Times exploit the custom charting capabilities of R programming to create stunning infographics.

3.One great feature of R programming is its reproducible research-the code and data can be given to an interested third party which can trace it back to reproduce the same results. Thus, data scientists need to write code that will extract the data, analyse it and generate a HTML, PDF or a PPT for reporting. When any other third party is interested, the original author can share the code and data with the third party for reproducing similar results.

4.R language is designed particularly for data analysis with a flexibility to mix and match various statistical and predictive models for best possible outcomes. R programming scripts can further be automated with ease to promote production deployments and reproducible research.

5.R language has rich community of approximately 2 million users and close to 1000’s of developers that draws talents of data scientists spread across the world. The community has packages widespread across actuarial analysis, finance, machine learning, web technologies,pharmaceuticals that can be of great help to predict component failure times, analyse genomic sequences, and optimize portfolios. All these resources created by experts in various domains can be accessed easily for free, online.

Applications of R Language

  • Ford uses open source tools like R programming and Hadoop for data driven decision support and statistical data analysis.
  • The popular insurance giant Lloyd’s uses R language to create motion charts that provide analysis reports to investors.
  • Google uses R programming to analyse the effectiveness of online advertising campaigns, predict economic activities and measure the ROI of advertising campaigns.
  • Facebook uses R language to analyse the status updates and create the social network graph.
  • Zillow makes use of R programming to promote the housing prices.

Limitations of R Language

  • R programming has a steep learning curve for professionals who do not come from a programming background (professionals hailing from a GUI world like that of MicrosoftExcel).
  • Working with R language can at times be slow if the code is written poorly, however, there are solutions to this like FastR package, pqR and Penjin.

Data Science with Python or R Programming- What to learn first?

There are certain strategies that will help professionals decide their call of action on whether to begin learning data science with Python language or with R language –

  • If professionals are aware of the fact on what kind of project they will be working on then they can make a decision on which language to learn first. If the projects requires working with jumbled or scrape data from files, websites or any other sources of data then professionals must first start their learning with Python language. On the other hand, if the project requires working with clean data then professionals must first learn to focus on the data analysis part which requires learning R programming first.
  • It is always better to be on-par with the teams so find out what data science  programming language are they using R or Python. Collaboration and learning becomes much easier if you and your team mates are on the same language paradigm.
  • Trends in increasing data scientist jobs will help make a better decision on which what to learn first R language or Python language.
  • Last but not the least, do consider your personal preferences as to what interests you more and which is easier for you to grasp.

Having understood briefly about Python language and R language, the bottom line here is that it is difficult to choose learning any one language first -Python or R to crack data scientist jobs in top big data companies. Each one has its own advantages and disadvantages based on the different scenarios and tasks to be performed. Thus, the best solution is to make a smart move based on the above listed strategies and decide which language you should learn first that will fetch you a job with big data scientist salary and later add onto your skill set by learning the other language.

To read the original article on DeZyre, click here.

Source: Data Science Programming: Python vs R

@DarrWest / @BrookingsInst on the Future of Work: AI, Robots & Automation #JobsOfFuture

[youtube https://www.youtube.com/watch?v=aEfVIY09p3o]

In this podcast Darrell West (@DarrWest) from @BrookingsInst talks about the future of work, worker and workplace. He sheds light into this research into changing work landscape and importance of policy design to compensate for technology disruption. Darrell shares his thoughts on how business, professionals and governments could come together to minimize the technology driven joblessness impact and help stimulate the economy by placing everyone in futuristic roles.

Darrell’s Recommended Read:
Einstein: His Life and Universe by Walter Isaacson https://amzn.to/2JA4hsK

Podcast Link:
iTunes: http://math.im/itunes
GooglePlay: http://math.im/gplay

Darrell’s BIO:
Darrell West is the Vice President of Governance Studies and Director of the Center for Technology Innovation at the Brookings Institution. Previously, he was the John Hazen White Professor of Political Science and Public Policy and Director of the Taubman Center for Public Policy at Brown University. His current research focuses on technology policy, the Internet, digital media, health care, education, and privacy and security.

The Center that he directs examines a wide range of topics related to technology innovation including technology policy, public sector innovation; legal and Constitutional aspects of technology; digital media and social networking; health information technology; virtual education, and green technology. Its mission is to identify key developments in technology innovation, undertake cutting-edge research, disseminate best practices broadly, inform policymakers at the local, state, and federal levels about actions needed to improve innovation, and enhance the public’s and media’s understanding of the importance of technology innovation.

West is the author of written 23 books and my most recent one published two weeks ago is The Future of Work: Robots, AI, and Automation (Brookings Institution Press) some of the past work includes: How Technology Can Transform Education (Brookings, 2012); The Next Wave: Using Digital Technology to Further Social and Political Innovation (Brookings, 2011), Brain Gain: Rethinking U.S. Immigration Policy (Brookings, 2010), Digital Medicine: Health Care in the Internet Era (Brookings, 2009), Digital Government: Technology and Public Sector Performance, (Princeton University Press, 2005), and Air Wars: Television Advertising in Election Campaigns (Congressional Quarterly Press, 2005), among others. He is the winner of the American Political Science Association’s Don K. Price award for best book on technology (for Digital Government) and the American Political Science Association’s Doris Graber award for best book on political communications (for Cross Talk).

About #Podcast:
#JobsOfFuture podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join?
If you or any you know wants to join in,
Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor?
Email us @ info@analyticsweek.com

Keywords:
#JobsOfFuture #Leadership #Podcast #Future of #Work #Worker & #Workplace

Originally Posted at: @DarrWest / @BrookingsInst on the Future of Work: AI, Robots & Automation #JobsOfFuture

Aug 08, 19: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Conditional Risk  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> How to Ensure Your Next Analytics Tool is Your Last by analyticsweek

>> Our Experience at InsureTech Connect 2017, Las Vegas by analyticsweek

>> House of Cards by analyticsweek

Wanna write? Click Here

[ FEATURED COURSE]

CS109 Data Science

image

Learning from data in order to gain useful predictions and insights. This course introduces methods for five key facets of an investigation: data wrangling, cleaning, and sampling to get a suitable data set; data managem… more

[ FEATURED READ]

Antifragile: Things That Gain from Disorder

image

Antifragile is a standalone book in Nassim Nicholas Taleb’s landmark Incerto series, an investigation of opacity, luck, uncertainty, probability, human error, risk, and decision-making in a world we don’t understand. The… more

[ TIPS & TRICKS OF THE WEEK]

Finding a success in your data science ? Find a mentor
Yes, most of us dont feel a need but most of us really could use one. As most of data science professionals work in their own isolations, getting an unbiased perspective is not easy. Many times, it is also not easy to understand how the data science progression is going to be. Getting a network of mentors address these issues easily, it gives data professionals an outside perspective and unbiased ally. It’s extremely important for successful data science professionals to build a mentor network and use it through their success.

[ DATA SCIENCE Q&A]

Q:Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
A: * In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically
* Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution
* The least frequently occurring 80% of items are more important as a proportion of the total population
* Zipf’s law, Pareto distribution, power laws

Examples:
1) Natural language
– Given some corpus of natural language – The frequency of any word is inversely proportional to its rank in the frequency table
– The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent…
– The” accounts for 7% of all word occurrences (70000 over 1 million)
– ‘of” accounts for 3.5%, followed by ‘and”…
– Only 135 vocabulary items are needed to account for half the English corpus!

2. Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people

3. File size distribution of Internet Traffic

Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites

Importance in classification and regression problems:
– Skewed distribution
– Which metrics to use? Accuracy paradox (classification), F-score, AUC
– Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…)
– Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (‘Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach

Source

[ VIDEO OF THE WEEK]

Solving #FutureOfOrgs with #Detonate mindset (by @steven_goldbach & @geofftuff) #FutureOfData #Podcast

 Solving #FutureOfOrgs with #Detonate mindset (by @steven_goldbach & @geofftuff) #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

The temptation to form premature theories upon insufficient data is the bane of our profession. – Sherlock Holmes

[ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with Joe DeCosmo, @Enova

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Joe DeCosmo, @Enova

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

In 2015, a staggering 1 trillion photos will be taken and billions of them will be shared online. By 2017, nearly 80% of photos will be taken on smart phones.

Sourced from: Analytics.CLUB #WEB Newsletter

The Beginner’s Guide to Predictive Workforce Analytics

Greta Roberts, CEO
Talent Analytics, Corp.

Human Resources Feels Pressure to Begin Using Predictive Analytics
Today’s business executives are increasingly applying pressure to their Human Resources departments to “use predictive analytics”.  This pressure isn’t unique to Human Resources as these same business leaders are similarly pressuring Sales, Customer Service, IT, Finance and every other line of business (LOB) leader, to do something predictive or analytical.

Every line of business (LOB) is clear on their focus. They need to uncover predictive analytics projects that somehow affect their bottom line. (Increase sales, increase customer service, decrease mistakes, increase calls per day and the like).

Human Resources Departments have a Different, and Somewhat Unique, Challenge not Faced By Most Other Lines of Business
When Human Resources analysts begin a predictive analytics initiative, what we see mirrors what every other line of business does. Somehow for HR, instead of having a great outcome it can be potentially devastating.

Unless the unique challenge HR faces is understood, it can trip up an HR organization for a long time, cause them to lose analytics project resources and funding, and continue to perplex HR as they have no idea how they missed the goal of the predictive initiative so badly.

Human Resources’ Traditional Approach to Predictive Projects
Talent Analytics’ experience has been that (like all other lines of business) when Human Resources focuses on predictive analytics projects, they look around for interesting HR problems to solve; that is, problems inside of the Human Resources departments. They’d like to know if employee engagement predicts anything, or if they can use predictive work somehow with their diversity challenges, or predict a flight risk score that is tied to how much training or promotions someone has, or see if the kind of onboarding someone has relates to how long they last in a role. Though these projects have tentative ties to other lines of business, these projects are driven from an HR need or curiosity.

HR (and everyone else) Needs to Avoid the “Wikipedia Approach” to Predictive Analytics
Our firm is often asked if we can “explore the data in the HR systems” to see if we can find anything useful. We recommend avoiding this approach as it is exactly the same as beginning to read Wikipedia from the beginning (like a book) hoping to find something useful.

When exploring HR data (or any data) without a question, what you’ll find are factoids that will be “interesting but not actionable”. They will make people say “really, I never knew that”, but nothing will result.  You’ll pay an external consultant a lot of money to do this, or have a precious internal resource do this – only to gain little value without any strategic impact.  Avoid using the Wikipedia Approach – at least at first.  Start with a question to solve.  Don’t start with a dataset.

Human Resources Predictive Project Results are Often Met with Little Enthusiasm
Like all other Lines of Business, HR is excited to show results of their HR focused predictive projects.

The important disconnect. HR shows results that are meaningful to HR only.

Perhaps there is a prediction that ties # of training classes to attrition, or correlates performance review ratings with how long someone would last in their role. This is interesting information to HR but not to the business.

Here’s what’s going on.

Business Outcomes Matter to the Business.  HR Outcomes Don’t.
Human Resources departments can learn from the Marketing Department who came before them on the predictive analytics journey. Today’s Marketing Departments, that are using predictive analytics successfully, are arguably one of the strongest and most strategic departments of the entire company.

Today’s Marketing leaders predict customers who will generate the most revenue (have high customer lifetime value). Marketing Departments did not gain any traction with predictive analytics when they were predicting how many prospects would “click”. They needed to predict how many customers would buy.

Early predictive efforts in the Marketing Department used predictive analytics to predict how many webinars they’ll need to conduct to get 1,000 new prospects in their prospect database.  Or, how much they’d need to spend on marketing campaigns to get prospects to click on a coupon. (Adding new prospect names to a prospect database is a marketing goal not a business goal.  Clicking on a coupon is a marketing goal not a business goal). Or, they could predict that customer engagement would go up if they gave a discount on a Friday (again, this is a marketing goal not a business goal. The business doesn’t care about any of these “middle measures” unless they can be proved and tracked to the end business outcome.

Marketing Cracked the Code
Business wants to reliably predict how many people would buy (not click) using this coupon vs. that one.  When marketing predicted real business outcomes, resources, visibility and funding quickly became available.

When Marketing was able to show a predictive project that could identify what offer to make so that a customer bought and sales went up – business executives took notice. They took such close notice that they highlighted what Marketing was able to do, they gave Marketing more resources and funding and visibility. Important careers were made out of marketing folks who were / are part of strategic predictive analytics projects that delivered real revenue and / or real cost savings to the business’s bottom line.

Marketing stopped being “aligned” with the business, Marketing was the business.

Human Resources needs to do the same thing.

Best Approach for Successful and Noteworthy Predictive Workforce Projects
Many people get tangled up in definitions. Is it people analytics, workforce analytics, talent analytics or something else? It doesn’t matter what you call it – the point is that predictive workforce projects need to address and predict business outcomes not HR outcomes.

Like Marketing learned over time, when Human Resources begins predictive analytics projects, they need to approach the business units they support and ask them what kinds of challenges they are having that might be affected by the workforce.

There are 2 critical categories for strategic predictive workforce projects:

  • Measurably reducing employee turnover / attrition in a certain department or role

  • Measurably increasing specific employee performance (real performance not performance review scores) in one role or department or another (i.e. more sales, less mistakes, higher customer service scores, less accidents).

I say “measurably” because to be credible, the predictive workforce initiative needs to measure and show business results both before and after the predictive model.

For Greatest ROI: Businesses Must Predict Performance or Flight Risk Pre-Hire
Once an employee is hired, the business begins pouring significant cost into the employee typically made up of a) their salary and benefits b) training time while they ramp up to speed and deliver little to no value. Our analytics work measuring true replacement costs show us that even for very entry level roles a conservative replacement estimate for a single employee (Call Center Rep, Bank Teller and the like) will be over $6,000.

A great example, is to consider the credit industry. Imagine them extending credit to someone for a mortgage – and then applying analytics after the mortgage has been extended to predict which mortgage holders are a good credit risk. It’s preposterous.

They only thing the creditor can do after the relationship has begun is to try to coach, train, encourage, change the payment plan and the like. It’s too late after the relationship has begun.

Predicting credit risk (who will pay their bills) – is predicting human behavior.  Predicting who will make their sales quota, who will make happy customers, who will make mistakes, who will drive their truck efficiently – also is predicting human behavior.

HR needs to realize that predicting human behavior is a mature domain with decades of experience and time to hone approaches, algorithms and sensitivity to private data.

What is Human Resources’ Role in Predictive Analytics Projects?
The great news is that typically the Human Resources Department will already be aware of both of these business challenges. They just hadn’t considered that Human Resources could be a part of helping to solve these challenges using predictive analytics.

Many articles discuss how Human Resources needs to be an analytics culture, and that all Human Resources employees need to learn analytics. Though I appreciate the realization that analytics is here to stay, Human Resources of all people should know that there are some people with the natural mindset to “get” and love analytics and there are some that don’t and won’t.

As I speak around the world and talk to folks in HR, I can feel the fear felt today by people in HR who have little interest in this space. My recommendation would be to breathe, take a step back and realize that not everyone needs to know how to perform predictive analytics.  Realize there are many traditional HR functions that need to be accomplished. We recommend a best practice approach of identifying who does have the mindset and interest in the analytics space and let them partner with someone who is a true predictive analyst.

For those who know they are not cut out to be the person doing the predictive analytics there are still many roles where they can be incredibly useful in the predictive process. Perhaps they could identify problem areas that predictive analytics can solve, or perhaps they could be the person doing more of the traditional Human Resources work. I find this “analytics fear” paralyzes and demoralizes employees and people in general.

Loosely Identified, but Important Roles on a Predictive Workforce Analytics Project

  1. Someone to identify high turnover roles in the lines of business, or identify where there are a lot of employees not performing very well in their jobs

  2. A liaison: Someone to introduce the HR predictive analytics team to the lines of business with turnover or business performance challenges

  3. Someone to help find and access the data to support the predictive project

  4. Someone to actually “do” the predictive analytics work (the workforce analyst or data scientist)

  5. Someone who creates a final business report to show the results of the work (both positive and negative)

  6. Someone who presents the final business report

  7. A high level project manager to help keep the project moving along

  8. The business and HR experts that understand how things work and need to be consulted all along the way

These roles can sometimes all be the same person, and sometimes they can be many different people depending on the complexity of the project, the size of the predictive workforce organization, the number of lines of business that are involved in the project and / or the multiple areas where data needs to be accessed.

The important thing to realize is there are several non analytics roles inside of predictive projects. Not every role in a predictive project requires a predictive specialist or even an analytics savvy person.

High Value Predictive Projects Don’t Deliver HR Answers
We recommend, no. At least not to begin with. We started by describing how business leaders are pressuring Human Resources to do predictive analytics projects. There is often little or no guidance given to HR about what predictive projects to do.

Here is my prediction and you can take it to the bank. I’ve seen it happen over and over again.

When HR departments use predictive analytics to solve real, Line of Business challenges that are driven by the workforce, HR becomes an instant hero. These Human Resources Departments are given more resources, their projects are funded, they receive more headcount for their analytics projects – and like Marketing, they will turn into one of the most strategic departments of the entire company.

Feeling Pressure to Get Started with Predictive?
If you’re feeling pressure from your executives to start using predictive analytics strategically and have a high volume role like sales or customer service you’d like to optimize, get in touch.

Want to see more examples of “real” predictive workforce business outcomes? Attend Predictive Analytics World for Workforce in San Francisco, April 3-6, 2016.

Greta Roberts is the CEO & Co-founder of Talent Analytics, Corp. She is the Program Chair of Predictive Analytics World for Workforce and a Faculty member of the International Institute for Analytics. Follow her on twitter @gretaroberts.

Source: The Beginner’s Guide to Predictive Workforce Analytics

Analytic Exploration: Where Data Discovery Meets Self-Service Big Data Analytics

Traditionally, the data discovery process was a critical prerequisite to, yet a distinct aspect of, formal analytics. This fact was particularly true for big data analytics, which involved extremely diverse sets of data types, structures, and sources.

However, a number of crucial developments have recently occurred within the data management landscape that resulted in increasingly blurred lines between the analytics and data discovery processes. The prominence of semantic graph technologies, combined with the burgeoning self-service movement and increased capabilities of visualization and dashboard tools, has resulted in a new conception of analytics in which users can dynamically explore their data while simultaneously gleaning analytic insight.

Such analytic exploration denotes several things: decreased time to insight and action, a democratization of big data and analytics fit for the users who need these technologies most, and an increased reliance on data for the pervasiveness of data-centric culture.

According to Ben Szekely, Vice President of Solutions and Pre-sales at Cambridge Semantics, it also means much more–a new understanding of the potential of analytics, which necessitates that users adopt:

“A willingness to explore their data and be a little bit daring. It is sort of a mind-bending thing to say, ‘let me just follow any relationship through my data as I’m just asking questions and doing analytics’. Most of our users, as they get in to it, they’re expanding their horizons a little bit in terms of realizing what this capability really is in front of them.”

Expanding Data Discovery to Include Analytics
In many ways, the data discovery process was widely viewed as part of the data preparation required to perform analytics. Data discovery was used to discern which data were relevant to a particular query and for solving a specific business problem. Discovery tools provided this information, which was then cleansed, transformed, and loaded into business intelligence or analytics options to deliver insight in a process that was typically facilitated by IT departments and exceedingly time consuming.

However, as the self-service movement has continued to gain credence throughout the data sphere these tools evolved to become more dynamic and celeritous. Today, any number of vendors is servicing tools that regularly publish the results of analytics in interactive dashboards and visualizations. These platforms enable users to manipulate those results, display them in ways that are the most meaningful for their objectives, and actually utilize those results to answer additional questions. As Szekely observed, oftentimes users are simply: “Approaching a web browser asking questions, or even using a BI or analytics tool they’re already familiar with.”

The Impact of Semantic Graphs for Exploration
The true potential for analytic exploration is realized when combining data discovery tools and visualizations with the relationship-based, semantic graph technologies that are highly effective on widespread sets of big data. By placing these data discovery platforms atop stacks predicated on an RDF graph, users are able to initiate analytics with the tools that they previously used to merely refine the results of analytics.

Szekely mentioned that: “It’s the responsibility of the toolset to make that exploration as easy as possible. It will allow them to navigate the ontology without them really knowing they’re using RDF or OWL at all…The system is just presenting it to them in a very natural and intuitive way. That’s the responsibility of the software; it’s not the responsibility of the user to try to come down to the level of RDF or OWL in any way.”

The underlying semantic components of RDF, OWL, and vocabularies and taxonomies that can link disparate sets of big data are able to contextualize that data to give them relevance for specific questions. Additionally, semantic graphs and semantic models are responsible for the upfront data integration that occurs prior to analyzing different data sets, structures and sources. By combining data discovery tools with semantic graph technologies, users are able to achieve a degree of profundity in their analytics that would have previously either taken too long to achieve or not have been possible.

The Nature of Analytic Exploration
On the one hand, that degree of analytic profundity is best described as the ability of the layman business end user to ask much more questions of his or her data in quicker time frames than he or she is used to doing so. On the other hand, the true utility of analytic exploration is realized in the types of questions that user can ask. These questions are frequently ad-hoc, include time-sensitive and real-time data, and are often based on the results of previous questions and conclusions that one can draw from them.

As Szekely previously stated, the sheer freedom and depth of analytic exploration lends itself to so many possibilities on different sorts of data that it may require a period of adjustment to conceptualize and fully exploit. The possibilities enabled by analytic exploration are largely based on the visual nature of semantic graphs, particularly when combined with competitive visualization mechanisms that capitalize on the relationships they illustrate for users. According to Craig Norvell, Franz Vice President of Global Sales and Marketing, such visualizations are an integral “part of the exploration process that facilitates the meaning of the research” for which an end user might be conducting analytics.

Emphasizing the End User
Overall, analytic exploration is reliant upon the relationship-savvy, encompassing nature of semantic technologies. Additionally, it depends upon contemporary visualizations to fuse data discovery and analytics. Its trump card, however, lies in its self-service nature which is tailored for end users to gain more comfort and familiarity with the analytics process. Ultimately, that familiarity can contribute to a significantly expanded usage of analytics, which in turn results in more meaningful data driven processes from which greater amounts of value are derived.

Source