Grow at the speed of collaboration
A research by Cornerstone On Demand pointed out the need for better collaboration within workforce, and data analytics domain is no different. A rapidly changing and growing industry like data analytics is very difficult to catchup by isolated workforce. A good collaborative work-environment facilitate better flow of ideas, improved team dynamics, rapid learning, and increasing ability to cut through the noise. So, embrace collaborative team dynamics.
[ DATA SCIENCE Q&A]
Q:How do you handle missing data? What imputation techniques do you recommend?
A: * If data missing at random: deletion has no bias effect, but decreases the power of the analysis by decreasing the effective sample size
* Recommended: Knn imputation, Gaussian mixture imputation
As a content creator, you might sometimes meet with companies and brands that are looking to market their products with content. Itâs one of the most efficient way to get your message out, and people go to great lengths to bake content that could make an impact through blogs and websites. But for the content creator, it can be a difficult task to put together content ideas that will satisfy the often detail oriented client.
The ideas for good content usually come from a need-to-know moment. You might browse the Internet when you come across something you want to know more about. Usually there are places to find this information, but every once in a while, you come across a road block that just wonât yield the information you want to find on one source. Thatâs usually when you have the idea of actually putting together an exact piece of content that outlines those missing pieces, if you know what I mean.
Depending on what content ideas you usually have, this routine can be used when you want and have to create ideas for your client as well. Usually you need to ask the client a few questions before you have a clear picture about what the client is looking for and what they want to promote or market.
Is your refrigerator running a spam operation? Thanks to the Internet of Things, the answer to that question could be yes.
Despite some dystopian fears, like that spamming refrigerator, the Internet of Things isnât just an eerie term that sounds like it was plucked from Brave New World. It is a vague one though, so to clear up any uncertainty, hereâs the dictionary definition: âa proposed development of the Internet in which everyday objects have network connectivity, allowing them to send and receive data.â
As Altimeter Group points out in its new report, âCustomer Experience in the Internet of Things,â brands are already using this sci-fi technology in amazing ways to build customer relationships and optimize their products.Â In reality, itâs moreÂ evolution than revolution, asÂ companies are already tracking smartphone and Internet usageÂ to gather data that provides crucial feedback about consumer behavior. As the report states, the Internet of Things only âbrings us closer than ever to the ultimate marketing objective: delivering the right content or experience in the right context.â
Talk of trackers and sensors and refrigerators gone wild may sound intimidating for brands that are still getting their content operations up and running, but some major companies are already exploring the new frontier of the Internet of Things. Here are the five brands doing it best.
Have you ever found yourself searching for a specific item in a pharmacy, wishing you could click control-F to locate it, pay, and leave quickly? Aisle411, Google Tango, and Walgreens teamed up to create a new mobile app that can grant harried shoppers that wish. By using Googleâs virtual indoor 3D mapping technology, Aisle411 created a mobile shopping platform that lets consumers search and map products in the store, take advantage of personalized offers, and easily collect loyalty points.
âThis changes the definition of in-store advertising in two key ways,â Aisle411 CEO Nathan Pettyjohn toldÂ Mobile Commerce Daily. âAdvertising becomes an experienceâimagine children in a toy store having their favorite toy guide them through the store on a treasure hunt in the aisles of the storeâand the end-cap is everywhere; every inch of the store is now a digital end cap.â
According to a Forrester study, 19 percent of consumers are already using their mobile devices to browse in stores. Instead of forcing consumers to look away from their screens, Walgreens is meeting them there.
2. Taco Bell
Nowadays, practically everyone is reliant on their GPS to get them places.Â Thatâs why Taco Bell is targeting consumers based on location by advertising and messaging them on mobile platforms like Pandora, Waze (a navigation app purchased by Google), and weather apps.
Digiday reports that in 2014, Taco Bell positioned ads on Waze for their 12 Pack product each Saturday morning to target drivers who mightâve been on their way to watch college football games. The Waze integration was so successful that Taco Bell decided to do the same thing on Sundays during the NFL seasonâthis time advertising its Cool Ranch Doritos Locos Taco.
3. Home Depot
Home Depot has previously used augmented reality in its mobile app to allow users to see how certain products would look in their homes. IKEA is also known for enticing consumers with this mobile strategy. But now, Home Depot is making life even easier for shoppers by piloting a program that connects a customerâs online shopping carts and wish lists with an in-store mobile app.
As explained in the Altimeter report, upon entering a Home Depot, customers who are part of the Pro Rewards program will be able to view the most efficient route through the store based on the products they shopped for online. And anyone whoâs been inside a Home Depot knows how massive and overwhelming those places can be without directions.
Creepy? Maybe. But helpful? Definitely. Michael Hibbison, VP of marketing and social media at Home Depot, defends the program to Altimeter Group: âLoyalty programs give brands more rope when it comes to balancing risks of creep. The way we think of it is we will be as personalized as you are loyal.â
4. Tesla Motors
Getting your car fixed can be as easy as installing a software update on your phoneâat least for Tesla customers. Teslaâs cars are electric, powered by batteries similar to those that fuel your laptop and mobile device. So when Tesla had to recall almost 30,000 Model S cars because their wall chargers were overheating, the company was able to do the ultimate form of damage control. Instead of taking the products back or bothering customers to take them to a dealership, Tesla just updated the software of each car, effectively eliminating the problem in all of their products.
Tesla also used this connectedness by crowdsourcing updated improvements for their products. As reported by Altimeter, a customer recently submitted a request for a crawl feature that allows the driver to ease into a slow cruise control in heavy traffic. Tesla not only granted the customerâs request, but they added the feature to their entire fleet of cars with just one software update.
McDonaldâs may be keeping it old school with their Monopoly contest, which, after 22 years, can still be won by peeling back stickers on your fries and McNuggets. But for their other marketing projects, McDonaldâs is getting pretty tech savvy.
McDonaldâs partnered with Piper, a Bluetooth low-energy beacon solution provider, to greet customers on their phones as they enter the restaurant. Through the app, consumers are offered coupons, surveys, Q&As, and even information about employment opportunities.
What does McDonaldâs get out of it? Data. Lots of data. When customers enter comments, their feedback is routed to the appropriate manager who can respond to the request before the person leaves the establishment.
Too close for comfort? Not compared to the companyâs controversial pay-with-a-hug stunt. And at least this initiative is working. According to Mobile Commerce Daily, in the first month of the appâs launch McDonaldâs garnered more than 18,000 offer redemptions, and McChicken sales increased 8 percent.
By tapping into the Internet of Things, brands can closely monitor consumer behavior, andâeven though it may sound a bit too invasiveâput the data they collect to good use. With sensors, a product can go from being a tool to an actual medium of communication between the marketer and the consumer. That sounds pretty cool. But, just to be safe, if you get a shady email from your fridge, maybe donât open it.
Â To read the original article on The Constant Strategist, click here.
Data aids, not replace judgement
Data is a tool and means to help build a consensus to facilitate human decision-making but not replace it. Analysis converts data into information, information via context leads to insight. Insights lead to decision making which ultimately leads to outcomes that brings value. So, data is just the start, context and intuition plays a role.
[ DATA SCIENCE Q&A]
Q:Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?
A: * Selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved
– Sampling bias: systematic error due to a non-random sample of a population causing some members to be less likely to be included than others
– Time interval: a trial may terminated early at an extreme value (ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all the variables have similar means
– Data: cherry picking, when specific subsets of the data are chosen to support a conclusion (citing examples of plane crashes as evidence of airline flight being unsafe, while the far more common example of flights that complete safely)
– Studies: performing experiments and reporting only the most favorable results
– Can lead to unaccurate or even erroneous conclusions
– Statistical methods can generally not overcome it
Why data handling make it worse?
– Example: individuals who know or suspect that they are HIV positive are less likely to participate in HIV surveys
– Missing data handling will increase this effect as its based on most HIV negative
-Prevalence estimates will be unaccurate
In this podcast Venu Vasudevan(@ProcterGamble) talks about best practices of creating a research led data driven data science team. He walked throug his journey of creating a robust and sustained data science team, spoke about bias in data science and some practices leaders and data science practitioner could adopt to create impactful data science team. This podcast is great for future data science leaders and practitioners leading organizations and are tasked to put together a data science practice.
Venu Vasudevan is Research Director, Data Science & AI at Procter & Gamble where he directs the Data Science & AI organization at Procter & Gamble research. He is a technology leader with a track record of successful consumer & enterprise innovation at the intersection of AI, Machine Learning, Big Data, and IoT. Previously he was VP of Data Science at an IoT startup, a founding member of the Motorola team that created the Zigbee IoT standard, worked to create an industry-first zero-click interface for mobile with Dag Kittlaus (co-creator of Apple Siri), created an industry first Google Glass experience for TV, an ARRIS video analytics and big data platform recently acquired by Comcast, and a social analytics platform leveraging Twitter that was featured in Wired Magazine and BBC. Venu holds a PhD (Databases & AI) from The Ohio State University, and was a member of Motorolaâs Science Advisory Board (top 2% of Motorola technologists). He is an Adjunct Professor at Rice Universityâs Electrical and Computer Engineering department, and was a mentor at Chicagoâs 1871 startup incubator.
#FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.
If you or any you know wants to join in,
Register your interest @ http://play.analyticsweek.com/guest/
Want to sponsor?
Email us @ email@example.com
When most people hear the word âbiasâ they think of gender or racial discrimination or other situations where preconceptions lead to bad decisions.
These kinds of biases and bad decisions are common. Tests have shown that blinding judges to gender makes a big difference in orchestra auditions , inserting perceptions of gender affects hiring for university jobs  and having a name that sounds African American costs an applicant the equivalent of 8 years of experience .
People are often not aware of the extent to which their decisions are affected by these biases.Â In the case of orchestra auditions, judges claim to be making decisions based purely on ability, but they make different decisions when they cannot see a musician. In the case of university hiring, female professors showed as much bias as male professors although they are presumably aware of and opposed to discrimination against women in the workplace. Clearly, bias can cause problems.
The bad news is that some level of bias appears to be unavoidable in human decisions.
But the surprising thing is that some form of bias has been shown to be absolutely necessary for machine learning to work at all . Similarly, some forms of bias appear innate and in many cases very useful in human reasoning. Why is this apparent contradiction possible?
Based on results like these, it seems like the right course of action is to work to expunge all kinds of bias.Â The real problem, however, is not bias per se.Â Instead the problem is uncontrolled and unrecognized bias that does not match reality. A bias against hiring flatworms as violinists is probably just fine (though some cellists might disagree). So the real question is how can weÂ make our bias â our assumptions about reality â more explicit and, importantly, how can we make it so that we can change our assumptions.
Bias in Machine Learning
You have to have bias to train a model. If you start with no assumptions, then all possible universes are equally likely including an infinite number of completely nonsensical ones. Itâs not feasible to start that way and successfully train a model. Instead, you give your model a smaller collection of possibilities from which to choose â possibilities that are reasonably likely to occur in reality. In other words, you give your model hints about what is possible and what is plausible. A simple example is that an internet ad, no matter how well designed or targeted, will not get a 90% click through rate as much as we might wish it could. Itâs technically not impossible, but realistically, it just isnât going to happen. Injecting a bias into the model that says that the click-through-rate on an ad is likely to be in the range from 0.1%
to 5% is a way to teach the model what your general experience is. Without that, the model wastes too much time and too much training data learning to eliminate model parameters that are far out of scope of what is reasonable.
So some assumptions (bias) are required to do anything at all. The question then becomes one of whether you should have your assumptions baked into the learning algorithm or have a way of expressing them mathematically and explicitly.
Make Bias Explicit and Controlled
There are many ways to build machine learning models, and one way in which these methods differ is by how they inject assumptions.
In some techniques, the assumptions are implicit and difficult to modify. The data scientist may not even realize what types of assumptions or bias are inherent in the technique nor be able to adjust the assumptions in light of new evidence. This type of bias is one to avoid.
In contrast, Bayesian inference is the branch of the theory of probability that deals with how to incorporate explicit assumptions into the process of learning from data. Using Bayesian techniques allows you to express what you know about the problem you are solving, that is, to be explicit and controlled in injecting bias. You know what bias you are dealing with, and if you realize your assumptions are incorrect, you can change them and start learning again.
When You Have Eliminated the Impossible, Never Say Never
One way to avoid misleading outcomes in machine learning due to mishandling of bias is to put soft limits on the assumptions you inject. All of the encoded assumptions in your model should express your experience but also allow for the possibility that a particular case is extremely rare yet still very slightly possible.Â Instead of using an absolute statement of impossibility, allow for surprises in the assumptions you make. Thus, if you want to say that a parameter should be between 1 and 5%, it is good practice to allow for the possibility that the parameter is more extreme than expected and could lie outside that expected range. We might instead say that the parameter will very probably be in the range from 1 to 5%, but that there is a small chance that it is outside that range. Making that small chance non-zero helps avoid a meltdown in the learning algorithm when your assumptions turn out to be wrong.
Reserve a touch of skepticism
A key best practice in dealing with bias is to admit to yourself that your model could be radically wrong or the world may have changed. Keep watching for that possibility and be ready to change your assumptions â and maybe your model â if you find evidence that your results are substantially inaccurate.
How would you recognize that situation? By monitoring how well reality â events as they occur â actually match what your model predicts, you can continually check on the correctness of your assumptions and thus how well youâve chosen and controlled bias. In other words, donât just give your model (and yourself) a passing grade when you deploy it and then never look back. When you see performance change from previous levels or relative to other similar models, you should start to suspect a systematic error in your assumptions and start experimenting with weaker or different assumptions.
Data scientists need to have a healthy dose of skepticism. This doesnât mean losing confidence in the value of their analyses or the outcomes of their models, but it means staying alert and being ready to adjust. Just because a model appeared to work once does not mean that it will continue working as the context changes. It must be monitored. Most of the time when a model degrades over time, small changes in the model assumptions can restore previous performance. Occasionally larger structural changes in the way the world works will require that a model be rethought at a deeper level.
Knowing What You Donât Know
Even when your model does appear to be behaving well, you should still be aware of what you donât know and put accurate error bounds on your results. Good work is not perfect, so the truly capable data scientist always is aware of the limits of what is known. This practice of being careful about what you do not know is an important step in preventing yourself (and your models) from over reaching in a way that would undermine the value of your conclusions.
The Human Element
Donât expect data scientists to be super humanly un-biased. Data scientists come with biases baked in, just like any human does. Data scientists will express the negative aspects of their biases by clinging to favorite models or techniques longer than is warranted or by inserting possibly erroneous assumptions into algorithms.
This means that data scientists need to be on the alert to detect cases where their bias migrates from wisdom borne of experience into oblivious pig-headedness. There is surprisingly little distance between these two, so it is very important to keep a close eye on the possibility that things have come unglued. External reviews are helpful as is continuous checking of predictions against actual observations.
The Value of Prediction
Reality should always the last word. The proof of any model is how well it predicts events that are happening right now. This process, often called now-casting or retro-diction, can allow the performance of a model to be continuously assessed and re-assessed.
People and mathematical models all suffer from (and benefit from) bias that comes in many forms. This bias can be the cause of great problems or make it possible to achieve great things. Being unaware of bias is one of the main ways it can cause serious problems. This danger applies to the bias that we as humans carry into our own decision making or the bias that may be hidden in some of the algorithms we choose. On the other hand, conscious and controlled bias in the form of intentional assumptions is a necessary and desirable part of effective machine learning. Itâs necessary to limit the options a machine learning model works with, and to do this well, you should inject assumptions that are carefully considered. Furthermore, in order to represent reality as accurately as possible, you should continue to monitor and evaluate the outcome of machine learning models over time. And in all cases, be aware of what you donât know as well as what you do.
 Claudia Goldin and Cecilia Rouse. 1997. Orchestrating Impartiality: The Effect of âBlindâ Auditions on Female Musicians.Â NBER Working Paper #5903.Â
 Corinne A. Moss-Racusin, John F. Dovidio, Victoria L. Brescoll, Mark J. Graham and Jo Handelsman. 2012. Science facultyâs subtle gender biases favor male students. PNAS 2012 109 (41) 16474-16479.
 Bertrand, Marianne, and Sendhil Mullainathan. 2004. âAre Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.âAmerican Economic Review, 94(4): 991-1013.
 Wolpert, D.H. and Macready, W.G. 1997. âNo free lunch theorems for optimization.âIEEE Transactions on Evolutionary Computation, 1(1): 67-82.
Note: This article originally appeared in Datanami. Click for link here.
Originally Posted at: The Role of Bias In Big Data: A Slippery Slope by analyticsweekpick
Winter is coming, warm your Analytics Club
Yes and yes! As we are heading into winter what better way but to talk about our increasing dependence on data analytics to help with our decision making. Data and analytics driven decision making is rapidly sneaking its way into our core corporate DNA and we are not churning practice ground to test those models fast enough. Such snugly looking models have hidden nails which could induce unchartered pain if go unchecked. This is the right time to start thinking about putting Analytics Club[Data Analytics CoE] in your work place to help Lab out the best practices and provide test environment for those models.
[ DATA SCIENCE Q&A]
Q:Examples of NoSQL architecture?
A: * Key-value: in a key-value NoSQL database, all of the data within consists of an indexed key and a value. Cassandra, DynamoDB
* Column-based: designed for storing data tables as sections of columns of data rather than as rows of data. HBase, SAP HANA
* Document Database: map a key to some document that contains structured information. The key is used to retrieve the document. MongoDB, CouchDB
* Graph Database: designed for data whose relations are well-represented as a graph and has elements which are interconnected, with an undetermined number of relations between them. Polyglot Neo4J
One of the most prominent benefits of Software-as-a-Service is its ability to use Big Data to provide the service a company needs. When that SaaS involves helping the B2B community gain more connections, Big Data can provide a reliable method for businesses to strengthen conversation rates and make their sales tactics more efficient.
Such is the mentality of customer management SaaS provider SalesPredict, which revealed late last week that it has secured a partnership with HG Data. Together, the two will combine SalesPredictâs existing algorithm with the Big Data analytics capabilities of HG Data allowing business users to more accurately target both inbound and outbound efforts to reach more companies.
According to SalesPredict, the collaboration combines predictive data analytics, an internal customer relationship management automated system and external data sources through the Internet to gain the highest quality of information for B2B companies that want to gain more qualified leads and conversations.
âOur goal is to deliver the most accurate predictive lead and account scores, along with deep data insights, to help our customers find, convert and retain more customers,â said Yaron Zakai-Or, cofounder and CEO of SalesPredict. âHG Data provides valuable information about the installed technology landscape, and we are excited to be able to share that data with our customers and help them market and sell more efficiently.â
Reports said HG Data is already connected to more than 1,700 vendors and to more than 5,000 technology products. It accesses data through SaaS, the cloud, open source applications, mobile devices, CRM and more.
The partnership hints at the rise of Data-as-a-Service, a sector that reflects the rise in data analytics demand among businesses. Some of the worldâs largest technology companies have similarly recognized the importance for B2B marketers to implement Big Data into their sales strategies.
Oracle Data Cloud, for example, recently launched its own DaaS platform. âBy bringing Data-as-a-Sevice to marketers, we are providing a new world of external data that enables targeting accuracy and scale to every B2B interaction,â said the firmâs Senior VP and General Manager Omar Tawakol at the time.
To read the original article on PYMNtS.com, click here.
What occupational fields do you think of when you hear the words âdata visualizationâ and âanalyticsâ? Finance? Marketing? IT? Sales?
What about Distribution? HR? Operations? Fundraising? Education?
One spring when I was in elementary school, my teacher passed out sheets of colored construction paper and instructed us to cut them into strips. We then stapled the strips of paper together to make several chains that were each five links long, one chain that was only four links long, and two chains with three links. She then hung them in a row along the back wall corkboard and explained that each link represented a school day and each of the chains represented a school week. The four-link chain was for the week we had Monday off for Memorial Day and the two three-link chains represented our short weeks for teacher in-service and make up snow days. Each school day, one of us would get to rip off a link and watch as, one by one, the days and weeks tore away to summer vacation.
That, my friends, is data visualization: expressing information through imagery. My teacher took a predetermined set of data (days and weeks left in the school year) and presented it in a visual way that helped us better comprehend the amount of time until summer break. Sure, we could count the links if we wanted to know exactly how many days were left in the school year, but one glance at the corkboard gave us enough information to get the big picture.
Finance, IT, and marketing have done a great job at utilizing data visualization, but they donât own the technique. If a teacher can use data visualization with a group of elementary kids, so can you. Letâs look at some less-than-conventional ways you can use data visualization throughout your company.
Production and Distribution
Youâre making and moving products, what do you need pictures for, right? Well, for one, mapping out production and distribution processes helps identify pain points. Are you behind on orders? Donât know where the holdup is? Physically mapping out your process can reveal possible bottlenecks and opportunities for improved efficiency. Take a good look at:
Workload: Bottlenecks can signify uneven workloads. Does anyone have too much on their plate? Are valuable resources being underutilized?
Timing: Timing is everything, or so the saying goes, and time is one of our most valuable resources. Are you using your time wisely? Are all of the steps in sync throughout the process? Is there a way to capitalize on unavoidable lag time?
Physical layout of machinery and supplies: Itâs about more than feng shui. Visualizing movement within the factory or warehouse highlights the most frequently used routes and resources. Are there physical obstacles hindering process flow or highly trafficked areas? Can the most frequently used and requested items be made more accessible?
Distribution routes: Wandering and overlapping routes waste time and fuel. Map out and assign deliveries by location to provide drivers with the most efficient routes. Update your maps regularly to help your drivers steer clear of road closures and construction zones.
Project management and recruiting tools allow HR to collect so much information it can be easy to lose the forest for the trees. Donât get lost in the minutiae. Human resources data visualization can help you to step back to see the bigger picture. Identify trends in employee satisfaction, engagement, and career progression. New employee drop off or confusion may signal a snag in the onboarding process. Training and certification data can uncover potential issues for succession planning or coverage. Organizational charts can expose over-burdened branches that might benefit from pruning or reorganization. Watch industry trends to ensure your company attracts and keeps quality employees with competitive salaries and benefits.
Sales and Marketing
Okay, okay, I know I said marketing is already accustomed to data visualization, but I also said we were going to look at unconventional methods. Sales and marketing teams have been using charts and graphs to measure progress and success basically since the dawn of sales and marketing. Pie charts show whoâs got the biggest market share, line graphs follow sales trends, bar graphs help us compare year over year, and many sales and marketing tools like CRMs and marketing automation software provide simple data visualizations.
What if, instead of using data visualization to analyze what has been done, we used it to flesh out what can be done. Iâm not just talking about forecasting trends. Scatter maps are great for seeing geographical location and impact, and lack thereof. Everyone wants to focus on analyzing the data represented. Donât forget to examine the white space. Are there missed sales opportunities outside of or between spheres? Can you use this information to literally add in new sales while moving between markets? Scatter graph spheres also make excellent Venn Diagrams, showing overlapping areas, where marketers may be overspending or inundating their customers with messaging. The list goes on.
The truth is, data is information, and every field works with some sort of information. You donât have to be an accountant or scientist to work with data and you certainly donât have to be a pivot-table whiz to benefit from data utilization. Itâs all around. That thermometer you colored in as your club raised money for the big trip? Data visualization. The fuel gauge on your carâs dashboard thatâs always a little too close to âEâ? Data visualization. Pro-Con list? March Madness Bracket? Paper chain-link countdown? I think you can see where Iâm going, but the real question is: can you see where youâre going?
About the Author
Melissa Reinke is a writer for TechnologyAdvice.com. She is a storyteller, editor, writer, and all-around word nerd extraordinaire. She spends her days managing web content and her nights unwinding in myriad creative ways, including writing for herself and others. From personal memoirs to professional solutions, when writing and editing for others Melissaâs singular goal is to sculpt each piece into its best, most successful form while maintaining the integrity of the original voice and vision. Based in Music City, USA, Melissa can often be found enjoying great live tunes with even better friends. Then again, sheâs just as likely to be found curled up with a good book and a tasty beverage.
Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.
[ DATA SCIENCE Q&A]
Q:How to clean data?
A: 1. First: detect anomalies and contradictions
* Tidy data: (Hadley Wickam paper)
column names are values, not names, e.g. 26-45
multiple variables are stored in one column, e.g. m1534 (male of 15-34 years old age)
variables are stored in both rows and columns, e.g. tmax, tmin in the same column
multiple types of observational units are stored in the same table. e.g, song dataset and rank dataset in the same table
*a single observational unit is stored in multiple tables (can be combined)
* Data-Type constraints: values in a particular column must be of a particular type: integer, numeric, factor, boolean
* Range constraints: number or dates fall within a certain range. They have minimum/maximum permissible values
* Mandatory constraints: certain columns cant be empty
* Unique constraints: a field must be unique across a dataset: a same person must have a unique SS number
* Set-membership constraints: the values for a columns must come from a set of discrete values or codes: a gender must be female, male
* Regular expression patterns: for example, phone number may be required to have the pattern: (999)999-9999
* Missing values
* Cross-field validation: certain conditions that utilize multiple fields must hold. For instance, in laboratory medicine: the sum of the different white blood cell must equal to zero (they are all percentages). In hospital database, a patients date or discharge cant be earlier than the admission date
2. Clean the data using:
* Regular expressions: misspellings, regular expression patterns
* KNN-impute and other missing values imputing methods
* Coercing: data-type constraints
* Melting: tidy data issues
* Date/time parsing
* Removing observations