The Last Layer of Cyber Security: Business Continuity and Disaster Recovery with Incremental Backups

Due to burgeoning regulatory penalties and a seemingly interminable amount of threats, cyber security is a foremost concern for the contemporary enterprise.

Oftentimes, the most dependable protection involves combining methods and technologies to preserve the integrity of IT systems and their data. In this regard, an organization’s security stack is often as important as its technology stack.

According to BackupAssist Chief Executive Officer Linus Chang, there is considerable advantage to topping the former with reliable, systematic backups to ensure business continuity.

“A few years ago, people were saying that backup is dead, people are moving to the cloud, it’s all about high availability and so on,” Chang mentioned. “But these new regulations, and especially the onset of crypto ransomware, has really brought back the spotlight for having multiple layers of protection for data. There’s a lot of interest in keeping historical versions of data, and having backups as the last layer of protection should perimeter security fail.”

Ransomware
The advent of cryptographic ransomware makes for a compelling use case to preserve backup copies of information assets, and serves to highlight the various areas in which business continuity impacts the enterprise. Business continuity involves elements of cyber security, network availability, organizational risk, and disaster recovery—meaning that timely backups are assistive in each of these areas as well. Chang referenced an occurrence in which users “were struck by ransomware and didn’t have a proper backup system. They never got the data back and had to re-key in six weeks of data. Having backups would have minimized that to two, three hours maximum.” Modern backup systems are able to address instances of ransomware (and other types of malware) in three ways:

  • Protection—Protective capabilities can help ward off ransomware or malware attacks.
  • Detection—Scanning mechanisms enable the detection of any sort of large scale corruption or file system modifications typical of ransomware. Such routine scanning is a component of the backup process.
  • Response—Once ransomware or infected files are detected, backup jobs are ceased and prevented from running on those files. Alerts are sent to users notifying them of infected files.

High Availability
Network availability is an integral aspect of business continuity, particularly in the event of security breaches or disaster recovery. Nonetheless, there are a few key differences between conventional high availability methods (which frequently involve redundancy and cloud failover capabilities) and those provided by timely backups. According to Chang, “High availability is about minimizing downtime. Commonly you would say three nines [99.9%], four nines [99.99%], five nines [99.999%] of availability. Over the course of a year you might be down for 20 or 30 minutes.” Backing up data can also decrease downtime in certain network failure events. Still, backups issue benefits in addition to availability. “Backup is about being able to restore whole systems to get back historical data,” Chang noted. “High availability only talks about the current version of data. It doesn’t talk about being able to pull something back from two years ago.” Thus, when prompted by regulators or legal discovery measures for data companies may have possessed years ago, data backups—not high availability—provide ideal solutions. As such, backups are necessary for business continuity. “Business continuity’s all about the business and do they have the data when they need it,” Chang explained.

Flexibility
Viewed from a business continuity perspective, backups require a degree of flexibility to serve an ever evolving ecosystem of enterprise data needs. Cloud backups are usually the most common variety of backups deployed. However, there are certain situations in which backing up data to local, physical storage (typically on disk) is much more preferable to cloud backups. “If you’ve got small data sets, then absolutely it’s more feasible to put that data in the cloud,” Chang said. “When you’ve got large datasets and complete servers that you need to get up after a disaster, you need to bring it back up for full metal disaster recovery, then it’s always faster to do that when it’s stored on a hard disk at your local office. Imagine downloading terabytes of data from the cloud. It’s just too slow.” Other backup options involve cold storage backups, in which copies of data are “disconnected from the computer and the network,” Chang commented. Best practices include storing backups in multiple locations, utilizing cold storage, and leveraging ubiquitous file formats as opposed to proprietary, vendor formats that are difficult to access once versions of software or files have progressed.

Secure Implementations
In addition to investing in measures to fortify perimeter security, it’s becoming more and more necessary to preserve data with backups for any variety of use cases. Doing so is instrumental to business continuity in all of its facets, which include high availability, risk, cyber security, and disaster recovery.

Source by jelaniharper

Learning to drive sustainable practices through smarter technology

As the demand for agricultural commodities rises around the world, the ag industry is recognizing the need for sustainable practices and processes. Sustainable agriculture practices are designed to protect both our natural environment and our resources, while promoting environmental stewardship and enhancing the quality of life for farmers and producers. To achieve this new approach to agribusiness, the industry must embrace the tools of success, including smart technology, data analysis algorithms and connectivity devices.

This demand for more sustainable practices has grown over the last decade, changing the industry’s focus from productivity to sustainability. From increasing concerns over weather changes, to the rapidly growing world population and the constantly changing business environment within the ag industry, there are strong drivers to achieve success.

Today’s agribusiness leaders are establishing their sincere motivation to achieve measurable and manageable practices across the value chain. From farmers and ranchers to agronomists, ag retailers, buyers and manufacturers, there are clear changes needed to achieve sustainable practices, including building trusting partnerships within the industry and creating work standards across the industry that make employees proud.

The emergence of smart farming tools and discussions around regenerative agriculture are driving industry leaders to look for solutions. And the technology available today could provide us the path toward those solutions. “Proagrica has stayed very close to the agronomists in developing our algorithms,” says Brandon Buie, Director of Business Development, Enterprise Solutions for Proagrica. “This helps us remain close to the science of improving production and the utilization of resources and reducing waste. While still helping producers to achieve the level of production and the quality to serve their customers.”

In order to truly improve sustainability efforts across the ag industry, technology, data analysis and connectivity must be enhanced to allow:

Improved Communications. Communication plays a vital role in strategy, execution and sustainability. Communication and connectivity allow team members to discuss challenges, share solutions and keep in contact regarding shared goals. For farmers who have experienced connectivity struggles for decades, the ability to trust their technology and rely on their ability to communication and transmit information changes their daily practices in significant, positive ways.

Stronger Partnerships. “Agriculture is such a huge industry, and in so many ways, the most important industry. Yet, it is fragmented; what we need is a true value network – not a supply chain – but a value network,” says Lindsay Suddon, Chief Strategy Officer for Proagrica. True partnerships and a culture of sharing information allows members of the industry to learn from the successes and failures of others, to more quickly identify efficiencies and to celebrate shared successes.

Algorithms and Answers. According to estimates by IBM, by the year 2020, there will be 300 times more data and information available than we had in 2005. In order to advance sustainable practices in agribusiness, we must not only utilize the available data, but understand how to analyze it to find answers and use it to create improved algorithms. “We need more data and we need it from multiple sources,” says Mr. Buie. “Proagrica understands the importance of integrating with our customer’s systems to provide actionable insights. That is the way forward.”

Proactive Planning. Utilizing data to manage and measure results to process changes can help to identify both problems and solutions. The more information that is available and shared, the more proactive farmers and producers can be. “Proactive, rather than reactive, is the key,” says Mr. Suddon. “We must be able to plan for the future or we will never move forward.”

Accountability Across the Industry. In order to meet the goals shared across the industry, there must also be shared accountability. “We are looking at collaboration between multiple stakeholders, all focused on regenerative practices that move the industry forward,” says Mr. Buie. “This is our opportunity to come together, from manufacturers through the value chain to growers and achieve not only the UN Sustainability Goals, but real sustainability in our daily practices and ensuring more effective and sustainable use of products.”

Looking forward, business leaders at every stage of the agriculture value cycle must take ownership of their responsibilities and the opportunities for innovation and improvement. Obtaining value from data is a complicated process; Proagrica is leading the way, depending on their team of data scientists and expertise in environmental sciences to compile data for their customers, track and trend information gained from the data and use that information to put real-world solutions in place to advance the cause of sustainability and the quest for improved production and quality in agribusiness.

To learn more about Proagrica’s data solutions tailored to support sustainable practices, click here.

The post Learning to drive sustainable practices through smarter technology appeared first on Proagrica.

Source: Learning to drive sustainable practices through smarter technology

What is in an API?

Adoption and use of an API are heavily dependent upon the consistency, reliability, and intuitiveness of a given API. It is a delicate balance. If the API is too difficult to use or the results are inconsistent across uses, its value will be limited and, hence, adoption will be low. Our research has found that many enterprises have focused on the technical requirements to deliver and scale and API, but have not focused on the corresponding non-functional requirements of a comprehensive API strategy that includes support, service level management, ecosystem development and usability in the face of legacy integration or stream large amounts of data in response to a request.

Key Findings:

  • Enterprises are creating APIs at an incredibly fast pace often without paying attention to the implications for operational overhead or a comprehensive strategy that goes beyond the technology for building and deployment.
  • A successful API strategy must include provisions for developer onboarding and support, clear policies related to service levels and use, the ability to operate in a hybrid cloud model, and a proactive monetization scheme. With regard to the latter primarily focused on support for various sizing approaches and scalability.
  • There is still a strong requirement to “roll your own” API infrastructure to support a comprehensive API strategy. While many of today’s API gateway and management products to address the basics, businesses will still need to invest in building a developer portal, providing support, growing a community and addressing many of the operational issues. Microservices architecture is helping to alleviate some of the overhead associated with deployment and scalability by packaging APIs into manageable entities.

Originally Posted at: What is in an API?

Recursive Feature Elimination (RFE) for Feature Selection in Python



Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm.

RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.

There are two important configuration options when using RFE: the choice in the number of features to select and the choice of the algorithm used to help choose features. Both of these hyperparameters can be explored, although the performance of the method is not strongly dependent on these hyperparameters being configured well.

In this tutorial, you will discover how to use Recursive Feature Elimination (RFE) for feature selection in Python.

After completing this tutorial, you will know:

  • RFE is an efficient approach for eliminating features from a training dataset for feature selection.
  • How to use RFE for feature selection for classification and regression predictive modeling problems.
  • How to explore the number of selected features and wrapped algorithm used by the RFE procedure.

Let’s get started.

Recursive Feature Elimination (RFE) for Feature Selection in Python

Recursive Feature Elimination (RFE) for Feature Selection in Python
Taken by djandywdotcom, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Recursive Feature Elimination
  2. RFE With scikit-learn
    1. RFE for Classification
    2. RFE for Regression
  3. RFE Hyperparameters
    1. Explore Number of Features
    2. Automatically Select the Number of Features
    3. Which Features Were Selected
    4. Explore Base Algorithm

Recursive Feature Elimination

Recursive Feature Elimination, or RFE for short, is a feature selection algorithm.

A machine learning dataset for classification or regression is comprised of rows and columns, like an excel spreadsheet. Rows are often referred to as samples and columns are referred to as features, e.g. features of an observation in a problem domain.

Feature selection refers to techniques that select a subset of the most relevant features (columns) for a dataset. Fewer features can allow machine learning algorithms to run more efficiently (less space or time complexity) and be more effective. Some machine learning algorithms can be misled by irrelevant input features, resulting in worse predictive performance.

For more on feature selection generally, see the tutorial:

RFE is a wrapper-type feature selection algorithm. This means that a different machine learning algorithm is given and used in the core of the method, is wrapped by RFE, and used to help select features. This is in contrast to filter-based feature selections that score each feature and select those features with the largest (or smallest) score.

Technically, RFE is a wrapper-style feature selection algorithm that also uses filter-based feature selection internally.

RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains.

This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

When the full model is created, a measure of variable importance is computed that ranks the predictors from most important to least. […] At each stage of the search, the least important predictors are iteratively eliminated prior to rebuilding the model.

— Pages 494-495, Applied Predictive Modeling, 2013.

Features are scored either using the provided machine learning model (e.g. some algorithms like decision trees offer importance scores) or by using a statistical method.

The importance calculations can be model based (e.g., the random forest importance criterion) or using a more general approach that is independent of the full model.

— Page 494, Applied Predictive Modeling, 2013.

Now that we are familiar with the RFE procedure, let’s review how we can use it in our projects.

RFE With scikit-learn

RFE can be implemented from scratch, although it can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of RFE for machine learning.

It is available in modern versions of the library.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version
import sklearn
print(sklearn.__version__)

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

0.22.1

The RFE method is available via the RFE class in scikit-learn.

RFE is a transform. To use it, first the class is configured with the chosen algorithm specified via the “estimator” argument and the number of features to select via the “n_features_to_select” argument.

The algorithm must provide a way to calculate important scores, such as a decision tree. The algorithm used in RFE does not have to be the algorithm that is fit on the selected features; different algorithms can be used.

Once configured, the class must be fit on a training dataset to select the features by calling the fit() function. After the class is fit, the choice of input variables can be seen via the “support_” attribute that provides a True or False for each input variable.

It can then be applied to the training and test datasets by calling the transform() function.

...
# define the method
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=3)
# fit the model
rfe.fit(X, y)
# transform the data
X, y = rfe.transform(X, y)

It is common to use k-fold cross-validation to evaluate a machine learning algorithm on a dataset. When using cross-validation, it is good practice to perform data transforms like RFE as part of a Pipeline to avoid data leakage.

Now that we are familiar with the RFE API, let’s take a look at how to develop a RFE for both classification and regression.

RFE for Classification

In this section, we will look at using RFE for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 10 input features, five of which are important and five of which are redundant.

The complete example is listed below.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 10) (1000,)

Next, we can evaluate an RFE feature selection algorithm on this dataset. We will use a DecisionTreeClassifier to choose features and set the number of features to five. We will then fit a new DecisionTreeClassifier model on the selected features.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

The complete example is listed below.

# evaluate RFE for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see the RFE that uses a decision tree and selects five features and then fits a decision tree on the selected features achieves a classification accuracy of about 88.6 percent.

Accuracy: 0.886 (0.030)

We can also use the RFE model pipeline as a final model and make predictions for classification.

First, the RFE and model are fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make a prediction with an RFE pipeline
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# fit the model on all available data
pipeline.fit(X, y)
# make a prediction for one example
data = [[2.56999479,-0.13019997,3.16075093,-4.35936352,-1.61271951,-1.39352057,-2.48924933,-1.93094078,3.26130366,2.05692145]]
yhat = pipeline.predict(data)
print('Predicted Class: %d' % (yhat))

Running the example fits the RFE pipeline on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 1

Now that we are familiar with using RFE for classification, let’s look at the API for regression.

RFE for Regression

In this section, we will look at using RFE for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 10 input features, five of which are important and five of which are redundant.

The complete example is listed below.

# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 10) (1000,)

Next, we can evaluate an REFE algorithm on this dataset.

As we did with the last section, we will evaluate the pipeline with a decision tree using repeated k-fold cross-validation, with three repeats and 10 folds.

We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

# evaluate RFE for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# create pipeline
rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5)
model = DecisionTreeRegressor()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see the RFE pipeline with a decision tree model achieves a MAE of about 26.

MAE: -26.853 (2.696)

We can also use the c as a final model and make predictions for regression.

First, the Pipeline is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

# make a regression prediction with an RFE pipeline
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# create pipeline
rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5)
model = DecisionTreeRegressor()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# fit the model on all available data
pipeline.fit(X, y)
# make a prediction for one example
data = [[-2.02220122,0.31563495,0.82797464,-0.30620401,0.16003707,-1.44411381,0.87616892,-0.50446586,0.23009474,0.76201118]]
yhat = pipeline.predict(data)
print('Predicted: %.3f' % (yhat))

Running the example fits the RFE pipeline on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted: -84.288

Now that we are familiar with using the scikit-learn API to evaluate and use RFE for feature selection, let’s look at configuring the model.

RFE Hyperparameters

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the RFE method for feature selection and their effect on model performance.

Explore Number of Features

An important hyperparameter for the RFE algorithm is the number of features to select.

In the previous section, we used an arbitrary number of selected features, five, which matches the number of informative features in the synthetic dataset. In practice, we cannot know the best number of features to select with RFE; instead, it is good practice to test different values.

The example below demonstrates selecting different numbers of features from 2 to 10 on the synthetic binary classification dataset.

# explore the number of selected features for RFE
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	for i in range(2, 10):
		rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=i)
		model = DecisionTreeClassifier()
		models[str(i)] = Pipeline(steps=[('s',rfe),('m',model)])
	return models

# evaluate a give model using cross-validation
def evaluate_model(model):
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	scores = evaluate_model(model)
	results.append(scores)
	names.append(name)
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Running the example first reports the mean accuracy for each configured number of input features.

In this case, we can see that performance improves as the number of features increase and perhaps peaks around 4-to-7 as we might expect, given that only five features are relevant to the target variable.

>2 0.715 (0.044)
>3 0.825 (0.031)
>4 0.876 (0.033)
>5 0.887 (0.030)
>6 0.890 (0.031)
>7 0.888 (0.025)
>8 0.885 (0.028)
>9 0.884 (0.025)

A box and whisker plot is created for the distribution of accuracy scores for each configured number of features.

Box Plot of RFE Number of Selected Features vs. Classification Accuracy

Box Plot of RFE Number of Selected Features vs. Classification Accuracy

Automatically Select the Number of Features

It is also possible to automatically select the number of features chosen by RFE.

This can be achieved by performing cross-validation evaluation of different numbers of features as we did in the previous section and automatically selecting the number of features that resulted in the best mean score.

The RFECV class implements this for us.

The RFECV is configured just like the RFE class regarding the choice of the algorithm that is wrapped. Additionally, the minimum number of features to be considered can be specified via the “min_features_to_select” argument (defaults to 1) and we can also specify the type of cross-validation and scoring to use via the “cv” (defaults to 5) and “scoring” arguments (uses accuracy for classification).

...
# automatically choose the number of features
rfe = RFECV(estimator=DecisionTreeClassifier())

We can demonstrate this on our synthetic binary classification problem and use RFECV in our pipeline instead of RFE to automatically choose the number of selected features.

The complete example is listed below.

# automatically select the number of features for RFE
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# create pipeline
rfe = RFECV(estimator=DecisionTreeClassifier())
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see the RFE that uses a decision tree and automatically selects a number of features and then fits a decision tree on the selected features achieves a classification accuracy of about 88.6 percent.

Accuracy: 0.886 (0.026)

Which Features Were Selected

When using RFE, we may be interested to know which features were selected and which were removed.

This can be achieved by reviewing the attributes of the fit RFE object (or fit RFECV object). The “support_” attribute reports true or false as to which features in order of column index were included and the “ranking_” attribute reports the relative ranking of features in the same order.

The example below fits an RFE model on the whole dataset and selects five features, then reports each feature column index (0 to 9), whether it was selected or not (True or False), and the relative feature ranking.

# report which features were selected by RFE
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define RFE
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
# fit RFE
rfe.fit(X, y)
# summarize all features
for i in range(X.shape[1]):
	print('Column: %d, Selected %s, Rank: %.3f' % (i, rfe.support_[i], rfe.ranking_[i]))

Running the example lists of the 10 input features and whether or not they were selected as well as their relative ranking of importance.

Column: 0, Selected False, Rank: 5.000
Column: 1, Selected False, Rank: 4.000
Column: 2, Selected True, Rank: 1.000
Column: 3, Selected True, Rank: 1.000
Column: 4, Selected True, Rank: 1.000
Column: 5, Selected False, Rank: 6.000
Column: 6, Selected True, Rank: 1.000
Column: 7, Selected False, Rank: 3.000
Column: 8, Selected True, Rank: 1.000
Column: 9, Selected False, Rank: 2.000

Explore Base Algorithm

There are many algorithms that can be used in the core RFE, as long as they provide some indication of variable importance.

Most decision tree algorithms are likely to report the same general trends in feature importance, but this is not guaranteed. It might be helpful to explore the use of different algorithms wrapped by RFE.

The example below demonstrates how you might explore this configuration option.

# explore the algorithm wrapped by RFE
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	# lr
	rfe = RFE(estimator=LogisticRegression(), n_features_to_select=5)
	model = DecisionTreeClassifier()
	models['lr'] = Pipeline(steps=[('s',rfe),('m',model)])
	# perceptron
	rfe = RFE(estimator=Perceptron(), n_features_to_select=5)
	model = DecisionTreeClassifier()
	models['per'] = Pipeline(steps=[('s',rfe),('m',model)])
	# cart
	rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
	model = DecisionTreeClassifier()
	models['cart'] = Pipeline(steps=[('s',rfe),('m',model)])
	# rf
	rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=5)
	model = DecisionTreeClassifier()
	models['rf'] = Pipeline(steps=[('s',rfe),('m',model)])
	# gbm
	rfe = RFE(estimator=GradientBoostingClassifier(), n_features_to_select=5)
	model = DecisionTreeClassifier()
	models['gbm'] = Pipeline(steps=[('s',rfe),('m',model)])
	return models

# evaluate a give model using cross-validation
def evaluate_model(model):
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	scores = evaluate_model(model)
	results.append(scores)
	names.append(name)
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Running the example first reports the mean accuracy for each wrapped algorithm.

In this case, the results suggest that linear algorithms like logistic regression might select better features more reliably than the chosen decision tree and ensemble of decision tree algorithms.

>lr 0.893 (0.030)
>per 0.843 (0.040)
>cart 0.887 (0.033)
>rf 0.858 (0.038)
>gbm 0.891 (0.030)

A box and whisker plot is created for the distribution of accuracy scores for each configured wrapped algorithm.

We can see the general trend of good performance with logistic regression, CART and perhaps GBM. This highlights that even thought the actual model used to fit the chosen features is the same in each case, the model used within RFE can make an important difference to which features are selected and in turn the performance on the prediction problem.

Box Plot of RFE Wrapped Algorithm vs. Classification Accuracy

Box Plot of RFE Wrapped Algorithm vs. Classification Accuracy

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

Papers

APIs

Articles

Summary

In this tutorial, you discovered how to use Recursive Feature Elimination (RFE) for feature selection in Python.

Specifically, you learned:

  • RFE is an efficient approach for eliminating features from a training dataset for feature selection.
  • How to use RFE for feature selection for classification and regression predictive modeling problems.
  • How to explore the number of selected features and wrapped algorithm used by the RFE procedure.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.



The post Recursive Feature Elimination (RFE) for Feature Selection in Python appeared first on Machine Learning Mastery.

Originally Posted at: Recursive Feature Elimination (RFE) for Feature Selection in Python by administrator

Data APIs: Gateway to Data-Driven Operation and Digital Transformation

Enterprises everywhere are on a quest to use their data efficiently and innovatively, and to maximum advantage, both in terms of operations and competitiveness. The advantages of doing so are taken on authority and reasonably so. Analyzing your data helps you better understand how your business actually runs.

Such insights can help you see where things can improve, and can help you make instantaneous decisions when required by emergent situations. You can even use your data to build predictive models that help you forecast operations and revenue, and, when applied correctly, these models can be used to prescribe actions and strategy in advance.

That today’s technology allows business to do this is exciting and inspiring. Once such practice becomes widespread, we’ll have trouble believing that our planning and decision-making wasn’t data-driven in the first place.

Bumps in the Road

While the analytics software has become so powerful, enterprise data integration needed to exploit that power has become much harder.

But we need to be cautious here. Even though the technical breakthroughs we’ve had are impressive and truly transformative, there are some dependencies – prerequisites – that must be met in order for these analytics technologies to work properly. If we get too far ahead of those requirements, then we will not succeed in our initiatives to extract business insights from data.

The dependencies concern the collection, the cleanliness, and the thoughtful integration of the organization’s data within the analytics layer. And, in an unfortunate irony, while analytics software has become so powerful, the integration work that’s needed to exploit that power has become more difficult.

From Consolidated to Dispersed

The reason for this added difficulty is the fragmentation and distribution of an organization’s data. Enterprise software, for the most part, used to run on-premises and much of its functionality was consolidated into a relatively small stable of applications, many of which shared the same database platform. Integrating the databases was a manageable process if proper time and resources were allocated.

But with so much enterprise software functionality now available through Software as a Service (SaaS) offerings in the cloud, bits and pieces of an enterprise’s data are now dispersed through different cloud environments on a variety of platforms. Pulling all of this data together is a unique exercise for each of these cloud applications, multiplying the required integration work many times over.

Even on-premises, the world of data has become complex. The database world was once dominated by three major relational database management system (RDBMS) products, but that’s no longer the case. Now, in addition to the three commercial majors, two open source RDBMSs have joined them in Enterprise popularity and adoption. And beyond the RDBMS world, various NoSQL databases and Big Data systems, like MongoDB and Hadoop, have joined the on-premises data fray.

A Way Forward

A major question emerges. As this data fragmentation is not merely an exception or temporary inconvenience, but rather the new normal, is there a way to approach it holistically? Can enterprises that must solve the issue of data dispersal and fragmentation at least have a unified approach to connecting to, integrating, and querying that data? While an ad hoc approach to integrating data one source at a time can eventually work, it’s a very expensive and slow way to go, and yields solutions that are very brittle.

In this report, we will explore the role of application programming interfaces (APIs) in pursuing the comprehensive data integration that is required to bring about a data-driven organization and culture. We’ll discuss the history of conventional APIs, and the web standards that most APIs use today. We’ll then explore how APIs and the metaphor of a database with tables, rows, and columns can be combined to create a new kind of API. And we’ll see how this new type of API scales across an array of data sources and is more easily accessible than older API types, by developers and analysts alike.

Originally Posted at: Data APIs: Gateway to Data-Driven Operation and Digital Transformation by analyticsweek