Machine Learning on Windows 10 – Part 6: ML Walk-Through
Machine Learning Walk-Through
Below is the general outline of the machine learning pipeline:
- Data cleaning and formatting
- Exploratory data analysis
- Feature engineering and selection
- Compare several machine learning models on a performance metric
- Perform hyper-parameter tuning on the best model to optimize it for the problem
- Evaluate the best model on the testing set
- Interpret the model results to the extent possible
- Draw conclusions and write a well-documented report
As with the previous post, I do not want to reinvent the wheel. I found the following article and associated code on github very informative.
You can download the code from github and open it in the local Jupyter notebook.
https://github.com/WillKoehrsen/machine-learning-project-walkthrough
I found walking through the Jupyter Notebooks helped me understand each concept much better. Being able to change values in real time and see the results.
Machine Learning Project Part 1.ipynb
This notebook will cover the following topics:
- Machine Learning Workflow
- Libraries
Library | Description |
numpy | A library supportorting for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays |
pandas | A library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. |
scikit-learn | This library features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. |
matplotlib | A plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. |
seaborn | A data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. |
Exploratory Data Analysis (EDA) | An open-ended process where we make plots and calculate statistics in order to explore our data. The purpose is to to find anomalies, patterns, trends, or relationships. |
Feature Engineering | The process of taking raw data and extracting or creating new features that allow a machine learning model to learn a mapping between these features and the target. This might mean taking transformations of variables. Generally, feature engineering is the process of adding additional features derived from the raw data. |
Feature Selection | The process of choosing the most relevant features in your data. “Most relevant” can depend on many factors, but it might be something as simple as the highest correlation with the target, or the features with the most variance. In feature selection, we remove features that do not help our model learn the relationship between features and the target. This can help the model generalize better to new data and results in a more interpretable model. Generally, feature selection is the process of subtracting features so we are left with only those that are most important. |
Split Into Training and Testing Sets | In machine learning, we always need to separate our features into two sets: Training and Testing |
Training set | A set of data provided to a model during training, along with the answers, so the model can learn a mapping between the features and the target |
Testing set | A set of data used to evaluate the mapping learned by the model. The model has never seen the answers on the testing set, but instead, must make predictions using only the features. As we know the true answers for the test set, we can then compare the test predictions to the true test targets to get an estimate of how well our model will perform when deployed in the real world. |
Establish a Baseline | It’s important to establish a naive baseline before we beginning making machine learning models. If the models we build cannot outperform a naive guess then we might have to admit that machine learning is not suited for this problem. This could be because we are not using the right models, because we need more data, or because there is a simpler solution that does not require machine learning. Establishing a baseline is crucial so we do not end up building a machine learning model only to realize we can’t actually solve the problem. For a regression task, a good naive baseline is to predict the median value of the target on the training set for all examples on the test set. This is simple to implement and sets a relatively low bar for our models: if they cannot do better than guessing the median value, then we will need to rethink our approach. |
Machine Learning Project Part 2.ipynb
Part two of the walk-through makes use of the scikit-learn. This needs to be installed first:
conda install scikit-learn
In addition, one of the libraries the Machine Learning Project Part 2.ipynb walk through uses has been deprecated, You will have to replace the following line:
# Imputing missing values and scaling values
from sklearn.preprocessing import Imputer, MinMaxScaler
with this:
# Imputing missing values and scaling values
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
Under the “Imputing Missing Values” section you will have to replace the following:
imputer = Imputer(strategy='median')
with
imputer = SimpleImputer(missing_values = np.nan , strategy = 'median')
- Evaluating and Comparing Machine Learning Models with supervised regression
- Imputing Missing Values
- Scaling Features
- Models To Evaluate
- Linear Regression
- Support Vector Machine Regression
- Random Forest Regression
- Gradient Boosting Regression
- K-Nearest Neighbors Regression
- Model Optimization – In machine learning, optimizing a model means finding the best set of hyperparameters for a particular problem
- Model hyperparameters – are best thought of as settings for a machine learning algorithm that are tuned by the data scientist before training. Examples would be the number of trees in the random forest, or the number of neighbors used in K Nearest Neighbors Regression.
- Model parameters – are what the model learns during training, such as the weights in the linear regression
- Hyperparameter Tuning with Random Search and Cross Validation
- Evaluate Final Model on the Test Set
Machine Learning Project Part 3.ipynb
If you get an error with the lime module in Machine Learning Project Part 3.ipynb, you will have to run the following:
conda install -c conda-forge lime
- Recreate Final Model
- Interpret the Model
- Feature Importances
- Use Feature Importances for Feature Selection
- LIME – Locally Interpretable Model-agnostic Explanations – LIME is used to explain individual predictions made the by the model. LIME is a relatively new effort aimed at showing how a machine learning model thinks by approximating the region around a prediction with a linear model.
- Examining a Single Decision Tree
- Make Conclusions and Document Findings
- Presenting your Work