Most of machine learning can be broken down into two major categories: classification and prediction. Classification is where you try to decide which category something belongs into. A couple of examples include “Is the picture a cat or a dog?” or “given these factors, does a person make over $50,000 per year?” Models like logistic regression, random forests, and XGBoost are among the most commonly used. Prediction is where you try to predict a value at a certain point in the future. Some examples are “What will be the cost of a stock in 3 months?” or “If we make these changes, how much will our sales increase?” Linear regression and general linear models (GLM) are a couple of the commonly used models. Today I am going to focus on classification with random forests. Check out this nice article by Tony Yiu (@tonester524) for more information on the details of how random forests works. One of the challenges of building any model is choosing the best hyperparameters. Hyperparameters are values you set when you are training, whereas parameters are values that are learned during training. For example, you can set the number of decision trees to use in a random forest. It can be tricky, though. Too few trees and your model may not be optimal and thus perform poorly. Too many trees and your model may result in overfitting and also increase computational time significantly. How do you find this balance without having to manually change each combination of hyperparameters and re-run training? Talk about labor intensive and, honestly, a huge waste of your time valuable time. Enter “Grid Search.” Grid search is a method that can create lists of the hyperparameter values you want to try. You then let your script use all the combinations (or a random subset) of hyperparameters and find the best performing model according to some metric, such as accuracy. Rather than manually trying each value, you start the script and let it run in the background. When it is finished, it provides a Python dictionary of the best hyperparameter values it found. You take those values, use them in your training model, and see how much it improves the model. If you don’t see much improvement, then it may be time to do some more feature engineering! In this post, I show you how to use Python’s GridSearchCV method along with a RandomForestRegressor to perform an exhaustive grid search to find the optimal hyperparameters for a simple random forest model. My purpose is not to do an exhaustive analysis of the data set in order to get the absolute best classification results, but rather to demonstrate the process and provide a template script you can use with your own data. Feel free to take this data set, manipulate it, and see if you get better results. If you do, please be sure to share them. The script, in the form of a Jupyter Notebook found on my Github, takes census data and tries to predict if the person in each row makes less than or greater than $50,000 per year. The data I used came from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Census+Income. In the script, the following is done:
The script is straightforward and will hopefully allow you to be more productive in your work. Instead of a lot of manual labor, you can focus on the things you love about data science and making your business more efficient and profitable. If you have questions and want to connect, you can message me on LinkedIn, Twitter: Follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #machinelearning #mldl #randomforest #hyperparameters #logisticregression #XGBoost #linearregression #generallinearmodels #GLM #GridSearchCV #RandomForestRegressor #overfitting #gridsearch #ai #artificialintelligence
0 Comments
Data science, at its core, is about time. Time to increasing sales, time to finding cures for diseases, time to understanding. Thus, it is critical to think about what hardware you are using for deep learning training, which is by far the most computationally intensive part of artificial intelligence and data science in general. A 10% reduction in training time can make a huge difference in employee productivity and time to results. How about a 25% decrease in training time? Let me show you how to get it! I recently read a Medium article by Thilina Rajapakse comparing 7 different BERT models. BERT is the current state-of-the-art training algorithm for natural language processing (NLP). In his article, he compared the models in terms of final accuracy and training time. His work was done using a custom built, AMD processor-based workstation with an Nvidia Titan RTX GPU, built upon the Turing architecture. This is a very popular PC/workstation GPU. In my conversations with customers, I often hear that they do their training on their laptop or a workstation with a configuration similar to that used by Thilina. His article, along with these customer comments, motivated me to determine how great of a difference an enterprise level server with top-shelf CPUs and GPUs can make in training time for these same models. The work I am presenting here did not focus on how the models work or comparing accuracy, but rather on training time. Since the models were so different, they allowed for a broad, unbiased comparison. ![]() I set up two different enterprise servers in our lab which I will call Server 1 (S1) and Server 2 (S2), to run the same training that Thilina did. My goal was to compare training times between S1, S2, and Thilina’s workstation (WS). Whereas WS had a Titan RTX GPU, S1 and S2 had Nvidia V100 GPUs, Nvidia’s most powerful GPU. I expected S1 and S2 to outperform WS, but I did not know how much difference there would be. Would it be significant? Would it justify the difference in price? Those were the two main questions I was trying to answer. In essence, by running deep learning training on an enterprise server, how much more productive, and thus profitable, could a company be? To keep this post concise, only the summary of the results is presented (see Table 1 below). Full details of the training performed and the specs of S1, S2, and WS can be found on my Github page. In Table 1, each model is listed, followed by the training time on that model for S1, S2, and WS. Overall, S1 reduced training time compared S2 and WS by 13.6% and 24.7%, respectively. S2 reduced training time compared to WS by 16.6%. An example from a customer is even more dramatic. A data scientist took a random forest model he was training on his laptop and trained it on a server very similar to S1. The training time decreased from 152 minutes to 26 minutes. That is a 6x speedup. Do these results demonstrate that using an enterprise level server rather than a workstation or laptop is significantly better? If decreasing your training time by 25% makes you more productive, then it is significantly better. In addition to being able to decrease training times, S1 had 8 GPUs. S2 and WS each had 1 GPU, although S2 can support up to 4 GPUs. (In this testing, only one GPU was utilized.) This means that 8 different people can be training concurrently on S1. Imagine 8 data scientists being able to train models at the same time and do it 25% faster than they can on their workstation or laptop. So not only is the training time decreased, it is decreased for as many as 8 people at the same time. This is a big deal. To find out about how you can implement an enterprise level-server and have it set up to be able to perform your training, contact me. I love helping companies utilize deep learning to its full potential and make a fundamental shift in their operations. If you have questions and want to connect, you can message me on LinkedIn, Twitter: Follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #artificialintelligence #ai #deeplearning #DL #machinelearning #ML #BERT #servers #nvidia #gpu #v100 Customers often ask me about how to operationalize the machine learning models they have created. It’s common that they have the model, but are struggling with how to apply it to new or existing business data sets. Another line of questioning is about “How do we move the machine learning model forward into production?” I wrote this article to provide an overview of the basic steps of operationalizing machine learning models. Creating a logistic regression model that can be operationalized Believe it or not, you simply create the model and save it to a file. This is typically done in a script that is separate from the script that will be used in production. pickle and joblib are common ways to save random forest, logistic regression, or other classical machine learning models. In this post, I will show the process for pickle. joblib is very similar. For neural networks created with Keras, it is recommended to use keras.models. Once the file is created, you later load it and use it to score your data. Create the model file and save it
Import the model from the file and use it on new data
Example: Logistic Regression That’s all there is to it! For more in-depth instructions on how to operationalize models using joblib from scikit-learn or saving and loading models in Keras, check out the following helpful websites:
![]() ![]() Do you work with clients who need help with operationalizing machine learning models? I’d love to hear what questions you commonly come across. Please share by adding a comment below. If you have questions and want to connect, you can message me on LinkedIn, Twitter: Follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #ai #machinelearning #MLDL #artificialintelligence #operationalize #production #python #keras #scikit-learn |