Images from https://fr.slideshare.net/Subhradeepsarkar/transposable-elements-in-maize-and-drosophila/16 and https://hospitallane.com/treatment/root-canal-treatment/ I've been planning to write this post for a while, but I've debated about a few things. First, many posts about text classification or sentiment analysis use simple binary classifiers and do something like classifying tweets or Yelp reviews as positive or negative. What about classifying into multiple classes? There are some examples out there, but not many. Second, the posts typically use nice, neat datasets that don't require much data/file manipulation. This is not how life is when working with actual production data. Third, I was not sure if I wanted to experiment on finding optimal hyperparameters for each classification algorithm on my dataset or just give examples of how to perform the overall process. I chose the later. If you want a fun project, take these Jupyter Notebooks and the dataset and do grid search or other methods for hyperparameter optimization. Maybe I'll do that in another post. In this post, I'll discuss how I performed text classification on Pubmed abstracts using logistic regression, random forests, and XGBoost. In each case, I stuck with pretty much default hyperparameters, so the accuracy for each is good, but could definitely be improved. ![]() You may not be familiar with Pubmed. "PubMed® comprises more than 30 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites." It is a huge dataset! I decided to use abstracts from these documents as my dataset. I chose 5 categories of articles and downloaded 2,000 abstracts for each. The categories were:
I chose these categories for 2 reasons. First, they were similar (at least the first 4) and I wanted to see how well the classifiers would work on similar categories with many overlapping terms. Second, transposable elements was a complete outgroup, so it should be able to classify them easily. The process I used for classification is very straightforward. You can check out the Jupyter Notebooks to see the actual code.
The notebooks, along with the data files, are found on my Github. Each notebook is the same for steps 1-5, but differs in the actual classifier created and trained. Even with default hyperparameters, the results were not bad and were consistent across all classifiers. Below is a comparison of feature importance as determined by random forest and XGBoost. There are some similarities and some differences. Again, this would change, and most likely be even more similar, with hyperparameter optimization. ![]() Here are the accuracy values on the test data:
Classification of the unseen abstracts was good as well. For a simple quick and dirty analysis, this is the way to go. Test each out, then experiment with the hyperparameters. Training time for each classifier is different, with XGBoost taking by far the longest. Take this into consideration. If you have a nice server to run it on, that will definitely help! I hope you will take a few minutes and check out the Notebooks. If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.
0 Comments
![]() This post is the last of my 4 part series on Keras Callbacks. A callback is an object that can perform actions at various stages of training and is specified when the model is trained using model.fit(). In today's post, I give a very brief overview of the LearningRateScheduler callback. So, what is a learning rate anyways? In the context of neural networks, the learning rate is a hyperparameter that is used when calculating the new weights during back propagation. For more details and the math involved, check out Matt Mazur's excellent discussion. The learning rate helps determine how much the weights change. Adjusting the learning rate can play a huge role in helping the network converge and lowering the loss value. If the learning rate is too high, the weights can fluctuate significantly between updates, causing the model to converge quickly resulting in a suboptimal solution. If the learning rate is too small, the weights will not fluctuate enough, causing the process to get stuck. In either case, your model will suffer. For a more detailed explanation of the impact of learning rate on neural network performance, check out Jason Brownlee's great post. One powerful technique for training an optimal model is to adjust the learning rate as training progresses. Start with a somewhat high learning rate, then reduce it as the training progresses. Think of sanding wood. You start with coarse grit sandpaper to do some initial smoothing. You then continue to use finer and finer grit sandpaper until you have very smooth surface. If you just used coarse grit sandpaper, you would get a sanded surface, but it would not be smooth. If you used very fine grit sandpaper, you may or may not get a smooth surface, but it would have taken forever! Keras provides a nice callback called LearningRateScheduler that takes care of the learning rate adjustments for you. Simply define your schedule and Keras does the rest. At a predetermined epoch of the training, the learning rate is adjusted by a factor that you decide. For example, at epoch 100, your learning rate is adjusted by a factor of 0.1. At epoch 200, it is adjusted again by a factor of 0.1 and so on. That's all there is to it. You can easily make adjustments to this schedule by updating the callback. You could even do a grid search to test multiple schedules and find the best model. As with most of the features of Keras, it is easy and straightforward. The syntax for the callback is below as well as some example code from the keras.io website. I hope you have enjoyed this series on Keras Callbacks. Be sure to check out all 4 parts. Part 1 - Keras EarlyStopping Callback Part 2 - Keras ModelCheckpoint Callback Part 3 - Keras Tensorboard Callback If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #artificialintelligence #ai #machinelearning #ml #tensorflow #keras #neuralnetworks #deeplearning #learningrate #hyperparameters #callbacks #learningratescheduler There is an enormous amount of information floating around about COVID and SARS-CoV-2, the virus that causes COVID. Unfortunately, much of the information is filtered through a lens of bias or reported by non-scientists. The definitive source of information is the peer-reviewed scientific literature that is being published. These papers are thoroughly reviewed by experts in the field to make sure the science is solid and the research was conducted appropriately. I can tell you from experience, the peer-review process is not easy. Your manuscripts are read with a critical eye to make sure you are not making unfounded conclusions, that you followed accepted protocols, and overall your findings are valid. There is a very good repository called LitCovid, part of the National Library of Medicine, part of the National Center for Biotechnology Information (NCBI), the organization that manages PubMed. As of November 10, 2020, there are over 68,000 peer-reviewed, published articles in the database. The articles are searchable by subject, journal name, and other criteria. If you want to know what has really been discovered and the directions that the research is headed, LitCovid is the source. Direct links for LitCovid, NCBI, and PubMed LitCovid - https://www.ncbi.nlm.nih.gov/research/coronavirus/ NCBI - https://www.ncbi.nlm.nih.gov/ PubMed - https://pubmed.ncbi.nlm.nih.gov/ ![]() Link to the LitCovid publication and citation https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa952/5964074 Qingyu Chen, Alexis Allot, Zhiyong Lu, LitCovid: an open database of COVID-19 literature, Nucleic Acids Research, , gkaa952, https://doi.org/10.1093/nar/gkaa952 If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. ![]() A few months ago, I recorded a podcast titled "Managing Data in Research" with my coworker Holly Newman, US Healthcare Client Leader, at Mark III Systems (https://soundcloud.com/user-438874054/heathcare-powerchat-106-managing-data-in-research-h-newman-j-pace-mark-iii-systems). The podcast is a part of the Dell Technologies Healthcare Power Chat series. In the podcast, I start by defining data that is used for clinical settings versus data used for research. Holly and I then discuss the flow of data between the clinical and research sides of healthcare and the challenges of data access. I then explain how discoveries made on the research side feed the clinical side and how AI and ML are used to drive insights. Holly and I conclude by sharing Mark III Systems' approach to making clinical data available to the research side, how they help clients optimize the infrastructure used to analyze and process the data, how research groups can get started, where to find more information, and final thoughts. I hope you'll take a few minutes and give it a listen. I would love to hear your feedback! If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #ai #aiinhealthcare #research #machinelearning #datascience #structureddata #unstructureddata #dell #delltechnologies #healthcarepowerchat #clinicaldata ![]() On September 3, 2020, Dr. Camille Ehre, Assistant Professor at the Marsico Lung Institute at the University of North Carolina School of Medicine, published some beautiful and fascinating scanning electron microscope images of the SARS-CoV-2 virus infecting human bronchial epithelial cells. Colored images can be found on Dr. Ehre's faculty page. The work was published in the prestigious New England Journal of Medicine. It's amazing that this seemingly innocuous little particle (in red in the images) could be responsible for so much sickness and death. Thanks for such great work! Links If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #coronaravirus #sarscov2 #unc #nejm #marsico #sem #microscopy #lungs #virus #infection #science #medicine #biology #disease ![]() For the last few weeks, Mark III Systems, along with Vanderbilt University Medical Center Department of Radiology, Dell/EMC, and Nvidia, have been hosting a series of webinars about how AI can augment the great work that radiologists are already doing. AI is never going to replace radiologists, but it does have the potential to make their work a little easier so they can focus on the cases that need their full expertise and experience. The goal of the series is to highlight what is available and what is coming to help achieve this goal. I had the privilege of being the first presenter. My presentation was entitled "The Greatest Thing to Happen to Computing in 10 Years." I discussed the role of GPUs in AI and how they have made things possible that were not possible just 10 years ago. I then described specific use cases for AI in Radiology. The link to the presentation is below. Enjoy! If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #ai #artificialintelligence #machinelearning #radiology #vumc #gpu #dell #emc #nvidia ![]() This post is part 3 of my 4 part series on Keras Callbacks. A callback is an object that can perform actions at various stages of training and is specified when the model is trained using model.fit(). In today's post, I give a very brief overview of the Tensorboard callback. ![]() If you are reading this post, then chances are that you have heard of Tensorboard. It is a great visualization tool that works with Tensorflow to allow you to visualize what is going on in the background while your job is running. You can see nice graphs of loss and accuracy, as well as visualizations of your neural network. It has other features as well, but I'll only mention these for now. Take a few minutes to explore it. It is worth the time! While there are multiple arguments that the Tensorboard callback can take, the one I want to discuss today is the log_dir. log_dir allows you to specify the directory where the Tensorboard log files are saved. The files in this directory are read by Tensorboard to create the visualizations. This comes in very handy when you are train with different models or with different hyperparameters. Simply specify different directories and you are all set. If the directory does not exist, it will be created when the callback is used. Here is a short snippet of code that creates and uses the callback. Since this particular job is doing linear regression with 100,000 datapoints, I add that to the directory name. The newly created directory looks something like this. logs/linear_regression_100k_datapoints-20200911-130022/ Now we have the Tensorboard log files being written to a very specific directory that clearly tracks the specific training job. This is a great way to keep up with what you have done! If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #artificialintelligence #ai #machinelearning #ml #tensorflow #keras #neuralnetworks #deeplearning #tensorboard #hyperparameters #callbacks #visualization I signed up for Ironman Coeur d'Alene, Idaho, a couple of days ago. Ironman #7 - June 27, 2021. Please COVID.....let us get back to normal by then. A family vacation to the great Pacific Northwest will be fun! #ironman #Triathlon
This post is part 2 of my 4 part series on Keras Callbacks. As I stated in part 1, a callback is an object that can perform actions at various stages of training and is specified when the model is trained using model.fit(). They are super helpful and save a lot of lines of code. That's part of the beauty of Keras in general - lots of lines of code saved! Some other callbacks include EarlyStopping, discussed in Part 1 of this series, TensorBoard, and LearningRateScheduler. I'll discuss TensorBoard and LearningRateScheduler in the last 2 parts. For the full list of callbacks, see the Keras Callbacks API documentation. In this post, I will discuss ModelCheckpoint. The purpose of the ModelCheckpoint callback is exactly what it sounds like. It saves the Keras model, or just the model weights, at some frequency that you determine. Want the model saved every epoch? No problem. Maybe after a certain number of batches? Again, no problem. Want to only keep the best model based on accuracy or loss values? You guessed it - no problem. As with other callbacks, you need to define how you want ModelCheckpoint to work and pass those values to model.fit() when you train the model. I'm sure you can read the full documentation on your own, so I'll just give an example from my own work and hit the high points. Here is some example code. Let's discuss the options above.
Here is the output from some training I did using the ModelCheckpoint I defined above. In epochs 3 and 4, the loss decreased, so the model weights were saved. In epochs 5 and 6, the loss did not decreased, so the weights were not saved. Now that we have our weights saved, we can later go back and load them for inference. I won't cover that (although it is simple) for the sake of brevity. Jason Brownlee (@TeachTheMachine) of Machine Learning Mastery has a very good tutorial on how to do it. He always writes great stuff! If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #artificialintelligence #ai #machinelearning #ml #tensorflow #keras #neuralnetworks #deeplearning #modelcheckpoint #hyperparameters #callbacks ![]() This post is part 1 of a 4 part series on Keras Callbacks. Training neural networks is an iterative process that can be very time consuming. Rarely do you get the optimal model on your first training run. You try a set of hyperparameters then evaluate how well the model performs. Then you do it again with different hyperparameters. Rinse and repeat, often by using grid search. ![]() One of the great features of Tensorflow 2.x is an API called Keras. It takes the potentially massive amounts of code that are needed to build neural networks and wraps it into a nice, discrete interface. For example, to build an artificial neural network with 53 inputs, 2 hidden layers (one with 6 neurons and the other with 12), and 7 output classes, you can use something like the following code. That's it. 4 lines of code! It takes quite a bit more with standard TensorFlow. Back to training. When training, you need to specify how many epochs you want to train for. An epoch is "one pass over the entire dataset, used to separate training into distinct phases." Sometimes you need to run 100 epochs to properly train your model, sometimes 1,000 or more. But what if you could have the training run automatically stop (not run any more epochs of training) when, according to a metric you set, the model is no longer continuing to improve? Keras has a nice class called "EarlyStopping" that does just that. It "stops training when a monitored metric has stopped improving." EarlyStopping falls into a group of objects known as callbacks that are specified when the model is trained using model.fit(). By setting the EarlyStopping value, you tell the model to quit training when a certain metric has not improved for a specified number of epochs. So maybe after 10 epochs of no improvement, stop the training. If you have specified the training to run for 100 epochs and it can stop at 50 epochs due to no improvement, you have saved 50% of the time you would have needed for training. Saving time is always a benefit! Plus, you can help avoid overfitting. Here's how it works. First, you import EarlyStopping from keras.callbacks. Then specify a value for the "patience" hyperparameter. This is how many epochs that must be trained without improvement for the training to stop. Continue by specifying which metric to evaluate after each epoch of training. Finally, tell your model to use the EarlyStopping values when training. It's that easy. Below is code for running 100 epochs, but stopping after the training accuracy does not improve after 10 epochs. Code Editor
Several other parameters can be specified for EarlyStopping, such as min_delta and verbose. Check out the documentation page for details. I'll discuss some of the other callbacks that are available in future posts. If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/. #artificialintelligence #ai #machinelearning #ml #tensorflow #keras #neuralnetworks #deeplearning #earlystopping #hyperparameters #callbacks |