John Pace
  • Home
  • About
  • Contact

​​
​
Data Scientist, husband, father of 3 great daughters, 5x Ironman triathlon finisher, just a normal guy who spent a lot of time in school.
Let’s explore data science, artificial intelligence, machine learning, and other topics together.

Text classification 3 ways - logistic regression, random forests, and XGBoost

12/4/2020

0 Comments

 
Picture
Images from https://fr.slideshare.net/Subhradeepsarkar/transposable-elements-in-maize-and-drosophila/16 and https://hospitallane.com/treatment/root-canal-treatment/
I've been planning to write this post for a while, but I've debated about a few things.  First, many posts about text classification or sentiment analysis use simple binary classifiers and do something like classifying tweets or Yelp reviews as positive or negative. What about classifying into multiple classes?  There are some examples out there, but not many.  Second, the posts typically use nice, neat datasets that don't require much data/file manipulation.  This is not how life is when working with actual production data.  Third, I was not sure if I wanted to experiment on finding optimal hyperparameters for each classification algorithm on my dataset or just give examples of how to perform the overall process.  I chose the later.  If you want a fun project, take these Jupyter Notebooks and the dataset and do grid search or other methods for hyperparameter optimization.  Maybe I'll do that in another post.

In this post, I'll discuss how I performed text classification on Pubmed abstracts using logistic regression, random forests, and XGBoost.  In each case, I stuck with pretty much default hyperparameters, so the accuracy for each is good, but could definitely be improved.
Picture

You may not be familiar with Pubmed.  "PubMed® comprises more than 30 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites."  It is a huge dataset!  I decided to use abstracts from these documents as my dataset.  I chose 5 categories of articles and downloaded 2,000 abstracts for each.  The categories were:

  1. Dental caries.  These are cavities in teeth.
  2. Dental pulp.  The inner part of the tooth.
  3. Endodontic treatment.  Root canals are included in this.
  4. Periodontal disease.  Disease of the gums.
  5. Transposable elements.  Repeated sequences of DNA in chromosomes.  I studied these for my PhD dissertation (it's a fascinating read) so they are near and dear to me.

I chose these categories for 2 reasons.  First, they were similar (at least the first 4) and I wanted to see how well the classifiers would work on similar categories with many overlapping terms.  Second, transposable elements was a complete outgroup, so it should be able to classify them easily.

The process I used for classification is very straightforward.  You can check out the Jupyter Notebooks to see the actual code.
  1. Download the abstracts from Pubmed.  They were in what is called "Pubmed" format and had to be parsed.  It wasn't a nice clean dataset.
  2. Read the parsed datasets and show some basic summary stats.
  3. Preprocess each abstract - remove section headings and other unneeded characters, lowercase, tokenize, lemmatize, and stem all text.  Describing all of this is beyond this post, but there are many good resources to describe them.
  4. Create TF-IDF word vectors.  Other options could be used, but I chose to use this one because it is straightforward.  
  5. Split train/test data.
  6. Create the classifier and train the model.
  7. For random forest and XGBoost, find the most important features in the model and display them.
  8. Print the metrics - accuracy, classification report, and confusion matrix.
  9. Classify previously unseen abstracts using the newly created model.

The notebooks, along with the data files, are found on my Github.  Each notebook is the same for steps 1-5, but differs in the actual classifier created and trained.
  
Even with default hyperparameters, the results were not bad and were consistent across all classifiers.  Below is a comparison of feature importance as determined by random forest and XGBoost.  There are some similarities and some differences.  Again, this would change, and most likely be even more similar, with hyperparameter optimization. 
Picture

Here are the accuracy values on the test data:
  • Random forest - 96.048%
  • Logistic regression - 96.698%
  • XGBoost - 95.398%

Classification of the unseen abstracts was good as well.  For a simple quick and dirty analysis, this is the way to go.  Test each out, then experiment with the hyperparameters.  Training time for each classifier is different, with XGBoost taking by far the longest.  Take this into consideration.  If you have a nice server to run it on, that will definitely help!  I hope you will take a few minutes and check out the Notebooks.


​​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.
0 Comments

Keras LearningRateScheduler callback - Making changes on the fly

11/11/2020

0 Comments

 
Picture
This post is the last of my 4 part series on Keras Callbacks.  A callback is an object that can perform actions at various stages of training and is specified when the model is trained using model.fit(). In today's post, I give a very brief overview of the LearningRateScheduler callback.


So, what is a learning rate anyways?  In the context of neural networks, the learning rate is a hyperparameter that is used when calculating the new weights during back propagation.  For more details and the math involved, check out Matt Mazur's excellent discussion.  The learning rate helps determine how much the weights change.  Adjusting the learning rate can play a huge role in helping the network converge and lowering the loss value.  If the learning rate is too high, the weights can fluctuate significantly between updates, causing the model to converge quickly resulting in a suboptimal solution.  If the learning rate is too small, the weights will not fluctuate enough, causing the process to get stuck.  In either case, your model will suffer.  For a more detailed explanation of the impact of learning rate on neural network performance, check out Jason Brownlee's great post.
Picture
Image from https://sawsonskates.com/sanding-wood-projects/

One powerful technique for training an optimal model is to adjust the learning rate as training progresses.  Start with a somewhat high learning rate, then reduce it as the training progresses.  Think of sanding wood.  You start with coarse grit sandpaper to do some initial smoothing.  You then continue to use finer and finer grit sandpaper until you have very smooth surface.  If you just used coarse grit sandpaper, you would get a sanded surface, but it would not be smooth.  If you used very fine grit sandpaper, you may or may not get a smooth surface, but it would have taken forever! 

Keras provides a nice callback called LearningRateScheduler that takes care of the learning rate adjustments for you.  Simply define your schedule and Keras does the rest.  At a predetermined epoch of the training, the learning rate is adjusted by a factor that you decide.  For example, at epoch 100, your learning rate is adjusted by a factor of 0.1.  At epoch 200, it is adjusted again by a factor of 0.1 and so on.  That's all there is to it.  You can easily make adjustments to this schedule by updating the callback.  You could even do a grid search to test multiple schedules and find the best model.  As with most of the features of Keras, it is easy and straightforward.  

The syntax for the callback is below as well as some example code from the keras.io website.

    
Picture

I hope you have enjoyed this series on Keras Callbacks.  Be sure to check out all 4 parts.
​
Part 1 - Keras EarlyStopping Callback
Part 2 - Keras ModelCheckpoint Callback
Part 3 - Keras Tensorboard Callback

​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

#artificialintelligence #ai #machinelearning #ml #tensorflow #keras #neuralnetworks #deeplearning #learningrate #hyperparameters #callbacks #learningratescheduler
0 Comments

LitCovid - the source for Covid and sars-cov-2 research

11/10/2020

0 Comments

 
Picture
There is an enormous amount of information floating around about COVID and SARS-CoV-2, the virus that causes COVID.  Unfortunately, much of the information is filtered through a lens of bias or reported by non-scientists.  The definitive source of information is the peer-reviewed scientific literature that is being published.  These papers are thoroughly reviewed by experts in the field to make sure the science is solid and the research was conducted appropriately.  I can tell you from experience, the peer-review process is not easy.  Your manuscripts are read with a critical eye to make sure you are not making unfounded conclusions, that you followed accepted protocols, and overall your findings are valid.  There is a very good repository called LitCovid, part of the National Library of Medicine, part of the National Center for Biotechnology Information (NCBI), the organization that manages PubMed.  As of November 10, 2020, there are over 68,000 peer-reviewed, published articles in the database. The articles are searchable by subject, journal name, and other criteria.  If you want to know what has really been discovered and the directions that the research is headed, LitCovid is the source.

Direct links for LitCovid, NCBI, and PubMed
​LitCovid - ​https://www.ncbi.nlm.nih.gov/research/coronavirus/
​
NCBI - https://www.ncbi.nlm.nih.gov/​
PubMed - ​https://pubmed.ncbi.nlm.nih.gov/
Picture

Link to the LitCovid publication and citation
https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa952/5964074

Qingyu Chen, Alexis Allot, Zhiyong Lu, LitCovid: an open database of COVID-19 literature, Nucleic Acids Research, , gkaa952, https://doi.org/10.1093/nar/gkaa952


​​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.
0 Comments

Managing Data in Research Podcast

11/10/2020

0 Comments

 
Picture
A few months ago, I recorded a podcast titled "Managing Data in Research" with my coworker Holly Newman, US Healthcare Client Leader, at Mark III Systems (https://soundcloud.com/user-438874054/heathcare-powerchat-106-managing-data-in-research-h-newman-j-pace-mark-iii-systems).  The podcast is a part of the Dell Technologies Healthcare Power Chat series.  In the podcast, I start by defining data that is used for clinical settings versus data used for research. Holly and I then discuss the flow of data between the clinical and research sides of healthcare and the challenges of data access. I then explain how discoveries made on the research side feed the clinical side and how AI and ML are used to drive insights. Holly and I conclude by sharing Mark III Systems' approach to making clinical data available to the research side, how they help clients optimize the infrastructure used to analyze and process the data, how research groups can get started, where to find more information, and final thoughts.  I hope you'll take a few minutes and give it a listen.  I would love to hear your feedback!


​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

#ai #aiinhealthcare #research #machinelearning #datascience #structureddata #unstructureddata #dell #delltechnologies #healthcarepowerchat #clinicaldata
0 Comments

Beautiful and fascinating images of SARS-CoV-2 Infection of Airway Cells

9/16/2020

0 Comments

 
Picture
Images from https://www.med.unc.edu/marsicolunginstitute/directory/camille-ehre-phd/
PictureDr. Camille Ehre, PhD

On September 3, 2020, Dr. Camille Ehre, Assistant Professor at the Marsico Lung Institute at the University of North Carolina School of Medicine, published some beautiful and fascinating scanning electron microscope images of the SARS-CoV-2 virus infecting human bronchial epithelial cells.  Colored images can be found on Dr. Ehre's faculty page.  The work was published in the prestigious New England Journal of Medicine.  It's amazing that this seemingly innocuous little particle (in red in the images) could be responsible for so much sickness and death.  Thanks for such great work!
​

Picture
Original images from the publication. https://www.nejm.org/doi/full/10.1056/NEJMicm2023328
Picture
Links
  • https://www.nejm.org/doi/full/10.1056/NEJMicm2023328
  • https://www.med.unc.edu/marsicolunginstitute/
  • https://www.med.unc.edu/marsicolunginstitute/directory/camille-ehre-phd/
  • ​https://twitter.com/camilleehre
  • https://www.linkedin.com/in/camille-ehre-21781966/

​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

​#coronaravirus #sarscov2 #unc #nejm #marsico #sem #microscopy #lungs #virus #infection #science #medicine #biology #disease
0 Comments

Using AI in Radiology - how gpus and ai are augmenting the great work already being done

9/15/2020

0 Comments

 
PictureImage from https://arxiv.org/pdf/1707.07734.pdf

For the last few weeks, Mark III Systems, along with Vanderbilt University Medical Center Department of Radiology, Dell/EMC, and Nvidia, have been hosting a series of webinars about how AI can augment the great work that radiologists are already doing.  AI is never going to replace radiologists, but it does have the potential to make their work a little easier so they can focus on the cases that need their full expertise and experience.  The goal of the series is to highlight what is available and what is coming to help achieve this goal.
​
Picture

I had the privilege of being the first presenter.  My presentation was entitled "The Greatest Thing to Happen to Computing in 10 Years."  I discussed the role of GPUs in AI and how they have made things possible that were not possible just 10 years ago.  I then described specific use cases for AI in Radiology.  The link to the presentation is below. Enjoy!

http://ow.ly/T9BQ50Bn9J7

​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

#ai #artificialintelligence #machinelearning #radiology #vumc #gpu #dell #emc #nvidia
0 Comments

Keras Tensorboard callback - see what's happening as it's happening

9/11/2020

0 Comments

 
Picturehttps://keras.io/

This post is part 3 of my 4 part series on Keras Callbacks.  A callback is an object that can perform actions at various stages of training and is specified when the model is trained using model.fit(). In today's post, I give a very brief overview of the Tensorboard callback.

Picture
Graphs showing training and validation accuracy and loss. Image from https://github.com/shekkizh/FCN.tensorflow/issues/50
Picture

If you are reading this post, then chances are that you have heard of Tensorboard.  It is a great visualization tool that works with Tensorflow to allow you to visualize what is going on in the background while your job is running.  You can see nice graphs of loss and accuracy, as well as visualizations of your neural network. It has other features as well, but I'll only mention these for now.  Take a few minutes to explore it.  It is worth the time!  
​

Picture
Visualization of neural network

While there are multiple arguments that the Tensorboard callback can take, the one I want to discuss today is the log_dir.  log_dir allows you to specify the directory where the Tensorboard log files are saved.  The files in this directory are read by Tensorboard to create the visualizations.  This comes in very handy when you are train with different models or with different hyperparameters.  Simply specify different directories and you are all set.  If the directory does not exist, it will be created when the callback is used.

Here is a short snippet of code that creates and uses the callback.  Since this particular job is doing linear regression with 100,000 datapoints, I add that to the directory name.  The newly created directory looks something like this.

logs/linear_regression_100k_datapoints-20200911-130022/
​

    

Now we have the Tensorboard log files being written to a very specific directory that clearly tracks the specific training job.  This is a great way to keep up with what you have done!

​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

#artificialintelligence #ai #machinelearning #ml #tensorflow #keras #neuralnetworks #deeplearning #tensorboard #hyperparameters #callbacks #visualization
0 Comments

Ironman Coeur d'alene - june 27, 2021 - #7 for me!

9/4/2020

0 Comments

 
I signed up for Ironman Coeur d'Alene, Idaho, a couple of days ago. Ironman #7 - June 27, 2021. Please COVID.....let us get back to normal by then. A family vacation to the great Pacific Northwest will be fun! #ironman #Triathlon
Picture
0 Comments

Keras modelcheckpoint callback - yet another great one!

9/2/2020

0 Comments

 
Picture
https://keras.io/
This post is part 2 of my 4 part series on Keras Callbacks.  As I stated in part 1, a callback is an object that can perform actions at various stages of training and is specified when the model is trained using model.fit().  They are super helpful and save a lot of lines of code.  That's part of the beauty of Keras in general - lots of lines of code saved!  Some other callbacks include EarlyStopping, discussed in Part 1 of this series, TensorBoard, and LearningRateScheduler.  I'll discuss TensorBoard and LearningRateScheduler in the last 2 parts.  For the full list of callbacks, see the Keras Callbacks API documentation.  In this post, I will discuss ModelCheckpoint.

​The purpose of the ModelCheckpoint callback is exactly what it sounds like.  It saves the Keras model, or just the model weights, at some frequency that you determine.  Want the model saved every epoch?  No problem.  Maybe after a certain number of batches?  Again, no problem.  Want to only keep the best model based on accuracy or loss values?  You guessed it - no problem.​

​As with other callbacks, you need to define how you want ModelCheckpoint to work and pass those values to model.fit() when you train the model.  I'm sure you can read the full documentation on your own, so I'll just give an example from my own work and hit the high points.  Here is some example code.
​

    
Let's discuss the options above.
  • filepath - This can be any name you want.  You can literally save every model as a single file named checkpoint.h5 that will be overwritten repeatedly if you want.  If you are only saving the best model (based on a metric), this might work, but it does not give you any information about which epoch the model came from or the value of the metric at that point.  To make this information actually valuable, formatting can be used in the filepath.  In the example above, it will save the epoch number and the loss. You could use val_loss, accuracy, or val_accuracy if you chose.
  • monitor - This tells Keras which metric to monitor, such as loss, val_loss, accuracy, or val_accuracy.
  • save_best_only - Tells Keras whether or not to save all models or just the best one, again defined by your metric.  There are 2 options - True or False.  If the value is set to True and you specify it to monitor loss, it will check the loss after every epoch.  If the loss went down, then it will save that model.  If it didn't go down, it won't save it.  If you chose False, it will save the model after every epoch regardless.  That may be something you want, but each model has a tendency to be large, so watch your storage space if you are training for a large number of epochs.
  • save_weights_only - This tells Keras whether or not to save the full model or just the weights.  There are pluses and minuses to both.  If save_weights_only is set to True, only the weights are saved, not the model topology.  If set to False, it saves the weights as well as the model topology.  Again, pluses and minuses.  You have to decide which is best for you.
  • save_frequency - How often to save the model.  In my case, I am using epoch so it saves the model after every epoch, assuming the loss value decreased.
  • mode - You can set this to auto, min, or max.  Specifying min or max, tells it to evaluate the current version of the metric and and save the model depending on if the metric is less than the minimum or greater than maximum value previously produced.  So if you are using accuracy as your metric, you want mode to be max.  If loss, then min.  I'll share a secret.  You can use auto and Keras is smart enough to know that with loss, it should use min and with accuracy it should use max.  Go ahead and set it to auto just so you don't end up putting in the wrong value for mode and then have to go back and troubleshoot.
  • Finally, verbose - You know what this does.

Here is the output from some training I did using the ModelCheckpoint I defined above.  In epochs 3 and 4, the loss decreased, so the model weights were saved.  In epochs 5 and 6, the loss did not decreased, so the weights were not saved.
​

    

Now that we have our weights saved, we can later go back and load them for inference.  I won't cover that (although it is simple) for the sake of brevity.  Jason Brownlee (@TeachTheMachine) of Machine Learning Mastery has a very good tutorial on how to do it.  He always writes great stuff!

​​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

​#artificialintelligence #ai #machinelearning #ml #tensorflow #keras #neuralnetworks #deeplearning #modelcheckpoint #hyperparameters #callbacks
0 Comments

I love Keras Earlystopping

8/24/2020

0 Comments

 
Picture
This post is part 1 of a 4 part series on Keras Callbacks.  Training neural networks is an iterative process that can be very time consuming.  Rarely do you get the optimal model on your first training run.  You try a set of hyperparameters then evaluate how well the model performs.  Then you do it again with different hyperparameters.  Rinse and repeat, often by using grid search.

Picture

One of the great features of Tensorflow 2.x is an API called Keras.  It takes the potentially massive amounts of code that are needed to build neural networks and wraps it into a nice, discrete interface.  For example, to build an artificial neural network with 53 inputs, 2 hidden layers (one with 6 neurons and the other with 12), and 7 output classes, you can use something like the following code.  That's it.  4 lines of code!  It takes quite a bit more with standard TensorFlow.


    

​Back to training.  When training, you need to specify how many epochs you want to train for.  An epoch is "one pass over the entire dataset, used to separate training into distinct phases."  Sometimes you need to run 100 epochs to properly train your model, sometimes 1,000 or more.  But what if you could have the training run automatically stop (not run any more epochs of training) when, according to a metric you set, the model is no longer continuing to improve?  Keras has a nice class called "EarlyStopping" that does just that.  It "stops training when a monitored metric has stopped improving."

​EarlyStopping falls into a group of objects known as callbacks that are specified when the model is trained using model.fit().  By setting the EarlyStopping value, you tell the model to quit training when a certain metric has not improved for a specified number of epochs.  So maybe after 10 epochs of no improvement, stop the training.  If you have specified the training to run for 100 epochs and it can stop at 50 epochs due to no improvement, you have saved 50% of the time you would have needed for training.  Saving time is always a benefit!  Plus, you can help avoid overfitting.

​Here's how it works.  First, you import EarlyStopping from keras.callbacks.  Then specify a value for the "patience" hyperparameter.  This is how many epochs that must be trained without improvement for the training to stop.  Continue by specifying which metric to evaluate after each epoch of training. Finally, tell your model to use the EarlyStopping values when training. It's that easy.  Below is code for running 100 epochs, but stopping after the training accuracy does not improve after 10 epochs.​
Code Editor

    

​Several other parameters can be specified for EarlyStopping, such as min_delta and verbose.  Check out the documentation page for details.  I'll discuss some of the other callbacks that are available in future posts.

​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

​#artificialintelligence #ai #machinelearning #ml #tensorflow #keras #neuralnetworks #deeplearning #earlystopping #hyperparameters #callbacks
0 Comments
<<Previous

    Archives

    December 2020
    November 2020
    September 2020
    August 2020
    June 2020
    May 2020
    April 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    May 2019
    April 2019
    March 2019
    April 2018
    March 2018
    January 2018
    November 2017

    Tweets by pacejohn
Proudly powered by Weebly
  • Home
  • About
  • Contact