Images from https://fr.slideshare.net/Subhradeepsarkar/transposable-elements-in-maize-and-drosophila/16 and https://hospitallane.com/treatment/root-canal-treatment/
I've been planning to write this post for a while, but I've debated about a few things. First, many posts about text classification or sentiment analysis use simple binary classifiers and do something like classifying tweets or Yelp reviews as positive or negative. What about classifying into multiple classes? There are some examples out there, but not many. Second, the posts typically use nice, neat datasets that don't require much data/file manipulation. This is not how life is when working with actual production data. Third, I was not sure if I wanted to experiment on finding optimal hyperparameters for each classification algorithm on my dataset or just give examples of how to perform the overall process. I chose the later. If you want a fun project, take these Jupyter Notebooks and the dataset and do grid search or other methods for hyperparameter optimization. Maybe I'll do that in another post.
In this post, I'll discuss how I performed text classification on Pubmed abstracts using logistic regression, random forests, and XGBoost. In each case, I stuck with pretty much default hyperparameters, so the accuracy for each is good, but could definitely be improved.
You may not be familiar with Pubmed. "PubMed® comprises more than 30 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites." It is a huge dataset! I decided to use abstracts from these documents as my dataset. I chose 5 categories of articles and downloaded 2,000 abstracts for each. The categories were:
I chose these categories for 2 reasons. First, they were similar (at least the first 4) and I wanted to see how well the classifiers would work on similar categories with many overlapping terms. Second, transposable elements was a complete outgroup, so it should be able to classify them easily.
The process I used for classification is very straightforward. You can check out the Jupyter Notebooks to see the actual code.
The notebooks, along with the data files, are found on my Github. Each notebook is the same for steps 1-5, but differs in the actual classifier created and trained.
Even with default hyperparameters, the results were not bad and were consistent across all classifiers. Below is a comparison of feature importance as determined by random forest and XGBoost. There are some similarities and some differences. Again, this would change, and most likely be even more similar, with hyperparameter optimization.
Here are the accuracy values on the test data:
Classification of the unseen abstracts was good as well. For a simple quick and dirty analysis, this is the way to go. Test each out, then experiment with the hyperparameters. Training time for each classifier is different, with XGBoost taking by far the longest. Take this into consideration. If you have a nice server to run it on, that will definitely help! I hope you will take a few minutes and check out the Notebooks.
If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn, LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/, and follow my company, Mark III Systems, on Twitter @markiiisystems.