John Pace
  • Home
  • About
  • Contact

​​
​
Data Scientist, husband, father of 3 great daughters, 5x Ironman triathlon finisher, just a normal guy who spent a lot of time in school.
Let’s explore data science, artificial intelligence, machine learning, and other topics together.

Question and Answer for Long Passages Using BERT

12/20/2019

0 Comments

 
Picture
Image from https://d827xgdhgqbnd.cloudfront.net/wp-content/uploads/2019/04/09110726/Bert-Head.png

BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.  It’s safe to say it is taking the NLP world by storm. BERT was developed by Google and Nvidia has created an optimized version that uses TensorRT.

To run a Question & Answer query using BERT, you have to provide the passage to be queried and the question you are trying to answer from the passage. One drawback of BERT is that only short passages can be queried when performing Question & Answer. After the passages reach a certain length, the correct answer cannot be found.  ​

I have created a script that allows you to query longer passages and get the correct answer.  I take an input passage and break it into paragraphs that are delimited by \n. Each paragraph is then queried to try to find the answer. All answers that are returned are put into a list. The list is then analyzed to find the answer with the highest probability.  This is returned as the final answer. When you run the script, you will want to change the paths to correspond with your setup.  All code and files are on my GitHub HERE.
​
Picture

Setup

In order to run the script properly, you need to make sure that a Docker container is created. Before running the query, be sure to start the TensorRT engine. Here are the steps Nvidia says to do and that I am doing as well.
One caveat is that the TensorRT engine will terminate after a period of time.  Be sure it is running before you perform you query.

Files I am using for queries

There are 3 files that can I am using as input passages.  Feel free to try it with your own passages.

22532_Full_Document.txt - this is the full document I am using. If you ask a question about the first part, it will return the correct answer. If you ask a question about a later part, it will not find the answer.

22532_Short_Document_With_Answers.txt - this is a shortened passage that has answers to the query. If you use the same query as I did for the question, it will find 2 answers. The one with the higher probability is the correct answer.

22532_Short_Document_Without_Answers.txt - this is a shortened passage that does not have the answers to the query. If you use the same query as I did for the question, it will not find any answers.

The question that is asked is "How many patients experienced recurrence at 12 years of age?" Feel free to experiment.

If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

This post is also on Medium at https://medium.com/analytics-vidhya/question-and-answer-for-long-passages-using-bert-dfc4fe08f17f.

​#ai #artificialintelligence #machinelearning #deeplearning #neuralnetwork #BERT #nlp #naturallanguageprocessing #nvidia #tensorrt #docker #github #google
0 Comments

Benchmarking Nvidia RAPIDS cuDF versus Pandas

12/20/2019

0 Comments

 
Picture
Image from By Fremte at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22469496
Introduction
Picture
If you have ever used Pandas, you know that it is great tool for creating and working with data frames.  As with any software, optimizations can be done to make it perform better.  Nvidia has developed software called RAPIDS, a data science framework that includes a collection of libraries for executing end-to-end data science pipelines completely on the GPU (www.rapids.ai).  Included in RAPIDS is cuDF, which allows data frames to be loaded and manipulated on a GPU.  In this post, I am going to discuss some benchmarking I have done with RAPIDS, particularly cuDF.  I conducted multiple experiments where I created data frames that ran on the CPU using Pandas and data frames that ran on the GPU using cuDF, then executed common methods on those data frames.  I will be using the term “processes” to describe the execution of the methods.  I will also be using the convention that Nvidia has used, in which the Pandas data frames are named PDF, for Pandas data frame, and the cuDF data frames are named GDF, for GPU data frame.

For my benchmarking data, I used a CSV file from MIMIC-III, a freely accessible critical care database that was created by the MIT Lab for Computational Physiology (https://mimic.physionet.org/).  The file was named MICROBIOLOGYEVENTS.csv.  It consisted of 631,726 rows and 16 columns of data.  I duplicated the records in the file to create new files that consisted of 5 million, 10 million, 20 million, and 40 million rows and 16 columns of data, respectively.  Experiments were then conducted using each of these 5 files.  An individual experiment is defined as running a processes on one of the 5 versions of the MICROBIOLOGYEVENTS files as input.  Each experiment was repeated 5 times and the results averaged together to give the final results that I list.
​

Picture
PictureNvidia V100 GPU

The benchmarking was done on both an Nvidia DGX-1 and an IBM POWER Systems AC922 using a single GPU in each.  The GPUs in the servers were both Nvidia V100 models, with the DGX-1 GPU having the model with 32GB of RAM and the AC922 having the 16GB model.


For the benchmarking, I ran some common processes on both PDF and GDF data frames and calculated the amount of time it took to run.  The processes were done in the following order using a Jupyter Notebook that can be found on my Github (https://github.com/pacejohn/RAPIDS-Benchmarks).  

  • Loading the file from a csv into PDF and GDF data frames.
  • Finding the number of unique values in one column.
  • Finding the number of rows with unique values in one column.
  • Finding the 5 smallest and largest values in one column.
  • Selecting 5 specific rows in a column by index.
  • Sorting the data frame by values in one column.
  • Creating a new column with no data in it.
  • Creating a new column that was populated with a calculated value.  It took the value of a preexisting column and multiplied it by 2 to get the calculated value.
  • Dropping a column.

In addition, I created an additional Jupyter Notebook that was used to concatenate 2 data frames.  In this experiment, the MICROBIOLOGYEVENTS.csv, which has 631,726 rows, was concatenated onto each of the 5 MICROBIOLOGYEVENTS input files.

Results

In 4 of the 9 experiments, the GDF outperformed the PDF regardless of the input file that was used.  In 3 experiments, the PDF outperformed the GDF.  Interestingly, in 2 experiments the PDF outperformed the GDF on small data frames but not on the larger ones.  In the concatenation experiments, the GDF always outperformed the PDF.  The results for the processes that were run on the AC922 are below.  The results for the DGX-1 are similar.  For complete results, including the actual times for the processes to run and the DGX-1 results, see my Github (RAPIDS_Benchmarks.xlsx).

​The most remarkable differences in performance were in the following processes.

GDF Outperforms PDF

  • For time to load the input file, the GDF outperformed the PDF by an average of 8.3x faster (range 4.3x-9.5x).  For the input file with 40 million records, the GDF was created and loaded in 5.87 seconds while the PDF took 56.03 seconds.
  • When sorting the data frame by values in one column, the GDF outperformed the PDF by an average of 15.5x faster (range 2.1x-23.4x).  Due to the GPU in the AC922 only having 16GB of RAM, the 40 million row data frame was not able to be sorted so these number include the results of the sort on the DGX-1 for the 40 million row data frame.
  • When creating a new column that was populated with a calculated value, the GDF outperformed the PDF by an average of 4.8x faster (range 2.0x-7.1x).
  • The most remarkable performance difference was seen when dropping a single column.  Amazingly, the GDP outperformed the PDF by an average of 3,979.5x faster (range 255.7x-9,736.9x).  Performance scaled linearly as the size of the data frame became larger.
  • When concatenating the 631,726 row data frame onto another data frame, the GDF outperformed the PDF by an average of 10.4x faster (range 1.2x-29.0x).  As with sorting, the 16GB GPU ran out of memory when trying to append the data frame onto the 40 million row data frame sorted so these number include the results of the sort on the DGX-1 for the 40 million row data frame.

PDF Outperforms GDF

  • When finding the number of unique values in one column, the PDF outperformed the GDF by an average of 74.1x (range 5.7x-286.0x).  However, as the size of the data frame increased, the performance difference reduced dramatically, from 285.9x to 5.7x.  This suggests that at a point the GDF would most likely perform better, but additional experiments would need to be conducted to demonstrate that.  This trend is seen in the processes where the PDF outperforms the GDF on smaller data frames. 
  • The most remarkable performance difference was seen when selecting 5 specific rows in a column by index.  In this case, the PDF outperformed the GDF by an average of 427.4x faster (range 32.2x-735.0x). 

PDF Outperforms GDF on Smaller Data Frames
​
  • When selecting the 5 smallest and largest values for a column, the PDF outperformed the GDF on the 0.631 million, 5 million, and 10 million row data frames with the performance decreasing as the data frame became larger (range 1.1x to 12.7x).  On the larger data frames, the GDF performed best and the performance increased as the data frame became larger (range 1.8x-3.3x).
  • When adding a blank column, the PDF outperformed the GDF only on the 0.631 million row data frame.

Summary

As shown above, data frames that run on the GPU can often speed up processes that manipulate the data frame by 10x to over 1,000x when compared to data frames that run on the CPU, but this is not always the case.  There is also a tradeoff in which smaller data frames perform better on the CPU while larger data frames perform better on the GPU.  The syntax for using a GDF is slightly different than using a PDF, but the learning curve is not steep, and the effort is worth the reward.  I’m going to try the same benchmarking on some other data sets and use some other methods to see how the results compare.  Stay tuned for the next installation. ​

If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

This article is also published on Medium at https://medium.com/@johnpace_32927/benchmarking-nvidia-rapids-cudf-versus-pandas-4da07af8151c.

​#ai #artificialintelligence #machinelearning #deeplearning #powersystems #ac922 #nvidia #dgx1 #gpu #gpus #pandas #cuDF #cuda #dataframe #python #rapids #jupyter #mimic #mit
0 Comments

"AI for Good" - AI Summit NYC

12/18/2019

0 Comments

 
Picture
Picture

On December 11 and 12, I attended the AI Summit New York City.  AI Summit conferences are always top notch and showcase a wide variety of companies that are innovating with their AI technologies.  This year was no different.  There were companies showing products that ranged from using AI to determine which makeup is best for your skin type to augmenting datasets by creating synthetic data to optimizing data science workflows.  The breadth of technologies was definitely interesting.

PictureLisa Stone, Account Executive, Mark III Systems

At my booth, my co-worker, Lisa Stone, and I showed off the latest version my American Sign Language transcription software.  The response was even bigger than at other conferences.  I had people from major banks, hospital groups, and universities who discussed possible use cases for their industry.  A total of 25 people allowed me to gather more training data by filming them signing the alphabet as well as the words “hello”, “good”, “bad”, “I love you”, “you”, “me”, and “your”.  Adding words that have motions is the next step of the project.  (Spoiler alert:  I have that version done for some words, but it needs a little more testing before it moves to prime time.)  One exciting experience was when a student from a university in New Jersey came to see the demo.  He was interested and we had a good discussion.  Later that day, he brought his entire team of 10 developers, data scientists, other AI practitioners to see the demo.  You can imagine how surprised I was to see 10 people all show up at once because one man was so excited and wanted them to see it.  Finally, I was interviewed on camera by the organizers of the conference because they thought the project was highly innovative and can help make people’s lives better.  I call it “AI for Good”.

Picture

Just as exciting, I had AI practitioners who discussed the technology with me from a very technical perspective.  For me, that is the beauty of these types of conferences.  I am able to plunge deeply into the technology and see how other people are doing AI.  This information can be used to improve my product.  The sharing of ideas is how industries are advanced.  As the saying goes “All of us is better than one of us.”  How many times do we get to talk to people about the hyperparameters we are using when training our models and how we can optimize them?  Not everyone is interested in that, but it’s the type of stuff that gets me excited.  We also talked about the hardware I am running the training on.  Let’s just say the hardware is quite impressive!
Picture

One final note, while in NYC, I got to visit Pace University.  No, I’m not the founder and I don’t think I am directly related, but how many people have a university that has their name?
Picture

If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

#ai #artificialintelligence #aisummit #machinelearning #deeplearning #aiforgood #asl #americansignlanguage #hyperparameters
0 Comments

Building an ASL AI Model at SuperCompute 2019

12/2/2019

0 Comments

 
Picture

Recently, I attended the Super Computing 2019 (SC19) conference in Denver.  SC is a very large conference that highlights some of the most cutting-edge work in high-performance computing (HPC), AI, computational fluid dynamics, and other fields where massive computing power is required.  Groups like NASA, the National Laboratories, including Sandia, Oak Ridge, and Lawrence Livermore, and major research universities all highlighted their work.  For someone who loves to see how far computing can be pushed, this is the place to be.

At the conference, I presented my software that uses AI to transcribe American Sign Language (ASL) in real-time as a person signs in front of a camera.  As letters of the alphabet are signed, they are written across the screen so they can be read (see video below for a demo).  The ultimate goal of the project is to develop both phone and web apps that allow someone who does not speak ASL to be able to understand what someone who does speak ASL is signing.
​


Since the project started in August of this year, I have been able to make significant progress.  One of the main things I need, and that is needed in any computer vision project, is lots and lots of training data.  At several other conferences this year I have asked people to allow me to take a video of them performing the ASL alphabet.  I was able to get video of 31 new people performing the alphabet.

At SC19, I added a new twist to the video collection.  I had a screen that played a 10-minute loop of various backgrounds, including random colors, a person walking through a castle and a forest, concert video footage, and highlights from the 2019 World Series (sorry Astros fans).  People stood in front of the screen as they performed the alphabet.  The reason for the screen is very simple.  When someone signs, they will rarely be standing in front of a solid, non-moving background.  Instead, they will be in a classroom or restaurant or outside or somewhere else where the environment is not static.  For the AI software to be generalizable, the training must be done using myriad backgrounds.  By adding 10-minutes of different backgrounds, I was able to ensure that each letter that was signed would be on a different background.  As I did at previous conferences, I made every attempt to get people with varying hand shapes, colors, sizes, fingernails (both painted and unpainted), and jewelry to sign the alphabet.  This will also make the models generalizable to the maximum number of people and reduce bias as much as possible. Below is an image of Michaela Buchanan, a former student at Oregon State University, now a Systems Data Analyst at Mark III Systems,  signing the American Sign Language alphabet in front of the changing backgrounds.
​

Picture
Michaela Buchanan signing the American Sign Language alphabet in front of the changing backgrounds

As at other conferences, the response to the software was very good.  I had many, many people come by and try the demo.  In fact, I had my first deaf person try it.  Honestly, I was quite nervous when she tried.  I have been careful to make it as useful as possible, but I wouldn’t know for sure how successful it is until a deaf person tried it.  She was very impressed and said it will be very helpful for the deaf community.  She even asked how she can help with the development.  I am very much looking forward to working with her.

Finally, I was able to do 2 oral presentations of the challenges I face in developing the software.  I had several people ask me questions and we had good discussions about some ways to overcome some of the challenges. I also gave 2 video interviews for companies that wanted to showcase my work.
​

Picture
Presenting the challenges involved in creating the ASL transcription software
Picture
Video interview where I discussed my ASL transcription software

I was fortunate to attend the conference with several of my co-workers. From left to right: Chris Hacker, Senior Systems Architect, AIX, data protection/backup, and POWER Systems guru, me, Stan Wysocki, President of Mark III Systems, and Chris Bogan, Vice President of Sales.  
​
Picture

​If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/.

#ai #artificialintelligence #deeplearning #computervision #ASL #convolutionalneuralnetwork #cnn #tensorflow #sc19
0 Comments

    Archives

    December 2020
    November 2020
    September 2020
    August 2020
    June 2020
    May 2020
    April 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    May 2019
    April 2019
    March 2019
    April 2018
    March 2018
    January 2018
    November 2017

    Tweets by pacejohn
Proudly powered by Weebly
  • Home
  • About
  • Contact