BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. It’s safe to say it is taking the NLP world by storm. BERT was developed by Google and Nvidia has created an optimized version that uses TensorRT. (https://github.com/google-research/bert and https://devblogs.nvidia.com/nlu-with-tensorrt-bert/)
One drawback of BERT is that only short passages can be queried when performing Question & Answer. After the passages reach a certain length, the correct answer cannot be found. To run a Question & Answer query, you have to provide the passage to be queried and the question you are trying to answer from the passage.
I have created a script that allows you to query longer passages and get the correct answer. I take an input passage and break it into paragraphs that are delimited by \n. Each paragraph is then queried to try to find the answer. All answers that are returned are put into a list. The list is then analyzed to find the answer with the highest probability. This is returned as the final answer. When you run the script, you will want to change the paths to correspond with your setup. All code and files are on my GitHub HERE.
In order to run the script properly, you need to make sure that a Docker container is created. Before running the query, be sure to start the TensorRT engine. Here are the steps Nvidia says to do and that I am doing.
From home directory, run the following. It takes a while.
Files I am using for queries
There are 3 files that can I am using as input passages. Feel free to try it with your own passages.
22532_Full_Document.txt - this is the full document I am using. If you ask a question about the first part, it will return the correct answer. If you ask a question about a later part, it will not find the answer.
22532_Short_Document_With_Answers.txt - this is a shortened passage that has answers to the query. If you use the same query as I did for the question, it will find 2 answers. The one with the higher probability is the correct answer.
22532_Short_Document_Without_Answers.txt - this is a shortened passage that does not have the answers to the query. If you use the same query as I did for the question, it will not find any answers.
The question that is asked is "How many patients experienced recurrence at 12 years of age?" Feel free to experiment.
I welcome your feedback and suggestions. Follow me on Twitter (@pacejohn) and on LinkedIn (https://www.linkedin.com/in/john-pace-phd-20b87070/). Also, follow my company, Mark III Systems on Twitter (@markiiisystems).
This post is also on Medium at https://medium.com/analytics-vidhya/question-and-answer-for-long-passages-using-bert-dfc4fe08f17f.
If you have ever used Pandas, you know that it is great tool for creating and working with data frames. As with any software, optimizations can be done to make it perform better. Nvidia has developed software called RAPIDS, a data science framework that includes a collection of libraries for executing end-to-end data science pipelines completely on the GPU (www.rapids.ai). Included in RAPIDS is cuDF, which allows data frames to be loaded and manipulated on a GPU. In this post, I am going to discuss some benchmarking I have done with RAPIDS, particularly cuDF. I conducted multiple experiments where I created data frames that ran on the CPU using Pandas and data frames that ran on the GPU using cuDF, then executed common methods on those data frames. I will be using the term “processes” to describe the execution of the methods. I will also be using the convention that Nvidia has used, in which the Pandas data frames are named PDF, for Pandas data frame, and the cuDF data frames are named GDF, for GPU data frame.
For my benchmarking data, I used a CSV file from MIMIC-III, a freely accessible critical care database that was created by the MIT Lab for Computational Physiology (https://mimic.physionet.org/). The file was named MICROBIOLOGYEVENTS.csv. It consisted of 631,726 rows and 16 columns of data. I duplicated the records in the file to create new files that consisted of 5 million, 10 million, 20 million, and 40 million rows and 16 columns of data, respectively. Experiments were then conducted using each of these 5 files. An individual experiment is defined as running a processes on one of the 5 versions of the MICROBIOLOGYEVENTS files as input. Each experiment was repeated 5 times and the results averaged together to give the final results that I list.
The benchmarking was done on both an Nvidia DGX-1 and an IBM POWER Systems AC922 using a single GPU in each. The GPUs in the servers were both Nvidia V100 models, with the DGX-1 GPU having the model with 32GB of RAM and the AC922 having the 16GB model.
For the benchmarking, I ran some common processes on both PDF and GDF data frames and calculated the amount of time it took to run. The processes were done in the following order using a Jupyter Notebook that can be found on my Github (https://github.com/pacejohn/RAPIDS-Benchmarks).
In addition, I created an additional Jupyter Notebook that was used to concatenate 2 data frames. In this experiment, the MICROBIOLOGYEVENTS.csv, which has 631,726 rows, was concatenated onto each of the 5 MICROBIOLOGYEVENTS input files.
In 4 of the 9 experiments, the GDF outperformed the PDF regardless of the input file that was used. In 3 experiments, the PDF outperformed the GDF. Interestingly, in 2 experiments the PDF outperformed the GDF on small data frames but not on the larger ones. In the concatenation experiments, the GDF always outperformed the PDF. The results for the processes that were run on the AC922 are below. The results for the DGX-1 are similar. For complete results, including the actual times for the processes to run and the DGX-1 results, see my Github.
The most remarkable differences in performance were in the following processes.
GDF Outperforms PDF
PDF Outperforms GDF
When the PDF outperformed the GDF, the results were astonishing.
PDF Outperforms GDF on Smaller Data Frames
As shown above, data frames that run on the GPU can often speed up processes that manipulate the data frame by 10x to over 1,000x when compared to data frames that run on the CPU, but this is not always the case. There is also a tradeoff in which smaller data frames perform better on the CPU while larger data frames perform better on the GPU. The syntax for using a GDF is slightly different than using a PDF, but the learning curve is not steep, and the effort is worth the reward. I’m going to try the same benchmarking on some other data sets and use some other methods to see how the results compare. Stay tuned for the next installation.
Follow me on Twitter (@pacejohn) and on LinkedIn (https://www.linkedin.com/in/john-pace-phd-20b87070/). Also, follow my company, Mark III Systems on Twitter (@markiiisystems).
On December 11 and 12, I attended the AI Summit New York City. AI Summit conferences are always top notch and showcase a wide variety of companies that are innovating with their AI technologies. This year was no different. There were companies showing products that ranged from using AI to determine which makeup is best for your skin type to augmenting datasets by creating synthetic data to optimizing data science workflows. The breadth of technologies was definitely interesting.
At my booth, my co-worker, Lisa Stone, and I showed off the latest version my American Sign Language transcription software. The response was even bigger than at other conferences. I had people from major banks, hospital groups, and universities who discussed possible use cases for their industry. A total of 25 people allowed me to gather more training data by filming them signing the alphabet as well as the words “hello”, “good”, “bad”, “I love you”, “you”, “me”, and “your”. Adding words that have motions is the next step of the project. (Spoiler alert: I have that version done for some words, but it needs a little more testing before it moves to prime time.) One exciting experience was when a student from a university in New Jersey came to see the demo. He was interested and we had a good discussion. Later that day, he brought his entire team of 10 developers, data scientists, other AI practitioners to see the demo. You can imagine how surprised I was to see 10 people all show up at once because one man was so excited and wanted them to see it. Finally, I was interviewed on camera by the organizers of the conference because they thought the project was highly innovative and can help make people’s lives better. I call it “AI for Good”.
Just as exciting, I had AI practitioners who discussed the technology with me from a very technical perspective. For me, that is the beauty of these types of conferences. I am able to plunge deeply into the technology and see how other people are doing AI. This information can be used to improve my product. The sharing of ideas is how industries are advanced. As the saying goes “All of us is better than one of us.” How many times do we get to talk to people about the hyperparameters we are using when training our models and how we can optimize them? Not everyone is interested in that, but it’s the type of stuff that gets me excited. We also talked about the hardware I am running the training on. Let’s just say the hardware is quite impressive!
One final note, while in NYC, I got to visit Pace University. No, I’m not the founder and I don’t think I am directly related, but how many people have a university that has their name?
I am looking forward to the conference next year. If you would like to discuss more, let me know. Be sure to follow me on Twitter (@pacejohn), on LinkedIn (https://www.linkedin.com/in/john-pace-phd-20b87070/), and follow my company, Mark III Systems, on Twitter as well (@markiiisystems).
I attended the Super Computing 2019 (SC19) conference in Denver. SC is a very large conference that highlights some of the most cutting-edge work in high-performance computing (HPC), AI, computational fluid dynamics, and other fields where massive computation is required. Groups like NASA, the National Laboratories, including Sandia, Oak Ridge, and Lawrence Livermore, and major research universities all highlighted their work. For someone who loves to see how far computing can be pushed, this is the place to be.
At the conference, I presented my software that uses AI to transcribe American Sign Language (ASL) in real-time as a person signs in front of a camera. As letters of the alphabet are signed, they are written across the screen so they can be read. The ultimate goal of the project is to develop both phone and web apps that allow someone who does not speak ASL to be able to understand what someone who does speak ASL is signing.
Since the project started in August of this year, I have been able to make significant progress. One of the main things I need, and that is needed in any computer vision project, is lots and lots of training data. At several other conferences this year I have asked people to allow me to take a video of them performing the ASL alphabet. I was able to get video of 31 new people performing the alphabet.
At SC19, I added a new twist to the video collection. I had a screen that played a 10-minute loop of various backgrounds, including random colors, a person walking through a castle and a forest, concert video footage, and highlights from the 2019 World Series (sorry Astros fans). People stood in front of the screen as they performed the alphabet. The reason for the screen is very simple. When someone signs, they will rarely be standing in front of a solid, non-moving background. Instead, they will be in a classroom or restaurant or outside or somewhere else where the environment is not static. For the AI software to be generalizable, the training must be done using myriad backgrounds. By adding 10-minutes of different backgrounds, I was able to ensure that each letter that was signed would be on a different background. As I did at previous conferences, I made every attempt to get people with varying hand shapes, colors, sizes, fingernails (both painted and unpainted), and jewelry to sign the alphabet. This will also make the models generalizable to the maximum number of people and reduce bias as much as possible.
As at other conference, the response to the software was very good. I had many, many people come by and try the demo. In fact, I had my first deaf person try it. Honestly, I was quite nervous when she tried. I have been careful to make it as useful as possible, but I wouldn’t know for sure how successful it is until a deaf person tried it. She was very impressed and said it will be very helpful for the deaf community. She even asked how she can help with the development. I am very much looking forward to working with her.
Finally, I was able to do 2 oral presentations of the challenges I face in developing the software. I had several people ask me questions and we had good discussions about some ways to overcome some of the challenges.
If you haven’t had a chance, look at my Twitter feed (@pacejohn) to see the posts I made during the conference and to stay up to date on my latest research. Follow my company on Twitter as well (@markiiisystems) and www.markiiisys.com.