John Pace
  • Home
  • About
  • Contact

​​
​
Data Scientist, husband, father of 3 great daughters, 5x Ironman triathlon finisher, just a normal guy who spent a lot of time in school.
Let’s explore data science, artificial intelligence, machine learning, and other topics together.

Benchmarking Nvidia RAPIDS cuDF versus Pandas

12/20/2019

0 Comments

 
Picture
Image from By Fremte at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22469496
Introduction
Picture
If you have ever used Pandas, you know that it is great tool for creating and working with data frames.  As with any software, optimizations can be done to make it perform better.  Nvidia has developed software called RAPIDS, a data science framework that includes a collection of libraries for executing end-to-end data science pipelines completely on the GPU (www.rapids.ai).  Included in RAPIDS is cuDF, which allows data frames to be loaded and manipulated on a GPU.  In this post, I am going to discuss some benchmarking I have done with RAPIDS, particularly cuDF.  I conducted multiple experiments where I created data frames that ran on the CPU using Pandas and data frames that ran on the GPU using cuDF, then executed common methods on those data frames.  I will be using the term “processes” to describe the execution of the methods.  I will also be using the convention that Nvidia has used, in which the Pandas data frames are named PDF, for Pandas data frame, and the cuDF data frames are named GDF, for GPU data frame.

For my benchmarking data, I used a CSV file from MIMIC-III, a freely accessible critical care database that was created by the MIT Lab for Computational Physiology (https://mimic.physionet.org/).  The file was named MICROBIOLOGYEVENTS.csv.  It consisted of 631,726 rows and 16 columns of data.  I duplicated the records in the file to create new files that consisted of 5 million, 10 million, 20 million, and 40 million rows and 16 columns of data, respectively.  Experiments were then conducted using each of these 5 files.  An individual experiment is defined as running a processes on one of the 5 versions of the MICROBIOLOGYEVENTS files as input.  Each experiment was repeated 5 times and the results averaged together to give the final results that I list.
​

Picture
PictureNvidia V100 GPU

The benchmarking was done on both an Nvidia DGX-1 and an IBM POWER Systems AC922 using a single GPU in each.  The GPUs in the servers were both Nvidia V100 models, with the DGX-1 GPU having the model with 32GB of RAM and the AC922 having the 16GB model.


For the benchmarking, I ran some common processes on both PDF and GDF data frames and calculated the amount of time it took to run.  The processes were done in the following order using a Jupyter Notebook that can be found on my Github (https://github.com/pacejohn/RAPIDS-Benchmarks).  

  • Loading the file from a csv into PDF and GDF data frames.
  • Finding the number of unique values in one column.
  • Finding the number of rows with unique values in one column.
  • Finding the 5 smallest and largest values in one column.
  • Selecting 5 specific rows in a column by index.
  • Sorting the data frame by values in one column.
  • Creating a new column with no data in it.
  • Creating a new column that was populated with a calculated value.  It took the value of a preexisting column and multiplied it by 2 to get the calculated value.
  • Dropping a column.

In addition, I created an additional Jupyter Notebook that was used to concatenate 2 data frames.  In this experiment, the MICROBIOLOGYEVENTS.csv, which has 631,726 rows, was concatenated onto each of the 5 MICROBIOLOGYEVENTS input files.

Results

In 4 of the 9 experiments, the GDF outperformed the PDF regardless of the input file that was used.  In 3 experiments, the PDF outperformed the GDF.  Interestingly, in 2 experiments the PDF outperformed the GDF on small data frames but not on the larger ones.  In the concatenation experiments, the GDF always outperformed the PDF.  The results for the processes that were run on the AC922 are below.  The results for the DGX-1 are similar.  For complete results, including the actual times for the processes to run and the DGX-1 results, see my Github (RAPIDS_Benchmarks.xlsx).

​The most remarkable differences in performance were in the following processes.

GDF Outperforms PDF

  • For time to load the input file, the GDF outperformed the PDF by an average of 8.3x faster (range 4.3x-9.5x).  For the input file with 40 million records, the GDF was created and loaded in 5.87 seconds while the PDF took 56.03 seconds.
  • When sorting the data frame by values in one column, the GDF outperformed the PDF by an average of 15.5x faster (range 2.1x-23.4x).  Due to the GPU in the AC922 only having 16GB of RAM, the 40 million row data frame was not able to be sorted so these number include the results of the sort on the DGX-1 for the 40 million row data frame.
  • When creating a new column that was populated with a calculated value, the GDF outperformed the PDF by an average of 4.8x faster (range 2.0x-7.1x).
  • The most remarkable performance difference was seen when dropping a single column.  Amazingly, the GDP outperformed the PDF by an average of 3,979.5x faster (range 255.7x-9,736.9x).  Performance scaled linearly as the size of the data frame became larger.
  • When concatenating the 631,726 row data frame onto another data frame, the GDF outperformed the PDF by an average of 10.4x faster (range 1.2x-29.0x).  As with sorting, the 16GB GPU ran out of memory when trying to append the data frame onto the 40 million row data frame sorted so these number include the results of the sort on the DGX-1 for the 40 million row data frame.

PDF Outperforms GDF

  • When finding the number of unique values in one column, the PDF outperformed the GDF by an average of 74.1x (range 5.7x-286.0x).  However, as the size of the data frame increased, the performance difference reduced dramatically, from 285.9x to 5.7x.  This suggests that at a point the GDF would most likely perform better, but additional experiments would need to be conducted to demonstrate that.  This trend is seen in the processes where the PDF outperforms the GDF on smaller data frames. 
  • The most remarkable performance difference was seen when selecting 5 specific rows in a column by index.  In this case, the PDF outperformed the GDF by an average of 427.4x faster (range 32.2x-735.0x). 

PDF Outperforms GDF on Smaller Data Frames
​
  • When selecting the 5 smallest and largest values for a column, the PDF outperformed the GDF on the 0.631 million, 5 million, and 10 million row data frames with the performance decreasing as the data frame became larger (range 1.1x to 12.7x).  On the larger data frames, the GDF performed best and the performance increased as the data frame became larger (range 1.8x-3.3x).
  • When adding a blank column, the PDF outperformed the GDF only on the 0.631 million row data frame.

Summary

As shown above, data frames that run on the GPU can often speed up processes that manipulate the data frame by 10x to over 1,000x when compared to data frames that run on the CPU, but this is not always the case.  There is also a tradeoff in which smaller data frames perform better on the CPU while larger data frames perform better on the GPU.  The syntax for using a GDF is slightly different than using a PDF, but the learning curve is not steep, and the effort is worth the reward.  I’m going to try the same benchmarking on some other data sets and use some other methods to see how the results compare.  Stay tuned for the next installation. ​

If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn, LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/, and follow my company, Mark III Systems, on Twitter @markiiisystems

This article is also published on Medium at https://medium.com/@johnpace_32927/benchmarking-nvidia-rapids-cudf-versus-pandas-4da07af8151c.

​#ai #artificialintelligence #machinelearning #deeplearning #powersystems #ac922 #nvidia #dgx1 #gpu #gpus #pandas #cuDF #cuda #dataframe #python #rapids #jupyter #mimic #mit
0 Comments



Leave a Reply.

    Archives

    November 2020
    September 2020
    August 2020
    June 2020
    May 2020
    April 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    May 2019
    April 2019
    March 2019
    April 2018
    March 2018
    January 2018
    November 2017

    Tweets by pacejohn
Proudly powered by Weebly
  • Home
  • About
  • Contact