The Nvidia DGX Station is a high-performance AI workstation that enables your Data Science team to get started quickly with the power of a data center in your office. With 4 Nvidia V100 32GB GPUs and Nvidia Docker preloaded, there is plenty of computing power to allow multiple users to run their most challenging AI training projects. Additionally, the DGX Station allows you to download preconfigured and optimized Docker containers from the Nvidia GPU Cloud (NGC) to further decrease the time it takes to get started. Our customers use their DGX Stations for projects involving Natural Language Processing (NLP), classification with algorithms such as XGBoost, image analysis of x-rays and MRIs, and predictive analytics.
The setup for the DGX Station (DGX-S) is a quick process, typically less than 30 minutes from power on until your first nvidia-docker run command. As you do your install, here are a few tips that can ensure the process is as smooth as possible. These tips come from our experience with installing and configuring DGX-S.
1. The Setup Guide says the Ethernet ports are configured for DHCP by default. We have found this is typically not the case, but that they are set to “manual”. Be sure to edit the /etc/network/interfaces file (or configure the ports using the GUI) and either set the ports to DHCP or a static IP address. As with any server, best practice is static IP. Time-saving tip: Make sure you have a monitor, keyboard, and mouse handy when first booting so you can configure the interfaces. After that, everything can be done with ssh.
2. Sometimes the DGX-S will ship with nvidia-docker v1. If so, it needs to be upgraded to nvidia-docker v2. You can check you nvidia-docker version using:
If the command returns 2.0.x, then your system already contains the upgrade to the Nvidia Container Runtime for Docker and no further action is needed. If it does not return 2.0.x, then perform the steps listed here.
3. One of the most “creative” errors that can occur is this. It can happen when trying to pull new docker containers from NGC:
“server misbehaving” – someone had fun with that error. This is typically caused by a DNS issue. Make sure the DNS servers are correctly configured and it should resolve the problem. Setting the DNS server to 22.214.171.124 is the most common fix.
4. Finally, make sure you remove the piece of foam that is inside the DGX-S that keeps the GPUs from moving during shipping. It’s easy to overlook this step!
If you have any questions, I'd love to answer them. Be sure to follow me on Twitter (@pacejohn) and on LinkedIn (https://www.linkedin.com/in/john-pace-phd-20b87070/).
#nvidia #dgx #dgxworkstation #artificialintelligence #ai #machinelearning #ngc #v100 #gpu #docker #nvidia-docker
At the Re-Work Deep Learning Summit in San Francisco in January, I learned about some new, very large image datasets. I have heard of and used the Cifar-10 and 100 datasets, as well as ImageNet. However, the “80 Million Tiny Images Dataset” (the dataset has been removed since this post was originally written) and the “CelebA Dataset” were discussed. These were new to me. All of these are very large collections of images that are labeled for classification tasks.
I often benchmark servers against each other to figure out which perform best in different situations. Since computer vision is such a big topic today, benchmarks on large image datasets are particularly useful and informative. Even a 10% increase in speed can have a large impact on business.
In the next few weeks, I will be comparing the different buses in the IBM POWER9 AC922 server versus several Intel-based servers. All servers will have the same Nvidia V100 GPUs and similar CPUs. However, the servers have different buses that move data at different speeds between the CPU and GPU. The faster the bus, the quicker data can be moved from the CPU to GPU and vice versa. The AC922 uses NVIDIA's NVLink technology between the CPU and GPU while Intel-based servers use PCIe Gen3. The NVLink bus in the AC922’s moves data at 150GB/sec versus the PCIe Gen3 bus in Intel-bus server which moves data at 32GB/sec. The question I am trying to answer is if the image classification tasks will run faster on the AC922 since the bus speed is approximately 5x faster than the Intel-based servers. Benchmarks from other groups using different types of dataset have shown this is the case, but I do not think this has been done with these large datasets of very small images. After the bus benchmarks, I will be testing disk I/O, particularly the IBM Elastic Storage Server (ESS) using Spectrum Scale against other external storage systems. The results should be published in the next few weeks. I am excited to see how they come out.
The details on the different datasets are below.
I welcome your feedback and suggestions. Follow me on Twitter (@pacejohn) and on LinkedIn (https://www.linkedin.com/in/john-pace-phd-20b87070/)
#rework #deeplearning #machinelearning #neuralnetworks #convolutionalneuralnetworks #cnn #computervision #nvidia #ibm #powersystems #dgx #ac922 #artificialintelligence #imagenet
Last week, I attended the Re-work Deep Learning Conference in San Francisco. The speakers were a plethora of the top AI researchers and practitioners in the world - Facebook AI Research (FAIR), Google Brain, Netflix, Uber, MIT, UC-Berkeley, Amazon, and Pandora, just to name a few. The topics ranged from academic topics like pruning neural networks to putting AI into practice in retail. As with most conferences, there seemed to be an overarching theme in the sessions I attended. At this conference, the theme was “you have to use multiple AI methods to solve your problem.” Consumers are relying on AI more each day and they expect the AI to evolve. For the evolution to occur, new techniques must be invented. In addition, techniques must be combined.
On Day 1, I saw a talk by Kamiya Motwani of Walmart AI Labs entitled “Personalizing the Online Grocery Substitution Experience.” My family uses Walmart online grocery ordering and pickup. At times, items are out-of-stock and so those items get substituted. Sometimes we wonder why in the world we got some of the substitutions because they seem unrelated to what we ordered. Other times, we get items that are significantly greater than what we ordered. For example, we may order a 50-count bottle of Advil. It is out-of-stock, so they substitute it with a 250-count bottle of Advil for the same price. Seems strange, but we definitely don’t complain. This talk fascinated me because they discussed how they determine what to substitute.
As you can see from the pictures below, determining what to substitute is not a trivial problem. Walmart combines multiple AI and mathematical techniques to solve it, including constraint-based optimizations (CBO) and graph convolutional networks (GCN). These two techniques are important because they address two different problems that must be solved. The GCN is used to determine which item to substitute and the probability of acceptance of the substitution. However, that must also be combined with a cost analysis, which is where CBO comes into play. A substitution that is typically accepted may cost Walmart a significant amount of money, while substitutions that are declined make no money for Walmart. A balance has to be found that makes the customer happy and produces profit for Walmart.
I’m looking forward to analyzing our substitutions on our grocery order this weekend to see if I can figure out the reasoning. Just another example of AI making life easier for consumers while still providing profits for sellers. I call this AI for Good.
I welcome your feedback and suggestions. Follow me on Twitter (@pacejohn) and on LinkedIn (https://www.linkedin.com/in/john-pace-phd-20b87070/).
#rework #deeplearning #machinelearning #neuralnetworks #fair #facebook #facebookairesearch #google #googlebrain #netflix #uber #mit #ucberkeley #amazon #pandora #walmart #walmartailabs #aiforgood #GCN #graphconvolutionalnetwork #constraintbasedoptimization