The new era of GPU computing has arrived

NVIDIA's GPU Technology Conference (GTC) Taiwan attracted more than 2,200 technologists, developers, researchers, government officials and media last week in Taipei. GTC Taiwan is the second of seven AI conferences NVIDIA will be holding in key tech centers globally this year. GTC is the industry's premier AI and deep learning event, providing an opportunity for developers and research communities to share and learn about new GPU solutions and supercomputers and have direct access to experts from NVIDIA and other leading organizations. The first GTC of 2018, in Silicon Valley in March, hosted more than 8,000 visitors. GTC events are showcases for the latest breakthroughs in AI use cases, ranging from healthcare and big data to high performance computing and virtual reality, along with many more advanced solutions leveraging NVIDIA technologies.

GTC 2018 in San Jose debuted the NVIDIA DGX-2 AI supercomputing system, a piece of technology that AI geek dreams are made of. The powerful DGX-2 system is an enterprise-grade cloud server that combines high performance computing with artificial intelligence requirements in one server. It combines 16 fully interconnected NVIDIA Tesla V100 Tensor Core GPUs for 10X the deep learning performance compared with its predecessor, the DGX-1, released in 2017. With a 1/2 a terabyte of HBM2 memory and 12 NVIDIA NVSwitch interconnects, the DGX-2 system became the first single server capable of delivering 2 petaFLOPS of computational capability for AI systems. It is powered by NVIDIA DGX software stack and a scalable architecture built on NVSwitch technology.

In this interview, Marc Hamilton, NVIDIA's Vice President of Solutions Architecture and Engineering, talks about GTC and the development of Taiwan's technology ecosystem. He and his engineering team work with customers and partners to deliver solutions powered by NVIDIA artificial intelligence and deep learning, professional visualization, and high performance computing technologies. From many visits to ecosystem partners and developers, Hamilton is very familiar with the pace of AI development in Taiwan.

AI is dealing with HPC-class scaling problems

AI technologies elevate the enterprise by transforming the way we work, increasing collaboration and ushering in a new era of AI-powered innovation. AI solutions are rapidly moving beyond hype and into reality, and are primed to become one of the most consequential technological segments. Enterprises need to rapidly deploy AI solutions in response to business imperatives. The DGX-2 system delivers a ready-to-go server solution that offers the path to scaling up AI performance.

DGX-2 is designed for both AI and HPC workloads and simplifies the speed of scaling up AI with flexible switching technology for building massive deep learning compute clusters, combined with virtualization features that enable improved user and workload isolation in shared infrastructure environments. With this accelerated deployment model and an open architecture for easy scaling, development teams and data scientists can spend more time driving insights and less time building infrastructure.

For example, running HPC applications for weather forecasting means dealing with the massive scale of computation nodes. Forecasts are created using a model of the Earth's systems by computing changes based on fluid flow, physics and other parameters. The precision and accuracy of a forecast depend on the fidelity of the model and the algorithms, and especially on how many data points are represented. Computing a weather forecast requires scheduling a complex ensemble of pre-processing jobs, solver jobs and post-processing jobs. Since there is no use in a forecast for yesterday, the prediction must be delivered on time, every time. The prediction application is executed on a server node and receives reports from the monitoring programs distributed over the compute nodes.

Typically, these would be large distributed memory clusters, made up of thousands of nodes and hundreds of thousands of cores. Many HPC applications work best when data fits in GPU memory. The nature of the computations is built on interaction between points on the grid that represents the space being simulated, and stepping the calculated variables in time. It turns out that in today's HPC technology, the moving of data in and out of the GPU is more demanding in time than the computations performed. To be effective, systems working with weather forecasting and climate modeling require high memory bandwidth and fast interconnect across the system.

NVSwitch maximizes data throughput between GPUs leveraging NVLink

Memory is one of the biggest challenges in deep neural networks (DNNs) today. Memory in DNNs is required to store input data, weight parameters and activations as an input propagates through the network. Developers are struggling with the limited memory bandwidth of the DRAM devices that have to be used by AI systems to store the huge amounts of weights and activations in DNNs.

Having long relied on PCI Express, when NVIDIA launched its Pascal architecture with the Tesla P100 GPU in 2016, one of the consequences of their increased server focus for Pascal was that interconnect bandwidth and latency became an issue. The data throughput requirements of NVIDIA's GPU platform began outpacing what PCIe could provide. As a result, for their compute focused GPUs, NVIDIA introduced a new interconnect called NVLink.

With six NVLink per GPU, these links could be teamed together for greater bandwidth between individual GPUs, or lesser bandwidth but still direct connections to a greater number of GPUs. In practice this limited the size of a single NVLink cluster to eight GPUs in what NVIDIA calls a Hybrid Mesh Cube configuration, and even then it's a NUMA setup where not every GPU could see every other GPU. Utilizing more than eight GPUs required multiple systems connected via InfiniBand, losing some of the shared memory and latency benefits of NVLink and closely connected GPUs.

In a DGX-2 system, there are 16 Volta GPUs in one server. So NVIDIA introduced NVSwitch, which is designed to enable clusters of much larger GPUs by routing GPUs through one or more switches. A single NVSwitch has 18 full-bandwidth ports, three times more than a single Tesla V100 GPU, with all of the NVSwitch ports fully connected with an internal crossbar.

The goal with NVSwitch is to increase the number of GPUs that can be in a cluster, with the switch easily allowing for a 16 GPU configuration with 12 NVSwitch interconnect (216 ports) in the system to maximize the amount of bandwidth available between the GPUs. NVSwitch enables GPU-to-GPU communications at 300GB per second, which already has double the capacity from the DGX-1 (and the HGX reference architecture it's based on). This advancement will drive hyper-connection between GPUs to handle bigger, more demanding AI projects for data scientists.

NVIDIA wants to take NVLink lane limits out of the equation entirely, as using multiple switches should make it possible to build almost any kind of GPU topology in theory.

Deep learning frameworks such as TensorFlow don't need to understand the underlying NVLink topology in a server thanks to NVIDIA's NCCL (NVIDIA Common Collectives Library), which is used by TensorFlow and all leading DL frameworks. NVIDIA's AI software stack is fully optimized and updated to support developers using DGX-2 and other DGX systems. This includes new versions of NVIDIA CUDA, TensorRT, NCCL and cuDNN, and a new Isaac software developer kit for robotics. Hamilton highlighted the release of TensorRT 4.0, a new version of NVIDIA's optimizing inference accelerator. TensorRT 4.0 integrates with the TensorFlow 1.7 framework. TensorFlow remains one of the more popular deep learning frameworks today. And NVIDIA engineers know their GPU well and make TensorRT 4.0 software to accelerate deep learning inference across a broad range of applications through optimizations and high-performance runtimes for GPU-based platforms.

Hamilton mentioned lots of TensorFlow users will gain from the highest inference performance possible along with a near transparent workflow using TensorRT. The new integration provides a simple API that applies powerful FP16 and INT8 optimizations compiling TensorFlow codes using TensorRT and speed up TensorFlow inference by 8x for low latency runs of the ResNet-50 benchmark.

In edge computing, TensorRT can be deployed on NVIDIA DRIVE autonomous vehicles and NVIDIA Jetson embedded platforms. Deep neural networks on every framework can be trained on NVIDIA DGX systems in the data center, and then deployed into all types of edge devices. With TensorRT software, developers can focus on developing advanced deep learning-powered applications rather than taking time for fine tuning performance for inference deployment.

HGX-2 server platform as a reference design for cloud data centers

The DGX-2 server is expected to ship to customers in Q3 2018. Meanwhile, bringing together the solution expertise of Taiwan's ecosystem partners and global server manufacturers, NVIDIA announced the HGX-2 cloud-server platform with Taiwan's leading server makers at GTC Taipei. The NVIDIA DGX-2 server is the first system built using the HGX-2 reference design.

The server industry has been one of the few industries that have remained strong for Taiwan ODMs and increased opportunities in the AI field will help Taiwan system makers. NVIDIA engineering teams work closely with Taiwan ODMs to help minimize the development time from design win to production deployments. The HGX-2 is designed to meet the needs of the growing number of applications that seek to leverage both HPC and AI use cases. Those server brands and ODMs are designing HGX-2-based systems to build a wide range of qualified GPU-accelerated systems for hyperscale data centers.

The HGX-2 server reference design consists of two baseboards. Each comes equipped with eight NVIDIA Tesla V100 32GB GPUs. These 16 GPUs are fully connected through NVSwitch interconnect technology. With the HGX-2 serving as a building block, server manufacturers will be able to build full server platforms that can be customized to meet the requirements of different data centers.

NVIDIA AI collaboration in Taiwan

Hamilton says the areas of AI collaboration in Taiwan include hands-on training of 3,000 developers on leading applications of deep learning and providing high-level internship opportunities for Taiwanese post-doctoral students to work with NVIDIA engineering teams. The first AI hospital in Taiwan, sponsored by the LEAP program, which is supported by the Ministry of Science and Technology (MOST), is making it possible for doctors to see disease earlier and better understand it through advanced breakthroughs in AI.

Another case Hamilton highlighted is AI helping semiconductor foundries to identify wafer defects. The solution focused on using AI to sharpen the domestic semiconductor market's competitive position. The wafer defects detection system uses physics-based Instruments to examine the images of wafers by leveraging NVIDIA GPU-based optical neural network. The same idea has been modified for use in the printed circuit board (PCB) industries to make visual inspection of PCBs more accurate and give production line mangers a significant edge in discovering and resolving product issues.

NVIDIA HGX-2 cloud server platform

DIGITIMES' editorial team was not involved in the creation or production of this content. Companies looking to contribute commercial news or press releases are welcome to contact us.