OpenCL development tools and resources; applications in network packet processing

In addition to introducing a full range of embedded application products that meet the OpenCL specification, US-based Advanced Micro Devices (AMD) has collaborated with its industrial partners to develop multiple application procedures, code samples, and software libraries that can exert the parallel computing performances of the heterogeneous multicore architecture APUs and GPUs using the Accelerated Parallel Processing (APP) technology from AMD.

AMD constructs the heterogeneous multicore acceleration processor architecture

Kelly Gillilan, embedded solutions product marketing manager of AMD, stated that today's multicore processors are divided into homogeneous and heterogeneous models, and the latter is composed by two or more different types of cores. The accelerated processing unit (APU) architecture concept proposed by AMD has the heterogeneous multicore comprised of the x86 CPU that is known for its memory read/write capabilities and the multi-floating point/3D graphics computing GPU core. In June 2012, AMD, ARM, Imagination, MediaTek, Texas Instruments and Qualcomm established in the Heterogeneous System Architecture (HAS) Foundation to collaborate and promote the standardization of heterogeneous multicore.

The first open source and free universal acceleration application programming interface (API) specification in the industry was the Open Computing Language (OpenCL) developed by the Khronos Group. AMD is the early incorporator and supporter of OpenCL and has created numerous application procedures with its technical partners, code samples, and software libraries through AMD's APP technology; which enables OpenCL to exert the full acceleration power of parallel computing in heterogeneous multicore systems.

AMD offers a full series of OpenCL-compatible heterogeneous parallel processing application products ranging from the low power G-Series APU that is designed to consume only 4.5W to 18W of power with the maximum parallel computing performance of 80GFLOPs, to the high-performance R-Series designed to consume 17W to 35W of power and provide 500GFLOPs of parallel computing performance. The AMD Radeon GPU series, including the AMD Radeon E6460 GPU, adopts a BGA package and the MXM module or PCIe add-on card format, and is designed with a power consumption rate of 20W in conjunction with the 25GB/s bandwidth GDDR5 memory to produce a floating-point performance of 192GFLOPs. The E6760 GPU adopts a BGA package and the MXM module or PCIe add-on card format, and is designed with a power consumption rate of 35W in conjunction with 51GB/s bandwidth GDDR5 memory to produce a floating-point performances of 576GFLOPs. The E6970 GPU adopts the MXM module, and is designed with the power consumption rate of 95W in conjunction with 115GB/s bandwidth GDDR5 memory to produce a floating-point performance of 1.3 TFLOPs.

The Basic Concept and writing style of OpenCL

Gillilan pointed out that OpenCL is different from the Open Graphic Language (OpenGL) used for image processing in the past, in that OpenCL incorporates GPUs for computing purposes. The platform defined by OpenCL means that the system's host (generally the CPU) framework shares resources with applications as well as a collection of multiple execution program kernels on devices. Each computer device receives from one to many compute units, and inside each unit there are from one to many basic processing elements.

Take the array loop model of the C language as an example, the traditional code may be int i; for (i=0;i<n;i++) c[i]=a[i]*b[i]. However, when incorporated into OpenCL, the "Kernel" keyword must be added to the beginning and end of the code, and the code itself becomes int id = get_global_id (0); c[id]=a[id]*b[id]. That is, the array is broken into 1 to n pieces, and the system configures the n cores to perform the parallel computing. OpenCL would determine the front and back mapping relationships among the allocated kernels. After data from each Kernel is completed and outputted, the system would wait until all other core data is outputted and assembled before proceeding to the next step.

Memory in OpenCL is divided into four major blocks - Global Memory is the read-only memory that can be reached by each computer device and internal compute unit; Global/Constant Memory DataCache is used by the OpenCL library; each computer device (meaning a CPU or GPU) has its own local memory; and the most basic internal compute unit of each compute unit has its own private memory.

The OpenCL program writing model is also divided into: 1. Host executed codes; 2. Header, as each parallel computing code fragment allocated must be defined by a header and a declaration; and 3. Kernels, which are assigned and allocated by the OpenCL function to each part of the core.

AMD recommended the industry to: 1. Create a development environment that can run under both Windows and Linux; 2. Download the latest AMD Catalyst software driver; 3. Download the AMD APP SDK development kit 2 (Microsoft Window 7 and later versions supported); 4. Perform the internal SDK sample programs; 5. Attempt to establish and execute a program sample; 6. Start writing the program and modify the program samples that are transferred into the SDK; and 7. Establish a unique code as well as write them into the hardware platform, and run it through a simulator or in a debug environment.

AMD has also established the OpenCL Zone dedicated webpage where developers can download the startup training manual documents, APP SDK Development Kit, etc., for free. Other industry partners include the Sage Probe/EDK hardware debug package from Sage Electronic Engineering and the multicore coding, algorithm transplantation, and debug consultation services offered by Texas Multicore Technologies (TMT).

Viosoft uses OpenCL to accelerate the network packet processing applications

Next, Charles Chiou, the Taiwan country manager of Viosoft, explained the OpenCL development tools and resources, and provided samples for network packet processing under the AMD APU platform. Chiou believed that the traditional CPU and GPU has to go through the slower PCI/PCIe bus communication, which forms a packet transmission bottleneck and has a higher power consumption cost. An APU has more than 80 built-in Radeon graphics processing units for parallel computing, which can be applied to visual computing and Internet transaction security codecs. APU outperforms the traditional CPU with independent GPUs in both the unit/watt performances as well as system volume size, and can handle the needs of the next-generation network packet processing applications.

Chiou used Vodafone (Australia) as an example, which offers free packets (such as those on Facebook), discounts through other Internet ISP networking, advertising content for viewing packets, and zero rate Internet shops (where the advertising suppliers are responsible for the costs). Traditional networking equipment that monitors only packet traffic such as routers, firewalls, NAT internet Redirection and VPN virtual private network devices cannot handle these types of network packet billing mechanisms that are based on the different content types and sources.

Viosoft is currently collaborating with AMD to implement the Teranium project plan, which uses OpenCL plus GPUs to accelerate packet processing. The OpenCL is controlled by the x86 architecture that can monitor the centralized data of basic network packets, focus on the various packets of the Internet and distinguish the timing and user-generated content, and focus on audience-specific advertising or promotional activities while filtering junk mail and preventing attacks by viruses or malicious network packets.

Chiou indicated that the current mainstream network equipment is Deep Content Inspection. However, to meet the demands of zero-rate and pay by the sender side network packets with certain requirements, the network equipment must evolve to have the systemic content processing technology that uses the software definition method to monitor and filter a wide range of packets. After Viosoft has incorporated the OpenCL optimized network packet program into the APU chip of a 1U rack-mounted dual AMD APU chip system with dual 10Gbit LAN interface, the APU chip can fully exert the offload engine function and increase the IP packet forwarding performance by up to 50% (5Gbps).

Viosoft is currently collaborating with AMD to implement the Teranium project plan, which is the optimized 4x10Gbit (4-port) network interface driver with the deep-packet inspection extensible framework that can bypass the TCP/IP software stack layer of the Linux core. Their research found that it is not enough to drop the load of the CPU. The correct strategy is to transfer a large amount of packets from the CPU to the GPU for parallel processing. Of course, like the device driver code, the data packet scheduling and the panel controlling codes must still be executed within the CPU. Meanwhile, researchers can endeavor to determine the best method to resolve problems such as the number of copies made in the memory, CPU/GPU communication bottlenecks, and load scheduling.

OpenCL parallel acceleration computing can enhance performance from tens to hundreds of times

Chiou described the acceleration performance made by the actual OpenCL by stating that a CPU with an external AMD Radeon 5450 HD GPU card is used in conjunction with the Persimmon board with an Embedded G-Series APU. Under OpenCL, the APP acceleration enables the parallel computing DES coding, DES decoding, AES coding, and AES decoding for the CPU and GPU. For the DES coding part, Radeon HD 5450 increased from 0.25 to 8.0 and G-Series APU significantly increased from 0.57 to 52.67. For the DES decoding, the Radeon HD 5450 increased from 0.25 to 5.5 and the G-Series APU significantly increased from 0.62 to 52.25. For the AES coding, the Radeon HD 5450 increased from 0.11 to 13.4 and the G-Series APU significantly increased from 0.85 to 59. For the AES decoding part, the Radeon HD 5450 increased from 0.4 to 9 and the G-Series APU significantly increased from 0.62 to 72.42. Overall, the acceleration can improve performances from 22 to 120 times.

Finally, Chiou introduced the Integrated Development Environment (IDE) kit developed by Viosoft for the AMD APU platform and downstream developers. The IDE kit is comprised of GNU and Linux Embedded Editions. Viosoft also developed the ARRIBA remote cross-platform virtual debugging technique. The client embedded Linux AMD APU platform is located in Atlanta but can be remotely controlled by engineers at Austin, Texas. After running the VMON virtual device, the engineers' computers can communicate with the AMD APU platforms on the other end even while using a Windows or Linux operating system, and then list the internal modules and code snippets to perform single-step executions and debugging steps.

Kelly Gillilan, embedded solutions product marketing manager of AMD

Charles Chiou, the Taiwan country manager of Viosoft