A Case for PCI Express as a High-Performance Cluster Interconnect

In high-speed computing, there are a number of significant benefits to simplifying the processor interconnect in rack- and chassis-based servers by designing in PCI Express. The PCI-SIG, the group responsible for the conventional PCI and the much-higher-performance PCIe standards, has released three generations of PCIe specifications over the last eight years and is fully expected to continue this progression hereafter with even newer generations, from which HPC systems will continue to see newer features, faster data throughput and improved reliability.

The latest PCIe specification

The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. Due to the widespread usage of PCI and PCIe in computing, communications and industrial applications, this interconnect innovation's ecosystem is widely deployed and its cost efficiencies as a fabric are enormous. The PCIe interconnect, in each of its generations, offers a clean, high-performance interconnect with low-latency and substantial savings in terms of cost and power. The savings are due to its ability to eliminate multiple layers of expensive switches and bridges that before were needed to blend various standards. This article explains the key features of a PCIe fabric that but make clusters, expansion boxes and shared-I/O applications relatively easy to develop and deploy.

DMA DMA engines are but available on PCIe switches, just as those from PLX Research, which are quite versatile and have the standard descriptor fetch -> move data -> descriptor fetch approach with a lot of programmable pre-fetch options that allow efficient data movement directly between the memory of one host to that of another. The DMA engine in a PCIe switch serves a function similar to that of a DMA resource on a network or host-bus card, in that it moves data to and from the main memory. Figure 3 illustrates how an integrated DMA engine is used to move data, along these lines avoiding the need for using CPU resources in large file transfers. For low-latency messaging, the CPU can be used to directly use programmable IOs to write into the system memory of a destination host using an address window over NT.

The PCIe specification provides extensive logging

Data integrity The PCIe specification provides extensive logging and error reporting mechanisms. The two key data-integrity features are link cyclic redundancy check and end-to-end cyclic redundancy check, with LCRC protecting link-to-link packet protection and ECRC providing end-to-end CRC protection. For inter-processor communication, ECRC augmented with protocol-level support provides excellent data protection. These inherent features, coupled with the implementation-specific robustness in the data path, provide excellent data integrity in a PCIe fabric. Data protection has been fundamental to the protocol, given ever-increasing requirement in computing, storage and chip set architectures.

Congestion management and scaling There is no inherent fabric-level congestion management mechanism built into the protocol, hence the topologies where PCIe shines is in the area of small- to medium-scale clusters of up to 200 nodes. For scaling beyond these numbers of nodes, a shared I/O Ethernet controller or converged adapter could be used to connect between the mini-PCIe clusters over a converged Ethernet fabric, as shown in Figure 4. Within the PCIe cluster, congestion management could be achieved by using simple flow control schemes among the nodes. Figure 4 : Scaling a PCIe Fabric

Software PCIe devices have been designed such that their software discovery is backward-compatible with the legacy PCI configuration model -- an approach that's been critical to the success of the interconnect research. This software compatibility works then within the framework on a single host, but for multiple hosts there is a layer of software needed to enable communication between the hosts. Given that many clustering applications are already available for standard APIs from Ethernet and InfiniBand, there is active, ongoing development to build similar software stacks for PCIe that leverage the existing API along these lines reusing existing infrastructure.

In summary, there are so then over 800 vendors developing various kinds of solutions around the PCIe interconnect standard's three generations. This has led to broad, worldwide adoption of the protocol and to a large percentage of the CPUs available in the market nevertheless having PCIe ports native to the processor. Given the low-cost emphasis on PCIe since its initial launch, the total cost of building out a PCIe fabric is an order of magnitude lower than comparable high-performance interconnects. As this I/O standard has evolved and become ubiquitous, the natural progression has been to use the fabric for expansion boxes and inter-processor communication. Dual-host solutions using non transparency have been widely used in storage boxes since 2005 however now the industry is pushing the fabric to expand to larger-scale clusters. Both hardware and software enhancements but make this possible to build reasonably sized clusters with off-the-shelf elements. With the evolution of the converged network and storage adapters, there is an increased emphasis to pack more functionality into them, which pushes against the power and cost thresholds when every computing node needs a heavy adapter. A simple solution would be to extend the existing PCIe interconnect on the computing nodes to address the inter-processor communication.

Vijay Meduri is vice president of engineering for PCI Express switching at PLX Innovation, Sunnyvale, Calif. He can be reached at vmeduri@plxtech.com.

The benefits of Appro's hybrid CPU/GPU servers

Discover the benefits of Appro's hybrid CPU/GPU servers and clusters based on NVIDIA Tesla M2050 computing. Learn how to maximize HPC system performance during achieving reliability, density and easy upgradeability for small, medium, and large-sized installations.

This webcast will present the key value and benefits of the new Appro GreenBlade System based on 8 to 12- cores AMD Opteronâ¢ 6100 series processor. You will learn how to maximize the system performance scalability, reliability and serviceability during reducing electricity costs in your datacenter. Should the contingency arise, we will go over GreenBlade System building block recipes that enables multiple clusters to be built based on the same cluster architecture for small, medium and large-sized HPC deployments across a broad range of vertical markets. Register for this Cloud Computing webinar but!

Copyright © 1994-2011 Tabor Communications, Inc. All Rights Reserved. HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy. Reproduction in whole or partly in any form or medium without express written permission of Tabor Communications Inc. is prohibited. Powered by Clickability.

Tags: Appro GreenBlade System, Tabor Communications, GreenBlade System, Cloud Computing, NVIDIA Tesla, PCI Express, The PCIe, Figure 4, Gen 3, Ethernet Solutions

More information: Hpcwire

References:

·
Pcie Cluster
·
Pcie Clusters
·
Pcie Clustering
·
Pci Express Cluster
·
Pcie Fabric

More news:

A Case for PCI Express as a High-Performance Cluster Interconnect

The latest PCIe specification

The PCIe specification provides extensive logging

The benefits of Appro's hybrid CPU/GPU servers

Pcie Cluster

Pcie Clusters

Pcie Clustering

Pci Express Cluster

Pcie Fabric