ParTec

Tech know-how to go:
How European is JUPITER, the first exaflop system in Europe?

The JUPITER supercomputer will be the first ExaFlop system in Europe – a computer with at least 1 trillion computing operations per second.

The system will be provided by a ParTec-Eviden supercomputer consortium and will be installed on the campus of the Jülich Research Centre in Germany in 2024. The total budget of 500 million euros includes acquisition and operating costs, and the installation will take place in 2024.

The JUPITER system will consist of two computing modules, a booster module and a cluster module. The booster module will deliver an FP64 performance of 1 ExaFLOP/s, measured with the HPL benchmark. It implements a highly scalable system architecture based on the latest generation of NVIDIA GPUs in the Grace Hopper superchip form factor. The universal cluster module targets workflows that either do not benefit from accelerator-based computations or large/complex workflows where there is a mix of CPU and GPU execution phases, or where CPU processors are required on the periphery in pre-processing and post-processing. It utilises the first European HPC processor, Rhea1 from SiPearl, to provide uniquely high memory bandwidth to support mixed workloads.

In addition, a 21 petabyte flash module (ExaFLASH) based on IBM Storage Scale software and a corresponding storage appliance based on IBM ESS 3500 devices will be provided. All JUPITER compute nodes as well as the storage and service systems are connected to a large NVIDIA Mellanox InfiniBand NDR fabric implementing a DragonFly+ topology.

The system configuration described above is the result of a public tender in which earlier JSC systems such as JUWELS, the Jülich Wizard on European Leadership Science, served as a blueprint. The current JUWELS and JSC user base played a key role in defining the requirements, with many benchmarks and applications used to evaluate the bids.

To summarise, JUPITER:

The technology in detail, source: JUPITER Technical Overview (fz-juelich.de)

JUPITER Booster

The JUPITER booster module (the booster for short) will have around 6000 computing nodes in order to achieve the computing power of 1 ExaFLOP/s (FP64, HPL)  and much more in the lower precision range (e.g. more than 90 ExaFLOP/s theoretical 8-bit computing power). The module will use the NVIDIA Hopper GPU, the latest generation of NVIDIA’s HPC-focused general-purpose GPUs. The GPUs are deployed in the Grace Hopper superchip form factor (GH200), a novel, tight combination of NVIDIA’s first CPU (Grace) and latest generation GPU (Hopper).

Each booster node has four GH200 superchips, i.e. four GPUs, each closely linked to a partner CPU (via NVLink chip-to-chip). With 72 cores per Grace CPU, a node has a total of 288 CPU cores (ARM). In a node, all GPUs are connected via NVLink 4, all CPUs via CPU NVLink connections.

The Hopper H100 GPU variant installed in the system offers 96 GB of HBM3 memory, which can be accessed by the GPU’s multiprocessors at a bandwidth of 4 TB/s. Compared to previous NVIDIA GPU generations, H100 offers more multiprocessors, larger caches, new core architectures and other enhancements. With NVLink4, a GPU can transfer data to any other GPU in a node at 150 GB/s per direction.

Each GPU is connected to a Grace CPU, NVIDIA’s first HPC CPU to use the ARM instruction set. The Grace CPU has 72 Neoverse V2 cores, which are SVE2-capable and each have four 128-bit functional units. The CPU can access 120 GB of LPDDR5X memory with a bandwidth of 500 GB/s. The main feature of the superchip design is the tight integration between CPU and GPU, which not only enables high bandwidth (450 GB/s per direction), but also more homogeneous programming.

One CPU is connected to the three neighbouring CPUs in a node via dedicated CPU NVLink connections (cNVLink), which offer 100 GB/s. A further PCIe Gen 5 connection exists per CPU to its associated InfiniBand adapter (HCA). Four InfiniBand NDR HCAs of the latest generation, each with 200 Gbit/s bandwidth, are available in one node.

The system is hot water cooled and utilises the BullSequana XH3000 blade and rack design.

JUPITER Cluster

The cluster module will comprise more than 1300 nodes and achieve a performance of more than 5 PetaFLOP/s (FP64, HPL). The silicon powering the cluster is the Rhea1 processor, a processor developed in Europe as part of the EPI project and commercialised by SiPearl. Rhea, like Grace, utilises the ARM instruction set architecture (ISA), with the unique feature of providing exceptionally high memory bandwidth through the use of HBM2e memory.

Each cluster node has two Rhea1 processors based on the ARM Neoverse Zeus ISA, providing scalable vector engines (SVE) for improved performance. In addition to HBM memory, each node offers 512 GB of DDR5 main memory. Dedicated nodes with 1 TB of main memory will also be available.

The nodes are again based on the BullSequana XH3000 architecture and will be integrated into the global NVIDIA Mellanox InfiniBand interconnect with one NDR200 link per node.

JUPITER high-speed interconnect network

At the core of the system, the InfiniBand NDR network connects 25 DragonFly+ groups in the booster module and a total of 2 additional groups for the cluster module, storage and administrative infrastructure. The network is fully connected, with more than 11000 global 400 Gb/s links connecting all groups.

Within each group, connectivity is maximised,with a full fat-tree topology. Within this, leaf and spine switches utilise dense 64-port Quantum-2 NDR switches with 400 Gbps ports; leaf switches use split ports to connect to 4 HCAs per node on the booster module (1 HCA per node on the cluster module), each with 200 Gbps.

In total, the network comprises almost 51,000 links and 102,000 logical ports, with 25,400 endpoints and 867 high-radix switches, and still has free ports for future expansion, for example for additional computing modules.

The network was developed with HPC and AI use cases in mind. Its adaptive routing and advanced in-network computing capabilities enable a highly balanced, scalable and cost-effective structure for ground-breaking scientific applications.

ExaFLASH and ExaSTORE

JUPITER will provide access to multiple storage systems. Under the JUPITER contract, a storage system with 21 PB of usable high-bandwidth, low-latency flash memory will be provided as scratch memory.

The scratch storage is based on 40 IBM Elastic Storage Server 3500 systems using NVMe disk technology and based on the IBM Storage Scale solution. With a raw capacity of 29 PB and a usable capacity of 21 PB, it will deliver a write speed of more than 2 TB/s and a read speed of 3 TB/s.

In addition, a high-performance storage module with a raw capacity of more than 300 PB and a tape infrastructure for backup and archiving purposes with a capacity of more than 700 PB will be provided. The systems are connected directly to JUPITER, but are part of independent procurements. Dedicated servers will be available for data exchange between the individual storage systems.

Service partitioning and system management

JUPITER will be installed and operated via the unique JUPITER Management Stack. This is a combination of Smart Management Center xScale (Atos/Eviden), ParaStation Modulo (ParTec) and software components from JSC.

Slurm is used for workload and resource management, extended by ParaStation components. The backbone of the background management stack is a Kubernetes environment that relies on a highly available Ceph storage. The management stack is used to install and manage all hardware and software components of the system.

More than 20 login nodes will provide SSH access to the system’s various modules. In addition, the system will be integrated into the Jupyter environment at the JSC and made available via UNICORE.