Each MEEP platform layer has its own responsibility, and all of them might be analyzed using profiling and performance monitoring tools.

 

Find them below in detail.

We have identified a suite of potential workloads to accelerate on the MEEP platform:

Traditional HPC workloads.

Emerging bioinformatic applications.

Emerging HPDA (High-Performance Data Analytics) frameworks.

As shown in Figure 2, there is a rich set of applications and we will focus on the applications that simultaneously have the most RISC-V ecosystem support and the largest HPC impact.

 

Figure 2. Detailed MEEP software stack
 MEEP software stack

 

During the early phase of the project, MEEP will use microkernels (such as DGEMM, SpMV, FFT, and others), representative code hot spots (functions, loops, etc.) to drive the software/hardware co-design aspects of the MEEP accelerator. Our focus in that case is set on the performance-efficiency achieved by low-level optimizations for the efficient exploitation of the architecture.

 

MPI+X workloads

 

  • HPC benchmarks such as HPL (High Performance Linpack).
  • HPCG (High Performance Conjugate Gradient) and continuing with four widely used HPC applications: Quantum Espresso, Alya, NEMO, and OpenIFS.

Additional applications

 

MEEP will also consider in its applications portfolio some of applications and application workflows from different application areas that are developed on top of PyCOMPSs/COMPSs (see the Software Stack tab). Some of these are:

 

  • Multi-Level Monte Carlo (MLMC) codes.
  • BioBB library (including GROMACS). 
  • Multiscale Online Nonhydrostatic Atmosphere Chemistry model (NMMB-Monarch).
  • Dislib ML library, or the SMUFIN code.

 

For HPDA, the idea is to mainly use benchmarks (such as the TensorFlow's Official Models and the Spark-Bench), and possibly internal workloads like the AIS-Predict, a TensorFlow application for vessel trajectory prediction. Likewise, we will use Bolt65 suite, a just-in-time video processing software in the HPDA collection.

The MEEP software stack is changing the way we think about accelerators and which applications can execute efficiently on this hardware. This is also an opportunity to extend these applications into the RISC-V ecosystem, creating a completely open ecosystem from hardware to application software.

 

Regarding the Operating System (OS), two complementary approaches will be considered to expose the MEEP hardware accelerators to the rest of the software stack:

 

1) offloading of accelerated kernels from a host device to an accelerator device, similar in spirit to how OpenCL works.

 

2) natively execute a fully-fledged Linux distribution running on the accelerator that will allow the native execution in the accelerator of most common software components, pushing the accelerator model beyond the traditional offload model explored in the first option.

 

MEEP's approach will use the following software components:

 

  • LLVM is an umbrella project hosted by the LLVM foundation for the development of compilers and related tooling. We will be developing extensions to the LLVM compiler and other essential toolchain components, contributing to a more robust environment.
  • OpenMP is an industrial standard that defines a parallel programming model based on compiler directives and runtime APIs. The MEEP software stack will use the LLVM OpenMP runtime.
  • MPI (Message Passing Interface) is a runtime library for parallelization of applications in distributed memory systems, such as clusters, that implements a SPMD (Single Program Multiple Data) paradigm. 
  • PyCOMPSs/COMPSs is a parallel task-based programming model for distributed computing platforms. 
  • TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks.
  • Apache Spark (Spark for short) has as its architectural foundation the Resilient Distributed Dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. 

The MEEP project also envisions the usage of containers as the mechanism to package, deploy and offload the execution of applications. However, since the project is based on a new architecture, not all needed software components will be available. 

In general, MEEP can leverage the FPGA built-in components and hardware macros, and efficiently map other components (IPs or custom designs) to the available  structures of the FPGA. The components in the system are arranged as multiple accelerator cores operating in parallel, and able to execute different kinds of HPC applications.

 

MEEP extends the traditional accelerator system architecture beyond the traditional CPU-GPU systems by presenting a self-hosted accelerator.

 

This self-hosted accelerator can increase the number of accelerators “managed” by a traditional host leading to reduce the host CPU energy tax boosting the system efficiency.

 

More specifically, the emulation platform architecture might be structured into two well-differentiated parts (below):

  • The FPGA Shell: This element contains all the necessary mechanisms for the accelerator to communicate with the host, and also with other neighboring accelerators.
  • The Emulated Accelerator (EA): This element is where all the data processing is executed and involves computation and memory management. 

 

Figure 3. MEEP Targeted UBB board in Phase 2 (image on the left), and a detailed view of the Emulation Platform (image on the right)
MEEP Targeted UBB board in Phase 2 (image on the left), and a detailed view of the Emulation Platform (image on the right)

 

Communication infrastructure

As shown above, the host server is connected by PCIe to each of the FPGAs (host interface) and all FPGAs are fully connected with dedicated SerDes (Serializer/Deserializer) links, enabling all-to-all communication between the FPGAs or emulated accelerators. 

 

This communication infrastructure can be exposed to the software layers in a variety of ways, like OpenMP/MPI, etc. Thus, we have software APIs to enable as well as the RTL and IP blocks inside the FPGA to move the bits around. All of these communication options enable a wide variety of EA mappings and system compositions utilizing one or more FPGAs to define a node.

 

Emulated accelerator architecture

 

MEEP’s first emulated accelerator architecture is a disaggregated structure that allows us to embed more intelligence near the memory controller and in doing so, select how to use the memory interface.

 

The key challenge is to combine the knowledge of the larger data structures (vectors), with the application program flow to optimize the use of the on-chip memory hierarchy. Using vectors enables the system to see into the future and orchestrate memory requests in both space and time, providing a glimpse into the future, and optimizing on-chip memory usage. 

 

The Shell

 

Each FPGA in the system will use a standard FPGA shell defined in MEEP, as three distinct layers (below). This FPGA shell is composed of IP blocks common to most FPGA-based systems for communication and access to memory:

  • PCIe: This IP allows the communication between the host and the FPGA. It provides a universal interconnect to the chip and the system.
  • HBM: This IP provides fast and high memory capacity to promote a self-hosting accelerator.
  • F2F: The accelerator links are useful for nearest-neighbor and remote FPGA communication (F2F); inter and intra chiplet data movement.

The innermost concentric ring is the MEEP shell interface to the internal FPGA logic of the emulated accelerator or other FPGA functionality. This ring is MEEP specific RTL that provides a common interface to the remaining FPGA logic. The goal is to keep it as minimal and flexible as possible.

 

Figure 4. Shell description overview
Shell description overview

 

 

Current work: Emulation platform architecture-Phase 1

 

Taking advantage of the latest available FPGA hardware platform, MEEP uses the Alveo U280 [Xilinx-U280], an actively cooled FPGA card, as shown in Figure 5. This uses the VU37P FPGA. This FPGA has various hard IP blocks that can be used in the system, including: HBM and related memory controller, PCIe controller and drivers, QSFP and ethernet functionality. These hard macros align well with some of the base functionality of the ACME EA.

 

Figure 5. Alveo U280 card
 Alveo U280 card

 

The self-hosted accelerator conceived in MEEP is envisioned as a collection of chiplets that are composed together in a module. The goal of MEEP is to capture the essential components of one of the chiplets in the FPGA and replicate that instance multiple times.

Although those components might be seen as a collection of different elements, they can be separated into two main categories according to their main functionality: memory or computation (below).

 

Figure 6. High-level description of the Emulated Accelerator
High-level description of the Emulated Accelerator

 

On one hand, all memory components will be distributed creating a complex memory hierarchy (from HBM and the intelligent memory controllers down to the scratchpads, and L1 caches). 

 

On the other hand, the computational engine is a RISC-V Accelerator (above), in which its main computational unit is called VAS (Vector And Systolic) Accelerator Tile. 

 

The figure below shows the high-level ACME architecture with a variety of different major blocks. 

 

ACME architecture
The ACME architecture
ACME core
The ACME architecture into detail to the core level

 

legend
Legend

 

Accelerated computational engine

 

Each VAS Accelerator Tile is designed as a multi-core RISC-V processor, capable of processing both scalar and vector instructions. Going down, each core is a composition of three elements: a scalar core and three coupled co-processors.

 

  • Scalar core: It is a traditional RISC-V scalar processor with the capability of recognizing coprocessor and special instructions, for it (RISC-V vector and systolic array extensions) and resend them to one of its co-processors.   
  • Co-processors: These units are added to the accelerator design with the aim of accelerating as much as possible the computation and memory operations, but also reducing power consumption and data movements. 
    • Vector processing unit (VPU): It is a specialized unit able to process RISC-V vector extension instructions.  
    • Systolic array unit: MEEP implements a systolic array shell or template that provides standard coprocessor and memory interfaces. 
    • Memory Controller CPU (MCPU): It is a specialized unit that can process both scalar and vector memory instructions as well as atomics and simple arithmetic operations. Using vectors, the MCPU can see into the future of large memory requests and manage memory resources to maximize performance and minimize energy.

 

Figure 8. Accelerated computational engine schema
Accelerated computational engine schema

 

As shown above, the scalar core has associated another co-processor:

 

  • Memory Controller CPU (MCPU). It is placed close to the main memory controller unit to reduce latencies as much as possible.

For all these components, MEEP will develop or reuse existing IPs that fit with the project specifications and needs. For existing IPs that are selected as a candidate to be used in MEEP, a deeper analysis is necessary to understand if any kind of modification/improvement has to be done. 

 

Performance modeling

 

As an evaluation platform, MEEP has to provide tools to test new ideas in early stages, but also along the development cycle of the target accelerator.

 

For this reason, MEEP has developed Coyote, a performance modeling tool able to provide an execution driven simulation environment for multicore RISC-V systems with multi-level memory hierarchies. 

 

In this context, MEEP developed Coyote to provide the following performance modeling tool capabilities:

  • Support for RISC-V ISA, both scalar and vector instructions.
  • Model deep and complex memory hierarchies structures.
  • Multicore support
  • Simulate high throughput scenarios.
  • Leverage with existing tools
  • Include scalability, flexibility and extensibility features to be able to adapt to different working context and system characteristics.
  • Easy to enable different architectural changes, in order to provide an easy-to-use mechanism to compare new architectural ideas.

Based on these criteria, MEEP found no existing infrastructures to use and created Coyote, a combination of two well-known simulation tools, Sparta and SPIKE. It focuses on modeling data movements throughout the memory hierarchy. This provides sufficient detail to perform:

 

  • First-order comparison between different design points.
  • The behaviour of memory accesses.
  • The well-known memory-wall.

Being the last two barrier to efficient computing. Cache coherence and lower level modelling of the cores are out of scope for Coyote.

 

Coyote can currently simulate architectures with a private L1 instruction and data cache per core and a shared, banked L2 cache. Their size and associativity can be configured. For the L2, the maximum number of in-flight misses and the hit/miss latencies can also be configured. Two well-known data mapping policies have been implemented using different bits of the addresses: page to bank and set interleaving.

 

Current work has set a solid foundation for a fast and flexible tool for HPC architecture design space exploration. Next steps are to extend the modelling capabilities of Coyote, including modelling the main memory, the NoC, the MMU and different data management policies. Data output and visualization capabilities will also be extended to enable finer grain analysis of memory accesses of certain regions of interest.