GHNAVS gem5 simulator

GHNAVS stands for Gem5 - HBM - NoC - ARM - V1 - Simulator

This is a patch for gem5, to enhance support for ARM64 CPUs and ARM-CMN Interconnects modeling.

I. Features

Implementation for multi-layered NoC, where each traffic class (VNET) uses a separate NoC. This is a more accurate modeling of ARM CMN Interconnects[7], in relation to the existing gem5 NoC models (garnet2.0 [1]).
- The number of NoC layers be adjusted with a single parameter (--noc_layers) from the command-line
Changes in the Ruby CPU Sequencer [2, 3], in order to avoid Sequencer port blocking caused by aliased requests. In the default implementation of Ruby CPU Sequencer subsequent requests (read or write), to the same Cache line are called aliased requests, and cause the CPU master port to block, even for requests targeting a different cacheline. This unrealistic modeling results in very poor performance (due to low number of outstanding requests), for workloads with sequential access to memory (e.g: STREAM [4]).
- This new feature (called sequencer_port_block_bypass) can be turned on/off using a single configuration parameter.
Processor models:

ARM Cortex-A76 CPU and
R-CPU, which models an ARM-Neoverse-V1 CPU core with SVE support

NoC Topology: 16-cores, 16 SLC slices, 8 HBM2 controllers 4x4 Mesh Topology
HBM2 memory model
Very detailed NoC statistics
MOESI_CMP_directory CC protocol

II. Instructions

A. How to build

Software Requirements

g++ version 7
scons version 3.01
python version 2.7

Additionally you are advised to install pydot and graphviz related libraries, in order to visualize the generated Systems, and cross-check that they are as expected. (When those libraries are installed gem5 will produce config.dot.pdf and config.system.ruby.dot.pdf files, which contain a high level, visual representation of the components of the simulated system).

One can build an LXC debian buster container, which can have all these requirements.

The provided patch is based on public-gem5 source code [6], commit: 904784fb1e15f0c090fb1f1e5c5057e74b0b4ea8. In order to apply the patch use the following:

    git clone https://gem5.googlesource.com/public/gem5
    cd gem5
    git checkout 904784fb1e15f0c090fb1f1e5c5057e74b0b4ea8
    git apply *.patch

Example build command:

   CXX=g++-7 scons NUMBER_BITS_PER_SET=128 PROTOCOL=MOESI_CMP_directory build/ARM/gem5.opt -j7 --force-lto

B. Important configuration parameters

The following parameters, can be adjusted to more accurate values, depending on the system you want to setup.

Param	File	Value	Description
max_outstanding_requests	src/mem/ruby/system/Sequencer.py	128	The number of outstanding requests for Ruby CPU Sequencer
sequencer_port_block_bypass	src/mem/ruby/system/Sequencer.py	True	Use True to enable FORTH Sequencer (False for the default Ruby Sequencer)
--sys-clock	command-line	2GHz	System Clock
--ruby-clock	command-line	2GHz	Ruby Subsystem Clock
--cpu-clock	command-line	2GHz	CPU clock
--topology	command-line	Mesh_EPI_quadrant_p1	Mesh_XY with multiple NoC layers support
--mesh-rows	command-line	4	4x4 Mesh, (set to 4 for 4x4 Mesh)
--link-width-bits	command-line	576	This covers 8 bytes for control and 64 bytes for data (64 bit + 512 bits)
--noc_layers	command-line	3	The number of layers is dependent on the number of VNETS of the CC protocol that is used. Use 3 for MOESI_CMP_directory protocol. Use 6 if you want to enable the multi-VNETs NoC feature
sve_vl_se	src/arch/arm/ArmISA.py	1/2/4/8/16	SVE Length in Quadwords (quadword = 128 bits). This only works for SE mode
sve_vl	arch/arm/ArmSystem.py	1/2/4/8/16	SVE Length in Quadwords (quadword = 128 bits). This only works for FS mode
--num-dirs	command-line	4,8,...	On Ruby Systems, the number of memory channels is directly related to the number of Cache directories. Note that `mem-channels` parameter is ignored when modeling Ruby Systems
--mem-channels	command-line	1,2,4,8,...	On Classic Memory Systems the number of memory channels is adjusted by `mem-channels`
--ports	command-line	4	For Ruby Systems: CC transitions per cycle

C. How to run

1. Syscall emulation (SE) case (`se.py`)

An example command for SE mode is the following:

    ./build/ARM/gem5.opt \
    --outdir=m5out/outdir_1 \
    configs/example/se.py \
    --cpu-type=R_CPU \
    --arm-iset=aarch64 \
    --num-cpus=4 \
    --num-dirs=4 \
    --caches \
    --l2cache \
    --num-l2caches=4 \
    --l1i_size=64kB \
    --l1d_size=64kB \
    --l2_size=1MB \
    --l1i_assoc=4 \
    --l1d_assoc=4 \
    --l2_assoc=8 \
    --mem-type=DDR4_2400_8x8 \
    --mem-size=2GB \
    --sys-clock=2GHz \
    --cpu-clock=2GHz \
    --ruby-clock=2GHz \
    --ruby \
    --topology=Mesh_EPI \
    --mesh-rows=2 \
    --network=garnet2.0 \
    --link-width-bits=576 \
    --noc_layers=3 \
    --vcs-per-vnet=4 \
    -c sve_stream_copy_se.exe

Note: One can use -c $executable1;$executable2;$executable3;$executable4 to launch the same or different applications to each core, since mutli-thread support is limited in SE mode.

2. FullSystem (FS) mode (`fs.py`)

For Full System mode we use fs.py simulation script, as it supports the gem5 Ruby System [2], and therefore NoC modeling.

Getting a checkpoint: We always get a checkpoint using a classic system setup. The number of CPU cores, SVE lengths, and memory size are important as changing those requires a new checkpoint. Additionally mounting or adding files to the image file requires a new checkpoint to be taken.

./build/ARM/gem5.opt \
--outdir=m5out/checkpoint_outdir \
configs/example/fs.py \
--cpu-type=AtomicSimpleCPU \
--cpu-clock=2GHz \
--num-cpus=16 \
--kernel=vmlinux_bsc \
--disk=aarch64-ubuntu-armcl-sve.img \
--machine-type=VExpress_GEM5_V1 \
--mem-type=DDR4_2400_8x8 \
--script=configs/boot/hack_back_ckpt.rcS \
--mem-size=2GB

Restoring from a checkpoint: In order to restore from a checkpoint, the checkpoint directory (named cpt.*) should be copied (or sym-linked), inside the outdir of the new simulation (outdir_2 in the following examples).
If you don't want to restore from a checkpoint, remove -r 1 from the command line.

2-a) Simple system with a 2x2 Mesh Noc

2-b) 4x4 Mesh: 16 cores - 16 SLCs - 8 HBM2 controllers (Topology: Mesh_EPI_quadrant_p1)

CXX=g++-7 scons NUMBER_BITS_PER_SET=128 PROTOCOL=MOESI_CMP_directory build/ARM/gem5.opt -j7 --force-lto

./build/ARM/gem5.opt \
--listener-mode=off  \
 --outdir=m5out/outdir_2 \
 configs/example/fs.py \
 -r 1 \
 --cpu-type=R_CPU \
 --restore-with-cpu=R_CPU \
 --arm-iset=aarch64 \
 --kernel=vmlinux_bsc \
 --disk=aarch64-ubuntu-armcl-sve.img \
 --machine-type=VExpress_GEM5_V1 \
 --script=configs/boot/bootscript1.rcS \
 --num-cpus=16 \
 --num-dirs=8 \
 --num-l2caches=16 \
 --caches \
 --l2cache \
 --l1i_size=64kB \
 --l1d_size=1MB \
 --l2_size=2MB \
 --l1i_assoc=4 \
 --l1d_assoc=4 \
 --l2_assoc=8 \
 --mem-type=HBM2_2000_4H_1x128 \
 --mem-size=2GB \
 --sys-clock=2GHz \
 --cpu-clock=2GHz \
 --ruby-clock=2GHz \
 --ruby \
 --topology=Mesh_EPI_quadrant_p1 \
 --mesh-rows=4 \
 --network=garnet2.0 \
 --link-width-bits=576 \
 --noc_layers=3 \
 --vcs-per-vnet=4

III. Advanced features

A. Setup with SLC approximation

MOESI_CMP_directory only supports L1 and L2. However we can approximate a 3-level cache hierarchy using the following: The idea is to use the L2 cache controllers to approximate the SLC cache slices. L1D size is increased to match the size of L1+L2 of the modeled platform, and L2 size is increased to match the size of the platform SLC. Also the cache latencies are adjusted accordingly.

The provided patch already adjusts L1/L2 cache latencies of the model. Then, you can adjust the L1D size, L2 size (which acts as SLC), as well as the number of L2 Controllers (--num-l2caches), approximating the number of SLC slices.

B. Mutli-VNETs feature (Double NoC bandwidth)

This release can support having multiple VNETs per Request/Response (This means that each VNET can have 2 physical links). More specifically one can initiate a gem5 system with a total of 6 NoC layers, instead of 3, while using MOESI_CMP_directory. Use --noc_layers=6, and make sure that use_offset_vnets = True in file: configs/network/Network.py.

C. Detailed NoC Latencies (and CC message types)

One can print detailed per VNET and per Source to Destination NoC Queue and Network latencies. To enable this feature, use --debug-flags=RubyNetConnections. When using this debug flag, the output shown, will contain two types of messages.

The first one regards the NoC latencies. For example:

    7207675622500: system.ruby.network: vnet:2, [L1Cache_Controller 3]->[L2Cache_Controller 5],
    NI[3->21], Rtr[46->37], hops[3],
    queue_AvgMinMaxMed[8.06/8/53/8.00],
    net_AvgMinMaxMed[9.86/9/91/9.00],
    flits:29393

The message is split into the following fields:

gem5 simulation current tick
VNET number
Source Controller -> Destination Controller (Reminder: L1 represents L1+L2 and L2 represents SLC)
Source NI -> Destination NI
Source Router -> Destination Router
Number of hops
Queueing latencies (Avg, Min, Max, Median)
Network latencies (Avg, Min, Max, Median)
Number of packets (flits) sent from source to destination during the last ROI

The second type of messages, describes what type of CC messages, each Source Controller sends to each Destination Controller per VNET and ROI. An example printout is the following:

7207675622500: system.ruby.network: vnet:1, [L2Cache_Controller 2]->[Directory_Controller 0], REQ: [GETX : 19691][GETS : 39488][PUTX : 19723][WRITEBACK_DIRTY_DATA : 19723]
7207675622500: system.ruby.network: vnet:2, [L1Cache_Controller 3]->[L2Cache_Controller 0], RESP: [UNBLOCK : 19713][UNBLOCK_EXCLUSIVE : 9781]

Those types of messages contain the following fields:

gem5 simulation current tick
VNET number
Source Controller -> Destination Controller (Reminder: L1 represents L1+L2 and L2 represents SLC)
REQ / RESP (Request, Response Type)
A sequence of: [CC Message type : Number of messages during last ROI]

IV. References

[1] http://old.gem5.org/Garnet2.0.html

[2] http://old.gem5.org/Ruby.html

[3] http://old.gem5.org/Coherence-Protocol-Independent_Memory_Components.html

[4] https://www.cs.virginia.edu/stream/

[5] http://old.gem5.org/Running_gem5.html

[6] https://gem5.googlesource.com/public/gem5

[7] Arm Neoverse CMN‑650 Coherent Mesh Network TRM

V. Acknowledgements

We thankfully acknowledge support for this research from the European High Performance Computing Joint Undertaking (EuroHPC JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI-SGA2). The EuroHPC JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. National contributions from the involved state members (including the Greek General Secretariat for Research and Innovation) match the EuroHPC funding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GHNAVS gem5 simulator

I. Features

II. Instructions

A. How to build

Software Requirements

B. Important configuration parameters

C. How to run

1. Syscall emulation (SE) case (`se.py`)

2. FullSystem (FS) mode (`fs.py`)

2-a) Simple system with a 2x2 Mesh Noc

2-b) 4x4 Mesh: 16 cores - 16 SLCs - 8 HBM2 controllers (Topology: Mesh_EPI_quadrant_p1)

III. Advanced features

A. Setup with SLC approximation

B. Mutli-VNETs feature (Double NoC bandwidth)

C. Detailed NoC Latencies (and CC message types)

IV. References

V. Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

GHNAVS gem5 simulator

I. Features

II. Instructions

A. How to build

Software Requirements

B. Important configuration parameters

C. How to run

1. Syscall emulation (SE) case (se.py)

2. FullSystem (FS) mode (fs.py)

2-a) Simple system with a 2x2 Mesh Noc

2-b) 4x4 Mesh: 16 cores - 16 SLCs - 8 HBM2 controllers (Topology: Mesh_EPI_quadrant_p1)

III. Advanced features

A. Setup with SLC approximation

B. Mutli-VNETs feature (Double NoC bandwidth)

C. Detailed NoC Latencies (and CC message types)

IV. References

V. Acknowledgements

1. Syscall emulation (SE) case (`se.py`)

2. FullSystem (FS) mode (`fs.py`)