Impact of Write-Allocate Elimination on Fujitsu A64FX

Yan Kang, Mahmut Kandemir, Sayan Ghosh, Andrés Marquez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.

Original languageEnglish (US)
Title of host publicationProceedings of International Conference on High Performance Computing in Asia-Pacific Region Workshops, HPC Asia 2024 Workshops
PublisherAssociation for Computing Machinery
Pages24-35
Number of pages12
ISBN (Electronic)9798400716522
DOIs
StatePublished - Jan 11 2024
Event2024 International Conference on High Performance Computing in Asia-Pacific Region Workshops, HPC Asia 2024 Workshops - Nagoya, Japan
Duration: Jan 25 2024 → …

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2024 International Conference on High Performance Computing in Asia-Pacific Region Workshops, HPC Asia 2024 Workshops
Country/TerritoryJapan
CityNagoya
Period1/25/24 → …

All Science Journal Classification (ASJC) codes

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this