TACO: Vol 20, No 1

Volume 20, Issue 1March 2023

Volume 20, Issue 1

March 2023

Editor:

David Kaeli
Northeastern University, USA

Publisher:

Association for Computing Machinery
New York
NY
United States

ISSN:1544-3566

EISSN:1544-3973

Tags:

PDF eReader

Bibliometrics

Issue Downloads

PDFfront matter (TOC, masthead, submission information)

Select All

Export Citations Save to Binder

research-article

Open Access

Symbolic Analysis for Data Plane Programs Specialization

Article No.: 1, pp 1–21https://doi.org/10.1145/3557727

Programmable network data planes have extended the capabilities of packet processing in network devices by allowing custom processing pipelines and agnostic packet processing. While a variety of applications can be implemented on current programmable data ...

research-article

Open Access

BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation

Article No.: 2, pp 1–28https://doi.org/10.1145/3558003

For Affine Control Programs or Static Control Programs (SCoP), symbolic counting of reuse distances could induce polynomials for each reuse pair. These polynomials along with cache capacity constraints lead to non-affine (semi-algebraic) sets; and ...

research-article

Open Access

As-Is Approximate Computing

Article No.: 3, pp 1–26https://doi.org/10.1145/3559761

Although approximate computing promises better performance for applications allowing marginal errors, dearth of hardware support and lack of run-time accuracy guarantees makes it difficult to adopt. We present As-Is, an Anytime Speculative Interruptible ...

research-article

Open Access

TokenSmart: Distributed, Scalable Power Management in the Many-core Era

Article No.: 4, pp 1–26https://doi.org/10.1145/3559762

Centralized power management control systems are hitting a scalability limit. In particular, enforcing a power cap in a many-core system in a performance-friendly manner is quite challenging. Today’s on-chip controller reduces the clock speed of compute ...

research-article

Open Access

Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization

Article No.: 5, pp 1–26https://doi.org/10.1145/3561651

Persistent memory (PM) provides large-scale non-volatile memory (NVM) with DRAM-comparable performance. The non-volatility and other unique characteristics of PM architecture bring new opportunities and challenges for the efficient storage system design. ...

research-article

Open Access

Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance

Article No.: 6, pp 1–23https://doi.org/10.1145/3561652

GraphBLASis a recent standard that allows the expression of graph algorithms in the language of linear algebra and enables automatic code parallelization and optimization. GraphBLAS operations are memory bound and may benefit from data locality ...

research-article

Open Access

SSD-SGD: Communication Sparsification for Distributed Deep Learning Training

Article No.: 7, pp 1–25https://doi.org/10.1145/3563038

Intensive communication and synchronization cost for gradients and parameters is the well-known bottleneck of distributed deep learning training. Based on the observations that Synchronous SGD (SSGD) obtains good convergence accuracy while asynchronous ...

research-article

Open Access

PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM

Article No.: 8, pp 1–31https://doi.org/10.1145/3563697

Commodity DRAM-based processing-using-memory (PuM) techniques that are supported by off-the-shelf DRAM chips present an opportunity for alleviating the data movement bottleneck at low cost. However, system integration of these techniques imposes non-...

research-article

Open Access

Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks

Article No.: 9, pp 1–24https://doi.org/10.1145/3563695

MicroScope and other similar microarchitectural replay attacks take advantage of the characteristics of speculative execution to trap the execution of the victim application in a loop, enabling the attacker to amplify a side-channel attack by executing it ...

research-article

Open Access

Quantifying Resource Contention of Co-located Workloads with the System-level Entropy

Article No.: 10, pp 1–25https://doi.org/10.1145/3563696

The workload co-location, such as deploying offline analysis workloads with online service workloads on the same node, has become common for modern data centers. Workload co-location deployment improves data center resource utilization significantly. ...

research-article

Open Access

A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks

Article No.: 11, pp 1–24https://doi.org/10.1145/3564606

Deep neural networks (DNNs) have become key solutions in the natural language processing (NLP) domain. However, the existing accelerators customized for their narrow target models cannot support diverse NLP models. Therefore, naively running complex NLP ...

research-article

Open Access

Occam: Optimal Data Reuse for Convolutional Neural Networks

Article No.: 12, pp 1–25https://doi.org/10.1145/3566052

Convolutional neural networks (CNNs) are emerging as powerful tools for image processing in important commercial applications. We focus on the important problem of improving the latency of image recognition. While CNNs are highly amenable to prefetching ...

research-article

Open Access

FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations

Article No.: 13, pp 1–26https://doi.org/10.1145/3565885

With the rapid development of cloud computing, numerous cloud services, containers, and virtual machines have been bringing tremendous demands on high-performance memory resources to modern data centers. Heterogeneous memory, especially the newly released ...

research-article

Open Access

RegCPython: A Register-based Python Interpreter for Better Performance

Article No.: 14, pp 1–25https://doi.org/10.1145/3568973

Interpreters are widely used in the implementation of many programming languages, such as Python, Perl, and Java. Even though various JIT compilers emerge in an endless stream, interpretation efficiency still plays a critical role in program performance. ...

research-article

Open Access

SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V

Article No.: 15, pp 1–26https://doi.org/10.1145/3566053

In modern processors, speculative execution has significantly improved the performance of processors, but it has also introduced speculative execution vulnerabilities. Recent defenses are based on the delayed execution to block various speculative side ...

research-article

Open Access

Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration

Article No.: 16, pp 1–26https://doi.org/10.1145/3566054

This article presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine ...

research-article

Open Access

XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments

Article No.: 17, pp 1–25https://doi.org/10.1145/3568956

Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in ...

research-article

Open Access

YaConv: Convolution with Low Cache Footprint

Article No.: 18, pp 1–18https://doi.org/10.1145/3570305

This article introduces YaConv, a new algorithm to compute convolution using GEMM microkernels from a Basic Linear Algebra Subprograms library that is efficient for multiple CPU architectures. Previous approaches either create a copy of each image element ...

research-article

Open Access

Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy

Article No.: 19, pp 1–25https://doi.org/10.1145/3570304

Over the years, processor throughput has steadily increased. However, the memory throughput has not increased at the same rate, which has led to the memory wall problem in turn increasing the gap between effective and theoretical peak processor ...

ACM Transactions on Architecture and Code Optimization

Sections

Issue Downloads

Symbolic Analysis for Data Plane Programs Specialization

BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation

As-Is Approximate Computing

TokenSmart: Distributed, Scalable Power Management in the Many-core Era

Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization

Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance

SSD-SGD: Communication Sparsification for Distributed Deep Learning Training

PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM

Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks

Quantifying Resource Contention of Co-located Workloads with the System-level Entropy

A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks

Occam: Optimal Data Reuse for Convolutional Neural Networks

FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations

RegCPython: A Register-based Python Interpreter for Better Performance

SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V

Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration

XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments

YaConv: Convolution with Low Cache Footprint

Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy

Subjects

Comments