Compiler-Assisted Data Streaming for Regular Code Structures

Nuno Neves, Member, IEEE, Pedro Tomás, Senior Member, IEEE, and Nuno Roma, Senior Member, IEEE

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa
Rua Alves Redol, 9, 1000-029 Lisboa - Portugal
Email: {nuno.neves, pedro.tomas, nuno.roma}@inesc-id.pt

Abstract—The performance of modern processors is often limited by execution stalls resulting from long memory access latencies. Compile-time optimizations, deep cache hierarchies and prefetching mechanisms already provide significant performance gains, by performing memory accesses in parallel with computation. However, they are reaching a throughput improvement limit. Hence, new solutions that effectively exploit the memory access patterns to improve processing throughput are required. To achieve this objective, a new compiler-assisted data streaming method is proposed. It leverages static analysis and code transformations with an on-chip data streaming support as a viable alternative to prefetching mechanisms for regular code structures. Static analysis is used to identify and encode memory accesses with a dedicated representation. Then, a code transformation algorithm detaches data indexation and address calculation from computation, allowing for a significant code reduction. An on-chip data stream controller, attached to the L1 data cache, is used to autonomously generate memory accesses from the pattern representation and reorganize the data transfers in streams, with the aid of stream buffers. When compared with state-of-the-art prefetchers, the proposed solution provides up to 26% of code reduction, an IPC improvement of 2.4x, and an average performance improvement of 40%.

Index Terms—Compiler Static Analysis, Data Streaming, Regular Code Structures, Indirect Memory Accesses.

1 INTRODUCTION

The performance of Central Processing Units (CPUs) is often limited by the adverse impact of stalls due to long memory access latencies. Although caches promote an attenuation of such impact, the memory access latency is bound by the characteristics of the data access pattern and can only be fully mitigated by hiding it behind computation. From the vast set of approaches that have been considered, code optimization [1], [2], [3], [4], [5] and data (pre-)fetching [6], [7], [8], [9], [10], [11], [12], [13] have shown significant performance improvements.

At the hardware level, prefetching mechanisms monitor the cache miss stream, detecting the application memory access pattern [10], [9], [12], and prefetching data ahead of request. At the software level, compiler tools rely on static code analysis [1], [2], [3] to infer the memory access pattern and/or critical memory instructions [14]. This information allows the application of code transformations (e.g., data access reorganization [14], [5], code optimizations [4], and software prefetching [15], [16]) with the goal of hiding the memory access latency behind computation. The inferred access pattern can also be used at runtime for assisted execution [14] and assisted data prefetching [17], [13].

Existing prefetching methods are particularly successful when dealing with specific memory access issues, such as reduced data-locality [6], [7], [18], complex memory access patterns [19], [20], [21] or large datasets that do not fit in cache [17], [22]. In fact, prefetching technology has evolved to a point where the main concern is no longer the memory access pattern detection and prediction, but the timeliness and effectiveness of the procedure itself. This led to the emergence of new prefetchers [9], [10], [11], [23] that combine multiple hardware modules, with different data fetch granularities and prediction heuristics, across different cache levels. However, despite the improved throughput, resulting from a high accuracy and coverage of data access prediction, the added gains provided by each new generation of prefetchers are becoming limited.

To tackle such limitations, several works [24], [25], [26], [8], [21] exploit the fact that complex memory access patterns are most often generated by regular code structures that are compile-time detectable. These include common data indexation schemes based on affine loop transformations and on irregular data accesses, such as indirect memory accesses in the format $A[B[i]]$. Due to their deterministic representation, such patterns can be detected at compile-time and described by affine relations, and then dynamically resolved at runtime by on-chip data fetching modules (data streaming). As a result, it is possible to explicitly detach the addressing of the memory accesses from the processor to accelerate data acquisition and increase the processing throughput.

While some works have exploited such approaches to detect indirect memory accesses for stream prefetching [21], actual data streaming approaches [24], [25], [26], [8] have...
been limited to deterministic data access sequences (i.e., they lack support for irregular accesses). Moreover, they have only been deployed in application-specific accelerators [24], [25], [8] and usually require the programmer to hand-model the access pattern and to program it with custom application code.

This paper presents a new data streaming approach for compile-time detectable memory access patterns as an alternative to complex prefetching schemes for regular data access code structures. Hence, contrasting to state-of-the-art prefetchers, the proposed approach features: (1) a compilation tool that leverages static analysis to detect and explicitly describe memory access patterns (addressing the limitations of inaccurate memory access predictions); and (2) a dedicated controller to stream data (upon request) directly from the memory to the processor (addressing the performance saturation caused by prefetching timeliness). It operates as follows:

- At compile-time, the application memory access pattern is identified, described and encoded in a multi-level affine model, which also allows encoding data dependencies between accesses. The tool is able to infer deterministic access patterns and indirect memory accesses from regular code structures. As a result of this explicit representation of memory access patterns, their corresponding indexation and address calculation (by the CPU) becomes redundant. Accordingly, the tool performs a code transformation pass that replaces the subscript and indexation of each encoded load instruction with a stream reference (represented by a pointer). This reduces the number of instructions per loop iteration, accelerating the execution of the code.
- At runtime, a Data Stream Controller (DSC), collocated with the L1 data cache, generates memory accesses from the encoded representation and reorganizes data in streams. Hence, the DSC becomes responsible for fetching and buffering streams and for serving them (upon request) to the processor. In such circumstances, the L1 data cache is bypassed for data stream acquisition and access, avoiding cache pollution and early evictions resulting from poor data fetch timeliness.

The proposed compilation tool was integrated in the LLVM compiler and deployed as a Clang front-end tool. The DSC was implemented on the Gem5 simulator [27]. The resulting data streaming solution outperforms a typical stride prefetcher, attaining speedups as high as 1.9x. Furthermore, the implicit set of code transformations reduces the code size as much as 26%, leading to a combined performance improvement as high as 2.6x. As a result, the combination of the proposed techniques outperforms state-of-the-art prefetchers by 40% (on average).

### 2 Background and Motivation

Just as caches, the simplest and most common prefetching solutions exploit the spatial locality of memory accesses. Sequential prefetchers usually anticipate the loading of data in the cache lines based on the most recently accessed address ($y_{current}$). Being this a hardly efficient approach, stride-based prefetchers analyze individual accesses and calculate the distance between consecutive addresses (stride), to predict the sequence of future accesses (up to a given degree):

$$ y(\text{degree}) = y_{current} + \text{degree} \times \text{stride} $$

#### Spatial Prefetching

Some of these approaches also adopt intelligent prediction tables, indexed by the Program Counter (PC) or by the memory address, to infer more complex patterns and increase prefetching coverage. As an example, Shevgoor et al. [19] introduced the Variable Length Delta Prefetcher (VLDP), which maintains multiple prediction tables to store delta histories between subsequent cache line misses within physical pages. Ishii et al. [9] proposed the AMPM stride prefetcher to identify hot zones in memory. It uses a dynamically constructed bitmap to infer strided patterns in the access stream. Somogyi et al. [22] proposed the SMS prefetcher to identify code-correlated spatial access patterns and stream predicted blocks to the cache. Despite their attained coverage, these approaches assume that memory accesses maintain a regular behaviour over time, struggling to deal with highly complex and non-sequential patterns.

#### Correlation Prefetching

Modern prefetchers overcome such limitations by exploiting structural and temporal correlation heuristics, attaining high prediction accuracies and data coverage. For example, [12] proposes a context-based memory prefetcher that approximates spatial and temporal locality using reinforced learning. Recently, Bakhsalipour et al. [7] proposed the Bingo prefetcher, which associates observed spatial patterns to both short and long events to improve prefetching coverage and accuracy.

Yu et al. [21] proposed IMP, targeting indirect memory accesses. The detection mechanism correlates the values obtained by an index stream with subsequent cache misses to infer indirection. The IMP provided a breakthrough in this domain (which previously required the use of software prefetching approaches [15], [16]).
Offset Prefetching: Current prefetching methods have attained such a precision that the focus has shifted from the detection and prediction of indexing patterns to the timeliness of the prefetching procedure [11]. Michaud [10] proposed the BO prefetcher, implementing a selection mechanism that dynamically sets the prefetching offset depending on application behavior.

Despite their successful approach, the performance gain for each new generation of prefetchers has saturated (see Fig. 1), particularly on highly regular code structures (see Fig. 2), which dominate the majority of applications. In fact, the regularity and deterministic nature of most of these patterns can be detected by simple prefetching schemes.

Structural Representation and Data Streaming: By exploiting such deterministic nature, recent studies have shown that it is possible to obtain accurate representations of complex and deterministic access patterns to substantially accelerate data acquisition in application-specific domains. Hussain [26] proposed an Advanced Pattern-based Memory Controller (APMC) that supports up to 3D regular data-fetcheding mechanisms, such as scatter-gather and strided accesses with programmable tiling. The Hotstream framework [24] adopts a programmable approach that eases the description of regular data-patterns. A dynamic descriptor graph specification was proposed in [8] to encode arbitrarily complex (but deterministic) data-patterns. However, such strategies have been limited to regular data patterns and, since no compiler support is given, they require a manual encoding of the pattern.

Compiler-Assisted Methods: Notwithstanding, alternatives have been considered based on compiler-aided approaches. Guo et al. [13] utilize compiler information to reduce prefetch-related energy consumption by filtering prefetching requests with very small strides and prefetching only selected memory accesses identified by the compiler, and by applying different prefetching schemes depending on the predicted memory access patterns. Ebrahimi et al. [17] target pointer-based applications with compiler-guided prefetch filtering to inform the hardware about which pointer addresses to prefetch.

Several other compiler tools have also emerged in the scope of code static analysis. In particular, memory access tracing [28], [29], complexity [30] and polyhedral [1], [2] analysis or memory access profiling [3] have been used to analyze and optimize data indexation [4], [5].

Opportunity: To overcome the ever smaller gains associated to each new prefetcher generation for regular code structures, and surpass the currently observed performance plateau, alternative solutions are necessary. Hence, by observing that the memory access pattern in such applications is often known at compile-time (as it can be observed in Fig. 2 by the percentage of affine data patterns), there is an opportunity to explicitly extract and generate runtime data streams (even when these depend on the runtime value of one or more variables).

Proposed Solution: Based on this key idea, the proposed approach exploits data streaming schemes as a viable alternative to predictive prefetchers for data patterns generated by regular code structures. Instead of solely providing hints to prefetching hardware such as other compiler-assisted methods, it takes a step further by explicitly exposing and extracting the memory access pattern (directly from the source code) to provide support for data streaming. To do so, the proposed approach performs static analysis to extract the memory access pattern and to encode it with a dedicated representation, which is only fully decoded at runtime, when the value of depending variables is known. This mechanism is combined with a code transformation step that leverages the explicit detachment of memory accesses from computation, providing code reductions and accelerating the execution.

3 Compiler-Assisted Streaming

The block diagram of the proposed compiler-assisted data streaming mechanism is depicted in Fig. 3. It comprises a compile-time tool (see Section 4) and a data-streaming module, colocated with the L1 data cache (see Section 4.5). The compile-time tool was integrated in the LLVM framework to perform static analysis over an annotated region of code. It starts by parsing the Clang Abstract Syntax Tree (AST) to identify the context of each data access in the region of interest (indicated in the annotation directive, as in Fig. 4.A). The context of each access is translated to a Context Representation Language (CRL) that gathers all the information regarding each access (see Fig. 4.C and Section 4.3). Such information includes all the dependencies for the address generation, including nested loop context (providing temporal information), indexing ranges and data dimensionality (for address calculation) and data-dependent access hierarchies (for indirect memory access representation).

A code transformation mechanism is also proposed to convert each extracted memory access into a data stream access (see Fig. 4.D and Section 4.5). This is possible because the memory addressing sequence is fully and exactly encoded, making the corresponding application code redundant. Such a transformation results in an explicit detachment of the memory access generation from the computational operations, and a consequent reduction in the number of instructions per loop.

The extracted data-pattern is encoded with a Context Descriptor (see Fig. 4.E and Section 4.4) and embedded in the application code. At runtime, the Context Descriptor is loaded into a dedicated DSC hardware module (see Fig. 3.
and Section 5), collocated with the L1 data cache. The DSC autonomously handles the stream generation and data fetching procedure. Each generated stream is stored in internal buffers and it is subsequently read by the core and/or used by the DSC to generate data-dependent streams.

Hence, the key advantage of the proposed scheme arises from not requiring any predictive schemes and access monitoring for stride detection. Instead, it relies on a static analysis performed at compile-time to encode memory access patterns in Context Descriptors, ensuring an exact coverage of the memory access sequence and avoiding redundant and over-prefetching scenarios that occur in predictive approaches. Moreover, its data stream buffering capabilities minimize premature cache evictions of prefetched data, resulting in an implicit increase of the timeliness of the acquired data. Hence, it results in a two-fold advantage over conventional prefetchers: (1) it accelerates the execution of acquired data. Hence, it results in a two-fold advantage over conventional prefetchers; and (2) it minimizes latency in the memory access stream.

4 ACCESS PATTERN DETECTION

The proposed compile-time method for memory access pattern extraction and code transformation was built on the Clang LibTooling library [32]. This C++ interface provides full control over the compiler’s front-end resources, as described in the following sections.

4.1 Compiler Module Overview

The compile-time method is divided in five steps, as depicted in Fig. 4: i) region extraction; ii) code translation to AST; iii) AST static analysis and translation to CRL; iv) AST to data streaming transformation; and v) Context Descriptor generation and code injection.

The region extraction phase was devised to be as simple as possible, requiring a programmer effort no higher than that of adopting other mainstream annotation schemes (such as the OpenMP library). It makes use of the pragma handling routines from the LibTooling library, therefore requiring any region-of-interest to be delimited by the directive (see Fig. 4.A):

```
#pragma stream-context (var:size[:size], ...)
```

It allows to configure and make the tool aware of the targeted array variables (and their size, per dimension) for data stream description. Moreover, such a scheme ensures that only relevant accesses are considered by the tool, avoiding the blind encoding of bad candidates for data streaming (e.g., an array that is often reused and is small enough to fit in the L1 cache is not a good candidate for streaming).

The application code is then transformed in a Clang translation unit, which is passed to the front-end tools to generate the corresponding AST (see Fig. 4.B bottom). At this point, the compilation flow slightly diverges from the typical compiler, which would proceed to generate the LLVM Intermediate Representation (IR) code.

Instead, the tool starts by generating its own internal memory access high-level representation, through a dedicated Context Representation Language (CRL), as detailed in Section 4.3 (see Fig. 4.C). This is achieved through a direct translation of the source code region-of-interest by parsing the Clang AST with a typical depth-first tree analysis. The translation mechanism is capable of inferring and representing deterministic and indirect array-based memory access sequences, by detecting affine relations in the regular access structures that result from typical for-loop coding schemes for array indexing.

After parsing the initial AST, a transformation pass modifies the AST subtree of each data access, to generate stream
accesses instead. This is done by creating an array with a stream reference per extracted access, and by transforming the n-dimensional array indexation \((\text{id}_x)\) with a stream access (see Fig. 4.D), i.e.:

\[
<\text{array}_\text{name}>[\text{id}_x0]..[\text{id}_xN] \rightarrow *\text{stream}_\text{name}
\]

The address sequence of each memory access in the generated CRL is then translated to a low-level Context Descriptor and tagged with a corresponding stream address. This translation is performed by extracting the access pattern from the CRL and by encoding the necessary parameters to calculate the address sequence of each memory access (see Section 4.4 and Fig. 4E).

Finally, the Context Descriptor is embedded in the original code by injecting a set of inline store instructions to send the descriptor data to the DSC, through a memory mapped interface. This allows sending information to the DSC that can only be obtained in runtime, such as the base address of the described array variables (see Section 5). From this point on, the control is passed to the LLVM compiler and the typical compilation flow is resumed.

### 4.2 Memory Access Modeling

The performed analysis is based on a formal mathematical model that captures the tagged deterministic memory access patterns. The key idea is to provide an n-dimensional \((n-D)\) affine representation of the address sequence.

While existing predictive prefetchers make use of runtime-detected strides to calculate the sequence of addresses (with a unidimensional model based on near-temporal and near-spatial locality - see Eq. 1), the considered model aims at describing the exact sequence of addresses. It works by capturing nested loop-based indexation and loop-data-dependent (indirect) index dynamic ranges, where current models fall short.

Accordingly, the following model to describe the address sequence for a given memory access is proposed:

\[
y(\text{access}) = y_{\text{base}} + \sum_{k=0}^{\text{dim}_y} x_k \times \text{stride}_k
\]

where each stream access \(y(\text{access})\) is described as the sum of a base address value \(y_{\text{base}}\) with \(\text{dim}_y\) pairs of increment variables (or indexes) \(x_k\) and \(\text{stride}_k\) multiplication factors. Each increment variable \(x_k\) is represented by an integer range, with limits \(a_k\) and \(b_k\) (see Fig. 5).

While the model in Eq. 2 already represents a broad range of patterns, further complexity can still be achieved by combining multiple functions. Such combinations can be performed by determining the base address and/or the upper and lower bounds of each increment variable with another affine function (compare index ranges in Fig. 5).

### 4.3 Context Representation Language

The Context Representation Language (CRL) was specifically designed to gather and represent the information required to calculate the address sequence of a given memory access (or its context). The language is built on a basic instruction set and three container structures: context, loop and access (see Fig. 6). Each container is composed of a unique identification, an header with configuration fields specific to each container type, and a body with a set of instructions. A CRL translation example is depicted in Fig. 4C.

The context container serves as the entry point for the encoding. It encapsulates all the information concerning the indexing variables, base addresses, dimensionality and size of the array variables within a nested loop. This allows strides defined in different dimensions to be inferred and multiplied by indexing variables, in order to calculate the sequence of memory addresses based on the model in Eq. 2.

All memory accesses to a given array variable (in the same context) are encapsulated in an access container. Each access is represented by a fetch instruction preceded by a set of arithmetic instructions that calculate the address (see Fig. 4), based on (see Fig. 6): i) a dimension vector (.dim), to store the size of each dimension of the array; ii) the size of the array data type (.type); and iii) the array base address (.base), indicated by an address reference.

Each loop in the target region of code is encoded by a loop container (see Fig. 6). The container body encapsulates: call control instructions to the lower level loop and access containers; any necessary arithmetic instructions; and a container header that stores the range of the loop iteration variable (typically used as an indexing variable for array.

<table>
<thead>
<tr>
<th>A. Source Code Snippet</th>
</tr>
</thead>
<tbody>
<tr>
<td>for (i = 0) (i &lt; \	ext{PB}_N) (i++)) {</td>
</tr>
<tr>
<td>(x_i \in [0, \text{PB}_N])</td>
</tr>
<tr>
<td>}</td>
</tr>
<tr>
<td>...</td>
</tr>
<tr>
<td>...</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>B. Affine Representation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Index Ranges</td>
</tr>
<tr>
<td>(x_i \in [0, \text{PB}_N])</td>
</tr>
<tr>
<td>(x_i \in [0, \text{PB}_N])</td>
</tr>
<tr>
<td>(x_i \in [0, \text{PB}_N])</td>
</tr>
<tr>
<td>(x_i \in [0, \text{PB}_N])</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Operand Types</th>
</tr>
</thead>
<tbody>
<tr>
<td>Constant ($\text{name})</td>
</tr>
<tr>
<td>Literal (&lt;\text{integer}&gt;)</td>
</tr>
<tr>
<td>Variable (%0, \ldots, %n)</td>
</tr>
<tr>
<td>Index (%\text{id}_x0, \ldots, %\text{id}_xN)</td>
</tr>
<tr>
<td>Stride (\cdot \text{stride}_1, \ldots, \cdot \text{stride}_n)</td>
</tr>
<tr>
<td>Loop ref. (\langle\text{loop}&gt;, \langle\text{num}\rangle)</td>
</tr>
<tr>
<td>Access ref. (\langle\text{dead}, \langle\text{index}&gt;\rangle, \cdot \text{a&lt;k}&gt;)</td>
</tr>
<tr>
<td>Base address &amp;\text{name}</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Arithmetic Instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>(&lt;\text{opCode}&gt; = \langle\text{opA}, \langle\text{opB}&gt;)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Assignment Instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>.Field (&lt;\text{opCode}&gt; = \langle\text{opA}, \langle\text{opB}&gt;)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Control Instructions</th>
</tr>
</thead>
<tbody>
<tr>
<td>call \langle\text{loop/access}&gt;</td>
</tr>
<tr>
<td>fetch \langle\text{index}&gt;</td>
</tr>
</tbody>
</table>

Fig. 5. Affine representation of the memory accesses in a code snippet of the trisolv benchmark, from the Polybench [31] suite.

Fig. 6. CRL reference sheet.
structures. The latter includes: i) the variable initialization (.init); ii) the iteration limit value (.limit); and iii) the iteration step (.step). With this encoding, when the variable is used in a subsequent instruction (e.g., for address calculation), it is coded as a reference to the loop container \((c_{i,j}, l_{j})\).

The CRL provides a simple instruction set (see Fig. 6), with three instruction types (arithmetic, assignment and control) and up to three operands. Arithmetic instructions are encoded in the format \(oR = opCode oA,oB\), where \(opCode\) specifies an integer arithmetic operation (add, sub, mul, etc.) over operands \(oA\) and \(oB\), and destination \(oR\).

Assignment instructions allow a composed configuration of container fields. They are encoded with one or two source operands and an optional operation code (representing an integer arithmetic operation or an inequality binary operator), in the format \(.field \{opCode\}\). They are encoded in the format \(oR = opCode oA,oB\), where \(.field\) is the container field. This provides an encoding for the initialization, condition and step of each iteration variable, supporting the assignment of literals, variables, conditions and instructions, to each field.

The workflow between containers and memory access operations is encoded with control instructions, in the formats call <ref> and fetch %idx<n>, where <ref> is a reference to a loop container \((c_{i,j}, l_{j})\) or an access container \((c_{i,j}, l_{j}, a_{k})\), and %idx<n> is an index.

### 4.4 Context Descriptor Specification

After being generated, the CRL is parsed and encoded in a low-level memory access Context Descriptor, which allows resolving the addresses calculation based on the variables defined in Eq. 2.

Fig. 7 presents the designed Context Descriptor specification, with support for multi-dimensional data patterns. It is composed of a top-level Context Header, which indicates the number (acc) and memory locations (a_idacc) of a set of Access Descriptors, each describing one memory access pattern. It also contains a reference to a subsequent Context Header (see below), allowing multiple contexts to be described and solved in sequence (mirroring the original loop order).

The Access Descriptor defines a data access pattern by means of: i) an header tuple, containing a stream pointer (stream), the base address (base), the descriptor dimensionality (dimz), and the number of modifier chains (mod - see below); and ii) pairs of xsize and stride fields, representing the \(x_{k}\) range (in number of iterations) and stride\(_{k}\), respectively (see Eq. 2). These pairs have an implicit hierarchy (the rightmost pair has the higher position), where each pair is completely iterated (once) for each instance of the pair in the upper position of the hierarchy level.

**Context Header**

- [acc, next] \(\Rightarrow [a_{id}, \ldots, a_{id}]\)

**Access Descriptor**

- [stream, base, dim, mod] \(\Rightarrow [xsize_{0}, stride_{0}, \ldots, xsize_{n}, stride_{n}, \ldots]\)

- [target\(_{k}\), dim\(_{k}\)] \(\Rightarrow [x_{k}, stride_{k}, \ldots, x_{k}, stride_{k}, \ldots]\)

- [field modifier descriptor]

- [modifier chain]

- [data, dim] \(\Rightarrow [target\(_{k}\), a_{id}, \ldots, target\(_{k}\), a_{id}, \ldots]\)

- [indirection descriptor]

- [data/target\(_{k}\), dim\(_{k}\), \ldots]

Fig. 7. Context Descriptor specification.

Since each \(x_{k}\) range can be delimited by affine functions, the Access Descriptor provides an optional modifier chain, composed of multiple descriptors, in which the header indicates the target field (target\(_{mod}\)) to be modified (see Fig. 7). The modifier chain applies a field modifier descriptor to the target field every time the corresponding pair is fully iterated. As a result, the Access Descriptor is only fully iterated after each modifier descriptor completes its iteration.

Fig. 8 depicts the resolution of an Access Descriptor encoding of a \(N \times N\) triangular matrix access pattern. The base descriptor is composed of two \([xsize, \text{stride}]\) pairs, where the first \((1, 1)\) describes contiguous row accesses (initially with size 1) and the second \((N, N)\) applies a stride to skip to the next row (\(N\) times). The descriptor relies on a modifier chain (with a field modifier \([\text{target}0:\{N, N\}]\)) that adds its own stride value to the \(xsize\) field of the first pair (in the base descriptor) each time it is completely iterated. As a result, each time the base descriptor iterates to a new row, the number of contiguous accesses is increased by 1, hence producing a triangular scan pattern.

The modifier chain can also represent data-dependent patterns (in the format of indirect memory accesses - e.g., \(A[B[i]]\)) by encoding data-indexation dependencies between Access Descriptors, through an indirection descriptor. It is composed of pairs of target fields (target\(_{mod}\)) and descriptor identifications (a_id\(_{mod}\)) (see Fig. 7). The indirection descriptor applies the encoded data dependencies similarly to a field modifier descriptor. However, it is the actual data from the stream generated by the a_id\(_{mod}\) descriptor that is used to modify the target field in the Access Descriptor. An encoding example of indirect accesses is depicted in Fig. 9.

### 4.5 Automatic Code Transformation

By detaching memory addressing and data accesses from the original code, array indexing instructions become re-
5 Runtime Data Streaming

To ensure the stream generation and data fetching, the DSC (see Fig. 10) is implemented as a hardware module that works in parallel with the L1 data cache.

To operate, the DSC starts by capturing the set of injected stores that send the Context Descriptor through a memory mapped interface, saving it in a local Descriptor Memory. Afterwards, the DSC initiates the generation of the addresses for each data stream, by fetching and buffering the corresponding data. Such data is immediately served to the core, by answering to requests to the stream references encoded in the Context Descriptor.

5.1 Stream Communication and Interface

From the core perspective, a data stream retrieval is simply represented by a read operation from a specific pointer. However, the stream is explicitly represented in the DSC by a structural address space (supported by buffering structures) that references data in a temporal sequence, detached from the physical address of each data block.

Accordingly, to provide the cores with an effective interface to transparently perform data stream accesses, the DSC offers a programmable interface that is memory-mapped to the core’s memory access channel (see Fig. 10). When a request to a data stream reference is performed (corresponding to an Access Descriptor), it is redirected to the DSC (instead of the cache) and directly served with data from the corresponding stream.

5.2 Stream Address Generation

The generation state of each Access Descriptor is kept in a dedicated Descriptor Table (see Fig. 10). This table is managed by a Context Controller that generates the addresses of a given context by solving the descriptor’s Context Headers.

The memory addresses are calculated in a dedicated Address Generation Unit (AGU). It comprises two parallel functional blocks, each composed of an adder and a register set, responsible for iterating one \{xsize, stride\} pair. The stride control block successively adds the stride fields of the descriptor to its base address, while the xsize control block counts the iterations of each pair. When a base descriptor is fully iterated, the AGU applies its modifier chain (if available) and resets the descriptor with new field values.

The generated addresses are loaded into the Stream Buffer that was assigned to that stream. Once the corresponding data is fetched from memory, the generated address entries are replaced with the data, and later sent to the core (upon request). In the presence of data dependencies between streams (indirect memory accesses), the necessary data is read by the AGU, to iterate the dependent descriptor. The data dependence path is sent to the AGU (by the Context Controller), according to the Access Descriptor modifier chain.

5.3 Memory Access

In the CPU, each application data structure is usually allocated over a contiguous virtual memory address range. This contrasts with the operation of the DSC, which operates on the physical memory space. While typical prefetchers avoid this issue by stopping the address generation, waiting for the CPU to resynchronize the physical address offset, this is not possible with the introduced detachment of the address generation. Hence, the AGU is equipped with page crossing detection logic and, upon detection, it consults the CPU’s Translation Lookaside Buffer (TLB) to obtain the page offset for the new address.

Moreover, whenever a new address is generated, it is passed to the Request Filter (see Fig. 10). This module maintains the last data block that was previously fetched for each stream (in a set of cacheline-sized registers) and serves the generated address with the corresponding data, by filling the matching entry in the Stream Buffers. Then, whenever a new address crosses the available data block address range, the Request Filter autonomously issues a new memory access for the new line. The issued requests are inserted in a Request Queue (also implemented by a buffer), and subsequently sent to the memory hierarchy.
Polyhedral Loop Computation. Nested loop computations in the affine domain are all-around. Their deterministic nature is particularly suited for memory access pattern description and data streaming. A subset of kernels was selected from the C-Polybench [31] suite, comprising different combinations of pattern complexity, data reutilization and dataset dimensionality.

Sparse Linear Algebra. Sparse linear algebra algorithms usually represent data structures in Compressed Sparse Row (CSR) format, requiring operations between sparse and dense arrays to be implemented through indirection (i.e., $A[B[i]]$). Hence, the sparse matrix-vector multiplication kernel and the symmetric Gauss-Seidel method (two consecutive sparse triangular solvers) were adapted from the HPCG [33] benchmark. Since the CSR representation is also used in graph analytics, to store neighbouring vertices in sparse arrays, the PageRank [37] algorithm was also considered to evaluate this class of applications.

Application-Specific. A particular demanding subset of scientific applications were also used in this evaluation, namely: the SRAD diffusion method (ultrasonic and radar imaging), which exploits data access indirection; the PathFinder algorithm (search the shortest path of a 2D grid), which uses large data sets and high data reutilization; and the LBM algorithm (incompressible fluids simulation in 3D), characterized by a high dimensionality and computational intensity. The SRAD and PathFinder applications are part of the Rodinia [36] benchmark suite and the LBM algorithm was selected from SPEC CPU 2017. To support the proposed compilation tool, all benchmarks were modified by including the annotation scheme described in Section 4.1. Accordingly, the main computational functions and kernels of each benchmark were fully annotated for acceleration.

6.2 Reference Prefetching Setups
To validate the proposed data streaming mechanism, it was compared against four state-of-the-art runtime prefetchers (see Table 3 for the configuration parameters), namely:

Baseline (BASE): represents the most established prefetching scheme, i.e., a typical stride prefetcher, comprising a stride/confidence table indexed by the PC.

AMPM Prefetcher [9]: combines a stride prefetcher at the L1 cache (for fine-grained prefetching), with an L2 hardware module that uses a memory access map and pattern matching scheme to detect all possible strides in parallel.

Best-Offset (BO) Prefetcher [10]: relies on a stride prefetcher at the L1 cache, but introduces a different module at
the L2 cache (a generalization of next-line prefetching), that dynamically sets the prefetching offset depending on the application behavior, while accounting for prefetch timeliness.

**Indirect Memory Prefetcher (IMP)** [21]: combines a stream prefetcher at the L1 cache with an indirect pattern detector (IPD) that computes coefficients between memory accesses to detect indirection between address pairs.

### 7 Evaluation

The proposed compiler-assisted data streaming solution was evaluated by first studying the impact of code transformation, followed by a performance evaluation.

#### 7.1 Pattern Description and Code Reduction

Fig. 11 depicts the observed code transformation and memory access encoding results. Reductions as high as 26% in the code size, and up to 28% in the number of committed instructions, are observed for the considered benchmarks. Such a reduction is bound by both the number of converted memory loads and by the complexity of their address calculation. For instance, only 40% of the loads in the cov benchmark were converted to streams; however, since the majority of loads represent matrix accesses, the removal of address calculation operations results in more than 13% of code size reduction. On the other hand, the seidel benchmark is characterized by a reduced percentage of address calculation instructions, resulting in only a 5% reduction.

Benchmarks with indirect memory accesses (spmv, symgs, rank and srad) take the most advantage of the code transformations. The conversion of indirect memory accesses (in the A[B[i]] format) to single pointer references (*stream) eliminates both indexing and memory access instructions, resulting in a significant code reduction (up to 23%). Naturally, such a reduction imposes a larger descriptor size (see Fig. 11), due to the necessary inter-stream dependency encoding. For the srad, despite a conversion of 90% of loads to data streams, there is not a major reduction in the code size and committed instructions. This is mainly due to the computational complexity of srad, with a high percentage of code performing computing operations. Similarly, lbm is also characterized by a high computing intensity. However, it is still possible to attain a high reduction in code size (due to the elimination of indexation code for high dimensional accesses).

#### 7.2 Memory Access Optimization

Fig. 12 shows the impact of each considered prefetching method in the L1 data cache behaviour (evaluated using the hit-rate and the misses-per-instruction (MPKI) metrics). In the evaluation of the proposed approach (DSC), both the data cache and stream buffer hit-rates are considered. For most polyhedral applications, the BASE stride prefetcher already provides high hit-rates, due to the regularity of the memory accesses. Nonetheless, the AMPM and BO are still able to improve the cache performance, due to an L2 high prefetching coverage. However, this is not the case when the access pattern complexity increases, as it can be observed in cov and mvt, where the datasets are large and require data reutilization; and in spmv, symgs and rank, where the memory accesses are irregular due to indirection.

The proposed DSC takes advantage over the other prefetching methods thanks to its memory access generation procedure. Since the data stream acquisition initiates before the execution of the corresponding requests and since data is not eliminated until it is read by the processor (contrarily to eventual evictions from cache caused by prefetched data in the other setups), data is promptly available ahead of time until it is needed. Moreover, the ability to exactly describe the sequence of addresses mitigates data locality issues. This is highlighted by the observed average hit rate (and corresponding MPKI improvement - see Fig. 12.B) of 95% with the DSC, when compared to the average 79% and 80% hit-rates observed in the AMPM and BO setups, respectively.

---

**Table 3**

<table>
<thead>
<tr>
<th>Setup</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>L1 Stride Prefetcher; 16x4-entry PC table</td>
</tr>
<tr>
<td></td>
<td>Confidence threshold: 4; Prefetch degree: 16</td>
</tr>
<tr>
<td>AMPM</td>
<td>L1 stride prefetcher (Baseline)</td>
</tr>
<tr>
<td></td>
<td>L2 AMPM prefetcher [9]</td>
</tr>
<tr>
<td></td>
<td>256-entry access map; 5.2KB storage</td>
</tr>
<tr>
<td>Bo</td>
<td>L1 stride prefetcher (Baseline)</td>
</tr>
<tr>
<td></td>
<td>L2 Best-Offset prefetcher [10]</td>
</tr>
<tr>
<td></td>
<td>256-entry RR table; 4KB storage</td>
</tr>
<tr>
<td>Imp</td>
<td>L1 IMP prefetcher [21]; 16-entry PI table</td>
</tr>
<tr>
<td></td>
<td>4-entry IPD table; Max. prefetch degree: 16</td>
</tr>
<tr>
<td>DSC (proposed)</td>
<td>16-entry Descriptor Table</td>
</tr>
<tr>
<td></td>
<td>16x32-byte Stream Buffers</td>
</tr>
<tr>
<td></td>
<td>1KB Descriptor Memory</td>
</tr>
</tbody>
</table>

Fig. 11. Context descriptor size; percentage of streamed accesses and code reduction; and resulting impact in runtime-committed instructions.

Fig. 12. L1 cache behaviour comparison in (A) hit-rate and (B) misses per kilo-instruction (MPKI).

---

**LEGEND**

- **BO** DSC
- **AMP** Base
- **BASE** L1 IMP prefetcher [21]; 16-entry PT table
- **DSC** 16-entry Descriptor Table
- **BASE** L1 stride prefetcher (Baseline)
- **DSC** 16x32-byte Stream Buffers
- **BASE** 1KB Descriptor Memory

---

The final version of this article has been published in IEEE Transactions on Computers (doi: 10.1109/TC.2020.2990302)
The acceleration due to code reduction is particularly evident in the polyhedral benchmarks (see Fig. 14). Since the memory access patterns are easily detected by the BASE stride prefetcher, the improvements provided by AMPM and BO are limited to their ability to move data to the L2 cache in a more timely manner (as it occurs in mvt due to its poor data locality). However, due to the elimination of redundant array indexation (accounting for about 40% of the achieved speedup - see Fig. 14B), the DSC is capable of further boosting the performance up to 2.63x over BASE.

The advantages of data streaming are also reflected in the overall system performance (see Fig. 14.A). This is especially evident when the application is characterized by large data sets, such as in 2mm and cov, which operate over multiple and large matrices. While the AMPM and BO prefachers can easily detect the patterns and feed the L2 cache, the data set size inherently results in a large amount of L1 evictions. However, the DSC is capable of fetching and buffering data streams ahead of time, resulting in 1.5x and 2x speedups over the other setups, respectively in 2mm and cov.

Performance gains are also evident when reusing data structures larger than the L1 cache capacity (as in mvt, lbme and path, where a large dense matrix is read multiple times). It can also be observed the gains resulting from the coarse data movement into the L2 cache by AMPM and BO, making it available for reutilization. However, the data acquisition timeliness of the DSC still provides a performance boost of 10% for lbm and path, when compared to the other prefetchers. In the case of mvt, the matrix is also accessed in transposed order, resulting in L1 data-locality-related issues. However, the combination of the pattern description and code reduction provided by the proposed data streaming mechanism results in 1.9x and 1.5x speedups (in mvt) over the AMPM and BO setups, respectively.

The results obtained with spmv, symgs and rank show the capability of the proposed data streaming mechanism to deal with indirect memory accesses. While the BO correlation heuristics provide visible performance gains when compared to the base stride and AMPM prefetchers, it is still limited by the irregularity present in the data accesses. However, the DSC is capable of producing the exact sequence of addresses (after indirection) ahead of time and without polluting the L1 with unnecessary data. Such an advantage (accounting for about 80% of the achieved speedup - see Fig. 14B), when combined with the performed code reductions, results in 45%/37%/60% performance increases for spmv/symgs/rank, when compared to the BO setup.

7.4 Indirect Memory Access Streaming

Despite the ability of the BO prefetcher to deal with data access irregularity caused by indirection, the proposed DSC was also compared with the IMP [21] prefetcher, since it represents a state-of-the-art in memory access indirection (see Fig. 15). To achieve a fair comparison, only the spmv, symgs, rank and srad benchmarks were considered, since only these include indirect memory accesses.

According to the obtained results, the IMP and the DSC show similar performance when comparing the impact in the L1 data cache behaviour (see Fig. 15.A). This result is consistent with the fact that, upon detecting an indirection, the behaviour of IMP [21] becomes similar to that of the DSC when streaming an indirect pattern. As a result, when not
considering the impact of code reduction in the proposed data streaming, the throughput gains of both approaches is similar, when compared to the BASE setup. However, when considering the additional impact of the introduced code reduction (representing an average 17% performance increase for the four benchmarks), the DSC provides an overall IPC improvement over IMP of up to 1.7x, and a consequent average performance gain of 30% (see Fig. 15.C).

Finally, the performance gains of the DSC are still visible when compared with a setup combining the IMP (L1) and the BO (L2) prefetchers (IMP+BO). Despite a 5% performance improvement over IMP, resulting from an improved L2 cache behaviour, the proposed mechanism is still capable of achieving a 1.24x speedup (see Fig. 15.C).

7.5 Discussion

7.5.1 Performance Gains

The considered AMPM [9], BO [10] and IMP [21] setups show significant performance improvements when compared to BASE. This is a direct result of their greater prefetching coverage and accuracy and, in the particular case of BO [10], of the timeliness of the prefetching procedure. However, when comparing the proposed DSC with such highly effective approaches, it is observed that the DSC still provides an increased memory access throughput (resulting from the exact data acquisition and implicit timeliness of the data streaming), matching and improving their performance. This gain is stretched through the acceleration that is introduced by the code reduction, providing further 20% improvements (see Fig 14.B) over predictive prefetching.

Moreover, by comparing the obtained results, and considering the performance wall that is currently observed in the offered gains of the newest generations of prefetchers, it is possible to predict that for regular code structures the proposed method would still be able to match and improve the performance over other recent prefetching methods, such as the Bingo prefetcher [7]. In fact, even against a mechanism with ideal accuracy and timeliness, the code reduction of the proposed method would still allow improvements.

7.5.2 Resource Overhead

Despite the performance gains offered by the proposed approach, it requires simpler hardware structures when compared to the other reference prefetchers. While the amount of storage for data streams (4KB stream buffering) and pattern description (1KB descriptor memory) is similar to the storage required by AMPM (5.2KB) and BO (4KB), the required logic complexity is much smaller. In particular, the DSC only requires 2 adders for the AGU and compare logic to detect cacheline crossing in the address generation procedure. In contrast, the AMPM prefetcher [9] requires a significant amount of logic to match up to 256 stride patterns to find prefetch candidates on each access. On the other hand, the BO prefetcher [10] requires 3 adders to compute the position of a cacheline inside a page, while the recent request table is accessed through a hash function. When compared to IMP [21], the DSC requires a larger amount of resources (IMP requires less than 1KB of storage and lower logic complexity). However, its architecture was solely designed for detecting indirect in the memory access stream and is not suited for other types of data patterns.

7.5.3 L1 Cache Bypass Limitations

In the particular case of the srad benchmark, the proposed approach has a similar performance to all the other considered setups. This is a consequence of the used data set, which is small enough to fit in the L1 cache. Since the data accesses that are performed by the DSC are currently done directly to the L2, it increases the access latency. While this impact is mitigated by the provided code reduction and data stream pre-acquisition, there is still room for improvement. Future implementations of the DSC will consider a snoop-like access to the L1 cache tags (i.e. without causing demand misses) to directly copy data from the L1 and speedup the data access. Such an approach can further improve the gains, specially in the presence of high L1 data reutilization.

8 Conclusions

A new compiler-assisted data streaming mechanism for regular code structures was proposed in this paper as an alternative to current predictive prefetching mechanisms. It is based on the notion that particular application memory access patterns can be explicitly extracted at compilation time and used to generate the corresponding data streams for the processor. To attain this objective, it relies on a compilation tool that performs static analysis over an annotated region of code to identify, describe and encode the memory access patterns, supporting both affine complex patterns and indirect memory accesses. The encoded memory accesses are then automatically converted to data stream accesses, ultimately resulting in a reduced number of instructions and accelerating the execution of the code. At runtime, a dedicated DSC is used to generate data streams from the encoded representation.

The obtained results show that the implemented compilation tools achieve significant code reductions and consequent IPC improvements in the processor. As a consequence, the proposed data streaming method showed to improve performance by up to 2.6x, when compared to a typical stride prefetcher. Moreover, it showed to outperform state-of-the-art prefetchers by 40% (on average).

Acknowledgments

This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under projects UIDB/50021/2020 and PTDC/EEI-HAC/30485/2017, and by funds from the European Union Horizon 2020 research and innovation programme under grant agreement No. 826647.

References

Nuno Roma received the Ph.D. degree in electrical and computer engineering from Instituto Superior Técnico (IST), Technical University of Lisbon, Portugal, in 2008. He is currently an Assistant Professor with the Department of Electrical and Computer Engineering of IST and an Integrated Researcher of INESC-ID, working on High-Performance Computing Architectures and Systems (HPCAS). His research interests include specialized computer architectures for digital signal processing, parallel processing, and high-performance computing. He has contributed to more than 100 papers to journals and international conferences. Dr. Roma is a Senior Member of both the IEEE Circuits and Systems Society and ACM.