Technical Report RT/XX/2015

Virtues and Limitations of Commodity Hardware Transactional Memory

Nuno Diegues
INESC-ID/IST
nmld@tecnico.ulisboa.pt

Paolo Romano
INESC-ID/IST
romano@inesc-id.pt

Luis Rodrigues
INESC-ID/IST
ler@tecnico.ulisboa.pt

March 2015
Abstract

Over the last years Transactional Memory (TM) gained growing popularity as a simpler, attractive alternative to classic lock-based synchronization schemes. Recently, the TM landscape has been profoundly changed by the integration of Hardware TM (HTM) in Intel commodity processors, raising a number of questions on the future of TM.

We seek answers to these questions by conducting the largest study on TM to date, comparing different locking techniques, hardware and software TMs, as well as different combinations of these mechanisms, from the dual perspective of performance and power consumption.

Our study sheds a mix of light and shadows on currently available commodity HTM: on one hand, we identify workloads in which HTM clearly outperforms any alternative synchronization mechanism; on the other hand, we show that current HTM implementations suffer of restrictions that narrow the scope in which these can be more effective than state of the art software solutions. Thanks to the results of our study, we identify a number of compelling research problems in the areas of TM design, compilers and self-tuning.

Keywords: Empirical Study; Synchronization Techniques; Transactional Memory; Performance; Energy Efficiency
1. Introduction

The advent of multi-core architectures has brought concurrent programming to the forefront of software development. For many years, locking has represented the de-facto standard approach to synchronization in concurrent applications. However, the inherent complexity and error-proneness of fine-grained locking [35] has motivated intense research on alternative methodologies aimed at making parallel programming accessible to the mass of software developers.

Transactional Memory (TM) [30] is one of the most prominent proposals in this sense. With the TM abstraction, programmers are required only to identify which code blocks should run atomically, and not how concurrent accesses to shared state should be synchronized (as with locks). The TM is then responsible for guaranteeing correctness by aborting transactions that would generate unsafe histories.

Over the last decade a large body of TM research focused on software-based implementations (STM) (e.g. [15, 19]). Unlike hardware implementations, however, STM requires software instrumentation of read and write memory accesses to trace conflicts between concurrent transactions. This instrumentation can, in certain scenarios, introduce large overheads and hinder performance with respect to conventional fine-grained locking [6]. HTM support is thus desirable, but its absence from commodity processors caused most research to be evaluated solely on simulators (e.g. [28]) — the only notorious exception being the Rock processor [14], which was never commercialized. Recently, the maturing of TM research led to a breakthrough that changed drastically this scenario: two major market players, IBM and Intel, introduced HTM support in their latest processors [40, 4, 39], targeting, respectively, HPC and commodity systems. This represents a significant milestone for TM, mainly due to the predictable widespread availability of Intel Haswell processors, which bring HTM support to millions of systems ranging from high-end servers to common laptops.

The advent of HTM in commodity processors raises a number of questions concerning the future of TM and concurrent programming: how competitive are available HTMs when compared with state of the art STMs? Will the performance of HTM be sufficiently alluring to turn TM into a mainstream programming paradigm? What role will STM play now that HTM is so widely available? How limiting are the architectural restrictions of existing HTM designs?

In this paper we seek an answer to these questions by conducting the largest study on TM-based synchronization to date. We compare, from the twofold perspective of performance and energy-efficiency, a range of synchronization mechanisms: 6 lock based approaches with different granularities; 4 state of the art STMs; Intel TSX’s implementation of HTM; and 2 Hybrid TMs (HyTM) that use STM and HTM mechanisms in synergy. We study highly heterogeneous applications, encompassing 1) STAMP, a de-facto standard suite of benchmarks for TM, 2) Memcached [38], a popular in-memory caching system that was recently ported to use TM, and 3) concurrent data structures that are widely
used as building blocks of parallel applications (yet, hard to parallelize efficiently). The results of our study allow us to draw two main conclusions:

**Lights and shadows for HTM:** Approaches based on TSX yielded outstanding performance in workloads characterized by small transactions, such as concurrent data structures and Memcached, but only with two of the STAMP benchmarks. TSX performance is strongly dependent on the access patterns to L1 cache, and long running transactions can lead to frequent cache capacity exceptions and spurious aborts. When transaction-intensity is medium, TSX is only the best choice for a limited degree of parallelism, and it is generally better on the energy side than on the performance side. The impact of its hardware limitations are highlighted by several STAMP benchmarks that generate long transactions, and in which TSX is outperformed by both locking and STM solutions. On the other hand, TSX shines as a synchronization primitive for concurrent data structures, for which it is by far the best choice in all considered workloads, with speed-ups up to 3.3× over the best alternative scheme.

**STM is still competitive:** Our study also shows that STM is quite competitive as an all-around solution across benchmarks, workloads, and parallelism degrees. Although STM was initially proposed as a prototyping alternative to actual hardware implementations of TM, its evolution throughout a decade of intense research has resulted in several highly-optimized mechanisms, which achieve performance comparable to that of fine-grained locking. This does not mean that STMs embody a perfect solution; instead, this result highlights the current limitations of HTM support, which make of STM still the most robust solution to date.

Further, the results of our study unveil a number of critical issues related with HTM performance and allow for identifying several research problems, whose timely solution could significantly enhance the chances for HTM to turn into a mainstream paradigm for concurrent programming:

**HyTMs: a missed opportunity?** The outcome of our study for what concerns the efficiency of HyTMs, when employed in conjunction with Intel’s TSX, is rather grim. The mechanisms currently adopted to support the simultaneous coexistence of HTM and STM induce high overheads in terms of additional spurious aborts. Our study highlights that these costs make HyTM generally less efficient than solutions based purely on STM or TSX+locking. This motivates further research in the design of architectural supports (e.g., non-transactional memory accesses from transactions) capable of exploiting the potential synergies of HyTMs.

**Complexity of HTM tuning.** HTM performance can be significantly affected by the settings of several parameters and mechanisms. Without proper tuning, TSX suffer average throughput losses of 72% and of 89% in power consumption. Also, the optimal configuration of these parameters can vary significantly, depending on the characteristics of the workload. These findings urge for novel approaches capable of removing from the shoulders of programmers the burden of manually tuning HTM, by delegating this task to run-time or compiler based solutions.

**Relevance of selective instrumentation.** Both TSX and GCC library for STM trace every memory access performed within a transaction. We show that this can cause significant increases of the transaction footprint’s size, amplifying the instrumentation overheads in STM, and the chances of incurring in capacity exceptions in HTM. These results motivate research on cross-layer mechanisms operating at the compiler and at the hardware level, aimed to achieve selective instrumentation in a way that is both convenient for the programmer and efficiently implementable in hardware.

The rest of the paper is structured as follows. Section 2. discusses related work. Section 3. overviews...
the synchronization mechanisms considered in the study. Section 4. describes our methodology, after
which Section 5. presents preliminary experiments aimed at tuning properly TSX. We then present our
study in Sections 6.-7.. In Section 8. we identify several research questions suggested by the findings
of our study. Finally, Section 9. concludes the paper.

2. Related Work

Transactional Memory was initially proposed as an extension to multi-processors’ cache coherence
protocols [30]. Due to the difficulty of rapid prototyping in hardware environments, researchers re-
sorted to STMs to advance the state of the art [20, 19, 16, 17]. Simultaneously, hardware-based
implementations have also been proposed, whose designs were validated using simulators [28].

The concern for both performance and power consumption metrics has been only marginally explored
in the scope of TM, and mostly relying on simulation studies that did not target Intel’s architecture
(whose internals are only partially disclosed). In both [23, 22] the authors assess the behaviour
of different HTM implementations via simulation (the latter focusing on embedded systems). The
approach was also taken by [2], where the power consumption of one STM was studied via simulation.
More recently, [24] studied both power consumption and performance in a non-simulated environment.
Yet, this work considered a restricted set of synchronization alternatives focusing mainly on one STM.

As already mentioned, both Intel [40] and IBM [4, 39] have integrated HTM in their processors. IBM
processors target high performance computing infra-structures that are not expected to be used in
commodity systems. In this work, we focus on the former (by Intel), for which our results show
aspects and insights that were not highlighted by Intel’s paper [40]. Before the recent release of Intel
Haswell processors, researchers had already proposed some theoretical improvements to best-effort
HTMs [1, 33]. We integrate these mechanisms in our HTM-based runtimes, and evaluate them for the
first time using an actual HTM implementation.

Traditional lock-based synchronization techniques have been thoroughly studied throughout decades.
In [21], the authors show that the power consumption of locking primitives can be improved by
exploring a trade-off between processor deep sleeping states, frequency downsizing and busy waiting.
We highlight a recent work [12], which studied the impact in performance of different lock designs and
hardware architectures (without however considering TM).

Our work is also related to the body of literature on performance modeling of TMs, which have relied
on methodologies such as analytical modeling [13, 29], as well as machine learning [37]. These proposals
were applied to self-tune various TM parameters, and our results clearly indicate the potentiality and
importance of this line of research also in the scope of HTM.

3. Synchronization Mechanisms Considered in the Study

In this comparative study we considered the several synchronization mechanisms listed in Table 1:

**Locks** — Decades of research on lock-based synchronization have resulted in a plethora of different
implementations, many times trading off subtle changes with great impact in performance. We consider
6 different lock implementations [12] and both coarse and fine-grained strategies. Contrarily to the
other approaches we used, fine-grained locking requires a *per-application* lock allocation strategy,
which is a non-generalizable and error-prone task [35].

**STM** — With STM, reads and writes to shared memory (inside atomic blocks) are instrumented
to detect conflicts between transactions. This instrumentation induces overheads that can have a
Table 1. Mechanisms used for synchronization in our study.

<table>
<thead>
<tr>
<th>Mechanism</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Locks</td>
<td>Coarse/fine-grained locking [12]: TTAS, Spin, RW (pthreads), MCS, CLH, Ticket</td>
</tr>
<tr>
<td>STMs</td>
<td>TL2 [15], TinySTM [20], SwissTM [19], NOrec [9]</td>
</tr>
<tr>
<td>HTMs</td>
<td>TSX-GL [40] (global lock), TSX-FL (fine locks)</td>
</tr>
<tr>
<td>HyTMs</td>
<td>TSX-TL2 [33], TSX-NOREC [10]</td>
</tr>
</tbody>
</table>

detrimental impact on the efficiency of STMs. Yet, much research has been devoted over the last years to reduce STM’s overheads. For our study we selected four state of the art STMs, which are representative of different choices in the design space of TM. These include an STM optimized for validations at commit-time (TL2 [15]); to maximize performance at low thread counts (NOrec [9]); in high contention scenarios (SwissTM [19]); and to minimize instrumentation costs (TinySTM [20]).

**HTM** — HTM implements a concurrency control scheme in hardware, avoiding the overheads of STM instrumentation. In our study we consider Intel TSX’s implementation of HTM, which is integrated in the family of Intel commodity processors (Haswell). One fundamental design principle of TSX (and in general of HTM) is its best-effort nature: one cannot depend exclusively on TSX to synchronize accesses to data, since a transaction is not guaranteed to commit, even in absence of contention. Briefly, TSX uses the L1 cache to buffer transactional writes, and relies on the cache coherence protocol to detect conflicts. A plausible reason for a transaction to fail in hardware is because its data footprint exceeds the L1 capacity. Hardware transactions are also subject to abort due to reasons like page faults and system calls.

As a result, a fallback software synchronization mechanism must be provided to ensure progress in case a transaction cannot be committed with HTM. The software fallback mechanism must co-operate with the hardware in order to ensure correctness. As we shall see, the mechanisms used to coordinate the execution of the fallback mechanism have a crucial impact on the performance of TSX. The simplest approach, as suggested by Intel’s optimization manual, is to use a single lock to protect atomic blocks (we call it TSX-GL). When a hardware transaction aborts, it has the alternative to acquire the global lock instead. To ensure a correct interplay with the fallback, hardware transactions must read the lock as free to guarantee correctness; the transactional semantics will guarantee that the transaction commits only when there is no ongoing fallback execution.

An obvious extension of this idea is to use fine-grained locks (TSX-FL). As the TM abstraction is motivated by the need of relieving programmers from the complexity of designing locking schemes, the usage of fine-grained locks as a fallback for HTM sounds somewhat contradictory. However, this choice allows us to assess to what extent a simplistic fallback (using a single lock) can hinder parallelism. Also, fine-grained locks may be automatically crafted, to some extent, by using recent techniques based on static analysis [32].

**HyTM** — Another mechanism proposed in the literature is to use an STM as fallback for HTM, also known as Hybrid TM (HyTM). Its main advantage is to allow concurrent execution of hardware transactions and software ones, used in the fallback. However, during their concurrent execution, both software and hardware transactions have to play along in order to preserve correctness. In our study we considered two state of the art Hybrid TM proposals [33, 10], which are evaluated for the first time on a commodity HTM. We exploit the idea of reduced hardware transactions: normally transactions execute mainly in hardware, with a pure software fallback; the idea of these HyTMs is to have an intermediary mode where the software fallback still relies partially on hardware speculations to boost performance, namely during the commit in the fallback.
4. Methodology and Testbed

We consider in our study several parallel applications using atomic blocks for synchronization. First we use the STAMP suite, a popular set of benchmarks for TM [5], encompassing 8 applications representative of various domains that generate highly heterogeneous workload domains. We excluded the Bayes application given its non-deterministic executions, and used standard parameters for each application. STAMP contains manually instrumented reads and writes inside atomic blocks to invoke the STM-based synchronization. Naturally, this is not relevant for the case of pure HTM approaches. We shall additionally present results for compiler based instrumentation in Section 8.

We also consider a red-black tree and a hashmap, as examples of concurrent data structures, which represent important building blocks of parallel applications and that have two interesting characteristics: they are very hard to parallelize efficiently using locking schemes, and are challenging for STMs given that they generate extremely short transactions that suffer from relatively large instrumentation overheads. Finally, we also used a recent TM-based porting of the popular Memcached [38], an in-memory cache, widely used for instance at Facebook [34].

Each experiment is the average of 20 executions. We use the geometric mean whenever we show an average of normalized results. We often show speedup results, which are relative to the performance of sequential, non-instrumented executions, unless stated otherwise. The reported measurements of power consumption were obtained via the Intel RAPL [11] facility and are restricted to the processor and memory subsystems. Recent studies ([26, 27]) show that the model used by Intel RAPL estimates quite accurately the power consumption, when compared to a power meter attached to the machine.

Our machine is equipped with an Intel Haswell Xeon E3-1275 3.5GHz processor and 32GB RAM. This choice is dictated by the requirement of using a processor equipped with TSX, which is limited for now to 4 cores (and 8 hyper-threads). We always pin threads to physical cores in a round-robin fashion; for instance, 4 threads will be allocated uniformly, one per core. As a result, hyper-threading is only used when 5 or more threads are used. We used GCC 4.8.1 with all compiler optimizations enabled and Ubuntu 12.04.

5. Tuning TSX Fallback Path

Before comparing the considered synchronization mechanisms, we conduct a set of preliminary experiments evaluating several alternative configurations of the coupling between TSX and its fallback. As we shall see, this can have significant impact on TSX’s efficiency. The settings identified thanks to this preliminary study will be adopted in the remainder of the paper to ensure that the comparison is performed using an appropriately tuned HTM.

We begin in Section 5.1. by comparing the performance and energy efficiency when using six locks implementations to implement the fallback mechanism of TSX. This shall allow us to narrow down the multitude of combinations of TSX and lock implementations assessed in our study. Next, in Section 5.2., we optimize TSX-GL with a recently proposed technique [1] aimed at reducing spurious hardware aborts. This shall provide some insights about its actual practical effectiveness, as it was never evaluated before. Lastly, we investigate when it is best to give up on hardware and trigger the fallback path, in Section 5.3.

5.1. The Impact of Locks on the Fallback

The simpler way to use TSX is by relying on very coarse-grained locks on the fallback, or simply a single one. We considered the six lock listed in Table 1, to be used in the fallback. These implementations are
representative of different design choices, and our goal is to understand if there is some implementation that consistently performs above the average across all parallelism degrees and benchmarks.

Table 2 shows the performance of TSX given the backing lock implementation used in the fallback path. We show the average overhead with respect to the best performing lock in each experiment, considering both time to complete the benchmark as well as power consumed. The reported overhead is the average across all STAMP benchmarks and thread counts (1 to 8). Using this metric, we can see that the Ticket, MCS and CLH locks perform best.

For each benchmark and thread count, we additionally sorted the considered lock implementations according to either their performance or power consumption, determining in this way their rank for that benchmark/configuration. This shows that no lock implementation is always the best or worse. However, we can see that the Ticket lock is consistently ranked higher, for which reason we shall rely on it from now on whenever we require locking (both standalone, or in the fallback of TSX).

5.2. Improving the Single-lock Fallback

The recommended fallback mechanism for TSX relies on a global lock (as explained in Section 3.). In this section we evaluate for the first time on TSX the efficiency of an alternative technique proposed to reduce the situations under which best-effort HTMs have to follow this pessimistic fallback path [1].

The idea is that it may not be safe for transactions to execute speculatively in hardware if, at the same time, some transaction is executing pessimistically after acquiring the global lock. A pessimistic execution cannot restart, and hence its accesses must be consistent when faced with concurrency. For this reason, TSX’s usage guide points out that the lock has to be read as being free during the course of a hardware transaction. Then any transaction that activates the fallback path has to first acquire the lock, and cause every hardware transaction to abort. This can cause a chain effect, also known as lemming effect [14], where the aborted hardware transactions also try to acquire the lock, preventing hardware speculation from ever resuming.

In [1], the authors use an auxiliary lock to prevent the lemming effect. The idea is to guard the global lock acquisition by another lock. Aborted hardware transactions have to acquire this auxiliary lock before restarting speculation, which effectively serializes them. However, this auxiliary lock is not added to the read-set of hardware transactions, which avoids aborting concurrent hardware transactions. If this procedure is attempted some times before actually giving up and acquiring the global lock, then the chain reaction effect can be avoided: the auxiliary lock serves as a manager preventing hardware aborts from continuously acquiring the fallback lock and preventing hardware speculations.

<table>
<thead>
<tr>
<th>Lock</th>
<th>Performance Overhead (%)</th>
<th>Performance Rank</th>
<th>Power Overhead (%)</th>
<th>Power Rank</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ticket</td>
<td>1.0</td>
<td>1.75</td>
<td>1.1</td>
<td>1.75</td>
</tr>
<tr>
<td>MCS</td>
<td>2.4</td>
<td>2.62</td>
<td>1.2</td>
<td>2.25</td>
</tr>
<tr>
<td>CLH</td>
<td>2.9</td>
<td>3.62</td>
<td>2.4</td>
<td>3.38</td>
</tr>
<tr>
<td>RW</td>
<td>14.2</td>
<td>4.89</td>
<td>17.4</td>
<td>3.88</td>
</tr>
<tr>
<td>TTAS</td>
<td>15.2</td>
<td>5.00</td>
<td>17.4</td>
<td>4.88</td>
</tr>
<tr>
<td>Spin</td>
<td>16.4</td>
<td>5.00</td>
<td>17.5</td>
<td>4.88</td>
</tr>
</tbody>
</table>
In Table 3 we compare TSX using the auxiliary lock against a single-lock (TSX-GL). For this, we report values for time, energy, and Energy Delay Product (EDP), normalized with respect to TSX-GL (analogously to traditional speedup metrics). We report the average across either benchmarks or threads. Naturally, we can see that there is no difference with 1 thread because there is no concurrency and hence no problem resuming speculative execution. But beyond that, and in particular at larger concurrency levels, this technique helps consistently to improve the EDP. Some benchmarks do not show any difference because there are very little aborts (SSCA2) or TSX is not able to execute speculatively most of the time (Labyrinth). Yada’s workload is conflict-intensive, for which reason the non-optimized approach is slightly better due to its inherent pessimism in following the fallback path — that pays off since the high conflict probability limits the effectiveness of optimistic transactions.

5.3. Retry Policy for the Fallback

Given that TSX must always have a fallback due to its best-effort nature, an important decision is when to trigger that path. Upon a transaction abort, TSX provides an error code that informs about the reason of the abort. An abort due to a capacity exception is typically a good reason to trigger the fallback path. However, hardware transactions may abort for various micro-architectural conditions that are less deterministically prone to happen upon transaction re-execution, and even capacity exceptions may not always be deterministic. Also, of course, transactions may abort due to data contention. In these situations one may aggressively trigger the fallback, or opt to insist on using HTM.

As we will shall discuss in more detail in Section 8., the optimal choice of the retry policy can vary significantly across workloads and degrees of parallelism. As it is impractical to assume that the retry policy is ad-hoc tuned by programmers for each and single workload/application, we set the number of retries to 5, which is the configuration reported to deliver best all-around performance with TSX [40, 31] (a result that we have confirmed with TSX-GL on our testbed). For the HyTMs, 4 times

Table 3. Normalized performance of auxiliary lock [1] over TSX-GL across benchmarks and threads (higher is better).

<table>
<thead>
<tr>
<th>Avg across Benchmarks</th>
<th>Avg across Threads</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>time</td>
</tr>
<tr>
<td>genome</td>
<td>1.58</td>
</tr>
<tr>
<td>intruder</td>
<td>1.80</td>
</tr>
<tr>
<td>kmeans</td>
<td>1.20</td>
</tr>
<tr>
<td>labyrinth</td>
<td>1.01</td>
</tr>
<tr>
<td>scca2</td>
<td>1.00</td>
</tr>
<tr>
<td>vacation</td>
<td>1.52</td>
</tr>
<tr>
<td>yada</td>
<td>0.96</td>
</tr>
</tbody>
</table>

Table 4. Summary of results according to the workload characterization of the STAMP suite (xt = number of threads).

<table>
<thead>
<tr>
<th>Time in Tx (%)</th>
<th>Contention</th>
<th>Best Performing</th>
<th>Least Power Consumption</th>
</tr>
</thead>
<tbody>
<tr>
<td>kmeans</td>
<td>low (7)</td>
<td>low</td>
<td>TSX-GL</td>
</tr>
<tr>
<td>scca2</td>
<td>low (17)</td>
<td>low</td>
<td>TSX-GL</td>
</tr>
<tr>
<td>intruder</td>
<td>medium (33)</td>
<td>high</td>
<td>TSX-GL ≤ 4t; TinySTM ≥ 5t</td>
</tr>
<tr>
<td>vacation</td>
<td>high (80)</td>
<td>low</td>
<td>TSX-GL ≤ 2t; TinySTM ≥ 3t</td>
</tr>
<tr>
<td>genome</td>
<td>high (97)</td>
<td>low</td>
<td>TinySTM</td>
</tr>
<tr>
<td>yada</td>
<td>high (99)</td>
<td>medium</td>
<td>SwissTM</td>
</tr>
<tr>
<td>labyrinth</td>
<td>high (100)</td>
<td>high</td>
<td>STMs (except TL2)</td>
</tr>
</tbody>
</table>
was found to be the best number of retries on average.

5.4. Improving the STM Fallback

In this section we use NOrec [9] hybridized with TSX as an example of an HybridTM that we use later on, and try to improve it for a fair assessment.

Hybridizing an HTM with an STM has been proposed across several algorithms in the literature. We consider the baseline algorithm proposed in HybridNOrec [8] — which at the time used both AMD’s and Sun’s HTMs — and we study its performance in the scope of Intel TSX (calling it TSX-NOrec). The base idea is strikingly simple: replace the software fall-back path relying on a single global lock for synchronization, by software transactions that rely on NOrec STM.

To ensure correctness between concurrent hardware and software transactions, it is necessary to have a mechanism similar to the subscription of the single global lock. The idea of HybridNOrec, which we summarize here, is to use NOrec’s single global lock to manage that interaction. There have been several proposals to optimize HybridNOrec in the literature. We now briefly describe each one of them, and assess their validity. Our objective is to obtain the best performing algorithm for this HybridTM.

NOrec’s single global lock has actually a dual responsibility: besides mutual exclusion for committing update software transactions, it is also a logical clock that allows software transactions to timestamp their snapshots over shared memory and ensure consistent reads (NOrec is an opaque STM [25], a criterion stronger than strict serializability [3]). As such, transactions acquire the lock by incrementing it to an odd number, and release it by incrementing it to an even number. In this way, the lock prevents concurrent software commits when an odd number, and at the same time the increments allow to signal concurrent software transactions that a commit has occurred and that their read-sets may be stale.

As a result of this design, hardware transactions in TSX-NOrec, must increment (twice) NOrec’s lock in the end of the atomic block. This creates contention between (possibly disjoint) hardware transactions, even when no software transactions are running.

One of the optimizations proposed in HybridNOrec [8] is for hardware transactions not to increment the lock. This is safe to do when hardware transactions contain no writes: read-only transactions need not signal software transactions through the lock. This can be implemented by adding instrumentation to writes in hardware transactions, so that they flag locally that the transaction is not read-only. The cost of this instrumentation is that every shared memory write actually performs two writes, one of which is to thread-local memory, thus adding one write location to hardware transactions’ footprint. We call this optimization is-read-only.

Another similar optimization is for hardware transactions not to increment the lock when there are no concurrent software transactions. We call this optimization exists-sw. For this, it suffices to have a flag that is incremented and decremented by software transactions during their execution, which can be done using a fetch and increment (or decrement) atomic primitives. These are weaker than compare-and-swap, but are beneficial because they cannot fail and are cheaper to implement in hardware.

Finally, the third optimization proposed in HybridNOrec is to distribute the lock, with up to as many global locks as the number of cores. We call this optimization dist-locks. The advantage of this is that hardware transactions increment only one such lock (possibly cache line padded), without conflicting with each other. However, this optimization has a considerable downside, because it requires software transactions to maintain a snapshot of a vector clock of the timestamps of the locks, instead of a single scalar value (as in traditional NOrec).
Table 5. Assessment of performance for TSX-NOrec with the different optimizations studied. The results are the average speedup with respect to sequential code across all our TM benchmarks. The best performing combination of optimizations is also shown on the top right.

<table>
<thead>
<tr>
<th>#threads</th>
<th>baseline</th>
<th>all-opts</th>
<th>two-paths + is-read-only + exists-sw</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>1.19</td>
<td>1.44</td>
<td>1.44</td>
</tr>
<tr>
<td>4</td>
<td>1.99</td>
<td>2.53</td>
<td>2.58</td>
</tr>
<tr>
<td>8</td>
<td>1.97</td>
<td>2.90</td>
<td>3.05</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>#threads</th>
<th>two-paths</th>
<th>is-read-only</th>
<th>exists-sw</th>
<th>dist-locks</th>
<th>reduced</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>1.35</td>
<td>1.28</td>
<td>1.32</td>
<td>1.30</td>
<td>1.24</td>
</tr>
<tr>
<td>4</td>
<td>2.19</td>
<td>2.31</td>
<td>2.28</td>
<td>2.27</td>
<td>2.03</td>
</tr>
<tr>
<td>8</td>
<td>2.23</td>
<td>2.59</td>
<td>2.68</td>
<td>2.45</td>
<td>2.10</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>#threads</th>
<th>two-paths</th>
<th>is-read-only</th>
<th>exists-sw</th>
<th>dist-locks</th>
<th>reduced</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>1.30</td>
<td>1.31</td>
<td>1.34</td>
<td>1.35</td>
<td>1.37</td>
</tr>
<tr>
<td>4</td>
<td>2.31</td>
<td>2.39</td>
<td>2.44</td>
<td>2.51</td>
<td>2.53</td>
</tr>
<tr>
<td>8</td>
<td>2.65</td>
<td>2.42</td>
<td>2.49</td>
<td>2.75</td>
<td>2.88</td>
</tr>
</tbody>
</table>

We also considered a recent proposal of Reduced Hardware Transactions [33], which we call reduced. The idea is to have several levels of dependence on the Best-Effort HTM (such as Intel RTM) between full hardware transactions and full software transactions. Namely, it is possible to devise a case in which a hardware transaction fails (possibly due to capacity overflow), so it is advisable to execute it as software transaction but whose commit procedure can be encapsulated in a hardware transaction. In the case of TSX-NOrec, we can apply this so that such hybrid transactions avoid taking NOrec’s lock. In the case that not even the commit procedure can be executed with RTM, then we fall-back completely to a full software mode.

We have taken all the four optimizations into consideration and studied their impact in TSX-NOrec. In addition to these, we added a fifth change, which is more specific to the implementation. Ideally, HybridTMs such as TSX-NOrec should be used in applications that are compiled with two execution paths: one that is used for software transactions and another one for hardware transactions (hopefully, with much less instrumentation in reads and writes). Due to the lack of support for this in current mainstream compilers, quite often HybridTMs are implemented with function pointers to handle the instrumentation, which are changed depending on the mode of the transaction being executed. We have modified all our benchmarks to provide the two execution paths in a fifth optimization that we call two-paths.

We report on the results of our study to optimize TSX-NOrec in Table 5. Our baseline is the HybridNOrec algorithm proposed in [8] without any of the five optimizations described above. In contrast, we also show all-opts which uses all the optimizations. To summarize our findings, we limit the results shown to two perspectives: 1) we show the performance of adding each of the five optimizations to the baseline; and 2) we show the effect of removing each of the five optimizations from all-opts.

The two perspectives allow to understand whether an optimization is beneficial to the performance of TSX-NOrec. Adding an optimization to baseline is beneficial if it provides better performance over baseline. We can see that is the case for all optimizations. However, we can also verify that not all optimizations are necessary when working together, namely because there is some overlapping
in their responsibilities and trade-offs in their usages (as described above). To understand this, we can see that removing some optimizations from \textit{all-opts} does not decrease much performance when comparison to that of \textit{all-opts}. That is the case for both \textit{dist-locks} and \textit{reduced}.

The former avoids hardware transactions from incrementing the NOrec lock at the cost of a more complex maintenance of the snapshots in software transactions. Because other optimizations already avoid writing to NOrec lock in an effective way most of the time, the \textit{dist-locks} optimization adds only burden to the software transactions. The \textit{reduced} optimization, instead, uses TSX in the fall-back (to execute the commit of the software transaction without acquiring locks). We verified, however, that most of the TM benchmarks did not benefit from this as they often triggered aborts even in such reduced hardware transactions; this leads to eventually using another fall-back that uses only software, but at the cost of retrying even more than usual in using hardware transactions.

As such, from all the combinations available, we elect the other three optimizations as the best combination for TSX-NOrec, whose performance we show also on Table 5. This is the version of TSX-NOrec that we shall use in the rest of this work, whose optimizations we also apply to TSX-TL2.

6. STAMP Benchmark Suite

In this section we rely on the STAMP benchmark suite to assess the efficiency of all the synchronization mechanisms listed in Section 3., namely HTM, STMs, HyTMs, and locking. In the following, we shall always include the TSX optimizations discussed in the previous section.

We start by summarizing our results in Table 4. There, we list the STAMP benchmarks sorted by two important characteristics of their workloads: the contention level between transactions, and the percentage of the workload that is transactional. We then identify the mechanism that takes the least time to complete and which one consumes the least power, given the averaged results across threads. This summarized perspective allows to highlight an interesting fact. It is possible to distinguish three categories in which TSX behaves differently, according to the transaction’s characteristics. Kmeans and SSCA2 represent workloads with small transactions, medium frequency and low contention; here, TSX-GL performs consistently better than the alternatives across all threads. Intruder and Vacation exhibit medium profiles for what concerns the time spent in transactions and contention; in these cases, TSX-GL results in the best performing solution using up to 4, resp. 2, threads, and the most energy efficient up to 5, resp. 4, threads. Finally, the other benchmarks spend almost all the time in transactions, encompassing both low and high contention scenarios. In these settings, TinySTM emerges as the most robust solution, both from the perspective of energy and performance.

This analysis allows to draw a set of guidelines to select which synchronization to use, at least when considering applications having analogous characteristics to those included in the STAMP suite. TSX-GL is desirable when transactions are small, generate low/medium contention, and the application does not spend all the time executing transactions. When contention increases, or the frequency of transactions is high, TSX-GL is competitive up to a medium degree of parallelism. In the remaining cases, STM is often the best choice, even when compared with fine-grained locking. The considered HyTMs perform poorly compared to the alternatives, never clearly outperforming the competing schemes in any benchmark. In Sections 6.1.-6.2. we present our experiments with STAMP. We will consider additional benchmarks and fine-grained locks in Section 7.
6.1. Performance Study

In Fig. 1 we show, for each benchmark and while varying the parallelism level, the speedup of all the considered synchronization schemes (with the exception of schemes based on fine-grained locking, which shall be presented in Section 7.) with respect to a sequential, non-instrumented execution, and the power consumption during the execution (in Kilo Joules). This allows us to discuss in detail the differences between the mechanisms in different workloads.

**Kmeans:** This benchmark yields the biggest gap in performance between a TSX variant and STMs. Namely, TSX-GL reaches $3.5 \times$ speedup over a sequential execution, beating every other alternative both performance-wise and in terms of energy-efficiency. An interesting trend concerning energy-efficiency is that the power consumption with TSX (and, to some extent, also for all other synchronization schemes but GL) tends to slightly decrease as the parallelism level grows, which is a symptom of efficient utilization of the available architecture resources achievable using TM-based solutions. If we consider TSX-TL2 and TSX-NOrec, they are still competitive and better than the corresponding STMs, but they are far from TSX-GL in both metrics. It is worth noticing that the small and rare atomic blocks of this benchmark allow the GL approach to scale up to 3 threads. This explains the considerable success of TSX-GL in this benchmark, as a transaction that resorts to the GL is still able to run concurrently with other threads that are not under an atomic block at that time.

**SSCA2:** This benchmark shows a similar trend between TSX variants, but with the significant difference that all STM approaches scale better as the degree of parallelism increases. Here, TSX-GL is only slightly better than the best STM, and this is consistent across all the thread counts. Also interestingly, TSX-TL2 improves little and fares rather bad on the energy side. This, however, is not the case for TL2 or TSX on their own, and as such is an artefact of the hybrid implementation integration. Finally, the reduced time within atomic blocks still allows the GL approach to scale up to 2 threads, which justifies the advantage of TSX-GL. However, this effect is smaller than in Kmeans, which also matches the fact that TSX-GL achieves less improvements over other approaches.

**Intruder:** Here TSX-NOrec (and TSX-GL to some extent) are competitive and even better (until 5 threads) than the best STMs (except for TL2). Since TL2 performs poorly in this benchmark, this also drags TSX-TL2 behind in both metrics. Interestingly, both TL2 and TSX-TL2 improve slightly performance with more threads, but TL2 consumes more power whereas TSX-TL2 slightly decreases it.

Table 6. Transactional abort rate (%). For STM we show the lowest and highest values obtained (across all considered STMs).

<table>
<thead>
<tr>
<th>benchmark</th>
<th>kmeans</th>
<th>ssc2</th>
<th>intruder</th>
<th>vacation</th>
<th>genome</th>
<th>yada</th>
<th>labyrinth</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 thread</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>STM</td>
<td>0 - 0</td>
<td>0 - 0</td>
<td>0 - 0</td>
<td>0 - 0</td>
<td>0 - 0</td>
<td>0 - 0</td>
<td>0 - 0</td>
</tr>
<tr>
<td>TSX-GL</td>
<td>0</td>
<td>0</td>
<td>7</td>
<td>49</td>
<td>11</td>
<td>46</td>
<td>95</td>
</tr>
<tr>
<td>TSX-TL2</td>
<td>0</td>
<td>0</td>
<td>36</td>
<td>94</td>
<td>35</td>
<td>19</td>
<td>53</td>
</tr>
<tr>
<td>TSX-NOrec</td>
<td>0</td>
<td>0</td>
<td>4</td>
<td>40</td>
<td>6</td>
<td>19</td>
<td>53</td>
</tr>
<tr>
<td>4 threads</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>STM</td>
<td>10 - 34</td>
<td>0</td>
<td>0 - 0</td>
<td>0 - 6</td>
<td>1 - 51</td>
<td>5 - 58</td>
<td>4 - 13</td>
</tr>
<tr>
<td>TSX-GL</td>
<td>26</td>
<td>0</td>
<td>22</td>
<td>69</td>
<td>31</td>
<td>48</td>
<td>100</td>
</tr>
<tr>
<td>TSX-TL2</td>
<td>50</td>
<td>74</td>
<td>74</td>
<td>100</td>
<td>45</td>
<td>84</td>
<td>60</td>
</tr>
<tr>
<td>TSX-NOrec</td>
<td>31</td>
<td>46</td>
<td>29</td>
<td>66</td>
<td>17</td>
<td>31</td>
<td>55</td>
</tr>
<tr>
<td>8 threads</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>STM</td>
<td>25 - 54</td>
<td>0</td>
<td>3 - 57</td>
<td>0 - 10</td>
<td>0 - 1</td>
<td>7 - 65</td>
<td>8 - 23</td>
</tr>
<tr>
<td>TSX-GL</td>
<td>42</td>
<td>1</td>
<td>33</td>
<td>72</td>
<td>48</td>
<td>47</td>
<td>100</td>
</tr>
<tr>
<td>TSX-TL2</td>
<td>60</td>
<td>99</td>
<td>92</td>
<td>100</td>
<td>53</td>
<td>92</td>
<td>69</td>
</tr>
<tr>
<td>TSX-NOrec</td>
<td>44</td>
<td>88</td>
<td>62</td>
<td>99</td>
<td>69</td>
<td>39</td>
<td>60</td>
</tr>
</tbody>
</table>
Figure 1. Speedup (relative to non-instrumented sequential execution) and Energy Consumption (in Kilo Joules) when varying the number of threads (horizontal axis) in all the STAMP benchmarks.

**Vacation:** Once again we see that the performance of TSX-TL2 is quite disappointing, as indeed TL2 itself performs poorly in this scenario. As we shall see throughout this study, TL2 is by far the worst
STM among those considered, which is a result of having a similar algorithmic and synchronization complexity to that of SwissTM and TinySTM, while detecting conflicts lazily at commit-time. This results in TL2 doing useless work more often, whereas SwissTM and TinySTM restart the speculation faster when reacting to conflicts. On the other hand, NOrec is simpler, both in algorithmic as well as synchronization terms, reducing its instrumentation overheads and maximizing its performance at low thread counts. With regard to the other approaches, TSX-GL and TSX-NOrec are competitive with STMs until 4 threads. At higher parallelism degrees, their performance degrades due to contention on L1 caches caused by hyper-threading. Analogous results are achieved for what regards power consumption. It is interesting to note that TSX-GL performs worse than TSX-NOrec at 8 threads, but the two consume approximately the same power. This is a result of the power savings that are achievable with the lock acquisition in TSX-GL.

**Genome:** In the three last benchmarks we have either transaction-heavy or high-contention workloads, characterized by large transaction foot-prints. These conditions are clearly a much more favourable playground for STMs. In this case, we see a clear (and consistent across benchmarks) distinction between TL2 and NOrec, as these two lag behind in both metrics particularly at higher thread counts. Interestingly, we can see that TSX-TL2 performs best among the TSX variants at a higher concurrency degree, which is a singularity among all benchmarks. This benchmark also shows a clear trend when the 5th thread is used: all approaches stabilize (or even decrease) performance at that point, due to hyper-threading. Interestingly, this effect is not so noticeable on the energy side, as STM approaches are still able to reduce the power consumed as parallelism increases. This highlights an interesting trade-off of hyper-threading: it allows sub-linear speed-ups only, but it also consumes little additional power. This fact is favourable to STMs, as TSX approaches generate more transactional aborts when hyper-threading is used, due the higher contention on the L1 cache.

**Yada:** This benchmark shows one scenario where TSX-GL performs poorly, with slowdowns above 3 threads. HyTMs follow closely their fallback STMs’ performance, as TSX is not able to succeed. This is also a case where TinySTM and SwissTM perform better than the other two STMs. This benchmark presents no surprises in the energy-efficiency, whose trends are highly correlated with the performance.

**Labyrinth:** Here we see STMs performing best and very alike each other. TSX-GL does not improve with thread count, simply because most transactions exceed the hardware cache capacity and, as such, eventually follow the fallback path which is a sequential bottleneck given the GL. For this reason, TSX-TL2 and TSX-NOrec obtain some improvements, exactly because the fallback allows for concurrency, contrarily to the global lock on TSX-GL. This scenario highlights, however, that HyTMs are capped by either TSX or the fallback STM — as such, it is dubious whether they are practical (at least when used with TSX), or if it would be preferable to adaptively employ the most promising technique (TSX-GL or an STM) based on the workload.

### 6.2. Insights on TM Efficiency

In this section we shed some additional light on the factors dictating the trends observed in the experiments. To this end, in Table 6 we report the average abort rate across benchmarks and threads for each speculative mechanism. This represents the percentage of speculations that do not complete. Since there are four STMs under evaluation, we show the minimum and maximum abort rates among them — typically the smallest abort rate belongs to TinySTM and SwissTM, whereas TL2 yields the maximum abort rate.

Once again, we structure the table considering the different categories of workloads. As we move right (more contended or transaction-intensive workloads) and down (higher degree of parallelism),
TSX approaches increase abort rates, which causes the loss of efficiency shown in the previous section. These results highlight that TSX has non-negligible aborts in many occasions where STMs abort very little.

In Fig. 2 we consider four different benchmarks, representative of scenarios that allow to derive insights on the efficiency of the considered TSX variants. In those plots we present a breakdown of the reasons motivating transactional aborts, for each TSX mechanism. We distinguish aborts caused by exceeding the capacity of L1 cache; micro-architectural instructions or states forbidden by TSX, such as some system calls; data contention resulting in conflicts; and interaction between TSX and the fallback paths, such as checking if the GL is free in TSX-GL or more complex logic in the case of HyTMs.

Kmeans’ breakdown shows that, as expected, as concurrency increases, also abort rates increase due mainly to data conflicts. It is worth mentioning that Kmeans is the benchmark with the least average aborts for TSX variants. Half of the aborts are due to conflicts, whereas the rest is motivated by a non-negligible percentage of aborts due to architectural instructions. This is something intrinsic to TSX, which is common throughout different benchmarks. The fact that these aborts occur less often in this benchmark allows TSX to obtain the most favourable results among all benchmarks.

In Yada and Labyrinth, instead, the workloads are much more transaction-intensive with non-negligible conflict rates. On top of this, the capacity of L1 caches is often exceeded by the hardware transactions (this is particularly visible in Labyrinth, where this phenomena dominates the aborts). This explains why the TSX variants followed up closely the performance of their fallbacks (with some constant overhead). HyTMs have a reduced abort rate because the fallback’s software transactions are also taken into account in these statistics, on top of the hardware transactions — since software transactions have little aborts due to the uncontended workload, they amortize the overall abort rate. In TSX-GL, instead, the fallback executes non-speculatively due to the global lock, so we only count statistics for the hardware transactions there.
Finally, SSCA2 shows a completely different scenario, in which TSX-GL generates almost no aborts (in line with STMs’ behaviour), whereas HyTMs have enormous abort rates, dominated by the interaction with the fallback path.

This motivates to better understand the usage of the fallback path in the HyTMs. Table 7 shows the percentage of transactions that were executed in the fallback (i.e., not purely in hardware). We also show the percentage of transactions in the fallback that are able to execute in a fast mode, i.e., a mode in which the transaction executes in software but the commit is boosted by using a reduced hardware transaction [33] (as explained in Section 3.). For every table cell we show the percentage corresponding to 1 and 8 threads. Overall, the percentages vary linearly from 1 to 8 threads, for which reason we omit the intermediate values.

We start by highlighting in SSCA2 how both HyTMs are able to execute purely in hardware with 1 thread (they trigger the fallback $< 1\%$ of the transactions). However, a higher thread count typically results in executing in the fallback mode almost all the time, which matches the idea conveyed by Fig. 2(d). In particular, for this benchmark, the ability to rely on hardware to speed up the software fallback path is reduced from above 90% to 14% or even less.

These results for HyTMs show that TSX-TL2 triggers the fallback more often, and is able to execute in the fast mode less frequently than TSX-NOrec. This justifies the advantage of TSX-NOrec, which fared better across all benchmarks in Section 6.: TSX-NOrec executes the fallback software transactions in fast mode for most of the time. The reason is that the much simpler design of NOrec allows for a much easier integration with TSX in a HyTM.

Ideally one may want to also rely on more scalable STMs, like TinySTM or SwissTM, in the fallback of TSX. However, due to the higher complexity of their algorithms, coupling them efficiently with HTM is a challenging task, and, in fact, we are not aware of any proposal in this sense in literature.

Finally, it has been pointed out in [10] that, in order to support efficient HyTMs, it is desirable to have hardware support for selective non-transactional memory accesses in the scope of transactions. Such a feature is not currently supported in TSX, whereas its inclusion was, e.g., planned in AMD’s HTM proposal [7] (which was never commercialized). Hence, an interesting research direction suggested by this study is to investigate the impact of supporting non-transactional accesses, not only in terms of performance and energy, but also in terms of architectural intrusiveness.

### Table 7. Rate (%) of triggering the fallback on HyTMs and of executing it in fast mode. Intervals of values are shown, ranging from 1 (lower) to 8 threads (upper bound).

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>TSX-TL2</th>
<th>TSX-NOrec</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>fallback</td>
<td>Fast</td>
</tr>
<tr>
<td>kmeans</td>
<td>$&lt; 1 - 77$</td>
<td>$92 - 32$</td>
</tr>
<tr>
<td>ssca2</td>
<td>$&lt; 1 - 99$</td>
<td>$91 - 2$</td>
</tr>
<tr>
<td>intruder</td>
<td>$33 - 88$</td>
<td>$98 - 39$</td>
</tr>
<tr>
<td>vacation</td>
<td>$94 - 100$</td>
<td>$43 - 3$</td>
</tr>
<tr>
<td>genome</td>
<td>$50 - 100$</td>
<td>$97 - 71$</td>
</tr>
<tr>
<td>yada</td>
<td>$18 - 78$</td>
<td>$50 - 34$</td>
</tr>
<tr>
<td>labyrinth</td>
<td>$58 - 100$</td>
<td>$14 - 2$</td>
</tr>
</tbody>
</table>
7. Benchmarks Using Fine-grained Locking

Most of the STAMP benchmarks have an irregular nature, which makes it very challenging to derive fine-grained locking schemes. In this section we focus on benchmarks for which it is possible to use (possibly very complex) fine-grained locking approaches. We start, in Section 7.1., by focusing on a subset of three STAMP benchmarks, for which we could craft an ad-hoc fine grained locking strategy. We then present results for Memcached in Section 7.2. and for two concurrent data structures in Section 7.3.

7.1. Fine-grained Locking in STAMP

As already mentioned, implementing a fine-grained locking strategy is a complex task for most of the STAMP benchmarks. We were, however, able to devise fine-grained locking strategies for three of the STAMP benchmarks, whose results we report in Fig. 3. Besides fine-grained locks (FL), we also show results for TSX-FL, which combines hardware transactions with a fallback path that relies on FL. Naturally, the combination of both schemes in TSX-FL requires hardware transactions to read all necessary locks as being free. We then compare these two approaches with TSX-GL and TinySTM, which were the best mechanisms in our previous experiments, and remove the others to improve the readability of the plots.

TSX-FL presents one advantage over TSX-GL, in that the fallback path allows for threads to proceed in parallel if they require different locks (which is highly likely if there is little data contention). However, this has the drawback that more locks have to be checked (during speculative executions) or acquired (during the fallback executions). Hence, there is a clear trade-off that is subtle and difficult to manage.

Recall that Kmeans and SSCA2 were the two benchmarks with workload characteristics more amenable to TSX-GL. This is justified by the low frequency of activation of the fallback path. As such, TSX-GL incurs minimal overhead thanks to the hardware speculation and to the avoidance of any software-based instrumentation. Therefore, it is not a surprise that fine-grained locking is of no advantage in this scenario: each lock acquisition represents a synchronization point, whereas for TSX-GL there exists only explicit synchronization at the hardware level when a transaction attempts to commit. Note, however, that FL is consistently better than the best STM (TinySTM). This fact is even more relevant from the energy perspective, where the gap between FL and TinySTM is larger. Since the TSX fallback is not triggered often, then TSX-FL goes through the additional verifications over more locks that are useless most of the time (to ensure a correct integration of the fallback with hardware transactions), which explains its lower performance in this kind of workload.

In SSCA2 we see a different behaviour as both TSX variants perform quite similarly. This is explained by the fact that the fine-grained scheme is not very efficient: its locks are relatively coarse, which induces unnecessary serialization. This has the side-effect of making TSX-FL competitive with TSX-GL, because both have a similar effort in checking the locks in the speculative executions to ensure correct integration with the fallback. Notice how the FL scheme still performs better than GL, which is a consequence of the higher degrees of parallelism achievable by reducing lock granularity. This confirms an expectable trade-off concerning lock granularity: the more fine-grained, the best the fallback performs; however this can have an impact on the performance of the speculative executions as we saw for Kmeans.

Finally, Intruder spends a large fraction of time within atomic blocks. As already discussed, this workload is more advantageous for STMs than for TSX. It is not surprising to see that TSX-GL is no longer the most competitive choice (although it still fares best until 3 threads). The interesting
Figure 3. Experiment similar to that in Fig. 1, but instead using fine-grained locking and Memcached.

fact is that this kind of workload is more beneficial for FL. With more threads, TinySTM degrades its scalability, and is surpassed by FL. From an energy perspective, it is even clearer that FL is the best choice comparing to TinySTM, as it is almost always consuming less energy. TSX-FL suffers from the overheads of checking additional locks, until 3 threads, for which reason it is not as good as TSX-GL. However, at that point TSX triggers the fallback more often, which justifies the use of fine locks and allows TSX-FL to perform substantially better than TSX-GL. On the energy side, TSX-FL is closer to TinySTM, following the trend of STMs that often fare worse on the energy side to obtain comparative levels of performance to the other approaches.

7.2. Memcached

Memcached is a popular distributed object caching system [34]. In this study, we rely on a recent TM-based porting [38], and use the original Memcached as the basis for FL. We used the memslap tool, configuring the workload with 95% gets and 5% puts, 8 threads and a concurrency of 256.

In Memcached it is not really possible to measure, for reference purposes, the performance of a sequential execution, because there is always concurrency due to the existence of a pool of maintenance
threads. Hence, we present the peak throughput obtained using the maximum number of available hardware threads (see Fig. 3(h)). The results show that FL has the best performance, but TSX-GL is only 7% behind. This is a significant achievement as the effort to devise such fine-grained locks is considerably higher than using TSX-GL. Also, since FL is quite optimized, it is expectable that TSX-FL is not able to extract any further parallelism. Interestingly, with this benchmark, TinySTM is not competitive because the instrumentation overheads are amplified by the short and uncontended transactions.

7.3. Concurrent Data Structures

We now consider two concurrent data structures, namely a red-black tree and a hashmap, which represent particularly relevant use cases for TM given the complexity of designing efficient fine-grained locking strategies for these scenarios.

Fig. 4 shows two different scenarios: we consider a small hashmap (512 buckets) with only 10% transactions performing writes (the rest are lookup operations), and a large red-black tree (1 million items) with 90% transactions performing updates. In the former case, TSX-GL achieves perfect linear scalability, which is a consequence of its negligible overheads and of the very reduced abort rate. With larger transactions, the gains achievable by TSX tend to diminish, although it still remains a very competitive solution.

Table 4(e) shows a spectrum of workloads in red-black tree, by considering the normalized EDP of TSX-GL against the best alternative in each experiment. For this, we vary the size of the tree and the percentage of write transactions. The trend is clear in this table: TSX behaves best with light workloads, and loses advantage when transactions become larger or write-intensive. This confirms the results of the analysis that we performed for STAMP, given that, also in this case, TSX shines most when atomic blocks have little duration and the workload is not fully transactional.

8. Research Directions Suggested by our Study

We now identify some relevant research directions that emerged from the analysis of our experimental study:

- The overall performance of the tested HyTM solutions is quite disappointing. These findings contradict the simulation results published in several previous works, e.g., [33]. Our analysis suggest that the root cause of the problem is related to the inefficiency of the mechanisms used to couple hardware and software transactions, which is generating a large number of spurious aborts. However, further research is due in order to understand what can be done to address such a problem. An interesting research question, in this sense, is whether the availability of support for enabling non-transactional memory accesses while executing hardware-assisted atomic blocks could indeed allow for more efficient interplay between HTM and STM (which has been assumed by other works in the area of HyTM, e.g. [36]). A related research question is how to support such a feature while minimizing the disruptiveness of the changes required at the hardware level — an aspect that cannot be overlooked given the complexity of modern processor architectures.

- As mentioned in Section 5.3., the performance of TSX is significantly affected by the retry policy (e.g., the settings of the number of retries upon abort, and the choice of how to react to capacity aborts). While in our study we used the configuration that performed best on average, as shown in Table 8, significant speedups (up to 80%) with regard to the configuration used in our study can be achieved by ad-hoc tuning the retry policy for the specific workload — even more could be achieved by considering
the specific concurrency degree as well. Unfortunately, this is a tedious and error prone task that is not desirable to delegate to programmers. Hence, these findings highlight the relevance of devising solutions for adaptively tuning these parameters in an automated manner. The key challenge is how to do it with minimal overhead, given that the cost imposed by self-tuning approaches targeting STMs (based on complex machine-learning [37] or analytical models [13]) is going to be strongly amplified in HTM settings because there exists no instrumentation as in STMs. In the light of these findings, we have concurrently obtained some initial results with regard to this research direction [18].

- Our study has used (selective) manual instrumentation when considering both STMs and HyTMs, i.e. only the relevant subset of memory locations accessed in atomic blocks have been traced. As an alternative, one could rely on the compiler to automatically instrument atomic blocks with calls to the TM runtime. The plots in Fig. 5, which were obtained using the C++ TM extension integrated in GCC 4.8.2, show that non-selective instrumentations can impact performance by approximately 20% when using TinySTM. This is a consequence of the increase of the transaction footprint (up to 3x larger with SSCA2) caused by the “blind” instrumentation performed by GCC.

Not only these results unveil the possibility of optimizations in existing compiler’s support for STM, but also provide an additional compelling motivation to incorporate support for selective instrumentation in HTM. Indeed, we have shown that capacity exceptions are one of the key sources of aborts with HTM. Hence, techniques capable of achieving noticeable reductions of the transactions’ footprint are expected to strongly benefit HTM’s performance. These considerations open interesting research avenues investigating cross-layer mechanisms operating at the compiler and architectural level, and aimed at supporting selective instrumentation in a way that is both convenient for the programmer (i.e., possibly fully transparent) and sufficiently non-intrusive to simplify integration in existing architectures.
Table 8. Improvement of configuring TSX-GL for each workload compared to the single configuration used in our study.

<table>
<thead>
<tr>
<th>Speedup %</th>
<th>kmeans</th>
<th>ssc2a2</th>
<th>intruder</th>
<th>vacation</th>
<th>genome</th>
<th>yoda</th>
<th>labyrinth</th>
</tr>
</thead>
<tbody>
<tr>
<td>4 threads</td>
<td>12</td>
<td>7</td>
<td>20</td>
<td>36</td>
<td>12</td>
<td>13</td>
<td>2</td>
</tr>
<tr>
<td>8 threads</td>
<td>5</td>
<td>8</td>
<td>80</td>
<td>21</td>
<td>2</td>
<td>55</td>
<td>39</td>
</tr>
</tbody>
</table>

Figure 5. Impact of GCC instrumentation.

9. Conclusions

This paper analyzed extensively the performance and energy efficiency of several state of the art TM systems. We compared different TM solutions (software, hardware and combinations thereof) among each other and against lock based systems. Our study demonstrates that the recent HTM implementation by Intel can strongly outperform any other synchronization alternative in a set of relevant workloads. On the other hand, it also identified some critical limitations of Intel TSX, and highlighted the robustness of state of the art STMs. These software implementations achieve performance competitive with fine-grained locking, and outperform HTM in workloads encompassing long and contention-prone transactions.

Furthermore we have shown that the performance of Hybrid TM, when used in combination with TSX, is normally quite disappointing; we determined that the root cause of this surprising result lies in the inefficiency of the mechanisms used to couple software and hardware transactions. Finally, our study allowed to identify a set of compelling research questions, which, we believe, should be timely addressed to increase the chances of turning HTM into a mainstream paradigm for parallel programming.

References


