Monday, February 13, 2012

Transactional Memory in Intel Haswell: The Good, and a Possible Ugly

Note: Part of this post is being updated based on new information received after it was published. Please check back tomorrow, 2/18/12, for an update. 

Sorry... no update yet (2/21/12). Bad cold. Also seduction / travails of a new computer. Update soon.

I asked James Reinders at the Fall 2011 IDF whether the synchronization features he had from the X86 architecture were adequate for MIC. (Transcript; see the very end.) He said that they were good for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something different.

Now Intel has announced support for transactional memory in Haswell, the chip which follows their Ivy Bridge chip that is just starting shipment as I write this. So I think I’d now be willing to take bets that this is what James was vaguely hinting at, and will appear in Intel’s MIC HPC architecture as it ships in Knight’s Corner product. I prefer to take bets on sure things.

There have been some light discussion of Intel’s “Transactional Synchronization Extensions” (TSE), as this is formally called, and a good example of its use from James Reinders. But now that an architecture spec has been published for TSE, we can get a bit deeper into what, exactly, Intel is providing, and where there might just be a skeleton in the closet.

First, background: What is this “transactional memory” stuff? Why is it useful? Then we’ll get into what Intel has, and the skeleton I believe is lurking.

Transactional Memory

The term “transaction” comes from contract law, was picked up by banking, and from there went into database systems. It refers to a collection of actions that all happen as a unit; they cannot be divided. If I give you money and you give me a property deed, for example, that happens as if it were one action – a transaction. The two parts can’t be (legally) separated; both happen, or neither. Or, in the standard database example: when I transfer money from my bank account to Aunt Sadie’s, the subtraction from my account and the addition to hers must either both happen, or neither; otherwise money is being either destroyed or created, which would be a bad thing. As you might imagine, databases have evolved a robust technology to do transactions where all the changes wind up on stable storage (disk, flash).

The notion of transactional memory is much the same: a collection of changes to memory is made all-or-nothing: Either all of them happen, as seen by every thread, process, processor, or whatever; or none of them happen. So, for example, when some program plays with the pointers in a linked list to insert or delete some list member, nobody can get in there when the update is partially done and follow some pointer to oblivion.

It applies as much to a collection of accesses – reads – as it does to changes – writes. The read side is necessary to ensure that a consistent collection of information is read and acted upon by entities that may be looking around while another is updating.

To do this, typically a program will issue something meaning “Transaction On!” to start the ball rolling. Once that’s done, everything it writes is withheld from view by all other entities in the system; and anything it reads is put on special monitoring in case someone else mucks with it. The cache coherence hardware is mostly re-used to make this monitoring work; cross-system memory monitoring is what cache coherence does.

This continues, accumulating things read and written, until the program issues something meaning “Transaction Off!”, typically called “Commit!.” Then, hraglblargarolfargahglug! All changes are vomited at once into memory, and the locations read are forgotten about.

What happens if some other entity does poke its nose into those saved and monitored locations, changing something the transactor was depending on or modifying? Well, “Transaction On!” was really “Transaction On! And, by the way, if anything screws up go there.” On reaching there, all the recording of data read and changes made has been thrown away; and there is a block of code that usually sets things up to start again, going back to the “Transaction On!” point. (The code could also decide “forget it” and not try over again.) Quitting like this in a controlled manner is called aborting a transaction. It is obviously better if aborts don’t happen a lot, just like it’s better if a lock is not subject to a lot of contention. However, note that nobody else has seen any of the changes made since “On!”, so half-mangled data structures are never seen by anyone.

Why Transactions Are a Good Thing

What makes transactional semantics potentially more efficient than simple locking is that only those memory locations read or referenced at run time are maintained consistently during the transaction. The consistency does not apply to memory locations that could be referenced, only the ones that actually are referenced.

There are situations where that’s a distinction without a difference, since everybody who gets into some particular transaction-ized section of code will bash on exactly the same data every time. Example: A global counter of how many times some operation has been done by all the processors in a system. Transactions aren’t any better than locks in those situations.

But there are cases where the dynamic nature of transactional semantics can be a huge benefit. The standard example, also used by James Reinders, is a multi-access hash table, with inserts, deletions, and lookups done by many processes, etc.

I won’t go through this is detail – you can read James’ version if you like; he has a nice diagram of a hash table, which I don’t – but consider:

With the usual lock semantics, you could simply have one coarse lock around the whole table: Only one person, read or write, gets in at any time. This works, and is simple, but all access to the table is now serialized, so will cause a problem as you scale to more processors.

Alternatively, you could have a lock per hash bucket, for fine-grained rather than coarse locking. That’s a lot of locks. They take up storage, and maintaining them all correctly gets more complex.

Or you could do either of those – one lock, or many – but also get out your old textbooks and try once again to understand those multiple reader / single writer algorithms and their implications, and, by the way, did you want reader or writer priority? Painful and error-prone.

On the other hand, suppose everybody – readers and writers – simply says “Transaction On!” (I keep wanting to write “Flame On!”) before starting a read or a write; then does a “Commit!” when they exit. This is only as complicated as the single coarse lock (and sounds a lot like an “atomic” keyword on a class, hint hint).

Then what you can bank on is that the probability is tiny that two simultaneous accesses will look at the same hash bucket; if that probability is not tiny, you need a bigger hash table anyway. The most likely thing to happen is that nobody – readers or writers – ever accesses same hash bucket, so everybody just sails right through, “Commit!”s, and continues, all in parallel, with no serialization at all. (Not really. See the skeleton, later.)

In the unlikely event that a reader and a writer are working on the same bucket at the same time, whoever “Commit!”s first wins; the other aborts and tries again. Since this is highly unlikely, overall the transactional version of hashing is a big win: it’s both simple and very highly parallel.

Transactional memory is not, of course, the only way to skin this particular cat. Azul Systems has published a detailed presentation on a Lock-Free Wait-Free Hash Table algorithm that uses only compare-and-swap instructions. I got lost somewhere around the fourth state diagram. (Well, OK, actually I saw the first one and kind of gave up.) Azul has need of such things. Among other things, they sell massive Java compute appliances, going up to the Azul Vega 3 7380D, which has 864 processors sharing 768GB of RAM. Think investment banks: take that, you massively recomplicated proprietary version of a Black-Sholes option pricing model! In Java! (Those guys don’t just buy GPUs.)

However, Azul only needs that algorithm on their port of their software stack to X86-based products. Their Vega systems are based on their own proprietary 54-core Vega processors, which have shipped with transactional memory – which they call Speculative Multi-address Atomicity – since the first system shipped in 2005 (information from Gil Tene, Azul Systems CTO). So, all these notions are not exactly new news.

Anyway, if you want this wait-free super-parallel hash table (and other things, obviously) without exploding your head, transactional memory makes it possible rather simply.

What Intel Has: RTE and HLE

Intel’s Transactional Synchronization Extensions come in two flavors: Restricted Transactional Memory (RTE) and Hardware Lock Elision (HLE).

RTE is essentially what I described above: There’s XBEGIN for “Transaction On!”, XEND for “Commit!” and ABORT if you want to manually toss in the towel for some reason. XBEGIN must be given a there location to go to in case of an abort. When an abort occurs, the processor state is restored to what it was at XBEGIN, except that flags are set indicating the reason for the abort (in EAX).

HLE is a bit different. All the documentation I’ve seen so far always talks about it first, perhaps because it seems like it is more familiar, or they want to brag (no question, it’s clever). I obviously think that’s confusing, so didn’t do it in that order.

HLE lets you take your existing, lock-based, code and transactional-memory-ify it: Lock-based code now runs without blocking unless required, as in the hash table example, with minimal, miniscule change that can probably be done with a compiler and the right flag.

I feel like adding “And… at no time did their fingers leave their hands!” It sounds like a magic trick.

In addition to being magical, it’s also clearly strategic for Intel’s MIC and its Knights SDK HPC accelerators. Those are making a heavy bet on people just wanting to recompile and run without the rewrites needed for accelerators like GPGPUs. (See my post MIC and the Knights.)

HLE works by setting a new instruction prefix – XACQUIRE – on any instruction you use to try to acquire a lock. Doing so causes there to be no change to the lock data: the lock write is “elided.” Instead it (a) takes a checkpoint of the machine state; (b) saves the address of the instruction that did this; (c) puts the lock location in the set of data that is transactionally read; and (d) does a “Transaction On!”

So everybody goes charging right through the lock without stopping, but now every location read is continually monitored, and every write is saved, not appearing in memory.

If nobody steps on anybody else’s feet – writes someone else’s monitored location – then when the instruction to release the lock is done, it uses an XRELEASE prefix. This does a “Commit!” hraglblargarolfargahglug flush of all the saved writes into memory, forgets everything monitored, and turns off transaction mode.

If somebody does write a location someone else has read, then we get an ABORT with its wayback machine: back to the location that tried to acquire the lock, restoring the CPU state, so everything is like it was just before the lock acquisition instruction was done. This time, though, the write is not elided: The usual semantics apply, and the code goes through exactly what it did without TSE, the way it worked before.

So, as I understand it, if you have a hash table and read is under way, if a write to the same bucket happens then both the read and the write abort. One of those two gets the lock and does its thing, followed by the other according to the original code. But other reads or writes that don’t have conflicts go right through.

This seems like it will work, but I have to say I’d like to see the data on real code. My gut tells me that anything which changes the semantics of parallel locking, which HLE does, is going to have a weird effect somewhere. My guess would be some fun, subtle, intermittent performance bugs.

The Serial Skeleton in the Closet

This is all powerful stuff that will certainly aid parallel efficiency in both MIC, with it’s 30-plus cores; and the Xeon line, with fewer but faster cores. (Fewer faster cores need it too, since serialization inefficiency gets proportionally worse with faster cores.) But don’t think for a minute that it eliminates all serialization.

I see is no issue with the part of this that monitors locations read and written; I don’t know Intel’s exact implementation, but I feel sure it re-uses the cache coherence mechanisms already present, which operate without (too) much serialization.

However, there’s a reason I used a deliberately disgusting analogy when talking about pushing all the written data to memory on “Commit!” (XEND, XRELEASE). Recall that the required semantics are “all or nothing”: Every entity in the system sees all of the changes, or every entity sees none of them. (I’ve been saying “entity” because GPUs are now prone to directly access cache coherent memory, too.)

If the code has changed multiple locations during a transaction, probably on multiple cache lines, that means those changes have to be made all at once. If locations A and B both change, nobody can possibly see location A after it changed but location B before it changed. Nothing, anywhere, can get between the write of A and the write of B (or the making of both changes visible outside of cache).

As I said, I don’t know Intel’s exact implementation, so could conceivably be wrong, but that for me that implies that every “Commit!” requires a whole system serialization event: Every processor and thread in the whole system has to be not just halted, but pipelines drained. Everything must come to a dead stop. Once that stop is done, then all the changes can be made visible, and everything restarted.

Note that Intel’s TSE architecture spec says nothing about these semantics being limited to one particular chip or region. This is very good; software exploitation would be far harder otherwise. But it implies that in a multi-chip, multi-socket system, this halt and drain applies to every processor in every chip in every socket. It’s a dead stop of everything.

Well, OK, but lock acquire and release instructions always did this dead stop anyway, so likely the aggregate amount of serialization is reduced. (Wait a minute, they always did this anyway?! What the… Yeah. Dirty little secret of the hardware dudes.)

But lock acquire and release only involve one cache line at a time. “Commit!” may involve many. Writes involve letting everybody else know you’re writing a particular line, so they can invalidate it in their cache(s). Those notifications all have to be sent out, serially, and acknowledgements received. They can be pipelined, and probably are, but the process is still serial, and must be done while at a dead stop.

So, if your transactional segment of code modifies, say, 128KB spread over 512K cache lines, you can expect a noticeable bit of serialization time when you “Commit!”. Don’t forget this issue now includes all your old-style locking, thanks to HLE, where the original locking involved updating just one cache line. This is another reason I want to see some real running code with HLE. Who knows what evil lurks between the locks?

But as I said, I don’t know the implementation. Could Intel folks have found a way around this? Maybe; I’m writing this, as I’ve indicated, speculatively. Perhaps real magic is involved. We’ll find out when Haswell ships.

Enjoy your bright, shiny, new, non-blocking transactional memory when it ships. It’ll probably work really well. But beware the dreaded hraglblargarolfargahglug. It bites.


Anonymous said...

Does this work the same way as the transactional memory on the IBM BG/Q processor ?

Greg Pfister said...

BG/Q - I've not read any architecture documents on that, but my understanding is that it does, except that in BG/Q it's limited to working within a single chip, not across multiple chips in a node. This makes software exploitation more difficult.

However, I'm not certain that BG/Q has cache coherence across the multiple chips in a node, so transactional memory across the chips wouldn't be as useful anyway.

jss said...

nice post, spectacular blog

Jeff Hammond said...

BG/Q was designed to have only one chip per node, so I don't see how it's a problem that TM only works in this context. Similarly, the question of cache coherence between multiple chips in a node is irrelevant because this configuration will never exist.

Anonymous said...

No sure why you assume that cache invalidates to take ownership have to happen at "commit" as opposed to when the write happens. That might cause excessive serialization at release but is not neccessary.

nick black said...

I might be overlooking something obvious, but it seems to me that you could solve this problem with another state in your coherence protocol. There's no need for a pipeline flush -- anything already brought in from DRAM is still good to compute with, until it points to a memory location not yet brought into the processor, at which point it sees an invalidation and rolls out to the abort block. It seems that what you *can't* have is actual instruction retirement from a transactional region when another has acquired the xaction? Regardless, the cache coherence protocol -- as you note at the beginning -- ought provide most of the mechanics for this.

Sassa said...

1. Reads outside locked region would show inconsistent state before. I think the code that reads values outside "Transaction On!" is not guaranteed to NOT see B inconsistent with A. Too many nots? OK, only transactional code is required to see A and B consistent.

2. If I wanted to modify 2 out of 20 fields in a object, the lock-based solution would need to create a copy of the object with the 2 fields modified by a single thread, then "swap-in" the mutated object. The big win of transactional memory is that now we don't need to create new objects to represent the whole state of the system anymore, we can mutate the state in place.

root@boy said...

2- why the lock-based solution need extra copy. in fact i suspect the transactional one to have extra copy. all you need is 20bit mask for marking what is locked and what not !?!?!?!?!?!?!

Post a Comment

Thanks for commenting!