Sorry... no update yet (2/21/12). Bad cold. Also seduction / travails of a new computer. Update soon.
I asked James Reinders at the Fall 2011 IDF whether the
synchronization features he had from the X86 architecture were adequate for MIC.
(Transcript;
see the very end.) He said that they were good for the 30 or so cores in
Knight’s Ferry, but when you got above 40, they would need to do something different.
Now Intel has announced
support for transactional memory in Haswell, the chip which follows their Ivy
Bridge chip that is just starting shipment as I write this. So I think I’d now
be willing to take bets that this is what James was vaguely hinting at, and
will appear in Intel’s
MIC HPC architecture as it ships in Knight’s Corner product. I prefer to
take bets on sure things.
There have been some light discussion of Intel’s
“Transactional Synchronization Extensions” (TSE), as this is formally called, and
a good example
of its use from James Reinders. But now that an architecture spec has been
published for TSE, we can get a bit deeper into what, exactly, Intel is
providing, and where there might just be a skeleton in the closet.
First, background: What is this “transactional memory”
stuff? Why is it useful? Then we’ll get into what Intel has, and the skeleton I
believe is lurking.
Transactional Memory
The term “transaction” comes from contract law, was
picked up by banking, and from there went into database systems. It refers to a
collection of actions that all happen as a unit; they cannot be divided. If I give
you money and you give me a property deed, for example, that happens as if it
were one action – a transaction. The two parts can’t be (legally) separated;
both happen, or neither. Or, in the standard database example: when I transfer
money from my bank account to Aunt Sadie’s, the subtraction from my account and
the addition to hers must either both happen, or neither; otherwise money is
being either destroyed or created, which would be a bad thing. As you might
imagine, databases have evolved a robust technology to do transactions where
all the changes wind up on stable storage (disk, flash).
The notion of transactional memory is much the same: a
collection of changes to memory is made all-or-nothing: Either all of them
happen, as seen by every thread, process, processor, or whatever; or none of
them happen. So, for example, when some program plays with the pointers in a
linked list to insert or delete some list member, nobody can get in there when
the update is partially done and follow some pointer to oblivion.
It applies as much to a collection of accesses – reads –
as it does to changes – writes. The read side is necessary to ensure that a
consistent collection of information is read and acted upon by entities that
may be looking around while another is updating.
To do this, typically a program will issue something meaning
“Transaction On!” to start the ball rolling. Once that’s done, everything it
writes is withheld from view by all other entities in the system; and anything
it reads is put on special monitoring in case someone else mucks with it. The
cache coherence hardware is mostly re-used to make this monitoring work; cross-system
memory monitoring is what cache coherence does.
This continues, accumulating things read and written,
until the program issues something meaning “Transaction Off!”, typically called
“Commit!.” Then, hraglblargarolfargahglug! All changes are vomited at once into
memory, and the locations read are forgotten about.
What happens if some other entity does poke its nose into
those saved and monitored locations, changing something the transactor was
depending on or modifying? Well, “Transaction On!” was really “Transaction On! And, by the way, if anything screws up go there.” On reaching there,
all the recording of data read and changes made has been thrown away; and there
is a block of code that usually sets things up to start again, going back to
the “Transaction On!” point. (The code could also decide “forget it” and not
try over again.) Quitting like this in a controlled manner is called aborting a transaction. It is obviously
better if aborts don’t happen a lot, just like it’s better if a lock is not
subject to a lot of contention. However, note that nobody else has seen any of
the changes made since “On!”, so half-mangled data structures are never seen by
anyone.
Why Transactions Are a Good Thing
What makes transactional semantics potentially more
efficient than simple locking is that only those memory locations read or
referenced at run time are maintained
consistently during the transaction. The consistency does not apply to memory
locations that could be referenced,
only the ones that actually are
referenced.
There are situations where that’s a distinction without a
difference, since everybody who gets into some particular transaction-ized
section of code will bash on exactly the same data every time. Example: A global
counter of how many times some operation has been done by all the processors in
a system. Transactions aren’t any better than locks in those situations.
But there are cases where the dynamic nature of transactional
semantics can be a huge benefit. The standard example, also used by James
Reinders, is a multi-access hash table, with inserts, deletions, and lookups
done by many processes, etc.
I won’t go through this is detail – you can read James’
version if you like; he has a nice diagram of a hash table, which I don’t –
but consider:
With the usual lock semantics, you could simply have one
coarse lock around the whole table: Only one person, read or write, gets in at
any time. This works, and is simple, but all access to the table is now
serialized, so will cause a problem as you scale to more processors.
Alternatively, you could have a lock per hash bucket, for
fine-grained rather than coarse locking. That’s a lot of locks. They take up
storage, and maintaining them all correctly gets more complex.
Or you could do either of those – one lock, or many – but
also get out your old textbooks and try once again to understand those multiple
reader / single writer algorithms and their implications, and, by the way, did
you want reader or writer priority? Painful and error-prone.
On the other hand, suppose everybody – readers and
writers – simply says “Transaction On!” (I keep wanting to write “Flame On!”)
before starting a read or a write; then does a “Commit!” when they exit. This
is only as complicated as the single coarse lock (and sounds a lot like an “atomic”
keyword on a class, hint hint).
Then what you can bank on is that the probability is tiny
that two simultaneous accesses will look at the same hash bucket; if that
probability is not tiny, you need a bigger hash table anyway. The most likely
thing to happen is that nobody – readers or writers – ever accesses same hash
bucket, so everybody just sails right through, “Commit!”s, and continues, all
in parallel, with no serialization at all. (Not really. See the skeleton, later.)
In the unlikely event that a reader and a writer are
working on the same bucket at the same time, whoever “Commit!”s first wins; the
other aborts and tries again. Since this is highly unlikely, overall the
transactional version of hashing is a big win: it’s both simple and very highly
parallel.
Transactional memory is not, of course, the only way to skin
this particular cat. Azul Systems has
published a detailed presentation on a Lock-Free
Wait-Free Hash Table algorithm that uses only compare-and-swap
instructions. I got lost somewhere around the fourth state diagram. (Well, OK,
actually I saw the first one and kind of gave up.) Azul has need of such
things. Among other things, they sell massive Java compute appliances, going up
to the Azul Vega 3
7380D, which has 864 processors sharing 768GB of RAM. Think investment
banks: take that, you massively
recomplicated proprietary version of a Black-Sholes option pricing model! In
Java! (Those guys don’t just buy GPUs.)
However, Azul only needs that algorithm on their port of
their software stack to X86-based products. Their Vega systems are based on
their own proprietary 54-core Vega processors, which have shipped with
transactional memory – which they call Speculative Multi-address Atomicity –
since the first system shipped in 2005 (information from Gil Tene, Azul Systems CTO). So, all these notions are not exactly
new news.
Anyway, if you want this wait-free super-parallel hash
table (and other things, obviously) without exploding your head, transactional
memory makes it possible rather simply.
What Intel Has: RTE and HLE
Intel’s Transactional Synchronization Extensions come in
two flavors: Restricted Transactional Memory (RTE) and Hardware Lock Elision
(HLE).
RTE is essentially what I described above: There’s XBEGIN
for “Transaction On!”, XEND for “Commit!” and ABORT if you want to manually
toss in the towel for some reason. XBEGIN must be given a there location to go
to in case of an abort. When an abort occurs, the processor state is restored
to what it was at XBEGIN, except that flags are set indicating the reason for
the abort (in EAX).
HLE is a bit different. All the documentation I’ve seen
so far always talks about it first, perhaps because it seems like it is more
familiar, or they want to brag (no question, it’s clever). I obviously think
that’s confusing, so didn’t do it in that order.
HLE lets you take your existing, lock-based, code and
transactional-memory-ify it: Lock-based code now runs without blocking unless
required, as in the hash table example, with minimal, miniscule change that can
probably be done with a compiler and the right flag.
I feel like adding “And… at no time did their fingers
leave their hands!” It sounds like a magic trick.
In addition to being magical, it’s also clearly strategic
for Intel’s MIC and its Knights SDK HPC accelerators. Those are making a heavy
bet on people just wanting to recompile and run without the rewrites needed for
accelerators like GPGPUs. (See my post MIC
and the Knights.)
HLE works by setting a new instruction prefix – XACQUIRE –
on any instruction you use to try to acquire a lock. Doing so causes there to
be no change to the lock data: the
lock write is “elided.” Instead it (a) takes a checkpoint of the machine state;
(b) saves the address of the instruction that did this; (c) puts the lock
location in the set of data that is transactionally read; and (d) does a “Transaction
On!”
So everybody goes charging right through the lock without
stopping, but now every location read is continually monitored, and every write
is saved, not appearing in memory.
If nobody steps on anybody else’s feet – writes someone
else’s monitored location – then when the instruction to release the lock is
done, it uses an XRELEASE prefix. This does a “Commit!” hraglblargarolfargahglug
flush of all the saved writes into memory, forgets everything monitored, and
turns off transaction mode.
If somebody does write a location someone else has read,
then we get an ABORT with its wayback machine: back to the location that tried
to acquire the lock, restoring the CPU state, so everything is like it was just
before the lock acquisition instruction was done. This time, though, the write
is not elided: The usual semantics
apply, and the code goes through exactly what it did without TSE, the way it
worked before.
So, as I understand it, if you have a hash table and read
is under way, if a write to the same bucket happens then both the read and the
write abort. One of those two gets the lock and does its thing, followed by the
other according to the original code. But other reads or writes that don’t have
conflicts go right through.
This seems like it will work, but I have to say I’d like
to see the data on real code. My gut tells me that anything which changes the
semantics of parallel locking, which HLE does, is going to have a weird effect
somewhere. My guess would be some fun, subtle, intermittent performance bugs.
The Serial Skeleton in the Closet
This is all powerful stuff that will certainly aid
parallel efficiency in both MIC, with it’s 30-plus cores; and the Xeon line,
with fewer but faster cores. (Fewer faster cores need it too, since serialization
inefficiency gets proportionally worse with faster cores.) But don’t think for
a minute that it eliminates all serialization.
I see is no issue with the part of this that monitors locations
read and written; I don’t know Intel’s exact implementation, but I feel sure it
re-uses the cache coherence mechanisms already present, which operate without
(too) much serialization.
However, there’s a reason I used a deliberately disgusting
analogy when talking about pushing all the written data to memory on “Commit!”
(XEND, XRELEASE). Recall that the required semantics are “all or nothing”: Every entity in the system sees all of the changes, or every entity sees none of them. (I’ve been saying “entity” because GPUs are now prone
to directly access cache coherent memory, too.)
If the code has changed multiple locations during a
transaction, probably on multiple cache lines, that means those changes have to
be made all at once. If locations A and B both change, nobody can possibly see location
A after it changed but location B before it changed. Nothing, anywhere,
can get between the write of A and the write of B (or the making of both changes
visible outside of cache).
As I said, I don’t know Intel’s exact implementation, so
could conceivably be wrong, but that for me that implies that every “Commit!”
requires a whole system serialization
event: Every processor and thread in the whole system has to be not just
halted, but pipelines drained. Everything must come to a dead stop. Once
that stop is done, then all the changes can be made visible, and everything
restarted.
Note that Intel’s TSE architecture spec says nothing
about these semantics being limited to one particular chip or region. This is
very good; software exploitation would be far harder otherwise. But it implies
that in a multi-chip, multi-socket system, this halt and drain applies to every
processor in every chip in every socket. It’s a dead stop of everything.
Well, OK, but lock acquire and release instructions
always did this dead stop anyway, so likely the aggregate amount of
serialization is reduced. (Wait a minute, they always did this anyway?! What
the… Yeah. Dirty little secret of the hardware dudes.)
But lock acquire and release only involve one cache line
at a time. “Commit!” may involve many. Writes involve letting everybody else
know you’re writing a particular line, so they can invalidate it in their
cache(s). Those notifications all have to be sent out, serially, and
acknowledgements received. They can be pipelined, and probably are, but the
process is still serial, and must be done while at a dead stop.
So, if your transactional segment of code modifies, say,
128KB spread over 512K cache lines, you can expect a noticeable bit of
serialization time when you “Commit!”. Don’t forget this issue now includes all
your old-style locking, thanks to HLE, where the original locking involved
updating just one cache line. This is another reason I want to see some real
running code with HLE. Who knows what evil lurks between the locks?
But as I said, I don’t know the implementation. Could
Intel folks have found a way around this? Maybe; I’m writing this, as I’ve
indicated, speculatively. Perhaps real magic is involved. We’ll find out when Haswell
ships.
Enjoy your bright, shiny, new, non-blocking transactional
memory when it ships. It’ll probably work really well. But beware the dreaded hraglblargarolfargahglug.
It bites.
8 comments:
Does this work the same way as the transactional memory on the IBM BG/Q processor ?
BG/Q - I've not read any architecture documents on that, but my understanding is that it does, except that in BG/Q it's limited to working within a single chip, not across multiple chips in a node. This makes software exploitation more difficult.
However, I'm not certain that BG/Q has cache coherence across the multiple chips in a node, so transactional memory across the chips wouldn't be as useful anyway.
nice post, spectacular blog
BG/Q was designed to have only one chip per node, so I don't see how it's a problem that TM only works in this context. Similarly, the question of cache coherence between multiple chips in a node is irrelevant because this configuration will never exist.
No sure why you assume that cache invalidates to take ownership have to happen at "commit" as opposed to when the write happens. That might cause excessive serialization at release but is not neccessary.
I might be overlooking something obvious, but it seems to me that you could solve this problem with another state in your coherence protocol. There's no need for a pipeline flush -- anything already brought in from DRAM is still good to compute with, until it points to a memory location not yet brought into the processor, at which point it sees an invalidation and rolls out to the abort block. It seems that what you *can't* have is actual instruction retirement from a transactional region when another has acquired the xaction? Regardless, the cache coherence protocol -- as you note at the beginning -- ought provide most of the mechanics for this.
1. Reads outside locked region would show inconsistent state before. I think the code that reads values outside "Transaction On!" is not guaranteed to NOT see B inconsistent with A. Too many nots? OK, only transactional code is required to see A and B consistent.
2. If I wanted to modify 2 out of 20 fields in a object, the lock-based solution would need to create a copy of the object with the 2 fields modified by a single thread, then "swap-in" the mutated object. The big win of transactional memory is that now we don't need to create new objects to represent the whole state of the system anymore, we can mutate the state in place.
2- why the lock-based solution need extra copy. in fact i suspect the transactional one to have extra copy. all you need is 20bit mask for marking what is locked and what not !?!?!?!?!?!?!
Post a Comment
Thanks for commenting!
Note: Only a member of this blog may post a comment.