Primitive and modern outdoor skills

How hardware and Software will screw you over

This is not a tutorial on threading. This is not a tutorial on SMP. This is not a tutorial on compilers. This is not a tutorial on synchronization primitives. It's assumed you have some inkling of all of these already. This IS a quick rundown of the details of how each of those can and do actually function, and how the will affect semantics.

Starting thoughts

Consider a locked region. Here is what a system MUST ensure for a mutex to achieve mutual exclusion.

  1. writes after a lock() do not cross to being before the lock()
  2. writes before an unlock() to not cross to being afer the unlock()
  3. reads after a lock() do not cross to being before the lock()
  4. reads before an unlock() do not cross to being after the unlock()
That is ALL. Note in particular that this does not preclude things from migrating around the locked region, and it does not preclude things from migrating into the locked region and it does not enforce ordering within the locked region.

It's important to remember that any mutation of your code not precluded by these rules *could* in the future be broken by some brilliant compiler, or some brilliant new architecture. This means that you should be suspect of any code which requires more than these basic rules of synchronization to operate correctly. That being said, we're now going to launch into the details of how these things are preserved, and what other semantics are actually preserved in practice.

C Compilers

The basics

So. This means that the C compiler can legally completely change the meaning of a variable. It can cache a variable in a register whenever it wants. It can use the stack arbitrarilly. It can decide to compiler your code to lisp and run it on a lisp interpreter. The physical representations of programs and data IS NOT GUARANTEED. Now after you grok this it would appear to make be *impossible* to write multithreaded code at all, but we have a couple of saving graces.

First, whenever a call crosses compilation units the compiler cannot know anything about what state the called function might expect to see, and it what form it might expect to see it in, similarly it can't tell how that state might get written back out. As a result it is force to represent things in a completely standard way whenever making such calls. This is especially true when you are linking to a different language such as assembly code. As a result the state of your program is defined when it crosses that call. Note - to see just *how* it's defined you will need to spend a long time groking "aliasing", and you should look up "sequence points".

Second, Many compilers (like gcc) also have ways to force behavior similar to crossing compilation boundaries, but inline to the code. In gcc this can be done with the the "memory" keyword inside of a volatile asm statement.

Note that volatile variables are NOT in our list of saving graces. It will become clear in the hardware section just why this is the case. Volatile does give us most of the properties we need to force the compiler to do things in the right order though. The exact definition of volatile is that it will:

  1. Force all reads from the variable after the previous sequence point to occur after the previous sequence point
  2. Force writes and reads from the variable to go to/from memory directly and not be cached
  3. Force the physical representation of the variable to reflect the virtual representation of that variable at each sequence point.
The semetric rule to #1 is already ensured by C's normal semantics. That writes will occur before the next sequence point (though this isn't useful for parallel programming without rule #3). Another important implication is that operations on volatile objects (if well ordered to begin with) cannot be re-ordered.

Now, exactly how a compiler enforces these rules is an interesting and complicated question. A lot of code that uses volatile depends on semantics that aren't strictly defined by what I just stated - for instance that a 32 bit variable is not written to as two 16-bit variables side-by-side. We will come back to volatile after talking about hardware

So, going back to our original set of ensurences. The compiler itself will not in fact re-order things into a locked region or across a locked region. The reason being that a mutex is implemented with one of the afore-mentioned "saving graces" which block the compiler from having any knowledge about the machine state before and after the "lock()" and "unlock()" operations. In other words, the C language has a very brute-force and heavyhanded way of stopping re-ordering, not as neat and fine-grained as our theoretical example spoken of at the beginning. This means though that it's quite simple. Things will not reorder around lock operations. End of story.

Hardware

Basically, the story for hardware looks a lot like the story for processors. The systems are defined with respect only to a single processor running on a system. In this world all of your state will always be consistant and proper. Your instructions may be executed out of order, your memory accesses may happen at random times, and things may be kept in cache and never written to memory - but everything will look just like it was running in a clean sequential order from your perspective, which is all that matters.

Once you get two processors in the mix though, this is no longer the case. All of this cute reordering magic that help our systems run fast isn't synchronized across the processors. One processor doesn't know what the other is doing in it's cache, and having it know would slow the machine down substantially. So, instead the designers of hardware have provided a few special instructions that let us ensure things are consistante when we need them to be.

Okay, so lets go back to our theoretical mutex case spoke of earlier. How does hardware get those 4 properties that we needed to make a mutex work? We need for operations mapping to our 4 constraints (I made up these names for the sake of explanation).

On most architectures we do not actually have all of these as seperate operations. Instead we have "sfence" which is like supfence and sdownfence in one, and we have "lfence" which is like lupfence and ldownfence in one, and lastly we have mfence which is just lfence and sfence in one. So mfence basically blocks any motion across it at all.

Outside of this you should consider it fair game for the processor, similar to the compiler, to reorder and cache anything it wants any time it wants.

Specific Archs

Now, a lot of people are only concerned with writing for x86. Here are several useful things to know about x86. As of the pentium 3 the cache on the x86 was in fact coherent. This meant that the semantics for two processors were the same as for one. Any time you access memory, you can expect youre state to be there as advertised YAY!

Then came the P4. The P4 introduced what is called the "write pipe". This is simply a pipeline in which writes are cached. See, what the designers noticed is that people often mutated the same variable many times in a row, such as a loop counter in a tight loop. So why write these changes out to memory? Instead we buffer changes over time, when we notice that we are overwriting a change that's still sitting in the write pipe - we can just drop the first change and only do the second change! What a great idea right!

Well, this suddenly meant that writes could be dropped entirely, and never hit ram! All our writes ARE still ordered though (in theory). It should be noted that there are already proofs floating around that x86 implementations (including those by intel) do not comform strictly to the ordering spec though - so be wary of depending on this. I'd also hazard to guess that AMD has already intentionally broken this.

Alpha is a very useful case to know about. This is not because anyone actually uses Alpha systems anymore, but because it's the loosest caching system yet implemented. Alpha has a write pipe - and prefetching logic. When a write goes out to ream, it does not necessarilly get mirrored into the other processors caches at all, and they will not check main memory unless instructed to do so, if they have a cached value. Also, Alpha does not guarantee that writes happen in order to memory, they can happen in arbitrary order as long as uniproc semantics are maintained. Additionally Alphas actually support a seperation of hte instructions discussed earlier - meaning that they understand the difference between an lupfence and an ldownfence. This allowed their memory bus to be one of the fastest in existence for a long time. This means that if you design with the Alpha in mind, you are unlikely to be caught on a worse architecture - so your code will be as portable as anyone's ever can be.

Now, as one other useful datapoint, it should be noted that AMD is using a bus archiecture derived from the last Alpha systems, that is the Ev7. This means that it would not be a surprise at all to see some of the same loose caching semantics on AMD as you see on Ev7. Sadly, I do note actually know the answer to this question.

Compiler + Hardware

Okay, so now we know how to make our compiler emit code that won't reorder. And we know how to keep our hardware from reordering. How do we make our compiler emit code that makes sure our hardware does not reorder?

GOOD QUESTION! Turns out the answer is... you don't. You must write these instructions in straight-up assembly, either inline or not, and use the magic described earlier to keep the compiler from reordering your hardware fencing instructions. Note that this implies that volatile DOES NOT DO THIS! This means that volatile variables, of themselves, achieve nothing with respect to threading. You've now ensured your compile can't reorder the instruction, but your hardware will happily achieve the same broken result anyway! You can actually use a volatile variable, but it must surrounded appropriately in these fencing instructions. Since you are doing that anyway though - it turns out to be far simpler and less error prone to just have the variable be in assembly as well. This makes the semantics obvious and trivial, without the compiler getting the chance to trip you up.

Okay, so you think you've got it

I used to think that: Technically what I've just told you is everything you should need to know to write code. There are a LOT of implications though. Eventually I will add a page discussing some of these.

Sadly, below there's a paper that (I think) proves the statement above wrong (see x86-TSO). This basically describes how the x86 specs (AMD and intel both) don't match the hardware, don't match each other, are incomplete and imprecise. They also propose a model that we should be coding to, and does match current hardware - so if you want to know what hardware really does, read this one.

Where to learn more

If you search around the internet, much of what you'll find on this topic is wrong. No article on threading would be complete without a reference to Boehm. Just remember, he's right, everyone else (including me) is probably wrong. Here are a few sources that are correct, or at least definitional.

- mbrewer