# Interesting modifications to the Lamport queue

While researching lock-free queue algorithms, I came across a few articles that made some interesting modifications to the Lamport queue. One made it more efficient by exploiting C11’s new memory model, while another made it more efficient by using cache locality. As I found the first one to be more interesting, and the refinements more useful for general multi-threaded programming, I thought I’d explain that one in a bit more detail.

The TL;DR: I

• provide a brief explanation of lock-free queue categories
• explain an article by Nhat Minh Le et al. in programmer-ese
• provide their improvement upon the Lamport queue, with code
• show why alternative approaches don’t work
• explain what a data race is, and how the C11 memory model fits in and addresses it

The article I will explain part is “Correct and Efficient Bounded FIFO Queues” by Nhat Minh Le et al., which you can find here. I have set up a Git repository on GitHub with the C11 code. In order to build it, you need at least GCC version 4.9. In order to test it properly, you need something like Thread Sanitizer, which is included in GCC and used by the provided Makefile — but the article itself contains ample proof that the code will work.

Let’s first take a look at the Lamport queue, as originally presented by Lamport in “Proving the Correctness of Multiprocess Programs” (available here) as an example of a lock-free queue. It wasn’t ostensibly designed to be particularly efficient but rather as a nice, simple, easy-to-analyse example of a multi-process program1.

The code for Lamport’s queue, translated to C11, looks like this:

```12 13 14 15 16 17 struct LamportQueue { atomic_size_t front_; atomic_size_t back_; T data_[SIZE]; };```

This defines the structure of the queue itself. The queue is a lock-free single-producer/single-consumer (SPSC) single-in/single-out (SISO) FIFO queue.
This is where you say “What does that mean?”.

Queues are classified along various categories, according to the guarantees they give you. Among various others (some of which I will discuss below), there is the question of “how many threads can push something into the queue at the same time?”, rephrased as single-producer, or multi-producer because generally, if you can push with two threads at the same time, you can push with three threads at the same time, etc.2.

Analogously, you can ask “with how many threads can I pop stuff from the queue at the same time?”, rephrased as single-consumer or multi-consumer. With these two questions answered, we now have four classes of queue algorithms: SPSC, MPSC, SPMC and MPMC. If you go out looking for queue algorithms, you’ll find the SPSC kind is the most ubiquitous.

A second set of questions you can ask is “how many values can I push into the queue at the same time?”, rephrased as single-in vs. multi-in — and conversely “how many values can I pop from the queue at the same time?”, rephrased as single-out or multi-out. Most queues (lock-free or not) are SISO, but there are also SIMO, MISO and MIMO queues.

A third question is about the order of the things that go in vs. the things that go out. Basically, there are three orders: first-in-first-out (FIFO), last-in-first-out (LIFO — also sometimes called first-in-last-out or FILO, this is basically a stack) and undetermined which basically means you don’t know but in which case there’s generally a note saying something like “FIFO in the general case” indicating that, while we can’t guarantee a specific order, it will generally look like this…

Now, I almost glossed over the “lock-free” part. Gotsman et al. provide a nice classification of non-blocking algorithms:

Wait-freedom:
Every running thread is guaranteed to complete its operation, regardless of the execution speeds of the other threads. Wait-freedom ensures the absence of livelock and starvation.
Lock-freedom:
From any point in a program’s execution, some thread is guaranteed to complete its operation. Lock-freedom ensures the absence of livelock, but not starvation.
Obstruction-freedom
Every thread is guaranteed to complete its operation provided it eventually executes in isolation. In other words, if at some point in a program’s execution we suspend all threads except one, then this thread’s operation will terminate.

Wait-freedom is the Holy Grail of non-blocking algorithms: if you can find a non-trivial wait-free algorithm that suits a general need, you will have earned the respect of many a programmer. Lamport’s algorithm is actually wait-free, but it has the caveat of failing when the queue is full/empty (which is OK in many cases, but in some cases, it means the producer has to loop back and wait for there to be space available, so the algorithm really becomes lock-free rather than wait-free)3.

Let’s get back to the code. Initializing the structure is straight-forward:

```19 20 21 22 23 void LamportQueue_init(struct LamportQueue *queue) { atomic_init(&queue->front_, 0); atomic_init(&queue->back_, 0); }```

Pushing into the queue is interesting: as Nhat Minh Le et al. describe it, each end (the pushing and pulling end) can consider one of the two indices as own and the other as foreign. A process has the right to read and modify its own index, but can only read the foreign one — no modification allowed there. Keeping that in mind, you just have to decide which end you push onto (we’ll take the tail) and which end you pop off of (the head). Hence, within `push` the tail, or `back_` is owned while the head, or `front_` is foreign, and vice-versa for pop.

So, pushing is a matter of getting the location to store the data to (line 28), checking whether it is available (lines 29 through 35), putting the data there (line 36), and publishing the fact that the data is there by incrementing the appropriate index (line 37).

```25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 bool LamportQueue_push(struct LamportQueue *queue, T elem) { size_t b, f; b = atomic_load_explicit(&queue->back_, memory_order_seq_cst); f = atomic_load_explicit(&queue->front_, memory_order_seq_cst); if ((b + 1) % SIZE == f) { return false; } else { /* not full */ } queue->data_[b] = elem; atomic_store_explicit(&queue->back_, (b + 1) % SIZE, memory_order_seq_cst); return true; }```

Popping is, of course, similar to pushing: read the place where the data should be (line 44), check whether the queue isn’t empty (lines 45 through 51) read the data (line 52) and publish the fact that it’s been read by incrementing the owned index (line 53).

```41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 bool LamportQueue_pop(struct LamportQueue *queue, T *elem) { size_t b, f; f = atomic_load_explicit(&queue->front_, memory_order_seq_cst); b = atomic_load_explicit(&queue->back_, memory_order_seq_cst); if (b == f) { return false; } else { /* not empty */ } *elem = queue->data_[f]; atomic_store_explicit(&queue->front_, (f + 1) % SIZE, memory_order_seq_cst); return true; }```

Lock-freedom is nice, but you want to avoid contention which is something lock-freedom alone will not do. On modern systems, contention can happen in all kinds of hidden places: loading a shared variable’s value, for example, may require the compiler or processor to emit memory barriers to ensure the value you see is the value you really want to see (or the value the compiler/processor thinks you really want to see). That might mean having to synchronize with other CPUs or other CPU cores, interrupting the normal workflow of some or all of them; going all the way through to memory, slowing you — and possibly others — down along the way.

So, how do you get rid of such contention? One way exemplified by Nhat Minh Le et al. is to get rid of memory barriers. The approach they take in the article is well-thought-out, but frankly a bit boring — so I thought I’d mix things up a little and just make everything “relaxed” (changing `memory_order_seq_cst` to `memory_order_relaxed` throughout the code) to show that that doesn’t work.

Running what I’ll call the “hippie” version, compiled with ThreadSanitizer, you get this warning:

```==================
#0 LamportQueue_pop /home/rlc/lamport/lamport.c:50 (lamport+0x000000000e33)
#1 consumer /home/rlc/lamport/lamport.c:73 (lamport+0x000000000f4b)
#2   (libtsan.so.0+0x000000023629)

Previous write of size 4 at 0x7ffc720b0b10 by thread T1:
[failed to restore the stack]

Location is stack of main thread.

#1 main /home/rlc/lamport/lamport.c:89 (lamport+0x000000001015)

#1 main /home/rlc/lamport/lamport.c:88 (lamport+0x000000000fec)

SUMMARY: ThreadSanitizer: data race /home/rlc/lamport/lamport.c:50 LamportQueue_pop
==================

This basically means that there’s no way of knowing which version of `back_` the consumer is reading, w.r.t. the associated data: because of the relaxed memory ordering, the the reads and writes in the producer thread aren’t necessarily visible in the same order in the consumer thread, and vice-versa.

This warrants a bit more explanation.

When you write your code, you might expect the compiler to translate your code into the equivalent instructions in assembly language, which are then translated into opcodes, one by one, which are then faithfully executed, in order, by the computer. However, if that is really what you believe is going on, many a computer scientist or software engineer will ask you what century you think we’re in. In this century, the hardware really only pretends to do something similar to what the software tells it to do, and at the moment it sees the software, it really is only a facsimile of what the programmer originally wrote. In order for the computer to do what you want it to do efficiently, we have to give it an enormous amount of latitude as to what exactly it does, and in what order.

This is where the C11 memory model comes in: while, in the spirit of the rest of the C language, most of it is undefined, it defines the order of things in terms of happens-before and related notions. Happens-before is the most important of these in that it addresses the notion of a race condition or data race: a data race occurs when a process tries to read a value $hat{x}$ ftom a variable $x$ but it is undefined whether the write that produces that value has occured yet, or is perhaps even still in progress; or when a process tries to write to $x$ while another process also tries to write to $x$. If $x$ is not shared, this cannot happen but if it is, reads and writes to shared variables may appear out-of-sequence to other processes/threads.

This gets us back to what I said earlier: between the code you write and what the computer executes, there may be a world of difference. The “hippie” version of the code above, with its relaxed reads and writes on atomic shared variables only guarantees that no thread/process will see any intermediate values — values that are neither the previous nor the new value — but it does not guarantee anything of the sort for non-atomic shared variables (such as `data_`) nor does it say anything about the ordering between writes to `data_` and writes to `back_`, as visible from the consumer, nor reads from `data_` and writes to `front_` as visible from the producer.

Of course, this does not mean that all reads and writes have to use `memory_order_seq_cst`: `memory_order_seq_cst` emits a memory barrier that makes anything that was sequenced-before it visible before it — which is usually overkill. To know what kind of `memory_order_*` you need, you need to ask yourself: what reads/writes may become visible after this point? and who else (what other thread/process) can see this shared state?

With this in mind, let’s take another look at `LamportQueue_pop`:

```41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 bool LamportQueue_pop(struct LamportQueue *queue, T *elem) { size_t b, f; f = atomic_load_explicit(&queue->front_, memory_order_seq_cst); b = atomic_load_explicit(&queue->back_, memory_order_seq_cst); if (b == f) { return false; } else { /* not empty */ } *elem = queue->data_[f]; atomic_store_explicit(&queue->front_, (f + 1) % SIZE, memory_order_seq_cst); return true; }```

On line 44, we load our own member variable, `front_`. We (the context of the thread the code is running in) are the only ones to ever write to this variable, so we know that the only sequencing that can happen — the only order in which we can see changes to this member — is the order we impose on it ourselves. This means we can breathe easily: there is no way for someone else (another thread) to mess up what we see when we look at this variable — we can relax.

More formally, reads from shared variables the reading thread only writes to itself can be relaxed because we only need sequenced-before ordering.

On line 45, we read from a foreign variable, so we will need some kind of barrier to make sure that any reads of our shared state — any reads of the data in our queue — cannot be ordered before this read. In the same vein, on line 53 we write to our own variable with the full knowledge that another thread will read it as a foreign variable, so we need to make sure no stores we do on the shared state are ordered after that write. I.e., we cannot be seen to read from the shared state before reading the foreign variable and thus acquiring the shared state, and we cannot be seen to write to the shared state after writing to our own variable to release the shared state.

The wording is important here: unless we tell it otherwise, the compiler/CPU is allowed to re-order anything we do in a single thread as long as from the thread itself, everything still seems to have occurred in the same order. The visible order from any other thread may well be different. Memory barriers and atomic operations affect the way things are seen from outside the thread. So when I say that the thread “cannot be seen to read from the shared state before reading the foreign variable” that means that the visible order of those operations, as seen from outside the thread, should be such that the read from the foreign atomic variable happens-before the read from the shared data.

Continued…

1. This is another reason why I prefer the article I’m explaining rather than the other candidate which caught my attention: the tone is much friendlier
2. Note that this is not always the case!
3. So close, yet so far away….

One of the most common sources of bugs is ambiguity: some too-subtle API change that’s missed in a library update and introduces a subtle bug, that finally only gets found out in the field. My answer to that problem is radical: make changes breaking changes — make sure the code just won’t compile unless fixed: the compiler is generally better at finding things you missed than you are.

Recently, I found a bug in a smart pointer implementation in Acari, Vlinder Software’s toolbox library used for, among other things, the Arachnida HTTP(s) server/client framework. The bug was subtle, not likely to cause problems in most current deployments of Arachnida, but limiting for one of our customers (so it had to be fixed).

When I started setting up the necessary testing framework, I came to the conclusion that the bug in question was a design flaw, and that not only the code would have to be changed, but the calling code at at least one of the call sites as well. I now had two things to make sure of:

1. the design had to be reviewed to make sure no other flaws were present
2. the calling sites that needed to be changed had to be spotted unambiguously, and changed in a way clearly specified

I decided to review the requirements that were at the base of the original design, clarify the requirement that was missed and led to the flaw, set up the necessary test cases for each of the functional requirements, design a new implementation to meet all the requirements, and implement that new design. This decision led to a delay in the release of version 2.3 of Arachnnida (which was planned for the end of 2014Q3 but will now come out early-to-mid 2014Q4) — which made it an executive decision.

Luckily, I’m the sole proprietor for Vlinder Software as well as its chief analyst — it says so on my business cards — so these type of decisions inevitably come down to me. I looked in the mirror and gave myself the go-ahead (without the mirror bit). I also informed the customer personally that, though his use-case wasn’t supported at the moment, it would be in version 2.3 of the server framework.

I then proceeded to coding the new smart pointer in a new library, according to the new design in a test setup designed specifically for it. This required, among other things, changes to the Relacy Race Detector, which are I made available on GitHub.

The new smart pointer no longer lives in the `Acari` namespace, but had basically the same API as the previous version did. That means that all the calling sites automatically fail to compile if they use the old version, because the fully-qualified name is no longer the same. The buggy use-case will fail to compile even if you change your `using namespace` directives to include the new namespace, because it will disallow an automatic conversion that was possible in the previous design.

Now, this forced me to revise and review all Vlinder Software code that used the old smart pointer from the Acari library, but as we have automated nightly builds that started failing as soon as I committed the “rip out the pointer class” changes in Acari’s master branch, those sites were easy to find and — because the two APIs are the same for the most part, and the only breaking change is abundantly clear — easy to fix.

Arachnida is now going through the hoops of the release process, with all the (hundreds of) automated test cases running, the security review starting for all modifications made in the months since the previous release, etc. When released, our customers who will upgrade from previous versions of Arachnida to version 2.3 will get a document explaining how to solve the compile-time errors they will probably (almost inevitably) face — a typical upgrade will take no more than 15 minutes to modify the usual call sites where a fully-qualified name for the smart pointer site would be used — and new use-cases will be supported that were not supported before, allowing for more efficient implementations on some devices.

# Radical Refactoring: Have the compiler to (some of) the reviewing

One of the most common sources of bugs is ambiguity: some too-subtle API change that’s missed in a library update and introduces a subtle bug, that finally only gets found out in the field. My answer to that problem is radical: make changes breaking changes — make sure the code just won’t compile unless fixed: the compiler is generally better at finding things you missed than you are.

I recently had to review a chunk of code that ported an application from one platform to a different flavor of that platform. The different flavor in question didn’t support a given library, but because all flavors were compiled from the same source tree, the headers of the unsupported library were still available. Regrettably the only way to distinguish between one flavor of the platform and another at compile-time was using an `#ifdef`.

The code was therefore littered with `#ifdef`s, but the `#include` directive that included the library’s header files was still there — so all the API calls that were no longer supported would still compile (and, in this case, link as well, but do the wrong thing at run-time in oh-so-subtle ways).

In stead of going through all the calls one by one, I asked the developer to surround the `#include` with an `#ifdef` and let the compiler check that none of them had been forgotten. In this case, none of them had.

The compiler didn’t find any sites that had been missed, but had there been any, it would have.

Of course, a better approach would have been to refactor the code so all those `#ifdef`s would no longer have been necessary. That is what had originally been planned, but sometimes the economic realities off our work catch up to the cleanliness of our code: sometimes refactoring and doing it right right now is simple too expensive. The question then becomes whether the investment into refactoring will return a real added value to the program — and the answer in this case was “no”.

# A different take on the “optimize by puzzle” problem

I explained the problem I presented in my previous post to my wife overt dinner yesterday. She’s a professor at law and a very intelligent person, but has no notion of set theory, graph theory, or algorithms. I’m sure many of my colleagues run into similar problems, so I thought I’d share the analogies I used to explain the problem, and the solution. I didn’t get to explaining how to arrive at computational complexity, though.

Say you have a class full of third-grade children. Their instructions are simple:

1. They cannot tell you their own names — if you ask, they have permission to kick you in the shins.
2. Each child has their hands on the shoulder of zero one or two other children.
3. All the children are facing in the same direction.
4. Only one child has no hands on their shoulder.
5. You can ask each child the names of the children whose shoulders they have their hands on, but they will only tell you once — ask again, they’ll kick you in the shins — and you have to address them by their names.

You are told the name of one child. How do you get the names of all the children without getting kicked in the shins and which child do you have to get the name of?

Obviously, the child whose name you have to know in advance is the one who doesn’t have any hands on their shoulders. From there on, you need to keep track of the kids whose names you know but haven’t asked yet (the `to_check` set) the kids whose names you know and have addresses (the `checked` set). At the end, you’ll have checked everyone, so you group of kids whose names you know but having asked yet is empty.

The third set (the `results` set) really only exists to make getting the “right” part of the set. As shown in the Venn chart below, the set of kids remaining to be checked is the difference between the result set and the (entirely overlapping) set of kids we checked with.

Venn chart of the sets

And that’s exactly what the algorithm does.

# Optimization by puzzle

Given a `query` routine that takes a name and may return several, write a routine that takes a single name and returns a set of names for which each of the following is true:

1. For each name in the set, `query` has been called exactly once.
2. All the results from the calls to `query` are included in the set
3. the parameter to the routine is not included in the set

You may assume the following:

1. Calls to `query` are idempotent1.
2. There is a finite number of values for names.
3. Names are less-than-comparable value-types (i.e. you can store them in an `std::set`) and are not expensive to copy
4. `query` results never contain their argument2

This is almost exactly the problem I had to solve recently: a tool was taking several minutes to perform a routine task that, in my opinion, should take milliseconds. Several other issues were involved as well, but this one has the bonus of being fun.

I should make this an interview question.

The way this ends up working is as follows:

1. We create three sets: one for the `results`, one for the things we’ve `checked` and one for the things that remain `to_check`.
2. We insert the value we got as a parameter in the `to_check` set.
3. As long as there are things left to check:
1. run `query` for each value in `to_check`
2. insert the results from the query in the `results` set
3. After iterating over each of the values, insert the values from `to_check` into the `checked` set,
4. clear the `to_check` set
5. fill `to_check` with the set difference between the `results` and the `checked` sets

Or, in C++:

```30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 template < typename T, typename F > set< T > foo(T t, F query) { set< T > results; set< T > checked; set< T > to_check; to_check.insert(t);   do { for (typename set< T >::const_iterator check(to_check.begin()); check != to_check.end(); ++check) { typename F::result_type query_results(query(*check)); results.insert(query_results.begin(), query_results.end()); } checked.insert(to_check.begin(), to_check.end()); to_check.clear(); set_difference(results.begin(), results.end(), checked.begin(), checked.end(), inserter(to_check, to_check.end())); } while (!to_check.empty()); return results; }```

Insertion into a set is $O(\lg{n})$ so lines 43 and 45 are both $O(n\lg{n})$. Line 46 should be $O(c)$ but is probably $O(n)$. Line 47 is $O(n)$ so the whole things boils down to $O(n\lg{n})$ complexity.

In order to play with the code a bit, I put it on GitHub as a Gist, with a test case (Query fails if you call it more than once with the same value):

1. so you really do need to call them only once
2. i.e. for the case at hand, we’re querying a directed acyclic graph, so our first argument will never be seen in any of the `query` results, although any given value may appear more than once in `query` results

# Run-time composed predicates and Code generation

While working on Arachnida, preparing version 2.2 due out this fall, one of the things we’ll be introducing is a hardened OpenSSL transport-layer-security plug-in, to replace the one we’ve had for the last seven or so years. One of the new features in this plug-in (which is part of Arachnida’s “Scorpion” module) is a much more flexible configuration scheme including the subject of today’s post: run-time composed predicates.

As the name indicates, run-time composed predicates are predicates that are composed at run-time. In this case, we use them for post-connection validations of the SSL/TLS connection: the user can plug their own post-connection validations in and combine them with the ones provided in the library using AND, OR, NOR, NAND, XOR and NOT primitives. Typically, such a composed predicate would look like this:

```configuration.post_connection_verification_predicate_ = and_( and_( peer_provided_certificate__, fqdn_matches_peer__) , userProvidedPredicate);```

in which `userProvidedPredicate` is a pointer to a user-provided predicate function whereas the other two predicates are included in the library.

The thing is that each of the following will also work:

```// if the peer provided a predicate, assume everything is fine configuration.post_connection_verification_predicate_ = peer_provided_certificate__; // we accept this only of the FQDN in the peer-provided certificate DOES NOT match the peer's FQDN // THIS IS STUPID - DO NOT DO THIS IN YOUR CODE! configuration.post_connection_verification_predicate_ = not_(fqdn_matches_peer__); // apply only the user's predicate configuration.post_connection_verification_predicate_ = userProvidedPredicate;```

The trick here is that the predicate type, `PostConnectionVerificationPredicate`, is a function-to-pointer type and the functions `and_`, `or_`, `xor_`, `nand_`, `nor_` and `not_` each return a function to a “newly created” function.

Of course, C++ does not allow the creation of functions at run-time and, as the call-back is passed to OpenSSL and OpenSSL is written in C, more to the point, neither does C.

As Arachnida is designed to run on industrial control systems and industrial embedded devices, we want to avoid run-time memory allocation if at all possible — and when that’s not possible, we want to control it. In this case, we avoid it by creating an array of pointers to functions, another array of “configurations” for those functions and a function for each position in the array. We do this using a Perl script (because we usually use Perl to generate code and it integrates nicely with our build system).

The following chunk of code is the generation script verbatim — annotated.

First, the usual pre-amble code: for the Perl part, this is invoking the interpreter; for the C++ code, this is including the neccessary headers.

```#! /usr/bin/env perl my \$name = \$0; my \$max_predicate_count = 20;   print <<EOF #line 7 "\${name}" #include "Scorpion/OpenSSL/Details/PostConnectionVerificationPredicate.h" #include <new> #include <stdexcept>```

The maximum predicate count is set above, and replicated in the generated C++ source code here. To change it, we currently need to change the script. At some point (probably before version 2.2 of Arachnida is released) this will become a command-line argument to the script.

```#define MAX_PREDICATE_COUNT \${max_predicate_count}   namespace Scorpion { namespace OpenSSL { namespace Details { namespace { static unsigned int next_predicate_id__ = 0;```

The following is how predicates are allocated: any call to any of the predicate construction functions (`and_`, `or_`, etc.) will call this once, and throw `bad_alloc` if it fails.

``` unsigned int allocatePredicateID() { if (MAX_PREDICATE_COUNT == next_predicate_id__) throw std::bad_alloc(); return next_predicate_id__++; }```

The following structure holds the configuration of the “generated” predicate. This is all we need to know for any operator: what the left-hand-side of the expression is, what the right-hand-side is and what operator it is. One operator is unary, all the others are binary. The unary one only uses the `lhs_` member of this structure.

``` struct PredicateInfo { enum Type { and__ , or__ , xor__ , nand__ , nor__ , not__ };   Type type_; PostConnectionVerificationPredicate lhs_; PostConnectionVerificationPredicate rhs_; };```

The following is an array of each of these configurations, followed by Perl code to generate each of the functions. I could have used a template to generate these rather than generated code but I find as long as I’m generating code anyway, it makes more sense to just keep generating — especially if there’s no compelling reason to do otherwise.

``` PredicateInfo predicate_infos__[MAX_PREDICATE_COUNT]; EOF ;   for (my \$i = 0; \$i < \$max_predicate_count; ++\$i) { print <<EOF #line 46 "\${name}" bool predicate\${i}(SSL *ssl, char *host) { switch (predicate_infos__[\${i}].type_) { case PredicateInfo::and__ : return (predicate_infos__[\${i}].lhs_(ssl, host) && predicate_infos__[\${i}].rhs_(ssl, host)); case PredicateInfo::or__ : return (predicate_infos__[\${i}].lhs_(ssl, host) || predicate_infos__[\${i}].rhs_(ssl, host)); case PredicateInfo::xor__ : { long lhs_result(predicate_infos__[\${i}].lhs_(ssl, host)); long rhs_result(predicate_infos__[\${i}].rhs_(ssl, host));   return ((lhs_result != 0) ^ (rhs_result != 0)); } case PredicateInfo::nand__ : return !(predicate_infos__[\${i}].lhs_(ssl, host) && predicate_infos__[\${i}].rhs_(ssl, host)); case PredicateInfo::nor__ : return !(predicate_infos__[\${i}].lhs_(ssl, host) && predicate_infos__[\${i}].rhs_(ssl, host)); case PredicateInfo::not__ : return !predicate_infos__[\${i}].lhs_(ssl, host); } throw std::logic_error("Should not reach this code"); } EOF ; }```

We can now generate the array of function pointers that the operator/generator code will pick from:

```print <<EOF #line 77 "\${name}" PostConnectionVerificationPredicate predicates__[] = { EOF ; my \$first = 1; for (my \$i = 0; \$i < \$max_predicate_count; ++\$i) { if (\$first) { print <<EOF #line 84 "\${name}" predicate\${i} EOF ; } else { print <<EOF #line 91 "\${name}" , predicate\${i} EOF ; } \$first = 0; } print <<EOF #line 99 "\${name}" }; EOF ;   print <<EOF #line 105 "\${name}" } EOF ;```

and create a function for each operator. Not that the binary operators are all the same for all intents and purposes, so might as well generate those too.

```my @keywords = qw/and or nor xor nand/;   foreach \$keyword (@keywords) { print <<EOF #line 113 "\${name}" PostConnectionVerificationPredicate \${keyword}_(PostConnectionVerificationPredicate lhs, PostConnectionVerificationPredicate rhs) { unsigned int predicate_id(allocatePredicateID()); predicate_infos__[predicate_id].type_ = PredicateInfo::\${keyword}__; predicate_infos__[predicate_id].lhs_ = lhs; predicate_infos__[predicate_id].rhs_ = rhs; return predicates__[predicate_id]; } EOF ; }   print <<EOF #line 127 "\${name}" PostConnectionVerificationPredicate not_(PostConnectionVerificationPredicate lhs) { unsigned int predicate_id(allocatePredicateID()); predicate_infos__[predicate_id].type_ = PredicateInfo::not__; predicate_infos__[predicate_id].lhs_ = lhs; predicate_infos__[predicate_id].rhs_ = 0; return predicates__[predicate_id]; }   }}} EOF ;```

A few fun tidbits: the `#line` directives tell the compiler where to look for the code for stepping etc., so if you step through this code you’ll be stepping into Perl!

This approach works for a whole slew of other repetitive code. Generated code, once debugged etc., usually scales pretty well: if I need a thousand of these operators for some reason, I have one constant to change and no other questions to ask (except perhaps why I could possibly need that many predicates!)

I used a very similar approach to translate a dump from the Unicode into C code to parse it: computers are very good at repeating themselves with minor variations in what they’re saying. This is an example of how you can reduce the amount of work you do by making the computer do more.

# Serializing floats

Serializing is the act of taking a chunk of data an converting it to something that can be communicated — i.e. some format or other that someone or something else can parse and understand. You do it all the time when you write, or even when you talk: you serialize your thoughts as words that you then serialize as either characters on paper (virtual or dead tree) or as sound.

Parsing is the opposite process of serializing — also called deserializing.

# What happens if structures aren’t well-designed

In my previous post, I explained how to design a structure for persisting and communicating. I didn’t say why I explained it — just that things get frustrating if these simple rules aren’t followed. In this post, I will tell you why I wrote the previous one.

Two or three years ago, I was working on a project that, among several other things, used existing software to communicate between the device I was programming for and another device. The device I was programming for used a binary configuration mechanism that persisted the configuration in binary form directly on disk, in a somewhat structured format. While the structures, as persisted, did include a header that told the reader how much to skip to get to the next section, and a magic number (or rather: a GUID) for each section. The structures in question were managed by a tool with a graphical interface and the generated configuration was included in the firmware with the software I was writing. My software was simply to open the file, get a chunk out of it and pass that chunk to the library doing the communicating, so it would know how to connect and what its parameters were to be.

This worked just fine for a very long time, but having moved on to other projects and the software in question not needing any maintenance until very recently, the library code being used for the communications had been allowed to evolve without “my” software being updated with it. I put “my” in quotes here because the software in question is proprietary software that I do not own — I just wrote it. The risk associated with not maintaining the software concurrently with the communications software was known, understood, and managed so there was no real objection to going down this path.

About three weeks ago, I was asked to help with a massive update of the project’s basic software: the OS, all of the libraries and several other chunks of software I didn’t write had all evolved while the software I had written had been left standing still. Now, a different device to communicate with had to be supported, some-one had been working on that for a few months already1 and a delivery date was nearing, but the update of the bulk of the firmware was in trouble: the system didn’t communicate.

There were two problems that had to be solved: the first was in the OS, the second was in the application-level software.

In the OS, the boot communicated the IP addresses and similar information to the main OS through a mailbox structure in memory. That mailbox structure had been changed independently in two branches. Both had added the same field, `timestamp` to the information to be communicated. In one branch, another field had been shortened from 16 to 12 bytes and the `timestamp` had been inserted. In the other branch, the `timestamp` had been added to the end of the structure.

This is a classic example of why

1. you should reserve fields for future use; and
2. you should consistently add new fields to the end of a structure if no reserved fields are available.

Not following these two simple rules meant I now had to detect which boot was running to know which format of the mailbox was being used, and translate from one format to the other — in a way that was transparent to the system — if I found an incompatible boot.

Once the OS was fixed and tested, we checked that this fixed the symptoms of the problem as well, which is when we found the second problem — the system still wasn’t communicating (though we could now ping and telnet into it, which was definite progress). The communications library failed to initialize.

Tracing through the initialization routine the problem was found easily enough: the chunks of data containing the configuration contained invalid values. We couldn’t verify whether the data being read was in any way misaligned or otherwise corrupted because almost none of the rules I set out in my previous post had been followed:

• there were no magic numbers
• the only version information included applied to the whole group — none for individual chunks

Due to this lack of following design principles, I only found out the next morning that another one of my design rules had also not been followed: a structure had been inserted in the sequence somewhere in the middle. Because some of the code I was running was “unaware” of this, the data being read was offset by several hundreds of bytes — something that would easily have been noticed if we had had magic numbers to look for. When I did finally find the problem, the fix took a few minutes. Several hours were lost searching for a cause in several wrong places, however.

Hence, two blog posts…

1. I knew about that: I’d helped him a few times already

# How to design a struct for storage or communicating

One of the most common ways of “persisting” or communicating data in an embedded device is to just dump it into persistent storage or onto the wire: rather than generating XML, JSON or some other format which would later have to be parsed and which takes a lot of resources both ways, both in terms of CPU time to generate and parse and in terms of storage overhead, dumping binary data into storage or onto the wire has only the — inevitable — overhead of accessing storage/the wire itself. There are, however, several caveats to this, some of which I run into on a more-or-less regular basis when trying to decipher some of that data, so in stead of just being frustrated with hard-to-decipher data, I choose to describe how it should be done in stead.

Note that I am by no means advocating anything more than a few simple rules to follow when dumping data. Particularly, I am not going to advocate using XML, JSON or any other intermediary form: each of those has their place, but they neither should be considered to solve the problems faced when trying to access binary data, nor can they replace binary data.

## Necessary parts

There are two things that any structure that is communicated1 in binary form should have:

1. a magic number, preferably one that is at exactly four bytes in length and one that is chosen to be human-readable, either when displayed as HEX or when displayed as “deciphered” ASCII
Good examples are `0xdeadbeef`; `0xNbadf00d` in which N is replaced by a hexadecimal value that might mean something — you have 16 options, and you can put the N at the end, so you really now have 32 options!!; `'CODE'` (or `0x434f4445` in this case) in which CODE is replaced by something descriptive for the structure’s content. For example, if it contains a config for a potato peeler, `'PCFG'` (or `0x50434647`) would do just fine. The idea is to have some magic number that’s easy to recognize when displayed by a memory debugger or when dumped by a run-of-the-mill binary editor/viewer.
2. the version of the structure. This can be a simple incremental counter — it can even be part of the magic number of you don’t want to “waste” bytes, but it really should be in there. Ideally, it should consist of at least two parts: “current” and “age”, the idea being that you increment both “current” and “age” if you add something, and that you increment “current” and set “age” to 0 if you remove something or change the meaning of some part in a way no longer compatible with previous versions. That way, any-one who reads the structure can very easily see if they can understand the structure:
```if (data.magic_ == POTATO_PEELER_CONFIG_MAGIC) { if ((data.version_.current_ - data.version_.age_) == (POTATO_PEELER_CONFIG_CURRENT - POTATO_PEELER_CONFIG_AGE)) { if (data.version_.current_ >= POTATO_PEELER_CONFIG_CURRENT) { // current or newer version - read it as if // it's current. // We should be able to ignore anything // added since (because the implementor // declared us to be forward-compatible, // after all) } else { // older version. Assume default values for // newer fields, or follow some kind of logic // to keep compatibility -- after all, the // implementor did declare us to be // backward-compatible } } else { /* incompatible version - maybe fall back on conversion code..? */ } } else { /* not something we understand - wrong magic number */ }```

With just these two in place on every persisted structure, I would have saved hours of futile staring at memory dumps and binary dumps of files from legacy (and current) systems that I was asked to debug. Basically, every persisted structure should begin like this:

```struct PotatoPeelerConfiguration_struct { uint32_t magic_; uint32_t version_;```

or, if we want the code above to compile2:

```struct Version_struct { uint16_t current_; uint16_t age_; }; struct PotatoPeelerConfiguration_struct { uint32_t magic_; struct Version_struct version_;```

## The structure’s structure

What’s wrong with this picture:

```struct Blah { uint32_t ulThingy; uint8_t ucThingy; uint16_t usThingy; };```

Hint: it’s not the Hungarian notation!

There’s an invisible hole in this structure3. Between `ucThingy` and `usThingy` there is a one-byte hole due to the structure’s members’ alignment.

The vast majority of compilers will insert a hole into the structure to make sure the `usThingy` member is aligned on a “natural” two-byte boundary. That is the right thing to do, because many hardware platforms will be very picky on mis-aligned data. ARM, for example, will throw a ‘data abort’ at you whereas x86 will simply slow down to a crawl.

Please don’t make the mistake of using `#pragma pack` for this: use `#pragma pack` only if you know it has no effect, and then only if you have a whole bunch of `assert`ions in your unit tests, which are run every night, checking that the `#pragma pack` has no effect on any of the platforms you target4. Using `#pragma pack` otherwise can cause mis-alignment of the contents of the structure which on some platforms (like ARM) can cause crashes.

Do use filler variables to fill the holes, like this:

```struct Blah { uint32_t magic; uint32_t version; uint32_t ulThingy; uint8_t ucThingy; uint8_t reserved; uint16_t usThingy; };```

Note the magic number and version as well, which should of course be at the start of the structure.

If you’re saving a whole bunch of data to a file, or sending it over a wire, please structure it so we can skip the parts we don’t care about – e.g. by including a header with a `size` field for a section of data objects that we might want to skip over. Every object in the section should still have its own magic number and version, of course, but at least we’ll know that the next 148 bytes are configuration for orange peelers (we’re interested in potato peelers, so we’ll skip those 148 bytes, thank you very much).

If you add something to a structure, unless you have reserved some space in the structure for that purpose add it to the end: you are least likely to create compatibility problems that way.

If you’re adding to a collection of objects (as described above) the same applies: make it a complete structure (magic, version and all) and add it to the end.

If you’re designing a structure that is going to be part of a collection of structures communicated somewhere, make sure its size is a multiple of the largest primitive normally used — e.g. a multiple of eight bytes, or four bytes if you don’t go larger than 32-bit integers. This allows the structures, once read into appropriately-aligned memory, to be automatically appropriately aligned when accessed in that appropriately-aligned memory. Because most functions that dump data to the medium don’t pad structures (as the compiler does when you create an array of objects that don’t follow this rule) you won’t be able to count on alignment otherwise.

## Other bits5

If the data you are dumping is somehow variable-sized (i.e. it’s the header of something), please include the size so we know how much data to skip. If need be the size can be used in lieu of a version (as one Redmond-based company often does).

You may want to include a few reserved fields for future use. Do expect them to be 0 by default, or include some kind of flag when set to 0 if the 0 means something, and please do `memset` structures to 0 before filling them in because you will forget some of the fields some of the time, and you don’t want random values to end up in there.

Don’t use already-common magic numbers, such as `0xfeeefeee` or `0xcccccccc` etc. While they’re easy to recognize and perfectly fine as magic numbers in their own right, encountering them usually means bug and really stands out to they hardened debugger’s eye.

If your structure contains strings, try to zero-terminate them if at all possible. Many, many programmers forget to check for zero-termination when they output something from the struct, which causes many, many crashes or other random behaviors.

If you have a structure in memory, writing it somewhere is simply a case of calling the appropriate `write()` function passing it a pointer to your structure and the size, like this:

`retval = write(&data, sizeof(data));`

You don’t have to worry too much about issues such as alignment, because the write function will ready the thing byte by byte if need be.

Reading it into memory is another matter: while in most cases you still don’t need to worry about alignment, you just might have to if the reading function hands you a pointer rather than the other way around: it may not be aligned properly, so you can’t just cast it to the type you want. In stead, use `memcpy` to copy the data into a temporary variable of the right type, so the compiler can align it properly for you.

Also use the version information you stored to know what the size of the data you received is – it may be different from what you were expecting as the structure may have grown since you wrote your code, or since the code that sent the data was written. Version information will also tell you something about the meaning of the contents of the structure, which may have changed or may need to be set to some default value if it wasn’t provided before.

So, reading data — even binary data — from any kind of storage or communications medium is really parsing: you should carefully design how it’s done and keep in mind that things change over time: some intern may change the code one day having neither read this post nor anything else useful to his job. While there is only so much you can do to protect yourself from that particular intern, you can at least try to be graceful about erroring out when you encounter one of his bugs.

HTH

rlc

1. and I include writing to persistent storage and reading it back later in “communication” because that’s what it is: the software reading the data may very well be different from the software writing it — be it different versions of the same software, or different software altogether
2. There is some religious debate over whether or not to do `typedef struct Version_struct Version;` in C headers, so I left that out, though I usually would have included it for convenience.
3. At least, there is on the vast majority of platforms
4. In other words: just don’t use it — it’s useless.
5. I was going to call this section “Optional bits” but it’s not really optional any more than the text itself implies — I don’t want you to skip over this section just because I said it was optional

# Hidden complexity

It really surprises me sometimes how much you can have to explain about simple things.

I’ve written quite a bit of code, most of which is in production today on systems ranging from huge servers tucked away in a bank’s data center somewhere to tiny embedded devices that might just be hanging from your keychain. Most of that code is written in either C or C++, or some combination of the two and some of it contains from-scratch rewrites of things I’d written, often slightly differently, elsewhere.

So today, I decided to do another one of those rewrites, but for educational purposes. I’m not done yet, but I am amazed at how much I’ll have to explain about the code to make it all clear. The code in question (which is done – the explaining takes a lot longer) uses various techniques that aren’t familiar to many programmers, which makes for even more explaining.

In all, I think I’ll have several hundreds of words for maybe a hundred lines of code…

A picture is worth a thousand words – but how much is a class template worth?