Infect to Protect

Bandwagons

I’m not one to jump on each and every bandwagon I see. Sometimes that’s a good decision, sometimes it’s better to just wait and see where they go before taking any action.

Containers are one of those ideas that, while promising and intriguing, were quite clumsy in the beginning, so I ignored them for a good while. It’s sufficiently mature now; so much so that’s quite difficult to ignore them. Time to investigate them again.

Now, most of my work revolve around writing embedded software that runs on bare metal; containers don’t really solve any work-related problem I have. For personal usage, package management is more than sufficient to install programs. However, the sandbox aspect of containers are quite interesting and it’s something I’d like to know more about.

There are many articles around the web explaining how containers on Linux work. Some get out of their way to explain in depth all the machinery necessary to make them work, so there’s no need to repeat it here.

But, in sum: almost all of the kernel side of things was already present before containers were actually a thing: cgroups, system call filters, etc. Containers (and their runtimes) only make them so simple to use it’s transparent for the user.

I usually have a hard time understanding things that I cannot build, so I decided to build a toy container runtime. It’s crude and it’s a far cry from what any industrial-strength container runtime is capable of, but it’s not only a start, it’s implemented in a way that makes things a lot easier for the user.

Virulent tutorials

Before I go into details on how my contraption works, a little bit of background. I’ve been using Linux for over 18 years, and began my forays in C about 14 years ago.

Around that time, a pretty interesting HOWTO explaining how to create viruses for ELF binaries came out. It explained not only various methods of infecting an ELF executable, but also methods to detect them. Suffice to say, I couldn’t understand a thing back then. A few months ago, though, a conversation in the local hackerspace brought up that tutorial; I could now finally not only understand the techniques but put them to use.

One of the techniques explained in the HOWTO involves finding some unused space in an ELF segment that’s also executable, writing shellcode to that area, rewiring the executable’s entry point to point to the shell code, and modifying the shell code so that it points to the original entry point. It’s all quite Rube Goldberg-ey, but it’s actually quite simple.

This way, a chunk of code can be executed every time that program starts, without altering the size of the program. The perfect crime.

Perfect crime

Dual use technology

By now, you’ve most likely connected the dots: the idea is to use the very same technique, originally designed for viruses, to create a program that transforms any program into a sandboxed version of itself.

The prototype I wrote is very elementary; the only thing it does is limiting, just once, which system calls a program can execute.

Sort of a less-powerful version of OpenBSD’s pledge(2) (née tame(2)), which can be repeatedly called to reduce the amount of privileges a process has. Useful, for instance, in cases where a configuration file has to be read before processing user-supplied work. That BSD version has been sprinkling calls to pledge() in almost all of the programs in the base install (which is easier for a BSD system, since everything is kept under the same roof.)

But, unlike pledge(2), this thing can be applied to binaries that have been already built. No source code modifications are necessary. If your distribution can withstand the stench, “infected” binaries could be a thing in the default installation.

Filtering the system calls

Any respectable container runtime will perform a lot of tasks to sandbox a process and their children. So, for a proof of concept, I decided to do just the bare minimum: limit system calls using Seccomp-BPF.

Seccomp is a set of features present in the Linux kernel, since the 2.6.x days, that allows restricting what a program can do, system call-wise. The original intent was to do not permit any other system calls excepting those to end the program, and read and write to already-opened file descriptors. In some scenarios, this is perfectly acceptable. For others, there’s the seccomp-BPF extension.

BPF stands for Berkeley Packet Filter. A famous use of BPFs is in the tcpdump program, where rules such as “only give me back TCP fragments with the RST flag set” can be passed to the kernel; packets that don’t match the filter are not copied back to the userland, reducing a lot of the chatter between the two lands.

Obviously, this must be extremely performant, since kernel time must be conserved at all costs (the kernel is there only to serve userland, after all). Linux has many ways to speed up BPF programs, including an in-kernel JIT compiler. Some restrictions are in place that wouldn’t allow BPF programs to take an infinite amount of time to execute, and this blog post is a good introductory reading material on the subject.

Another, slightly less famous use of BPFs is with the seccomp-BPF extension. Instead of filtering network packets, processes can, for instance, pick which system calls they’re allowed to perform. And that’s precisely what’s necessary for my proof of concept.

Scripting like a kid

There are many ways to skin a cat. I decided to take a look how other programs were doing their sandboxes, and eventually found one that seemed easy enough to copy the technique from.

Unfortunately, writing shellcodes in C isn’t that easy, specially if you don’t know which C library a program was linked with (or if it were linked to a C library in the first place). Luckily, all the shellcode has to do is make two system calls, which is straightforward to do with a little bit of assembly.

The first call will forbid the process from getting more privileges. The second call will actually copy the BPF program to the kernel side.

The first call is painless: just set a few registers, invoke the syscall, done.

The other one takes a little bit more work. A few things helped: I’ve used nasm, which is a macro assembler, and wrote a few macros that let me write BPF programs as if they were standard x86-64 instructions.

The remaining issue is that a pointer to the BPF program must be passed to the call to prctl(), and the shellcode must be relocatable. A common trick to perform in these scenarios is to employ the fact that, on x86, when a call instruction is made, the return address (i.e. the address of the byte right after the call instruction) is pushed to the stack:

    ; …
    jmp push_bpf_addr
apply_filter:
    pop rdx     ; rdx points to the BPF program
    ; …
push_bpf_adr:
    call apply_filter
bpf:
    bpf_stmt ; …
    bpf_jump ; …
    sc_allow ; …
    ; …
bpf_end:

The bpf label doesn’t point to any x86 instruction: it contains only macros that expands to the definitions of struct sock_filter as defined in linux/filter.h. To copy the BPF program to the kernel, the prctl() call expects a struct sock_fprog, which contains the BPF program length (in number of struct sock_filter elements), and a pointer to the base of that array. Since there’s no way to know where this code is gong to land in memory beforehand, this trick comes in handy: after the call apply_filter instruction, the top of the stack now contains the base address of that array.

Now that I had a way to write the shellcode, it was just the matter of shoehorning it into the executable.

Hacking time

Scoring a goal

For the proof of concept, I was initially going to write the infection program in Python, as I usually do for throwaway code. However, I wasn’t successful in finding a working ELF library that would let me dump the modified executable.

I was too lazy to actually fix or write support for that, so I kept looking for alternatives and ended up finding the ELFkickers suite from the always excellent Muppet Labs. It includes an “infect” program that does exactly what says in the tin: it takes in an executable file, and produces another executable file that creates a setuid shell before continuing to the original program. Exactly what one would expect from a program with nefarious purposes.

So I substituted the original shellcode for the one I’ve just assembled, and now I had a proof of concept. Which of course didn’t work the first few tries. In fact, it took a long while to get it right.

Debugging the contraption with gdb

The GNU Debugger is indeed very powerful, but ease of use (compared to the Turbo Debugger I used to use in the DOS days) is not it’s strong suit. I’m not used to using it to debug programs without access to source, and this was a good opportunity to learn a few things.

Since the infection program modifies the ELF entry point, setting a breakpoint on main() won’t actually work. But this is easily solvable: just use readelf(1) to find where the new entry point is, and set a breakpoint to that:

$ gcc -o hello hello.c
$ readelf -h hello | grep Entry
  Entry point address: 0x400490
$ ./infect hello
$ readelf -h hello | grep Entry
  Entry point address: 0x4007bc
$ gdb ./hello

(gdb) break *0x4007bc
Breakpoint 1 at 0x4007bc

From now on, it’s just the usual execute-inspect-modify-reassemble-reinfect loop until it works. Although it’s no td, I’m certainly glad GDB has layouts that displays both the disassembly and the registers.

Step-by-step debugging

Watching the magic happen

The hello program is very short and the call to socket(2) doesn’t make much sense there. It’s just a way to test what’s going to happen when the filter is in place, without the need to modify the program to test this assumption. (Lots of things happens when executing a simple program such as this.)

#include <stdio.h>
#include <sys/socket.h>
#include <netinet/in.h>

int main(int argc, char *argv[])
{
    if (argc < 2) {
            printf("no socket created\n");
    } else {
            int fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
            printf("created socket, fd=%d\n", fd);
    }
    return 0;
}

Executing the program before infecting it gives the following output, as expected:

$ ./hello
no socket created
$ ./hello 1
created socket, fd = 3

Indeed, if the program is executed under strace, it all goes exactly like it’s supposed to be:

$ strace ./hello
execve("./hello", ["./hello"], [/* 58 vars */]) = 0
…
write(1, "no socket created\n", 18no socket created
)     = 18
exit_group(0)                           = ?
+++ exited with 0 +++

And, with a command-line argument, so the socket is created:

…
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
…
write(1, "created socket, fd = 3\n", 23created socket, fd = 3
) = 23
exit_group(0)                           = ?
+++ exited with 0 +++

However, the magic happens after the “infected” binary is executed. First, without creating a socket:

…
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len=30, filter=0x400824}) = 0
…
write(1, "no socket created\n", 18no socket created
)     = 18
exit_group(0)                           = ?
+++ exited with 0 +++

Notice the calls to prctl(), very similar to the ones found in the previously-mentioned commit. And then the program executes as usual. Now, if an argument is passed, the program will attempt to create a socket:

…
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len=30, filter=0x400824}) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 41
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0x7f2d01aa19e7, si_syscall=__NR_socket, si_arch=AUDIT_ARCH_X86_64} ---
+++ killed by SIGSYS (core dumped) +++
[1]    27536 invalid system call (core dumped)  strace ./hello 1

And Seccomp kicks in and kills the program with a SIGSYS signal. As expected. It’s alive!

It's alive!

Next steps

The prototype works. But there are a few things that must be considered before even considering this idea for anything.

System call whitelist

The list of system calls is still hardcoded within the shellcode. That’s not optimal. Maintaining a list such as this for each and every program will most likely be so boring nobody is going to do that.

I can think of three possible ways of coming up with this list.

The first would be doing the same thing pledge(2) does: allowing a very restrict set of system calls at first, with some limitations, and then providing a few sets of calls per set of features a program might use: stdio, inet, tty, etc. The nice thing about this is that the filters are more fine grained; it’s not just a whitelist of system calls. (The man page has more details.)

The second way would involve running the program under strace(1) and record which system calls the program makes from a few runs. If the test coverage for each run is sufficiently high, this will work very reliably; this isn’t always the case, so the mileage may vary. Also, for certain large, complicated programs, stracing it all automatically could prove to be a challenge.

Another way would be the following: Grab a list of undefined symbols a program uses, and find them in the shared libraries it links to. Then scan the executable and the libraries for sequences like mov eax, 57; syscall (for the oldschool fork(2) syscall on x86-64) or mov rdi, 57; call syscall@plt. This is still not foolproof, since not necessarily a system call number (loaded into eax) will be hardcoded within a program or shared library.

There’s a fourth idea, as well, which involves both doing the automated static analysis on the binary and running strace to catch “runaway” syscalls. This can get quite complicated and it’s unlikely I’ll get it correct in the first few tries (and, yet, the same shortcomings will apply in the end.)

For me, though, these experiments are all about the hunt, not about the treasure. So the tried and true approach that pledge(2) uses won’t be used at first.

Filter optimization

Another thing that might be a problem is: on x86-64, Linux has hundreds of system calls. (329 according to sys/syscall.h at the moment I write this.)

Even if the JIT for BPFs is quite efficient, doing a linear search before each and every system call will certainly be a bottleneck. Also, BPF programs are limited in size, and a large whitelist that’s implemented the same way as the prototype will limit the possibility for more fine-grained filters. Things like “the socket(2) call is allowed only for UNIX-domain sockets”, rather than allowing whatever call to socket(2) would be impractical.

Since each syscall is identified by a number, a simple bitmap could be used to implement the whitelist. This will also free up some space in the BPF program for more detailed whitelisting for certain syscalls (for instance, only allowing certain family of sockets to be created).

After a quick read of networking/filter.txt, this seems doable by using an algorithm such as this, which will reduce the number of comparisons as the number of acceptable system calls increases:

        if syscall_number < 32:
                if bitmask_0 & 1<<syscall_number: goto accept
        if syscall_number < 64:
                syscall_number -= 32
                if bitmask_1 & 1<<syscall_number: goto accept
        if syscall_number < 96:
                syscall_number -= 64
                if bitmask_2 & 1<<syscall_number: goto accept
        …
        if syscall_number < 352:
                syscall_number -= 320
                if bitmask_10 & 1<<syscall_number: goto accept
        return SECCOMP_RET_KILL
accept:
        return SECCOMP_RET_ACCEPT

(Some of the if syscall_number < N blocks could be changed to syscall_number -= M if their respective bitmask is 0.)

Or maybe just a bloom filter instead of a series of bitmaps. I’ll have to experiment.

Getting a larger vessel

Containers, of course, are not just about restricting which system calls a program is allowed to perform. There are many things that can and must be considered before even calling this a container runtime, or really consider that this is in fact sandboxing anything. Learning about namespaces, cgroups and virtual machines are certainly on the list of things to learn about.

Conclusion

While the prototype I built isn’t practical and is of very limited use, I find the idea of sandboxed programs without the need for specialized runtimes very enticing.

Programs can be still packaged the way they have been packaged in the past decades, without throwing away some of the sandboxing benefits that containers provide, all the while not introducing new concepts for users.

Of course, something like this – even if properly implemented – won’t be a replacement for containers. Specially if one considers their role as packets ready for deployment, which have a lot of value for devops personnel.

The code, as usual, is open source, and available from this Git repository.

Initializing a heap-allocated structure in C

A pretty common mistake that happens when programming things in C is to allocate less memory than necessary to hold a structure:

struct foobar *foobar = malloc(sizeof(struct foobaz));

Note that struct foobaz is passed instead of struct foobar. We might get lucky, and sizeof(struct foobaz) might be larger or equal than sizeof(struct foobar), but we might not.

There are lots of tools out there that will catch these mistakes: static analyzers such as the one from Clang, and Memcheck from Valgrind are just two examples that should be in any C programmer’s toolbelt.

Even then, people often resort to a a nicer idiom: sizeof(*foobar), which not only avoids these problems, but also is somewhat future-proof, should the type of foobar change:

struct foobar *foobar = malloc(sizeof(*foobar));

However, structures often have members that, if someone forgets to initialize, will inflict some undefined behavior pains on the user. The things in the toolbelt might help here, as well as the calloc() function, that, in addition to allocating memory, also zero-out the memory block:

struct foobar *foobar = calloc(1, sizeof(*foobar));

Not always one would want to zero out the whole memory chunk just to fill out important fields afterwards, though.

Here’s a trick that’s being used in a yet-to-be-released project I’ve been working on and off for the past few months. It starts by defining the generic-chunk-of-memory equivalent of strdup(), memdup():

void *memdup(const void *src, size_t sz) {
        void *mem = malloc(sz);
        return mem ? memcpy(mem, src, sz) : NULL;
}

Then a macro is defined:

#define ALLOC_INIT(type, ...)   \
        (type *)memdup((type[]){ __VA_ARGS__ }, sizeof(type))

Then it is used like so:

struct foobar *foobar = ALLOC_INIT(struct foobar, {
        .field = value,
        .other_field = other_value,
        .yet_another_field = yet_another_value
});

The compiler will check if field, other_field, and yet_another_field are actually part of struct foobar, and will abort compilation of a field isn’t there or is of the wrong type.

The cast prevents the allocated memory block from being assigned to the wrong type. (C will happily cast any void* to any other pointer.)

The amount of memory allocated will be exactly what’s needed by the structure, and all fields that not mentioned will be initialized with their default values as per designated initializer rules.

If memdup() is inlined, a good compiler will generate pretty good code, that’s often byte-by-byte equivalent to allocating directly with malloc(), initializing all the fields by hand, etc.

If GCC is being used, the __auto_type extension can be used, to avoid having to type struct foobar twice. This has been suggested by Thiago Macieira. I’d use this sparingly, though.

__auto_type foobar = ALLOC_INIT(struct foobar, {
        .field = value,
        .other_field = other_value,
        .yet_another_field = yet_another_value
});

It’s a pretty nice idiom that I haven’t seen anywhere else, and I’m blogging here as the project I’m working on might not ever see the light of day and it would be a shame if at least this didn’t become public.

Hybrid C/Pascal Strings

I’ve been thinking for a while on how to reduce the overhead in Lwan‘s string buffer, when the strings are small. There are a number of ways of accomplishing this.

A somewhat common way is what std::string does: it reuses the bits reserved for effective string length, allocated buffer size, and pointer to buffer to store the string contents inline.

A clever improvement is, when the string is small, to turn the effective string length counter to a bytes remaining counter, and put it after the buffer that’s storing the string; this way, when the string is at full capacity, this serves as a \0 terminator, which is very useful for compatibility with C. And, of course, as a result, one more byte can be stored in that string.

Another common approach are the strings used in Pascal, where the first byte tells the length of the string. This has the advantage of allowing strings to contain \0, but the disadvantage of limiting the maximum size of the string. If someone were to implement this in C, the advantage would turn into a disadvantage, as most string-handling routines present in the standard library would be then rendered useless.

Or would it?

I’m sure I’m not the first person to come up with the idea of having a C/Pascal String hybrid. But at least the Wikipedia article on Strings doesn’t seem to mention this variant I just came up with:

  • Keep the \0 to terminate the string. This helps reusing the string handling routines from the C standard library, which are usually very fast, hand-tuned functions
  • The first byte tells the size, not in bytes, but in 8-byte blocks. To calculate the string length, one just jumps that amount of 8-byte blocks and find the position of the \0 terminator.
  • Larger blocks could be considered if SIMD instructions were available.

With 8-byte blocks, this can yield strings up to 2KiB of size (256 * 8), with an overhead of only two bytes, while retaining compatibility with C strings. With SIMD, the maximum string size could be easily doubled or quadrupled.

Of course, this isn’t actually an improvement on the kind of small string optimization performed by std::string, so I’m not yet convinced this is the way to go. This is one of the reasons I haven’t yet implemented this, but I might use the fact that I’m currently enjoying some vacation time and write a prototype.

Life of a HTTP request, as seen by my toy web server

When learning a new programming language, I tend to write two things with it: a language interpreter (usually a FORTH-like language or Brainfuck if I’m feeling lazy), and a HTTP server. Sometimes, just as a challenge or a way to quench my boredom, I do this even though I’ve been working with a particular language for some time, as is the case with C.

None of these projects I’ve written over the years have been as complex as Lwan ended up being: most of them were nothing but weekend hacks and were never able to hold my attention for more than a few dozen hours.

It’s to be expected, then, that I might have a thing or two to say about it. In fact, I’ve been doing this in homeopathic doses over the almost two years since I’ve started the project. Never actually connected all the dots, leaving out important details.

This article is an attempt to describe, from the perspective of Lwan, the life of a HTTP request — from the socket being accepted to the response being sent — and explaining details and reasoning behind the implementation.

Creating the listening socket & accepting connections

There’s nothing really special here: sockets are either created using the standard POSIX stuff, or are passed down from systemd. In either case, TCP Fastopen and Quickack are enabled, in addition to socket lingering. The socket is left in its default, blocking mode. The listen() backlog is the maximum allowed by the system.

static int
_get_backlog_size(void)
{
#ifdef SOMAXCONN
    int backlog = SOMAXCONN;
#else
    int backlog = 128;
#endif
    FILE *somaxconn;

    somaxconn = fopen("/proc/sys/net/core/somaxconn", "r");
    if (somaxconn) {
        int tmp;
        if (fscanf(somaxconn, "%d", &tmp) == 1)
            backlog = tmp;
        fclose(somaxconn);
    }

    return backlog;
}

It’s a blocking file descriptor since the main thread (responsible for accepting all the sockets and scheduling clients) blocks on a call to accept4() instead of something like Epoll. This accept() variant is Linux-only and, among other things, lets one specify flags in sockets without requiring an additional round trip to the kernel; the only flag that interests Lwan is SOCK_NONBLOCK.

void
lwan_main_loop(lwan_t *l)
{
    if (setjmp(cleanup_jmp_buf))
        return;

    signal(SIGINT, _signal_handler);

    lwan_status_info("Ready to serve");

    for (;;) {
        int client_fd = accept4(l->main_socket, NULL, NULL,
                                SOCK_NONBLOCK);
        if (UNLIKELY(client_fd < 0)) {
            lwan_status_perror("accept");
            continue;
        }

        _schedule_client(l, client_fd);
    }
}

File descriptor limits are raised to the maximum allowed by system settings — at which time, Lwan pre-allocates an array of structures to hold connection state for all possible file descriptors.

Scheduling connection

In order to multiplex connections, Lwan spawns one thread per logical CPU, and uses Epoll to determine which socket is ready to be written to or read from. Once a connection is scheduled to one of these threads, it stays there until it is explicitly closed or a timeout occurs.

All threads share the preallocated connection array, and there are no explicit locks. The index to this array is the connection file descriptor, which makes lookup very quick. This exploits the notion that file descriptors are always allocated from the lowest possible number.

struct lwan_connection_t_ {
    /* This structure is exactly 32-bytes on x86-64. If it is
     * changed, make sure the scheduler (lwan.c) is updated as
     * well. */
    lwan_connection_flags_t flags;
    unsigned int time_to_die; /* In seconds since DQ epoch */
    coro_t *coro;
    lwan_thread_t *thread;
    int prev, next;           /* For death queue */
};

Since this structure is quite small, this leads to a form of implicit lock called false sharing, which is solved with a scheduler that is aware of that problem and groups two connection structures per cache line. It’s simpler than it sounds:

int thread = ((fd - 1) / 2) % n_threads;

A round robin scheduler is used on other architectures.

An interesting curiosity about the connection structure is that it doesn’t store the file descriptor: pointer arithmetic is performed to obtain it, as the the base address for the connection array is known.

ALWAYS_INLINE int
lwan_connection_get_fd(lwan_connection_t *conn)
{
    return (int)(ptrdiff_t)(conn - conn->thread->lwan->conns);
}

After a thread has been chosen by the scheduler, the file descriptor number is sent to a Unix domain socket created with socketpair() to that particular thread’s Epoll. This part used to use epoll_ctl() directly — which, although threadsafe, had a problem: epoll_wait() will never timeout on a socket if nothing was read from it previously. By writing to that socketpair, Epoll awakens, the file descriptor is added to it, and that thread’s death queue can handle the timeout by itself.

The sole purpose of each thread is to react to Epoll events, such as:

  • Timeouts (in which case the death queue iterates, potentially terminating connections);
  • Epoll errors (in which case the thread finishes gracefully);
  • Readiness events (can read, can write);
  • Connection hung up.

Epoll events are used as signals to create, destroy, resume, and reset coroutines: there’s one for each connection, and they’re used both as lightweight threads and as resource management facilities.

Coroutines

Coroutines provides a reasonably simple model for asynchronous I/O handling that’s less convoluted than the dreaded callback idiom prevalent in C. They also require a lot less stack space than a thread and their creation is pretty efficient: essentially just a call to malloc().

coro_t *
coro_new(coro_switcher_t *switcher,
         coro_function_t function,
         void *data)
{
    coro_t *coro = malloc(sizeof(*coro) + CORO_STACK_MIN);
    if (!coro)
        return NULL;

    coro->switcher = switcher;
    coro->defer = NULL;

    /* coro_reset() is just a few assignments on x86-64 */
    coro_reset(coro, function, data);

#if !defined(NDEBUG) && defined(USE_VALGRIND)
    char *stack = (char *)(coro + 1);
    coro->vg_stack_id = VALGRIND_STACK_REGISTER(stack,
                                   stack + CORO_STACK_MIN);
#endif

    return coro;
}

Request handlers can be written using an API that’s completely synchronous on the surface but behind the curtains, I/O happens in the background (client sockets are non-blocking) and control is given to the next coroutine as commanded by each thread’s loop.

Execution resumes where the coroutine left off. This saves a lot of code, not only making things easier to reason about, but also simplifying resource management by having a single cleanup point.

To provide a synchronous-looking API, Lwan provides a few wrappers for common operations, such as writev() or sendfile(). Unlike the functions these wrap, they return no error:

  • On success, the same return code is returned;
  • Recoverable errors (such as EINTR) are handled by trying them again a few times before giving up;
  • When giving up, or on unrecoverable errors, coroutines are aborted.
int
lwan_openat(lwan_request_t *request,
            int dirfd, const char *pathname, int flags)
{
    for (int tries = max_failed_tries; tries; tries--) {
        int fd = openat(dirfd, pathname, flags);
        if (LIKELY(fd >= 0)) {
            /*
             * close() will be called as soon as the
             * coroutine ends
             */
            coro_defer(request->conn->coro, CORO_DEFER(close),
                       (void *)(intptr_t)fd);
            return fd;
        }

        switch (errno) {
        case EINTR:
        case EMFILE:
        case ENFILE:
        case ENOMEM:
            coro_yield(request->conn->coro,
                       CONN_CORO_MAY_RESUME);
            break;
        default:
            return -errno;
        }
    }

    return -ENFILE;
}

When a coroutine is destroyed, user-defined callbacks are executed. These include callbacks set by the wrapper functions, to close files, free memory, and perform many other cleanup tasks. This ensures resources are released regardless if the coroutine ended normally or an unrecoverable error has been detected.

coroutines

Diagram of main loop plus two coroutines

On supported architectures, coroutine context switching is almost as cheap as a function call. This is possible because hand-written assembly routines are used, which only performs the essential register exchange, as mandated by the ABI. There is still some work to do in order to speed up this; tricks used by libco, for instance, might be used in the future to reduce some of the overhead.

On every other architecture, swapcontext() is used and this usually incurs in saving and restoring the signal mask, in addition to swapping every register (including those not required by the calling convention); this might change to setjmp() in the future to avoid at least the two system calls.

Another use for coroutines in Lwan is inside the Mustache templating engine, described in more depth below.

Reading requests

The loop within each I/O thread is quite crude.

Essentially, a coroutine will only be resumed for reading once per request: once it yields, Epoll will only be interested in write events. Because of this, reading a request uses a purpose-built read() wrapper that tricks the scheduler to still be interested in read events, unless the request has been fully received (by ending with the “␍␊␍␊” separator).

As soon as the whole request has been received, it is then parsed and acted upon.

Parsing request

Request parsing in Lwan is quite efficient: there are no copies, no memory allocations from the heap. The buffer is modified in place by slicing and storing pointers to stuff the server might be interested in. Parsing of HTTP request headers is delayed until needed (and they might not be needed).

struct lwan_request_parse_t_ {
    lwan_value_t buffer;            /* The whole buffer */
    lwan_value_t query_string;      /* Stuff after URLs ? */
    lwan_value_t if_modified_since; /* If-Modified-Since: */
    lwan_value_t range;             /* Range: */
    lwan_value_t accept_encoding;   /* Accept-Encoding: */
    lwan_value_t fragment;          /* Stuff after URLs # */
    lwan_value_t content_length;    /* Content-Length: */
    lwan_value_t post_data;         /* POST data */
    lwan_value_t content_type;      /* Content-Type: */
    lwan_value_t authorization;     /* Authorization: */
    char connection;                /* k=keep-alive, c=close */
};

Among other things, one that often receives comments is how headers are parsed. Two tricks are involved: avoiding spilling/filling registers to compare strings with strncmp(), and applying a heuristic to avoid reading (and comparing) more than necessary. Both tricks are intertwined into a “string prefix switch”:

  • Four bytes are read from memory, and are cast to a 32-bit integer pointer;
  • That pointer is then dereferenced;
  • A standard switch statement is used to perform cheap comparisons on a 32-bit integer;
  • When a header prefix is matched, a simple heuristic of finding the separating colon and space character where they’re supposed to be is used.
    • This might give false positives, although that’s very unlikely in practice.

Once the request has been parsed, it is time to look up what is going to handle it.

Looking up handler

A prefix tree is used to look up handlers. It is a modified trie data structure that has only eight pointers per node, so that on x86-64, each node fills one cache line exactly. This is achieved by hashing each character used to build up a node by taking the 3 least significant bits.

struct lwan_trie_node_t_ {
    lwan_trie_node_t *next[8];
    lwan_trie_leaf_t *leaf;
    int ref_count;
};

The canonical and naïve alternative to the hashed trie is having 256 pointers per node, which puts too much virtual memory pressure: the approach used in Lwan is a good compromise between keeping this pressure low and implementation complexity.

Another alternative (which might be considered in the future) is to reduce the amount of nodes by coalescing common prefixes; this significantly increases implementation complexity, though, but combined with the string switch trick, this might yield a good performance boost.

Yet another technique investigated was to generate machine code to perform lookup: essentially turning a data structure into executable code. The idea works but the instruction cache pressure isn’t worth the trouble. I’m still partial to this solution, though, so I might revisit it later: Varnish does something remotely similar with VCL and it seems to work, so this deserves a little bit more research.

After a handler is found, a second round of parsing might happen. Each handler contains a set of flags that signal if headers (which were sliced in the request parsing stage) should be actually parsed. This include headers such as Range, Accept-Encoding, If-Modified-Since, and authorization stuff. Handlers that do not require parsing these headers will not trigger potentially expensive string crunching.

typedef enum {
    HANDLER_PARSE_QUERY_STRING = 1<<0,
    HANDLER_PARSE_IF_MODIFIED_SINCE = 1<<1,
    HANDLER_PARSE_RANGE = 1<<2,
    HANDLER_PARSE_ACCEPT_ENCODING = 1<<3,
    HANDLER_PARSE_POST_DATA = 1<<4,
    HANDLER_MUST_AUTHORIZE = 1<<5,
    HANDLER_REMOVE_LEADING_SLASH = 1<<6,

    HANDLER_PARSE_MASK = 1<<0 | 1<<1 | 1<<2 | 1<<3 | 1<<4
} lwan_handler_flags_t;

To reduce the amount of boilerplate necessary to declare a handler, there’s a shortcut that parses almost everything; these are the “request handlers”, such as the “Hello world handler” example shown below.

Modules, on the other hand, provide much more fine-grained control of how the request will be handled; an example is the static file serving feature, also discussed further down.

static const lwan_module_t serve_files = {
    .name = "serve_files",
    .init = serve_files_init,
    .init_from_hash = serve_files_init_from_hash,
    .shutdown = serve_files_shutdown,
    .handle = serve_files_handle_cb,
    .flags = HANDLER_REMOVE_LEADING_SLASH
        | HANDLER_PARSE_IF_MODIFIED_SINCE
        | HANDLER_PARSE_RANGE
        | HANDLER_PARSE_ACCEPT_ENCODING
};

Hello world handler

The simplest handler possible is a “Hello, World!“. This tests the raw read-parse-write capacity of Lwan, without requiring more system calls than absolutely necessary.

lwan_http_status_t
hello_world(lwan_request_t *request __attribute__((unused)),
            lwan_response_t *response,
            void *data __attribute__((unused)))
{
    static const char *hello_world = "Hello, world!";

    response->mime_type = "text/plain";
    strbuf_set_static(response->buffer, hello_world,
                      strlen(hello_world));

    return HTTP_OK;
}

These simple handlers will use whatever is inside their respective string buffers (which is an array that grows automatically when needed, with some bookkeeping attached). In the “Hello, World!” case, however, the string buffer acts merely as a pointer to some read-only string stored in the text section; this simplifies the interface a little bit, while avoiding string copies and unneeded heap allocations.

Chunked encoding and Server-sent events

Supported also is the Chunked Encoding. Using it is very simple: just set the response MIME Type, fill the string buffer, and call lwan_response_send_chunk(). From this point on, the response headers will be sent alongside the first chunk, the string buffer will be cleared, and the coroutine will yield. To send the next chunk, just fill the string buffer again and send another chunk, until your handler is complete.

lwan_http_status_t
test_chunked_encoding(lwan_request_t *request,
            lwan_response_t *response,
            void *data __attribute__((unused)))
{
    response->mime_type = "text/plain";

    strbuf_printf(response->buffer, "First chunk\n");
    lwan_response_send_chunk(request);

    for (int i = 0; i <= 10; i++) {
        strbuf_printf(response->buffer, "*Chunk #%d*\n", i);
        lwan_response_send_chunk(request);
    }

    strbuf_printf(response->buffer, "Last chunk\n");
    lwan_response_send_chunk(request);

    return HTTP_OK;
}

The same general idea is used by Server-sent events; however, one uses lwan_response_send_event(), and passes the event name as well.

lwan_http_status_t
test_server_sent_event(lwan_request_t *request,
            lwan_response_t *response,
            void *data __attribute__((unused)))
{
    for (int i = 0; i <= 10; i++) {
        strbuf_printf(response->buffer, "{n: %d}", i);
        lwan_response_send_event(request, "currval");
    }

    return HTTP_OK;
}

The implementation inside Lwan is as straightforward as it looks: coroutines saved the day.

File serving module

Since files can be served using the sendfile() system call, the kind of handlers used by Hello World can’t be used: responses are sent using writev() to send both response headers and contents in one kernel roundtrip. Because of this, there’s a different kind of handler that gives more control as to how the response is sent: the (for the lack of a better name) streaming handlers. Streaming handlers are expected to send the whole response themselves.

To convert a “normal” handler into a streaming handler is simple: just set a few pointers in the “normal” handler and return. With the exception of producing error responses automatically — streaming handlers function exactly the same as a “normal” handler that does not send the response headers automatically.

static lwan_http_status_t
serve_files_handle_cb(lwan_request_t *request,
                      lwan_response_t *response, void *data)
{
    lwan_http_status_t return_status = HTTP_NOT_FOUND;
    serve_files_priv_t *priv = data;
    struct cache_entry_t *ce;

    if (UNLIKELY(!priv)) {
        return_status = HTTP_INTERNAL_ERROR;
        goto fail;
    }

    ce = cache_coro_get_and_ref_entry(priv->cache,
                request->conn->coro, request->url.value);
    if (LIKELY(ce)) {
        file_cache_entry_t *fce = (file_cache_entry_t *)ce;
        response->mime_type = fce->mime_type;
        response->stream.callback = fce->funcs->serve;
        response->stream.data = ce;
        response->stream.priv = priv;

        return HTTP_OK;
    }

fail:
    response->stream.callback = NULL;
    return return_status;
}

To avoid having to obtain information about a file for every request, this information is cached for a few seconds. The caching mechanism itself is discussed in detail further down.

While caching file information, the file size is considered while picking the way to serve it. Files larger than 16KiB are served with sendfile() to allow zero (or fewer) copy transfers, and smaller files are mapped in memory using mmap().

static const cache_funcs_t *
_get_funcs(serve_files_priv_t *priv, const char *key,
           char *full_path, struct stat *st)
{
    char index_html_path_buf[PATH_MAX];
    char *index_html_path = index_html_path_buf;

    if (S_ISDIR(st->st_mode)) {
        /* It is a directory. It might be the root directory
         * (empty key), or something else.  In either case,
         * tack priv->index_html to the path.  */
        if (*key == '\0') {
            index_html_path = (char *)priv->index_html;
        } else {
            /* Redirect /path to /path/. This is to help
             * cases where there's something like <img
             * src="../foo.png">, so that actually
             * /path/../foo.png is served instead of
             * /path../foo.png.  */
            const char *key_end = rawmemchr(key, '\0');
            if (*(key_end - 1) != '/')
                return &redir_funcs;

            if (UNLIKELY(snprintf(index_html_path, PATH_MAX,
                                  "%s%s", key,
                                  priv->index_html) < 0))
                return NULL;
        }

        /* See if it exists. */
        if (fstatat(priv->root.fd, index_html_path, st, 0) < 0) {
            if (UNLIKELY(errno != ENOENT))
                return NULL;

            /* If it doesn't, generate a directory list. */
            return &dirlist_funcs;
        }

        /* If it does, we want its full path. */

        if (UNLIKELY(priv->root.path_len + 1 /* slash */ +
                     strlen(index_html_path) + 1 >= PATH_MAX))
            return NULL;

        full_path[priv->root.path_len] = '/';
        strncpy(full_path + priv->root.path_len + 1,
                index_html_path,
                PATH_MAX - priv->root.path_len - 1);
    }

    /* It's not a directory: choose the fastest way to serve the
     * file judging by its size.  */
    if (st->st_size < 16384)
        return &mmap_funcs;

    return &sendfile_funcs;
}

Small files may also be compressed, unless compressed data ends up being larger than the original data. Especially if the response header is considered. Because of this, small files are only compressed if it’s worth the trouble. The 16KiB threshold has been chosen empirically: larger values did not yield substantial performance gains compared to using sendfile().

static void
_compress_cached_entry(mmap_cache_data_t *md)
{
    static const size_t deflated_header_size =
            sizeof("Content-Encoding: deflate");

    md->compressed.size = compressBound(md->uncompressed.size);

    md->compressed.contents = malloc(md->compressed.size);
    if (UNLIKELY(!md->compressed.contents))
        goto error_zero_out;

    int ret = compress(md->compressed.contents,
                       &md->compressed.size,
                       md->uncompressed.contents,
                       md->uncompressed.size)
    if (UNLIKELY(ret != Z_OK))
        goto error_free_compressed;

    size_t total_size = md->compressed.size
            + deflated_header_size;
    if (total_size < md->uncompressed.size)
        return;

error_free_compressed:
    free(md->compressed.contents);
    md->compressed.contents = NULL;
error_zero_out:
    md->compressed.size = 0;
}

For directories, the template engine is used to create the listing. The contents are cached using the same mechanism files are. Templating is discussed below.

An interesting optimization is that, to obtain the full path, a special version of realpath(), forked from the GNU libc implementation, is used. This version uses the lighter “-at()” variants of system calls that operates on paths; they do not need to perform path-to-inode conversion for the whole path, only from a path pointed to by a directory file descriptor.

The file server is a module. It is a simple way to keep per instance state, such as the file descriptor for the root directory, the directory list template, and a few other things.

Mustache templating engine

Not all features from Mustache are implemented: some are pretty much only practical if using a language that’s more expressive than C. However, without requiring (too much) boilerplate, a substantial amount of its specification is implemented, in a pretty efficient way, and suits all Lwan uses pretty well. (Being performant might not matter, though, but I’m here to have fun, not solve problems.)

Not everything is implemented exactly as in the standard, though: that’s mostly for laziness reasons, but the non-dynamic nature of C would make certain things needlessly difficult to implement and use, anyway. The templating engine supports the basic stuff. In no particular order:

  • Variables of different types;
  • Checking the emptiness of variables;
  • Iteration on lists (and any kind of sequences);
  • Partials;
  • Comments;
  • Inverted sections.

Setting the delimiters, triple mustaches (for escaping HTML output), ampersand to unescape strings — and possibly other things — are not implemented, but could be implemented with relatively minimal effort. String escaping is supported by using a special string type and should conform to best practices.

Templates are pre-processed. This pre-processing step uses a state machine parser to break down its text representation into a series of actions that can be performed by the engine very efficiently. Actions include things like “append string”, “append variable”, “start iteration”, and so on.

typedef enum {
    TPL_ACTION_APPEND,
    TPL_ACTION_APPEND_CHAR,
    TPL_ACTION_VARIABLE,
    TPL_ACTION_LIST_START_ITER,
    TPL_ACTION_LIST_END_ITER,
    TPL_ACTION_IF_VARIABLE_NOT_EMPTY,
    TPL_ACTION_END_IF_VARIABLE_NOT_EMPTY,
    TPL_ACTION_APPLY_TPL,
    TPL_ACTION_LAST
} lwan_tpl_action_t;

For instance, a stack of hash tables is used during this pre-processing step to act as a symbol table; this table can be thrown away as soon as the pre-processing step is complete, as all variables have been resolved and a much more efficient value lookup mechanism can be used instead.

Obtaining variables

To use the templating mechanism, one should have a structure for each template. Structures are cheap and provide some welcome compile-time type checking that wouldn’t be possible otherwise.

typedef struct hello_t {
  char *name;
  int age;
};

In addition to a structure, due to the lack of introspection in C, an array of variable descriptors should be declared. A variable descriptor contains a string representation of a variable name, the offset in bytes of that variable within the structure, and pointers to functions to test the emptiness of that kind of variable and to append the variable to the string buffer; macros help alleviate boilerplate headaches.

lwan_var_descriptor_t hello_descriptor[] = {
  TPL_VAR_STR(hello_t, name),
  TPL_VAR_INT(hello_t, age),
  TPL_VAR_SENTINEL
};

lwan_tpl_t *hello = lwan_tpl_compile("hello.tpl",
                                     hello_descriptor);

A structure containing all the variables can then be supplied by some sort of database layer, caching layer, or be declared on the spot: compound literals with designated initializers make this use case pretty straightforward.

strbuf_t *rendered = lwan_tpl_render(hello, (hello_t[]) {{
  .name = "World",
  .age = 42
}});

/* Do something with `rendered` */

strbuf_free(rendered);

Appending a variable is then just the matter of calling the appropriate callback function (conveniently in the descriptor), passing the base address of that structure plus the byte offset within it.

static void
append_var_to_strbuf(lwan_tpl_chunk_t *chunk, void *variables,
                     strbuf_t *buf)
{
    lwan_var_descriptor_t *descriptor = chunk->data;
    if (LIKELY(descriptor))
        descriptor->append_to_strbuf(buf,
                      (char *)variables + descriptor->offset);
}

Sequences

To avoid creating potentially lots of small, temporary objects, for lists and sequences a coroutine is created and is used as a makeshift generator function. Another option was to implement iterators using a structure to hold state plus a few callbacks — I gave up while imagining the amount of boilerplate necessary. A function is simple to write on the other hand, and can include initialization, iteration, and cleanup.

sequences

How sequences are evaluated by the templating engine

The only user of sequences in templates within Lwan is the file listing feature in the file serving module. The generator function is pretty straightforward, and is responsible for opening the directory, obtaining information for each entry, and then closing the directory. A shorter version of it is described in the original blog post about sequences in the templating engine.

Caching

I’ve used and implemented a few caching infrastructures over the years, and I believe that the one in Lwan is, so far, the simplest one I’ve used. Most caches will require items to be created — and then added manually to the cache. Not only clumsy, but could also lead to race conditions.

The one in Lwan knows how to create and destroy a cache entry: one just asks the cache to obtain a value for a given key. If it’s not there, the entry is created and returned. The lifetime of a cache entry is controlled automatically, and a low priority thread kicks in every now and then to prune old entries.

struct cache_t {
    struct {
        struct hash *table;
        pthread_rwlock_t lock;
    } hash;

    struct {
        struct list_head list;
        pthread_rwlock_t lock;
    } queue;

    struct {
        CreateEntryCallback create_entry;
        DestroyEntryCallback destroy_entry;
        void *context;
    } cb;

    struct {
        time_t time_to_live;
        clockid_t clock_id;
    } settings;

    unsigned flags;

#ifndef NDEBUG
    struct {
        unsigned hits;
        unsigned misses;
        unsigned evicted;
    } stats;
#endif
};

Unlike most caches, the one in Lwan isn’t limited by size: items stay in the cache for a predetermined amount of time.

Cache entries are reference-counted, and they’re not automatically reaped if something is holding on a reference: these items are marked as floating when this happens, and the last one to give up the reference will also destroy the entry.

struct cache_entry_t {
  struct list_node entries;
  char *key;
  unsigned refs;
  unsigned flags;
  struct timespec time_to_die;
};

struct file_cache_entry_t_ {
    struct cache_entry_t base;

    struct {
        char string[31];
        time_t integer;
    } last_modified;

    const char *mime_type;
    const cache_funcs_t *funcs;
};

When used within a coroutine, two things can happen: ➀ the coroutine might yield if the cache lock were to become contended and ➁ automatically releasing a reference when a coroutine is destroyed.

In addition to floating entries, there are also temporary entries. The cache uses read-write locks, but most of the time, locks are only obtained using the “trylock” primitive: if a lock can’t be obtained for a reason, Lwan tries to move on to something else. This could be attending to another request (by yielding the coroutine), or merely returning an off-the-books entry that will be destroyed as soon as its sole user releases its reference. The difference to floating entries is merely an implementation detail, so that an atomic decrement (and its accompanying memory barrier) isn’t used.

The cache tries to avoid keeping the locks locked. As an example, while an item is being created, no locks are held. This can, of course, lead to multiple entries being created concurrently, but if caching would be useful anyway, having a few temporary entries lying around isn’t a problem, as at least one will be cached for future access.

As nice as the cache subsystem ended up being, there is a lot of room for improvement. Reducing the amount of concurrent reference counting is high on the list. Reducing the latency is also in consideration. Making HTTP responses cacheable without special code in the handler is there as well.

Keep-alive connections, death queue

Connection lifetime is managed by a per-thread queue.

Each time a connection is scheduled to a certain thread, it is pushed to the queue, and a time to die is set. When there are connections in this queue, Epoll will timeout every second to iterate through it and kill connections when their time has come. Timeouts are infinite when the queue is empty, to avoid waking the process unnecessarily. Every time a coroutine is resumed, the time to die is updated, and the connection is pushed to the end of the queue.

Each death queue has its own epoch, which starts at zero and increments at every timeout. Whenever the last connection is removed from a queue, the epoch restarts. Keeping the epoch a small number will help shave a few bytes from each connection in the future.

struct death_queue_t {
    lwan_connection_t *conns;
    lwan_connection_t head;
    unsigned time;
    unsigned short keep_alive_timeout;
};

The same timeout value is used for keep-alive connections and coroutines. This ensures coroutines will not linger indefinitely when not performing any kind of work.

The death queue is so important that almost a third of the connection structure is dedicated to its existence. Three integers keep state for the death queue: the time to die (as an unsigned int), and two integers as pointers to a doubly linked list.

Integers were used instead of pointers in order to save memory. This was possible since in reality they are indices to the connection array. A doubly linked list was also chosen since removing a connection from the middle of the queue should be efficient, as it is done very frequently to move the entry to the end. The list is also circular, in order to avoid branching to handle empty queue cases. Maintaining the queue inline with the connection structures help reducing cache pressure.

static inline int _death_queue_node_to_idx(
            struct death_queue_t *dq, lwan_connection_t *conn)
{
    return (conn == &dq->head) ?
            -1 : (int)(ptrdiff_t)(conn - dq->conns);
}

static inline lwan_connection_t *_death_queue_idx_to_node(
            struct death_queue_t *dq, int idx)
{
    return (idx < 0) ? &dq->head : &dq->conns[idx];
}

static void _death_queue_insert(struct death_queue_t *dq,
    lwan_connection_t *new_node)
{
    new_node->next = -1;
    new_node->prev = dq->head.prev;
    lwan_connection_t *prev = _death_queue_idx_to_node(dq,
                                           dq->head.prev);
    dq->head.prev = prev->next = _death_queue_node_to_idx(dq,
                                                   new_node);
}

static void _death_queue_remove(
            struct death_queue_t *dq, lwan_connection_t *node)
{
    lwan_connection_t *prev = _death_queue_idx_to_node(dq,
                                              node->prev);
    lwan_connection_t *next = _death_queue_idx_to_node(dq,
                                              node->next);
    next->prev = node->prev;
    prev->next = node->next;
}

Closing words

That’s pretty much it: when a response has been sent, the connection can either be closed, or a new request can be serviced in the same connection. Repeat ad infinitum and there’s the HTTP server.

If you’ve made this far, I invite you to take a look at the full source code. There are things that were not mentioned in this article. It’s also a young Free Software project with no entry barrier: just fork and issue a pull request.

Integer to string conversion

There are various ways to convert integers to their string representation. These conversions are rarely a bottleneck, but they often show up while profiling certain applications. For instance, they’re very common in Lwan while building the response headers.

To use Lwan as an example: initially, snprintf() was used to convert numbers. Although this works, it is quite boring, performance-wise.

The second approach was using the naïve algorithm, which basically divides the number by 10 in succession, writing backwards the result of modulus by 10 to a string, and then reversing the string when the number reaches 0.

// Code based on https://code.google.com/p/stringencoders/
size_t naive_uint32_to_str(uint32_t value, char *str) {
    char *wstr = str;
    // Conversion. Number is reversed.
    do
       *wstr++ = (char) decimal_digits[uvalue % 10];
    while (uvalue /= 10);
    *wstr = '\0';
    // Reverse string
    strreverse(str, wstr - 1);
    return wstr - str;
}

This was fine for a while, but that string reversion step always bothered me. Why not just write the string backwards already?

I’ve then changed the code in Lwan to the following snippet. Note the nice trick of multiplying the size of an integer in bytes by 3 to obtain an approximation of the number of digits for MAX_INT, including the zero terminator, regardless of what sizeof(int) is.

#define INT_TO_STR_BUFFER_SIZE (3 * sizeof(int))

char *lwan_uint32_to_str(uint32_t value,
            char buffer[static INT_TO_STR_BUFFER_SIZE],
            size_t *len) {
    char *p = buffer + INT_TO_STR_BUFFER_SIZE - 1;

    *p = '\0';
    do {
        *--p = "0123456789"[value % 10];
    } while (value /= 10);

    size_t difference = (size_t)(p - buffer);
    *len = (size_t)(INT_TO_STR_BUFFER_SIZE - difference - 1;

    return p;
}

Reducing writes to the array made this algorithm significantly faster. However, I eventually did what one should always avoid when tinkering with this kind of thing: I’ve changed the array lookup to an addition, without measuring if it would perform better, and committed the code anyway. The lookup table is ~9% faster. Ouch!

Last year, the Facebook Engineering team posted a function to convert integers to strings that manages to be even faster. They do use the same idea of avoiding having to reverse the string after they’re done converting each digit, and they use a lookup table as well.

But the nice trick is that, instead of having a lookup table for 10 digits, there’s a table for all pair of digits, from 00 to 99. This cuts the amount of divisions by half, yielding a significantly faster algorithm: around 31% faster than the above snippet:

size_t facebook_uint32_to_str(uint32_t value, char *dst)
{
    static const char digits[201] =
        "0001020304050607080910111213141516171819"
        "2021222324252627282930313233343536373839"
        "4041424344454647484950515253545556575859"
        "6061626364656667686970717273747576777879"
        "8081828384858687888990919293949596979899";
    size_t const length = digits10(value);
    size_t next = length - 1;
    while (value >= 100) {
        auto const i = (value % 100) * 2;
        value /= 100;
        dst[next] = digits[i + 1];
        dst[next - 1] = digits[i];
        next -= 2;
    }
    // Handle last 1-2 digits
    if (value < 10) {
        dst[next] = '0' + uint32_t(value);
    } else {
        auto i = uint32_t(value) * 2;
        dst[next] = digits[i + 1];
        dst[next - 1] = digits[i];
    }
    return length;
}

The digits10() function is also another function that calculates the number of digits of a number in a very efficient manner. Even being performant, though, one can get rid of the call altogether: using a constant like numeric_limits<uint32_t>::digits10 will keep the same interface. This is possible because the dst buffer should be large enough to hold all the digits of the largest 32-bit unsigned integer anyway.

Because of implementation details – the function basically compares numbers to powers of 10 and recurses when the number of digits surpasses the maximum power that they’re comparing to – the speedup of using a constant length won’t be significant for small numbers (one and two digits); but if you’re optimizing to this level, using a constant won’t hurt. So much so, that it is consistently faster on my machine (a Core i7 2640M laptop, with an up-to-date 64-bit Arch Linux):

relativespeedup

Relative speedup of facebook_uint32_to_str() using digits10() and a constant value

That chart was obtained by using a benchmark program I wrote that will test all these ways of converting an integer to their string representation. To compare with other methods, here’s the full chart:

benchmark

Results for snprintf() omitted to not skew results. Spoiler: it’s slow.

Unfortunately, there’s a licencing issue that won’t let me use this code in Lwan. The blog post doesn’t mention the license. I’ve found this two-digit lookup table in places unrelated to Facebook as well, so I’m not sure who had this idea first. My go-to source of this kind of thing is usually Hacker’s Delight, but even then it’s not there.