Announcing Kanidm - A new IDM project

Today I’m starting to talk about my new project - Kanidm. Kanidm is an IDM project designed to be correct, simple and scalable. As an IDM project we should be able to store the identities and groups of people, authenticate them securely to various other infrastructure components and services, and much more.

You can find the source for kanidm on github.

For more details about what the project is planning to achieve, and what we have already implemented please see the github.

What about 389 Directory Server

I’m still part of the project, and working hard on making it the best LDAP server possible. Kanidm and 389-ds have different goals. 389 Directory Server is a globally scalable, distributed database, that can store huge amounts of data and process thousands of operations per second. 389-ds let’s you build a system ontop, in any way you want. If you want an authentication system today, use 389-ds. We are even working on a self-service web portal soon too (one of our most requested features!). Besides my self, no one on the (amazing) 389 DS team has any association with kanidm (yet?).

Kanidm is an opinionated IDM system, and has strong ideas about how authentication and users should be processed. We aim to be scalable, but that’s a long road ahead. We also want to have more web integrations, client tools and more. We’ll eventually write a kanidm to 389-ds sync tool.

Why not integrate something with 389? Why something new?

There are a lot of limitations with LDAP when it comes to modern web-focused auth processes such as webauthn. Because of this, I wanted to make something that didn’t have the same limitations, and had different ideas about data storage and apis. That’s why I wanted to make something new in parallel. It was a really hard decision to want to make something outside of 389 Directory Server (Because I really do love the project, and have great pride in the team), but I felt like it was going to be more productive to build in parallel, than ontop.

When will it be ready?

I think that a single-server deployment will be usable for small installations early 2020, and a fully fledged system with replication would be late 2020. It depends on how much time I have and what parts I implement in what order. Current rough work order (late 2019) is indexing, RADIUS integration, claims, and then self-service/web ui.

OpenSUSE leap as a virtualisation host

I’ve been rebuilding my network to use SUSE from CentOS, and the final server was my hypervisor. Most of the reason for this is the change in my employment, so I feel it’s right to dogfood for my workplace.

What you will need

  • Some computer parts (assembaly may be required)
  • OpenSUSE LEAP 15.1 media (dd if=opensuse.iso of=/dev/a_usb_i_hope)

What are we aiming for?

My new machine has dual NVME and dual 8TB spinning disks. The intent is to have the OS on the NVME and to have a large part of the NVME act as LVM cache for the spinning disk. The host won’t run any applications beside libvirt, and has to connect a number of vlans over a link aggregation.

Useful commands

Through out this document I’ll assume some details about your devices and partitions. To find your own, and to always check and confirm what you are doing, some command will help:

lsblk  # Shows all block storage devices, and how (if) they are mounted.
lvs  # shows all active logical volumes
vgs  # shows all active volume groups
pvs  # shows all active physical volumes
dmidecode  # show hardware information
ls -al /dev/disk/by-<ID TYPE>/  # how to resolve disk path to a uuid etc.

I’m going to assume you have devices like:

/dev/nvme0  # the first nvme of your system that you install to.
/dev/nvme1  # the second nvme, used later
/dev/sda    # Two larger block storage devices.
/dev/sdb

Install

Install and follow the prompts. Importantly when you install you install to a single NVME, and choose transactional server + lvm, then put btrfs on the /root in the lvm. You want to partition such that there is free space still in the NVME - I left about 400GB unpartitioned for this. In other words, the disk should be like:

[ /efi | pv + vg system               | pv (unused) ]
       | /root (btrfs), /boot, /home  |

Remember to leave about 1GB of freespace on the vg system to allow raid 1 metadata later!

Honestly, it may take you a try or two to get this right with YaST, and it was probably the trickiest part of the install.

You should also select that network management is via networkmanager, not wicked. You may want to enable ssh here. I disabled the firewall personally because there are no applications and it interfers with the bridging for the vms.

Packages

Because this is suse transactional we need to add packages and reboot each time. Here is what I used, but you may find you don’t need everything here:

transactional-update pkg install libvirt libvirt-daemon libvirt-daemon-qemu \
  sssd sssd-ad sssd-ldap sssd-tools docker zsh ipcalc python3-docker rdiff-backup \
  vim rsync iotop tmux fwupdate fwupdate-efi bridge-utils qemu-kvm apcupsd

Reboot, and you are ready to partition.

Partitioning - post install

First, copy your gpt from the first NVME to the second. You can do this by hand with:

gdisk /dev/nvme0
p
q

gdisk /dev/nvme1
c
<duplicate the parameters as required>

Now we’ll make your /efi at least a little redundant

mkfs.fat /dev/nvme1p0
ls -al /dev/disk/by-uuid/
# In the above, look for your new /efi fs, IE CE0A-2C1D -> ../../nvme1n1p1
# Now add a line to /etc/fstab like:
UUID=CE0A-2C1D    /boot/efi2              vfat   defaults                      0  0

Now to really make this work, because it’s transactional, you have to make a change to the /root, which is readonly! To do this run

transactional-update shell dup

This put’s you in a shell at the end. Run:

mkdir /boot/efi2

Now reboot. After the reboot your second efi should be mounted. rsync /boot/efi/* to /boot/efi2/. I leave it to the reader to decide how to sync this periodically.

Next you can setup the raid 1 mirror for /root and the system vg.

pvcreate /dev/nvme1p1
vgextend system /dev/nvme1p1

Now we have enough pvs to make a raid 1, so we convert all our volumes:

lvconvert --type raid1 --mirrors 1 system/home
lvconvert --type raid1 --mirrors 1 system/root
lvconvert --type raid1 --mirrors 1 system/boot

If this fails with “not enough space to alloc metadata”, it’s because you didn’t leave space on the vg during install. Honestly, I made this mistake twice due to confusion about things leading to two reinstalls …

Getting ready to cache

Now lets get ready to cache some data. We’ll make pvs and vgs for data:

pvcreate /dev/nvme0p2
pvcreate /dev/nvme1p2
pvcreate /dev/sda1
pvcreate /dev/sda2
vgcreate data /dev/nvme0p2 /dev/nvme1p2 /dev/sda1 /dev/sdb2

Create the larger volume

lvcreate --type raid1 --mirrors 1 -L7.5T -n libvirt_t2 data /dev/sda1 /dev/sdb1

Prepare the caches

lvcreate --type raid1 --mirrors 1 -L 4G -n libvirt_t2_meta data
lvcreate --type raid1 --mirrors 1 -L 400G -n libvirt_t2_cache data
lvconvert --type cache-pool --poolmetadata data/libvirt_t2_meta data/libvirt_t2_cache

Now put the caches in front of the disks. It’s important for you to check you have the correct cachemode at this point, because you can’t change it without removing and re-adding the cache. I choose writeback because my nvme devices are in a raid 1 mirror, and it helps to accelerate writes. You may err to use the default where the SSD’s are read cache only.

lvconvert --type cache --cachemode writeback --cachepool data/libvirt_t2_cache data/libvirt_t2
mkfs.xfs /dev/mapper/data-libvirt_t2

You can monitor the amount of “cached” data in the data column of lvs.

Now you can add this to /etc/fstab as any other xfs drive. I mounted it to /var/lib/libvirt/images.

Network Manager

Now, I have to assemble the network bridges. Network Manager has some specific steps to follow to achieve this. I have:

  • two intel gigabit ports
  • the ports are link aggregated by 802.3ad
  • there are multiple vlans ontop of the link agg
  • bridges must be built on top of the vlans

This requires a specific set of steps to layer this, because network manager sees the bridge and the lagg as seperate things that require the vlan to tie them together.

Configure the link agg, and add our two ethernet phys

nmcli conn add type bond con-name bond0 ifname bond0 mode 802.3ad ipv4.method disabled ipv6.method ignore
nmcli connection add type ethernet con-name bond0-eth1 ifname eth1 master bond0 slave-type bond
nmcli connection add type ethernet con-name bond0-eth2 ifname eth2 master bond0 slave-type bond

Add a bridge for a vlan:

nmcli connection add type bridge con-name net_18 ifname net_18 ipv4.method disabled ipv6.method ignore

Now tie together a vlan on the bond, to the bridge we created.

nmcli connection add type vlan con-name bond0.18 ifname bond0.18 dev bond0 id 18 master net_18 slave-type bridge

You will need to repeat these last two commands as required for the vlans you have.

House Keeping

Finally you need to do some house keeping. Transactional server will automatically reboot and update so you need to be ready for this. You may disable this with:

systemctl disable transactional-update.timer

You likely want to edit:

/etc/sysconfig/libvirt-guests

To be able to handle guest shutdown policy due to a UPS failure or a reboot.

Now you can enable and start libvirt:

systemctl enable libvirtd
systemctl start libvirtd

Finally you can build and import virtualmachines.

LDAP Filter Syntax Validation

Today I want to do a deep-dive into a change that will be released in 389 Directory Server 1.4.2. It’s a reasonably complicated change for our server, but it has a simple user interaction for admins and developers. I want to peel back some of the layers to explain what kind of experience, thought and team work goes into a change like this.

TL;DR - just keep upgrading your 389 Directory Server instance, and our ‘correct by default’ policy will apply, and you’ll keep having the best LDAP server we can make :)

LDAP Filters and How They Work

LDAP filters are one of the primary methods of expression in LDAP, and are used in almost every aspect of the system - from finding who you are when you login, to asserting you are member of a group or have other security attributes.

For the purposes of this discussion we’ll look at this filter:

'(|(cn=william)(cn=claire))'

In order to execute these queries quickly (LDAP is designed to handle thousands of operations per second) we heavily rely on indexing. Indexing is often a topic where people believe it to be some kind of “magic” but it’s reasonably simple: indexes are pre-computed partial result sets. So why do we need these?

We’ll imagine we have two entries (invalid, and truncated for brevity).

dn: cn=william,...
cn: william

dn: cn=claire,...
cn: claire

These entries both have entry-ids - these id’s are per-server in a replication group and are integers. You can show them by requesting entryid as an attribute in 389.

dn: cn=william,...
entryid: 1
cn: william

dn: cn=claire,...
entryid: 2
cn: claire

Our entries are stored in the main-entry database in /var/lib/dirsrv/slapd-standalone1/db/userRoot in the file “id2entry.db4”. This is a key-value database where the keys are the entryid, and the value is the serialised entry itself. Roughly, it’s:

[ ID ][ Entry             ]
  1     dn: cn=william,...
        cn: william

  2     dn: cn=claire,...
        cn: claire

Now, if we had NO indexes, to evaluate our filters we have to scan every entry of id2entry to determine if the filter matches. This algorithm is:

candidate_set = []
for id in id-min to id-max:
    entry = load_entry_by_id(id)
    if apply_filter(filter, entry):
        candidate_set.append(entry)

For two entries, this may be fast, but when you have 1000, 10.000, or even millions, this is extremely slow. We call these searches full table scans or in 389 DS, ALLIDS searches.

To make our searches faster we have indexes. An index is a mapping of a partial query term to an id list (In 389 we call these IDLs). An IDL is a set of integers. Our index for these examples would likely be something like:

cn
=william: [1, ]
=claire: [2, ]

These indexes are also stored in key-value databases in userRoot - you can see this as cn.db4.

So when we have an indexed term, to evaluate the query, we’ll load the indexes, then using mathematical set operations, we then produce a candidate_id_set, and we can then load the entries that only match.

For example in psuedo python code:

# Assume query is: (cn=william)

attr = filter.get_attr_name()
with open('%s.db' % attr) as index:
    idl = index.get('=william') # from the filter :)

for id in idl:
    ... # as before.

So we can see now that when we load the idl for cn index, this would give us the set [1, ]. Even if the database had 100 million entries, as our idl is a single value, we only need to load the one entry that matches. Neat!

When we have a more complex operation such as AND and OR, we can now manipulate the idl sets. For example:

(|(uid=claire)(uid=william))
   uid =claire -> idl [2, ]
   uid =william -> idl [1, ]

candidate_idl_set = union([2, ], [1, ])
# [1, 2]

This means again, even with millions of entries, we only need to load entry 1 and 2 to uphold the query provided to us.

So we finally know enough to understand how our example query is executed. PHEW!

Unindexed Attributes

However, it’s not always so easy. When we have an attribute that isn’t indexed, we have to handle this situation. In these cases, while we operate on the idl set, we may insert an idl with the value of ALLIDS (which as previously mentioned, is the “set of all entries”). This can have various effects.

If this was an AND query, we can annotate that the filter is partially resolved. This means that if we had:

(&(cn=william)(unindexed=foo))

Because an AND condition, both filter components must be satisfied, we have a partial candidate set from cn=william of [1, ]. We can load this partial candidate set, and then apply the filter test as in the full table scan case, but as we only apply it to a single entry this is really fast.

The real problem is OR queries. If we had:

(|(cn=william)(unindexed=foo))

Because OR means that both filter components could be satisfied, we have to turn unindexd into ALLIDS, and the result of the OR as a whole is ALLIDS. So even if we have 30 indexed values in the OR, a single ALLIDS (unindexed value) will always turn that OR into a full table scan. This is not good for performance!

Missing Attributes

So as a weirder case … what if the attribute doesn’t exist in schema at all? For example we could search for Microsoft AD attributes in 389 Directory Server, or we could submit bogus filters like “(whargarble=foo)”. What happens here?

Well, historically we treated these the same as unindexed queries. Which means that any term that is not in schema, would be treated as ALLIDS. This led to a “quitely known” denial of service attack again 389 Directory Server where you could emit a large number of queries for attributes that don’t exist, causing the server to attempt many ALLIDS scans. We have some defences like the allids limit (how many entries you can full table scan before giving up). But it can still cause entry cache churn and other performance issues.

I was first made aware of this issue in 2014 while working for University of Adelaide where our VMWare service would query LDAP for MS attributes, causing a large performance issue. We resolved this by adding the MS attributes to schema and indexing them so that they would create empty indexes - now we would call this in 389 Directory Server and “idl_alloc(0)” or “empty IDL”.

When initially hired by Red Hat in 2015 I knew this issue existed but I didn’t know enough about the server core to really fix it, so it went in the back of my mind … it was rare to have a customer raise this issue, but we had the work around and so I was able to advise support services on how to mitigate this.

In 2019 however, while investigating an issue related to filter optimisation, I was made aware of an issue with FreeIPA where they were doing certmap queries that requested MS Cert attributes. However it would cause large performance issues. We now had the issue again, and in a large widely installed product so it was time to tackle it.

How to handle this?

A major issue in this is “never breaking customers”. Because we had always supported this behaviour there is a risk that any solution would cause customer queries to “silently” begin to break if we issued a fix or change. More accurately, any change to how we execute the filters could cause results of the filters to change, which would disrupt customers.

Saying this, there is also precedent that 389 Directory Server was handling this incorrectly. From the RFC for LDAP it was noted:

Any assertion about the values of such an attribute is only defined if the AttributeType is known by the evaluating mechanism, the purported AttributeValue(s) conforms to the attribute syntax defined for that attribute type, the implied or indicated matching rule is applicable to that attribute type, and (when used) a presented matchValue conforms to the syntax defined for the indicated matching rules. When these conditions are not met, the FilterItem shall evaluate to the logical value UNDEFINED. An assertion which is defined by these conditions additionally evaluates to UNDEFINED if it relates to an attribute value and the attribute type is not present in an attribute against which the assertion is being tested. An assertion which is defined by these conditions and relates to the presence of an attribute type evaluates to FALSE.

Translation: If a filter component (IE nonexist=foo) is in a filter but NOT in the schema, the result of the filter’s evaluation is an empty-set aka undefined.

It was also clear that if an engaged and active consumer like FreeIPA is making this mistake, then it must be overlooked by many others without notice. So there is sometimes value in helping to raise the standard so that everyone benefits, and highlight mistakes quicker.

The Technical Solution

This is the easy part - we add a new configuration option with three states. “on”, “off”, “warn”. “On” would enable the strictest handling of filters, rejecting them an not continuing if any attribute requested was not in the schema. “Warn” would provide the rfc compliant behaviour, mapping to empty-set index, and notifying in the logs that this occured. Finally, “off” would be the previous “silently allow” behaviour.

This was easily achieved in filter parsing, by checking the attribute of each filter component against our schema hashmap. We then tag the filter element, and depending on the current setting level reject or continue.

In the query execution code, we now check the filter tag to understand if the attribute is schema present or not. If it’s flagged as “undefined”, then we immediately shortcut to return idl_alloc(0) instead of returning ALLIDS on the failure to find the relevant index db.

We can show the performance impact of this change:

Before with non-existant attribute

Average rate:    7.40/thr

After with “warn” enabled (map to empty set)

Average rate: 4808.70/thr

This is a huge improvement, and certainly shows the risk of DOS and how effective the solution was!

The Social Solution

Of course, this is the hard part - the 389 Directory Server team are all amazingly smart people, from many countries, and all live around the world. They all care that the server is the best possible, and that our standards as a team are high. That means when introducing a change that has a risk of affecting query result sets like this, they pay attention, and ask many difficult questions about how the feature will be implemented.

The first important justification - is a change like this worth while? We can see from the performance results that the risk of DOS is reasonable, so the answer there becomes Yes from a security view. But it’s also important to consider the cost on consumers - is this change going to benefit FreeIPA for example? As I am biased being the author I want to say “yes” - by notifying or rejecting invalid filters earlier, we can help FreeIPA developers improve their code quality, without expecting them to know LDAP inside and out.

The next major question is performance - before the feature was developed there is clearly a risk of DOS, but when we implement this we are performing additional locking on the schema. Is that a risk to our standalone performance or normal operating conditions. This had to also be discussed and assessed.

A really important point that was raised by Thierry is how we communicated these errors too. Previously we would use the “notes=” field of the access log. It looks like this:

conn=1 op=4 RESULT err=0 tag=101 nentries=13 etime=0.0003795424 notes=U

The challenge with the notes= field, is that it’s easy to overlook, and unless you are familar, hard to see what this is indicating. In this case, notes=U means partially unindexed query (one filter component but not all returned ALLIDS).

We can’t change the notes field due to the risk of breaking our own scripts like logconv.pl, support tools developed by RH or SUSE, or even integrations to platforms like splunk. But clearly we need a way to detail what is happening with your filter. So Thierry suggested an extension to have details about the provided notes. Now we get:

conn=1 op=4 RESULT err=0 tag=101 nentries=13 etime=0.0003795424 notes=U details="Partially Unindexed Filter"
conn=1 op=8 RESULT err=0 tag=101 nentries=0 etime=0.0001886208 notes=F details="Filter Element Missing From Schema"

So we have extended our log message, but without breaking existing integrations.

The final question is what our defaults should be. It’s one thing to have this feature, but what should we ship with? Do we strictly reject filters? Warn? Or disable, and expect people to turn this on.

This became a long discussion with Ludwig, Thierry and I - we discussed the risk of DOS in the first place, what the impact of the levels could be, how it could break legacy applications or sites using deprecated features or with weird data imports. Many different aspects were considered. We decided to default to “warn” (non-existant becomes empty-set), and we settled on communication with support to advise them of the upcoming change, but also we considered that our “back out” plan is to change the default and ship a patch if there is a large volume of negative feedback.

Conclusion

As of today, the PR is merged, and the code on it’s way to the next release. It’s a long process but the process exists to ensure we do what’s best for our users, while we try to balance many different aspects. We have a great team of people, with decades of experience from many backgrounds which means that these discussions can be long and detailed, but in the end, we hope to give what is the best product possible to our community.

It’s also valuable to share how much thought and effort goes into projects - in your life you may only interact with 1% of our work through our configuration and system, but we have an iceberg of decisions and design process that affects you every day, where we have to be responsible and considerate in our actions.

I hope you enjoyed this exploration of this change!

References

PR#50379

Using ramdisks with Cargo

I have a bit of a history of killing SSDs - probably because I do a bit too much compiling and management of thousands of tiny files. Plenty of developers have this problem! So while thinking one evening, I was curious if I could setup a ramdisk on my mac for my cargo work to output to.

Making the ramdisk

On Linux you’ll need to use tmpfs or some access to /dev/shm.

On OSX you need to run a script like the following:

diskutil partitionDisk $(hdiutil attach -nomount ram://4096000) 1 GPTFormat APFS 'ramdisk' '100%'

This creates and mounts a 4GB ramdisk to /Volumes/ramdisk. Make sure you have enough ram!

Asking cargo to use it

We probably don’t want to make our changes permant in Cargo.toml, so we’ll use the environment:

CARGO_TARGET_DIR=/Volumes/ramdisk/rs cargo ...

Does it work?

Yes!

Disk Build (SSD, 2018MBP)

Finished dev [unoptimized + debuginfo] target(s) in 2m 29s

4 GB APFS ramdisk

Finished dev [unoptimized + debuginfo] target(s) in 1m 53s

For me it’s more valuable to try and save those precious SSD write cycles, so I think I’ll try to stick with this setup. You can see how much rust writes by doing a clean + build. My project used the following:

Filesystem                             Size   Used  Avail Capacity    iused               ifree %iused  Mounted on
/dev/disk110s1                        2.0Gi  1.2Gi  751Mi    63%       3910 9223372036854771897    0%   /Volumes/ramdisk

Make it permanent

Put the following in /Library/LaunchDaemons/au.net.blackhats.fy.ramdisk.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
        <dict>
                <key>Label</key>
                <string>au.net.blackhats.fy.ramdisk</string>
                <key>Program</key>
                <string>/usr/local/libexec/ramdisk.sh</string>
                <key>RunAtLoad</key>
                <true/>
                <key>StandardOutPath</key>
                <string>/var/log/ramdisk.log</string>
        </dict>
</plist>

And the following into /usr/local/libexec/ramdisk.sh

#!/bin/bash
date
diskutil partitionDisk $(hdiutil attach -nomount ram://4096000) 1 GPTFormat APFS 'ramdisk' '100%'

Finally put this in your cargo file of choice

[build]
target = "/Volumes/ramdisk/rs"

Future william will need to work out if there are negative consequences to multiple cargo projects sharing the same target directory … hope not!

Launchctl tips

# Init the service
launchctl load /Library/LaunchDaemons/au.net.blackhats.fy.ramdisk.plist

References

lanuchd.info

CPU atomics and orderings explained

Sometimes the question comes up about how CPU memory orderings work, and what they do. I hope this post explains it in a really accessible way.

Short Version - I wanna code!

Summary - The memory model you commonly see is from C++ and it defines:

  • Relaxed
  • Acquire
  • Release
  • Acquire/Release (sometimes AcqRel)
  • SeqCst

There are memory orderings - every operation is “atomic”, so will work correctly, but there rules define how the memory and code around the atomic are influenced.

If in doubt - use SeqCst - it’s the strongest guarantee and prevents all re-ordering of operations and will do the right thing.

The summary is:

  • Relaxed - no ordering guarantees, just execute the atomic as is.
  • Acquire - all code after this atomic, will be executed after the atomic.
  • Release - all code before this atomic, will be executed before the atomic.
  • Acquire/Release - both Acquire and Release - ie code stays before and after.
  • SeqCst - Stronger consistency of Acquire/Release.

Long Version … let’s begin …

So why do we have memory and operation orderings at all? Let’s look at some code to explain:

let mut x = 0;
let mut y = 0;
x = x + 3;
y = y + 7;
x = x + 4;
x = y + x;

Really trivial example - now to us as a human, we read this and see a set of operations that are linear by time. That means, they execute from top to bottom, in order.

However, this is not how computers work. First, compilers will optimise your code, and optimisation means re-ordering of the operations to achieve better results. A compiler may optimise this to:

let mut x = 0;
let mut y = 0;
// Note removal of the x + 3 and x + 4, folded to a single operation.
x = x + 7
y = y + 7;
x = y + x;

Now there is a second element. Your CPU presents the illusion of running as a linear system, but it’s actually an asynchronous, out-of-order task execution engine. That means a CPU will reorder your instructions, and may even run them concurrently and asynchronously.

For example, your CPU will have both x + 7 and y + 7 in the pipeline, even though neither operation has completed - they are effectively running at the “same time” (concurrently).

When you write a single thread program, you generally won’t notice this behaviour. This is because a lot of smart people write compilers and CPU’s to give the illusion of linear ordering, even though both of them are operating very differently.

Now we want to write a multithreaded application. Suddenly this is the challenge:

We write a concurrent program, in a linear language, executed on a concurrent asynchronous machine.

This means there is a challenge is the translation between our mind (thinking about the concurrent problem), the program (which we have to express as a linear set of operations), which then runs on our CPU (an async concurrent device).

Phew. How do computers even work in this scenario?!

Why are CPU’s async?

CPU’s have to be async to be fast - remember spectre and meltdown? These are attacks based on measuring the side effects of CPU’s asynchronous behaviour. While computers are “fast” these attacks will always be possible, because to make a CPU synchronous is slow - and asynchronous behaviour will always have measurable side effects. Every modern CPU’s performance is an illusion of async black magic.

A large portion of the async behaviour comes from the interaction of the CPU, cache, and memory.

In order to provide the “illusion” of a coherent synchronous memory interface there is no seperation of your programs cache and memory. When the cpu wants to access “memory” the CPU cache is utilised transparently and will handle the request, and only on a cache miss, will we retrieve the values from RAM.

(Aside: in almost all cases more CPU cache, not frequency will make your system perform better, because a cache miss will mean your task stalls waiting on RAM. Ohh no!)

CPU -> Cache -> RAM

When you have multiple CPU’s, each CPU has it’s own L1 cache:

CPU1 -> L1 Cache -> |              |
CPU2 -> L1 Cache -> | Shared L2/L3 | -> RAM
CPU3 -> L1 Cache -> |              |
CPU4 -> L1 Cache -> |              |

Ahhh! Suddenly we can see where problems can occur - each CPU has an L1 cache, which is transparent to memory but unique to the CPU. This means that each CPU can make a change to the same piece of memory in their L1 cache without the other CPU knowing. To help explain, let’s show a demo.

CPU just trash my variables fam

We’ll assume we now have two threads - my code is in rust again, and there is a good reason for the unsafes - this code really is unsafe!

// assume global x: usize = 0; y: usize = 0;

THREAD 1                        THREAD 2

if unsafe { *x == 1 } {          unsafe {
    unsafe { *y += 1 }              *y = 10;
}                                   *x = 1;
                                }

At the end of execution, what state will X and Y be in? The answer is “it depends”:

  • What order did the threads run?
  • The state of the L1 cache of each CPU
  • The possible interleavings of the operations.
  • Compiler re-ordering

In the end the result of x will always be 1 - because x is only mutated in one thread, the caches will “eventually” (explained soon) become consistent.

The real question is y. y could be:

  • 10
  • 11
  • 1

10 - This can occur because in thread 2, x = 1 is re-ordered above y = 10, causing the thread 1 “y += 1” to execute, followed by thread 2 assign 10 directly to y. It can also occur because the check for x == 1 occurs first, so y += 1 is skipped, then thread 2 is run, causing y = 10. Two ways to achieve the same result!

11 - This occurs in the “normal” execution path - all things considered it’s a miracle :)

1 - This is the most complex one - The y = 10 in thread 2 is applied, but the result is never sent to THREAD 1’s cache, so x = 1 occurs and is made available to THREAD 1 (yes, this is possible to have different values made available to each cpu …). Then thread 1 executes y (0) += 1, which is then sent back trampling the value of y = 10 from thread 2.

If you want to know more about this and many other horrors of CPU execution, Paul McKenny is an expert in this field and has many talks at LCA and others on the topic. He can be found on twitter and is super helpful if you have questions.

So how does a CPU work at all?

Obviously your system (likely a multicore system) works today - so it must be possible to write correct concurrent software. Cache’s are kept in sync via a protocol called MESI. This is a state machine describing the states of memory and cache, and how they can be synchronised. The states are:

  • Modified
  • Exclusive
  • Shared
  • Invalid

What’s interesting about MESI is that each cache line is maintaining it’s own state machine of the memory addresses - it’s not a global state machine. To coordinate CPU’s asynchronously message each other.

A CPU can be messaged via IPC (Inter-Processor-Communication) to say that another CPU wants to “claim” exclusive ownership of a memory address, or to indicate that it has changed the content of a memory address and you should discard your version. It’s important to understand these messages are asynchronous. When a CPU modifies an address it does not immediately send the invalidation message to all other CPU’s - and when a CPU recieves the invalidation request it does not immediately act upon that message.

If CPU’s did “synchronously” act on all these messages, they would be spending so much time handling IPC traffic, they would never get anything done!

As a result, it must be possible to indicate to a CPU that it’s time to send or acknowledge these invalidations in the cache line. This is where barriers, or the memory orderings come in.

  • Relaxed - No messages are sent or acknowledged.
  • Release - flush all pending invalidations to be sent to other CPUS
  • Acquire - Acknowledge and process all invalidation requests in my queue
  • Acquire/Release - flush all outgoing invalidations, and process my incomming queue
  • SeqCst - as AcqRel, but with some other guarantees around ordering that are beyond this discussion.

Understand a Mutex

With this knowledge in place, we are finally in a position to understand the operations of a Mutex

// Assume mutex: Mutex<usize> = Mutex::new(0);

THREAD 1                            THREAD 2

{                                   {
    let guard = mutex.lock()            let guard = mutex.lock()
    *guard += 1;                        println!(*guard)
}                                   }

We know very clearly that this will print 1 or 0 - it’s safe, no weird behaviours. Let’s explain this case though:

THREAD 1

{
    let guard = mutex.lock()
    // Acquire here!
    // All invalidation handled, guard is 0.
    // Compiler is told "all following code must stay after .lock()".
    *guard += 1;
    // content of usize is changed, invalid req is queue
}
// Release here!
// Guard goes out of scope, invalidation reqs sent to all CPU's
// Compiler told all proceeding code must stay above this point.

            THREAD 2

            {
                let guard = mutex.lock()
                // Acquire here!
                // All invalidations handled - previous cache of usize discarded
                // and read from THREAD 1 cache into S state.
                // Compiler is told "all following code must stay after .lock()".
                println(*guard);
            }
            // Release here!
            // Guard goes out of scope, no invalidations sent due to
            // no modifications.
            // Compiler told all proceeding code must stay above this point.

And there we have it! How barriers allow us to define an ordering in code and a CPU, to ensure our caches and compiler outputs are correct and consistent.

Benefits of Rust

A nice benefit of Rust, and knowing these MESI states now, we can see that the best way to run a system is to minimise the number of invalidations being sent and acknowledged as this always causes a delay on CPU time. Rust variables are always mutable or immutable. These map almost directly to the E and S states of MESI. A mutable value is always exclusive to a single cache line, with no contention - and immutable values can be placed into the Shared state allowing each CPU to maintain a cache copy for higher performance.

This is one of the reasons for Rust’s amazing concurrency story is that the memory in your program map to cache states very clearly.

It’s also why it’s unsafe to mutate a pointer between two threads (a global) - because the cache of the two cpus’ won’t be coherent, and you may not cause a crash, but one threads work will absolutely be lost!

Finally, it’s important to see that this is why using the correct concurrency primitives matter - it can highly influence your cache behaviour in your program and how that affects cache line contention and performance.

For comments and more, please feel free to email me!

Shameless Plug

I’m the author and maintainer of Conc Read - a concurrently readable datastructure library for Rust. Check it out on crates.io!