One of my readers sent me this question:
Do you have any thoughts on this meltdown HPTI thing? How does a hardware issue/feature become a software vulnerability? Hasn't there always been an appropriate level of separation between kernel and user space?
There’s always been privilege-level separation between kernel and user space, but not the address space separation - kernel has been permanently mapped into the high-end addresses of user space (but not visible from the user-space code on systems that had decent virtual memory management hardware) since the days of OS/360, CP/M and VAX/VMS (RSX-11M was an exception since it ran on 16-bit CPU architecture and its designers wanted to support programs up to 64K byte in size).
Yeah, it helps that I wrote an operating system or two 35 years ago, and read the full source code for CP/M and RSX-11M.
However, most recent CPUs perform numerous operations needed to execute a single instruction execution in parallel, sometimes in a pipeline dozen instructions deep… and that’s what the vulnerability is all about. Here’s how it works:
- Your program tests a bit somewhere in kernel space
- Based on the result of that test one or another memory location is fetched.
The test of the kernel space location will fail once the access control is checked, but in the meantime the other parts of the CPU already went ahead and executed one of the alternate instructions, including fetching the memory location.
It’s not just the CPUs
CPU vendors are not the only ones trying to get better performance with parallelized execution of seemingly-independent things. It seems at least one network hardware vendor decided to do hardware-assisted IPv6 Neighbor Discovery and started the ND process before the input ACL was checked.
End result: susceptibility to ND cache exhaustion attacks even when the proper infrastructure ACLs were applied at network edge.
RFC 6164 solves that problem by recommending use of /127 prefixes on P2P links, but yet again increases the complexity of IPv6, and causes interesting problems with data center switches that use popular merchant silicon - that silicon supports only a small number of IPv6 prefixes longer than /64. More details in Data Center Fabrics webinar.
Back to Instructions That Should Not Be Executed
You would say “well, it doesn’t matter if the CPU performs some prefetching… none of that stuff will be executed”, but that’s simply not true - instructions behind the test that will eventually result in access violation are executed anyway, it’s just that their results are thrown away… but not the cached content of the memory location that was fetched from the user space based on the test results.
After that, you simply fetch both locations that could have been fetched, and measure how long it takes - fetching something from main memory takes longer than if it’s fetched from the cache, and that gives you an idea what the bit you couldn’t possibly access (but did, although the results were voided) is.
Or as Robert Graham wrote:
The CPUs have no bug. The results are correct, it's just that the timing is different. CPU designers will never fix the general problem of undetermined timing.
The operating system patches that the vendors are rolling out completely remove kernel from the user address space (or use PCID/ASID). That stops the exploit for good, but doesn’t give kernel direct access to user space when needed. Without PCID/ASID the kernel has to change the virtual-to-physical page tables to map user space into kernel page tables and unmap it after you’re done on every single system call, which can be something as trivial as read one byte from a file.