There's a particular shortcoming in the standard SELinux policy which becomes evident in systems like XenClient. When multiple instances of a program are run, each instances executed from the same binary will have the same label (i.e. all qemu-dm processes will be labeled qemu_t). This would allow a compromised qemu-dm process supporting a virtual machine (call it VM_A) to access the resources belonging to other VMs (pick a specific one and call it VM_B). Obviously this is not desirable as it becomes an obvious "weakest link" the separation we've worked so hard to achieve. This page documents an implementation of a solution to this issue known as sVirt.
Background
...
The need for sVirt became apparent as virtualization on Linux became popular in the late 2000s. James Morris announced the project in 2008 targeting integration into the libvirt project. Eventually this resulted in yet another pluggable driver in libvirt so that sVirt protections wouldn't be exclusive to SELinux systems (this allowed confinement using AppArmor as well).
Requirements
...
The sVirt requirements are documented quite thoroughly on the SELinux wiki. A quick description of these requirements is provided here for convenience but the requirements documented on the SELinux wiki should be considered authoritative.
sVirt exploits the category label component in the SELinux MCS policy. A property of an category is that for read operations to succeed the category component of the subject label must be a superset of the object. For write operations, the subject and object label category components must be equal. This is a standard classical implementation of the classical BLP model. I've done a relatively thorough analysis of the implementation in SELinux. The relevant bits can be foundhere and here for the interested reader :)
By assigning a unique category to each running instance of QEMU we can effectively prevent their interaction even though they run with the same SELinux type. As an example, with sVirt implemented two running QEMU processes may have these labels:
- system_u:object_r:qemu_t:c716
- system_u:object_r:qemu_t:c425
...
Thus the two instances of qemu can not interact with eachother.
Further the same labels should be applied to the resources owned by each QEMU instance. Currently we apply these labels to the device nodes created by blktap2. We do not assign c0 to any running QEMU instance. This category is reserved for disks belonging to VMs with no QEMU instance or VMs that are powered off and thus should never be accessed by a QEMU instance.
Current Implementation on OpenXT's predecessor
...
Deployed in OpenXT' predecessor currently is a small binary (selinux-interpose) that is interposed between the toolstack (xenmgr) and qemu-dm. Upon execution this program does 3 things:
- Generates a unique integer between 1 and 1023. This integer represents an SELinux MCS category that is assigned to each running VM.
- Enumerates each writable storage device assigned to the VM and relabels them with the generated category.
- Sets the execution context for subsequent exec calls such that qemu-dm when started is labeled with the appropriate category.
- Executes qemu-dm with the supplied command line.
...
This code is derived from the SELinux Virtualization Prototype approved for public release by the US Air Force, case number 88ABW-2011-2106. The code can be found here: https://github.com/OpenXT/xenclient-oe/blob/master/recipes-security/selinux/svirt-interpose/svirt-interpose.c
...