This document collates information obtained from inspection of the Linux kernel and Windows KVM VirtIO device drivers for running as guest workloads.

Analysis of the Linux VirtIO implementation and system structure

Within the Linux kernel device driver implementations and QEMU device model.

Driver structure

There is a distinction between the general class of virtio device drivers, which provide function-specific logic implementing the front-end of virtual devices, and the transport virtio device drivers, which are responsible for device discovery and provision of virtqueues for data transport to the front-end drivers.

VirtIO transport drivers

There are several implementations of transport virtio drivers, designed to be interchangeable with respect to the virtio front-end drivers.

There are also other instances of virtio transport drivers elsewhere in the Linux kernel:

These are not relevant to the remainder of this document and so not discussed further.

These transport drivers communicate with backend implementations in the QEMU device model. Multiple transport drivers can operate concurrently in the same kernel without interference. The virtio-pci-modern transport driver is the most advanced implementation within the Linux kernel, so is appropriate for use for reference when building a new virtio-argo transport driver.

Each virtio device has a handle to the virtio transport driver where it originated, so the front-end virtio drivers can operate on devices from different transports on the same system without need for any different handling within the frontend driver.

Virtio transport driver interface

The interface that a transport driver must implement is defined in struct virtio_config_ops in include/linux/virtio_config.h

The Virtio PCI driver populates the interface struct thus:

static const struct virtio_config_ops virtio_pci_config_ops = {
  .get        = vp_get,
  .set        = vp_set,
  .generation = vp_generation,
  .get_status = vp_get_status,
  .set_status = vp_set_status,
  .reset      = vp_reset,
  .find_vqs   = vp_modern_find_vqs,
  .del_vqs    = vp_del_vqs,
  .get_features   = vp_get_features,
  .finalize_features = vp_finalize_features,
  .bus_name   = vp_bus_name,
  .set_vq_affinity = vp_set_vq_affinity,
  .get_vq_affinity = vp_get_vq_affinity,
};

Device discovery and driver registration with the Virtio PCI transport driver

Primarily implemented in: drivers/virtio/virtio_pci_common.c

On a system that is using the Virtio PCI transport driver, Virtio devices that are exposed to the guest are surfaced as PCI devices with device identifiers that match those registered by the Virtio PCI transport driver.

ie. Each virtual device first surfaces on the PCI bus and then driver configuration propagates from that point.

The Virtio PCI transport driver registers as a PCI device driver, declaring the range of PCI IDs that it will match to claim devices. When such a device is detected and matched, virtio_pci_probe is called to initialize the device driver for it, which typically will proceed into virtio_pci_modern_probe, which implements quite a bit of validation and feature negociation logic. A key action in this function is to initialize the pointer to the struct that contains the transport ops function pointers: vp_dev->vdev.config = &virtio_pci_config_ops;

Back in virtio_pci_probe, the new virtual device is then registered in register_virtio_device(&vp_dev->vdev); which calls device_add, which causes the kernel bus infrastructure to locate a matching front-end driver for this new device – the bus being the virtio bus -- and the front-end driver will then have its probe function called to register and activate the device with its actual front-end function.

Each front-end virtio driver needs to initialize the virtqueues that are for communication with the backend, and it does this via the methods in the device’s transport ops function pointer struct. Once the virtqueues are initialized, they are operated by the front-end driver via the standard virtqueue interfaces.

Argo: Device discovery and driver registration with Virtio Argo transport

A new Virtio transport driver, virtio-argo, should implement the virtio_pci_config_ops interface, and function as an alternative or complementary driver to Virtio PCI. In the same fashion as Virtio PCI, it will have responsibility for device discovery and invoke device_add for new virtual devices that it finds, but virtual devices will not be surfaced to the guest as PCI devices.

In modern Linux kernels, support for ACPI can be enabled without requiring support for PCI to be enabled.

Virtual devices to be discovered by the virtio-argo driver can be described in new ACPI tables that are provided to the guest. The tables can enumerate the devices and include the necessary configuration metadata for the front-ends to be matched, probed and configured for connection to the corresponding backends over the Argo transport.

ACPI has support for hotplug of devices, so precedent exists for being able to support handling dynamic arrival and removal of virtual devices via this interface.

With this implementation, there should be no requirement for PCI support to be enabled in the guest VM when using the virtio-argo transport, which stands in contrast to the existing Xen device drivers in PVH and HVM guests, where the Xen platform PCI device is needed, and to Virtio-PCI transport, which depends on PCI support.

Virtqueues

Virtqueues implement the mechanism for transport of data for virtio devices.

There are two supported formats for virtqueues, and each driver and each device may support either one or both formats:

Negociation between driver and device of which virtqueue format to operate occurs at device initialization.

Current Linux driver implementations

The packed virtqueue format is negociated by the VIRTIO_F_RING_PACKED feature bit. There are currently very few references to this value in the kernel. The vring_create_virtqueue function does provide an implementation for creation of packed vring structures and the vdpa transport driver offers the feature, but otherwise it appears that the kernel implementation of virtio device drivers all use the original split ring format.

Current QEMU device implementations

The general device-independent QEMU logic in hw/virtio/virtio.c has support for packed rings, but the device implementations do not appear to negociate it in the current implementation.

In include/hw/virtio/virtio.h there is a centrally-defined macro DEFINE_VIRTIO_COMMON_FEATURES used in the common virtio_device_class_init function that sets VIRTIO_F_RING_PACKED to false.

Other virtio device implementations

DPDK implements virtio devices that can use either packed or split virtqueues.

Virtqueues : implemented by vrings

The virtio transport driver is asked by a virtio front-end driver, on behalf of a device, to find the virtual queues needed to connect the driver to the backend. The function to obtain the virtqueues is accessed via the transport ops find_vqs function.

Each of the existing Linux kernel virtio transport drivers uses the vring_create_virtqueue function to provision vrings, which implement virtqueues.

Virtqueue definition in the Virtio 1.1 standard:

https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-230005

Virtqueue interface

See the virtqueue structure and functions in : <linux/virtio.h> and the virtqueue functions exported from virtio_ring.c , listed below. These functions are called directly - ie. not via function pointers in an ops structure - from call sites in the front-end virtio drivers. It does not appear that the virtqueue implementation (ie. vring) is practical to substitute for an alternative virtqueue implementation in a new transport driver.

struct virtqueue {
  struct list_head list;
  void (*callback)(struct virtqueue *vq);
  const char *name;
  struct virtio_device *vdev;
  unsigned int index;
  unsigned int num_free;
  void *priv;
};

EXPORT_SYMBOL_GPL(virtio_max_dma_size);
EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
EXPORT_SYMBOL_GPL(virtqueue_notify);
EXPORT_SYMBOL_GPL(virtqueue_kick);
EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
EXPORT_SYMBOL_GPL(virtqueue_get_buf);
EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
EXPORT_SYMBOL_GPL(virtqueue_poll);
EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
EXPORT_SYMBOL_GPL(virtqueue_get_desc_addr);
EXPORT_SYMBOL_GPL(virtqueue_get_avail_addr);
EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
EXPORT_SYMBOL_GPL(virtqueue_get_vring);
EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
EXPORT_SYMBOL_GPL(virtqueue_is_broken);

Vring interface

These vring functions are exported:

EXPORT_SYMBOL_GPL(vring_interrupt);
EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
EXPORT_SYMBOL_GPL(vring_create_virtqueue);
EXPORT_SYMBOL_GPL(vring_new_virtqueue);
EXPORT_SYMBOL_GPL(vring_del_virtqueue);
EXPORT_SYMBOL_GPL(vring_transport_features);
EXPORT_SYMBOL_GPL(virtio_break_device);

Of the above, only the virtio_break_device function is accessed by a non-transport virtio driver - it’s invoked by the virtio_crypto_core driver to force disable a device and assumes that the virtqueue implementation is a vring.

Virtqueues: enabling Argo in the transport control path

In the Virtio component architecture, the transport driver is responsible for mapping the virtqueue/vrings onto the transport medium. An Argo ring buffer can be treated as a DMA buffer for virtio.

Since both QEMU and the Linux kernel support use of split virtqueues, and the split virtqueue format separates the areas in the data structure for writes by the device and the driver, this is the correct virtqueue format to select for initial implementation of the Argo virtio transport. The split format will be assumed from this point onwards in this document.

Vrings

A vring is a variable-sized structure, allocated by the transport driver, with no requirement for a specific location within a memory page. There are alignment requirements for its data members. Some or all of the vring will typically be located within a shared memory region for other transports, but this is not the case for Argo.

The vring structure is defined in uapi/linux/virtio_ring.h :

/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
struct vring_desc {
    __virtio64 addr;            /* Address (guest-physical). */
    __virtio32 len;
    __virtio16 flags;
    __virtio16 next;            /* We chain unused descriptors via this, too */
};

struct vring_avail {
    __virtio16 flags;
    __virtio16 idx;
    __virtio16 ring[];
};

struct vring_used_elem {
    __virtio32 id;  /* Index of start of used descriptor chain. */
    __virtio32 len; /* Total len of the descriptor chain used (written to) */
};

typedef struct vring_used_elem __attribute__((aligned(VRING_USED_ALIGN_SIZE)))
    vring_used_elem_t;

struct vring_used {
    __virtio16 flags;
    __virtio16 idx;
    vring_used_elem_t ring[];
};

#define VRING_AVAIL_ALIGN_SIZE 2
#define VRING_USED_ALIGN_SIZE 4
#define VRING_DESC_ALIGN_SIZE 16

typedef struct vring_desc __attribute__((aligned(VRING_DESC_ALIGN_SIZE)))
    vring_desc_t;
typedef struct vring_avail __attribute__((aligned(VRING_AVAIL_ALIGN_SIZE)))
    vring_avail_t;
typedef struct vring_used __attribute__((aligned(VRING_USED_ALIGN_SIZE)))
    vring_used_t;

/* size in bytes values below are taken from the spec document */
struct vring {
    unsigned int num;      /* required to be a power of 2 */
    vring_desc_t *desc;    /* size in bytes: 16 * num */
    vring_avail_t *avail;  /* size in bytes: 6 + (2 * num) */
    vring_used_t *used;    /* size in bytes: 6 + (8 * num) */
};

The standard layout for a vring contains an array of descriptors, each being 16 bytes, and the availability of each descriptor for writing into is tracked in a separate array within the ring structure.

For the driver to send a buffer to a device:

When a device has finished with a buffer that was supplied by the driver:

Write access within the vring structure:

Vring descriptors

Each vring descriptor describes a buffer that is either write-only (VRING_DESC_F_WRITE set in flags) or read-only (VRING_DESC_F_WRITE clear in flags) for the device, with a guest-physical addr and a size in bytes indicated in len, optionally chained together via the next field.

Indirect descriptors enable a buffer to describe a list of buffer descriptors, to increase the maximum capacity of a ring and support efficient dispatch of large requests.

Argo rings

An Argo ring is also a variable-sized structure, allocated by the transport driver, and must be aligned at the head of a page of memory, and it is used to receive data transmitted from other domains. The ring header is 64 bytes.

typedef struct xen_argo_ring
{
    uint32_t rx_ptr;
    uint32_t tx_ptr;
    uint8_t reserved[56];             /* Reserved */
    uint8_t ring[XEN_FLEX_ARRAY_DIM];
} xen_argo_ring_t;

An Argo ring contains slots, each being 16 bytes. Each message that is sent is rounded up in size to the next slot boundary, and has a 16 byte message header inserted ahead of each message in the destination Argo ring.

Enabling remote writes for device backends into vrings using Argo

The used ring

The remote domain which implements the backend device that the virtio argo transport communciates with needs to be able to write into the ‘used’ ring within the vring, as it consumes the buffers supplied by the driver. It must also not overwrite the ‘available’ ring that typically immediate precedes it in the vring structure, so this is achieved by ensuring that the ‘used' ring starts on a separate memory page.

The following data structure allocation and positioning will support remote writes into the virtio used ring without requiring copies out of the receiving Argo ring to populate the vring used ring:

This will enable the remote Argo sender domain to transmit directly into the used vring, while ensuring that the Argo ring header and message header are safely outside the vring and do not interfere with its operation. The sender needs to ensure that the message header for any transmits is always written outside of the vring used ring and into the first page of the Argo ring.

The current logic for initializing vrings when virtqueues are created by existing transport drivers, in vring_create_virtqueue, assumes that the vring contents are contiguous in virtual memory – see:

static inline void vring_init(struct vring *vr, unsigned int num, void *p,
                  unsigned long align)
{
    vr->num = num;
    vr->desc = p;
    vr->avail = (struct vring_avail *)((char *)p + num * sizeof(struct vring_desc));
    vr->used = (void *)(((uintptr_t)&vr->avail->ring[num] + sizeof(__virtio16)
        + align-1) & ~(align - 1));
}

A new virtio-argo transport driver is not necessarily required to use this interface for initializing vrings, but enabing reuse of the existing functions will avoid duplicating common aspects of them.

The available ring and descriptor ring

The memory pages occupied by the available ring and descriptor ring can be transmitted via Argo when their contents are updated.

To be verified: A virtio front-end driver will invoke virtqueue_notify to notify the other end of changes, which uses a function pointer in the virtqueue that is set by the transport driver, which can invoke Argo sendv.

Vring use of the DMA interface provided by the virtio transport driver

The logic in vring_use_dma_api ensures that Xen domains always used the Linux DMA API for allocating the memory for vrings. Each DMA operation is invoked with a handle to the virtual device’s parent: vdev->dev.parent , which is set by the virtio transport driver – see virtio_pci_probe for example - and so can be set to be that virtio transport driver.

The memory to be used for the vring structure is then allocated via the DMA API interface that has been provided. With a new implementation of the DMA API interface, the virtio-argo transport driver will be able to allocate the vrings with memory layout described in the section above, to support registration of a corresponding Argo ring for access to the vring’s used ring.

Virtqueues: enabling Argo in the data path

The transmit sequence for sending data via the virtqueue looks like:

and when the buffers have been processed by the remote end, they will be indicated as used:

The receive sequence for incoming data via the virtqueue looks like:

Each of the virtio ring operations for managing exposure of buffers from the front-end virtio driver passes through the DMA operations functions struct provided by the device’s transport driver.

DMA operations

struct dma_map_ops {
    ... alloc
    ... free
    ... mmap
    ... get_sgtable
    ... map_page
    ... unmap_page
    ... map_sg
    ... unmap_sg
    ... map_resource
    ... unmap_resource
    ... sync_single_for_cpu
    ... sync_single_for_device
    ... sync_sg_for_cpu
    ... sync_sg_for_device
    ... cache_sync
    ... dma_supported
    ... get_required_mask
    ... max_mapping_size
    ... get_merge_boundary
};

Flow of front-end operations on the virtqueue

virtqueue_add_sgs
  → virtqueue_add
    → virtqueue_add_split
      → vring_map_one_sg( ... DMA_TO_DEVICE)
        → dma_map_page
      → vring_map_one_sg( ... DMA_FROM_DEVICE)
        → dma_map_page
      → vring_map_single( ... DMA_TO_DEVICE)
        → dma_map_single
      → virtqueue_kick


virtqueue_add_outbuf → virtqueue_add …
virtqueue_add_inbuf → virtqueue_add …
virtqueue_add_inbuf_ctx → virtqueue_add …

The DMA map operations return a dma_addr_t which is then translated to a 64bit value to insert into the virtio descriptor struct written into the vring’s descriptor ring.

Inbound receive buffers

For dma_map_page with a caller requesting the DMA_FROM_DEVICE direction:

Oubound transmit buffers

Handling dma_map_page with a caller requesting the DMA_TO_DEVICE direction:

Analysis of the Windows VirtIO implementation

The Windows VirtIO device drivers have the transport driver abstraction and separate driver structure, with the core virtqueue data structures, as the Linux VirtIO kernel implementation does.

The current kvm-guest-drivers-windows repository includes virtio-pci-modern and virtio-pci-legacy transports. virtio-mmio appears to be absent.

In summary: it looks promising that leverage of DMA translation interfaces to allow for insertion of the Argo transport in the same fashion as the Linux VirtIO-Argo design may be feasible with the Windows VirtIO driver implementations, but it is not yet clear whether the client VirtIO drivers may need to be recompiled, since the DMA address translation is performed differently in each of the storage, networking and input drivers inspected.

References:


Notes from Virtio Argo planning meeting, July 2020

Call on Friday 17th July

Attendees:

Actions:

Topic: Argo as a transport for Virtio

Background: the Virtio implementation in the Linux kernel is stacked:

KVM uses Qemu to provide backend implementations for devices.
To implement Argo as a Virtio transport, will need to make Qemu Argo-aware.

Typically a PCI device is emulated by Qemu, on a bus provided to the guest.
The Bus:Device:Function (bdf) triple maps to Qemu; the guest loads the PCI
virio transport when this PCI device with that bdf is detected.

For Xen, this is OK for HVM guests, but since the platform PCI device is not
exposed to PVH (or PV) guests, not a sufficient method for those cases.

Discussion notes

The existing main Xen PCI code is small, used for discovery.
vPCI is experimental full emulation of PCI within Xen -- Daniel does not
favour this direction.

Paul Durrant has a draft VFIO implementation plan, previously circulated.
Paul is working on emulating PCI devices directly via a IOReq server (see: "demu").

Daniel is interested in the Linux kernel packet mmap, to avoid one world-switch
when processing data.

Christopher raised the Cambridge Argo port-connection design for external
connection of Argo ports between VMs.

Eric raised the vsock-argo implementation; discussion mentioned the Argo development wiki page that describes requirements and plan that has previously been discussed and agreed for future Argo Linux driver development, which needs to include a non-vsock interface for administrative actions such as runtime firewall configuration.

Daniel is positively inclined towards using ACPI tables for discovery
and notes the executable ACPI table feature, via AML.

Naming: an Argo transport for virtio in the Linux kernel would, to match the
existing naming scheme, be: virtio-argo.c

Group inclined not to require XenStore.

Topic: connecting a frontend to a matching backend

Argo (domain identifier + port) tuple is needed for each end

Design topic: surfacing bootstrap/init connection info to guests

Options:

  1. PCI device, with emulation provided by Xen

  2. ACPI tables

  3. External connection of Argo ports, performed by a toolstack

  4. Use a wildcard ring on the frontend side for driver to autoconnect

Note:
ACPI plan is good for compatibility with DomB work, where ACPI tables are
already being populated to enable PVH guest launch.

Plan:

For a first cut: just claim a predefined port and use that to avoid the need
for interacting with an ACPI table.

To build:

Wiki page to include:

License of this Document

Copyright (c) 2020 BAE Systems.
Document author: Christopher Clark.
This work is licensed under the Creative Commons Attribution Share-Alike 4.0 International License.
To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/.