Analysis of Argo as a transport medium for VirtIO

This document collates information obtained from inspection of the Linux kernel and Windows KVM VirtIO device drivers for running as guest workloads.

 

Analysis of the Linux VirtIO implementation and system structure

Within the Linux kernel device driver implementations and QEMU device model.

Driver structure

There is a distinction between the general class of virtio device drivers, which provide function-specific logic implementing the front-end of virtual devices, and the transport virtio device drivers, which are responsible for device discovery and provision of virtqueues for data transport to the front-end drivers.

VirtIO transport drivers

There are several implementations of transport virtio drivers, designed to be interchangeable with respect to the virtio front-end drivers.

  • virtio-pci-modern

    • also: virtio-pci-legacy, intertwined with virtio-pci-modern via virtio-pci-common

  • virtio-mmio

  • virtio-ccw : this is a s390-specific channel driver

  • virtio-dpa : new for 2020: virtual Data Path Acceleration

There are also other instances of virtio transport drivers elsewhere in the Linux kernel:

  • vop : virtio over PCIe

  • remoteproc : remote processor messaging transport

  • mellanox bluefield soc driver

These are not relevant to the remainder of this document and so not discussed further.

These transport drivers communicate with backend implementations in the QEMU device model. Multiple transport drivers can operate concurrently in the same kernel without interference. The virtio-pci-modern transport driver is the most advanced implementation within the Linux kernel, so is appropriate for use for reference when building a new virtio-argo transport driver.

Each virtio device has a handle to the virtio transport driver where it originated, so the front-end virtio drivers can operate on devices from different transports on the same system without need for any different handling within the frontend driver.

Virtio transport driver interface

The interface that a transport driver must implement is defined in struct virtio_config_ops in include/linux/virtio_config.h

The Virtio PCI driver populates the interface struct thus:

 

static const struct virtio_config_ops virtio_pci_config_ops = { .get = vp_get, .set = vp_set, .generation = vp_generation, .get_status = vp_get_status, .set_status = vp_set_status, .reset = vp_reset, .find_vqs = vp_modern_find_vqs, .del_vqs = vp_del_vqs, .get_features = vp_get_features, .finalize_features = vp_finalize_features, .bus_name = vp_bus_name, .set_vq_affinity = vp_set_vq_affinity, .get_vq_affinity = vp_get_vq_affinity, };

Device discovery and driver registration with the Virtio PCI transport driver

Primarily implemented in: drivers/virtio/virtio_pci_common.c

On a system that is using the Virtio PCI transport driver, Virtio devices that are exposed to the guest are surfaced as PCI devices with device identifiers that match those registered by the Virtio PCI transport driver.

ie. Each virtual device first surfaces on the PCI bus and then driver configuration propagates from that point.

The Virtio PCI transport driver registers as a PCI device driver, declaring the range of PCI IDs that it will match to claim devices. When such a device is detected and matched, virtio_pci_probe is called to initialize the device driver for it, which typically will proceed into virtio_pci_modern_probe, which implements quite a bit of validation and feature negociation logic. A key action in this function is to initialize the pointer to the struct that contains the transport ops function pointers: vp_dev->vdev.config = &virtio_pci_config_ops;

Back in virtio_pci_probe, the new virtual device is then registered in register_virtio_device(&vp_dev->vdev); which calls device_add, which causes the kernel bus infrastructure to locate a matching front-end driver for this new device – the bus being the virtio bus -- and the front-end driver will then have its probe function called to register and activate the device with its actual front-end function.

Each front-end virtio driver needs to initialize the virtqueues that are for communication with the backend, and it does this via the methods in the device’s transport ops function pointer struct. Once the virtqueues are initialized, they are operated by the front-end driver via the standard virtqueue interfaces.

Argo: Device discovery and driver registration with Virtio Argo transport

A new Virtio transport driver, virtio-argo, should implement the virtio_pci_config_ops interface, and function as an alternative or complementary driver to Virtio PCI. In the same fashion as Virtio PCI, it will have responsibility for device discovery and invoke device_add for new virtual devices that it finds, but virtual devices will not be surfaced to the guest as PCI devices.

In modern Linux kernels, support for ACPI can be enabled without requiring support for PCI to be enabled.

Virtual devices to be discovered by the virtio-argo driver can be described in new ACPI tables that are provided to the guest. The tables can enumerate the devices and include the necessary configuration metadata for the front-ends to be matched, probed and configured for connection to the corresponding backends over the Argo transport.

ACPI has support for hotplug of devices, so precedent exists for being able to support handling dynamic arrival and removal of virtual devices via this interface.

With this implementation, there should be no requirement for PCI support to be enabled in the guest VM when using the virtio-argo transport, which stands in contrast to the existing Xen device drivers in PVH and HVM guests, where the Xen platform PCI device is needed, and to Virtio-PCI transport, which depends on PCI support.

Virtqueues

Virtqueues implement the mechanism for transport of data for virtio devices.

There are two supported formats for virtqueues, and each driver and each device may support either one or both formats:

Negociation between driver and device of which virtqueue format to operate occurs at device initialization.

Current Linux driver implementations

The packed virtqueue format is negociated by the VIRTIO_F_RING_PACKED feature bit. There are currently very few references to this value in the kernel. The vring_create_virtqueue function does provide an implementation for creation of packed vring structures and the vdpa transport driver offers the feature, but otherwise it appears that the kernel implementation of virtio device drivers all use the original split ring format.

Current QEMU device implementations

The general device-independent QEMU logic in hw/virtio/virtio.c has support for packed rings, but the device implementations do not appear to negociate it in the current implementation.

In include/hw/virtio/virtio.h there is a centrally-defined macro DEFINE_VIRTIO_COMMON_FEATURES used in the common virtio_device_class_init function that sets VIRTIO_F_RING_PACKED to false.

Other virtio device implementations

DPDK implements virtio devices that can use either packed or split virtqueues.

Virtqueues : implemented by vrings

The virtio transport driver is asked by a virtio front-end driver, on behalf of a device, to find the virtual queues needed to connect the driver to the backend. The function to obtain the virtqueues is accessed via the transport ops find_vqs function.

Each of the existing Linux kernel virtio transport drivers uses the vring_create_virtqueue function to provision vrings, which implement virtqueues.

Virtqueue definition in the Virtio 1.1 standard:

https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-230005

Virtqueue interface

See the virtqueue structure and functions in : <linux/virtio.h> and the virtqueue functions exported from virtio_ring.c , listed below. These functions are called directly - ie. not via function pointers in an ops structure - from call sites in the front-end virtio drivers. It does not appear that the virtqueue implementation (ie. vring) is practical to substitute for an alternative virtqueue implementation in a new transport driver.

 

struct virtqueue { struct list_head list; void (*callback)(struct virtqueue *vq); const char *name; struct virtio_device *vdev; unsigned int index; unsigned int num_free; void *priv; };

 

EXPORT_SYMBOL_GPL(virtio_max_dma_size); EXPORT_SYMBOL_GPL(virtqueue_add_sgs); EXPORT_SYMBOL_GPL(virtqueue_add_outbuf); EXPORT_SYMBOL_GPL(virtqueue_add_inbuf); EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx); EXPORT_SYMBOL_GPL(virtqueue_kick_prepare); EXPORT_SYMBOL_GPL(virtqueue_notify); EXPORT_SYMBOL_GPL(virtqueue_kick); EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx); EXPORT_SYMBOL_GPL(virtqueue_get_buf); EXPORT_SYMBOL_GPL(virtqueue_disable_cb); EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); EXPORT_SYMBOL_GPL(virtqueue_poll); EXPORT_SYMBOL_GPL(virtqueue_enable_cb); EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed); EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf); EXPORT_SYMBOL_GPL(virtqueue_get_desc_addr); EXPORT_SYMBOL_GPL(virtqueue_get_avail_addr); EXPORT_SYMBOL_GPL(virtqueue_get_used_addr); EXPORT_SYMBOL_GPL(virtqueue_get_vring); EXPORT_SYMBOL_GPL(virtqueue_get_vring_size); EXPORT_SYMBOL_GPL(virtqueue_is_broken);

Vring interface

These vring functions are exported:

Of the above, only the virtio_break_device function is accessed by a non-transport virtio driver - it’s invoked by the virtio_crypto_core driver to force disable a device and assumes that the virtqueue implementation is a vring.

Virtqueues: enabling Argo in the transport control path

In the Virtio component architecture, the transport driver is responsible for mapping the virtqueue/vrings onto the transport medium. An Argo ring buffer can be treated as a DMA buffer for virtio.

Since both QEMU and the Linux kernel support use of split virtqueues, and the split virtqueue format separates the areas in the data structure for writes by the device and the driver, this is the correct virtqueue format to select for initial implementation of the Argo virtio transport. The split format will be assumed from this point onwards in this document.

Vrings

A vring is a variable-sized structure, allocated by the transport driver, with no requirement for a specific location within a memory page. There are alignment requirements for its data members. Some or all of the vring will typically be located within a shared memory region for other transports, but this is not the case for Argo.

The vring structure is defined in uapi/linux/virtio_ring.h :

The standard layout for a vring contains an array of descriptors, each being 16 bytes, and the availability of each descriptor for writing into is tracked in a separate array within the ring structure.

For the driver to send a buffer to a device:

  • one or more slots are filled in the vring->desc table, with multiple slots chained together using vring_desc->next

  • the index of the descriptor is written to the available ring

  • the device is notified

When a device has finished with a buffer that was supplied by the driver:

  • the index of the descriptor is written to the used ring

  • the driver is notified of the used buffer

Write access within the vring structure:

  • The descriptor table is only written to by the driver and read by the device.

  • The available ring is only written to by the driver and read by the device.

  • The used ring is only written to by the device and read by the driver.

Vring descriptors

Each vring descriptor describes a buffer that is either write-only (VRING_DESC_F_WRITE set in flags) or read-only (VRING_DESC_F_WRITE clear in flags) for the device, with a guest-physical addr and a size in bytes indicated in len, optionally chained together via the next field.

Indirect descriptors enable a buffer to describe a list of buffer descriptors, to increase the maximum capacity of a ring and support efficient dispatch of large requests.

Argo rings

An Argo ring is also a variable-sized structure, allocated by the transport driver, and must be aligned at the head of a page of memory, and it is used to receive data transmitted from other domains. The ring header is 64 bytes.

An Argo ring contains slots, each being 16 bytes. Each message that is sent is rounded up in size to the next slot boundary, and has a 16 byte message header inserted ahead of each message in the destination Argo ring.

Enabling remote writes for device backends into vrings using Argo

The used ring

The remote domain which implements the backend device that the virtio argo transport communciates with needs to be able to write into the ‘used’ ring within the vring, as it consumes the buffers supplied by the driver. It must also not overwrite the ‘available’ ring that typically immediate precedes it in the vring structure, so this is achieved by ensuring that the ‘used' ring starts on a separate memory page.

The following data structure allocation and positioning will support remote writes into the virtio used ring without requiring copies out of the receiving Argo ring to populate the vring used ring:

  • The vring is allocated so that the ‘used’ ring, within the vring structure, starts on a page boundary.

    • This requirement is necessary regardless of page size - ie. true for 4k and 64k page sizes

  • A single page is allocated to contain the Argo ring header and the initial region of the Argo ring, which will include (ample) space for the Argo message header of 16 bytes

  • When the Argo ring is registered with the hypervisor, it is registered as a multi-page ring

    • The first page is the one allocated for containing the Argo ring header and space for an Argo message header

    • The subsequent pages of the Argo ring are all those which contain the vring ‘used’ ring

  • The Argo ring is sized so that the end of the Argo ring is at the end of the vring ‘used’ ring, which will prevent overwrites beyond the used ring.

This will enable the remote Argo sender domain to transmit directly into the used vring, while ensuring that the Argo ring header and message header are safely outside the vring and do not interfere with its operation. The sender needs to ensure that the message header for any transmits is always written outside of the vring used ring and into the first page of the Argo ring.

The current logic for initializing vrings when virtqueues are created by existing transport drivers, in vring_create_virtqueue, assumes that the vring contents are contiguous in virtual memory – see:

A new virtio-argo transport driver is not necessarily required to use this interface for initializing vrings, but enabing reuse of the existing functions will avoid duplicating common aspects of them.

The available ring and descriptor ring

The memory pages occupied by the available ring and descriptor ring can be transmitted via Argo when their contents are updated.

To be verified: A virtio front-end driver will invoke virtqueue_notify to notify the other end of changes, which uses a function pointer in the virtqueue that is set by the transport driver, which can invoke Argo sendv.

Vring use of the DMA interface provided by the virtio transport driver

The logic in vring_use_dma_api ensures that Xen domains always used the Linux DMA API for allocating the memory for vrings. Each DMA operation is invoked with a handle to the virtual device’s parent: vdev->dev.parent , which is set by the virtio transport driver – see virtio_pci_probe for example - and so can be set to be that virtio transport driver.

The memory to be used for the vring structure is then allocated via the DMA API interface that has been provided. With a new implementation of the DMA API interface, the virtio-argo transport driver will be able to allocate the vrings with memory layout described in the section above, to support registration of a corresponding Argo ring for access to the vring’s used ring.

Virtqueues: enabling Argo in the data path

The transmit sequence for sending data via the virtqueue looks like:

  • virtqueue_add_outbuf

  • virtqueue_kick_prepare (or virtqueue_kick, which performs this and virtqueue_notify)

  • virtqueue_notify

and when the buffers have been processed by the remote end, they will be indicated as used:

  • virtqueue_poll - returns true if there are pending used buffers to process in the virtqueue

  • vring_interrupt → call to the registered virtqueue callback

The receive sequence for incoming data via the virtqueue looks like:

  • virtqueue_add_inbuf

  • virtqueue_kick_prepare (or virtqueue_kick, which performs this and virtqueue_notify)

  • virtqueue_notify

  • virtqueue_poll - returns true if there are pending used buffers to process in the virtqueue

  • vring_interrupt → call to the registered virtqueue callback

Each of the virtio ring operations for managing exposure of buffers from the front-end virtio driver passes through the DMA operations functions struct provided by the device’s transport driver.

DMA operations

Flow of front-end operations on the virtqueue

The DMA map operations return a dma_addr_t which is then translated to a 64bit value to insert into the virtio descriptor struct written into the vring’s descriptor ring.

Inbound receive buffers

For dma_map_page with a caller requesting the DMA_FROM_DEVICE direction:

  • The transport driver’s DMA API implementation can register a new Argo ring, so that the remote device can invoke the hypervisor to write directly into the buffer

    • the first page of the new ring needs to be a separate page to hold the ring header and incoming message header

    • the subsequent pages of the ring should be the memory supplied by the caller for the buffer to receive into

    • the DMA address returned for insertion into the virtqueue’s descriptor ring should encode the Argo ring address

  • An Argo notification (currently VIRQ) will occur when data has been sent to the buffer

  • The Argo ring can be unregistered when the buffer has been reported as used via the virtqueue’s used ring. It is safe to do so at any point since the driver domain owns the ring memory; it will just prevent any further writes to it from the remote sender once unregistered.

  • At ring registration, the Argo ring’s rx_ptr can be initialized to point to the start of the receive buffer supplied by the caller, minus space for the message header, so that the data written by the Argo sendv by the hypervisor on behalf of the remote device will go directly into the buffer where it is needed.

Oubound transmit buffers

Handling dma_map_page with a caller requesting the DMA_TO_DEVICE direction:

  • When the remote implementation of the Argo virtio device initializes state, it must register an Argo ring to contain a buffer for receiving incoming data on behalf of the device.

  • The address for the device’s incoming buffer ring must be communicated to the Argo virtio transport driver within the guest, as part of the device discovery protocol.

  • The transport driver’s DMA API implementation will invoke the Argo sendv operation to transmit the data provided in the dma_map_page into the Argo ring provided by the remote device.

  • When the sendv is issued, the message needs to include an identifier that is then encoded into the DMA address that is returned by dma_map_page for passing into the virtqueue’s descriptor ring.

Analysis of the Windows VirtIO implementation

The Windows VirtIO device drivers have the transport driver abstraction and separate driver structure, with the core virtqueue data structures, as the Linux VirtIO kernel implementation does.

The current kvm-guest-drivers-windows repository includes virtio-pci-modern and virtio-pci-legacy transports. virtio-mmio appears to be absent.

In summary: it looks promising that leverage of DMA translation interfaces to allow for insertion of the Argo transport in the same fashion as the Linux VirtIO-Argo design may be feasible with the Windows VirtIO driver implementations, but it is not yet clear whether the client VirtIO drivers may need to be recompiled, since the DMA address translation is performed differently in each of the storage, networking and input drivers inspected.

References:


Notes from Virtio Argo planning meeting, July 2020

Call on Friday 17th July

Attendees:

  • Christopher Clark, Rich Persaud; MBO/BAE

  • Daniel Smith; Apertus Solutions

  • Eric Chanudet, Nick Krasnoff; AIS Boston

Actions:

  • Christopher to produce a draft plan, post to Wire, discuss and then
    put it onto a private page on the OpenXT confluence wiki

  • Christopher and Eric to commence building a prototype as funding permits,
    with Daniel providing design input and review

Topic: Argo as a transport for Virtio

Background: the Virtio implementation in the Linux kernel is stacked:

  • there are many virtio drivers for different virtual device functions

  • each virtio driver uses a common in-kernel virtio transport API,
    and there are multiple alternative transport implementations

  • transport negociation proceeds, then driver-level negociation
    to establish the frontend-backend communication for each driver

KVM uses Qemu to provide backend implementations for devices.
To implement Argo as a Virtio transport, will need to make Qemu Argo-aware.

Typically a PCI device is emulated by Qemu, on a bus provided to the guest.
The Bus:Device:Function (bdf) triple maps to Qemu; the guest loads the PCI
virio transport when this PCI device with that bdf is detected.

For Xen, this is OK for HVM guests, but since the platform PCI device is not
exposed to PVH (or PV) guests, not a sufficient method for those cases.

Discussion notes

The existing main Xen PCI code is small, used for discovery.
vPCI is experimental full emulation of PCI within Xen -- Daniel does not
favour this direction.

Paul Durrant has a draft VFIO implementation plan, previously circulated.
Paul is working on emulating PCI devices directly via a IOReq server (see: "demu").

Daniel is interested in the Linux kernel packet mmap, to avoid one world-switch
when processing data.

Christopher raised the Cambridge Argo port-connection design for external
connection of Argo ports between VMs.

Eric raised the vsock-argo implementation; discussion mentioned the Argo development wiki page that describes requirements and plan that has previously been discussed and agreed for future Argo Linux driver development, which needs to include a non-vsock interface for administrative actions such as runtime firewall configuration.

Daniel is positively inclined towards using ACPI tables for discovery
and notes the executable ACPI table feature, via AML.

Naming: an Argo transport for virtio in the Linux kernel would, to match the
existing naming scheme, be: virtio-argo.c

Group inclined not to require XenStore.

  • Relevant to dom0less systems where XenStore is not necessary present.

  • Don't want to introduce another dependency on it.

  • uXen demonstrates that it is not a requirement.

  • XenStore is a performance bottleneck (see NoXS work), bad interface, etc.

Topic: connecting a frontend to a matching backend

Argo (domain identifier + port) tuple is needed for each end

  • Note: the toolstack knows about both the front and backend and could
    be able to connect the two together.

  • Note: "dom0less" case is important: no control domain or toolstack
    executing concurrently to perform any connection handshake.
    DomB should be able to provide configuration information to each domain
    and have connections succeed as the drivers in the different domains
    connect to each other.

    • domain reboot plan in that case: to be considered

  • Note: options: frontend can:

    • use Argo wildcards, or

    • know which domain+port to connect to the backend

    • or be preconfigured by some external entity to enable the connection
      (with Argo modified to enable this, per Cambridge plan)

  • Note: ability to restart a backend (potentially to a different domid)
    is valuable

  • Note: Qemu is already used as a single-purpose backend driver for storage.

    • ref: use of Qemu in Xen-on-ARM (PVH architecture, not HVM device emulator)

    • ref: Qemu presence in the xencommons init scripts

    • ref: qcow filesystem driver

    • ref: XenServer storage?

Design topic: surfacing bootstrap/init connection info to guests

Options:

  1. PCI device, with emulation provided by Xen

  2. ACPI tables

    • dynamic content ok: cite existing CPU hotplug behaviour

    • executable content ok: AML

  3. External connection of Argo ports, performed by a toolstack

    • ref: Cambridge Argo design discussion, Q4 2019

  4. Use a wildcard ring on the frontend side for driver to autoconnect

    • can scope reachability via the Argo firewall (XSM or nextgen impl)

Note:
ACPI plan is good for compatibility with DomB work, where ACPI tables are
already being populated to enable PVH guest launch.

Plan:

  • ACPI table -> indicates the Argo tuple(s) for a guest to register
    virtio-argo transport

    • note: if multiple driver domains, will need (at least) one transport
      per driver domain

  • toolstack passes the backend configuration info to the qemu (or demu) instance
    that is runnning the backend driver

  • qemu: has an added argo backend for virtio

  • the backend connects to the frontend to negociate transport

For a first cut: just claim a predefined port and use that to avoid the need
for interacting with an ACPI table.

To build:

  • an early basic virtio transport over Argo as a proof of viability

Wiki page to include:

  • describe compatibility of the plan wrt to the Cambridge Argo design
    discussion which covered Argo handling communication across nested
    virtualization.

  • how could a new CPU VMFUNC assist Argo?

    • aim: to obviate or mitigate the need for VMEXITs in the data path.

  • look at v4v's careful use of interrupts (rather than events) for
    delivery of notifications: should be able reduce the number of Argo VMEXITS.
    -> see the Bromium uxen v4v driver and the first-posted round of the
    Argo upstreaming series.

License of this Document

Copyright (c) 2020 BAE Systems.
Document author: Christopher Clark.
This work is licensed under the Creative Commons Attribution Share-Alike 4.0 International License.
To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/.