From Developer Community
Introduction
What is a Virtual-Bus?
Virtual-Bus ("vbus" for short) is a Linux-kernel based virtual IO resource container technology. It allows you to declare virtualized device models directly within a host kernel that can be uniformly accessed from a variety of environments, such as KVM, userspace, lguest, Xen, etc. The goal is to reduce overhead while still preserving proper isolation to achieve maximum throughput and latency from your IO subsystem. It is installed on the host and configured using simple filesystem operations like "mkdir" to create a new container, "ln -s" to associate a device with a container, and "echo" to configure the device thanks to integration with configfs/sysfs. Therefore, managing a virtual-bus does not require special userspace tools.
The Problem
A Linux Kernel is already well adapted to managing IO resources in addition to many other tasks in a modern computing environment. However, there are cases today where these facilities are utilized indirectly and inefficiently. Consider, for instance, that virtualization technologies like KVM route IO through userspace such as the figure to the right. In addition, it often times requires multiple passes to perform a single IO transaction, thereby compounding the effects of the longer path. This ultimately leads to performance degradation by such tangible metrics as throughput and latency.
Hardware vendors have been addressing some of these concerns with technologies such as IOMMU and SR-IOV specifications, but these only address newly emerging hardware. Other strategies have involved using techniques such as PCI pass-through/device-assignment. While this works reasonably well, it requires the dedication of a particular resource to a guest which may or may not be acceptable in all environments.
Our Solution: The Virtual-Bus
Direct access to kernel resource
This project attempts to address the problem at the software architecture layer instead of at the hardware capability layer. Doing so allows us to offer enhanced performance to a wider range of applications and platforms. We introduce a new construct called a "virtual-bus" which allows contexts such as a KVM guest to have more direct access to the underlying hardware, while still providing all of the isolation and policy enforcement that is required in a typical virtualized environment.
The end result is that we can achieve near native speeds utilizing commodity hardware, which is currently several orders of magnitude faster and/or more responsive than current state of the art technologies like virtio-pci based virtio-net.
Details
The virtual-bus concept is modeled closely to the Linux Device-Model (LDM), where we have buses, devices, and drivers as the primary actors. However, VBUS has several distinctions when contrasted with LDM:
- "Busses" in LDM are relatively static and global to the kernel (e.g. "PCI", "USB", etc). VBUS buses are arbitrarily created and destroyed dynamically, and are not globally visible. Instead they are defined as visible only to a specific subset of the system (the contained context).
- "Devices" in LDM are typically tangible physical (or sometimes logical) devices. VBUS devices are purely software abstractions (which may or may not have one or more physical devices behind them). Devices may also be arbitrarily created or destroyed by software/administrative action as opposed to by a hardware discovery mechanism.
- "Drivers" in LDM sit within the same kernel context as the busses and devices they interact with. VBUS drivers live in a foreign context (such as userspace, or a virtual-machine guest).
The idea is that a vbus is created to contain access to some IO services. Virtual devices are then instantiated and linked to a bus to grant access to drivers actively present on the bus. Drivers will only have visibility to devices present on their respective bus, and nothing else.
Virtual devices are defined by modules which register a deviceclass with the system. A deviceclass simply represents a type of device that _may_ be instantiated into a device, should an administrator wish to do so. Once this has happened, the device may be associated with one or more buses where it will become visible to all clients of those respective buses.
Key Concepts
- Direct kernel access - Traditional userspace applications have direct access to the kernel. Lets extend that notion to other contexts such as KVM guests.
- Simple API - Devices in vbus only support two verbs: call() and shm(). The former provides for a synchronous verb interface, while the latter allows for asychronous communication via shared-memory. The idea is that these to basic but powerful interfaces should be robust enough to describe a unlimited range of IO services.
- Lockless shared-memory + queuing - Devices and their drivers may communicate in a contention-less manner, much as would be the case between a physical device and the host cpu system.
- Bidirectional signal mitigation - The Linux networking stack introduced "NAPI" a few years ago which included an interrupt mitigation strategy by introducing a hybrid polled/interrupt driven design. For physical hardware, this is generally enough because access from the cpu to device is "cheap". However, virtualization has introduced an additional cost where IO is expensive in both directions. Therefore, we build on the mitigation technique introduced by NAPI to reduce the amount of IO required in *both* directions.
- Better concurrent utilization of multi-core hardware - Use kthreads to parallelize IO processing (such as soft GSO/CSUM. Modern hardware continues to offer more and more processing cores. If we break up IO processing into a pipe-line of threads we can achieve higher utilization rates and more performance.
You can read a more through introduction to the usage of these new facilities here
Security Model
Exposing a kernel interface directly to something such as a guest requires careful security considerations. We do not want to provide a channel for a guest to violate protection and/or isolation domains. To address this, vbus has various components of its design to mitigate exposure risks.
- Host-only administration interface - The only way to create a bus, and/or create a device on a bus, is via the administrative interface on the host.
- Isolated namespace - The client's view of a vbus is a bus-specific namespace of device-ids which consists solely of the devices that have been explicitly placed on that bus by adminstrative action. Access to this namespace is managed through a narrow gate, and kernel items outside this namespace, including devices on other busses are invisible and inaccessible.
- Per-task admittance policy - A linux task can only associate with, at most, one vbus at a time. This means that a task can only see the device-id namespace of the devices on its associated bus and nothing else. This is enforced by the host kernel by placing a reference to the associated vbus on the task-struct itself. Again, the only way to modify this association is via a host based administrative operation. Note that multiple tasks can associate to the same vbus, which would commonly be used by all threads in an app, or all vcpus in a guest, etc.
Other Security Considerations
- Asynchronous error reporting - the asynchronous nature of the shm/ring interfaces implies we have the potential for asynchronous faults. E.g. "garbage" in the ring might not be discovered at the EIP of the guest vcpu when it actually inserts the error, but rather later when the host side tries to update the ring. A naive implementation would have the host do a BUG_ON() when it discovers the discrepancy. Instead, we utilize an asynchronous fault mechanism that allows the guest to always be the one punished (via something like a machine-check for guests, or SIGABRT for userspace, etc)
- Signal-path robustness - Because vbus supports a variety of different environments, we call guest/userspace "north', and the host/kernel "south". When the north wants to communicate with the kernel, its perfectly ok to stall the north indefinitely if the south is not ready. However, it is not really ok to stall the south when communicating with the north because this is an attack vector. E.g. a malicous/broken guest could just stop servicing its ring to cause threads in the host to jam up, which is undesirable. So what we do is we design all south to-north signaling paths to be robust against stalling. What they do instead is manage backpressure a little bit more intelligently than simply blocking like they might in the guest. For instance, in venet-tap, a "transmit" from netif that has to be injected in the south-to-north ring when it is full will result in a netif_stop_queue(). etc.