Cross Reference: /src/lib/libnvmm

History log of /src/lib/libnvmm
Revision	Date	Author	Comments
1.7	18-Sep-2025	mrg	introduce a couple of new turn-off-gcc-warning variables and use them. GCC 14 has a new annoying calloc() checker that we turn off in a bunch of places, and there are a few more dangling-pointer issuse that come up, but seem bogus.
1.6	28-Apr-2019	maxv	branches: 1.6.2; Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
1.5	13-Nov-2018	martin	branches: 1.5.2; Too much magic involved - revert previous.
1.4	13-Nov-2018	martin	Need some minimalistic support for additional things that ../Makefile requires, even if we do nothing here
1.3	13-Nov-2018	martin	Move conditionals for libnvmm to subdir makefile, requested boy mrg.
1.2	12-Nov-2018	nakayama	No need to install shared libraries to /lib.
1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
1.5.2.2	26-Nov-2018	pgoyette	Sync with HEAD, resolve a couple of conflicts
1.5.2.1	13-Nov-2018	pgoyette	file Makefile was added on branch pgoyette-compat on 2018-11-26 01:52:13 +0000
1.6.2.2	10-Jun-2019	christos	Sync with HEAD
1.6.2.1	28-Apr-2019	christos	file Makefile was added on branch phil-wifi on 2019-06-10 22:05:25 +0000
1.29	26-Jul-2025	skrll	Note example code updated to use the new API. Bump date.
1.28	10-Dec-2021	msaitoh	branches: 1.28.4; s/premissions/permissions/
1.27	05-Sep-2020	maxv	nvmm: update copyright headers
1.26	09-Feb-2020	maxv	Reference nvmmctl(8).
1.25	28-Oct-2019	maxv	should be fork(2), noticed by wiz
1.24	28-Oct-2019	wiz	Macro tidyness.
1.23	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
1.22	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
1.21	27-Oct-2019	maxv	Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago.
1.20	25-Oct-2019	maxv	Update the libnvmm man page: - Sync the naming with reality. - Replace "relevant" by "desired" and "virtualizer" by "emulator", closer to what I meant. - Add a "VCPU Configuration" section. - Add a "Machine Ownership" section.
1.19	08-Jun-2019	maxv	branches: 1.19.2; 1.19.4; Change the NVMM API to reduce data movements. Sent to tech-kern@.
1.18	11-May-2019	maxv	Replace "VMM" by "emulator", clearer.
1.17	11-May-2019	maxv	Sync with reality.
1.16	29-Apr-2019	maxv	sync with reality
1.15	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
1.14	07-Apr-2019	maxv	Sync, and fix grammar.
1.13	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
1.12	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
1.11	05-Feb-2019	wiz	Mark up NULL with Dv. Remove empty line.
1.10	05-Feb-2019	maxv	Sync with reality, and improve.
1.9	07-Jan-2019	wiz	Remove leading zero from date.
1.8	07-Jan-2019	maxv	Optimize: on single memory operand instructions, take the GPA directly from the exit structure provided by the kernel. This saves an MMU translation, and sometimes complex address computation (eg SIB). Drop the GVA field, it is not useful to virtualizers.
1.7	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
1.4	12-Dec-2018	wiz	Remove superfluous dot.
1.3	12-Dec-2018	maxv	Change the "FILES" section, in the end I don't want to commit toyvirt and smallkern, there is little interest installing them by default, rather they can be downloaded from www. It's better this way. While here add NVMM(4) in "SEE ALSO".
1.2	10-Nov-2018	maxv	branches: 1.2.2; Add copyright and RCSID, from wiz@.
1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
1.2.2.4	18-Jan-2019	pgoyette	Synch with HEAD
1.2.2.3	26-Dec-2018	pgoyette	Sync with HEAD, resolve a few conflicts
1.2.2.2	26-Nov-2018	pgoyette	Sync with HEAD, resolve a couple of conflicts
1.2.2.1	10-Nov-2018	pgoyette	file libnvmm.3 was added on branch pgoyette-compat on 2018-11-26 01:52:13 +0000
1.19.4.2	10-Feb-2020	martin	Pull up following revision(s) (requested by maxv in ticket #688): share/man/man4/nvmm.4: revision 1.5 lib/libnvmm/libnvmm.3: revision 1.26 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.47 Mmh, as noted in PR/54847, this should be uint64_t, not uint16_t. Harmless because we use only the two lowest bits anyway. I believe this could be caught by KUBSAN; time to do another round of NVMM+K_SAN testing. Reference nvmmctl(8).
1.19.4.1	10-Nov-2019	martin	Pull up following revision(s) (requested by maxv in ticket #405): usr.sbin/nvmmctl/nvmmctl.8: revision 1.2 lib/libnvmm/libnvmm.3: revision 1.24 sys/dev/nvmm/nvmm.h: revision 1.11 lib/libnvmm/libnvmm.3: revision 1.25 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.16 sys/dev/nvmm/nvmm.h: revision 1.12 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.17 tests/lib/libnvmm/h_mem_assist.c: revision 1.12 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.18 share/mk/bsd.hostprog.mk: revision 1.82 lib/libnvmm/libnvmm.c: revision 1.15 distrib/sets/lists/base/md.amd64: revision 1.281 tests/lib/libnvmm/h_mem_assist.c: revision 1.13 lib/libnvmm/libnvmm.c: revision 1.16 tests/lib/libnvmm/h_mem_assist.c: revision 1.14 lib/libnvmm/libnvmm_x86.c: revision 1.32 lib/libnvmm/libnvmm.c: revision 1.17 tests/lib/libnvmm/h_mem_assist.c: revision 1.15 lib/libnvmm/libnvmm_x86.c: revision 1.33 lib/libnvmm/libnvmm.c: revision 1.18 usr.sbin/nvmmctl/Makefile: revision 1.1 tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.7 tests/lib/libnvmm/h_mem_assist.c: revision 1.16 lib/libnvmm/libnvmm_x86.c: revision 1.34 usr.sbin/nvmmctl/Makefile: revision 1.2 tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.8 tests/lib/libnvmm/h_mem_assist.c: revision 1.17 sys/dev/nvmm/nvmm_internal.h: revision 1.13 lib/libnvmm/libnvmm_x86.c: revision 1.35 lib/libnvmm/libnvmm_x86.c: revision 1.36 usr.sbin/postinstall/postinstall.in: revision 1.8 lib/libnvmm/libnvmm_x86.c: revision 1.37 lib/libnvmm/libnvmm_x86.c: revision 1.38 lib/libnvmm/libnvmm_x86.c: revision 1.39 usr.sbin/Makefile: revision 1.282 lib/libnvmm/nvmm.h: revision 1.13 lib/libnvmm/nvmm.h: revision 1.14 lib/libnvmm/nvmm.h: revision 1.15 sys/dev/nvmm/nvmm.c: revision 1.23 lib/libnvmm/nvmm.h: revision 1.16 sys/dev/nvmm/nvmm.c: revision 1.24 lib/libnvmm/nvmm.h: revision 1.17 sys/dev/nvmm/nvmm.c: revision 1.25 tests/lib/libnvmm/h_io_assist.c: revision 1.9 etc/MAKEDEV.tmpl: revision 1.209 tests/lib/libnvmm/h_io_assist.c: revision 1.10 tests/lib/libnvmm/h_io_assist.c: revision 1.11 etc/group: revision 1.35 distrib/sets/lists/man/mi: revision 1.1660 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.40 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.41 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.42 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.43 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.44 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.51 sys/dev/nvmm/nvmm_ioctl.h: revision 1.8 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.52 sys/dev/nvmm/nvmm_ioctl.h: revision 1.9 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.53 usr.sbin/nvmmctl/nvmmctl.c: revision 1.1 lib/libnvmm/libnvmm.3: revision 1.20 distrib/sets/lists/debug/md.amd64: revision 1.106 lib/libnvmm/libnvmm.3: revision 1.21 lib/libnvmm/libnvmm.3: revision 1.22 usr.sbin/nvmmctl/nvmmctl.8: revision 1.1 lib/libnvmm/libnvmm.3: revision 1.23 Fix incorrect parsing: the R/M field uses a special GPR map when the address size is 16 bits, regardless of the actual operating mode. With this special map there can be two registers referenced at once, and also disp16-only. Implement this special behavior, and add associated tests. While here simplify a few things. With this in place, the Windows 95 installer initializes correctly. Part of PR/54611. add missing initializer Implement XCHG, add associated tests, and add comments to explain. With this in place the Windows 95 installer completes successfuly. Part of PR/54611. Improve nvmm_vcpu_dump(). Put back 'default', because llvm apparently doesn't realize that all cases are covered in the switch. Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added. Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU. Update the libnvmm man page: - Sync the naming with reality. - Replace "relevant" by "desired" and "virtualizer" by "emulator", closer to what I meant. - Add a "VCPU Configuration" section. - Add a "Machine Ownership" section. Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago. Use the new PTE naming, and define CR3_FRAME_ separately. No functional change. Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test. Mask CPUID leaf 0x0A on Intel, because we don't want the guest to try (and fail) to probe the PMC MSRs. This avoids "Unexpected WRMSR" warnings in qemu-nvmm. Add PCID support in the guests. This speeds up most 64bit guests, because since Meltdown, everybody uses PCID (including NetBSD). Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system. A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure. Add nram in struct nvmm_ctl_mach_info. Add nvmmctl, with two commands for now. Macro tidyness. Sort SEE ALSO. should be fork(2), noticed by wiz Add debug entry for newly introduced nvmmctl utility. Annotate a covering switch as such to avoid warnings about missing returns. Forgot to put nvmmctl in the "nvmm" group. Add nvmm group.
1.19.2.3	13-Apr-2020	martin	Mostly merge changes from HEAD upto 20200411
1.19.2.2	10-Jun-2019	christos	Sync with HEAD
1.19.2.1	08-Jun-2019	christos	file libnvmm.3 was added on branch phil-wifi on 2019-06-10 22:05:25 +0000
1.28.4.1	02-Aug-2025	perseant	Sync with HEAD
1.20	06-Apr-2021	reinoud	Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without signals. This introduces a new kernel and userland NVMM version indicating this support. Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request. This is the missing libnvmm part I forgot to include in the origional commit.
1.19	05-Sep-2020	maxv	nvmm: update copyright headers
1.18	27-Oct-2019	maxv	Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system.
1.17	27-Oct-2019	maxv	Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago.
1.16	23-Oct-2019	maxv	Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU.
1.15	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
1.14	08-Jun-2019	maxv	branches: 1.14.2; 1.14.4; Change the NVMM API to reduce data movements. Sent to tech-kern@.
1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
1.2.2.4	18-Jan-2019	pgoyette	Synch with HEAD
1.2.2.3	26-Dec-2018	pgoyette	Sync with HEAD, resolve a few conflicts
1.2.2.2	26-Nov-2018	pgoyette	Sync with HEAD, resolve a couple of conflicts
1.2.2.1	19-Nov-2018	pgoyette	file libnvmm.c was added on branch pgoyette-compat on 2018-11-26 01:52:13 +0000
1.14.4.1	10-Nov-2019	martin	Pull up following revision(s) (requested by maxv in ticket #405): usr.sbin/nvmmctl/nvmmctl.8: revision 1.2 lib/libnvmm/libnvmm.3: revision 1.24 sys/dev/nvmm/nvmm.h: revision 1.11 lib/libnvmm/libnvmm.3: revision 1.25 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.16 sys/dev/nvmm/nvmm.h: revision 1.12 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.17 tests/lib/libnvmm/h_mem_assist.c: revision 1.12 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.18 share/mk/bsd.hostprog.mk: revision 1.82 lib/libnvmm/libnvmm.c: revision 1.15 distrib/sets/lists/base/md.amd64: revision 1.281 tests/lib/libnvmm/h_mem_assist.c: revision 1.13 lib/libnvmm/libnvmm.c: revision 1.16 tests/lib/libnvmm/h_mem_assist.c: revision 1.14 lib/libnvmm/libnvmm_x86.c: revision 1.32 lib/libnvmm/libnvmm.c: revision 1.17 tests/lib/libnvmm/h_mem_assist.c: revision 1.15 lib/libnvmm/libnvmm_x86.c: revision 1.33 lib/libnvmm/libnvmm.c: revision 1.18 usr.sbin/nvmmctl/Makefile: revision 1.1 tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.7 tests/lib/libnvmm/h_mem_assist.c: revision 1.16 lib/libnvmm/libnvmm_x86.c: revision 1.34 usr.sbin/nvmmctl/Makefile: revision 1.2 tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.8 tests/lib/libnvmm/h_mem_assist.c: revision 1.17 sys/dev/nvmm/nvmm_internal.h: revision 1.13 lib/libnvmm/libnvmm_x86.c: revision 1.35 lib/libnvmm/libnvmm_x86.c: revision 1.36 usr.sbin/postinstall/postinstall.in: revision 1.8 lib/libnvmm/libnvmm_x86.c: revision 1.37 lib/libnvmm/libnvmm_x86.c: revision 1.38 lib/libnvmm/libnvmm_x86.c: revision 1.39 usr.sbin/Makefile: revision 1.282 lib/libnvmm/nvmm.h: revision 1.13 lib/libnvmm/nvmm.h: revision 1.14 lib/libnvmm/nvmm.h: revision 1.15 sys/dev/nvmm/nvmm.c: revision 1.23 lib/libnvmm/nvmm.h: revision 1.16 sys/dev/nvmm/nvmm.c: revision 1.24 lib/libnvmm/nvmm.h: revision 1.17 sys/dev/nvmm/nvmm.c: revision 1.25 tests/lib/libnvmm/h_io_assist.c: revision 1.9 etc/MAKEDEV.tmpl: revision 1.209 tests/lib/libnvmm/h_io_assist.c: revision 1.10 tests/lib/libnvmm/h_io_assist.c: revision 1.11 etc/group: revision 1.35 distrib/sets/lists/man/mi: revision 1.1660 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.40 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.41 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.42 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.43 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.44 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.51 sys/dev/nvmm/nvmm_ioctl.h: revision 1.8 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.52 sys/dev/nvmm/nvmm_ioctl.h: revision 1.9 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.53 usr.sbin/nvmmctl/nvmmctl.c: revision 1.1 lib/libnvmm/libnvmm.3: revision 1.20 distrib/sets/lists/debug/md.amd64: revision 1.106 lib/libnvmm/libnvmm.3: revision 1.21 lib/libnvmm/libnvmm.3: revision 1.22 usr.sbin/nvmmctl/nvmmctl.8: revision 1.1 lib/libnvmm/libnvmm.3: revision 1.23 Fix incorrect parsing: the R/M field uses a special GPR map when the address size is 16 bits, regardless of the actual operating mode. With this special map there can be two registers referenced at once, and also disp16-only. Implement this special behavior, and add associated tests. While here simplify a few things. With this in place, the Windows 95 installer initializes correctly. Part of PR/54611. add missing initializer Implement XCHG, add associated tests, and add comments to explain. With this in place the Windows 95 installer completes successfuly. Part of PR/54611. Improve nvmm_vcpu_dump(). Put back 'default', because llvm apparently doesn't realize that all cases are covered in the switch. Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added. Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU. Update the libnvmm man page: - Sync the naming with reality. - Replace "relevant" by "desired" and "virtualizer" by "emulator", closer to what I meant. - Add a "VCPU Configuration" section. - Add a "Machine Ownership" section. Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago. Use the new PTE naming, and define CR3_FRAME_ separately. No functional change. Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test. Mask CPUID leaf 0x0A on Intel, because we don't want the guest to try (and fail) to probe the PMC MSRs. This avoids "Unexpected WRMSR" warnings in qemu-nvmm. Add PCID support in the guests. This speeds up most 64bit guests, because since Meltdown, everybody uses PCID (including NetBSD). Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system. A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure. Add nram in struct nvmm_ctl_mach_info. Add nvmmctl, with two commands for now. Macro tidyness. Sort SEE ALSO. should be fork(2), noticed by wiz Add debug entry for newly introduced nvmmctl utility. Annotate a covering switch as such to avoid warnings about missing returns. Forgot to put nvmmctl in the "nvmm" group. Add nvmm group.
1.14.2.3	13-Apr-2020	martin	Mostly merge changes from HEAD upto 20200411
1.14.2.2	10-Jun-2019	christos	Sync with HEAD
1.14.2.1	08-Jun-2019	christos	file libnvmm.c was added on branch phil-wifi on 2019-06-10 22:05:25 +0000
1.43	27-Dec-2020	reinoud	Implement support for trapping REP CMPS instructions in NVMM. Qemu would abort hard when NVMM would get a memory trap on the instruction since it didn't know it.
1.42	31-Oct-2020	reinoud	Revert (REPE) CMPS support per request of Maxime, it is incorrect.
1.41	30-Oct-2020	reinoud	Implement missing (REPE) CMPS instruction support in NVMMs x86_decode(). In apparently rare cases the (REPE) CMPS instruction can trigger an memory assist. NVMM wouldn't recognize the instruction and thus couldn't assist and Qemu would abort.
1.40	05-Sep-2020	maxv	nvmm: update copyright headers
1.39	28-Oct-2019	joerg	Annotate a covering switch as such to avoid warnings about missing returns.
1.38	27-Oct-2019	maxv	Use the new PTE naming, and define CR3_FRAME_* separately. No functional change.
1.37	23-Oct-2019	maxv	Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU.
1.36	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
1.35	19-Oct-2019	maxv	Put back 'default', because llvm apparently doesn't realize that all cases are covered in the switch.
1.34	14-Oct-2019	maxv	Improve nvmm_vcpu_dump().
1.33	14-Oct-2019	maxv	Implement XCHG, add associated tests, and add comments to explain. With this in place the Windows 95 installer completes successfuly. Part of PR/54611.
1.32	13-Oct-2019	maxv	Fix incorrect parsing: the R/M field uses a special GPR map when the address size is 16 bits, regardless of the actual operating mode. With this special map there can be two registers referenced at once, and also disp16-only. Implement this special behavior, and add associated tests. While here simplify a few things. With this in place, the Windows 95 installer initializes correctly. Part of PR/54611.
1.31	08-Jun-2019	maxv	branches: 1.31.2; 1.31.4; Change the NVMM API to reduce data movements. Sent to tech-kern@.
1.30	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
1.29	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
1.28	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
1.27	07-Mar-2019	maxv	Micro optimizations: - Compress x86_rexpref, x86_regmodrm, x86_opcode and x86_instr. - Cache-align the register, opcode and group tables. - Modify the opcode tables to have 256 entries, and avoid a lookup.
1.26	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
1.25	26-Feb-2019	maxv	Set hardseg to -1 rather than 0, because 0 can be a valid segment.
1.24	17-Feb-2019	maxv	Fix handling of SIB instructions. We were jumping to the SIB node _before_ fetching the displacement, so the node would always think there was no displacement. This didn't alter the final GPA we would be touching - because it is fetched from the kernel directly and not from the computation -, but it altered the instruction length, and on some guests (like Fedora 64bit), the VCPU would resume execution at the wrong RIP and crash. Now these guests work.
1.23	15-Feb-2019	maxv	Remove the PSE check in the 32bit-PAE MMU. Setting CR4.PAE automatically enables PSE regardless of whether CR4.PSE is set or not, so we should just ignore it. With this in place I can boot Windows 8.1 on NVMM.
1.22	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
1.21	12-Feb-2019	maxv	Optimize: fetch only 5 bytes instead of 15, the instruction can have only up to five prefixes.
1.20	10-Feb-2019	christos	#### is not legal.
1.19	07-Feb-2019	maxv	Improvements: - Emulate the instructions by executing them directly on the host CPU. This is easier and probably faster than doing it in software manually. - Decode SUB from Primary, CMP from Group1, TEST from Group3, and add associated tests. - Handle correctly the cases where an instruction that always implicitly reads the register operand is executed with the mem operand as source (eg: "orq (%rbx),%rax"). - Fix the MMU handling of 32bit-PAE. Under PAE CR3 is not page-aligned, so there are extra bits that are valid. With these changes in place I can boot Windows XP on Qemu+NVMM.
1.18	01-Feb-2019	maxv	Fix two issues: * Uh I put the wrong masks in some GPRs, fuck. * When the opsize of MOVZX is 4, we need to combine the zero-extend from the instruction with the natural zero-extend of long mode. Add two associated tests.
1.17	27-Jan-2019	pgoyette	Merge the [pgoyette-compat] branch
1.16	26-Jan-2019	maxv	Ah, fix bug: when the opcode has an immediate, we fill the src with a register storage, but then we overwrite it without zeroing out the highest bits of the resulting immediate (which may contain garbage from the union).
1.15	13-Jan-2019	maxv	Handle more corner cases, clean up a little, and add a set of instructions in Group1.
1.14	08-Jan-2019	maxv	Handle REPN. FreeBSD has a "repn movs", which is a bit unusual, but doesn't seem illegal as far as I can tell from the AMD SDM. With that, I can boot FreeBSD on Qemu+NVMM.
1.13	07-Jan-2019	maxv	Optimize the legpref node: omit BRN (we don't care and it's the same as OVR_CS), inline the loops, sort the checks from most to least likely prefix, and use a compact structure.
1.12	07-Jan-2019	maxv	Optimize: on single memory operand instructions, take the GPA directly from the exit structure provided by the kernel. This saves an MMU translation, and sometimes complex address computation (eg SIB). Drop the GVA field, it is not useful to virtualizers.
1.11	07-Jan-2019	maxv	Improvements and fixes: * Decode AND/OR/XOR from Group1. * Sign-extend the immediates and displacements in 64bit mode. * Fix the storage of {read,write}_guest_memory, now that we batch certain IO operations we can copy more than 8 bytes, and shit hits the fan. * Remove the CR4_PSE check in the 64bit MMU. This bit is actually ignored in long mode, and some systems (like FreeBSD) don't set it.
1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
1.9	04-Jan-2019	maxv	In !64bit mode RIP-relative is null+disp32, handle that correctly.
1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
1.7	29-Dec-2018	maxv	Fix the segmentation check, the limit is relative, not absolute.
1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
1.5	15-Dec-2018	maxv	Two changes: - Fix the I/O Assist, for INS* it is RDI and not RSI, and the register gets updated regardless of the REP prefix. - Fill in the Mem Assist. We decode and emulate certain instructions, and pass a mem descriptor to the callback to handle the transaction. The disassembler could use some polishing, and there are still a few instructions missing; but basically it works.
1.4	17-Nov-2018	maxv	branches: 1.4.2; Don't forget to set 'prot' when the guest has paging disabled.
1.3	13-Nov-2018	maya	Revert my own rev 1.2, the missing include was only when building the 32-bit compat library, we no longer do this.
1.2	11-Nov-2018	maya	Add missing include for struct nvmm_x64_state (Pointed out by the clang build)
1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
1.4.2.5	26-Jan-2019	pgoyette	Sync with HEAD
1.4.2.4	18-Jan-2019	pgoyette	Synch with HEAD
1.4.2.3	26-Dec-2018	pgoyette	Sync with HEAD, resolve a few conflicts
1.4.2.2	26-Nov-2018	pgoyette	Sync with HEAD, resolve a couple of conflicts
1.4.2.1	17-Nov-2018	pgoyette	file libnvmm_x86.c was added on branch pgoyette-compat on 2018-11-26 01:52:13 +0000
1.31.4.1	10-Nov-2019	martin	Pull up following revision(s) (requested by maxv in ticket #405): usr.sbin/nvmmctl/nvmmctl.8: revision 1.2 lib/libnvmm/libnvmm.3: revision 1.24 sys/dev/nvmm/nvmm.h: revision 1.11 lib/libnvmm/libnvmm.3: revision 1.25 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.16 sys/dev/nvmm/nvmm.h: revision 1.12 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.17 tests/lib/libnvmm/h_mem_assist.c: revision 1.12 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.18 share/mk/bsd.hostprog.mk: revision 1.82 lib/libnvmm/libnvmm.c: revision 1.15 distrib/sets/lists/base/md.amd64: revision 1.281 tests/lib/libnvmm/h_mem_assist.c: revision 1.13 lib/libnvmm/libnvmm.c: revision 1.16 tests/lib/libnvmm/h_mem_assist.c: revision 1.14 lib/libnvmm/libnvmm_x86.c: revision 1.32 lib/libnvmm/libnvmm.c: revision 1.17 tests/lib/libnvmm/h_mem_assist.c: revision 1.15 lib/libnvmm/libnvmm_x86.c: revision 1.33 lib/libnvmm/libnvmm.c: revision 1.18 usr.sbin/nvmmctl/Makefile: revision 1.1 tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.7 tests/lib/libnvmm/h_mem_assist.c: revision 1.16 lib/libnvmm/libnvmm_x86.c: revision 1.34 usr.sbin/nvmmctl/Makefile: revision 1.2 tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.8 tests/lib/libnvmm/h_mem_assist.c: revision 1.17 sys/dev/nvmm/nvmm_internal.h: revision 1.13 lib/libnvmm/libnvmm_x86.c: revision 1.35 lib/libnvmm/libnvmm_x86.c: revision 1.36 usr.sbin/postinstall/postinstall.in: revision 1.8 lib/libnvmm/libnvmm_x86.c: revision 1.37 lib/libnvmm/libnvmm_x86.c: revision 1.38 lib/libnvmm/libnvmm_x86.c: revision 1.39 usr.sbin/Makefile: revision 1.282 lib/libnvmm/nvmm.h: revision 1.13 lib/libnvmm/nvmm.h: revision 1.14 lib/libnvmm/nvmm.h: revision 1.15 sys/dev/nvmm/nvmm.c: revision 1.23 lib/libnvmm/nvmm.h: revision 1.16 sys/dev/nvmm/nvmm.c: revision 1.24 lib/libnvmm/nvmm.h: revision 1.17 sys/dev/nvmm/nvmm.c: revision 1.25 tests/lib/libnvmm/h_io_assist.c: revision 1.9 etc/MAKEDEV.tmpl: revision 1.209 tests/lib/libnvmm/h_io_assist.c: revision 1.10 tests/lib/libnvmm/h_io_assist.c: revision 1.11 etc/group: revision 1.35 distrib/sets/lists/man/mi: revision 1.1660 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.40 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.41 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.42 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.43 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.44 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.51 sys/dev/nvmm/nvmm_ioctl.h: revision 1.8 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.52 sys/dev/nvmm/nvmm_ioctl.h: revision 1.9 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.53 usr.sbin/nvmmctl/nvmmctl.c: revision 1.1 lib/libnvmm/libnvmm.3: revision 1.20 distrib/sets/lists/debug/md.amd64: revision 1.106 lib/libnvmm/libnvmm.3: revision 1.21 lib/libnvmm/libnvmm.3: revision 1.22 usr.sbin/nvmmctl/nvmmctl.8: revision 1.1 lib/libnvmm/libnvmm.3: revision 1.23 Fix incorrect parsing: the R/M field uses a special GPR map when the address size is 16 bits, regardless of the actual operating mode. With this special map there can be two registers referenced at once, and also disp16-only. Implement this special behavior, and add associated tests. While here simplify a few things. With this in place, the Windows 95 installer initializes correctly. Part of PR/54611. add missing initializer Implement XCHG, add associated tests, and add comments to explain. With this in place the Windows 95 installer completes successfuly. Part of PR/54611. Improve nvmm_vcpu_dump(). Put back 'default', because llvm apparently doesn't realize that all cases are covered in the switch. Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added. Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU. Update the libnvmm man page: - Sync the naming with reality. - Replace "relevant" by "desired" and "virtualizer" by "emulator", closer to what I meant. - Add a "VCPU Configuration" section. - Add a "Machine Ownership" section. Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago. Use the new PTE naming, and define CR3_FRAME_ separately. No functional change. Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test. Mask CPUID leaf 0x0A on Intel, because we don't want the guest to try (and fail) to probe the PMC MSRs. This avoids "Unexpected WRMSR" warnings in qemu-nvmm. Add PCID support in the guests. This speeds up most 64bit guests, because since Meltdown, everybody uses PCID (including NetBSD). Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system. A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure. Add nram in struct nvmm_ctl_mach_info. Add nvmmctl, with two commands for now. Macro tidyness. Sort SEE ALSO. should be fork(2), noticed by wiz Add debug entry for newly introduced nvmmctl utility. Annotate a covering switch as such to avoid warnings about missing returns. Forgot to put nvmmctl in the "nvmm" group. Add nvmm group.
1.31.2.3	13-Apr-2020	martin	Mostly merge changes from HEAD upto 20200411
1.31.2.2	10-Jun-2019	christos	Sync with HEAD
1.31.2.1	08-Jun-2019	christos	file libnvmm_x86.c was added on branch phil-wifi on 2019-06-10 22:05:25 +0000
1.19	06-Apr-2021	reinoud	Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without signals. This introduces a new kernel and userland NVMM version indicating this support. Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request. This is the missing libnvmm part I forgot to include in the origional commit.
1.18	05-Sep-2020	maxv	nvmm: update copyright headers
1.17	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
1.16	27-Oct-2019	maxv	Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system.
1.15	27-Oct-2019	maxv	Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago.
1.14	23-Oct-2019	maxv	Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU.
1.13	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
1.12	08-Jun-2019	maxv	branches: 1.12.2; 1.12.4; Change the NVMM API to reduce data movements. Sent to tech-kern@.
1.11	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
1.9	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
1.8	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
1.7	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
1.6	07-Jan-2019	maxv	Optimize: on single memory operand instructions, take the GPA directly from the exit structure provided by the kernel. This saves an MMU translation, and sometimes complex address computation (eg SIB). Drop the GVA field, it is not useful to virtualizers.
1.5	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
1.4	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
1.3	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
1.2	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
1.1	10-Nov-2018	maxv	branches: 1.1.2; Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
1.1.2.4	18-Jan-2019	pgoyette	Synch with HEAD
1.1.2.3	26-Dec-2018	pgoyette	Sync with HEAD, resolve a few conflicts
1.1.2.2	26-Nov-2018	pgoyette	Sync with HEAD, resolve a couple of conflicts
1.1.2.1	10-Nov-2018	pgoyette	file nvmm.h was added on branch pgoyette-compat on 2018-11-26 01:52:13 +0000
1.12.4.1	10-Nov-2019	martin	Pull up following revision(s) (requested by maxv in ticket #405): usr.sbin/nvmmctl/nvmmctl.8: revision 1.2 lib/libnvmm/libnvmm.3: revision 1.24 sys/dev/nvmm/nvmm.h: revision 1.11 lib/libnvmm/libnvmm.3: revision 1.25 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.16 sys/dev/nvmm/nvmm.h: revision 1.12 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.17 tests/lib/libnvmm/h_mem_assist.c: revision 1.12 sys/dev/nvmm/x86/nvmm_x86.h: revision 1.18 share/mk/bsd.hostprog.mk: revision 1.82 lib/libnvmm/libnvmm.c: revision 1.15 distrib/sets/lists/base/md.amd64: revision 1.281 tests/lib/libnvmm/h_mem_assist.c: revision 1.13 lib/libnvmm/libnvmm.c: revision 1.16 tests/lib/libnvmm/h_mem_assist.c: revision 1.14 lib/libnvmm/libnvmm_x86.c: revision 1.32 lib/libnvmm/libnvmm.c: revision 1.17 tests/lib/libnvmm/h_mem_assist.c: revision 1.15 lib/libnvmm/libnvmm_x86.c: revision 1.33 lib/libnvmm/libnvmm.c: revision 1.18 usr.sbin/nvmmctl/Makefile: revision 1.1 tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.7 tests/lib/libnvmm/h_mem_assist.c: revision 1.16 lib/libnvmm/libnvmm_x86.c: revision 1.34 usr.sbin/nvmmctl/Makefile: revision 1.2 tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.8 tests/lib/libnvmm/h_mem_assist.c: revision 1.17 sys/dev/nvmm/nvmm_internal.h: revision 1.13 lib/libnvmm/libnvmm_x86.c: revision 1.35 lib/libnvmm/libnvmm_x86.c: revision 1.36 usr.sbin/postinstall/postinstall.in: revision 1.8 lib/libnvmm/libnvmm_x86.c: revision 1.37 lib/libnvmm/libnvmm_x86.c: revision 1.38 lib/libnvmm/libnvmm_x86.c: revision 1.39 usr.sbin/Makefile: revision 1.282 lib/libnvmm/nvmm.h: revision 1.13 lib/libnvmm/nvmm.h: revision 1.14 lib/libnvmm/nvmm.h: revision 1.15 sys/dev/nvmm/nvmm.c: revision 1.23 lib/libnvmm/nvmm.h: revision 1.16 sys/dev/nvmm/nvmm.c: revision 1.24 lib/libnvmm/nvmm.h: revision 1.17 sys/dev/nvmm/nvmm.c: revision 1.25 tests/lib/libnvmm/h_io_assist.c: revision 1.9 etc/MAKEDEV.tmpl: revision 1.209 tests/lib/libnvmm/h_io_assist.c: revision 1.10 tests/lib/libnvmm/h_io_assist.c: revision 1.11 etc/group: revision 1.35 distrib/sets/lists/man/mi: revision 1.1660 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.40 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.41 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.42 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.43 sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.44 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.51 sys/dev/nvmm/nvmm_ioctl.h: revision 1.8 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.52 sys/dev/nvmm/nvmm_ioctl.h: revision 1.9 sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.53 usr.sbin/nvmmctl/nvmmctl.c: revision 1.1 lib/libnvmm/libnvmm.3: revision 1.20 distrib/sets/lists/debug/md.amd64: revision 1.106 lib/libnvmm/libnvmm.3: revision 1.21 lib/libnvmm/libnvmm.3: revision 1.22 usr.sbin/nvmmctl/nvmmctl.8: revision 1.1 lib/libnvmm/libnvmm.3: revision 1.23 Fix incorrect parsing: the R/M field uses a special GPR map when the address size is 16 bits, regardless of the actual operating mode. With this special map there can be two registers referenced at once, and also disp16-only. Implement this special behavior, and add associated tests. While here simplify a few things. With this in place, the Windows 95 installer initializes correctly. Part of PR/54611. add missing initializer Implement XCHG, add associated tests, and add comments to explain. With this in place the Windows 95 installer completes successfuly. Part of PR/54611. Improve nvmm_vcpu_dump(). Put back 'default', because llvm apparently doesn't realize that all cases are covered in the switch. Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added. Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU. Update the libnvmm man page: - Sync the naming with reality. - Replace "relevant" by "desired" and "virtualizer" by "emulator", closer to what I meant. - Add a "VCPU Configuration" section. - Add a "Machine Ownership" section. Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago. Use the new PTE naming, and define CR3_FRAME_ separately. No functional change. Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test. Mask CPUID leaf 0x0A on Intel, because we don't want the guest to try (and fail) to probe the PMC MSRs. This avoids "Unexpected WRMSR" warnings in qemu-nvmm. Add PCID support in the guests. This speeds up most 64bit guests, because since Meltdown, everybody uses PCID (including NetBSD). Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system. A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure. Add nram in struct nvmm_ctl_mach_info. Add nvmmctl, with two commands for now. Macro tidyness. Sort SEE ALSO. should be fork(2), noticed by wiz Add debug entry for newly introduced nvmmctl utility. Annotate a covering switch as such to avoid warnings about missing returns. Forgot to put nvmmctl in the "nvmm" group. Add nvmm group.
1.12.2.3	13-Apr-2020	martin	Mostly merge changes from HEAD upto 20200411
1.12.2.2	10-Jun-2019	christos	Sync with HEAD
1.12.2.1	08-Jun-2019	christos	file nvmm.h was added on branch phil-wifi on 2019-06-10 22:05:25 +0000
1.1	21-Nov-2024	riastradh	branches: 1.1.4; libnvmm: Add expected symbols list. amd64-specific nvmm.x86_64.expsym for now -- I expect this to be machine-dependent for a while. To make sure future architectures get expsyms, I've also added an empty machine-independent nvmm.expsym which will trigger a build failure. PR lib/58838: shared libraries in base should all have expsym lists
1.1.4.2	02-Aug-2025	perseant	Sync with HEAD
1.1.4.1	21-Nov-2024	perseant	file nvmm.x86_64.expsym was added on branch perseant-exfatfs on 2025-08-02 05:54:53 +0000
1.1	10-Nov-2018	maxv	branches: 1.1.2; 1.1.4; Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
1.1.4.2	10-Jun-2019	christos	Sync with HEAD
1.1.4.1	10-Nov-2018	christos	file shlib_version was added on branch phil-wifi on 2019-06-10 22:05:25 +0000
1.1.2.2	26-Nov-2018	pgoyette	Sync with HEAD, resolve a couple of conflicts
1.1.2.1	10-Nov-2018	pgoyette	file shlib_version was added on branch pgoyette-compat on 2018-11-26 01:52:13 +0000

OpenGrok