Home | History | Annotate | only in /src/sys/dev/nvmm/x86
History log of /src/sys/dev/nvmm/x86
RevisionDateAuthorComments
 1.1 07-Nov-2018  maxv branches: 1.1.2; 1.1.6;
Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that
provides support for hardware-accelerated virtualization on NetBSD.

It is made of an MI frontend, to which MD backends can be plugged. One
MD backend is implemented, x86-SVM, for x86 AMD CPUs.

We install

/usr/include/dev/nvmm/nvmm.h
/usr/include/dev/nvmm/nvmm_ioctl.h
/usr/include/dev/nvmm/{arch}/nvmm_{arch}.h

And the kernel module. For now, the only architecture where we do that
is amd64 (arch=x86).

NVMM is not enabled by default in amd64-GENERIC, but is instead easily
modloadable.

Sent to tech-kern@ a month ago. Validated with kASan, and optimized
with tprof.
 1.1.6.2 10-Jun-2019  christos Sync with HEAD
 1.1.6.1 07-Nov-2018  christos file Makefile was added on branch phil-wifi on 2019-06-10 22:07:14 +0000
 1.1.2.2 26-Nov-2018  pgoyette Sync with HEAD, resolve a couple of conflicts
 1.1.2.1 07-Nov-2018  pgoyette file Makefile was added on branch pgoyette-compat on 2018-11-26 01:52:32 +0000
 1.23 06-Oct-2022  msaitoh Update some AMD CPUID bits:

- Rename FSREP_MOV to FSRM.
- Add Memory Bandwidth Enforcement (MBE)
- Add AMD's PPIN. Rename CPUID_SEF_PPIN to CPUID_SEF_INTEL_PPIN.
- Add Collaborative Processor Performance Control (CPPC).
- Add HOST_MCE_OVERRIDE.
- Add some unknown bits as Bxx.
- Add comments.
- Use __BIT().
 1.22 20-Aug-2022  riastradh x86: Move page attribute table bits to x86/pat.h.
 1.21 08-Sep-2020  maxv nvmm: cosmetic changes

- Style.
- Explicitly include ioccom.h.
 1.20 06-Sep-2020  riastradh Fix fallout from previous uvm.h cleanup.

- pmap(9) needs uvm/uvm_extern.h.

- x86/pmap.h is not usable on its own; it is only usable if included
via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h).

- Make nvmm.h and nvmm_internal.h standalone.
 1.19 05-Sep-2020  riastradh Round of uvm.h cleanup.

The poorly named uvm.h is generally supposed to be for uvm-internal
users only.

- Narrow it to files that actually need it -- mostly files that need
to query whether curlwp is the pagedaemon, which should maybe be
exposed by an external header.

- Use uvm_extern.h where feasible and uvm_*.h for things not exposed
by it. We should split up uvm_extern.h but this will serve for now
to reduce the uvm.h dependencies.

- Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use
UVMHIST(ubchist), since ubchist is declared in uvm.h but the
reference evaporates if UVMHIST is not defined, so we reduce header
file dependencies.

- Make uvm_device.h and uvm_swap.h independently includable while
here.

ok chs@
 1.18 05-Sep-2020  maxv x86: fix several CPUID flags

- Rename: CPUID_PN -> CPUID_PSN
CPUID_CFLUSH -> CPUID_CLFSH
CPUID_SBF -> CPUID_PBE
CPUID_LZCNT -> CPUID_ABM
CPUID_P1GB -> CPUID_PAGE1GB
CPUID2_PCLMUL -> CPUID2_PCLMULQDQ
CPUID2_CID -> CPUID2_CNXTID
CPUID2_xTPR -> CPUID2_XTPR
CPUID2_AES -> CPUID2_AESNI
To match the x86 specification and the other OSes.

- Remove: CPUID_B10, CPUID_B20, CPUID_IA64. They do not exist.
 1.17 05-Sep-2020  maxv nvmm: update copyright headers
 1.16 04-Sep-2020  maxv nvmm-x86: improve the CPUID emulation

- Mask DTES64, DS_CPL, CID, SDBG, xTPR, PN.
- B10, B20 and IA64 do not exist, so just remove them.
 1.15 22-Aug-2020  maxv nvmm-x86: hide more CPUID flags, mostly related to perf monitors
 1.14 20-Aug-2020  maxv nvmm-x86: improve the CPUID emulation

- x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter
contains extended features we must filter out. Apply the same in
x86-vmx for symmetry.
- x86-svm: explicitly handle extended leaves until 0x8000001F, and
truncate to it.
 1.13 20-Aug-2020  maxv nvmm-x86: advertise the SERIALIZE instruction, available on future CPUs
 1.12 11-Aug-2020  maxv Hide OSPKE. NFC since the host never uses PKU, but still.
 1.11 05-Aug-2020  maxv Improve the CPUID emulation:

- Hide SGX*, PKU, WAITPKG, and SKINIT, because they are not supported.
- Hide HLE and RTM, part of TSX. Because TSX is just too buggy and we
cannot guarantee that it remains enabled in the guest (if for example
the host disables TSX while the guest is running). Nobody wants this
crap anyway, so bye-bye.
- Advertise FSREP_MOV, because no reason to hide it.
 1.10 05-Aug-2020  maxv Make it easier to understand what's going on, no functional change.
 1.9 09-May-2020  maxv Improve the CPUID emulation of basic leaves:
- Hide DCA and PQM, they cannot be used in guests.
- On Intel, explicitly handle each basic leaf until 0x16.
- On AMD, explicitly handle each basic leaf until 0x0D.
 1.8 16-Nov-2019  maxv Don't report MWAITX by default.
 1.7 15-May-2019  maxv branches: 1.7.2; 1.7.4;
NVMM: Expose MD_CLEAR to the guests.
 1.6 06-Apr-2019  maxv Replace the misc[] state by a new compressed nvmm_x64_state_intr structure,
which describes the interruptibility state of the guest.

Add evt_pending, read-only, that allows the virtualizer to know if an event
is pending.
 1.5 03-Apr-2019  maxv VMX: if PAT is not valid, #GP on WRMSR, rather than crashing the guest.
 1.4 03-Apr-2019  maxv Add MSR_TSC.
 1.3 03-Mar-2019  maxv Choose which CPUID bits to allow, rather than which bits to disallow. This
is clearer, and also forward compatible with future CPUs.

While here be more consistent when allowing the bits, and sync between
nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest
state we provide is only x86+SSE. Fixes a CentOS panic when booting on
NVMM, reported by Jared McNeill, thanks.
 1.2 26-Feb-2019  maxv Change the layout of the SEG state:

- Reorder it, to match the CPU encoding. This is the universal order,
also used by Qemu. Drop the seg_to_nvmm[] tables.

- Compress it. This divides its size by two.

- Rename some of its fields, to better match the x86 spec. Also, take S
out of Type, this was a NetBSD-ism that was likely confusing to other
people.
 1.1 23-Feb-2019  maxv Install the x86 RESET state at VCPU creation time, for convenience, so
that the libnvmm users can expect a functional VCPU right away.
 1.7.4.8 15-Oct-2022  martin Pull up following revision(s) (requested by msaitoh in ticket #1542):

sys/arch/x86/include/specialreg.h: revision 1.189
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.23
usr.sbin/cpuctl/arch/i386.c: revision 1.128
sys/arch/x86/include/specialreg.h: revision 1.190
sys/arch/x86/include/specialreg.h: revision 1.191
sys/arch/x86/include/specialreg.h: revision 1.192

s/shareing/sharing/. No functional change.

Add top-down slots event bit of architectural performance monitoring leaf.

Modify CPUID Fn0000000a %ebx's string. Add new string for %ecx.

Modify output of CPUID Fn0000000a.
old:
cpu0: Perfmon-eax 0x8300805<VERSION=0x5,GPCounter=0x8,GPBitwidth=0x30>
cpu0: Perfmon-eax 0x8300805<Vectorlen=0x8>
cpu0: Perfmon-edx 0x8604<FixedFunc=0x4,FFBitwidth=0x30,ANYTHREADDEPR>
new:
cpu0: Perfmon: Ver. 5
cpu0: Perfmon: General: bitwidth 48, 8 counters
cpu0: Perfmon: General: avail 0xff<CORECYCL,INST,REFCYCL,LLCREF,LLCMISS,BRINST>
cpu0: Perfmon: General: avail 0xff<BRMISPR,TOPDOWNSLOT>
cpu0: Perfmon: Fixed: bitwidth 48, 4 counters
cpu0: Perfmon: Fixed: avail 0xf<INST,CLK_CORETHREAD,CLK_REF_TSC,TOPDOWNSLOT>

Update some AMD CPUID bits:
- Rename FSREP_MOV to FSRM.
- Add Memory Bandwidth Enforcement (MBE)
- Add AMD's PPIN. Rename CPUID_SEF_PPIN to CPUID_SEF_INTEL_PPIN.
- Add Collaborative Processor Performance Control (CPPC).
- Add HOST_MCE_OVERRIDE.
- Add some unknown bits as Bxx.
- Add comments.
- Use __BIT().
 1.7.4.7 08-Dec-2021  martin Pull up the following revisions, requested by msaitoh in ticket #1391:

sys/arch/x86/include/specialreg.h 1.171, 1.173-1.178
sys/arch/x86/x86/identcpu.c 1.106, 1.117,
1.122 via patch
sys/dev/nvmm/x86/nvmm_x86.c 1.18
sys/external/bsd/drm2/drm/drm_cache.c 1.14
sys/external/bsd/drm2/include/asm/cpufeature.h 1.5
usr.sbin/cpuctl/arch/i386.c 1.114-1.117


- Add LA57, PKE, PKS, CET, CET_U, CET_S, HWP, KL, AVX512_BF16, TME_EN
and PCONFIG.
- Rename some macros to match the x86 specification and the other OSes.
- Print CPUID 0x8000008 %ebx on Intel, too.
- Print CPUID leaf 7 subleaf 1.
- Identify Tiger Lake, 3rd gen Xeon Scalable (Ice Lake), Elkhart Lake
and Jasper Lake.
- Add comment.
- KNF. Whitespace fix.
 1.7.4.6 13-Sep-2020  martin Pull up following revision(s) (requested by maxv in ticket #1077):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.68
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.74
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.16

Improve emulation of MSR_IA32_ARCH_CAPABILITIES: publish only the *_NO
bits. Initially they were the only ones there, but Intel then added other
bits we aren't interested in, and they must be filtered out.

nvmm-x86-svm: improve the handling of MSR_EFER

Intercept reads of it as well, just to mask EFER_SVME, which the guest
doesn't need to see.

nvmm-x86: improve the CPUID emulation

- Mask DTES64, DS_CPL, CID, SDBG, xTPR, PN.
- B10, B20 and IA64 do not exist, so just remove them.
 1.7.4.5 29-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1068):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.71
sys/dev/nvmm/nvmm.c: revision 1.34
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.72
sys/dev/nvmm/nvmm.c: revision 1.35
sys/dev/nvmm/nvmm.c: revision 1.36
sys/dev/nvmm/x86/nvmm_x86_svmfunc.S: revision 1.5
sys/dev/nvmm/nvmm.c: revision 1.37
sys/dev/nvmm/x86/nvmm_x86_vmxfunc.S: revision 1.5
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.68
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.15
sys/dev/nvmm/nvmm_ioctl.h: revision 1.10

Micro-optimize: use pushq instead of pushw. To avoid LCP stalls and
unaligned stack accesses.

nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes

nvmm: localify a variable that doesn't need to be global

nvmm: use relaxed atomics to read nmachines

nvmm-x86-svm: dedup code

nvmm-x86: hide more CPUID flags, mostly related to perf monitors

nvmm: misc improvements
- use mach->ncpus to get the number of vcpus, now that we have it
- don't forget to decrement mach->ncpus when a machine gets killed
- add more __predict_false()

nvmm-x86-svm: don't forget to intercept INVD
INVD executed in the guest can be dangerous for the host, due to CPU
caches being flushed without write-back.

nvmm: slightly clarify

nvmm: explicitly include atomic.h
 1.7.4.4 26-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1058):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.19
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.69
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.71
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.69
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.11
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.13
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.14

Improve the CPUID emulation:
- Hide SGX*, PKU, WAITPKG, and SKINIT, because they are not supported.
- Hide HLE and RTM, part of TSX. Because TSX is just too buggy and we
cannot guarantee that it remains enabled in the guest (if for example
the host disables TSX while the guest is running). Nobody wants this
crap anyway, so bye-bye.
- Advertise FSREP_MOV, because no reason to hide it.

Hide OSPKE. NFC since the host never uses PKU, but still.

Improve the CPUID emulation on nvmm-intel:
- Limit the highest extended leaf.
- Limit 0x00000007 to ECX=0, for future-proofness.

nvmm-x86-svm: improve the CPUID emulation

Limit the hypervisor range, and properly handle each basic leaf until 0xD.

nvmm-x86: advertise the SERIALIZE instruction, available on future CPUs

nvmm-x86: improve the CPUID emulation
- x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter
contains extended features we must filter out. Apply the same in
x86-vmx for symmetry.
- x86-svm: explicitly handle extended leaves until 0x8000001F, and
truncate to it.
 1.7.4.3 18-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1055):

sys/dev/nvmm/nvmm.h: revision 1.13
sys/dev/nvmm/nvmm.h: revision 1.14
sys/dev/nvmm/nvmm.c: revision 1.33
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.67
sys/dev/nvmm/nvmm_internal.h: revision 1.17
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.67
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.10

Put the few x86-specific structures under #ifdef __x86_64__, for clarity.

Make it easier to understand what's going on, no functional change.

Add new field definitions.

Add new field definitions, and intercept everything, for future-proofness.

Add CTASSERT.
 1.7.4.2 21-May-2020  martin Pull up following revision(s) (requested by maxv in ticket #919):

sys/dev/nvmm/x86/nvmm_x86.c: revision 1.9
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.60
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.61
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.56
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.57
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.58
sys/dev/nvmm/nvmm.c: revision 1.29

Improve the CPUID emulation of basic leaves:
- Hide DCA and PQM, they cannot be used in guests.
- On Intel, explicitly handle each basic leaf until 0x16.
- On AMD, explicitly handle each basic leaf until 0x0D.

Respect the convention for the hypervisor information: return the highest
hypervisor leaf in 0x40000000.EAX.

Improve the CPUID emulation on nvmm-intel: limit the highest basic and
hypervisor leaves.

Complete rev1.26: reset nvmm_impl to NULL in nvmm_fini().
 1.7.4.1 16-Nov-2019  martin Pull up following revision(s) (requested by jmcneill in ticket #434):

sys/dev/nvmm/x86/nvmm_x86.c: revision 1.8

Don't report MWAITX by default.
 1.7.2.3 13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.7.2.2 10-Jun-2019  christos Sync with HEAD
 1.7.2.1 15-May-2019  christos file nvmm_x86.c was added on branch phil-wifi on 2019-06-10 22:07:14 +0000
 1.21 26-Mar-2021  reinoud Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without
signals. This introduces a new kernel and userland NVMM version indicating
this support.

Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request.
 1.20 05-Sep-2020  maxv branches: 1.20.2; 1.20.4;
nvmm: update copyright headers
 1.19 20-Aug-2020  maxv nvmm-x86: improve the CPUID emulation

- x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter
contains extended features we must filter out. Apply the same in
x86-vmx for symmetry.
- x86-svm: explicitly handle extended leaves until 0x8000001F, and
truncate to it.
 1.18 28-Oct-2019  maxv A few changes:

- Use smaller types in struct nvmm_capability.
- Use smaller type for nvmm_io.port.
- Switch exitstate to a compacted structure.
 1.17 27-Oct-2019  maxv Add a new VCPU conf option, that allows userland to request VMEXITs after a
TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs.

The reason for wanting this option is that certain OSes (like Win10 64bit)
manage interrupt priority in hardware via CR8 directly, and for these OSes,
the emulator may want to sync its internal TPR state on each change.

Add two new fields in cap.arch, to report the conf capabilities. Report TPR
only on Intel for now, not AMD, because I don't have a recent AMD CPU on
which to test.
 1.16 23-Oct-2019  maxv Miscellaneous changes in NVMM, to address several inconsistencies and
issues in the libnvmm API.

- Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in
libnvmm. Introduce NVMM_USER_VERSION, for future use.

- In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to
avoid sharing the VMs with the children if the process forks. In the
NVMM driver, force O_CLOEXEC on open().

- Rename the following things for consistency:
nvmm_exit* -> nvmm_vcpu_exit*
nvmm_event* -> nvmm_vcpu_event*
NVMM_EXIT_* -> NVMM_VCPU_EXIT_*
NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR
NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP
Delete NVMM_EVENT_INTERRUPT_SW, unused already.

- Slightly reorganize the MI/MD definitions, for internal clarity.

- Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide
separate u.rdmsr and u.wrmsr fields. This is more consistent with the
other exit reasons.

- Change the types of several variables:
event.type enum -> u_int
event.vector uint64_t -> uint8_t
exit.u.*msr.msr: uint64_t -> uint32_t
exit.u.io.type: enum -> bool
exit.u.io.seg: int -> int8_t
cap.arch.mxcsr_mask: uint64_t -> uint32_t
cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t

- Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we
already intercept 'monitor' so it is never armed.

- Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn().
The 'npc' field wasn't getting filled properly during certain VMEXITs.

- Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(),
but as its name indicates, the configuration is per-VCPU and not per-VM.
Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID.
This becomes per-VCPU, which makes more sense than per-VM.

- Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on
specific leaves. Until now we could only mask the leaves. An uint32_t
is added in the structure:
uint32_t mask:1;
uint32_t exit:1;
uint32_t rsvd:30;
The two first bits select the desired behavior on the leaf. Specifying
zero on both resets the leaf to the default behavior. The new
NVMM_VCPU_EXIT_CPUID exit reason is added.
 1.15 11-May-2019  maxv branches: 1.15.2; 1.15.4;
Rework the machine configuration interface.

Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and
<MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf
op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now
per-machine, and the emulators should now do:

- nvmm_callbacks_register(&cbs);
+ nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs);

This provides more granularity, for example if the process runs two VMs
and wants different callbacks for each.
 1.14 01-May-2019  maxv Use the comm page to inject events, rather than ioctls, and commit them in
vcpu_run. This saves a few syscalls and copyins.

For example on Windows 10, moving the mouse from the left to right sides of
the screen generates ~500 events, which now don't result in syscalls.

The error handling is done in vcpu_run and it is less precise, but this
doesn't matter a lot, and will be solved with future NVMM error codes.
 1.13 28-Apr-2019  maxv Modify the communication layer between the kernel NVMM driver and libnvmm:
introduce a bidirectionnal "comm page", a page of memory shared between
the kernel and userland, and used to transfer data in and out in a more
performant manner than ioctls.

The comm page contains the VCPU state, plus three flags:

- "wanted": the states the kernel must get/set when requested via ioctls
- "cached": the states that are in the comm page
- "commit": the states the kernel must set in vcpu_run

The idea is to avoid performing expensive syscalls, by using the VCPU
state cached, either explicitly or speculatively, in the comm page. For
example, if the state is cached we do a direct 1->5 with no syscall:

+---------------------------------------------+
| Qemu |
+---------------------------------------------+
| ^
| (0) nvmm_vcpu_getstate | (6) Done
| |
V |
+---------------------------------------+
| libnvmm |
+---------------------------------------+
| ^ | ^
(1) State | | (2) No | (3) Ioctl: | (5) Ok, state
cached? | | | "please cache | fetched
| | | the state" |
V | | |
+-----------+ | |
| Comm Page |------+---------------+
+-----------+ |
^ |
(4) "Alright | V
babe" | +--------+
+-----| Kernel |
+--------+

The main changes in behavior are:

- nvmm_vcpu_getstate(): won't emit a syscall if the state is already
cached in the comm page, will just fetch from the comm page directly
- nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache
the wanted state in the comm page
- nvmm_vcpu_run(): will commit the to-be-set state in the comm page,
as previously requested by nvmm_vcpu_setstate()

In addition to this, the kernel NVMM driver is changed to speculatively
cache certain states known to be of interest, so that the future
nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use
the comm page rather than expensive syscalls. For example, if an I/O
VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS,
and now the kernel caches all of that in the comm page before returning
to userland.

Overall, in a normal run of Windows 10, this saves several millions of
syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO
goes from taking 1min35 to taking 1min16.

The libnvmm API is not changed, but the ABI is. If we changed the API it
would be possible to save expensive memcpys on libnvmm's side. This will
be avoided in a future version. The comm page can also be extended to
implement future services.
 1.12 27-Apr-2019  maxv Reorder the NVMM headers, to make a clear(er) distinction between MI and
MD. Also use #defines for the exit reasons rather than an union. No ABI
change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
 1.11 06-Apr-2019  maxv Replace the misc[] state by a new compressed nvmm_x64_state_intr structure,
which describes the interruptibility state of the guest.

Add evt_pending, read-only, that allows the virtualizer to know if an event
is pending.
 1.10 03-Apr-2019  maxv VMX: if PAT is not valid, #GP on WRMSR, rather than crashing the guest.
 1.9 03-Apr-2019  maxv Add MSR_TSC.
 1.8 03-Mar-2019  maxv Choose which CPUID bits to allow, rather than which bits to disallow. This
is clearer, and also forward compatible with future CPUs.

While here be more consistent when allowing the bits, and sync between
nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest
state we provide is only x86+SSE. Fixes a CentOS panic when booting on
NVMM, reported by Jared McNeill, thanks.
 1.7 26-Feb-2019  maxv Change the layout of the SEG state:

- Reorder it, to match the CPU encoding. This is the universal order,
also used by Qemu. Drop the seg_to_nvmm[] tables.

- Compress it. This divides its size by two.

- Rename some of its fields, to better match the x86 spec. Also, take S
out of Type, this was a NetBSD-ism that was likely confusing to other
people.
 1.6 23-Feb-2019  maxv Install the x86 RESET state at VCPU creation time, for convenience, so
that the libnvmm users can expect a functional VCPU right away.
 1.5 14-Feb-2019  maxv Harmonize the handling of the CPL between AMD and Intel.

AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET
instructions do not force SS.DPL to predefined values. On Intel they do,
so the CPL on Intel is just the guest's SS.DPL value.

Even though technically possible on AMD, there is no sane reason for a
guest kernel to set a non-three SS.DPL, doing that would mess up several
common segmentation practices and wouldn't be compatible with Intel.

So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL.
Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually
increases performance on AMD: to detect interrupt windows the virtualizer
has to modify some fields of misc[], and because CPL was there, we had to
flush the SEG set of the VMCB cache. Now there is no flush necessary.

While here remove the CPL check for XSETBV on Intel, contrary to AMD
Intel checks the CPL before the intercept, so if we receive an XSETBV
VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the
way my check was wrong in the first place, it was reading SS.RPL instead
of SS.DPL.
 1.4 13-Feb-2019  maxv Reorder the GPRs to match the CPU encoding, simplifies things on Intel.
 1.3 06-Jan-2019  maxv Improvements and fixes in NVMM.

Kernel driver:

* Don't take an extra (unneeded) reference to the UAO.

* Provide npc for HLT. I'm not really happy with it right now, will
likely be revisited.

* Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide
them in the exitstate too.

* Don't take the TPR into account when processing INTs. The virtualizer
can do that itself (Qemu already does).

* Provide a hypervisor signature in CPUID, and hide SVM.

* Ignore certain MSRs. One special case is MSR_NB_CFG in which we set
NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC.

* If the LWP has pending signals or softints, leave, rather than waiting
for a rescheduling to happen later. This reduces interrupt processing
time in the guest (Qemu sends a signal to the thread, and now we leave
right away). This could be improved even more by sending an actual IPI
to the CPU, but I'll see later.

Libnvmm:

* Fix the MMU translation of large pages, we need to add the lower bits
too.

* Change the IO and Mem structures to take a pointer rather than a
static array. This provides more flexibility.

* Batch together the str+rep IO transactions. We do one big memory
read/write, and then send the IO commands to the hypervisor all at
once. This considerably increases performance.

* Decode MOVZX.

With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0
in a VM with multiple VCPUs, connect to the network, etc.
 1.2 25-Nov-2018  maxv branches: 1.2.2;
Add RFLAGS in the exitstate.
 1.1 07-Nov-2018  maxv Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that
provides support for hardware-accelerated virtualization on NetBSD.

It is made of an MI frontend, to which MD backends can be plugged. One
MD backend is implemented, x86-SVM, for x86 AMD CPUs.

We install

/usr/include/dev/nvmm/nvmm.h
/usr/include/dev/nvmm/nvmm_ioctl.h
/usr/include/dev/nvmm/{arch}/nvmm_{arch}.h

And the kernel module. For now, the only architecture where we do that
is amd64 (arch=x86).

NVMM is not enabled by default in amd64-GENERIC, but is instead easily
modloadable.

Sent to tech-kern@ a month ago. Validated with kASan, and optimized
with tprof.
 1.2.2.3 18-Jan-2019  pgoyette Synch with HEAD
 1.2.2.2 26-Nov-2018  pgoyette Sync with HEAD, resolve a couple of conflicts
 1.2.2.1 25-Nov-2018  pgoyette file nvmm_x86.h was added on branch pgoyette-compat on 2018-11-26 01:52:32 +0000
 1.15.4.2 26-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1058):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.19
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.69
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.71
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.69
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.11
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.13
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.14

Improve the CPUID emulation:
- Hide SGX*, PKU, WAITPKG, and SKINIT, because they are not supported.
- Hide HLE and RTM, part of TSX. Because TSX is just too buggy and we
cannot guarantee that it remains enabled in the guest (if for example
the host disables TSX while the guest is running). Nobody wants this
crap anyway, so bye-bye.
- Advertise FSREP_MOV, because no reason to hide it.

Hide OSPKE. NFC since the host never uses PKU, but still.

Improve the CPUID emulation on nvmm-intel:
- Limit the highest extended leaf.
- Limit 0x00000007 to ECX=0, for future-proofness.

nvmm-x86-svm: improve the CPUID emulation

Limit the hypervisor range, and properly handle each basic leaf until 0xD.

nvmm-x86: advertise the SERIALIZE instruction, available on future CPUs

nvmm-x86: improve the CPUID emulation
- x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter
contains extended features we must filter out. Apply the same in
x86-vmx for symmetry.
- x86-svm: explicitly handle extended leaves until 0x8000001F, and
truncate to it.
 1.15.4.1 10-Nov-2019  martin Pull up following revision(s) (requested by maxv in ticket #405):

usr.sbin/nvmmctl/nvmmctl.8: revision 1.2
lib/libnvmm/libnvmm.3: revision 1.24
sys/dev/nvmm/nvmm.h: revision 1.11
lib/libnvmm/libnvmm.3: revision 1.25
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.16
sys/dev/nvmm/nvmm.h: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.17
tests/lib/libnvmm/h_mem_assist.c: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.18
share/mk/bsd.hostprog.mk: revision 1.82
lib/libnvmm/libnvmm.c: revision 1.15
distrib/sets/lists/base/md.amd64: revision 1.281
tests/lib/libnvmm/h_mem_assist.c: revision 1.13
lib/libnvmm/libnvmm.c: revision 1.16
tests/lib/libnvmm/h_mem_assist.c: revision 1.14
lib/libnvmm/libnvmm_x86.c: revision 1.32
lib/libnvmm/libnvmm.c: revision 1.17
tests/lib/libnvmm/h_mem_assist.c: revision 1.15
lib/libnvmm/libnvmm_x86.c: revision 1.33
lib/libnvmm/libnvmm.c: revision 1.18
usr.sbin/nvmmctl/Makefile: revision 1.1
tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.7
tests/lib/libnvmm/h_mem_assist.c: revision 1.16
lib/libnvmm/libnvmm_x86.c: revision 1.34
usr.sbin/nvmmctl/Makefile: revision 1.2
tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.8
tests/lib/libnvmm/h_mem_assist.c: revision 1.17
sys/dev/nvmm/nvmm_internal.h: revision 1.13
lib/libnvmm/libnvmm_x86.c: revision 1.35
lib/libnvmm/libnvmm_x86.c: revision 1.36
usr.sbin/postinstall/postinstall.in: revision 1.8
lib/libnvmm/libnvmm_x86.c: revision 1.37
lib/libnvmm/libnvmm_x86.c: revision 1.38
lib/libnvmm/libnvmm_x86.c: revision 1.39
usr.sbin/Makefile: revision 1.282
lib/libnvmm/nvmm.h: revision 1.13
lib/libnvmm/nvmm.h: revision 1.14
lib/libnvmm/nvmm.h: revision 1.15
sys/dev/nvmm/nvmm.c: revision 1.23
lib/libnvmm/nvmm.h: revision 1.16
sys/dev/nvmm/nvmm.c: revision 1.24
lib/libnvmm/nvmm.h: revision 1.17
sys/dev/nvmm/nvmm.c: revision 1.25
tests/lib/libnvmm/h_io_assist.c: revision 1.9
etc/MAKEDEV.tmpl: revision 1.209
tests/lib/libnvmm/h_io_assist.c: revision 1.10
tests/lib/libnvmm/h_io_assist.c: revision 1.11
etc/group: revision 1.35
distrib/sets/lists/man/mi: revision 1.1660
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.40
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.41
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.42
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.43
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.44
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.51
sys/dev/nvmm/nvmm_ioctl.h: revision 1.8
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.52
sys/dev/nvmm/nvmm_ioctl.h: revision 1.9
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.53
usr.sbin/nvmmctl/nvmmctl.c: revision 1.1
lib/libnvmm/libnvmm.3: revision 1.20
distrib/sets/lists/debug/md.amd64: revision 1.106
lib/libnvmm/libnvmm.3: revision 1.21
lib/libnvmm/libnvmm.3: revision 1.22
usr.sbin/nvmmctl/nvmmctl.8: revision 1.1
lib/libnvmm/libnvmm.3: revision 1.23

Fix incorrect parsing: the R/M field uses a special GPR map when the
address size is 16 bits, regardless of the actual operating mode. With
this special map there can be two registers referenced at once, and
also disp16-only.
Implement this special behavior, and add associated tests. While here
simplify a few things.
With this in place, the Windows 95 installer initializes correctly.
Part of PR/54611.
add missing initializer
Implement XCHG, add associated tests, and add comments to explain. With
this in place the Windows 95 installer completes successfuly.
Part of PR/54611.
Improve nvmm_vcpu_dump().
Put back 'default', because llvm apparently doesn't realize that all cases
are covered in the switch.
Miscellaneous changes in NVMM, to address several inconsistencies and
issues in the libnvmm API.
- Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in
libnvmm. Introduce NVMM_USER_VERSION, for future use.
- In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to
avoid sharing the VMs with the children if the process forks. In the
NVMM driver, force O_CLOEXEC on open().
- Rename the following things for consistency:
nvmm_exit* -> nvmm_vcpu_exit*
nvmm_event* -> nvmm_vcpu_event*
NVMM_EXIT_* -> NVMM_VCPU_EXIT_*
NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR
NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP
Delete NVMM_EVENT_INTERRUPT_SW, unused already.
- Slightly reorganize the MI/MD definitions, for internal clarity.
- Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide
separate u.rdmsr and u.wrmsr fields. This is more consistent with the
other exit reasons.
- Change the types of several variables:
event.type enum -> u_int
event.vector uint64_t -> uint8_t
exit.u.*msr.msr: uint64_t -> uint32_t
exit.u.io.type: enum -> bool
exit.u.io.seg: int -> int8_t
cap.arch.mxcsr_mask: uint64_t -> uint32_t
cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t
- Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we
already intercept 'monitor' so it is never armed.
- Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn().
The 'npc' field wasn't getting filled properly during certain VMEXITs.
- Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(),
but as its name indicates, the configuration is per-VCPU and not per-VM.
Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID.
This becomes per-VCPU, which makes more sense than per-VM.
- Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on
specific leaves. Until now we could only mask the leaves. An uint32_t
is added in the structure:
uint32_t mask:1;
uint32_t exit:1;
uint32_t rsvd:30;
The two first bits select the desired behavior on the leaf. Specifying
zero on both resets the leaf to the default behavior. The new
NVMM_VCPU_EXIT_CPUID exit reason is added.
Three changes in libnvmm:
- Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem
structures.
- Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'.
- Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS,
it now becomes per-VCPU.
Update the libnvmm man page:
- Sync the naming with reality.
- Replace "relevant" by "desired" and "virtualizer" by "emulator", closer
to what I meant.
- Add a "VCPU Configuration" section.
- Add a "Machine Ownership" section.
Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few
days ago.
Use the new PTE naming, and define CR3_FRAME_* separately. No functional
change.
Add a new VCPU conf option, that allows userland to request VMEXITs after a
TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs.
The reason for wanting this option is that certain OSes (like Win10 64bit)
manage interrupt priority in hardware via CR8 directly, and for these OSes,
the emulator may want to sync its internal TPR state on each change.
Add two new fields in cap.arch, to report the conf capabilities. Report TPR
only on Intel for now, not AMD, because I don't have a recent AMD CPU on
which to test.
Mask CPUID leaf 0x0A on Intel, because we don't want the guest to try (and
fail) to probe the PMC MSRs. This avoids "Unexpected WRMSR" warnings in
qemu-nvmm.
Add PCID support in the guests. This speeds up most 64bit guests, because
since Meltdown, everybody uses PCID (including NetBSD).
Change the way root_owner works: consider the calling process as root_owner
not if it has root privileges, but if the /dev/nvmm device was opened with
write permissions. Introduce the undocumented nvmm_root_init() function to
achieve that.
The goal is to simplify the logic and have more granularity, eg if we want
a monitoring agent to access VMs but don't want to give this agent real
root access on the system.
A few changes:
- Use smaller types in struct nvmm_capability.
- Use smaller type for nvmm_io.port.
- Switch exitstate to a compacted structure.
Add nram in struct nvmm_ctl_mach_info.
Add nvmmctl, with two commands for now.
Macro tidyness.
Sort SEE ALSO.
should be fork(2), noticed by wiz
Add debug entry for newly introduced nvmmctl utility.
Annotate a covering switch as such to avoid warnings about missing
returns.
Forgot to put nvmmctl in the "nvmm" group.
Add nvmm group.
 1.15.2.3 13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.15.2.2 10-Jun-2019  christos Sync with HEAD
 1.15.2.1 11-May-2019  christos file nvmm_x86.h was added on branch phil-wifi on 2019-06-10 22:07:14 +0000
 1.20.4.1 03-Apr-2021  thorpej Sync with HEAD.
 1.20.2.1 03-Apr-2021  thorpej Sync with HEAD.
 1.90 15-Aug-2025  skrll Remove unnecessary casts.

Same code before and after change.
 1.89 21-Apr-2025  riastradh nvmm/x86: Mark comments that should be synced between vmx/svm.

...so that I don't so easily forget to apply typo fixes in one to the
other.
 1.88 19-Apr-2025  andvar mirror recent typo fixes in nvmm_x86_vmx.c.
 1.87 11-Apr-2025  imil nvmm(4): implement CPUID leaf 0x40000010, VMware compatible TSC and LAPIC
frequency detection. Partially fixes PR kern/59170
 1.86 09-Mar-2025  riastradh nvmm(4): Add comments explaining unknown CPUID behaviour.

Sprinkle some more comments about the CPUID ranges too.

No functional change intended.
 1.85 23-Feb-2023  riastradh branches: 1.85.6;
nvmm: Filter CR4 bits on x86 SVM (AMD).

In particular, prohibit PKE, Protection Key Enable, which requires
some additional management of CPU state by nvmm.
 1.84 20-Aug-2022  riastradh branches: 1.84.4;
x86: Split most of pmap.h into pmap_private.h or vmparam.h.

This way pmap.h only contains the MD definition of the MI pmap(9)
API, which loads of things in the kernel rely on, so changing x86
pmap internals no longer requires recompiling the entire kernel every
time.

Callers needing these internals must now use machine/pmap_private.h.
Note: This is not x86/pmap_private.h because it contains three parts:

1. CPU-specific (different for i386/amd64) definitions used by...

2. common definitions, including Xenisms like xpmap_ptetomach,
further used by...

3. more CPU-specific inlines for pmap_pte_* operations

So {amd64,i386}/pmap_private.h defines 1, includes x86/pmap_private.h
for 2, and then defines 3. Maybe we should split that out into a new
pmap_pte.h to reduce this trouble.

No functional change intended, other than that some .c files must
include machine/pmap_private.h when previously uvm/uvm_pmap.h
polluted the namespace with pmap internals.

Note: This migrates part of i386/pmap.h into i386/vmparam.h --
specifically the parts that are needed for several constants defined
in vmparam.h:

VM_MAXUSER_ADDRESS
VM_MAX_ADDRESS
VM_MAX_KERNEL_ADDRESS
VM_MIN_KERNEL_ADDRESS

Since i386 needs PDP_SIZE in vmparam.h, I added it there on amd64
too, just to keep things parallel.
 1.83 26-Mar-2021  reinoud Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without
signals. This introduces a new kernel and userland NVMM version indicating
this support.

Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request.
 1.82 24-Oct-2020  mgorny branches: 1.82.2; 1.82.4;
Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).
 1.81 08-Sep-2020  maxv nvmm-x86: avoid hogging behavior observed recently

When the FPU code got rewritten in NetBSD, the dependency on IPL_HIGH was
eliminated, and I took _vcpu_guest_fpu_enter() out of the VCPU loop since
there was no need to be in the splhigh window.

Later, the code was switched to use the kernel FPU API, API that works at
IPL_VM, not at IPL_NONE.

These two changes mean that the whole VCPU loop is now executing at IPL_VM,
which is not desired, because it introduces a delay in interrupt processing
on the host in certain cases.

Fix this by putting _vcpu_guest_fpu_enter() back inside the VCPU loop.
 1.80 08-Sep-2020  maxv nvmm: cosmetic changes

- Style.
- Explicitly include ioccom.h.
 1.79 06-Sep-2020  riastradh Fix fallout from previous uvm.h cleanup.

- pmap(9) needs uvm/uvm_extern.h.

- x86/pmap.h is not usable on its own; it is only usable if included
via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h).

- Make nvmm.h and nvmm_internal.h standalone.
 1.78 05-Sep-2020  riastradh Round of uvm.h cleanup.

The poorly named uvm.h is generally supposed to be for uvm-internal
users only.

- Narrow it to files that actually need it -- mostly files that need
to query whether curlwp is the pagedaemon, which should maybe be
exposed by an external header.

- Use uvm_extern.h where feasible and uvm_*.h for things not exposed
by it. We should split up uvm_extern.h but this will serve for now
to reduce the uvm.h dependencies.

- Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use
UVMHIST(ubchist), since ubchist is declared in uvm.h but the
reference evaporates if UVMHIST is not defined, so we reduce header
file dependencies.

- Make uvm_device.h and uvm_swap.h independently includable while
here.

ok chs@
 1.77 05-Sep-2020  maxv x86: rename PGEX_X -> PGEX_I

To match the x86 specification and the other OSes.
 1.76 05-Sep-2020  maxv nvmm: update copyright headers
 1.75 04-Sep-2020  maxv nvmm-x86-svm: check the SVM revision

Only revision 1 exists, but check it, for future-proofness.
 1.74 26-Aug-2020  maxv nvmm-x86-svm: improve the handling of MSR_EFER

Intercept reads of it as well, just to mask EFER_SVME, which the guest
doesn't need to see.
 1.73 26-Aug-2020  maxv nvmm-x86: improve the handling of RFLAGS.RF

- When injecting certain exceptions, set RF. For us to have an up-to-date
view of RFLAGS, we commit the state before the event.
- When advancing RIP, clear RF.
 1.72 26-Aug-2020  maxv nvmm-x86-svm: don't forget to intercept INVD

INVD executed in the guest can be dangerous for the host, due to CPU
caches being flushed without write-back.
 1.71 22-Aug-2020  maxv nvmm-x86-svm: dedup code
 1.70 20-Aug-2020  maxv nvmm-x86: improve the CPUID emulation

- x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter
contains extended features we must filter out. Apply the same in
x86-vmx for symmetry.
- x86-svm: explicitly handle extended leaves until 0x8000001F, and
truncate to it.
 1.69 18-Aug-2020  maxv nvmm-x86-svm: improve the CPUID emulation

Limit the hypervisor range, and properly handle each basic leaf until 0xD.
 1.68 18-Aug-2020  maxv nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
 1.67 05-Aug-2020  maxv Add new field definitions, and intercept everything, for future-proofness.
 1.66 05-Aug-2020  maxv Use ULL, to make it clear we are unsigned.
 1.65 19-Jul-2020  maxv Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel
itself uses the fpu.
 1.64 19-Jul-2020  maxv The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no
effect. Disable interrupts earlier instead. This prevents a possible race
against such IPIs.
 1.63 03-Jul-2020  maxv Print the backend name when attaching.
 1.62 24-May-2020  maxv Gather the conditions to return from the VCPU loops in nvmm_return_needed(),
and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors:

- When a VM initializes, the many nested page faults that need processing
could cause the calling thread to occupy the CPU too much if we're unlucky
and are only getting repeated nested page faults thousands of times in a
row.

- When the emulator calls nvmm_vcpu_run() and immediately sends a signal to
stop the VCPU, it's better to check signals earlier and leave right away,
rather than doing a round of VCPU run that could increase the time spent
by the emulator waiting for the return.
 1.61 10-May-2020  maxv Respect the convention for the hypervisor information: return the highest
hypervisor leaf in 0x40000000.EAX.
 1.60 09-May-2020  maxv Improve the CPUID emulation of basic leaves:
- Hide DCA and PQM, they cannot be used in guests.
- On Intel, explicitly handle each basic leaf until 0x16.
- On AMD, explicitly handle each basic leaf until 0x0D.
 1.59 30-Apr-2020  maxv When the identification fails, print the reason.
 1.58 22-Mar-2020  ad x86 pmap:

- Give pmap_remove_all() its own version of pmap_remove_ptes() that on native
x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on
'build.sh release' for me.

- pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The
caller waits for the competing operation to finish.

- Bring 'options TLBSTATS' up to date.
 1.57 14-Mar-2020  ad - Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.
 1.56 21-Feb-2020  joerg Explicitly cast pointers to uintptr_t before casting to enums. They are
not necessarily the same size. Don't cast pointers to bool, check for
NULL instead.
 1.55 10-Dec-2019  ad branches: 1.55.2;
pg->phys_addr > VM_PAGE_TO_PHYS(pg)
 1.54 20-Nov-2019  maxv Hide XSAVES-specific stuff and the masked extended states.
 1.53 28-Oct-2019  maxv A few changes:

- Use smaller types in struct nvmm_capability.
- Use smaller type for nvmm_io.port.
- Switch exitstate to a compacted structure.
 1.52 27-Oct-2019  maxv Add a new VCPU conf option, that allows userland to request VMEXITs after a
TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs.

The reason for wanting this option is that certain OSes (like Win10 64bit)
manage interrupt priority in hardware via CR8 directly, and for these OSes,
the emulator may want to sync its internal TPR state on each change.

Add two new fields in cap.arch, to report the conf capabilities. Report TPR
only on Intel for now, not AMD, because I don't have a recent AMD CPU on
which to test.
 1.51 23-Oct-2019  maxv Miscellaneous changes in NVMM, to address several inconsistencies and
issues in the libnvmm API.

- Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in
libnvmm. Introduce NVMM_USER_VERSION, for future use.

- In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to
avoid sharing the VMs with the children if the process forks. In the
NVMM driver, force O_CLOEXEC on open().

- Rename the following things for consistency:
nvmm_exit* -> nvmm_vcpu_exit*
nvmm_event* -> nvmm_vcpu_event*
NVMM_EXIT_* -> NVMM_VCPU_EXIT_*
NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR
NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP
Delete NVMM_EVENT_INTERRUPT_SW, unused already.

- Slightly reorganize the MI/MD definitions, for internal clarity.

- Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide
separate u.rdmsr and u.wrmsr fields. This is more consistent with the
other exit reasons.

- Change the types of several variables:
event.type enum -> u_int
event.vector uint64_t -> uint8_t
exit.u.*msr.msr: uint64_t -> uint32_t
exit.u.io.type: enum -> bool
exit.u.io.seg: int -> int8_t
cap.arch.mxcsr_mask: uint64_t -> uint32_t
cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t

- Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we
already intercept 'monitor' so it is never armed.

- Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn().
The 'npc' field wasn't getting filled properly during certain VMEXITs.

- Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(),
but as its name indicates, the configuration is per-VCPU and not per-VM.
Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID.
This becomes per-VCPU, which makes more sense than per-VM.

- Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on
specific leaves. Until now we could only mask the leaves. An uint32_t
is added in the structure:
uint32_t mask:1;
uint32_t exit:1;
uint32_t rsvd:30;
The two first bits select the desired behavior on the leaf. Specifying
zero on both resets the leaf to the default behavior. The new
NVMM_VCPU_EXIT_CPUID exit reason is added.
 1.50 12-Oct-2019  maxv Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.
 1.49 04-Oct-2019  maxv Switch to the new PTE naming.
 1.48 04-Oct-2019  maxv Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed
version.
 1.47 04-Oct-2019  maxv Add definitions for RDPRU, MCOMMIT, GMET and VTE.
 1.46 11-May-2019  maxv branches: 1.46.2; 1.46.4;
Rework the machine configuration interface.

Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and
<MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf
op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now
per-machine, and the emulators should now do:

- nvmm_callbacks_register(&cbs);
+ nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs);

This provides more granularity, for example if the process runs two VMs
and wants different callbacks for each.
 1.45 01-May-2019  maxv Use the comm page to inject events, rather than ioctls, and commit them in
vcpu_run. This saves a few syscalls and copyins.

For example on Windows 10, moving the mouse from the left to right sides of
the screen generates ~500 events, which now don't result in syscalls.

The error handling is done in vcpu_run and it is less precise, but this
doesn't matter a lot, and will be solved with future NVMM error codes.
 1.44 29-Apr-2019  maxv Stop taking care of the INT/NMI windows in the kernel, the emulator is
supposed to do that itself.
 1.43 28-Apr-2019  maxv Modify the communication layer between the kernel NVMM driver and libnvmm:
introduce a bidirectionnal "comm page", a page of memory shared between
the kernel and userland, and used to transfer data in and out in a more
performant manner than ioctls.

The comm page contains the VCPU state, plus three flags:

- "wanted": the states the kernel must get/set when requested via ioctls
- "cached": the states that are in the comm page
- "commit": the states the kernel must set in vcpu_run

The idea is to avoid performing expensive syscalls, by using the VCPU
state cached, either explicitly or speculatively, in the comm page. For
example, if the state is cached we do a direct 1->5 with no syscall:

+---------------------------------------------+
| Qemu |
+---------------------------------------------+
| ^
| (0) nvmm_vcpu_getstate | (6) Done
| |
V |
+---------------------------------------+
| libnvmm |
+---------------------------------------+
| ^ | ^
(1) State | | (2) No | (3) Ioctl: | (5) Ok, state
cached? | | | "please cache | fetched
| | | the state" |
V | | |
+-----------+ | |
| Comm Page |------+---------------+
+-----------+ |
^ |
(4) "Alright | V
babe" | +--------+
+-----| Kernel |
+--------+

The main changes in behavior are:

- nvmm_vcpu_getstate(): won't emit a syscall if the state is already
cached in the comm page, will just fetch from the comm page directly
- nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache
the wanted state in the comm page
- nvmm_vcpu_run(): will commit the to-be-set state in the comm page,
as previously requested by nvmm_vcpu_setstate()

In addition to this, the kernel NVMM driver is changed to speculatively
cache certain states known to be of interest, so that the future
nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use
the comm page rather than expensive syscalls. For example, if an I/O
VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS,
and now the kernel caches all of that in the comm page before returning
to userland.

Overall, in a normal run of Windows 10, this saves several millions of
syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO
goes from taking 1min35 to taking 1min16.

The libnvmm API is not changed, but the ABI is. If we changed the API it
would be possible to save expensive memcpys on libnvmm's side. This will
be avoided in a future version. The comm page can also be extended to
implement future services.
 1.42 27-Apr-2019  maxv Reorder the NVMM headers, to make a clear(er) distinction between MI and
MD. Also use #defines for the exit reasons rather than an union. No ABI
change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
 1.41 27-Apr-2019  maxv If guest events were being processed when a #VMEXIT occurred, reschedule
the events rather than dismissing them. This can happen for instance when a
guest wants to process an exception and an #NPF occurs on the guest IDT. In
practice it occurs only when the host swapped out specific guest pages.
 1.40 24-Apr-2019  maxv Provide the hardware error code for NVMM_EXIT_INVALID, useful when
debugging.
 1.39 20-Apr-2019  maxv Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the
guest relies only on ECX to initialize/copy the FPU state (like NetBSD
does), spurious #GPs can be encountered because the bitmap is clobbered.
 1.38 07-Apr-2019  maxv Invert the filtering priority: now the kernel-managed cpuid leaves are
overwritable by the virtualizer. This is useful to virtualizers that want
to 100% control every leaf.
 1.37 06-Apr-2019  maxv Replace the misc[] state by a new compressed nvmm_x64_state_intr structure,
which describes the interruptibility state of the guest.

Add evt_pending, read-only, that allows the virtualizer to know if an event
is pending.
 1.36 03-Apr-2019  maxv Add MSR_TSC.
 1.35 21-Mar-2019  maxv Make it possible for an emulator to set the protection of the guest pages.
For some reason I had initially concluded that it wasn't doable; verily it
is, so let's do it.

The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes
mmap-like protection codes.
 1.34 14-Mar-2019  maxv Reduce the mask of the VTPR, only the first four bits matter.
 1.33 03-Mar-2019  maxv Choose which CPUID bits to allow, rather than which bits to disallow. This
is clearer, and also forward compatible with future CPUs.

While here be more consistent when allowing the bits, and sync between
nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest
state we provide is only x86+SSE. Fixes a CentOS panic when booting on
NVMM, reported by Jared McNeill, thanks.
 1.32 26-Feb-2019  maxv Change the layout of the SEG state:

- Reorder it, to match the CPU encoding. This is the universal order,
also used by Qemu. Drop the seg_to_nvmm[] tables.

- Compress it. This divides its size by two.

- Rename some of its fields, to better match the x86 spec. Also, take S
out of Type, this was a NetBSD-ism that was likely confusing to other
people.
 1.31 23-Feb-2019  maxv Install the x86 RESET state at VCPU creation time, for convenience, so
that the libnvmm users can expect a functional VCPU right away.
 1.30 23-Feb-2019  maxv Reorder the functions, and constify setstate. No functional change.
 1.29 21-Feb-2019  maxv Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU
mutexes which can sleep, but their context does not allow it.

Rewrite the TLB handling code to fix that. It becomes a bit complex. In
short, we use a per-VM generation number, which we increase on each TLB
flush, before sending a broadcast IPI to everybody. The IPIs cause a
#VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen
with a per-VCPU copy, and apply the flushes as neededi lazily.

The behavior differs between AMD and Intel; in short, on Intel we don't
flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now,
we need to maintain a kcpuset to know which VCPU's hTLBs are active on
which hCPU. This creates some redundancy on Intel, ie there are cases
where we flush the hTLB several times unnecessarily; but hTLB flushes are
very rare, so there is no real performance regression.

The thing is lock-less and non-blocking, so it solves our problem.
 1.28 21-Feb-2019  maxv Clarify the gTLB code a little.
 1.27 18-Feb-2019  maxv Ah, finally found you. Fix scheduling bug in NVMM.

When processing guest page faults, we were calling uvm_fault with
preemption disabled. The thing is, uvm_fault may block, and if it does,
we land in sleepq_block which calls mi_switch; so we get switched away
while we explicitly asked not to be. From then on things could go really
wrong.

Fix that by processing such faults in MI, where we have preemption enabled
and are allowed to block.

A KASSERT in sleepq_block (or before) would have helped.
 1.26 16-Feb-2019  maxv Ah no, adapt previous, on AMD RAX is in the VMCB.
 1.25 16-Feb-2019  maxv Improve the FPU detection: hide XSAVES because we're not allowing it, and
don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE.

With these changes in place, I can boot Windows 10 on NVMM.
 1.24 15-Feb-2019  maxv Initialize the guest TSC to zero at VCPU creation time, and handle guest
writes to MSR_TSC at run time.

This is imprecise, because the hardware does not provide a way to preserve
the TSC during #VMEXITs, but that's fine enough.
 1.23 14-Feb-2019  maxv Harmonize the handling of the CPL between AMD and Intel.

AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET
instructions do not force SS.DPL to predefined values. On Intel they do,
so the CPL on Intel is just the guest's SS.DPL value.

Even though technically possible on AMD, there is no sane reason for a
guest kernel to set a non-three SS.DPL, doing that would mess up several
common segmentation practices and wouldn't be compatible with Intel.

So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL.
Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually
increases performance on AMD: to detect interrupt windows the virtualizer
has to modify some fields of misc[], and because CPL was there, we had to
flush the SEG set of the VMCB cache. Now there is no flush necessary.

While here remove the CPL check for XSETBV on Intel, contrary to AMD
Intel checks the CPL before the intercept, so if we receive an XSETBV
VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the
way my check was wrong in the first place, it was reading SS.RPL instead
of SS.DPL.
 1.22 13-Feb-2019  maxv Drop support for software interrupts. I had initially added that to cover
the three event types available on AMD, but Intel has seven of them, all
with weird and twisted meanings, and they require extra parameters.

Software interrupts should not be used anyway.
 1.21 13-Feb-2019  maxv Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather
than saving them on each VMENTRY, save them only once, at VCPU creation
time.
 1.20 12-Feb-2019  maxv Optimize: the hardware does not clear the TLB flush command after a
VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest
TLB. While here also fix the processing of EFER-induced flushes, they
shouldn't be delayed.
 1.19 04-Feb-2019  maxv Improvements:

- Guest reads/writes to PAT land in gPAT, so no need to emulate them.

- When emulating EFER, don't advance the RIP if a fault occurs, and don't
forget to flush the VMCB cache accordingly.
 1.18 26-Jan-2019  maxv Remove nvmm_exit_memory.npc, useless.
 1.17 24-Jan-2019  maxv Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu
state" which occurs after the instruction executed, rather than an
instruction intercept which occurs before. Disable the shadow and the intr
window in kernel mode, and advance the RIP, so that the virtualizer doesn't
have to do it itself. This saves two syscalls and one VMCB cache flush.

Provide npc for other instruction intercepts, in case someone is
interested.
 1.16 20-Jan-2019  maxv Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.
 1.15 13-Jan-2019  maxv Reset DR7 before loading DR0-3, to prevent a fault if the host process
has dbregs enabled.
 1.14 10-Jan-2019  maxv Optimize:

* Don't save/restore the host CR2, we don't care because we're not in a
#PF context (and preemption switches already handle CR2 safely).

* Don't save/restore the host FS and GS, just reset them to zero after
VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't
apply to FS and GS.

* Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost
of saving/restoring them when there's no reason to leave the loop.
 1.13 08-Jan-2019  maxv Optimize: don't keep a full copy of the guest state, rather take only what
is needed. This avoids expensive memcpy's.

Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
 1.12 07-Jan-2019  maxv Optimize: cache the guest state entirely in the VMCB-cache, flush it on a
state-by-state basis when needed.
 1.11 06-Jan-2019  maxv Add more VMCB fields. Also remove debugging code I mistakenly committed
in the previous revision. No functional change.
 1.10 06-Jan-2019  maxv Improvements and fixes in NVMM.

Kernel driver:

* Don't take an extra (unneeded) reference to the UAO.

* Provide npc for HLT. I'm not really happy with it right now, will
likely be revisited.

* Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide
them in the exitstate too.

* Don't take the TPR into account when processing INTs. The virtualizer
can do that itself (Qemu already does).

* Provide a hypervisor signature in CPUID, and hide SVM.

* Ignore certain MSRs. One special case is MSR_NB_CFG in which we set
NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC.

* If the LWP has pending signals or softints, leave, rather than waiting
for a rescheduling to happen later. This reduces interrupt processing
time in the guest (Qemu sends a signal to the thread, and now we leave
right away). This could be improved even more by sending an actual IPI
to the CPU, but I'll see later.

Libnvmm:

* Fix the MMU translation of large pages, we need to add the lower bits
too.

* Change the IO and Mem structures to take a pointer rather than a
static array. This provides more flexibility.

* Batch together the str+rep IO transactions. We do one big memory
read/write, and then send the IO commands to the hypervisor all at
once. This considerably increases performance.

* Decode MOVZX.

With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0
in a VM with multiple VCPUs, connect to the network, etc.
 1.9 03-Jan-2019  maxv Fix another gross copy-pasto.
 1.8 02-Jan-2019  maxv When there's no DecodeAssist in hardware, decode manually in software. This
is needed on certain AMD CPUs (like mine): the segment base of OUTS can be
overridden, and it is wrong to just assume DS.

We fetch the instruction and look at the prefixes if any to determine the
correct segment.
 1.7 13-Dec-2018  maxv Don't forget to advance the RIP after an XSETBV emulation.
 1.6 25-Nov-2018  maxv branches: 1.6.2;
Add RFLAGS in the exitstate.
 1.5 22-Nov-2018  maxv Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
 1.4 19-Nov-2018  maxv Rename one constant, for clarity.
 1.3 14-Nov-2018  maxv Take RAX from the VMCB and not the VCPU state, the latter is not
synchronized and contains old values.
 1.2 10-Nov-2018  maxv Remove unused cpu_msr.h includes.
 1.1 07-Nov-2018  maxv Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that
provides support for hardware-accelerated virtualization on NetBSD.

It is made of an MI frontend, to which MD backends can be plugged. One
MD backend is implemented, x86-SVM, for x86 AMD CPUs.

We install

/usr/include/dev/nvmm/nvmm.h
/usr/include/dev/nvmm/nvmm_ioctl.h
/usr/include/dev/nvmm/{arch}/nvmm_{arch}.h

And the kernel module. For now, the only architecture where we do that
is amd64 (arch=x86).

NVMM is not enabled by default in amd64-GENERIC, but is instead easily
modloadable.

Sent to tech-kern@ a month ago. Validated with kASan, and optimized
with tprof.
 1.6.2.5 26-Jan-2019  pgoyette Sync with HEAD
 1.6.2.4 18-Jan-2019  pgoyette Synch with HEAD
 1.6.2.3 26-Dec-2018  pgoyette Sync with HEAD, resolve a few conflicts
 1.6.2.2 26-Nov-2018  pgoyette Sync with HEAD, resolve a couple of conflicts
 1.6.2.1 25-Nov-2018  pgoyette file nvmm_x86_svm.c was added on branch pgoyette-compat on 2018-11-26 01:52:32 +0000
 1.46.4.14 25-Jul-2023  martin Pull up following revision(s) (requested by riastradh in ticket #1666):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.85

nvmm: Filter CR4 bits on x86 SVM (AMD).

In particular, prohibit PKE, Protection Key Enable, which requires
some additional management of CPU state by nvmm.
 1.46.4.13 13-Sep-2020  martin Pull up following revision(s) (requested by maxv in ticket #1078):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.73
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.73
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.74

nvmm-x86-vmx: improve the handling of CR4
- Filter out certain features we don't want the guest to enable. This is
for general correctness, and future-proofness.
- Flush the guest TLB when certain flags change.

nvmm-x86: improve the handling of RFLAGS.RF
- When injecting certain exceptions, set RF. For us to have an up-to-date
view of RFLAGS, we commit the state before the event.
- When advancing RIP, clear RF.
 1.46.4.12 13-Sep-2020  martin Pull up following revision(s) (requested by maxv in ticket #1077):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.68
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.74
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.16

Improve emulation of MSR_IA32_ARCH_CAPABILITIES: publish only the *_NO
bits. Initially they were the only ones there, but Intel then added other
bits we aren't interested in, and they must be filtered out.

nvmm-x86-svm: improve the handling of MSR_EFER

Intercept reads of it as well, just to mask EFER_SVME, which the guest
doesn't need to see.

nvmm-x86: improve the CPUID emulation

- Mask DTES64, DS_CPL, CID, SDBG, xTPR, PN.
- B10, B20 and IA64 do not exist, so just remove them.
 1.46.4.11 04-Sep-2020  martin Pull up following revision(s) (requested by maxv in ticket #1076):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.75
sys/arch/x86/include/specialreg.h: revision 1.172
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.72

nvmm-x86-vmx: fix detection of the BIOS lock

If it's locked, ensure it's locked with VMX enabled. If it's not locked,
then lock it ourselves with VMX enabled.

Should fix NetBSD PR/55596.

-

Add a few more CPUID flags.

-

nvmm-x86-svm: check the SVM revision
Only revision 1 exists, but check it, for future-proofness.
 1.46.4.10 29-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1068):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.71
sys/dev/nvmm/nvmm.c: revision 1.34
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.72
sys/dev/nvmm/nvmm.c: revision 1.35
sys/dev/nvmm/nvmm.c: revision 1.36
sys/dev/nvmm/x86/nvmm_x86_svmfunc.S: revision 1.5
sys/dev/nvmm/nvmm.c: revision 1.37
sys/dev/nvmm/x86/nvmm_x86_vmxfunc.S: revision 1.5
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.68
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.15
sys/dev/nvmm/nvmm_ioctl.h: revision 1.10

Micro-optimize: use pushq instead of pushw. To avoid LCP stalls and
unaligned stack accesses.

nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes

nvmm: localify a variable that doesn't need to be global

nvmm: use relaxed atomics to read nmachines

nvmm-x86-svm: dedup code

nvmm-x86: hide more CPUID flags, mostly related to perf monitors

nvmm: misc improvements
- use mach->ncpus to get the number of vcpus, now that we have it
- don't forget to decrement mach->ncpus when a machine gets killed
- add more __predict_false()

nvmm-x86-svm: don't forget to intercept INVD
INVD executed in the guest can be dangerous for the host, due to CPU
caches being flushed without write-back.

nvmm: slightly clarify

nvmm: explicitly include atomic.h
 1.46.4.9 26-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1058):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.19
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.69
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.71
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.69
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.11
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.13
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.14

Improve the CPUID emulation:
- Hide SGX*, PKU, WAITPKG, and SKINIT, because they are not supported.
- Hide HLE and RTM, part of TSX. Because TSX is just too buggy and we
cannot guarantee that it remains enabled in the guest (if for example
the host disables TSX while the guest is running). Nobody wants this
crap anyway, so bye-bye.
- Advertise FSREP_MOV, because no reason to hide it.

Hide OSPKE. NFC since the host never uses PKU, but still.

Improve the CPUID emulation on nvmm-intel:
- Limit the highest extended leaf.
- Limit 0x00000007 to ECX=0, for future-proofness.

nvmm-x86-svm: improve the CPUID emulation

Limit the hypervisor range, and properly handle each basic leaf until 0xD.

nvmm-x86: advertise the SERIALIZE instruction, available on future CPUs

nvmm-x86: improve the CPUID emulation
- x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter
contains extended features we must filter out. Apply the same in
x86-vmx for symmetry.
- x86-svm: explicitly handle extended leaves until 0x8000001F, and
truncate to it.
 1.46.4.8 18-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1055):

sys/dev/nvmm/nvmm.h: revision 1.13
sys/dev/nvmm/nvmm.h: revision 1.14
sys/dev/nvmm/nvmm.c: revision 1.33
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.67
sys/dev/nvmm/nvmm_internal.h: revision 1.17
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.67
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.10

Put the few x86-specific structures under #ifdef __x86_64__, for clarity.

Make it easier to understand what's going on, no functional change.

Add new field definitions.

Add new field definitions, and intercept everything, for future-proofness.

Add CTASSERT.
 1.46.4.7 05-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1041):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.66
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.50
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.66
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.46
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.49
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.55
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.56

pg->phys_addr > VM_PAGE_TO_PHYS(pg)

Explicitly cast pointers to uintptr_t before casting to enums. They are
not necessarily the same size. Don't cast pointers to bool, check for
NULL instead.

vmx_vmptrst(): only used when DIAGNOSTIC

Simplify, remove unnecessary #ifdef DIAGNOSTIC around KASSERTs.

Use ULL, to make it clear we are unsigned.
 1.46.4.6 02-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1032):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.60 (patch)
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.61 (patch)
sys/dev/nvmm/nvmm.c: revision 1.30
sys/dev/nvmm/nvmm.c: revision 1.31
sys/dev/nvmm/nvmm.c: revision 1.32
sys/dev/nvmm/nvmm_internal.h: revision 1.15
sys/dev/nvmm/nvmm_internal.h: revision 1.16
sys/dev/nvmm/files.nvmm: revision 1.3
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.62 (patch)
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.63 (patch)
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.59 (patch)
sys/modules/nvmm/nvmm.ioconf: revision 1.2

Gather the conditions to return from the VCPU loops in nvmm_return_needed(),
and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors:

- When a VM initializes, the many nested page faults that need processing
could cause the calling thread to occupy the CPU too much if we're unlucky
and are only getting repeated nested page faults thousands of times in a
row.

- When the emulator calls nvmm_vcpu_run() and immediately sends a signal to
stop the VCPU, it's better to check signals earlier and leave right away,
rather than doing a round of VCPU run that could increase the time spent
by the emulator waiting for the return.

style

Register NVMM as an actual pseudo-device. Without PMF handler, to
explicitly disallow ACPI suspend if NVMM is running.

Should fix PR/55406.

Print the backend name when attaching.
 1.46.4.5 21-May-2020  martin Pull up following revision(s) (requested by maxv in ticket #919):

sys/dev/nvmm/x86/nvmm_x86.c: revision 1.9
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.60
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.61
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.56
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.57
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.58
sys/dev/nvmm/nvmm.c: revision 1.29

Improve the CPUID emulation of basic leaves:
- Hide DCA and PQM, they cannot be used in guests.
- On Intel, explicitly handle each basic leaf until 0x16.
- On AMD, explicitly handle each basic leaf until 0x0D.

Respect the convention for the hypervisor information: return the highest
hypervisor leaf in 0x40000000.EAX.

Improve the CPUID emulation on nvmm-intel: limit the highest basic and
hypervisor leaves.

Complete rev1.26: reset nvmm_impl to NULL in nvmm_fini().
 1.46.4.4 13-May-2020  martin Pull up following revision(s) (requested by maxv in ticket #898):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.59
sys/dev/nvmm/nvmm_internal.h: revision 1.14
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.53
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.54
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.55
sys/dev/nvmm/nvmm.c: revision 1.27
sys/dev/nvmm/nvmm.c: revision 1.28

When the identification fails, print the reason.

If we were processing a software int/excp, and got a VMEXIT in the middle,
we must also reflect the instruction length, otherwise the next VMENTER
fails and Qemu shuts the guest down.

On Intel CPUs, CPUID leaf 0xB, too, provides topology information, so
filter it correctly, to avoid inconsistencies if the host has SMT.

This fixes HaikuOS which fetches SMT information from there and would
panic because of the inconsistencies.
 1.46.4.3 25-Nov-2019  martin Pull up following revision(s) (requested by maxv in ticket #475):

tests/lib/libnvmm/h_mem_assist.c: revision 1.18
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.45
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.54

Hide XSAVES-specific stuff and the masked extended states.

Several improvements. In particular, reduce CS.limit, because Intel CPUs
perform strict sanity checks, and the previous (too high) limit caused the
VM entry to fail.
 1.46.4.2 10-Nov-2019  martin Pull up following revision(s) (requested by maxv in ticket #405):

usr.sbin/nvmmctl/nvmmctl.8: revision 1.2
lib/libnvmm/libnvmm.3: revision 1.24
sys/dev/nvmm/nvmm.h: revision 1.11
lib/libnvmm/libnvmm.3: revision 1.25
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.16
sys/dev/nvmm/nvmm.h: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.17
tests/lib/libnvmm/h_mem_assist.c: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.18
share/mk/bsd.hostprog.mk: revision 1.82
lib/libnvmm/libnvmm.c: revision 1.15
distrib/sets/lists/base/md.amd64: revision 1.281
tests/lib/libnvmm/h_mem_assist.c: revision 1.13
lib/libnvmm/libnvmm.c: revision 1.16
tests/lib/libnvmm/h_mem_assist.c: revision 1.14
lib/libnvmm/libnvmm_x86.c: revision 1.32
lib/libnvmm/libnvmm.c: revision 1.17
tests/lib/libnvmm/h_mem_assist.c: revision 1.15
lib/libnvmm/libnvmm_x86.c: revision 1.33
lib/libnvmm/libnvmm.c: revision 1.18
usr.sbin/nvmmctl/Makefile: revision 1.1
tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.7
tests/lib/libnvmm/h_mem_assist.c: revision 1.16
lib/libnvmm/libnvmm_x86.c: revision 1.34
usr.sbin/nvmmctl/Makefile: revision 1.2
tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.8
tests/lib/libnvmm/h_mem_assist.c: revision 1.17
sys/dev/nvmm/nvmm_internal.h: revision 1.13
lib/libnvmm/libnvmm_x86.c: revision 1.35
lib/libnvmm/libnvmm_x86.c: revision 1.36
usr.sbin/postinstall/postinstall.in: revision 1.8
lib/libnvmm/libnvmm_x86.c: revision 1.37
lib/libnvmm/libnvmm_x86.c: revision 1.38
lib/libnvmm/libnvmm_x86.c: revision 1.39
usr.sbin/Makefile: revision 1.282
lib/libnvmm/nvmm.h: revision 1.13
lib/libnvmm/nvmm.h: revision 1.14
lib/libnvmm/nvmm.h: revision 1.15
sys/dev/nvmm/nvmm.c: revision 1.23
lib/libnvmm/nvmm.h: revision 1.16
sys/dev/nvmm/nvmm.c: revision 1.24
lib/libnvmm/nvmm.h: revision 1.17
sys/dev/nvmm/nvmm.c: revision 1.25
tests/lib/libnvmm/h_io_assist.c: revision 1.9
etc/MAKEDEV.tmpl: revision 1.209
tests/lib/libnvmm/h_io_assist.c: revision 1.10
tests/lib/libnvmm/h_io_assist.c: revision 1.11
etc/group: revision 1.35
distrib/sets/lists/man/mi: revision 1.1660
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.40
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.41
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.42
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.43
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.44
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.51
sys/dev/nvmm/nvmm_ioctl.h: revision 1.8
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.52
sys/dev/nvmm/nvmm_ioctl.h: revision 1.9
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.53
usr.sbin/nvmmctl/nvmmctl.c: revision 1.1
lib/libnvmm/libnvmm.3: revision 1.20
distrib/sets/lists/debug/md.amd64: revision 1.106
lib/libnvmm/libnvmm.3: revision 1.21
lib/libnvmm/libnvmm.3: revision 1.22
usr.sbin/nvmmctl/nvmmctl.8: revision 1.1
lib/libnvmm/libnvmm.3: revision 1.23

Fix incorrect parsing: the R/M field uses a special GPR map when the
address size is 16 bits, regardless of the actual operating mode. With
this special map there can be two registers referenced at once, and
also disp16-only.
Implement this special behavior, and add associated tests. While here
simplify a few things.
With this in place, the Windows 95 installer initializes correctly.
Part of PR/54611.
add missing initializer
Implement XCHG, add associated tests, and add comments to explain. With
this in place the Windows 95 installer completes successfuly.
Part of PR/54611.
Improve nvmm_vcpu_dump().
Put back 'default', because llvm apparently doesn't realize that all cases
are covered in the switch.
Miscellaneous changes in NVMM, to address several inconsistencies and
issues in the libnvmm API.
- Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in
libnvmm. Introduce NVMM_USER_VERSION, for future use.
- In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to
avoid sharing the VMs with the children if the process forks. In the
NVMM driver, force O_CLOEXEC on open().
- Rename the following things for consistency:
nvmm_exit* -> nvmm_vcpu_exit*
nvmm_event* -> nvmm_vcpu_event*
NVMM_EXIT_* -> NVMM_VCPU_EXIT_*
NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR
NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP
Delete NVMM_EVENT_INTERRUPT_SW, unused already.
- Slightly reorganize the MI/MD definitions, for internal clarity.
- Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide
separate u.rdmsr and u.wrmsr fields. This is more consistent with the
other exit reasons.
- Change the types of several variables:
event.type enum -> u_int
event.vector uint64_t -> uint8_t
exit.u.*msr.msr: uint64_t -> uint32_t
exit.u.io.type: enum -> bool
exit.u.io.seg: int -> int8_t
cap.arch.mxcsr_mask: uint64_t -> uint32_t
cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t
- Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we
already intercept 'monitor' so it is never armed.
- Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn().
The 'npc' field wasn't getting filled properly during certain VMEXITs.
- Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(),
but as its name indicates, the configuration is per-VCPU and not per-VM.
Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID.
This becomes per-VCPU, which makes more sense than per-VM.
- Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on
specific leaves. Until now we could only mask the leaves. An uint32_t
is added in the structure:
uint32_t mask:1;
uint32_t exit:1;
uint32_t rsvd:30;
The two first bits select the desired behavior on the leaf. Specifying
zero on both resets the leaf to the default behavior. The new
NVMM_VCPU_EXIT_CPUID exit reason is added.
Three changes in libnvmm:
- Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem
structures.
- Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'.
- Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS,
it now becomes per-VCPU.
Update the libnvmm man page:
- Sync the naming with reality.
- Replace "relevant" by "desired" and "virtualizer" by "emulator", closer
to what I meant.
- Add a "VCPU Configuration" section.
- Add a "Machine Ownership" section.
Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few
days ago.
Use the new PTE naming, and define CR3_FRAME_* separately. No functional
change.
Add a new VCPU conf option, that allows userland to request VMEXITs after a
TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs.
The reason for wanting this option is that certain OSes (like Win10 64bit)
manage interrupt priority in hardware via CR8 directly, and for these OSes,
the emulator may want to sync its internal TPR state on each change.
Add two new fields in cap.arch, to report the conf capabilities. Report TPR
only on Intel for now, not AMD, because I don't have a recent AMD CPU on
which to test.
Mask CPUID leaf 0x0A on Intel, because we don't want the guest to try (and
fail) to probe the PMC MSRs. This avoids "Unexpected WRMSR" warnings in
qemu-nvmm.
Add PCID support in the guests. This speeds up most 64bit guests, because
since Meltdown, everybody uses PCID (including NetBSD).
Change the way root_owner works: consider the calling process as root_owner
not if it has root privileges, but if the /dev/nvmm device was opened with
write permissions. Introduce the undocumented nvmm_root_init() function to
achieve that.
The goal is to simplify the logic and have more granularity, eg if we want
a monitoring agent to access VMs but don't want to give this agent real
root access on the system.
A few changes:
- Use smaller types in struct nvmm_capability.
- Use smaller type for nvmm_io.port.
- Switch exitstate to a compacted structure.
Add nram in struct nvmm_ctl_mach_info.
Add nvmmctl, with two commands for now.
Macro tidyness.
Sort SEE ALSO.
should be fork(2), noticed by wiz
Add debug entry for newly introduced nvmmctl utility.
Annotate a covering switch as such to avoid warnings about missing
returns.
Forgot to put nvmmctl in the "nvmm" group.
Add nvmm group.
 1.46.4.1 06-Oct-2019  martin Pull up following revision(s) (requested by maxv in ticket #287):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.38
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.47
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.48
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.49

Add definitions for RDPRU, MCOMMIT, GMET and VTE.

Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed
version.

Switch to the new PTE naming.
 1.46.2.3 13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.46.2.2 10-Jun-2019  christos Sync with HEAD
 1.46.2.1 11-May-2019  christos file nvmm_x86_svm.c was added on branch phil-wifi on 2019-06-10 22:07:14 +0000
 1.55.2.1 29-Feb-2020  ad Sync with head.
 1.82.4.1 03-Apr-2021  thorpej Sync with HEAD.
 1.82.2.1 03-Apr-2021  thorpej Sync with HEAD.
 1.84.4.1 25-Jul-2023  martin Pull up following revision(s) (requested by riastradh in ticket #246):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.85

nvmm: Filter CR4 bits on x86 SVM (AMD).

In particular, prohibit PKE, Protection Key Enable, which requires
some additional management of CPU state by nvmm.
 1.85.6.1 02-Aug-2025  perseant Sync with HEAD
 1.6 05-Sep-2020  maxv nvmm: update copyright headers
 1.5 11-Aug-2020  maxv Micro-optimize: use pushq instead of pushw. To avoid LCP stalls and
unaligned stack accesses.
 1.4 19-Jul-2020  maxv The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no
effect. Disable interrupts earlier instead. This prevents a possible race
against such IPIs.
 1.3 24-Apr-2019  maxv branches: 1.3.2; 1.3.4;
Match the structure order, for better cache utilization.
 1.2 10-Jan-2019  maxv Optimize:

* Don't save/restore the host CR2, we don't care because we're not in a
#PF context (and preemption switches already handle CR2 safely).

* Don't save/restore the host FS and GS, just reset them to zero after
VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't
apply to FS and GS.

* Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost
of saving/restoring them when there's no reason to leave the loop.
 1.1 07-Nov-2018  maxv branches: 1.1.2;
Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that
provides support for hardware-accelerated virtualization on NetBSD.

It is made of an MI frontend, to which MD backends can be plugged. One
MD backend is implemented, x86-SVM, for x86 AMD CPUs.

We install

/usr/include/dev/nvmm/nvmm.h
/usr/include/dev/nvmm/nvmm_ioctl.h
/usr/include/dev/nvmm/{arch}/nvmm_{arch}.h

And the kernel module. For now, the only architecture where we do that
is amd64 (arch=x86).

NVMM is not enabled by default in amd64-GENERIC, but is instead easily
modloadable.

Sent to tech-kern@ a month ago. Validated with kASan, and optimized
with tprof.
 1.1.2.3 18-Jan-2019  pgoyette Synch with HEAD
 1.1.2.2 26-Nov-2018  pgoyette Sync with HEAD, resolve a couple of conflicts
 1.1.2.1 07-Nov-2018  pgoyette file nvmm_x86_svmfunc.S was added on branch pgoyette-compat on 2018-11-26 01:52:32 +0000
 1.3.4.1 29-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1068):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.71
sys/dev/nvmm/nvmm.c: revision 1.34
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.72
sys/dev/nvmm/nvmm.c: revision 1.35
sys/dev/nvmm/nvmm.c: revision 1.36
sys/dev/nvmm/x86/nvmm_x86_svmfunc.S: revision 1.5
sys/dev/nvmm/nvmm.c: revision 1.37
sys/dev/nvmm/x86/nvmm_x86_vmxfunc.S: revision 1.5
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.68
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.15
sys/dev/nvmm/nvmm_ioctl.h: revision 1.10

Micro-optimize: use pushq instead of pushw. To avoid LCP stalls and
unaligned stack accesses.

nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes

nvmm: localify a variable that doesn't need to be global

nvmm: use relaxed atomics to read nmachines

nvmm-x86-svm: dedup code

nvmm-x86: hide more CPUID flags, mostly related to perf monitors

nvmm: misc improvements
- use mach->ncpus to get the number of vcpus, now that we have it
- don't forget to decrement mach->ncpus when a machine gets killed
- add more __predict_false()

nvmm-x86-svm: don't forget to intercept INVD
INVD executed in the guest can be dangerous for the host, due to CPU
caches being flushed without write-back.

nvmm: slightly clarify

nvmm: explicitly include atomic.h
 1.3.2.2 10-Jun-2019  christos Sync with HEAD
 1.3.2.1 24-Apr-2019  christos file nvmm_x86_svmfunc.S was added on branch phil-wifi on 2019-06-10 22:07:14 +0000
 1.91 15-Aug-2025  skrll Remove unnecessary casts.

Same code before and after change.
 1.90 21-Apr-2025  riastradh nvmm/x86: Mark comments that should be synced between vmx/svm.

...so that I don't so easily forget to apply typo fixes in one to the
other.
 1.89 13-Apr-2025  riastradh nvmm(4): Fix typos in comments about CPUID.

No functional change intended.
 1.88 11-Apr-2025  imil nvmm(4): implement CPUID leaf 0x40000010, VMware compatible TSC and LAPIC
frequency detection. Partially fixes PR kern/59170
 1.87 09-Mar-2025  riastradh nvmm(4): Add comments explaining unknown CPUID behaviour.

Sprinkle some more comments about the CPUID ranges too.

No functional change intended.
 1.86 06-Nov-2023  rin branches: 1.86.6;
nvmm_x86_vmx: vmx_vmptrst: Sprinkle __diagused to fix clang !DIAGNOSTIC build
 1.85 13-Sep-2022  riastradh branches: 1.85.4;
nvmm(4): Add suspend/resume support.

New MD nvmm_impl callbacks:

- .suspend_interrupt forces all VMs on all physical CPUs to exit.
- .vcpu_suspend suspends an individual vCPU on a machine.
- .machine_suspend suspends an individual machine.
- .suspend suspends the whole system.
- .resume resumes the whole system.
- .machine_resume resumes an individual machine.
- .vcpu_resume resumes an indidivudal vCPU on a machine.

Suspending nvmm:

1. causes new VM operations (ioctl and close) to block until resumed,
2. uses .suspend_interrupt to interrupt any concurrent and force them
to return early, and then
3. uses the various suspend callbacks to suspend all vCPUs, machines,
and the whole system -- all vCPUs before the machine they're on,
and all machines before the system.

Resuming nvmm does the reverse of (3) -- resume system, resume each
machine and then the vCPUs on that machine -- and then unblocks
operations.

Implemented only for x86-vmx for now:

- suspend_interrupt triggers a TLB IPI to cause VM exits;
- vcpu_suspend issues VMCLEAR to force any in-CPU state to be written
to memory;
- machine_suspend does nothing;
- suspend does VMXOFF on all CPUs;
- resume does VMXON on all CPUs;
- machine_resume does nothing; and
- vcpu_resume just marks each vCPU as valid but inactive so
subsequent use will clear it and load it with vmptrld.

x86-svm left as an exercise for the reader.
 1.84 20-Aug-2022  riastradh x86: Split most of pmap.h into pmap_private.h or vmparam.h.

This way pmap.h only contains the MD definition of the MI pmap(9)
API, which loads of things in the kernel rely on, so changing x86
pmap internals no longer requires recompiling the entire kernel every
time.

Callers needing these internals must now use machine/pmap_private.h.
Note: This is not x86/pmap_private.h because it contains three parts:

1. CPU-specific (different for i386/amd64) definitions used by...

2. common definitions, including Xenisms like xpmap_ptetomach,
further used by...

3. more CPU-specific inlines for pmap_pte_* operations

So {amd64,i386}/pmap_private.h defines 1, includes x86/pmap_private.h
for 2, and then defines 3. Maybe we should split that out into a new
pmap_pte.h to reduce this trouble.

No functional change intended, other than that some .c files must
include machine/pmap_private.h when previously uvm/uvm_pmap.h
polluted the namespace with pmap internals.

Note: This migrates part of i386/pmap.h into i386/vmparam.h --
specifically the parts that are needed for several constants defined
in vmparam.h:

VM_MAXUSER_ADDRESS
VM_MAX_ADDRESS
VM_MAX_KERNEL_ADDRESS
VM_MIN_KERNEL_ADDRESS

Since i386 needs PDP_SIZE in vmparam.h, I added it there on amd64
too, just to keep things parallel.
 1.83 13-May-2022  tnn nvmm_x86_vmx.c: remove an #ifdef DIAGNOSTIC, it is wrong since r1.66
 1.82 26-Mar-2021  reinoud Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without
signals. This introduces a new kernel and userland NVMM version indicating
this support.

Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request.
 1.81 24-Oct-2020  mgorny branches: 1.81.2; 1.81.4;
Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).
 1.80 08-Sep-2020  maxv nvmm-x86: avoid hogging behavior observed recently

When the FPU code got rewritten in NetBSD, the dependency on IPL_HIGH was
eliminated, and I took _vcpu_guest_fpu_enter() out of the VCPU loop since
there was no need to be in the splhigh window.

Later, the code was switched to use the kernel FPU API, API that works at
IPL_VM, not at IPL_NONE.

These two changes mean that the whole VCPU loop is now executing at IPL_VM,
which is not desired, because it introduces a delay in interrupt processing
on the host in certain cases.

Fix this by putting _vcpu_guest_fpu_enter() back inside the VCPU loop.
 1.79 08-Sep-2020  maxv nvmm-x86-vmx: improve the handling of CR0

- CR0_ET is hard-wired to 1 in the cpu, so force CR0_ET to 1 in the
shadow.
- Clarify.
 1.78 06-Sep-2020  riastradh Fix fallout from previous uvm.h cleanup.

- pmap(9) needs uvm/uvm_extern.h.

- x86/pmap.h is not usable on its own; it is only usable if included
via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h).

- Make nvmm.h and nvmm_internal.h standalone.
 1.77 05-Sep-2020  riastradh Round of uvm.h cleanup.

The poorly named uvm.h is generally supposed to be for uvm-internal
users only.

- Narrow it to files that actually need it -- mostly files that need
to query whether curlwp is the pagedaemon, which should maybe be
exposed by an external header.

- Use uvm_extern.h where feasible and uvm_*.h for things not exposed
by it. We should split up uvm_extern.h but this will serve for now
to reduce the uvm.h dependencies.

- Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use
UVMHIST(ubchist), since ubchist is declared in uvm.h but the
reference evaporates if UVMHIST is not defined, so we reduce header
file dependencies.

- Make uvm_device.h and uvm_swap.h independently includable while
here.

ok chs@
 1.76 05-Sep-2020  maxv nvmm: update copyright headers
 1.75 04-Sep-2020  maxv nvmm-x86-vmx: improve the handling of CR0

- Flush the guest TLB when certain CR0 bits change.
- If the guest updates a static bit in CR0, then reflect the change in
VMCS_CR0_SHADOW, for the guest to get the illusion that the change was
applied. The "real" CR0 static bits remain unchanged.
- In vmx_vcpu_{g,s}et_state(), take VMCS_CR0_SHADOW into account.
- Slightly modify the CR4 handling code, just for more symmetry with CR0.
 1.74 26-Aug-2020  maxv nvmm-x86: improve the handling of RFLAGS.RF

- When injecting certain exceptions, set RF. For us to have an up-to-date
view of RFLAGS, we commit the state before the event.
- When advancing RIP, clear RF.
 1.73 26-Aug-2020  maxv nvmm-x86-vmx: improve the handling of CR4

- Filter out certain features we don't want the guest to enable. This is
for general correctness, and future-proofness.
- Flush the guest TLB when certain flags change.
 1.72 22-Aug-2020  maxv nvmm-x86-vmx: fix detection of the BIOS lock

If it's locked, ensure it's locked with VMX enabled. If it's not locked,
then lock it ourselves with VMX enabled.

Should fix NetBSD PR/55596.
 1.71 20-Aug-2020  maxv nvmm-x86: improve the CPUID emulation

- x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter
contains extended features we must filter out. Apply the same in
x86-vmx for symmetry.
- x86-svm: explicitly handle extended leaves until 0x8000001F, and
truncate to it.
 1.70 18-Aug-2020  maxv nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
 1.69 11-Aug-2020  maxv Improve the CPUID emulation on nvmm-intel:

- Limit the highest extended leaf.
- Limit 0x00000007 to ECX=0, for future-proofness.
 1.68 11-Aug-2020  maxv Improve emulation of MSR_IA32_ARCH_CAPABILITIES: publish only the *_NO
bits. Initially they were the only ones there, but Intel then added other
bits we aren't interested in, and they must be filtered out.
 1.67 05-Aug-2020  maxv Add new field definitions.
 1.66 05-Aug-2020  maxv Simplify, remove unnecessary #ifdef DIAGNOSTIC around KASSERTs.
 1.65 19-Jul-2020  maxv Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel
itself uses the fpu.
 1.64 19-Jul-2020  maxv The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no
effect. Disable interrupts earlier instead. This prevents a possible race
against such IPIs.
 1.63 18-Jul-2020  maxv Now that the IDT is per-CPU, it must be saved/restored on each CPU
independently.
 1.62 14-Jul-2020  yamaguchi Introduce per-cpu IDTs

This is realized by following modifications:
- Add IDT pages and its allocation maps for each cpu in "struct cpu_info"
- Load per-cpu IDTs at cpu_init_idt(struct cpu_info*)
- Copy the IDT entries for cpu0 to other CPUs at attach
- These are, for example, exceptions, db, system calls, etc.

And, added a kernel option named PCPU_IDT to enable the feature.
 1.61 03-Jul-2020  maxv Print the backend name when attaching.
 1.60 18-Jun-2020  maxv style
 1.59 24-May-2020  maxv Gather the conditions to return from the VCPU loops in nvmm_return_needed(),
and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors:

- When a VM initializes, the many nested page faults that need processing
could cause the calling thread to occupy the CPU too much if we're unlucky
and are only getting repeated nested page faults thousands of times in a
row.

- When the emulator calls nvmm_vcpu_run() and immediately sends a signal to
stop the VCPU, it's better to check signals earlier and leave right away,
rather than doing a round of VCPU run that could increase the time spent
by the emulator waiting for the return.
 1.58 21-May-2020  maxv Improve the CPUID emulation on nvmm-intel: limit the highest basic and
hypervisor leaves.
 1.57 10-May-2020  maxv Respect the convention for the hypervisor information: return the highest
hypervisor leaf in 0x40000000.EAX.
 1.56 09-May-2020  maxv Improve the CPUID emulation of basic leaves:
- Hide DCA and PQM, they cannot be used in guests.
- On Intel, explicitly handle each basic leaf until 0x16.
- On AMD, explicitly handle each basic leaf until 0x0D.
 1.55 09-May-2020  maxv On Intel CPUs, CPUID leaf 0xB, too, provides topology information, so
filter it correctly, to avoid inconsistencies if the host has SMT.

This fixes HaikuOS which fetches SMT information from there and would
panic because of the inconsistencies.
 1.54 30-Apr-2020  maxv If we were processing a software int/excp, and got a VMEXIT in the middle,
we must also reflect the instruction length, otherwise the next VMENTER
fails and Qemu shuts the guest down.
 1.53 30-Apr-2020  maxv When the identification fails, print the reason.
 1.52 22-Mar-2020  ad x86 pmap:

- Give pmap_remove_all() its own version of pmap_remove_ptes() that on native
x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on
'build.sh release' for me.

- pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The
caller waits for the competing operation to finish.

- Bring 'options TLBSTATS' up to date.
 1.51 14-Mar-2020  ad - Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.
 1.50 12-Mar-2020  tnn vmx_vmptrst(): only used when DIAGNOSTIC
 1.49 21-Feb-2020  joerg Explicitly cast pointers to uintptr_t before casting to enums. They are
not necessarily the same size. Don't cast pointers to bool, check for
NULL instead.
 1.48 09-Jan-2020  maxv Registering the host's CR0 is done outside of the VCPU loop, so it must be
cleared because it is also cleared inside the loop.

Not clearing it could trigger DNAs on VMEXITs, because STTS/CLTS are still
here as part of debugging since my FPU overhaul.
 1.47 09-Jan-2020  maxv Mmh, as noted in PR/54847, this should be uint64_t, not uint16_t. Harmless
because we use only the two lowest bits anyway.

I believe this could be caught by KUBSAN; time to do another round of
NVMM+K_SAN testing.
 1.46 10-Dec-2019  ad branches: 1.46.2;
pg->phys_addr > VM_PAGE_TO_PHYS(pg)
 1.45 20-Nov-2019  maxv Hide XSAVES-specific stuff and the masked extended states.
 1.44 28-Oct-2019  maxv A few changes:

- Use smaller types in struct nvmm_capability.
- Use smaller type for nvmm_io.port.
- Switch exitstate to a compacted structure.
 1.43 27-Oct-2019  maxv Add PCID support in the guests. This speeds up most 64bit guests, because
since Meltdown, everybody uses PCID (including NetBSD).
 1.42 27-Oct-2019  maxv Mask CPUID leaf 0x0A on Intel, because we don't want the guest to try (and
fail) to probe the PMC MSRs. This avoids "Unexpected WRMSR" warnings in
qemu-nvmm.
 1.41 27-Oct-2019  maxv Add a new VCPU conf option, that allows userland to request VMEXITs after a
TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs.

The reason for wanting this option is that certain OSes (like Win10 64bit)
manage interrupt priority in hardware via CR8 directly, and for these OSes,
the emulator may want to sync its internal TPR state on each change.

Add two new fields in cap.arch, to report the conf capabilities. Report TPR
only on Intel for now, not AMD, because I don't have a recent AMD CPU on
which to test.
 1.40 23-Oct-2019  maxv Miscellaneous changes in NVMM, to address several inconsistencies and
issues in the libnvmm API.

- Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in
libnvmm. Introduce NVMM_USER_VERSION, for future use.

- In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to
avoid sharing the VMs with the children if the process forks. In the
NVMM driver, force O_CLOEXEC on open().

- Rename the following things for consistency:
nvmm_exit* -> nvmm_vcpu_exit*
nvmm_event* -> nvmm_vcpu_event*
NVMM_EXIT_* -> NVMM_VCPU_EXIT_*
NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR
NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP
Delete NVMM_EVENT_INTERRUPT_SW, unused already.

- Slightly reorganize the MI/MD definitions, for internal clarity.

- Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide
separate u.rdmsr and u.wrmsr fields. This is more consistent with the
other exit reasons.

- Change the types of several variables:
event.type enum -> u_int
event.vector uint64_t -> uint8_t
exit.u.*msr.msr: uint64_t -> uint32_t
exit.u.io.type: enum -> bool
exit.u.io.seg: int -> int8_t
cap.arch.mxcsr_mask: uint64_t -> uint32_t
cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t

- Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we
already intercept 'monitor' so it is never armed.

- Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn().
The 'npc' field wasn't getting filled properly during certain VMEXITs.

- Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(),
but as its name indicates, the configuration is per-VCPU and not per-VM.
Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID.
This becomes per-VCPU, which makes more sense than per-VM.

- Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on
specific leaves. Until now we could only mask the leaves. An uint32_t
is added in the structure:
uint32_t mask:1;
uint32_t exit:1;
uint32_t rsvd:30;
The two first bits select the desired behavior on the leaf. Specifying
zero on both resets the leaf to the default behavior. The new
NVMM_VCPU_EXIT_CPUID exit reason is added.
 1.39 12-Oct-2019  maxv Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.
 1.38 04-Oct-2019  maxv Switch to the new PTE naming.
 1.37 13-Sep-2019  maxv Always set hwcode on error. Useful for debugging.
 1.36 16-Jun-2019  maxv branches: 1.36.2;
Make sure VMX-outside-SMX is allowed. It may not be if the BIOS decided to
disable VMX. Seen on an HP laptop, where NVMM would panic because of that.
 1.35 18-May-2019  maxv branches: 1.35.2;
Now that SVS cannot be disabled at run time, MSR_LSTAR is static, so no
need to save it on each VM enter.
 1.34 11-May-2019  maxv Rework the machine configuration interface.

Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and
<MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf
op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now
per-machine, and the emulators should now do:

- nvmm_callbacks_register(&cbs);
+ nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs);

This provides more granularity, for example if the process runs two VMs
and wants different callbacks for each.
 1.33 01-May-2019  maxv Use the comm page to inject events, rather than ioctls, and commit them in
vcpu_run. This saves a few syscalls and copyins.

For example on Windows 10, moving the mouse from the left to right sides of
the screen generates ~500 events, which now don't result in syscalls.

The error handling is done in vcpu_run and it is less precise, but this
doesn't matter a lot, and will be solved with future NVMM error codes.
 1.32 29-Apr-2019  maxv Stop taking care of the INT/NMI windows in the kernel, the emulator is
supposed to do that itself.
 1.31 28-Apr-2019  maxv Modify the communication layer between the kernel NVMM driver and libnvmm:
introduce a bidirectionnal "comm page", a page of memory shared between
the kernel and userland, and used to transfer data in and out in a more
performant manner than ioctls.

The comm page contains the VCPU state, plus three flags:

- "wanted": the states the kernel must get/set when requested via ioctls
- "cached": the states that are in the comm page
- "commit": the states the kernel must set in vcpu_run

The idea is to avoid performing expensive syscalls, by using the VCPU
state cached, either explicitly or speculatively, in the comm page. For
example, if the state is cached we do a direct 1->5 with no syscall:

+---------------------------------------------+
| Qemu |
+---------------------------------------------+
| ^
| (0) nvmm_vcpu_getstate | (6) Done
| |
V |
+---------------------------------------+
| libnvmm |
+---------------------------------------+
| ^ | ^
(1) State | | (2) No | (3) Ioctl: | (5) Ok, state
cached? | | | "please cache | fetched
| | | the state" |
V | | |
+-----------+ | |
| Comm Page |------+---------------+
+-----------+ |
^ |
(4) "Alright | V
babe" | +--------+
+-----| Kernel |
+--------+

The main changes in behavior are:

- nvmm_vcpu_getstate(): won't emit a syscall if the state is already
cached in the comm page, will just fetch from the comm page directly
- nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache
the wanted state in the comm page
- nvmm_vcpu_run(): will commit the to-be-set state in the comm page,
as previously requested by nvmm_vcpu_setstate()

In addition to this, the kernel NVMM driver is changed to speculatively
cache certain states known to be of interest, so that the future
nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use
the comm page rather than expensive syscalls. For example, if an I/O
VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS,
and now the kernel caches all of that in the comm page before returning
to userland.

Overall, in a normal run of Windows 10, this saves several millions of
syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO
goes from taking 1min35 to taking 1min16.

The libnvmm API is not changed, but the ABI is. If we changed the API it
would be possible to save expensive memcpys on libnvmm's side. This will
be avoided in a future version. The comm page can also be extended to
implement future services.
 1.30 27-Apr-2019  maxv Reorder the NVMM headers, to make a clear(er) distinction between MI and
MD. Also use #defines for the exit reasons rather than an union. No ABI
change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
 1.29 27-Apr-2019  maxv If guest events were being processed when a #VMEXIT occurred, reschedule
the events rather than dismissing them. This can happen for instance when a
guest wants to process an exception and an #NPF occurs on the guest IDT. In
practice it occurs only when the host swapped out specific guest pages.
 1.28 27-Apr-2019  maxv Optimize nvmm-intel, use inlined GCC assembly rather than function calls.
 1.27 24-Apr-2019  maxv Provide the hardware error code for NVMM_EXIT_INVALID, useful when
debugging.
 1.26 20-Apr-2019  maxv Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the
guest relies only on ECX to initialize/copy the FPU state (like NetBSD
does), spurious #GPs can be encountered because the bitmap is clobbered.
 1.25 07-Apr-2019  maxv Invert the filtering priority: now the kernel-managed cpuid leaves are
overwritable by the virtualizer. This is useful to virtualizers that want
to 100% control every leaf.
 1.24 06-Apr-2019  maxv Replace the misc[] state by a new compressed nvmm_x64_state_intr structure,
which describes the interruptibility state of the guest.

Add evt_pending, read-only, that allows the virtualizer to know if an event
is pending.
 1.23 03-Apr-2019  maxv VMX: if PAT is not valid, #GP on WRMSR, rather than crashing the guest.
 1.22 03-Apr-2019  maxv Add new VMCS bits.
 1.21 03-Apr-2019  maxv Add MSR_TSC.
 1.20 21-Mar-2019  maxv Make it possible for an emulator to set the protection of the guest pages.
For some reason I had initially concluded that it wasn't doable; verily it
is, so let's do it.

The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes
mmap-like protection codes.
 1.19 14-Mar-2019  maxv Optimize NVMM-Intel: keep the VMCS active on the host CPU, and lazy-switch
it on demand only when needed. This allows the CPU to use the cached
version of the guest state, rather than the in-memory copy of it. This is
much more performant.

A VMCS must be active on only one CPU, but one CPU can have several active
VMCSs at the same time.

We keep track of which CPU each VMCS is active on. When we want to execute
a VCPU, we determine whether its VMCS is loaded on another CPU, and if so
send an IPI to ask it to unbusy that VMCS. In most cases the VMCS is
already active on the current CPU, so we don't have to do anything and can
proceed with a fast VMRESUME.

We send IPIs with kpreemption enabled but with a bound LWP, because we
don't want to get context-switched to the CPU we just sent an IPI to.

Overall, with this in place, I see a ~15% performance increase in the
guests on NVMM-Intel.
 1.18 14-Mar-2019  maxv Move a KASSERT, applies to all branches.
 1.17 07-Mar-2019  maxv Parse EXC_NMI on nvmm-intel, and don't return NVMM_EXIT_INVALID if we
received a host NMI, otherwise the guest could get killed if an NMI comes
in, typically when the host runs tprof at the same time.

Already handled on nvmm-amd.
 1.16 03-Mar-2019  maxv Choose which CPUID bits to allow, rather than which bits to disallow. This
is clearer, and also forward compatible with future CPUs.

While here be more consistent when allowing the bits, and sync between
nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest
state we provide is only x86+SSE. Fixes a CentOS panic when booting on
NVMM, reported by Jared McNeill, thanks.
 1.15 26-Feb-2019  maxv Change the layout of the SEG state:

- Reorder it, to match the CPU encoding. This is the universal order,
also used by Qemu. Drop the seg_to_nvmm[] tables.

- Compress it. This divides its size by two.

- Rename some of its fields, to better match the x86 spec. Also, take S
out of Type, this was a NetBSD-ism that was likely confusing to other
people.
 1.14 23-Feb-2019  maxv Install the x86 RESET state at VCPU creation time, for convenience, so
that the libnvmm users can expect a functional VCPU right away.
 1.13 23-Feb-2019  maxv Add support for CPUs that don't have the EPT_{A,D} bits.

On such CPUs, these bits are ignored by the hardware. We don't care about
setting them, however, we must always assume they are set. Modify the pmap
code to do that.

While here, in pmap_ept_remove_pte, don't flush the TLB when it's not
needed.

Tested on an old Intel Celeron.
 1.12 23-Feb-2019  maxv Reorder the functions, and constify setstate. No functional change.
 1.11 22-Feb-2019  maxv Fix omission: if we receive a guest trap on CR0, and if the original
instruction would have resulted in Long Mode being enabled, we need to
manually enable Long Mode ourselves. We were already doing that correctly
in setstate, but not in the CR0 trap handler.

Problem initially reported by Aymeric Vincent; ArchLinux wouldn't boot,
now it does and works correctly.

While here, add CR0_ET in the CR0 mask, for the associated shadow to
be taken into account. Normally this shadow bit shouldn't be necessary,
but for now I keep it regardless.
 1.10 21-Feb-2019  maxv Reorder the detection in vmx_ident(), to fix panic on old CPUs. We must
read MSR_IA32_VMX_EPT_VPID_CAP _after_ ensuring EPT is there, because if
it's not, the rdmsr faults.
 1.9 21-Feb-2019  maxv Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU
mutexes which can sleep, but their context does not allow it.

Rewrite the TLB handling code to fix that. It becomes a bit complex. In
short, we use a per-VM generation number, which we increase on each TLB
flush, before sending a broadcast IPI to everybody. The IPIs cause a
#VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen
with a per-VCPU copy, and apply the flushes as neededi lazily.

The behavior differs between AMD and Intel; in short, on Intel we don't
flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now,
we need to maintain a kcpuset to know which VCPU's hTLBs are active on
which hCPU. This creates some redundancy on Intel, ie there are cases
where we flush the hTLB several times unnecessarily; but hTLB flushes are
very rare, so there is no real performance regression.

The thing is lock-less and non-blocking, so it solves our problem.
 1.8 21-Feb-2019  maxv Clarify the gTLB code a little.
 1.7 18-Feb-2019  maxv Ah, finally found you. Fix scheduling bug in NVMM.

When processing guest page faults, we were calling uvm_fault with
preemption disabled. The thing is, uvm_fault may block, and if it does,
we land in sleepq_block which calls mi_switch; so we get switched away
while we explicitly asked not to be. From then on things could go really
wrong.

Fix that by processing such faults in MI, where we have preemption enabled
and are allowed to block.

A KASSERT in sleepq_block (or before) would have helped.
 1.6 16-Feb-2019  maxv Improve the FPU detection: hide XSAVES because we're not allowing it, and
don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE.

With these changes in place, I can boot Windows 10 on NVMM.
 1.5 16-Feb-2019  maxv Handle MSR_MISC_ENABLE on NVMM-Intel (Intel-specific).
 1.4 15-Feb-2019  maxv Initialize the guest TSC to zero at VCPU creation time, and handle guest
writes to MSR_TSC at run time.

This is imprecise, because the hardware does not provide a way to preserve
the TSC during #VMEXITs, but that's fine enough.
 1.3 14-Feb-2019  maxv Harmonize the handling of the CPL between AMD and Intel.

AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET
instructions do not force SS.DPL to predefined values. On Intel they do,
so the CPL on Intel is just the guest's SS.DPL value.

Even though technically possible on AMD, there is no sane reason for a
guest kernel to set a non-three SS.DPL, doing that would mess up several
common segmentation practices and wouldn't be compatible with Intel.

So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL.
Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually
increases performance on AMD: to detect interrupt windows the virtualizer
has to modify some fields of misc[], and because CPL was there, we had to
flush the SEG set of the VMCB cache. Now there is no flush necessary.

While here remove the CPL check for XSETBV on Intel, contrary to AMD
Intel checks the CPL before the intercept, so if we receive an XSETBV
VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the
way my check was wrong in the first place, it was reading SS.RPL instead
of SS.DPL.
 1.2 14-Feb-2019  maxv On AMD, the segments have a simple "present" bit. On Intel however there
is an extra "unusable" bit, which has a twisted meaning. We can't just
ignore this bit, because when unset, the CPU performs extra checks on the
other attributes, which may cause VMENTRY to fail and the guest to be
killed.

Typically, on Qemu, some guests like Windows XP trigger two consecutive
getstate+setstate calls, and while processing them, we end up wrongfully
removing the "unusable" bits that were previously set.

Fix that by forcing "unusable = !present". Each hypervisor I could check
does something different, but this seems to be the least problematic
solution for now.

While here, the fields of vmx_guest_segs are VMX indexes, so they should
be uint64_t (no functional change).
 1.1 13-Feb-2019  maxv Add Intel-VMX support in NVMM. This allows us to run hardware-accelerated
VMs on Intel CPUs. Overall this implementation is fast and reliable, I am
able to run NetBSD VMs with many VCPUs on a quad-core Intel i5.

NVMM-Intel applies several optimizations already present in NVMM-AMD, and
has a code structure similar to it. No change was needed in the NVMM MI
frontend, or in libnvmm.

Some differences exist against AMD:

- On Intel the ASID space is big, so we don't fall back to a shared ASID
when there are more VCPUs executing than available ASIDs in the host,
contrary to AMD. There are enough ASIDs for the maximum number of VCPUs
supported by NVMM.

- On Intel there are two TLBs we need to take care of, one for the host
(EPT) and one for the guest (VPID). Changes in EPT paging flush the
host TLB, changes to the guest mode flush the guest TLB.

- On Intel there is no easy way to set/fetch the VTPR, so we intercept
reads/writes to CR8 and maintain a software TPR, that we give to the
virtualizer as if it was the effective TPR in the guest.

- On Intel, because of SVS, the host CR4 and LSTAR are not static, so
we're forced to save them on each VMENTRY.

- There is extra Intel weirdness we need to take care of, for example the
reserved bits in CR0 and CR4 when accesses trap.

While this implementation is functional and can already run many OSes, we
likely have a problem on 32bit-PAE guests, because they require special
care on Intel CPUs, and currently we don't handle that correctly; such
guests may misbehave for now (without altering the host stability). I
expect to fix that soon.
 1.35.2.3 13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.35.2.2 10-Jun-2019  christos Sync with HEAD
 1.35.2.1 18-May-2019  christos file nvmm_x86_vmx.c was added on branch phil-wifi on 2019-06-10 22:07:14 +0000
 1.36.2.15 13-Sep-2020  martin Pull up following revision(s) (requested by maxv in ticket #1078):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.73
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.73
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.74

nvmm-x86-vmx: improve the handling of CR4
- Filter out certain features we don't want the guest to enable. This is
for general correctness, and future-proofness.
- Flush the guest TLB when certain flags change.

nvmm-x86: improve the handling of RFLAGS.RF
- When injecting certain exceptions, set RF. For us to have an up-to-date
view of RFLAGS, we commit the state before the event.
- When advancing RIP, clear RF.
 1.36.2.14 13-Sep-2020  martin Pull up following revision(s) (requested by maxv in ticket #1077):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.68
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.74
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.16

Improve emulation of MSR_IA32_ARCH_CAPABILITIES: publish only the *_NO
bits. Initially they were the only ones there, but Intel then added other
bits we aren't interested in, and they must be filtered out.

nvmm-x86-svm: improve the handling of MSR_EFER

Intercept reads of it as well, just to mask EFER_SVME, which the guest
doesn't need to see.

nvmm-x86: improve the CPUID emulation

- Mask DTES64, DS_CPL, CID, SDBG, xTPR, PN.
- B10, B20 and IA64 do not exist, so just remove them.
 1.36.2.13 04-Sep-2020  martin Pull up following revision(s) (requested by maxv in ticket #1076):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.75
sys/arch/x86/include/specialreg.h: revision 1.172
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.72

nvmm-x86-vmx: fix detection of the BIOS lock

If it's locked, ensure it's locked with VMX enabled. If it's not locked,
then lock it ourselves with VMX enabled.

Should fix NetBSD PR/55596.

-

Add a few more CPUID flags.

-

nvmm-x86-svm: check the SVM revision
Only revision 1 exists, but check it, for future-proofness.
 1.36.2.12 29-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1068):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.71
sys/dev/nvmm/nvmm.c: revision 1.34
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.72
sys/dev/nvmm/nvmm.c: revision 1.35
sys/dev/nvmm/nvmm.c: revision 1.36
sys/dev/nvmm/x86/nvmm_x86_svmfunc.S: revision 1.5
sys/dev/nvmm/nvmm.c: revision 1.37
sys/dev/nvmm/x86/nvmm_x86_vmxfunc.S: revision 1.5
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.68
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.15
sys/dev/nvmm/nvmm_ioctl.h: revision 1.10

Micro-optimize: use pushq instead of pushw. To avoid LCP stalls and
unaligned stack accesses.

nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes

nvmm: localify a variable that doesn't need to be global

nvmm: use relaxed atomics to read nmachines

nvmm-x86-svm: dedup code

nvmm-x86: hide more CPUID flags, mostly related to perf monitors

nvmm: misc improvements
- use mach->ncpus to get the number of vcpus, now that we have it
- don't forget to decrement mach->ncpus when a machine gets killed
- add more __predict_false()

nvmm-x86-svm: don't forget to intercept INVD
INVD executed in the guest can be dangerous for the host, due to CPU
caches being flushed without write-back.

nvmm: slightly clarify

nvmm: explicitly include atomic.h
 1.36.2.11 26-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1058):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.19
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.69
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.71
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.69
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.11
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.13
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.14

Improve the CPUID emulation:
- Hide SGX*, PKU, WAITPKG, and SKINIT, because they are not supported.
- Hide HLE and RTM, part of TSX. Because TSX is just too buggy and we
cannot guarantee that it remains enabled in the guest (if for example
the host disables TSX while the guest is running). Nobody wants this
crap anyway, so bye-bye.
- Advertise FSREP_MOV, because no reason to hide it.

Hide OSPKE. NFC since the host never uses PKU, but still.

Improve the CPUID emulation on nvmm-intel:
- Limit the highest extended leaf.
- Limit 0x00000007 to ECX=0, for future-proofness.

nvmm-x86-svm: improve the CPUID emulation

Limit the hypervisor range, and properly handle each basic leaf until 0xD.

nvmm-x86: advertise the SERIALIZE instruction, available on future CPUs

nvmm-x86: improve the CPUID emulation
- x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter
contains extended features we must filter out. Apply the same in
x86-vmx for symmetry.
- x86-svm: explicitly handle extended leaves until 0x8000001F, and
truncate to it.
 1.36.2.10 18-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1055):

sys/dev/nvmm/nvmm.h: revision 1.13
sys/dev/nvmm/nvmm.h: revision 1.14
sys/dev/nvmm/nvmm.c: revision 1.33
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.67
sys/dev/nvmm/nvmm_internal.h: revision 1.17
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.67
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.10

Put the few x86-specific structures under #ifdef __x86_64__, for clarity.

Make it easier to understand what's going on, no functional change.

Add new field definitions.

Add new field definitions, and intercept everything, for future-proofness.

Add CTASSERT.
 1.36.2.9 05-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1041):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.66
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.50
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.66
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.46
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.49
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.55
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.56

pg->phys_addr > VM_PAGE_TO_PHYS(pg)

Explicitly cast pointers to uintptr_t before casting to enums. They are
not necessarily the same size. Don't cast pointers to bool, check for
NULL instead.

vmx_vmptrst(): only used when DIAGNOSTIC

Simplify, remove unnecessary #ifdef DIAGNOSTIC around KASSERTs.

Use ULL, to make it clear we are unsigned.
 1.36.2.8 02-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1032):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.60 (patch)
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.61 (patch)
sys/dev/nvmm/nvmm.c: revision 1.30
sys/dev/nvmm/nvmm.c: revision 1.31
sys/dev/nvmm/nvmm.c: revision 1.32
sys/dev/nvmm/nvmm_internal.h: revision 1.15
sys/dev/nvmm/nvmm_internal.h: revision 1.16
sys/dev/nvmm/files.nvmm: revision 1.3
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.62 (patch)
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.63 (patch)
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.59 (patch)
sys/modules/nvmm/nvmm.ioconf: revision 1.2

Gather the conditions to return from the VCPU loops in nvmm_return_needed(),
and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors:

- When a VM initializes, the many nested page faults that need processing
could cause the calling thread to occupy the CPU too much if we're unlucky
and are only getting repeated nested page faults thousands of times in a
row.

- When the emulator calls nvmm_vcpu_run() and immediately sends a signal to
stop the VCPU, it's better to check signals earlier and leave right away,
rather than doing a round of VCPU run that could increase the time spent
by the emulator waiting for the return.

style

Register NVMM as an actual pseudo-device. Without PMF handler, to
explicitly disallow ACPI suspend if NVMM is running.

Should fix PR/55406.

Print the backend name when attaching.
 1.36.2.7 21-May-2020  martin Pull up following revision(s) (requested by maxv in ticket #919):

sys/dev/nvmm/x86/nvmm_x86.c: revision 1.9
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.60
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.61
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.56
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.57
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.58
sys/dev/nvmm/nvmm.c: revision 1.29

Improve the CPUID emulation of basic leaves:
- Hide DCA and PQM, they cannot be used in guests.
- On Intel, explicitly handle each basic leaf until 0x16.
- On AMD, explicitly handle each basic leaf until 0x0D.

Respect the convention for the hypervisor information: return the highest
hypervisor leaf in 0x40000000.EAX.

Improve the CPUID emulation on nvmm-intel: limit the highest basic and
hypervisor leaves.

Complete rev1.26: reset nvmm_impl to NULL in nvmm_fini().
 1.36.2.6 13-May-2020  martin Pull up following revision(s) (requested by maxv in ticket #898):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.59
sys/dev/nvmm/nvmm_internal.h: revision 1.14
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.53
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.54
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.55
sys/dev/nvmm/nvmm.c: revision 1.27
sys/dev/nvmm/nvmm.c: revision 1.28

When the identification fails, print the reason.

If we were processing a software int/excp, and got a VMEXIT in the middle,
we must also reflect the instruction length, otherwise the next VMENTER
fails and Qemu shuts the guest down.

On Intel CPUs, CPUID leaf 0xB, too, provides topology information, so
filter it correctly, to avoid inconsistencies if the host has SMT.

This fixes HaikuOS which fetches SMT information from there and would
panic because of the inconsistencies.
 1.36.2.5 10-Feb-2020  martin Pull up following revision(s) (requested by maxv in ticket #688):

share/man/man4/nvmm.4: revision 1.5
lib/libnvmm/libnvmm.3: revision 1.26
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.47

Mmh, as noted in PR/54847, this should be uint64_t, not uint16_t. Harmless
because we use only the two lowest bits anyway.

I believe this could be caught by KUBSAN; time to do another round of
NVMM+K_SAN testing.

Reference nvmmctl(8).
 1.36.2.4 25-Nov-2019  martin Pull up following revision(s) (requested by maxv in ticket #475):

tests/lib/libnvmm/h_mem_assist.c: revision 1.18
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.45
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.54

Hide XSAVES-specific stuff and the masked extended states.

Several improvements. In particular, reduce CS.limit, because Intel CPUs
perform strict sanity checks, and the previous (too high) limit caused the
VM entry to fail.
 1.36.2.3 10-Nov-2019  martin Pull up following revision(s) (requested by maxv in ticket #405):

usr.sbin/nvmmctl/nvmmctl.8: revision 1.2
lib/libnvmm/libnvmm.3: revision 1.24
sys/dev/nvmm/nvmm.h: revision 1.11
lib/libnvmm/libnvmm.3: revision 1.25
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.16
sys/dev/nvmm/nvmm.h: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.17
tests/lib/libnvmm/h_mem_assist.c: revision 1.12
sys/dev/nvmm/x86/nvmm_x86.h: revision 1.18
share/mk/bsd.hostprog.mk: revision 1.82
lib/libnvmm/libnvmm.c: revision 1.15
distrib/sets/lists/base/md.amd64: revision 1.281
tests/lib/libnvmm/h_mem_assist.c: revision 1.13
lib/libnvmm/libnvmm.c: revision 1.16
tests/lib/libnvmm/h_mem_assist.c: revision 1.14
lib/libnvmm/libnvmm_x86.c: revision 1.32
lib/libnvmm/libnvmm.c: revision 1.17
tests/lib/libnvmm/h_mem_assist.c: revision 1.15
lib/libnvmm/libnvmm_x86.c: revision 1.33
lib/libnvmm/libnvmm.c: revision 1.18
usr.sbin/nvmmctl/Makefile: revision 1.1
tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.7
tests/lib/libnvmm/h_mem_assist.c: revision 1.16
lib/libnvmm/libnvmm_x86.c: revision 1.34
usr.sbin/nvmmctl/Makefile: revision 1.2
tests/lib/libnvmm/h_mem_assist_asm.S: revision 1.8
tests/lib/libnvmm/h_mem_assist.c: revision 1.17
sys/dev/nvmm/nvmm_internal.h: revision 1.13
lib/libnvmm/libnvmm_x86.c: revision 1.35
lib/libnvmm/libnvmm_x86.c: revision 1.36
usr.sbin/postinstall/postinstall.in: revision 1.8
lib/libnvmm/libnvmm_x86.c: revision 1.37
lib/libnvmm/libnvmm_x86.c: revision 1.38
lib/libnvmm/libnvmm_x86.c: revision 1.39
usr.sbin/Makefile: revision 1.282
lib/libnvmm/nvmm.h: revision 1.13
lib/libnvmm/nvmm.h: revision 1.14
lib/libnvmm/nvmm.h: revision 1.15
sys/dev/nvmm/nvmm.c: revision 1.23
lib/libnvmm/nvmm.h: revision 1.16
sys/dev/nvmm/nvmm.c: revision 1.24
lib/libnvmm/nvmm.h: revision 1.17
sys/dev/nvmm/nvmm.c: revision 1.25
tests/lib/libnvmm/h_io_assist.c: revision 1.9
etc/MAKEDEV.tmpl: revision 1.209
tests/lib/libnvmm/h_io_assist.c: revision 1.10
tests/lib/libnvmm/h_io_assist.c: revision 1.11
etc/group: revision 1.35
distrib/sets/lists/man/mi: revision 1.1660
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.40
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.41
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.42
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.43
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.44
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.51
sys/dev/nvmm/nvmm_ioctl.h: revision 1.8
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.52
sys/dev/nvmm/nvmm_ioctl.h: revision 1.9
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.53
usr.sbin/nvmmctl/nvmmctl.c: revision 1.1
lib/libnvmm/libnvmm.3: revision 1.20
distrib/sets/lists/debug/md.amd64: revision 1.106
lib/libnvmm/libnvmm.3: revision 1.21
lib/libnvmm/libnvmm.3: revision 1.22
usr.sbin/nvmmctl/nvmmctl.8: revision 1.1
lib/libnvmm/libnvmm.3: revision 1.23

Fix incorrect parsing: the R/M field uses a special GPR map when the
address size is 16 bits, regardless of the actual operating mode. With
this special map there can be two registers referenced at once, and
also disp16-only.
Implement this special behavior, and add associated tests. While here
simplify a few things.
With this in place, the Windows 95 installer initializes correctly.
Part of PR/54611.
add missing initializer
Implement XCHG, add associated tests, and add comments to explain. With
this in place the Windows 95 installer completes successfuly.
Part of PR/54611.
Improve nvmm_vcpu_dump().
Put back 'default', because llvm apparently doesn't realize that all cases
are covered in the switch.
Miscellaneous changes in NVMM, to address several inconsistencies and
issues in the libnvmm API.
- Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in
libnvmm. Introduce NVMM_USER_VERSION, for future use.
- In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to
avoid sharing the VMs with the children if the process forks. In the
NVMM driver, force O_CLOEXEC on open().
- Rename the following things for consistency:
nvmm_exit* -> nvmm_vcpu_exit*
nvmm_event* -> nvmm_vcpu_event*
NVMM_EXIT_* -> NVMM_VCPU_EXIT_*
NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR
NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP
Delete NVMM_EVENT_INTERRUPT_SW, unused already.
- Slightly reorganize the MI/MD definitions, for internal clarity.
- Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide
separate u.rdmsr and u.wrmsr fields. This is more consistent with the
other exit reasons.
- Change the types of several variables:
event.type enum -> u_int
event.vector uint64_t -> uint8_t
exit.u.*msr.msr: uint64_t -> uint32_t
exit.u.io.type: enum -> bool
exit.u.io.seg: int -> int8_t
cap.arch.mxcsr_mask: uint64_t -> uint32_t
cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t
- Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we
already intercept 'monitor' so it is never armed.
- Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn().
The 'npc' field wasn't getting filled properly during certain VMEXITs.
- Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(),
but as its name indicates, the configuration is per-VCPU and not per-VM.
Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID.
This becomes per-VCPU, which makes more sense than per-VM.
- Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on
specific leaves. Until now we could only mask the leaves. An uint32_t
is added in the structure:
uint32_t mask:1;
uint32_t exit:1;
uint32_t rsvd:30;
The two first bits select the desired behavior on the leaf. Specifying
zero on both resets the leaf to the default behavior. The new
NVMM_VCPU_EXIT_CPUID exit reason is added.
Three changes in libnvmm:
- Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem
structures.
- Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'.
- Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS,
it now becomes per-VCPU.
Update the libnvmm man page:
- Sync the naming with reality.
- Replace "relevant" by "desired" and "virtualizer" by "emulator", closer
to what I meant.
- Add a "VCPU Configuration" section.
- Add a "Machine Ownership" section.
Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few
days ago.
Use the new PTE naming, and define CR3_FRAME_* separately. No functional
change.
Add a new VCPU conf option, that allows userland to request VMEXITs after a
TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs.
The reason for wanting this option is that certain OSes (like Win10 64bit)
manage interrupt priority in hardware via CR8 directly, and for these OSes,
the emulator may want to sync its internal TPR state on each change.
Add two new fields in cap.arch, to report the conf capabilities. Report TPR
only on Intel for now, not AMD, because I don't have a recent AMD CPU on
which to test.
Mask CPUID leaf 0x0A on Intel, because we don't want the guest to try (and
fail) to probe the PMC MSRs. This avoids "Unexpected WRMSR" warnings in
qemu-nvmm.
Add PCID support in the guests. This speeds up most 64bit guests, because
since Meltdown, everybody uses PCID (including NetBSD).
Change the way root_owner works: consider the calling process as root_owner
not if it has root privileges, but if the /dev/nvmm device was opened with
write permissions. Introduce the undocumented nvmm_root_init() function to
achieve that.
The goal is to simplify the logic and have more granularity, eg if we want
a monitoring agent to access VMs but don't want to give this agent real
root access on the system.
A few changes:
- Use smaller types in struct nvmm_capability.
- Use smaller type for nvmm_io.port.
- Switch exitstate to a compacted structure.
Add nram in struct nvmm_ctl_mach_info.
Add nvmmctl, with two commands for now.
Macro tidyness.
Sort SEE ALSO.
should be fork(2), noticed by wiz
Add debug entry for newly introduced nvmmctl utility.
Annotate a covering switch as such to avoid warnings about missing
returns.
Forgot to put nvmmctl in the "nvmm" group.
Add nvmm group.
 1.36.2.2 06-Oct-2019  martin Pull up following revision(s) (requested by maxv in ticket #287):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.38
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.47
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.48
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.49

Add definitions for RDPRU, MCOMMIT, GMET and VTE.

Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed
version.

Switch to the new PTE naming.
 1.36.2.1 24-Sep-2019  martin Pull up following revision(s) (requested by maxv in ticket #239):

sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.37

Always set hwcode on error. Useful for debugging.
 1.46.2.2 29-Feb-2020  ad Sync with head.
 1.46.2.1 17-Jan-2020  ad Sync with head.
 1.81.4.1 03-Apr-2021  thorpej Sync with HEAD.
 1.81.2.1 03-Apr-2021  thorpej Sync with HEAD.
 1.85.4.1 14-Dec-2023  martin Pull up following revision(s) (requested by rin in ticket #494):

sys/arch/xen/x86/xen_ipi.c: revision 1.42
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.86

xen_ipi: valid_ipimask: Sprinkle __diagused to fix clang !DIAGNOSTIC build

nvmm_x86_vmx: vmx_vmptrst: Sprinkle __diagused to fix clang !DIAGNOSTIC build
 1.86.6.1 02-Aug-2025  perseant Sync with HEAD
 1.6 05-Sep-2020  maxv nvmm: update copyright headers
 1.5 11-Aug-2020  maxv Micro-optimize: use pushq instead of pushw. To avoid LCP stalls and
unaligned stack accesses.
 1.4 19-Jul-2020  maxv The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no
effect. Disable interrupts earlier instead. This prevents a possible race
against such IPIs.
 1.3 27-Apr-2019  maxv branches: 1.3.2; 1.3.4;
Optimize nvmm-intel, use inlined GCC assembly rather than function calls.
 1.2 24-Apr-2019  maxv Match the structure order, for better cache utilization.
 1.1 13-Feb-2019  maxv Add Intel-VMX support in NVMM. This allows us to run hardware-accelerated
VMs on Intel CPUs. Overall this implementation is fast and reliable, I am
able to run NetBSD VMs with many VCPUs on a quad-core Intel i5.

NVMM-Intel applies several optimizations already present in NVMM-AMD, and
has a code structure similar to it. No change was needed in the NVMM MI
frontend, or in libnvmm.

Some differences exist against AMD:

- On Intel the ASID space is big, so we don't fall back to a shared ASID
when there are more VCPUs executing than available ASIDs in the host,
contrary to AMD. There are enough ASIDs for the maximum number of VCPUs
supported by NVMM.

- On Intel there are two TLBs we need to take care of, one for the host
(EPT) and one for the guest (VPID). Changes in EPT paging flush the
host TLB, changes to the guest mode flush the guest TLB.

- On Intel there is no easy way to set/fetch the VTPR, so we intercept
reads/writes to CR8 and maintain a software TPR, that we give to the
virtualizer as if it was the effective TPR in the guest.

- On Intel, because of SVS, the host CR4 and LSTAR are not static, so
we're forced to save them on each VMENTRY.

- There is extra Intel weirdness we need to take care of, for example the
reserved bits in CR0 and CR4 when accesses trap.

While this implementation is functional and can already run many OSes, we
likely have a problem on 32bit-PAE guests, because they require special
care on Intel CPUs, and currently we don't handle that correctly; such
guests may misbehave for now (without altering the host stability). I
expect to fix that soon.
 1.3.4.1 29-Aug-2020  martin Pull up following revision(s) (requested by maxv in ticket #1068):

sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.71
sys/dev/nvmm/nvmm.c: revision 1.34
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.72
sys/dev/nvmm/nvmm.c: revision 1.35
sys/dev/nvmm/nvmm.c: revision 1.36
sys/dev/nvmm/x86/nvmm_x86_svmfunc.S: revision 1.5
sys/dev/nvmm/nvmm.c: revision 1.37
sys/dev/nvmm/x86/nvmm_x86_vmxfunc.S: revision 1.5
sys/dev/nvmm/x86/nvmm_x86_vmx.c: revision 1.70
sys/dev/nvmm/x86/nvmm_x86_svm.c: revision 1.68
sys/dev/nvmm/x86/nvmm_x86.c: revision 1.15
sys/dev/nvmm/nvmm_ioctl.h: revision 1.10

Micro-optimize: use pushq instead of pushw. To avoid LCP stalls and
unaligned stack accesses.

nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes

nvmm: localify a variable that doesn't need to be global

nvmm: use relaxed atomics to read nmachines

nvmm-x86-svm: dedup code

nvmm-x86: hide more CPUID flags, mostly related to perf monitors

nvmm: misc improvements
- use mach->ncpus to get the number of vcpus, now that we have it
- don't forget to decrement mach->ncpus when a machine gets killed
- add more __predict_false()

nvmm-x86-svm: don't forget to intercept INVD
INVD executed in the guest can be dangerous for the host, due to CPU
caches being flushed without write-back.

nvmm: slightly clarify

nvmm: explicitly include atomic.h
 1.3.2.2 10-Jun-2019  christos Sync with HEAD
 1.3.2.1 27-Apr-2019  christos file nvmm_x86_vmxfunc.S was added on branch phil-wifi on 2019-06-10 22:07:14 +0000

RSS XML Feed