Home | History | Annotate | Download | only in include
History log of /src/sys/arch/x86/include/pmap.h
RevisionDateAuthorComments
 1.134  20-Aug-2022  riastradh x86: Move definition of struct pmap to pmap_private.h.

This makes pmap_resident_count and pmap_wired_count out-of-line
functions instead of inline. No functional change intended
otherwise.
 1.133  20-Aug-2022  riastradh x86: Split most of pmap.h into pmap_private.h or vmparam.h.

This way pmap.h only contains the MD definition of the MI pmap(9)
API, which loads of things in the kernel rely on, so changing x86
pmap internals no longer requires recompiling the entire kernel every
time.

Callers needing these internals must now use machine/pmap_private.h.
Note: This is not x86/pmap_private.h because it contains three parts:

1. CPU-specific (different for i386/amd64) definitions used by...

2. common definitions, including Xenisms like xpmap_ptetomach,
further used by...

3. more CPU-specific inlines for pmap_pte_* operations

So {amd64,i386}/pmap_private.h defines 1, includes x86/pmap_private.h
for 2, and then defines 3. Maybe we should split that out into a new
pmap_pte.h to reduce this trouble.

No functional change intended, other than that some .c files must
include machine/pmap_private.h when previously uvm/uvm_pmap.h
polluted the namespace with pmap internals.

Note: This migrates part of i386/pmap.h into i386/vmparam.h --
specifically the parts that are needed for several constants defined
in vmparam.h:

VM_MAXUSER_ADDRESS
VM_MAX_ADDRESS
VM_MAX_KERNEL_ADDRESS
VM_MIN_KERNEL_ADDRESS

Since i386 needs PDP_SIZE in vmparam.h, I added it there on amd64
too, just to keep things parallel.
 1.132  20-Aug-2022  riastradh x86: Move pl*_i, pl_i_roundup, and ptp_va2o out of x86/pmap.h.

- pl[1-4]_i -> x86/pte.h
- pl_i, pl_i_roundup, ptp_va2o -> x86/pmap.c
 1.131  20-Aug-2022  riastradh x86: Move struct vm_page_md to common x86/pmap.h.
 1.130  20-Aug-2022  riastradh x86: Split bootspace out of x86/pmap.h into new x86/bootspace.h.
 1.129  20-Aug-2022  riastradh x86: Move page attribute table bits to x86/pat.h.
 1.128  18-Jun-2022  andvar fix typos in word "functions" in comments, mainly s/fuctions/functions/.
 1.127  30-Apr-2021  christos Merge the x86 gdt function and constant definitions
 1.126  30-Apr-2021  christos Bump MAX_USERLDT_SIZE to the max size (wastes some memory). wine needs more
than PAGE_SIZE and fails spuriously.
XXX: Note the duplicate definition hacks. Should really create <x86/gdt.h>,
put the just the constants there and unify them.
This would also avoid the hack in: src/tests/lib/libi386/t_user_ldt.c#46
 1.125  19-Jul-2020  maxv branches: 1.125.6;
Revert most of ad's movs/stos change. Instead do a lot simpler: declare
svs_quad_copy() used by SVS only, with no need for instrumentation, because
SVS is disabled when sanitizers are on.
 1.124  14-Jul-2020  yamaguchi Introduce per-cpu IDTs

This is realized by following modifications:
- Add IDT pages and its allocation maps for each cpu in "struct cpu_info"
- Load per-cpu IDTs at cpu_init_idt(struct cpu_info*)
- Copy the IDT entries for cpu0 to other CPUs at attach
- These are, for example, exceptions, db, system calls, etc.

And, added a kernel option named PCPU_IDT to enable the feature.
 1.123  24-Jun-2020  maxv remove unused x86_stos
 1.122  27-May-2020  ad - Add a couple of wrapper functions around STOS and MOVS and use them to zero
and copy PTEs in preference to memset()/memcpy().

- Remove related SSE / pageidlezero stuff.
 1.121  26-May-2020  bouyer Ajust pmap_enter_ma() for upcoming new Xen privcmd ioctl:
pass flags to xpq_update_foreign()
Introduce a pmap MD flag: PMAP_MD_XEN_NOTR, which cause xpq_update_foreign()
to use the MMU_PT_UPDATE_NO_TRANSLATE flag.
make xpq_update_foreign() return the raw Xen error. This will cause
pmap_enter_ma() to return a negative error number in this case, but the
only user of this code path is privcmd.c and it can deal with it.

Add pmap_enter_gnt()m which maps a set of Xen grant entries at the
specified va in the specified pmap. Use the hooks implemented for EPT to
keep track of mapped grand entries in the pmap, and unmap them
when pmap_remove() is called. This requires pmap_remove() to be split
into a pmap_remove_locked(), to be called from pmap_remove_gnt().
 1.120  08-May-2020  riastradh Factor randomization out of slotspace_rand.

slotspace_rand becomes deterministic; the randomization moves into
the callers instead. Why?

There are two callers of slotspace_rand:

- x86/pmap.c pmap_bootstrap
- amd64/amd64.c init_slotspace

When the randomization was introduced, it used an x86-only
`cpu_earlyrng' abstraction that would hash rdseed/rdrand and rdtsc
output together. Except init_slotspace ran before cpu_probe, so
cpu_feature was not yet filled out, so during init_slotspace, the
only randomization was rdtsc.

In the course of the recent entropy overhaul, I replaced cpu_earlyrng
by entropy_extract, and moved cpu_init_rng much earlier -- but still
after cpu_probe -- in order to reduce the number of abstractions
lying around and the number of copies of rdrand/rdseed logic. In so
doing I added some annoying complication (see curcpu_available) to
kern_entropy.c to make it work early enough for init_slotspace, and
dropped the rdtsc.

For pmap_bootstrap that didn't substantively change anything. But
for init_slotspace, it removed the only randomization. To mitigate
this, this commit pulls the randomization out of slotspace_rand into
pmap_bootstrap and init_slotspace, so that

(a) init_slotspace can use rdtsc and a little private entropy pool in
order to restore the prior (weak) randomization it had, and

(b) pmap_bootstrap, which runs a little bit later, can continue to
use entropy_extract normally and get rdrand/rdseed too.

A subsequent commit will move cpu_init_rng just a wee bit later,
after cpu_init_msrs, so the kern_entropy.c complications can go away.
Perhaps someone else more wizardly with x86 can find a way to make
init_slotspace run a little later too, after cpu_probe and after
cpu_init_msrs and after cpu_rng_init, but I am not that wizardly.
 1.119  25-Apr-2020  bouyer Merge the bouyer-xenpvh branch, bringing in Xen PV drivers support under HVM
guests in GENERIC.
Xen support can be disabled at runtime with
boot -c
disable hypervisor
 1.118  24-Apr-2020  maxv Give the ldt a fixed size of one page (512 slots), and drop the variable-
sized mechanism that was too complex.

This fixes a race between USER_LDT and SVS: during context switches, the
way SVS installs the new ldt relies on the ldt pointer AND the ldt size,
but both cannot be accessed atomically at the same time.
 1.117  05-Apr-2020  ad branches: 1.117.2;
Allocate PV entries in PAGE_SIZE chunks, and cache partially allocated PV
pages with the pmap. Worth about 2-3% sys time on build.sh for me.
 1.116  22-Mar-2020  ad x86 pmap:

- Give pmap_remove_all() its own version of pmap_remove_ptes() that on native
x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on
'build.sh release' for me.

- pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The
caller waits for the competing operation to finish.

- Bring 'options TLBSTATS' up to date.
 1.115  17-Mar-2020  ad Hallelujah, the bug has been found. Resurrect prior changes, to be fixed
with following commit.
 1.114  17-Mar-2020  ad Back out the recent pmap changes until I can figure out what is going on
with pmap_page_remove() (to pmap.c rev 1.365).
 1.113  14-Mar-2020  ad PR kern/55071 (Panic shortly after running X11 due to kernel diagnostic assertion "mutex_owned(&pp->pp_lock)")

- Fix a locking bug in pmap_pp_clear_attrs() and in pmap_pp_remove() do the
TLB shootdown while still holding the target pmap's lock.

Also:

- Finish PV list locking for x86 & update comments around same.

- Keep track of the min/max index of PTEs inserted into each PTP, and use
that to clip ranges of VAs passed to pmap_remove_ptes().

- Based on the above, implement a pmap_remove_all() for x86 that clears out
the pmap in a single pass. Makes exit() / fork() much cheaper.
 1.112  14-Mar-2020  ad pmap_remove_all(): Return a boolean value to indicate the behaviour. If
true, all mappings have been removed, the pmap is totally cleared out, and
UVM can then avoid doing the work to call pmap_remove() for each map entry.
If false, either nothing has been done, or some helpful arch-specific voodoo
has taken place.
 1.111  10-Mar-2020  ad - pmap_check_inuse() is expensive so make it DEBUG not DIAGNOSTIC.

- Put PV locking back in place with only a minor performance impact.
pmap_enter() still needs more work - it's not easy to satisfy all the
competing requirements so I'll do that with another change.

- Use pmap_find_ptp() (lookup only) in preference to pmap_get_ptp() (alloc).
Make pm_ptphint indexed by VA not PA. Replace the per-pmap radixtree for
dynamic PV entries with a per-PTP rbtree. Cuts system time during kernel
build by ~10% for me.
 1.110  23-Feb-2020  ad UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.
 1.109  12-Jan-2020  ad x86 pmap:

- It turns out that every page the pmap frees is necessarily zeroed. Tell
the VM system about this and use the pmap as a source of pre-zeroed pages.

- Redo deferred freeing of PTPs more elegantly, including the integration with
pmap_remove_all(). This fixes problems with nvmm, and possibly also a crash
discovered during fuzzing.

Reported-by: syzbot+a97186518c84f1d85c0c@syzkaller.appspotmail.com
 1.108  04-Jan-2020  ad branches: 1.108.2;
x86 pmap improvements, reducing system time during a build by about 15% on
my test machine:

- Replace the global pv_hash with a per-pmap record of dynamically allocated
pv entries. The data structure used for this can be changed easily, and
has no special concurrency requirements. For now go with radixtree.

- Change pmap_pdp_cache back into a pool; cache the page directory with the
pmap, and avoid contention on pmaps_lock by adjusting the global list in
the pool_cache ctor & dtor. Align struct pmap and its lock, and update
some comments.

- Simplify pv_entry lists slightly. Allow both PP_EMBEDDED and dynamically
allocated entries to co-exist on a single page. This adds a pointer to
struct vm_page on x86, but shrinks pv_entry to 32 bytes (which also gets
it nicely aligned).

- More elegantly solve the chicken-and-egg problem introduced into the pmap
with radixtree lookup for pages, where we need PTEs mapped and page
allocations to happen under a single hold of the pmap's lock. While here
undo some cut-n-paste.

- Don't adjust pmap_kernel's stats with atomics, because its mutex is now
held in the places the stats are changed.
 1.107  15-Dec-2019  ad uvm_pagerealloc() can now block because of radixtree manipulation, so defer
freeing PTPs until pmap_unmap_ptes(), where we still have the pmap locked
but can finally tolerate context switches again.

To be revisited soon: pmap_map_ptes() seems broken WRT other pmap load.

Reported-by: syzbot+689fb7dab41abff8e75a@syzkaller.appspotmail.com
Reported-by: syzbot+3e7bbf37d37d451b25d7@syzkaller.appspotmail.com
Reported-by: syzbot+689fb7dab41abff8e75a@syzkaller.appspotmail.com
Reported-by: syzbot+689fb7dab41abff8e75a@syzkaller.appspotmail.com
Reported-by: syzbot+3e7bbf37d37d451b25d7@syzkaller.appspotmail.com
 1.106  08-Dec-2019  ad Merge x86 pmap changes from yamt-pagecache:

- Deal better with the multi-level pmap object locking kludge.
- Handle uvm_pagealloc() being able to block.
 1.105  14-Nov-2019  maxv Add support for Kernel Memory Sanitizer (kMSan). It detects uninitialized
memory used by the kernel at run time, and just like kASan and kCSan, it
is an excellent feature. It has already detected 38 uninitialized variables
in the kernel during my testing, which I have since discreetly fixed.

We use two shadows:
- "shad", to track uninitialized memory with a bit granularity (1:1).
Each bit set to 1 in the shad corresponds to one uninitialized bit of
real kernel memory.
- "orig", to track the origin of the memory with a 4-byte granularity
(1:1). Each uint32_t cell in the orig indicates the origin of the
associated uint32_t of real kernel memory.

The memory consumption of these shadows is consequent, so at least 4GB of
RAM is recommended to run kMSan.

The compiler inserts calls to specific __msan_* functions on each memory
access, to manage both the shad and the orig and detect uninitialized
memory accesses that change the execution flow (like an "if" on an
uninitialized variable).

We mark as uninit several types of memory buffers (stack, pools, kmem,
malloc, uvm_km), and check each buffer passed to copyout, copyoutstr,
bwrite, if_transmit_lock and DMA operations, to detect uninitialized memory
that leaves the system. This allows us to detect kernel info leaks in a way
that is more efficient and also more user-friendly than KLEAK.

Contrary to kASan, kMSan requires comprehensive coverage, ie we cannot
tolerate having one non-instrumented function, because this could cause
false positives. kMSan cannot instrument ASM functions, so I converted
most of them to __asm__ inlines, which kMSan is able to instrument. Those
that remain receive special treatment.

Contrary to kASan again, kMSan uses a TLS, so we must context-switch this
TLS during interrupts. We use different contexts depending on the interrupt
level.

The orig tracks precisely the origin of a buffer. We use a special encoding
for the orig values, and pack together in each uint32_t cell of the orig:
- a code designating the type of memory (Stack, Pool, etc), and
- a compressed pointer, which points either (1) to a string containing
the name of the variable associated with the cell, or (2) to an area
in the kernel .text section which we resolve to a symbol name + offset.

This encoding allows us not to consume extra memory for associating
information with each cell, and produces a precise output, that can tell
for example the name of an uninitialized variable on the stack, the
function in which it was pushed on the stack, and the function where we
accessed this uninitialized variable.

kMSan is available with LLVM, but not with GCC.

The code is organized in a way that is similar to kASan and kCSan, so it
means that other architectures than amd64 can be supported.
 1.104  13-Nov-2019  maxv Rename:
PP_ATTRS_M -> PP_ATTRS_D
PP_ATTRS_U -> PP_ATTRS_A
For consistency.
 1.103  05-Oct-2019  maxv Switch to the new PTE naming. No binary diff (tested with MKREPRO).
 1.102  07-Aug-2019  maxv Add support for USER_LDT in SVS. This allows us to have both enabled at
the same time.

We allocate an LDT for each CPU in the GDT and map an area for it, in
addition to the default LDT already present. In context switches between
different processes, we choose between the default or the per-cpu LDT
selector: if the user set specific LDT entries, we memcpy them to the
per-cpu LDT and load the per-cpu selector.

Tested by Naveen Narayanan (with Wine on amd64).
 1.101  29-May-2019  maxv branches: 1.101.2;
Add PCID support in SVS. This avoids TLB flushes during kernel<->user
transitions, which greatly reduces the performance penalty introduced by
SVS.

We use two ASIDs, 0 (kern) and 1 (user), and use invpcid to flush pages
in both ASIDs.

The read-only machdep.svs.pcid={0,1} sysctl is added, and indicates whether
SVS+PCID is in use.
 1.100  10-Mar-2019  maxv Two changes:

* Allow large pages to be passed in pmap_pdes_valid, this happens under
DDB when it reads RIP (.text), called via pmap_extract.

* Invert a branch in pmap_extract, so that 'l_cpu' is not touched if we're
dealing with the kernel pmap.

This fixes 'boot -d'.
 1.99  09-Mar-2019  maxv Start replacing the x86 PTE bits.
 1.98  23-Feb-2019  maxv Move PATENTRY into pmap.h, will be used outside.
 1.97  13-Feb-2019  maxv Add the EPT pmap code, used by Intel-VMX.

The idea is that under NVMM, we don't want to implement the hypervisor page
tables manually in NVMM directly, because we want pageable guests; that is,
we want to allow UVM to unmap guest pages when the host comes under
pressure.

Contrary to AMD-SVM, Intel-VMX uses a different set of PTE bits from
native, and this has three important consequences:

- We can't use the native PTE bits, so each time we want to modify the
page tables, we need to know whether we're dealing with a native pmap
or an EPT pmap. This is accomplished with callbacks, that handle
everything PTE-related.

- There is no recursive slot possible, so we can't use pmap_map_ptes().
Rather, we walk down the EPT trees via the direct map, and that's
actually a lot simpler (and probably faster too...).

- The kernel is never mapped in an EPT pmap. An EPT pmap cannot be loaded
on the host. This has two sub-consequences: at creation time we must
zero out all of the top-level PTEs, and at destruction time we force
the page out of the pool cache and into the pool, to ensure that a next
allocation will invoke pmap_pdp_ctor() to create a native pmap and not
recycle some stale EPT entries.

To create an EPT pmap, the caller must invoke pmap_ept_transform() on a
newly-allocated native pmap. And that's about it, from then on the EPT
callbacks will be invoked, and the pmap can be destroyed via the usual
pmap_destroy(). The TLB shootdown callback is not initialized however,
it is the responsibility of the hypervisor (NVMM) to set it.

There are some twisted cases that we need to handle. For example if
pmap_is_referenced() is called on a physical page that is entered both by
a native pmap and by an EPT pmap, we take the Accessed bits from the
two pmaps using different PTE sets in each case, and combine them into a
generic PP_ATTRS_U flag (that does not depend on the pmap type).

Given that the EPT layout is a 4-Level tree with the same address space as
native x86_64, we allow ourselves to use a few native macros in EPT, such
as pmap_pa2pte(), rather than re-defining them with "ept" in the name.

Even though this EPT code is rather complex, it is not too intrusive: just
a few callbacks in a few pmap functions, predicted-false to give priority
to native. So this comes with no messy #ifdef or performance cost.
 1.96  11-Feb-2019  cherry We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.
 1.95  01-Feb-2019  maxv Add the remaining pmap callbacks, will be used by NVMM-VMX.
 1.94  01-Feb-2019  maxv Change the format of the pp_attrs field: instead of using PTE bits
directly, use abstracted bits that are converted from/to PTE bits when
needed (in pmap_sync_pv).

This allows us to use the same pp_attrs for pmaps that have PTE bits at
different locations.
 1.93  17-Dec-2018  maxv Add two pmap fields, will be used by NVMM-VMX. Also apply a few cosmetic
changes.
 1.92  06-Dec-2018  maxv Fix inconsistency, these are indexes and not types, no real functional
change.
 1.91  19-Nov-2018  maxv Introduce pl_pi, will be used soon.
 1.90  19-Nov-2018  maxv Rename 'mask' -> 'frame', we will use the real 'mask' soon.
 1.89  07-Nov-2018  maxv Add two pmap fields, will be used by NVMM.
 1.88  29-Aug-2018  maxv clean up a little
 1.87  29-Aug-2018  maxv Remove the constants of the DMAP, they are unused, and move NL4_SLOT_DIRECT
into amd64/.
 1.86  29-Aug-2018  maxv Simplify the ASLR stuff, we don't care about resizable areas now, and it
makes the code more complicated for no good reason.
 1.85  20-Aug-2018  maxv Add support for kASan on amd64. Written by me, with some parts inspired
from Siddharth Muralee's initial work. This feature can detect several
kinds of memory bugs, and it's an excellent feature.

It can be enabled by uncommenting these three lines in GENERIC:

#makeoptions KASAN=1 # Kernel Address Sanitizer
#options KASAN
#no options SVS

The kernel is compiled without SVS, without DMAP and without PCPU area.
A shadow area is created at boot time, and it can cover the upper 128TB
of the address space. This area is populated gradually as we allocate
memory. With this design the memory consumption is kept at its lowest
level.

The compiler calls the __asan_* functions each time a memory access is
done. We verify whether this access is legal by looking at the shadow
area.

We declare our own special memcpy/memset/etc functions, because the
compiler's builtins don't add the __asan_* instrumentation.

Initially all the mappings are marked as valid. During dynamic
allocations, we add a redzone, which we mark as invalid. Any access on
it will trigger a kASan error message. Additionally, the compiler adds
a redzone on global variables, and we mark these redzones as invalid too.
The illegal-access detection works with a 1-byte granularity.

For now, we cover three areas:

- global variables
- kmem_alloc-ated areas
- malloc-ated areas

More will come, but that's a good start.
 1.84  12-Aug-2018  maxv Move the PCPU area from slot 384 to slot 510, to avoid creating too much
fragmentation in the slot space (384 is in the middle of the kernel half
of the VA).
 1.83  12-Aug-2018  maxv Randomize the main memory on Xen, same as native. Tested on amd64-dom0.
 1.82  12-Aug-2018  maxv Add a new area, SLAREA_HYPV, which indicates the slots used by the
hypervisor, in our case Xen.
 1.81  21-Jul-2018  maxv More ASLR. Randomize the location of the direct map at boot time on amd64.
This doesn't need "options KASLR" and works on GENERIC. Will soon be
enabled by default.

The location of the areas is abstracted in a slotspace structure. Ideally
we should always use this structure when touching the L4 slots, instead of
the current cocktail of global variables and constants.

machdep initializes the structure with the default values, and we then
randomize its dmap entry. Ideally machdep should randomize everything at
once, but in the case of the direct map its size is determined a little
later in the boot procedure, so we're forced to randomize its location
later too.
 1.80  20-Jun-2018  maxv branches: 1.80.2;
Add and use bootspace.smodule. Initialize it in locore/prekern to better
hide the specifics from the "upper" layers. This allows for greater
flexibility.
 1.79  19-May-2018  jakllsch remove some remaining uvm_emap(9)-related function prototypes
 1.78  19-May-2018  jdolecek Remove emap support. Unfortunately it never got to state where it would be
used and usable, due to reliability and limited & complicated MD support.

Going forward, we need to concentrate on interface which do not map anything
into kernel in first place (such as direct map or KVA-less I/O), rather
than making those mappings cheaper to do.
 1.77  08-May-2018  maxv Mitigation for the SS bug, CVE-2018-8897. We disabled dbregs a month ago
in -current and -8 so we are not particularly affected anymore.

The #DB handler runs on ist3, if we decide to process the exception we
copy the iret frame on the correct non-ist stack and continue as usual.
 1.76  04-Mar-2018  jdolecek branches: 1.76.2;
drop pmap_update_2pg(), just call pmap_update_pg() separately for each
 1.75  18-Jan-2018  maxv Unmap the kernel heap from the user page tables (SVS).

This implementation is optimized and organized in such a way that we
don't need to copy the kernel stack to a safe place during user<->kernel
transitions. We create two VAs that point to the same physical page; one
will be mapped in userland and is offset in order to contain only the
trapframe, the other is mapped in the kernel and maps the entire stack.

Sent on tech-kern@ a week ago.
 1.74  11-Jan-2018  maxv Add ist0 to pcpu_entry.
 1.73  05-Jan-2018  maxv Add a __HAVE_PCPU_AREA option, enabled by default on native amd64 but not
Xen.

With this option, the CPU structures that must always be present in the
CPU's page tables are moved on L4 slot 384, which means address
0xffffc00000000000.

A new pcpu_area structure is defined. It contains shared structures (IDT,
LDT), and then an array of pcpu_entry structures, indexed by cpu_index(ci).
Theoretically the LDT should be in the array, but this will be done later.

During the boot procedure, cpu0 calls pmap_init_pcpu, which creates a
page tree that is able to map the pcpu_area structure entirely. cpu0 then
immediately maps the shared structures. Later, every CPU goes through
cpu_pcpuarea_init, which allocates physical pages and kenters the relevant
pcpu_entry to them. Finally, each pointer is replaced to point to pcpuarea.

The point of this change is to make sure that the structures that must
always be present in the page tables have their own L4 slot. Until now
their L4 slot was that of pmap_kernel, and making a distinction between
what must be mapped and what does not need to be was complicated.

Even in the non-speculative-bug case this change makes some sense: there
are several x86 instructions that leak the addresses of the CPU structures,
and putting these structures inside pmap_kernel actually offered a way to
compute the address of the kernel heap - which would have made ASLR on it
plainly useless, had we implemented that.

Note that, for now, pcpuarea does not contain rsp0.

Unfortunately this change adds many #ifdefs, and makes the code harder to
understand. There is also some duplication, but that will be solved later.
 1.72  28-Dec-2017  maxv Use variables in PMAP_DIRECT_*, so that the location of the direct map can
change.
 1.71  11-Nov-2017  maxv Modify the layout of the bootspace structure, in such a way that it can
contain several kernel segments of the same type (eg several .text
segments). Some parts are still a bit messy but will be cleaned up soon.

I cannot compile-test this change on i386, but it seems fine enough.

NOTE: you need to rebuild and reinstall a new prekern after this change.
 1.70  29-Oct-2017  maxv Add a fifth region, called "head". On kaslr kernels it contains the ELF
Header and the ELF Section Headers. On normal kernels it is empty (the
headers are in the "boot" region).

Note: if you're using GENERIC_KASLR, you also need to rebuild the prekern.
 1.69  30-Sep-2017  maxv Add a bootspace structure. It describes the physical and virtual space
layout created by the early kernel bootstrap code. Start using it, and
eliminate several references to KERNBASE and other global symbols. While
here clean up xen-i386, it's really tiring.
 1.68  29-Sep-2017  ozaki-r Fix build

sys/arch/x86/x86/cpu.c:920:20: error: 'pmap_largepages' undeclared (first use in this function)
smp_data.large = (pmap_largepages != 0);
^
 1.67  17-Jun-2017  maxv Actually, use slot 456 instead, so that it fits a cache line.
 1.66  14-Jun-2017  maxv Give the direct map 32 slots (16TB of va). This matches MAXPHYSMEM, in
such a way that the direct map is no longer the limiting factor for high
memory systems.
 1.65  14-Jun-2017  maxv Move the direct map from slot 509 to slot 460. We will increase its size
dynamically.
 1.64  23-Mar-2017  maxv branches: 1.64.6;
Remove PG_k completely.
 1.63  05-Mar-2017  maxv Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege
separation between the kernel and userland.

On Xen-amd64, the kernel runs in ring3 just like userland, and the
separation is guaranteed by the hypervisor - each syscall/trap is
intercepted by Xen and sent manually to the kernel. Before that, the
hypervisor modifies the page tables so that the kernel becomes accessible.
Later, when returning to userland, the hypervisor removes the kernel pages
and flushes the TLB.

However, TLB flushes are costly, and in order to reduce the number of pages
flushed Xen marks the userland pages as global, while keeping the kernel
ones as local. This way, when returning to userland, only the kernel pages
get flushed - which makes sense since they are the only ones that got
removed from the mapping.

Xen differentiates the userland pages by looking at their PG_u bit in the
PTE; if a page has this bit then Xen tags it as global, otherwise Xen
manually adds the bit but keeps the page as local. The thing is, since we
set PG_u in the kernel pages, Xen believes our kernel pages are in fact
userland pages, so it marks them as global. Therefore, when returning to
userland, the kernel pages indeed get removed from the page tree, but are
not flushed from the TLB. Which means that they are still accessible.

With this - and depending on the DTLB size - userland has a small window
where it can read/write to the last kernel pages accessed, which is enough
to completely escalate privileges: the sysent structure systematically gets
read when performing a syscall, and chances are that it will still be
cached in the TLB. Userland can then use this to patch a chosen syscall,
make it point to a userland function, retrieve %gs and compute the address
of its credentials, and finally grant itself root privileges.
 1.62  11-Feb-2017  maxv Instead of using a global array with per-cpu indexes, embed the tmp VAs
into cpu_info directly. This concerns only {i386, Xen-i386, Xen-amd64},
because amd64 already has a direct map that is way faster than that.

There are two major issues with the global array: maxcpus entries are
allocated while it is unlikely that common i386 machines have so many
cpus, and the base VA of these entries is not cache-line-aligned, which
mostly guarantees cache-line-thrashing each time the VAs are entered.

Now the number of tmp VAs allocated is proportionate to the number of CPUs
attached (which therefore reduces memory consumption), and the base is
properly aligned.

On my 3-core AMD, the number of DC_refills_L2 events triggered when
performing 5x10^6 calls to pmap_zero_page on two dedicated cores is on
average divided by two with this patch.

Discussed on tech-kern a little.
 1.61  08-Nov-2016  christos branches: 1.61.2;
PR/49691: KAMADA Ken'ichi: free deferred ptp mappings if present.
XXX: pullup-7
 1.60  19-Sep-2016  maya move function prototype to x86, so it is available to amd64 too
 1.59  25-Jul-2016  maxv The L1 entry of the first page of the data segment is overwritten for the
LAPIC page, and set as RWX+PG_N. The LAPIC pa is fixed, and its va resides
in the data segment. Because of this error-prone design, the kernel image
map is not linear, and I first thought it was a bug (as I vaguely said in
PR/51148). Using large pages for the data segment is therefore wrong, since
the first page does not actually belong to the data segment (even if its va
is in the range). This bug is not triggered currently, since local_apic is
not large-page-aligned.

We will certainly have to allocate a va dynamically instead of using the
first page of data; but for now, disable large pages on the data segment,
and map the LAPIC as RW.

This is the last x86-specific RWX page.
 1.58  01-Jul-2016  maxv branches: 1.58.2;
Define pmap_pg_nx globally. Will be used soon.
 1.57  11-Nov-2015  skrll Split out the pmap_pv_track stuff for use by others.

Discussed with riastradh@
 1.56  03-Apr-2015  riastradh Implement pmap_pv(9) for x86 for P->V tracking of unmanaged pages.

Proposed on tech-kern with no objections:

https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html
 1.55  17-Oct-2013  christos branches: 1.55.4; 1.55.6;
__USE() unused variables
 1.54  23-Jun-2013  uebayasi branches: 1.54.2;
Remove obsolete comment. OK'ed by rmind@.
 1.53  13-Nov-2012  chs add a pmap_kremove_local() that doesn't do TLB invalidations
on other CPUs. this is only intended for use while writing
kernel crash dumps. remove unused pmap_map().
 1.52  20-Apr-2012  rmind branches: 1.52.2;
- Convert x86 MD code, mainly pmap(9) e.g. TLB shootdown code, to use
kcpuset(9) and thus replace hardcoded CPU bitmasks. This removes the
limitation of maximum CPUs.

- Support up to 256 CPUs on amd64 architecture by default.

Bug fixes, improvements, completion of Xen part and testing on 64-core
AMD Opteron(tm) Processor 6282 SE (also, as Xen HVM domU with 128 CPUs)
by Manuel Bouyer.
 1.51  11-Mar-2012  jym Alternate PTEs got killed a few weeks ago. Clean up unused prototypes.
 1.50  17-Feb-2012  bouyer Apply patch proposed in PR port-xen/45975 (this does not solve the exact
problem reported here but is part of the solution):
xen_kpm_sync() is not working as expected,
leading to races between CPUs.
1 the check (xpq_cpu != &x86_curcpu) is always false because we
have different x86_curcpu symbols with different addresses in the kernel.
Fortunably, all addresses dissaemble to the same code.
Because of this we always use the code intended for bootstrap, which doesn't
use cross-calls or lock.

2 once 1 above is fixed, xen_kpm_sync() will use xcalls to sync other CPUs,
which cause it to sleep and pmap.c doesn't like that. It triggers this
KASSERT() in pmap_unmap_ptes():
KASSERT(pmap->pm_ncsw == curlwp->l_ncsw);
3 pmap->pm_cpus is not safe for the purpose of xen_kpm_sync(), which
needs to know on which CPU a pmap is loaded *now*:
pmap->pm_cpus is cleared before cpu_load_pmap() is called to switch
to a new pmap, leaving a window where a pmap is still in a CPU's
ci_kpm_pdir but not in pm_cpus. As a virtual CPU may be preempted
by the hypervisor at any time, it can be large enough to let another
CPU free the PTP and reuse it as a normal page.

To fix 2), avoid cross-calls and IPIs completely, and instead
use a mutex to update all CPU's ci_kpm_pdir from the local CPU.
It's safe because we just need to update the table page, a tlbflush IPI will
happen later. As a side effect, we don't need a different code for bootstrap,
fixing 1). The mutex added to struct cpu needs a small headers reorganisation.

to fix 3), introduce a pm_xen_ptp_cpus which is updated from
cpu_pmap_load(), whith the ci_kpm_mtx mutex held. Checking it with
ci_kpm_mtx held will avoid overwriting the wrong pmap's ci_kpm_pdir.

While there I removed the unused pmap_is_active() function;
and added some more details to DIAGNOSTIC panics.
 1.49  04-Dec-2011  chs branches: 1.49.2;
map all of physical memory using large pages.
ported from openbsd years ago by Murray Armfield,
updated for changes since then by me.
 1.48  23-Nov-2011  jym branches: 1.48.2;
No more users of xpmap_update(). Use pmap_pte_*() functions now.
 1.47  23-Nov-2011  jym Move Xen-specific functions to Xen pmap. Requested by cherry@.

Un'ifdef XEN in xen_pmap.c, it is always defined there.
 1.46  20-Nov-2011  jym Expose pmap_pdp_cache publicly to x86/xen pmap. Provide suspend/resume
callbacks for Xen pmap.

Turn static internal callbacks of pmap_pdp_cache.

XXX the implementation of pool_cache_invalidate(9) is still wrong, and
IMHO this needs fixing before -6. See
http://mail-index.netbsd.org/tech-kern/2011/11/18/msg011924.html
 1.45  08-Nov-2011  cherry Expose the PG_k #define pt/pd bit to both xen and "baremetal" x86. This is required, since kernel pages are mapped with user permissions in XEN/amd64 since the VM kernel runs in ring3. Since XEN/i386(including PAE) runs in ring1, supervisor mode is appropriate for these ports. We need to share this since the pmap implementation is still shared. Once the xen implementation is sufficiently independant of the x86 one, this can be made private to xen/include/xenpmap.h
 1.44  06-Nov-2011  cherry [merging from cherry-xenmp] Make the xen MMU op queue locking api private. Implement per-cpu queues.
 1.43  18-Oct-2011  jym branches: 1.43.2;
Make "pmaps" (list of non-kernel pmaps) and "pmaps_lock" externally
visible. Required by pmap MD code that could reside in other
files, notably Xen's pmap.
 1.42  20-Sep-2011  jym Merge jym-xensuspend branch in -current. ok bouyer@.

Goal: save/restore support in NetBSD domUs, for i386, i386 PAE and amd64.

Executive summary:
- split all Xen drivers (xenbus(4), grant tables, xbd(4), xennet(4))
in two parts: suspend and resume, and hook them to pmf(9).
- modify pmap so that Xen hypervisor does not cry out loud in case
it finds "unexpected" recursive memory mappings
- provide a sysctl(7), machdep.xen.suspend, to command suspend from
userland via powerd(8). Note: a suspend can only be handled correctly
when dom0 requested it, so provide a mechanism that will prevent
kernel to blindly validate user's commands

The code is still in experimental state, use at your own risk: restore
can corrupt backend communications rings; this can completely thrash
dom0 as it will loop at a high interrupt level trying to honor
all domU requests.

XXX PAE suspend does not work in amd64 currently, due to (yet again!)
page validation issues with hypervisor. Will fix.

XXX secondary CPUs are not suspended, I will write the handlers
in sync with cherry's Xen MP work.

Tested under i386 and amd64, bear in mind ring corruption though.

No build break expected, GENERICs and XEN* kernels should be fine.
./build.sh distribution still running. In any case: sorry if it does
break for you, contact me directly for reports.
 1.41  13-Aug-2011  cherry Add locking around ops to the hypervisor MMU "queue".
 1.40  13-Jun-2011  tls Fix Xen kernel builds (pmap_is_curpmap can't be static)
 1.39  12-Jun-2011  rmind Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.
 1.38  07-May-2011  jym branches: 1.38.2;
Do as the comment says, use ilog2(). This gets optimized directly at
compile time, no call to fls() is needed.
 1.37  25-Apr-2011  yamt comment
 1.36  25-Apr-2011  yamt remove unused ptei
 1.35  11-Feb-2011  jmcneill add bus_space_mmap support for BUS_SPACE_MAP_PREFETCHABLE, ok matt@
 1.34  01-Feb-2011  chuck udpate license clauses on my code to match the new-style BSD licenses.
remove no-longer-valid wustl email address for me.
based on diff that rmind@ sent me.

no functional change with this commit.
 1.33  24-Jul-2010  jym branches: 1.33.2; 1.33.4;
Welcome PAE inside i386 current.

This patch is inspired by work previously done by Jeremy Morse, ported by me
to -current, merged with the work previously done for port-xen, together with
additionals fixes and improvements.

PAE option is disabled by default in GENERIC (but will be enabled in ALL in
the next few days).

In quick, PAE switches the CPU to a mode where physical addresses become
36 bits (64 GiB). Virtual address space remains at 32 bits (4 GiB). To cope
with the increased size of the physical address, they are manipulated as
64 bits variables by kernel and MMU.

When supported by the CPU, it also allows the use of the NX/XD bit that
provides no-execution right enforcement on a per physical page basis.

Notes:

- reworked locore.S

- introduce cpu_load_pmap(), used to switch pmap for the curcpu. Due to the
different handling of pmap mappings with PAE vs !PAE, Xen vs native, details
are hidden within this function. This helps calling it from assembly,
as some features, like BIOS calls, switch to pmap_kernel before mapping
trampoline code in low memory.

- some changes in bioscall and kvm86_call, to reflect the above.

- the L3 is "pinned" per-CPU, and is only manipulated by a
reduced set of functions within pmap. To track the L3, I added two
elements to struct cpu_info, namely ci_l3_pdirpa (PA of the L3), and
ci_l3_pdir (the L3 VA). Rest of the code considers that it runs "just
like" a normal i386, except that the L2 is 4 pages long (PTP_LEVELS is
still 2).

- similar to the ci_pae_l3_pdir{,pa} variables, amd64's xen_current_user_pgd
becomes an element of cpu_info (slowly paving the way for MP world).

- bootinfo_source struct declaration is modified, to cope with paddr_t size
change with PAE (it is not correct to assume that bs_addr is a paddr_t when
compiled with PAE - it should remain 32 bits). bs_addrs is now a
void * array (in bootloader's code under i386/stand/, the bs_addrs
is a physaddr_t, which is an unsigned long).

- fixes in multiboot code (same reason as bootinfo): paddr_t size
change. I used Elf32_* types, use RELOC() where necessary, and move the
memcpy() functions out of the if/else if (I do not expect sym and str tables
to overlap with ELF).

- 64 bits atomic functions for pmap

- all pmap_pdirpa access are now done through the pmap_pdirpa macro. It
hides the L3/L2 stuff from PAE, as well as the pm_pdirpa change in
struct pmap (it now becomes a PDP_SIZE array, with or without PAE).

- manipulation of recursive mappings ( PDIR_SLOT_{,A}PTEs ) is done via
loops on PDP_SIZE.

See also http://mail-index.netbsd.org/port-i386/2010/07/17/msg002062.html

No objection raised on port-i386@ and port-xen@R for about a week.

XXX kvm(3) will be fixed in another patch to properly handle both PAE and !PAE
kernel dumps (VA => PA macros are slightly different, and need proper 64 bits
PA support in kvm_i386).

XXX Mixing PAE and !PAE modules may lead to unwanted/unexpected results. This
cannot be solved easily, and needs lots of thinking before being declared
safe (paddr_t/bus_addr_t size handling, PD/PT macros abstractions).
 1.32  15-Jul-2010  jym Make the comment about PDPpaddr more thorough.
 1.31  06-Jul-2010  cegger Turn PMAP_NOCACHE into MI flag.
Add MI flags PMAP_WRITE_COMBINE, PMAP_WRITE_BACK, PMAP_NOCACHE_OVR.
Update pmap(9) manpage.

hppa: Remove MD PMAP_NOCACHE flag as it exists as MI flag
mips: Rename MD PMAP_NOCACHE to PGC_NOCACHE.

x86: Implement new MI flags using Page-Attribute Tables.
x86: Implement BUS_SPACE_MAP_PREFETCHABLE.

Patch presented on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2010/06/30/msg008458.html

No comments on this last version.
 1.30  10-May-2010  dyoung Provide pmap_enter_ma(), pmap_extract_ma(), pmap_kenter_ma() in all x86
kernels, and use them in the bus_space(9) implementation instead of ugly
Xen #ifdef-age. In a non-Xen kernel, the _ma() functions either call or
alias the equivalent _pa() functions.

Reviewed on port-xen@netbsd.org and port-i386@netbsd.org. Passes
rmind@'s and bouyer@'s inspection. Tested on i386 and on Xen DOMU /
DOM0.
 1.29  09-Feb-2010  jym branches: 1.29.2;
Fix typos in comments.
 1.28  11-Nov-2009  cegger branches: 1.28.2;
update comment: we use PMAP_NOCACHE for both pmap_enter and pmap_kenter_pa
 1.27  07-Nov-2009  cegger Add a flags argument to pmap_kenter_pa(9).
Patch showed on tech-kern@ http://mail-index.netbsd.org/tech-kern/2009/11/04/msg006434.html
No objections.
 1.26  19-Jul-2009  rmind pmap_emap_sync: add an argument, and do not perform pmap_load() during
context switch (pmap_destroy() path seems to be unsafe), instead just
perform tlbflush(). Slightly inefficient, but good enough for now.
 1.25  28-Jun-2009  rmind Ephemeral mapping (emap) implementation. Concept is based on the idea that
activity of other threads will perform the TLB flush for the processes using
emap as a side effect. To track that, global and per-CPU generation numbers
are used. This idea was suggested by Andrew Doran; various improvements to
it by me. Notes:

- For now, zero-copy on pipe is not yet enabled.
- TCP socket code would likely need more work.
- Additional UVM loaning improvements are needed.

Proposed on <tech-kern>, silence there.
Quickly reviewed by <ad>.
 1.24  22-Apr-2009  cegger change pmap flags argument from int to u_int.
forgot to commit this.
 1.23  18-Apr-2009  cegger Introduce PMAP_NOCACHE as first PMAP MD bit in x86. Make use of it in pmap_enter().
This safes one extra TLB flush when mapping dma-safe memory.
Presented on tech-kern@, port-i386@ and port-amd64@
ok ad@
 1.22  21-Mar-2009  ad PR port-i386/40143 Viewing an mpeg transport stream with mplayer causes crash

Fix numerous problems:

1. LDT updates are not atomic.

2. Number of processes running with private LDTs and/or I/O bitmaps
is not capped. System with high maxprocs can be paniced.

3. LDTR can be leaked over context switch.

4. GDT slot allocations can race, giving the same LDT slot to two procs.

5. Incomplete interrupt/trap frames can be stacked.

6. In some rare cases segment faults are not handled correctly.
 1.21  09-Dec-2008  pooka branches: 1.21.2;
Make pmap_kernel() a MI macro for struct pmap *kernel_pmap_ptr,
which is now the "API" provided by the pmap module. pmap_kernel()
remains as the syntactic sugar.

Bonus cosmetics round: move all the pmap_t pointer typedefs into
uvm_pmap.h.

Thanks to Greg Oster for providing cpu muscle for doing test builds.
 1.20  16-Sep-2008  bouyer branches: 1.20.2; 1.20.4;
Implement the arch-dependent p2m frame lists list. This adds support for
'xm dump-core' for NetBSD domUs.
From Jean-Yves Migeon (jean-yves dot migeon at espci dot fr)
 1.19  24-Jun-2008  jmcneill branches: 1.19.2;
Define PMAP_FORK -- this was lost in the vmlocking merge, and is required
by options USER_LDT.
 1.18  05-Jun-2008  ad branches: 1.18.2;
pmap_remove_all() for x86. Also, always defer freeing ptps to pmap_update().
There may be a better way to do this, but for now this is simple and avoids
potential bugs.

Proposed on tech-kern and discussed with chs@.
 1.17  04-Jun-2008  ad Revert unintentional change.
 1.16  04-Jun-2008  ad vm_page: put TAILQ_ENTRY into a union with LIST_ENTRY, so we can use both.
 1.15  02-Jun-2008  ad - Don't bother using sse to copy/zero pages on demand. It turns out not
to be worth it.
- If the machine has sse, re-enable zeroing pages in the idle loop and
use the sse instructions so that we don't blow out the cache.
 1.14  03-May-2008  ad branches: 1.14.2;
Back out previous which was not thought through properly.
 1.13  03-May-2008  ad Implement pmap_remove_all().
 1.12  23-Jan-2008  bouyer branches: 1.12.6; 1.12.8; 1.12.10;
Merge the bouyer-xeni386 branch. This brings in PAE support to NetBSD xeni386
(domU only). PAE support is enabled by 'options PAE', see the new XEN3PAE_DOMU
and INSTALL_XEN3PAE_DOMU kernel config files.

See the comments in arch/i386/include/{pte.h,pmap.h} to see how it works.
In short, we still handle it as a 2-level MMU, with the second level page
directory being 4 pages in size. pmap switching is done by switching the
L2 pages in the L3 entries, instead of loading %cr3. This is almost required
by Xen, which handle the last L2 page (the one mapping 0xc0000000 - 0xffffffff)
in a very special way. But this approach should also work for native PAE
support if ever supported (in fact, the pmap should almost suport native
PAE, what's missing is bootstrap code in locore.S).
 1.11  20-Jan-2008  yamt - rewrite P->V tracking.
- use a hash rather than SPLAY trees.
SPLAY tree is a wrong algorithm to use here.
will be revisited if it slows down anything other than
micro-benchmarks.
- optimize the single mapping case (it's a common case) by
embedding an entry into mdpage.
- don't keep a pmap pointer as it can be obtained from ptp.
(discussed on port-i386 some years ago.)
ideally, a single paddr_t should be enough to describe a pte.
but it needs some more thoughts as it can increase computational
costs.
- pmap_enter: simplify and fix races with pmap_sync_pv.
- don't bother to lock pm_obj[i] where i > 0, unless DIAGNOSTIC.
- kill mp_link to save space.
- add many KASSERTs.
 1.10  11-Jan-2008  bouyer Merge the bouyer-xeni386 branch to head, at tag bouyer-xeni386-merge1 (the
branch is still active and will see i386PAE support developement).
Sumary of changes:
- switch xeni386 to the x86/x86/pmap.c, and the xen/x86/x86_xpmap.c
pmap bootstrap.
- merge back most of xen/i386/ to i386/i386
- change the build to reduce diffs between i386 and amd64 in file locations
- remove include files that were identical to the i386/amd64 counterparts,
the build will find them via the xen-ma/machine link.
 1.9  08-Jan-2008  yamt kill unused PMF_USER_RELOAD.
 1.8  02-Jan-2008  yamt g/c pv_page stuffs.
 1.7  25-Dec-2007  perry Convert many of the uses of __attribute__ to equivalent
__packed, __unused and __dead macros from cdefs.h
 1.6  09-Dec-2007  jmcneill branches: 1.6.2;
Merge jmcneill-pm branch.
 1.5  22-Nov-2007  bouyer branches: 1.5.2; 1.5.4;
Pull up the bouyer-xenamd64 branch to HEAD. This brings in amd64 support
to NetBSD/Xen, both Dom0 and DomU.
 1.4  15-Nov-2007  ad Remove support for 80386 level CPUs. PR port-i386/36163.
 1.3  07-Nov-2007  ad Merge from vmlocking:

- pool_cache changes.
- Debugger/procfs locking fixes.
- Other minor changes.
 1.2  18-Oct-2007  yamt branches: 1.2.2; 1.2.4; 1.2.6; 1.2.8; 1.2.10;
merge yamt-x86pmap branch.

- reduce differences between amd64 and i386. notably, share pmap.c
between them. it makes several i386 pmap improvements available to
amd64, including tlb shootdown reduction and bug fixes from Stephan Uphoff.
- implement deferred pmap switching for amd64.
- remove LARGEPAGES option. always use large pages if available.
also, make it work on amd64.
 1.1  08-Oct-2007  yamt branches: 1.1.2; 1.1.4;
file pmap.h was initially added on branch yamt-x86pmap.
 1.1.4.4  18-Nov-2007  bouyer Sync with HEAD
 1.1.4.3  13-Nov-2007  bouyer Sync with HEAD
 1.1.4.2  25-Oct-2007  bouyer Finish sync with HEAD. Especially use the new x86 pmap for xenamd64.
For this:
- rename pmap_pte_set() to pmap_pte_testset()
- make pmap_pte_set() a function or macro for non-atomic PTE write
- define and use pmap_pa2pte()/pmap_pte2pa() to read/write PTE entries
- define pmap_pte_flush() which is a nop in x86 case, and flush the
MMUops queue in the Xen case
 1.1.4.1  25-Oct-2007  bouyer Sync with HEAD.
 1.1.2.3  18-Oct-2007  yamt #ifdef out an unused member for x86_64.
 1.1.2.2  14-Oct-2007  yamt move pl_i_roundup to a header.
 1.1.2.1  08-Oct-2007  yamt merge some parts of x86 pmap.h.
 1.2.10.5  23-Mar-2008  matt sync with HEAD
 1.2.10.4  09-Jan-2008  matt sync with HEAD
 1.2.10.3  08-Nov-2007  matt sync with -HEAD
 1.2.10.2  06-Nov-2007  matt sync with HEAD
 1.2.10.1  18-Oct-2007  matt file pmap.h was added on branch matt-armv6 on 2007-11-06 23:23:38 +0000
 1.2.8.4  18-Feb-2008  mjf Sync with HEAD.
 1.2.8.3  27-Dec-2007  mjf Sync with HEAD.
 1.2.8.2  08-Dec-2007  mjf Sync with HEAD.
 1.2.8.1  19-Nov-2007  mjf Sync with HEAD.
 1.2.6.6  04-Feb-2008  yamt sync with head.
 1.2.6.5  21-Jan-2008  yamt sync with head
 1.2.6.4  07-Dec-2007  yamt sync with head
 1.2.6.3  15-Nov-2007  yamt sync with head.
 1.2.6.2  27-Oct-2007  yamt sync with head.
 1.2.6.1  18-Oct-2007  yamt file pmap.h was added on branch yamt-lazymbuf on 2007-10-27 11:28:56 +0000
 1.2.4.6  27-Nov-2007  joerg Sync with HEAD. amd64 Xen support needs testing.
 1.2.4.5  21-Nov-2007  joerg Sync with HEAD.
 1.2.4.4  11-Nov-2007  joerg Sync with HEAD.
 1.2.4.3  28-Oct-2007  joerg Cosmetic: reduce diff to HEAD.
 1.2.4.2  26-Oct-2007  joerg Sync with HEAD.

Follow the merge of pmap.c on i386 and amd64 and move
pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup
code to restore CR4 before jumping back into kernel space as the large
page option might cover that.
 1.2.4.1  18-Oct-2007  joerg file pmap.h was added on branch jmcneill-pm on 2007-10-26 15:43:44 +0000
 1.2.2.4  03-Dec-2007  ad Sync with HEAD.
 1.2.2.3  24-Oct-2007  ad Use a pool_cache to allocate pv entries. PR port-i386/37193.
 1.2.2.2  23-Oct-2007  ad Sync with head.
 1.2.2.1  18-Oct-2007  ad file pmap.h was added on branch vmlocking on 2007-10-23 20:36:40 +0000
 1.5.4.1  11-Dec-2007  yamt sync with head.
 1.5.2.1  26-Dec-2007  ad Sync with head.
 1.6.2.7  20-Jan-2008  bouyer Sync with HEAD
 1.6.2.6  17-Jan-2008  bouyer - Fix L2_SLOT_APTE value (not sure how I got this value but it was definitively
wrong)
- Use global variable for the PAE L3 page adresses, so that pmap.c can get it
from the bootstrap code
- Extent the size of our virtual PDP from 3 to 4 pages, so that pmap->pm_pdir[]
is contigous for the whole VA range. The last page is a shadow of
the kernel's real PDP (L3[3]).
- make pm_pdirpa an array of 4 paddr_t if using PAE. introduce a
pmap_pdirpa macro to get the physical address of a given PD entry.
- fix pmap_map_pte

The kernel now boots single-user. fsck will cause a kernel fault in
pmap_pdes_invalid() on exit.
 1.6.2.5  13-Jan-2008  bouyer Work in progress on xeni386 PAE support:
Make xeni386 build with a 64bit paddr_t. For this vaddr_t vs paddr_t vs
pointers usages had to be clarified.
If 'options PAE' is present in a Xen3 kernel, switch paddr_t, pd_entry_t
and pt_entry_t to 64bits, and add the PAE entry in the __xen_guest ELF section.
 1.6.2.4  10-Jan-2008  bouyer Sync with HEAD
 1.6.2.3  02-Jan-2008  bouyer Sync with HEAD
 1.6.2.2  13-Dec-2007  bouyer - make amd64 XEN3 kernels build again
- pin the pdp pages in the PDP cache contructor, and unpin them in the
destructor. garbage-collect PMF_USER_XPIN.
 1.6.2.1  11-Dec-2007  bouyer Switch i386 to x86/x86/pmap.c
 1.12.10.5  11-Aug-2010  yamt sync with head.
 1.12.10.4  11-Mar-2010  yamt sync with head
 1.12.10.3  19-Aug-2009  yamt sync with head.
 1.12.10.2  18-Jul-2009  yamt sync with head.
 1.12.10.1  04-May-2009  yamt sync with head.
 1.12.8.2  17-Jun-2008  yamt sync with head.
 1.12.8.1  04-Jun-2008  yamt sync with head
 1.12.6.4  17-Jan-2009  mjf Sync with HEAD.
 1.12.6.3  28-Sep-2008  mjf Sync with HEAD.
 1.12.6.2  29-Jun-2008  mjf Sync with HEAD.
 1.12.6.1  05-Jun-2008  mjf Sync with HEAD.

Also fix build.
 1.14.2.3  24-Sep-2008  wrstuden Merge in changes between wrstuden-revivesa-base-2 and
wrstuden-revivesa-base-3.
 1.14.2.2  18-Sep-2008  wrstuden Sync with wrstuden-revivesa-base-2.
 1.14.2.1  23-Jun-2008  wrstuden Sync w/ -current. 34 merge conflicts to follow.
 1.18.2.1  27-Jun-2008  simonb Sync with head.
 1.19.2.2  13-Dec-2008  haad Update haad-dm branch to haad-dm-base2.
 1.19.2.1  19-Oct-2008  haad Sync with HEAD.
 1.20.4.1  04-Apr-2009  snj Pull up following revision(s) (requested by ad in ticket #656):
sys/arch/amd64/amd64/gdt.c: revision 1.21 via patch
sys/arch/amd64/amd64/machdep.c: revision 1.129 via patch
sys/arch/i386/i386/gdt.c: revision 1.47 via patch
sys/arch/i386/i386/kvm86.c: revision 1.17 via patch
sys/arch/i386/i386/locore.S: revision 1.85 via patch
sys/arch/i386/i386/machdep.c: revision 1.666 via patch
sys/arch/i386/i386/vector.S: revision 1.45 via patch
sys/arch/i386/include/pcb.h: revision 1.47 via patch
sys/arch/x86/include/pmap.h: revision 1.22 via patch
sys/arch/x86/include/sysarch.h: revision 1.8 via patch
sys/arch/x86/x86/pmap.c: revision 1.80 via patch
sys/arch/x86/x86/sys_machdep.c: revision 1.17 via patch
sys/compat/linux/arch/i386/linux_machdep.c: revision 1.143 via patch
sys/kern/init_main.c: revision 1.384 via patch
PR port-i386/40143 Viewing an mpeg transport stream with mplayer causes crash
Fix numerous problems:
1. LDT updates are not atomic.
2. Number of processes running with private LDTs and/or I/O bitmaps
is not capped. System with high maxprocs can be paniced.
3. LDTR can be leaked over context switch.
4. GDT slot allocations can race, giving the same LDT slot to two procs.
5. Incomplete interrupt/trap frames can be stacked.
6. In some rare cases segment faults are not handled correctly.
 1.20.2.2  28-Apr-2009  skrll Sync with HEAD.
 1.20.2.1  19-Jan-2009  skrll Sync with HEAD.
 1.21.2.11  27-Aug-2011  jym Sync with HEAD. Most notably: uvm/pmap work done by rmind@, and MP Xen
work of cherry@.

No regression observed on suspend/restore.
 1.21.2.10  26-May-2011  jym Pull-up some modifications from -current to my branch.
 1.21.2.9  02-May-2011  jym Sync with head.
 1.21.2.8  28-Mar-2011  jym Sync with HEAD. TODO before merge:
- shortcut for suspend code in sysmon, when powerd(8) is not running.
Borrow ``xs_watch'' thread context?
- bug hunting in xbd + xennet resume. Rings are currently thrashed upon
resume, so current implementation force flush them on suspend. It's not
really needed.
 1.21.2.7  24-Oct-2010  jym Sync with HEAD
 1.21.2.6  01-Nov-2009  jym - Upgrade suspend/resume code to comply with Xen2 removal.
- Add support for PAE domUs suspend/resume.
- Fix an issue regarding initialization of the xbd ring I/O that could end
badly during resume, with invalid block operations submitted to dom0 backend.

NetBSD supports PAE under x86_32 by considering the L2 page as being
4 pages long instead of 1.

Xen validates the page types during resume. Sadly, the hypervisor handles
alternative recursive mappings (== PG/PD entries pointing to pages other
than self) inadequately, which could lead to incorrect page pinning.

As a result, the important change with this patch is to clear these alternative
mappings during suspend, and reset them back to their former self upon
resume. For PAE, approx. all 4 PDIR_SLOT_PTEs could be considered as
alternative recursive mappings.

See comments in pmap.c for further details.

Now, let the testing and bug hunting begin.
 1.21.2.5  01-Nov-2009  jym Sync with HEAD.
 1.21.2.4  24-Jul-2009  jym - rework the page pinning API, so that now a function is provided for
each level of indirection encountered during virtual memory translations. Update
pmap accordingly. Pinning looks cleaner that way, and it offers the possibility
to pin lower level pages if necessary (NetBSD does not do it currently).

- some fixes and comments to explain how page validation/invalidation take
place during save/restore/migrate under Xen. L2 shadow entries from PAE are now
handled, so basically, suspend/resume works with PAE.

- fixes an issue reported by Christoph (cegger@) for xencons suspend/resume
in dom0.

TODO:

- PAE save/restore is currently limited to single-user only, multi-user
support requires modifications in PAE pmap that should be discussed first. See
the comments about the L2 shadow pages cached in pmap_pdp_cache in this commit.

- grant table bug is still there; do not use the kernels of this branch
to test suspend/resume, unless you want to experience bad crashes in dom0,
and push the big red button.

Now there is light at the end of the tunnel :)

Note: XEN2 kernels will neither build nor work with this branch.
 1.21.2.3  23-Jul-2009  jym Sync with HEAD.
 1.21.2.2  31-May-2009  jym Modifications for the Xen suspend/migrate/resume branch:

- introduce xenbus_device_{suspend,resume}() functions. These are routines
used to suspend/resume MI parts of the Xenbus device interfaces, like updating
frontend/backend devices' paths found in XenStore.

- introduce HYPERVISOR_sysctl(), an hypercall used only by Xentools to obtain
information from hypervisor (listing VMs, printing console, etc.). I use it
to query xenconsole from ddb(), as a last resort in case of a panic() in
dom0 (xm being not available). Currently unused in the branch; could be, if
requested.

- disable the rwlock(9) used to protect code that could use transient MFNs.
It could trigger nasty context switches in place it should not to.

- fix some bugs in the xennet/xbd suspend/resume pmf(9) handlers.

- following XenSource's design, talk_to_otherend() is now called
watch_otherend(), and free_otherend_details() is used by Xenbus device
suspend/resume routines.

- some slight modifications in pmap regarding APDP. Introduce an inline
function (pmap_unmap_apdp_pde()) that clears APDP entry for the current pmap.

- similarly, implement pmap_unmap_all_apdp_pdes() that iterates through all
pmaps and tears down APDP, as Xen does not handle them properly.

TODO/XXX:

- pmap_unmap_apdp_pde() does not handle APDP shadow entry of PAE. It will,
once I figure out how PAE uses it.

- revisit the pmap locking issue regarding transient MFNs. As NetBSD does not
use kernel preemption and MP for Xen, this could be skipped momentarily. See
http://mail-index.netbsd.org/port-xen/2009/04/27/msg004903.html for details.

- fix a bug regarding grant tables which could technically DoS a dom0 if
ridiculously high consumer/producer indexes are passed down in the ring during
a resume.

All in all, once the grant table index issue and APDP PAE are fixed, next step
is to torture test this branch.

Tested under i386 PAE and non-PAE, Xen3 dom0 and domU. amd64 is only compile
tested.
 1.21.2.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.28.2.2  17-Aug-2010  uebayasi Sync with HEAD.
 1.28.2.1  30-Apr-2010  uebayasi Sync with HEAD.
 1.29.2.11  31-May-2011  rmind sync with head
 1.29.2.10  19-May-2011  rmind Implement sharing of vnode_t::v_interlock amongst vnodes:
- Lock is shared amongst UVM objects using uvm_obj_setlock() or getnewvnode().
- Adjust vnode cache to handle unsharing, add VI_LOCKSHARE flag for that.
- Use sharing in tmpfs and layerfs for underlying object.
- Simplify locking in ubc_fault().
- Sprinkle some asserts.

Discussed with ad@.
 1.29.2.9  17-Mar-2011  rmind - Fix tlbflushg() to behave like tlbflush(), if page global extension (PGE)
is not (yet) enabled. This fixes the issue of stale TLB entry, experienced
early on boot, when PGE is not yet set on primary CPU.
- Rewrite i386/amd64 TLB interrupt handlers in C (only stubs are in assembly),
which simplifies and unifies (under x86) code, plus fixes few bugs.
- cpu_attach: remove assignment to cpus_running, as primary CPU might not be
attached first, which causes reset (and thus missed secondary CPUs).
 1.29.2.8  08-Mar-2011  rmind struct pmap_tlb_mailbox: make tm_pending and tm_gen volatile.
 1.29.2.7  05-Mar-2011  rmind sync with head
 1.29.2.6  31-May-2010  rmind - Split off Xen versions of pmap_map_ptes/pmap_unmap_ptes into Xen pmap,
also move pmap_apte_flush() with pmap_unmap_apdp() there.
- Make Xen buildable.
 1.29.2.5  30-May-2010  rmind sync with head
 1.29.2.4  26-May-2010  rmind Split x86 TLB shootdown code into a separate file.
Code part is under TNF license, as per pmap.c 1.105.2.4 revision.
 1.29.2.3  26-Apr-2010  rmind Partly rewrite amd64 TLB shutdown handler for the changes in x86 pmap.
At this point, branch seems to pass preliminar stress tests on amd64.
 1.29.2.2  26-Apr-2010  rmind Apply renovated patch to significantly reduce TLB shootdowns in x86 pmap,
also provide TLBSTATS option to measure and track TLB shootdowns. Details:

http://mail-index.netbsd.org/port-i386/2009/01/11/msg001018.html

Patch from Andrew Doran, proposed on tech-x86 [sic], in January 2009.

XXX: amd64 and xen are not yet; work in progress.
 1.29.2.1  16-Mar-2010  rmind Change struct uvm_object::vmobjlock to be dynamically allocated with
mutex_obj_alloc(). It allows us to share the locks among UVM objects.
 1.33.4.2  17-Feb-2011  bouyer Sync with HEAD
 1.33.4.1  08-Feb-2011  bouyer Sync with HEAD
 1.33.2.1  06-Jun-2011  jruoho Sync with HEAD.
 1.38.2.3  20-Sep-2011  cherry Remove the "xpq lock", since we have per-cpu mmu queues now. This may need further testing. Also add some preliminary locking around queue-ops in the network backend driver
 1.38.2.2  23-Jun-2011  cherry Catchup with rmind-uvmplock merge.
 1.38.2.1  03-Jun-2011  cherry Initial import of xen MP sources, with kernel and userspace tests.
- this is a source priview.
- boots to single user.
- spurious interrupt and pmap related panics are normal
 1.43.2.6  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.43.2.5  16-Jan-2013  yamt sync with (a bit old) head
 1.43.2.4  23-May-2012  yamt sync with head.
 1.43.2.3  17-Apr-2012  yamt sync with head
 1.43.2.2  18-Nov-2011  yamt share a lock among pmap uobjs
 1.43.2.1  10-Nov-2011  yamt sync with head
 1.48.2.3  29-Apr-2012  mrg sync to latest -current.
 1.48.2.2  05-Apr-2012  mrg sync to latest -current.
 1.48.2.1  18-Feb-2012  mrg merge to -current.
 1.49.2.3  06-Mar-2017  snj Pull up following revision(s) (requested by bouyer in ticket #1441):
sys/arch/x86/x86/pmap.c: revision 1.241 via patch
sys/arch/x86/include/pmap.h: revision 1.63 via patch
Should be PG_k, doesn't change anything.
--
Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege
separation between the kernel and userland.
On Xen-amd64, the kernel runs in ring3 just like userland, and the
separation is guaranteed by the hypervisor - each syscall/trap is
intercepted by Xen and sent manually to the kernel. Before that, the
hypervisor modifies the page tables so that the kernel becomes accessible.
Later, when returning to userland, the hypervisor removes the kernel pages
and flushes the TLB.
However, TLB flushes are costly, and in order to reduce the number of pages
flushed Xen marks the userland pages as global, while keeping the kernel
ones as local. This way, when returning to userland, only the kernel pages
get flushed - which makes sense since they are the only ones that got
removed from the mapping.
Xen differentiates the userland pages by looking at their PG_u bit in the
PTE; if a page has this bit then Xen tags it as global, otherwise Xen
manually adds the bit but keeps the page as local. The thing is, since we
set PG_u in the kernel pages, Xen believes our kernel pages are in fact
userland pages, so it marks them as global. Therefore, when returning to
userland, the kernel pages indeed get removed from the page tree, but are
not flushed from the TLB. Which means that they are still accessible.
With this - and depending on the DTLB size - userland has a small window
where it can read/write to the last kernel pages accessed, which is enough
to completely escalate privileges: the sysent structure systematically gets
read when performing a syscall, and chances are that it will still be
cached in the TLB. Userland can then use this to patch a chosen syscall,
make it point to a userland function, retrieve %gs and compute the address
of its credentials, and finally grant itself root privileges.
 1.49.2.2  09-May-2012  riz branches: 1.49.2.2.4; 1.49.2.2.6;
Pull up following revision(s) (requested by rmind in ticket #202):
sys/arch/x86/include/cpuvar.h: revision 1.46
sys/arch/xen/include/xenpmap.h: revision 1.34
sys/arch/i386/include/param.h: revision 1.77
sys/arch/x86/x86/pmap_tlb.c: revision 1.5
sys/arch/x86/x86/pmap_tlb.c: revision 1.6
sys/arch/i386/i386/genassym.cf: revision 1.92
sys/arch/xen/x86/cpu.c: revision 1.91
sys/arch/x86/x86/pmap.c: revision 1.177
sys/arch/xen/x86/xen_pmap.c: revision 1.21
sys/arch/x86/acpi/acpi_wakeup.c: revision 1.31
sys/kern/subr_kcpuset.c: revision 1.5
sys/arch/amd64/include/param.h: revision 1.18
sys/sys/kcpuset.h: revision 1.5
sys/arch/x86/x86/mtrr_i686.c: revision 1.26
sys/arch/x86/x86/mtrr_i686.c: revision 1.27
sys/arch/xen/x86/x86_xpmap.c: revision 1.43
sys/arch/x86/x86/cpu.c: revision 1.98
sys/arch/amd64/amd64/mptramp.S: revision 1.14
sys/kern/sys_sched.c: revision 1.42
sys/arch/amd64/amd64/genassym.cf: revision 1.50
sys/arch/i386/i386/mptramp.S: revision 1.24
sys/arch/x86/include/pmap.h: revision 1.52
sys/arch/x86/include/cpu.h: revision 1.50
- Convert x86 MD code, mainly pmap(9) e.g. TLB shootdown code, to use
kcpuset(9) and thus replace hardcoded CPU bitmasks. This removes the
limitation of maximum CPUs.
- Support up to 256 CPUs on amd64 architecture by default.
Bug fixes, improvements, completion of Xen part and testing on 64-core
AMD Opteron(tm) Processor 6282 SE (also, as Xen HVM domU with 128 CPUs)
by Manuel Bouyer.
- pmap_tlb_shootdown: do not overwrite tp_cpumask with pm_cpus, but merge
like pm_kernel_cpus. Remove unecessary intersection with kcpuset_running.
Do not reset tp_userpmap if pmap_kernel().
- Remove pmap_tlb_mailbox_t wrapping, which is pointless after recent changes.
- pmap_tlb_invalidate, pmap_tlb_intr: constify for packet structure.
i686_mtrr_init_first: handle the case when there are no variable-size MTRR
registers available (i686_mtrr_vcnt == 0).
 1.49.2.1  22-Feb-2012  riz Pull up following revision(s) (requested by bouyer in ticket #29):
sys/arch/xen/x86/x86_xpmap.c: revision 1.39
sys/arch/xen/include/hypervisor.h: revision 1.37
sys/arch/xen/include/intr.h: revision 1.34
sys/arch/xen/x86/xen_ipi.c: revision 1.10
sys/arch/x86/x86/cpu.c: revision 1.97
sys/arch/x86/include/cpu.h: revision 1.48
sys/uvm/uvm_map.c: revision 1.315
sys/arch/x86/x86/pmap.c: revision 1.165
sys/arch/xen/x86/cpu.c: revision 1.81
sys/arch/x86/x86/pmap.c: revision 1.167
sys/arch/xen/x86/cpu.c: revision 1.82
sys/arch/x86/x86/pmap.c: revision 1.168
sys/arch/xen/x86/xen_pmap.c: revision 1.17
sys/uvm/uvm_km.c: revision 1.122
sys/uvm/uvm_kmguard.c: revision 1.10
sys/arch/x86/include/pmap.h: revision 1.50
Apply patch proposed in PR port-xen/45975 (this does not solve the exact
problem reported here but is part of the solution):
xen_kpm_sync() is not working as expected,
leading to races between CPUs.
1 the check (xpq_cpu != &x86_curcpu) is always false because we
have different x86_curcpu symbols with different addresses in the kernel.
Fortunably, all addresses dissaemble to the same code.
Because of this we always use the code intended for bootstrap, which doesn't
use cross-calls or lock.
2 once 1 above is fixed, xen_kpm_sync() will use xcalls to sync other CPUs,
which cause it to sleep and pmap.c doesn't like that. It triggers this
KASSERT() in pmap_unmap_ptes():
KASSERT(pmap->pm_ncsw == curlwp->l_ncsw);
3 pmap->pm_cpus is not safe for the purpose of xen_kpm_sync(), which
needs to know on which CPU a pmap is loaded *now*:
pmap->pm_cpus is cleared before cpu_load_pmap() is called to switch
to a new pmap, leaving a window where a pmap is still in a CPU's
ci_kpm_pdir but not in pm_cpus. As a virtual CPU may be preempted
by the hypervisor at any time, it can be large enough to let another
CPU free the PTP and reuse it as a normal page.
To fix 2), avoid cross-calls and IPIs completely, and instead
use a mutex to update all CPU's ci_kpm_pdir from the local CPU.
It's safe because we just need to update the table page, a tlbflush IPI will
happen later. As a side effect, we don't need a different code for bootstrap,
fixing 1). The mutex added to struct cpu needs a small headers reorganisation.
to fix 3), introduce a pm_xen_ptp_cpus which is updated from
cpu_pmap_load(), whith the ci_kpm_mtx mutex held. Checking it with
ci_kpm_mtx held will avoid overwriting the wrong pmap's ci_kpm_pdir.
While there I removed the unused pmap_is_active() function;
and added some more details to DIAGNOSTIC panics.
When using uvm_km_pgremove_intrsafe() make sure mappings are removed
before returning the pages to the free pool. Otherwise, under Xen,
a page which still has a writable mapping could be allocated for
a PDP by another CPU and the hypervisor would refuse it (this is
PR port-xen/45975).
For this, move the pmap_kremove() calls inside uvm_km_pgremove_intrsafe(),
and do pmap_kremove()/uvm_pagefree() in batch of (at most) 16 entries
(as suggested by Chuck Silvers on tech-kern@, see also
http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012727.html and
followups).
Avoid early use of xen_kpm_sync(); locks are not available at this time.
Don't call cpu_init() twice.
Makes LOCKDEBUG kernels boot again
Revert pmap_pte_flush() -> xpq_flush_queue() in previous.
 1.49.2.2.6.1  06-Mar-2017  snj Pull up following revision(s) (requested by bouyer in ticket #1441):
sys/arch/x86/x86/pmap.c: revision 1.241 via patch
sys/arch/x86/include/pmap.h: revision 1.63 via patch
Should be PG_k, doesn't change anything.
--
Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege
separation between the kernel and userland.
On Xen-amd64, the kernel runs in ring3 just like userland, and the
separation is guaranteed by the hypervisor - each syscall/trap is
intercepted by Xen and sent manually to the kernel. Before that, the
hypervisor modifies the page tables so that the kernel becomes accessible.
Later, when returning to userland, the hypervisor removes the kernel pages
and flushes the TLB.
However, TLB flushes are costly, and in order to reduce the number of pages
flushed Xen marks the userland pages as global, while keeping the kernel
ones as local. This way, when returning to userland, only the kernel pages
get flushed - which makes sense since they are the only ones that got
removed from the mapping.
Xen differentiates the userland pages by looking at their PG_u bit in the
PTE; if a page has this bit then Xen tags it as global, otherwise Xen
manually adds the bit but keeps the page as local. The thing is, since we
set PG_u in the kernel pages, Xen believes our kernel pages are in fact
userland pages, so it marks them as global. Therefore, when returning to
userland, the kernel pages indeed get removed from the page tree, but are
not flushed from the TLB. Which means that they are still accessible.
With this - and depending on the DTLB size - userland has a small window
where it can read/write to the last kernel pages accessed, which is enough
to completely escalate privileges: the sysent structure systematically gets
read when performing a syscall, and chances are that it will still be
cached in the TLB. Userland can then use this to patch a chosen syscall,
make it point to a userland function, retrieve %gs and compute the address
of its credentials, and finally grant itself root privileges.
 1.49.2.2.4.1  06-Mar-2017  snj Pull up following revision(s) (requested by bouyer in ticket #1441):
sys/arch/x86/x86/pmap.c: revision 1.241 via patch
sys/arch/x86/include/pmap.h: revision 1.63 via patch
Should be PG_k, doesn't change anything.
--
Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege
separation between the kernel and userland.
On Xen-amd64, the kernel runs in ring3 just like userland, and the
separation is guaranteed by the hypervisor - each syscall/trap is
intercepted by Xen and sent manually to the kernel. Before that, the
hypervisor modifies the page tables so that the kernel becomes accessible.
Later, when returning to userland, the hypervisor removes the kernel pages
and flushes the TLB.
However, TLB flushes are costly, and in order to reduce the number of pages
flushed Xen marks the userland pages as global, while keeping the kernel
ones as local. This way, when returning to userland, only the kernel pages
get flushed - which makes sense since they are the only ones that got
removed from the mapping.
Xen differentiates the userland pages by looking at their PG_u bit in the
PTE; if a page has this bit then Xen tags it as global, otherwise Xen
manually adds the bit but keeps the page as local. The thing is, since we
set PG_u in the kernel pages, Xen believes our kernel pages are in fact
userland pages, so it marks them as global. Therefore, when returning to
userland, the kernel pages indeed get removed from the page tree, but are
not flushed from the TLB. Which means that they are still accessible.
With this - and depending on the DTLB size - userland has a small window
where it can read/write to the last kernel pages accessed, which is enough
to completely escalate privileges: the sysent structure systematically gets
read when performing a syscall, and chances are that it will still be
cached in the TLB. Userland can then use this to patch a chosen syscall,
make it point to a userland function, retrieve %gs and compute the address
of its credentials, and finally grant itself root privileges.
 1.52.2.3  03-Dec-2017  jdolecek update from HEAD
 1.52.2.2  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.52.2.1  20-Nov-2012  tls Resync to 2012-11-19 00:00:00 UTC
 1.54.2.1  18-May-2014  rmind sync with head
 1.55.6.6  28-Aug-2017  skrll Sync with HEAD
 1.55.6.5  05-Dec-2016  skrll Sync with HEAD
 1.55.6.4  05-Oct-2016  skrll Sync with HEAD
 1.55.6.3  09-Jul-2016  skrll Sync with HEAD
 1.55.6.2  27-Dec-2015  skrll Sync with HEAD (as of 26th Dec)
 1.55.6.1  06-Apr-2015  skrll Sync with HEAD
 1.55.4.3  06-Mar-2017  snj Pull up following revision(s) (requested by bouyer in ticket #1388):
sys/arch/x86/x86/pmap.c: revision 1.241
Should be PG_k, doesn't change anything.
--
Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege
separation between the kernel and userland.
On Xen-amd64, the kernel runs in ring3 just like userland, and the
separation is guaranteed by the hypervisor - each syscall/trap is
intercepted by Xen and sent manually to the kernel. Before that, the
hypervisor modifies the page tables so that the kernel becomes accessible.
Later, when returning to userland, the hypervisor removes the kernel pages
and flushes the TLB.
However, TLB flushes are costly, and in order to reduce the number of pages
flushed Xen marks the userland pages as global, while keeping the kernel
ones as local. This way, when returning to userland, only the kernel pages
get flushed - which makes sense since they are the only ones that got
removed from the mapping.
Xen differentiates the userland pages by looking at their PG_u bit in the
PTE; if a page has this bit then Xen tags it as global, otherwise Xen
manually adds the bit but keeps the page as local. The thing is, since we
set PG_u in the kernel pages, Xen believes our kernel pages are in fact
userland pages, so it marks them as global. Therefore, when returning to
userland, the kernel pages indeed get removed from the page tree, but are
not flushed from the TLB. Which means that they are still accessible.
With this - and depending on the DTLB size - userland has a small window
where it can read/write to the last kernel pages accessed, which is enough
to completely escalate privileges: the sysent structure systematically gets
read when performing a syscall, and chances are that it will still be
cached in the TLB. Userland can then use this to patch a chosen syscall,
make it point to a userland function, retrieve %gs and compute the address
of its credentials, and finally grant itself root privileges.
 1.55.4.2  18-Dec-2016  snj Pull up following revision(s) (requested by riastradh in ticket #1316):
sys/arch/x86/x86/pmap.c: revision 1.223
sys/arch/x86/x86/vm_machdep.c: revision 1.26
sys/arch/x86/include/pmap.h: revision 1.61
PR/49691: KAMADA Ken'ichi: free deferred ptp mappings if present.
XXX: pullup-7
 1.55.4.1  23-Apr-2015  snj branches: 1.55.4.1.2; 1.55.4.1.4;
Pull up following revision(s) (requested by mrg in ticket #718):
sys/arch/x86/include/pmap.h: revision 1.56
sys/arch/x86/x86/pmap.c: revision 1.188
sys/dev/pci/agp_amd64.c: revision 1.8
sys/dev/pci/agp_i810.c: revision 1.118
sys/external/bsd/drm2/dist/drm/i915/i915_dma.c: revision 1.16
sys/external/bsd/drm2/dist/drm/i915/i915_gem.c: revision 1.29
sys/external/bsd/drm2/dist/drm/nouveau/nouveau_agp.c: revision 1.3
sys/external/bsd/drm2/dist/drm/nouveau/nouveau_ttm.c: revision 1.4
sys/external/bsd/drm2/dist/drm/radeon/atombios_crtc.c: revision 1.3
sys/external/bsd/drm2/dist/drm/radeon/radeon_agp.c: revision 1.3
sys/external/bsd/drm2/dist/drm/radeon/radeon_display.c: revision 1.3
sys/external/bsd/drm2/dist/drm/radeon/radeon_legacy_crtc.c: revision 1.2
sys/external/bsd/drm2/dist/drm/radeon/radeon_object.c: revision 1.3
sys/external/bsd/drm2/dist/drm/radeon/radeon_ttm.c: revision 1.7
sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c: revisions 1.7-1.10
sys/external/bsd/drm2/dist/drm/ttm/ttm_bo_util.c: revision 1.5
sys/external/bsd/drm2/i915drm/intelfb.c: revision 1.13
sys/external/bsd/drm2/include/drm/drm_wait_netbsd.h: revisions 1.12, 1.13
sys/external/bsd/drm2/include/linux/mm.h: revision 1.5
sys/external/bsd/drm2/include/linux/pci.h: revisions 1.16, 1.17
sys/external/bsd/drm2/nouveau/nouveaufb.c: revision 1.2
sys/external/bsd/drm2/radeon/radeon_pci.c: revisions 1.8, 1.9
sys/uvm/uvm_init.c: revision 1.46
Hack against the blank console problem:
Leave the CLUT alone on ancient cards. At least this leaves us with a
semi working console (red and blue are flipped). Leave an example of what
seems to be happening but disable it because colors are better than 444 bit
greyscale.
--
Initialize P->V tracking for unmanaged device pages in uvm_init.

Conditional on __HAVE_PMAP_PV_TRACK until we add it to all pmaps.

MI part of pmap_pv(9) change proposed on tech-kern:

https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html
--
Implement pmap_pv(9) for x86 for P->V tracking of unmanaged pages.

Proposed on tech-kern with no objections:

https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html
--
Use pmap_pv(9) to remove mappings of Intel graphics aperture pages.

Proposed on tech-kern with no objections:

https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html

Further background at:

https://mail-index.netbsd.org/tech-kern/2014/07/23/msg017392.html
--
Use pmap_pv(9) to remove mappings of device pages in TTM.

Adapt nouveau and radeon to do pmap_pv_track for their device pages.

Proposed on tech-kern with no objections:

https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html

Further background at:

https://mail-index.netbsd.org/tech-kern/2014/07/23/msg017392.html
--
Fix error branches in agp_amd64.c.

- agp_generic_detach always.
- Free asc if it was allocated. (Found by Brainy, noted by maxv@.)
- Free the GATT if it was allocated.
--
pmf_device_register returns false on failure, not true
--
In DRM_SPIN_WAIT_ON, don't stop after waiting only one tick.

Continue the loop to recheck the condition and count the whole
duration.
--
Don't use the video BIOS memory as an i915 flush page!
--
Don't let anyone else allocate the video BIOS either.
--
Missed a zero: it's 0x100000, not 0x10000.
--
Don't reserve if atomic -- caller must have pre-pinned the buffer.
--
Don't reserve if atomic -- caller must have pre-pinned the buffer.
--
almost add radeondrmkms suspend/resume support. it unfortunately doesn't work.
--
Need the page's uvm object lock to do pmap_page_protect.
--
Use KASSERTMSG to show bad base/offset.
--
KASSERT about page-alignment on initialization too.
--
Don't break when hardclock_ticks wraps around.

Since we now only count time spent in wait, rather than determining
the end time and checking whether we've passed it, timeouts might be
marginally longer in effect. Unlikely to be an issue.
--
Remove broken drm2 vm_mmap stub. Can't possibly have ever worked.
--
apply some of the additional changes from Arto Huusko in PR#49645:
- call pmf_device_deregister on detach.

i've kept the "resume = true" for radeon_resume_kms() call as it
seems to work for me (indeed, code inspection shows it is unused
on netbsd :-)

my old nforce4 box that can resume old drm (or could, last i tried
several years ago) while X and GL apps were running, can at least
survive a resume if X hasn't started. my one attempt so far with
X exited, but having run, did not work.
--
First attempt to make ttm_buffer_object_transfer less bogus.
--
Make sure mem.bus.is_iomem is initialized. PR 49833
 1.55.4.1.4.2  13-Mar-2017  skrll Sync with netbsd-7-1-RELEASE
 1.55.4.1.4.1  18-Jan-2017  skrll Sync with netbsd-5
 1.55.4.1.2.2  06-Mar-2017  snj Pull up following revision(s) (requested by bouyer in ticket #1388):
sys/arch/x86/include/pmap.h: revision 1.63 via patch
sys/arch/x86/x86/pmap.c: revision 1.241 via patch
Should be PG_k, doesn't change anything.
--
Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege
separation between the kernel and userland.
On Xen-amd64, the kernel runs in ring3 just like userland, and the
separation is guaranteed by the hypervisor - each syscall/trap is
intercepted by Xen and sent manually to the kernel. Before that, the
hypervisor modifies the page tables so that the kernel becomes accessible.
Later, when returning to userland, the hypervisor removes the kernel pages
and flushes the TLB.
However, TLB flushes are costly, and in order to reduce the number of pages
flushed Xen marks the userland pages as global, while keeping the kernel
ones as local. This way, when returning to userland, only the kernel pages
get flushed - which makes sense since they are the only ones that got
removed from the mapping.
Xen differentiates the userland pages by looking at their PG_u bit in the
PTE; if a page has this bit then Xen tags it as global, otherwise Xen
manually adds the bit but keeps the page as local. The thing is, since we
set PG_u in the kernel pages, Xen believes our kernel pages are in fact
userland pages, so it marks them as global. Therefore, when returning to
userland, the kernel pages indeed get removed from the page tree, but are
not flushed from the TLB. Which means that they are still accessible.
With this - and depending on the DTLB size - userland has a small window
where it can read/write to the last kernel pages accessed, which is enough
to completely escalate privileges: the sysent structure systematically gets
read when performing a syscall, and chances are that it will still be
cached in the TLB. Userland can then use this to patch a chosen syscall,
make it point to a userland function, retrieve %gs and compute the address
of its credentials, and finally grant itself root privileges.
 1.55.4.1.2.1  18-Dec-2016  snj Pull up following revision(s) (requested by riastradh in ticket #1316):
sys/arch/x86/x86/pmap.c: revision 1.223
sys/arch/x86/x86/vm_machdep.c: revision 1.26
sys/arch/x86/include/pmap.h: revision 1.61
PR/49691: KAMADA Ken'ichi: free deferred ptp mappings if present.
XXX: pullup-7
 1.58.2.5  26-Apr-2017  pgoyette Sync with HEAD
 1.58.2.4  20-Mar-2017  pgoyette Sync with HEAD
 1.58.2.3  07-Jan-2017  pgoyette Sync with HEAD. (Note that most of these changes are simply $NetBSD$
tag issues.)
 1.58.2.2  04-Nov-2016  pgoyette Sync with HEAD
 1.58.2.1  26-Jul-2016  pgoyette Sync with HEAD
 1.61.2.1  21-Apr-2017  bouyer Sync with HEAD
 1.64.6.2  22-Mar-2018  martin Pull up the following revisions, requested by maxv in ticket #652:

sys/arch/amd64/amd64/amd64_trap.S upto 1.39 (partial, patch)
sys/arch/amd64/amd64/db_machdep.c 1.6 (patch)
sys/arch/amd64/amd64/genassym.cf 1.65,1.66,1.67 (patch)
sys/arch/amd64/amd64/locore.S upto 1.159 (partial, patch)
sys/arch/amd64/amd64/machdep.c 1.299-1.302 (patch)
sys/arch/amd64/amd64/trap.c upto 1.113 (partial, patch)
sys/arch/amd64/amd64/amd64/vector.S upto 1.61 (partial, patch)
sys/arch/amd64/conf/GENERIC 1.477,1.478 (patch)
sys/arch/amd64/conf/kern.ldscript 1.26 (patch)
sys/arch/amd64/include/frameasm.h upto 1.37 (partial, patch)
sys/arch/amd64/include/param.h 1.25 (patch)
sys/arch/amd64/include/pmap.h 1.41,1.43,1.44 (patch)
sys/arch/x86/conf/files.x86 1.91,1.93 (patch)
sys/arch/x86/include/cpu.h 1.88,1.89 (patch)
sys/arch/x86/include/pmap.h 1.75 (patch)
sys/arch/x86/x86/cpu.c 1.144,1.146,1.148,1.149 (patch)
sys/arch/x86/x86/pmap.c upto 1.289 (partial, patch)
sys/arch/x86/x86/vm_machdep.c 1.31,1.32 (patch)
sys/arch/x86/x86/x86_machdep.c 1.104,1.106,1.108 (patch)
sys/arch/x86/x86/svs.c 1.1-1.14
sys/arch/xen/conf/files.compat 1.30 (patch)

Backport SVS. Not enabled yet.
 1.64.6.1  16-Mar-2018  martin Pull up the following revisions (via patch), requested by maxv in #635:

sys/arch/amd64/amd64/gdt.c 1.39-1.45 (patch)
sys/arch/amd64/amd64/amd64/machdep.c 1.284,1.287,1.288 (patch)
sys/arch/amd64/amd64/include/param.h 1.23 (patch)
sys/arch/amd64/include/types.h 1.53 (patch)
sys/arch/x86/include/cpu.h 1.87 (patch)
sys/arch/x86/include/pmap.h 1.73,1.74 (patch)
sys/arch/x86/x86/cpu.c 1.142 (patch)
sys/arch/x86/x86/intr.c 1.117 (partial),1.120 (patch)
sys/arch/x86/x86/pmap.c 1.276 (patch)

Initialize ist0 in cpu_init_tss.
Backport __HAVE_PCPU_AREA.
 1.76.2.6  26-Dec-2018  pgoyette Sync with HEAD, resolve a few conflicts
 1.76.2.5  26-Nov-2018  pgoyette Sync with HEAD, resolve a couple of conflicts
 1.76.2.4  06-Sep-2018  pgoyette Sync with HEAD

Resolve a couple of conflicts (result of the uimin/uimax changes)
 1.76.2.3  28-Jul-2018  pgoyette Sync with HEAD
 1.76.2.2  25-Jun-2018  pgoyette Sync with HEAD
 1.76.2.1  21-May-2018  pgoyette Sync with HEAD
 1.80.2.3  13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.80.2.2  08-Apr-2020  martin Merge changes from current as of 20200406
 1.80.2.1  10-Jun-2019  christos Sync with HEAD
 1.101.2.1  31-May-2020  martin Pull up following revision(s) (requested by bouyer in ticket #935):

sys/arch/xen/x86/x86_xpmap.c: revision 1.89
sys/arch/x86/include/pmap.h: revision 1.121
sys/arch/xen/xen/privcmd.c: revision 1.58
sys/external/mit/xen-include-public/dist/xen/include/public/memory.h: revision 1.2
sys/arch/xen/include/xenpmap.h: revision 1.44
sys/arch/xen/include/xenio.h: revision 1.12
sys/arch/x86/x86/pmap.c: revision 1.394
(all via patch)

Ajust pmap_enter_ma() for upcoming new Xen privcmd ioctl:
pass flags to xpq_update_foreign()

Introduce a pmap MD flag: PMAP_MD_XEN_NOTR, which cause xpq_update_foreign()
to use the MMU_PT_UPDATE_NO_TRANSLATE flag.
make xpq_update_foreign() return the raw Xen error. This will cause
pmap_enter_ma() to return a negative error number in this case, but the
only user of this code path is privcmd.c and it can deal with it.

Add pmap_enter_gnt()m which maps a set of Xen grant entries at the
specified va in the specified pmap. Use the hooks implemented for EPT to
keep track of mapped grand entries in the pmap, and unmap them
when pmap_remove() is called. This requires pmap_remove() to be split
into a pmap_remove_locked(), to be called from pmap_remove_gnt().

Implement new ioctl, needed by Xen 4.13:
IOCTL_PRIVCMD_MMAPBATCH_V2
IOCTL_PRIVCMD_MMAP_RESOURCE
IOCTL_GNTDEV_MMAP_GRANT_REF
IOCTL_GNTDEV_ALLOC_GRANT_REF

Always enable declarations needed by privcmd.c
 1.108.2.2  29-Feb-2020  ad Sync with head.
 1.108.2.1  17-Jan-2020  ad Sync with head.
 1.117.2.1  25-Apr-2020  bouyer Sync with bouyer-xenpvh-base2 (HEAD)
 1.125.6.1  13-May-2021  thorpej Sync with HEAD.

RSS XML Feed