Cross Reference: /src/sys/arch/x86/include/pmap.h

History log of /src/sys/arch/x86/include/pmap.h
Revision	Date	Author	Comments
1.134	20-Aug-2022	riastradh	x86: Move definition of struct pmap to pmap_private.h. This makes pmap_resident_count and pmap_wired_count out-of-line functions instead of inline. No functional change intended otherwise.
1.133	20-Aug-2022	riastradh	x86: Split most of pmap.h into pmap_private.h or vmparam.h. This way pmap.h only contains the MD definition of the MI pmap(9) API, which loads of things in the kernel rely on, so changing x86 pmap internals no longer requires recompiling the entire kernel every time. Callers needing these internals must now use machine/pmap_private.h. Note: This is not x86/pmap_private.h because it contains three parts: 1. CPU-specific (different for i386/amd64) definitions used by... 2. common definitions, including Xenisms like xpmap_ptetomach, further used by... 3. more CPU-specific inlines for pmap_pte_* operations So {amd64,i386}/pmap_private.h defines 1, includes x86/pmap_private.h for 2, and then defines 3. Maybe we should split that out into a new pmap_pte.h to reduce this trouble. No functional change intended, other than that some .c files must include machine/pmap_private.h when previously uvm/uvm_pmap.h polluted the namespace with pmap internals. Note: This migrates part of i386/pmap.h into i386/vmparam.h -- specifically the parts that are needed for several constants defined in vmparam.h: VM_MAXUSER_ADDRESS VM_MAX_ADDRESS VM_MAX_KERNEL_ADDRESS VM_MIN_KERNEL_ADDRESS Since i386 needs PDP_SIZE in vmparam.h, I added it there on amd64 too, just to keep things parallel.
1.132	20-Aug-2022	riastradh	x86: Move pl*_i, pl_i_roundup, and ptp_va2o out of x86/pmap.h. - pl[1-4]_i -> x86/pte.h - pl_i, pl_i_roundup, ptp_va2o -> x86/pmap.c
1.131	20-Aug-2022	riastradh	x86: Move struct vm_page_md to common x86/pmap.h.
1.130	20-Aug-2022	riastradh	x86: Split bootspace out of x86/pmap.h into new x86/bootspace.h.
1.129	20-Aug-2022	riastradh	x86: Move page attribute table bits to x86/pat.h.
1.128	18-Jun-2022	andvar	fix typos in word "functions" in comments, mainly s/fuctions/functions/.
1.127	30-Apr-2021	christos	Merge the x86 gdt function and constant definitions
1.126	30-Apr-2021	christos	Bump MAX_USERLDT_SIZE to the max size (wastes some memory). wine needs more than PAGE_SIZE and fails spuriously. XXX: Note the duplicate definition hacks. Should really create <x86/gdt.h>, put the just the constants there and unify them. This would also avoid the hack in: src/tests/lib/libi386/t_user_ldt.c#46
1.125	19-Jul-2020	maxv	branches: 1.125.6; Revert most of ad's movs/stos change. Instead do a lot simpler: declare svs_quad_copy() used by SVS only, with no need for instrumentation, because SVS is disabled when sanitizers are on.
1.124	14-Jul-2020	yamaguchi	Introduce per-cpu IDTs This is realized by following modifications: - Add IDT pages and its allocation maps for each cpu in "struct cpu_info" - Load per-cpu IDTs at cpu_init_idt(struct cpu_info*) - Copy the IDT entries for cpu0 to other CPUs at attach - These are, for example, exceptions, db, system calls, etc. And, added a kernel option named PCPU_IDT to enable the feature.
1.123	24-Jun-2020	maxv	remove unused x86_stos
1.122	27-May-2020	ad	- Add a couple of wrapper functions around STOS and MOVS and use them to zero and copy PTEs in preference to memset()/memcpy(). - Remove related SSE / pageidlezero stuff.
1.121	26-May-2020	bouyer	Ajust pmap_enter_ma() for upcoming new Xen privcmd ioctl: pass flags to xpq_update_foreign() Introduce a pmap MD flag: PMAP_MD_XEN_NOTR, which cause xpq_update_foreign() to use the MMU_PT_UPDATE_NO_TRANSLATE flag. make xpq_update_foreign() return the raw Xen error. This will cause pmap_enter_ma() to return a negative error number in this case, but the only user of this code path is privcmd.c and it can deal with it. Add pmap_enter_gnt()m which maps a set of Xen grant entries at the specified va in the specified pmap. Use the hooks implemented for EPT to keep track of mapped grand entries in the pmap, and unmap them when pmap_remove() is called. This requires pmap_remove() to be split into a pmap_remove_locked(), to be called from pmap_remove_gnt().
1.120	08-May-2020	riastradh	Factor randomization out of slotspace_rand. slotspace_rand becomes deterministic; the randomization moves into the callers instead. Why? There are two callers of slotspace_rand: - x86/pmap.c pmap_bootstrap - amd64/amd64.c init_slotspace When the randomization was introduced, it used an x86-only `cpu_earlyrng' abstraction that would hash rdseed/rdrand and rdtsc output together. Except init_slotspace ran before cpu_probe, so cpu_feature was not yet filled out, so during init_slotspace, the only randomization was rdtsc. In the course of the recent entropy overhaul, I replaced cpu_earlyrng by entropy_extract, and moved cpu_init_rng much earlier -- but still after cpu_probe -- in order to reduce the number of abstractions lying around and the number of copies of rdrand/rdseed logic. In so doing I added some annoying complication (see curcpu_available) to kern_entropy.c to make it work early enough for init_slotspace, and dropped the rdtsc. For pmap_bootstrap that didn't substantively change anything. But for init_slotspace, it removed the only randomization. To mitigate this, this commit pulls the randomization out of slotspace_rand into pmap_bootstrap and init_slotspace, so that (a) init_slotspace can use rdtsc and a little private entropy pool in order to restore the prior (weak) randomization it had, and (b) pmap_bootstrap, which runs a little bit later, can continue to use entropy_extract normally and get rdrand/rdseed too. A subsequent commit will move cpu_init_rng just a wee bit later, after cpu_init_msrs, so the kern_entropy.c complications can go away. Perhaps someone else more wizardly with x86 can find a way to make init_slotspace run a little later too, after cpu_probe and after cpu_init_msrs and after cpu_rng_init, but I am not that wizardly.
1.119	25-Apr-2020	bouyer	Merge the bouyer-xenpvh branch, bringing in Xen PV drivers support under HVM guests in GENERIC. Xen support can be disabled at runtime with boot -c disable hypervisor
1.118	24-Apr-2020	maxv	Give the ldt a fixed size of one page (512 slots), and drop the variable- sized mechanism that was too complex. This fixes a race between USER_LDT and SVS: during context switches, the way SVS installs the new ldt relies on the ldt pointer AND the ldt size, but both cannot be accessed atomically at the same time.
1.117	05-Apr-2020	ad	branches: 1.117.2; Allocate PV entries in PAGE_SIZE chunks, and cache partially allocated PV pages with the pmap. Worth about 2-3% sys time on build.sh for me.
1.116	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
1.115	17-Mar-2020	ad	Hallelujah, the bug has been found. Resurrect prior changes, to be fixed with following commit.
1.114	17-Mar-2020	ad	Back out the recent pmap changes until I can figure out what is going on with pmap_page_remove() (to pmap.c rev 1.365).
1.113	14-Mar-2020	ad	PR kern/55071 (Panic shortly after running X11 due to kernel diagnostic assertion "mutex_owned(&pp->pp_lock)") - Fix a locking bug in pmap_pp_clear_attrs() and in pmap_pp_remove() do the TLB shootdown while still holding the target pmap's lock. Also: - Finish PV list locking for x86 & update comments around same. - Keep track of the min/max index of PTEs inserted into each PTP, and use that to clip ranges of VAs passed to pmap_remove_ptes(). - Based on the above, implement a pmap_remove_all() for x86 that clears out the pmap in a single pass. Makes exit() / fork() much cheaper.
1.112	14-Mar-2020	ad	pmap_remove_all(): Return a boolean value to indicate the behaviour. If true, all mappings have been removed, the pmap is totally cleared out, and UVM can then avoid doing the work to call pmap_remove() for each map entry. If false, either nothing has been done, or some helpful arch-specific voodoo has taken place.
1.111	10-Mar-2020	ad	- pmap_check_inuse() is expensive so make it DEBUG not DIAGNOSTIC. - Put PV locking back in place with only a minor performance impact. pmap_enter() still needs more work - it's not easy to satisfy all the competing requirements so I'll do that with another change. - Use pmap_find_ptp() (lookup only) in preference to pmap_get_ptp() (alloc). Make pm_ptphint indexed by VA not PA. Replace the per-pmap radixtree for dynamic PV entries with a per-PTP rbtree. Cuts system time during kernel build by ~10% for me.
1.110	23-Feb-2020	ad	UVM locking changes, proposed on tech-kern: - Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock. - Break v_interlock and vmobjlock apart. v_interlock remains a mutex. - Do partial PV list locking in the x86 pmap. Others to follow later.
1.109	12-Jan-2020	ad	x86 pmap: - It turns out that every page the pmap frees is necessarily zeroed. Tell the VM system about this and use the pmap as a source of pre-zeroed pages. - Redo deferred freeing of PTPs more elegantly, including the integration with pmap_remove_all(). This fixes problems with nvmm, and possibly also a crash discovered during fuzzing. Reported-by: syzbot+a97186518c84f1d85c0c@syzkaller.appspotmail.com
1.108	04-Jan-2020	ad	branches: 1.108.2; x86 pmap improvements, reducing system time during a build by about 15% on my test machine: - Replace the global pv_hash with a per-pmap record of dynamically allocated pv entries. The data structure used for this can be changed easily, and has no special concurrency requirements. For now go with radixtree. - Change pmap_pdp_cache back into a pool; cache the page directory with the pmap, and avoid contention on pmaps_lock by adjusting the global list in the pool_cache ctor & dtor. Align struct pmap and its lock, and update some comments. - Simplify pv_entry lists slightly. Allow both PP_EMBEDDED and dynamically allocated entries to co-exist on a single page. This adds a pointer to struct vm_page on x86, but shrinks pv_entry to 32 bytes (which also gets it nicely aligned). - More elegantly solve the chicken-and-egg problem introduced into the pmap with radixtree lookup for pages, where we need PTEs mapped and page allocations to happen under a single hold of the pmap's lock. While here undo some cut-n-paste. - Don't adjust pmap_kernel's stats with atomics, because its mutex is now held in the places the stats are changed.
1.107	15-Dec-2019	ad	uvm_pagerealloc() can now block because of radixtree manipulation, so defer freeing PTPs until pmap_unmap_ptes(), where we still have the pmap locked but can finally tolerate context switches again. To be revisited soon: pmap_map_ptes() seems broken WRT other pmap load. Reported-by: syzbot+689fb7dab41abff8e75a@syzkaller.appspotmail.com Reported-by: syzbot+3e7bbf37d37d451b25d7@syzkaller.appspotmail.com Reported-by: syzbot+689fb7dab41abff8e75a@syzkaller.appspotmail.com Reported-by: syzbot+689fb7dab41abff8e75a@syzkaller.appspotmail.com Reported-by: syzbot+3e7bbf37d37d451b25d7@syzkaller.appspotmail.com
1.106	08-Dec-2019	ad	Merge x86 pmap changes from yamt-pagecache: - Deal better with the multi-level pmap object locking kludge. - Handle uvm_pagealloc() being able to block.
1.105	14-Nov-2019	maxv	Add support for Kernel Memory Sanitizer (kMSan). It detects uninitialized memory used by the kernel at run time, and just like kASan and kCSan, it is an excellent feature. It has already detected 38 uninitialized variables in the kernel during my testing, which I have since discreetly fixed. We use two shadows: - "shad", to track uninitialized memory with a bit granularity (1:1). Each bit set to 1 in the shad corresponds to one uninitialized bit of real kernel memory. - "orig", to track the origin of the memory with a 4-byte granularity (1:1). Each uint32_t cell in the orig indicates the origin of the associated uint32_t of real kernel memory. The memory consumption of these shadows is consequent, so at least 4GB of RAM is recommended to run kMSan. The compiler inserts calls to specific __msan_* functions on each memory access, to manage both the shad and the orig and detect uninitialized memory accesses that change the execution flow (like an "if" on an uninitialized variable). We mark as uninit several types of memory buffers (stack, pools, kmem, malloc, uvm_km), and check each buffer passed to copyout, copyoutstr, bwrite, if_transmit_lock and DMA operations, to detect uninitialized memory that leaves the system. This allows us to detect kernel info leaks in a way that is more efficient and also more user-friendly than KLEAK. Contrary to kASan, kMSan requires comprehensive coverage, ie we cannot tolerate having one non-instrumented function, because this could cause false positives. kMSan cannot instrument ASM functions, so I converted most of them to __asm__ inlines, which kMSan is able to instrument. Those that remain receive special treatment. Contrary to kASan again, kMSan uses a TLS, so we must context-switch this TLS during interrupts. We use different contexts depending on the interrupt level. The orig tracks precisely the origin of a buffer. We use a special encoding for the orig values, and pack together in each uint32_t cell of the orig: - a code designating the type of memory (Stack, Pool, etc), and - a compressed pointer, which points either (1) to a string containing the name of the variable associated with the cell, or (2) to an area in the kernel .text section which we resolve to a symbol name + offset. This encoding allows us not to consume extra memory for associating information with each cell, and produces a precise output, that can tell for example the name of an uninitialized variable on the stack, the function in which it was pushed on the stack, and the function where we accessed this uninitialized variable. kMSan is available with LLVM, but not with GCC. The code is organized in a way that is similar to kASan and kCSan, so it means that other architectures than amd64 can be supported.
1.104	13-Nov-2019	maxv	Rename: PP_ATTRS_M -> PP_ATTRS_D PP_ATTRS_U -> PP_ATTRS_A For consistency.
1.103	05-Oct-2019	maxv	Switch to the new PTE naming. No binary diff (tested with MKREPRO).
1.102	07-Aug-2019	maxv	Add support for USER_LDT in SVS. This allows us to have both enabled at the same time. We allocate an LDT for each CPU in the GDT and map an area for it, in addition to the default LDT already present. In context switches between different processes, we choose between the default or the per-cpu LDT selector: if the user set specific LDT entries, we memcpy them to the per-cpu LDT and load the per-cpu selector. Tested by Naveen Narayanan (with Wine on amd64).
1.101	29-May-2019	maxv	branches: 1.101.2; Add PCID support in SVS. This avoids TLB flushes during kernel<->user transitions, which greatly reduces the performance penalty introduced by SVS. We use two ASIDs, 0 (kern) and 1 (user), and use invpcid to flush pages in both ASIDs. The read-only machdep.svs.pcid={0,1} sysctl is added, and indicates whether SVS+PCID is in use.
1.100	10-Mar-2019	maxv	Two changes: * Allow large pages to be passed in pmap_pdes_valid, this happens under DDB when it reads RIP (.text), called via pmap_extract. * Invert a branch in pmap_extract, so that 'l_cpu' is not touched if we're dealing with the kernel pmap. This fixes 'boot -d'.
1.99	09-Mar-2019	maxv	Start replacing the x86 PTE bits.
1.98	23-Feb-2019	maxv	Move PATENTRY into pmap.h, will be used outside.
1.97	13-Feb-2019	maxv	Add the EPT pmap code, used by Intel-VMX. The idea is that under NVMM, we don't want to implement the hypervisor page tables manually in NVMM directly, because we want pageable guests; that is, we want to allow UVM to unmap guest pages when the host comes under pressure. Contrary to AMD-SVM, Intel-VMX uses a different set of PTE bits from native, and this has three important consequences: - We can't use the native PTE bits, so each time we want to modify the page tables, we need to know whether we're dealing with a native pmap or an EPT pmap. This is accomplished with callbacks, that handle everything PTE-related. - There is no recursive slot possible, so we can't use pmap_map_ptes(). Rather, we walk down the EPT trees via the direct map, and that's actually a lot simpler (and probably faster too...). - The kernel is never mapped in an EPT pmap. An EPT pmap cannot be loaded on the host. This has two sub-consequences: at creation time we must zero out all of the top-level PTEs, and at destruction time we force the page out of the pool cache and into the pool, to ensure that a next allocation will invoke pmap_pdp_ctor() to create a native pmap and not recycle some stale EPT entries. To create an EPT pmap, the caller must invoke pmap_ept_transform() on a newly-allocated native pmap. And that's about it, from then on the EPT callbacks will be invoked, and the pmap can be destroyed via the usual pmap_destroy(). The TLB shootdown callback is not initialized however, it is the responsibility of the hypervisor (NVMM) to set it. There are some twisted cases that we need to handle. For example if pmap_is_referenced() is called on a physical page that is entered both by a native pmap and by an EPT pmap, we take the Accessed bits from the two pmaps using different PTE sets in each case, and combine them into a generic PP_ATTRS_U flag (that does not depend on the pmap type). Given that the EPT layout is a 4-Level tree with the same address space as native x86_64, we allow ourselves to use a few native macros in EPT, such as pmap_pa2pte(), rather than re-defining them with "ept" in the name. Even though this EPT code is rather complex, it is not too intrusive: just a few callbacks in a few pmap functions, predicted-false to give priority to native. So this comes with no messy #ifdef or performance cost.
1.96	11-Feb-2019	cherry	We reorganise definitions for XEN source support as follows: XEN - common sources required for baseline XEN support. XENPV - sources required for support of XEN in PV mode. XENPVHVM - sources required for support for XEN in HVM mode. XENPVH - sources required for support for XEN in PVH mode.
1.95	01-Feb-2019	maxv	Add the remaining pmap callbacks, will be used by NVMM-VMX.
1.94	01-Feb-2019	maxv	Change the format of the pp_attrs field: instead of using PTE bits directly, use abstracted bits that are converted from/to PTE bits when needed (in pmap_sync_pv). This allows us to use the same pp_attrs for pmaps that have PTE bits at different locations.
1.93	17-Dec-2018	maxv	Add two pmap fields, will be used by NVMM-VMX. Also apply a few cosmetic changes.
1.92	06-Dec-2018	maxv	Fix inconsistency, these are indexes and not types, no real functional change.
1.91	19-Nov-2018	maxv	Introduce pl_pi, will be used soon.
1.90	19-Nov-2018	maxv	Rename 'mask' -> 'frame', we will use the real 'mask' soon.
1.89	07-Nov-2018	maxv	Add two pmap fields, will be used by NVMM.
1.88	29-Aug-2018	maxv	clean up a little
1.87	29-Aug-2018	maxv	Remove the constants of the DMAP, they are unused, and move NL4_SLOT_DIRECT into amd64/.
1.86	29-Aug-2018	maxv	Simplify the ASLR stuff, we don't care about resizable areas now, and it makes the code more complicated for no good reason.
1.85	20-Aug-2018	maxv	Add support for kASan on amd64. Written by me, with some parts inspired from Siddharth Muralee's initial work. This feature can detect several kinds of memory bugs, and it's an excellent feature. It can be enabled by uncommenting these three lines in GENERIC: #makeoptions KASAN=1 # Kernel Address Sanitizer #options KASAN #no options SVS The kernel is compiled without SVS, without DMAP and without PCPU area. A shadow area is created at boot time, and it can cover the upper 128TB of the address space. This area is populated gradually as we allocate memory. With this design the memory consumption is kept at its lowest level. The compiler calls the __asan_* functions each time a memory access is done. We verify whether this access is legal by looking at the shadow area. We declare our own special memcpy/memset/etc functions, because the compiler's builtins don't add the __asan_* instrumentation. Initially all the mappings are marked as valid. During dynamic allocations, we add a redzone, which we mark as invalid. Any access on it will trigger a kASan error message. Additionally, the compiler adds a redzone on global variables, and we mark these redzones as invalid too. The illegal-access detection works with a 1-byte granularity. For now, we cover three areas: - global variables - kmem_alloc-ated areas - malloc-ated areas More will come, but that's a good start.
1.84	12-Aug-2018	maxv	Move the PCPU area from slot 384 to slot 510, to avoid creating too much fragmentation in the slot space (384 is in the middle of the kernel half of the VA).
1.83	12-Aug-2018	maxv	Randomize the main memory on Xen, same as native. Tested on amd64-dom0.
1.82	12-Aug-2018	maxv	Add a new area, SLAREA_HYPV, which indicates the slots used by the hypervisor, in our case Xen.
1.81	21-Jul-2018	maxv	More ASLR. Randomize the location of the direct map at boot time on amd64. This doesn't need "options KASLR" and works on GENERIC. Will soon be enabled by default. The location of the areas is abstracted in a slotspace structure. Ideally we should always use this structure when touching the L4 slots, instead of the current cocktail of global variables and constants. machdep initializes the structure with the default values, and we then randomize its dmap entry. Ideally machdep should randomize everything at once, but in the case of the direct map its size is determined a little later in the boot procedure, so we're forced to randomize its location later too.
1.80	20-Jun-2018	maxv	branches: 1.80.2; Add and use bootspace.smodule. Initialize it in locore/prekern to better hide the specifics from the "upper" layers. This allows for greater flexibility.
1.79	19-May-2018	jakllsch	remove some remaining uvm_emap(9)-related function prototypes
1.78	19-May-2018	jdolecek	Remove emap support. Unfortunately it never got to state where it would be used and usable, due to reliability and limited & complicated MD support. Going forward, we need to concentrate on interface which do not map anything into kernel in first place (such as direct map or KVA-less I/O), rather than making those mappings cheaper to do.
1.77	08-May-2018	maxv	Mitigation for the SS bug, CVE-2018-8897. We disabled dbregs a month ago in -current and -8 so we are not particularly affected anymore. The #DB handler runs on ist3, if we decide to process the exception we copy the iret frame on the correct non-ist stack and continue as usual.
1.76	04-Mar-2018	jdolecek	branches: 1.76.2; drop pmap_update_2pg(), just call pmap_update_pg() separately for each
1.75	18-Jan-2018	maxv	Unmap the kernel heap from the user page tables (SVS). This implementation is optimized and organized in such a way that we don't need to copy the kernel stack to a safe place during user<->kernel transitions. We create two VAs that point to the same physical page; one will be mapped in userland and is offset in order to contain only the trapframe, the other is mapped in the kernel and maps the entire stack. Sent on tech-kern@ a week ago.
1.74	11-Jan-2018	maxv	Add ist0 to pcpu_entry.
1.73	05-Jan-2018	maxv	Add a __HAVE_PCPU_AREA option, enabled by default on native amd64 but not Xen. With this option, the CPU structures that must always be present in the CPU's page tables are moved on L4 slot 384, which means address 0xffffc00000000000. A new pcpu_area structure is defined. It contains shared structures (IDT, LDT), and then an array of pcpu_entry structures, indexed by cpu_index(ci). Theoretically the LDT should be in the array, but this will be done later. During the boot procedure, cpu0 calls pmap_init_pcpu, which creates a page tree that is able to map the pcpu_area structure entirely. cpu0 then immediately maps the shared structures. Later, every CPU goes through cpu_pcpuarea_init, which allocates physical pages and kenters the relevant pcpu_entry to them. Finally, each pointer is replaced to point to pcpuarea. The point of this change is to make sure that the structures that must always be present in the page tables have their own L4 slot. Until now their L4 slot was that of pmap_kernel, and making a distinction between what must be mapped and what does not need to be was complicated. Even in the non-speculative-bug case this change makes some sense: there are several x86 instructions that leak the addresses of the CPU structures, and putting these structures inside pmap_kernel actually offered a way to compute the address of the kernel heap - which would have made ASLR on it plainly useless, had we implemented that. Note that, for now, pcpuarea does not contain rsp0. Unfortunately this change adds many #ifdefs, and makes the code harder to understand. There is also some duplication, but that will be solved later.
1.72	28-Dec-2017	maxv	Use variables in PMAP_DIRECT_*, so that the location of the direct map can change.
1.71	11-Nov-2017	maxv	Modify the layout of the bootspace structure, in such a way that it can contain several kernel segments of the same type (eg several .text segments). Some parts are still a bit messy but will be cleaned up soon. I cannot compile-test this change on i386, but it seems fine enough. NOTE: you need to rebuild and reinstall a new prekern after this change.
1.70	29-Oct-2017	maxv	Add a fifth region, called "head". On kaslr kernels it contains the ELF Header and the ELF Section Headers. On normal kernels it is empty (the headers are in the "boot" region). Note: if you're using GENERIC_KASLR, you also need to rebuild the prekern.
1.69	30-Sep-2017	maxv	Add a bootspace structure. It describes the physical and virtual space layout created by the early kernel bootstrap code. Start using it, and eliminate several references to KERNBASE and other global symbols. While here clean up xen-i386, it's really tiring.
1.68	29-Sep-2017	ozaki-r	Fix build sys/arch/x86/x86/cpu.c:920:20: error: 'pmap_largepages' undeclared (first use in this function) smp_data.large = (pmap_largepages != 0); ^
1.67	17-Jun-2017	maxv	Actually, use slot 456 instead, so that it fits a cache line.
1.66	14-Jun-2017	maxv	Give the direct map 32 slots (16TB of va). This matches MAXPHYSMEM, in such a way that the direct map is no longer the limiting factor for high memory systems.
1.65	14-Jun-2017	maxv	Move the direct map from slot 509 to slot 460. We will increase its size dynamically.
1.64	23-Mar-2017	maxv	branches: 1.64.6; Remove PG_k completely.
1.63	05-Mar-2017	maxv	Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege separation between the kernel and userland. On Xen-amd64, the kernel runs in ring3 just like userland, and the separation is guaranteed by the hypervisor - each syscall/trap is intercepted by Xen and sent manually to the kernel. Before that, the hypervisor modifies the page tables so that the kernel becomes accessible. Later, when returning to userland, the hypervisor removes the kernel pages and flushes the TLB. However, TLB flushes are costly, and in order to reduce the number of pages flushed Xen marks the userland pages as global, while keeping the kernel ones as local. This way, when returning to userland, only the kernel pages get flushed - which makes sense since they are the only ones that got removed from the mapping. Xen differentiates the userland pages by looking at their PG_u bit in the PTE; if a page has this bit then Xen tags it as global, otherwise Xen manually adds the bit but keeps the page as local. The thing is, since we set PG_u in the kernel pages, Xen believes our kernel pages are in fact userland pages, so it marks them as global. Therefore, when returning to userland, the kernel pages indeed get removed from the page tree, but are not flushed from the TLB. Which means that they are still accessible. With this - and depending on the DTLB size - userland has a small window where it can read/write to the last kernel pages accessed, which is enough to completely escalate privileges: the sysent structure systematically gets read when performing a syscall, and chances are that it will still be cached in the TLB. Userland can then use this to patch a chosen syscall, make it point to a userland function, retrieve %gs and compute the address of its credentials, and finally grant itself root privileges.
1.62	11-Feb-2017	maxv	Instead of using a global array with per-cpu indexes, embed the tmp VAs into cpu_info directly. This concerns only {i386, Xen-i386, Xen-amd64}, because amd64 already has a direct map that is way faster than that. There are two major issues with the global array: maxcpus entries are allocated while it is unlikely that common i386 machines have so many cpus, and the base VA of these entries is not cache-line-aligned, which mostly guarantees cache-line-thrashing each time the VAs are entered. Now the number of tmp VAs allocated is proportionate to the number of CPUs attached (which therefore reduces memory consumption), and the base is properly aligned. On my 3-core AMD, the number of DC_refills_L2 events triggered when performing 5x10^6 calls to pmap_zero_page on two dedicated cores is on average divided by two with this patch. Discussed on tech-kern a little.
1.61	08-Nov-2016	christos	branches: 1.61.2; PR/49691: KAMADA Ken'ichi: free deferred ptp mappings if present. XXX: pullup-7
1.60	19-Sep-2016	maya	move function prototype to x86, so it is available to amd64 too
1.59	25-Jul-2016	maxv	The L1 entry of the first page of the data segment is overwritten for the LAPIC page, and set as RWX+PG_N. The LAPIC pa is fixed, and its va resides in the data segment. Because of this error-prone design, the kernel image map is not linear, and I first thought it was a bug (as I vaguely said in PR/51148). Using large pages for the data segment is therefore wrong, since the first page does not actually belong to the data segment (even if its va is in the range). This bug is not triggered currently, since local_apic is not large-page-aligned. We will certainly have to allocate a va dynamically instead of using the first page of data; but for now, disable large pages on the data segment, and map the LAPIC as RW. This is the last x86-specific RWX page.
1.58	01-Jul-2016	maxv	branches: 1.58.2; Define pmap_pg_nx globally. Will be used soon.
1.57	11-Nov-2015	skrll	Split out the pmap_pv_track stuff for use by others. Discussed with riastradh@
1.56	03-Apr-2015	riastradh	Implement pmap_pv(9) for x86 for P->V tracking of unmanaged pages. Proposed on tech-kern with no objections: https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html
1.55	17-Oct-2013	christos	branches: 1.55.4; 1.55.6; __USE() unused variables
1.54	23-Jun-2013	uebayasi	branches: 1.54.2; Remove obsolete comment. OK'ed by rmind@.
1.53	13-Nov-2012	chs	add a pmap_kremove_local() that doesn't do TLB invalidations on other CPUs. this is only intended for use while writing kernel crash dumps. remove unused pmap_map().
1.52	20-Apr-2012	rmind	branches: 1.52.2; - Convert x86 MD code, mainly pmap(9) e.g. TLB shootdown code, to use kcpuset(9) and thus replace hardcoded CPU bitmasks. This removes the limitation of maximum CPUs. - Support up to 256 CPUs on amd64 architecture by default. Bug fixes, improvements, completion of Xen part and testing on 64-core AMD Opteron(tm) Processor 6282 SE (also, as Xen HVM domU with 128 CPUs) by Manuel Bouyer.
1.51	11-Mar-2012	jym	Alternate PTEs got killed a few weeks ago. Clean up unused prototypes.
1.50	17-Feb-2012	bouyer	Apply patch proposed in PR port-xen/45975 (this does not solve the exact problem reported here but is part of the solution): xen_kpm_sync() is not working as expected, leading to races between CPUs. 1 the check (xpq_cpu != &x86_curcpu) is always false because we have different x86_curcpu symbols with different addresses in the kernel. Fortunably, all addresses dissaemble to the same code. Because of this we always use the code intended for bootstrap, which doesn't use cross-calls or lock. 2 once 1 above is fixed, xen_kpm_sync() will use xcalls to sync other CPUs, which cause it to sleep and pmap.c doesn't like that. It triggers this KASSERT() in pmap_unmap_ptes(): KASSERT(pmap->pm_ncsw == curlwp->l_ncsw); 3 pmap->pm_cpus is not safe for the purpose of xen_kpm_sync(), which needs to know on which CPU a pmap is loaded now: pmap->pm_cpus is cleared before cpu_load_pmap() is called to switch to a new pmap, leaving a window where a pmap is still in a CPU's ci_kpm_pdir but not in pm_cpus. As a virtual CPU may be preempted by the hypervisor at any time, it can be large enough to let another CPU free the PTP and reuse it as a normal page. To fix 2), avoid cross-calls and IPIs completely, and instead use a mutex to update all CPU's ci_kpm_pdir from the local CPU. It's safe because we just need to update the table page, a tlbflush IPI will happen later. As a side effect, we don't need a different code for bootstrap, fixing 1). The mutex added to struct cpu needs a small headers reorganisation. to fix 3), introduce a pm_xen_ptp_cpus which is updated from cpu_pmap_load(), whith the ci_kpm_mtx mutex held. Checking it with ci_kpm_mtx held will avoid overwriting the wrong pmap's ci_kpm_pdir. While there I removed the unused pmap_is_active() function; and added some more details to DIAGNOSTIC panics.
1.49	04-Dec-2011	chs	branches: 1.49.2; map all of physical memory using large pages. ported from openbsd years ago by Murray Armfield, updated for changes since then by me.
1.48	23-Nov-2011	jym	branches: 1.48.2; No more users of xpmap_update(). Use pmap_pte_*() functions now.
1.47	23-Nov-2011	jym	Move Xen-specific functions to Xen pmap. Requested by cherry@. Un'ifdef XEN in xen_pmap.c, it is always defined there.
1.46	20-Nov-2011	jym	Expose pmap_pdp_cache publicly to x86/xen pmap. Provide suspend/resume callbacks for Xen pmap. Turn static internal callbacks of pmap_pdp_cache. XXX the implementation of pool_cache_invalidate(9) is still wrong, and IMHO this needs fixing before -6. See http://mail-index.netbsd.org/tech-kern/2011/11/18/msg011924.html
1.45	08-Nov-2011	cherry	Expose the PG_k #define pt/pd bit to both xen and "baremetal" x86. This is required, since kernel pages are mapped with user permissions in XEN/amd64 since the VM kernel runs in ring3. Since XEN/i386(including PAE) runs in ring1, supervisor mode is appropriate for these ports. We need to share this since the pmap implementation is still shared. Once the xen implementation is sufficiently independant of the x86 one, this can be made private to xen/include/xenpmap.h
1.44	06-Nov-2011	cherry	[merging from cherry-xenmp] Make the xen MMU op queue locking api private. Implement per-cpu queues.
1.43	18-Oct-2011	jym	branches: 1.43.2; Make "pmaps" (list of non-kernel pmaps) and "pmaps_lock" externally visible. Required by pmap MD code that could reside in other files, notably Xen's pmap.
1.42	20-Sep-2011	jym	Merge jym-xensuspend branch in -current. ok bouyer@. Goal: save/restore support in NetBSD domUs, for i386, i386 PAE and amd64. Executive summary: - split all Xen drivers (xenbus(4), grant tables, xbd(4), xennet(4)) in two parts: suspend and resume, and hook them to pmf(9). - modify pmap so that Xen hypervisor does not cry out loud in case it finds "unexpected" recursive memory mappings - provide a sysctl(7), machdep.xen.suspend, to command suspend from userland via powerd(8). Note: a suspend can only be handled correctly when dom0 requested it, so provide a mechanism that will prevent kernel to blindly validate user's commands The code is still in experimental state, use at your own risk: restore can corrupt backend communications rings; this can completely thrash dom0 as it will loop at a high interrupt level trying to honor all domU requests. XXX PAE suspend does not work in amd64 currently, due to (yet again!) page validation issues with hypervisor. Will fix. XXX secondary CPUs are not suspended, I will write the handlers in sync with cherry's Xen MP work. Tested under i386 and amd64, bear in mind ring corruption though. No build break expected, GENERICs and XEN* kernels should be fine. ./build.sh distribution still running. In any case: sorry if it does break for you, contact me directly for reports.
1.41	13-Aug-2011	cherry	Add locking around ops to the hypervisor MMU "queue".
1.40	13-Jun-2011	tls	Fix Xen kernel builds (pmap_is_curpmap can't be static)
1.39	12-Jun-2011	rmind	Welcome to 5.99.53! Merge rmind-uvmplock branch: - Reorganize locking in UVM and provide extra serialisation for pmap(9). New lock order: [vmpage-owner-lock] -> pmap-lock. - Simplify locking in some pmap(9) modules by removing P->V locking. - Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs). - Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner. Add TLBSTATS option for x86 to collect statistics about TLB shootdowns. - Unify /dev/mem et al in MI code and provide required locking (removes kernel-lock on some ports). Also, avoid cache-aliasing issues. Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches formed the core changes of this branch.
1.38	07-May-2011	jym	branches: 1.38.2; Do as the comment says, use ilog2(). This gets optimized directly at compile time, no call to fls() is needed.
1.37	25-Apr-2011	yamt	comment
1.36	25-Apr-2011	yamt	remove unused ptei
1.35	11-Feb-2011	jmcneill	add bus_space_mmap support for BUS_SPACE_MAP_PREFETCHABLE, ok matt@
1.34	01-Feb-2011	chuck	udpate license clauses on my code to match the new-style BSD licenses. remove no-longer-valid wustl email address for me. based on diff that rmind@ sent me. no functional change with this commit.
1.33	24-Jul-2010	jym	branches: 1.33.2; 1.33.4; Welcome PAE inside i386 current. This patch is inspired by work previously done by Jeremy Morse, ported by me to -current, merged with the work previously done for port-xen, together with additionals fixes and improvements. PAE option is disabled by default in GENERIC (but will be enabled in ALL in the next few days). In quick, PAE switches the CPU to a mode where physical addresses become 36 bits (64 GiB). Virtual address space remains at 32 bits (4 GiB). To cope with the increased size of the physical address, they are manipulated as 64 bits variables by kernel and MMU. When supported by the CPU, it also allows the use of the NX/XD bit that provides no-execution right enforcement on a per physical page basis. Notes: - reworked locore.S - introduce cpu_load_pmap(), used to switch pmap for the curcpu. Due to the different handling of pmap mappings with PAE vs !PAE, Xen vs native, details are hidden within this function. This helps calling it from assembly, as some features, like BIOS calls, switch to pmap_kernel before mapping trampoline code in low memory. - some changes in bioscall and kvm86_call, to reflect the above. - the L3 is "pinned" per-CPU, and is only manipulated by a reduced set of functions within pmap. To track the L3, I added two elements to struct cpu_info, namely ci_l3_pdirpa (PA of the L3), and ci_l3_pdir (the L3 VA). Rest of the code considers that it runs "just like" a normal i386, except that the L2 is 4 pages long (PTP_LEVELS is still 2). - similar to the ci_pae_l3_pdir{,pa} variables, amd64's xen_current_user_pgd becomes an element of cpu_info (slowly paving the way for MP world). - bootinfo_source struct declaration is modified, to cope with paddr_t size change with PAE (it is not correct to assume that bs_addr is a paddr_t when compiled with PAE - it should remain 32 bits). bs_addrs is now a void * array (in bootloader's code under i386/stand/, the bs_addrs is a physaddr_t, which is an unsigned long). - fixes in multiboot code (same reason as bootinfo): paddr_t size change. I used Elf32_* types, use RELOC() where necessary, and move the memcpy() functions out of the if/else if (I do not expect sym and str tables to overlap with ELF). - 64 bits atomic functions for pmap - all pmap_pdirpa access are now done through the pmap_pdirpa macro. It hides the L3/L2 stuff from PAE, as well as the pm_pdirpa change in struct pmap (it now becomes a PDP_SIZE array, with or without PAE). - manipulation of recursive mappings ( PDIR_SLOT_{,A}PTEs ) is done via loops on PDP_SIZE. See also http://mail-index.netbsd.org/port-i386/2010/07/17/msg002062.html No objection raised on port-i386@ and port-xen@R for about a week. XXX kvm(3) will be fixed in another patch to properly handle both PAE and !PAE kernel dumps (VA => PA macros are slightly different, and need proper 64 bits PA support in kvm_i386). XXX Mixing PAE and !PAE modules may lead to unwanted/unexpected results. This cannot be solved easily, and needs lots of thinking before being declared safe (paddr_t/bus_addr_t size handling, PD/PT macros abstractions).
1.32	15-Jul-2010	jym	Make the comment about PDPpaddr more thorough.
1.31	06-Jul-2010	cegger	Turn PMAP_NOCACHE into MI flag. Add MI flags PMAP_WRITE_COMBINE, PMAP_WRITE_BACK, PMAP_NOCACHE_OVR. Update pmap(9) manpage. hppa: Remove MD PMAP_NOCACHE flag as it exists as MI flag mips: Rename MD PMAP_NOCACHE to PGC_NOCACHE. x86: Implement new MI flags using Page-Attribute Tables. x86: Implement BUS_SPACE_MAP_PREFETCHABLE. Patch presented on tech-kern@: http://mail-index.netbsd.org/tech-kern/2010/06/30/msg008458.html No comments on this last version.
1.30	10-May-2010	dyoung	Provide pmap_enter_ma(), pmap_extract_ma(), pmap_kenter_ma() in all x86 kernels, and use them in the bus_space(9) implementation instead of ugly Xen #ifdef-age. In a non-Xen kernel, the _ma() functions either call or alias the equivalent _pa() functions. Reviewed on port-xen@netbsd.org and port-i386@netbsd.org. Passes rmind@'s and bouyer@'s inspection. Tested on i386 and on Xen DOMU / DOM0.
1.29	09-Feb-2010	jym	branches: 1.29.2; Fix typos in comments.
1.28	11-Nov-2009	cegger	branches: 1.28.2; update comment: we use PMAP_NOCACHE for both pmap_enter and pmap_kenter_pa
1.27	07-Nov-2009	cegger	Add a flags argument to pmap_kenter_pa(9). Patch showed on tech-kern@ http://mail-index.netbsd.org/tech-kern/2009/11/04/msg006434.html No objections.
1.26	19-Jul-2009	rmind	pmap_emap_sync: add an argument, and do not perform pmap_load() during context switch (pmap_destroy() path seems to be unsafe), instead just perform tlbflush(). Slightly inefficient, but good enough for now.
1.25	28-Jun-2009	rmind	Ephemeral mapping (emap) implementation. Concept is based on the idea that activity of other threads will perform the TLB flush for the processes using emap as a side effect. To track that, global and per-CPU generation numbers are used. This idea was suggested by Andrew Doran; various improvements to it by me. Notes: - For now, zero-copy on pipe is not yet enabled. - TCP socket code would likely need more work. - Additional UVM loaning improvements are needed. Proposed on <tech-kern>, silence there. Quickly reviewed by <ad>.
1.24	22-Apr-2009	cegger	change pmap flags argument from int to u_int. forgot to commit this.
1.23	18-Apr-2009	cegger	Introduce PMAP_NOCACHE as first PMAP MD bit in x86. Make use of it in pmap_enter(). This safes one extra TLB flush when mapping dma-safe memory. Presented on tech-kern@, port-i386@ and port-amd64@ ok ad@
1.22	21-Mar-2009	ad	PR port-i386/40143 Viewing an mpeg transport stream with mplayer causes crash Fix numerous problems: 1. LDT updates are not atomic. 2. Number of processes running with private LDTs and/or I/O bitmaps is not capped. System with high maxprocs can be paniced. 3. LDTR can be leaked over context switch. 4. GDT slot allocations can race, giving the same LDT slot to two procs. 5. Incomplete interrupt/trap frames can be stacked. 6. In some rare cases segment faults are not handled correctly.
1.21	09-Dec-2008	pooka	branches: 1.21.2; Make pmap_kernel() a MI macro for struct pmap *kernel_pmap_ptr, which is now the "API" provided by the pmap module. pmap_kernel() remains as the syntactic sugar. Bonus cosmetics round: move all the pmap_t pointer typedefs into uvm_pmap.h. Thanks to Greg Oster for providing cpu muscle for doing test builds.
1.20	16-Sep-2008	bouyer	branches: 1.20.2; 1.20.4; Implement the arch-dependent p2m frame lists list. This adds support for 'xm dump-core' for NetBSD domUs. From Jean-Yves Migeon (jean-yves dot migeon at espci dot fr)
1.19	24-Jun-2008	jmcneill	branches: 1.19.2; Define PMAP_FORK -- this was lost in the vmlocking merge, and is required by options USER_LDT.
1.18	05-Jun-2008	ad	branches: 1.18.2; pmap_remove_all() for x86. Also, always defer freeing ptps to pmap_update(). There may be a better way to do this, but for now this is simple and avoids potential bugs. Proposed on tech-kern and discussed with chs@.
1.17	04-Jun-2008	ad	Revert unintentional change.
1.16	04-Jun-2008	ad	vm_page: put TAILQ_ENTRY into a union with LIST_ENTRY, so we can use both.
1.15	02-Jun-2008	ad	- Don't bother using sse to copy/zero pages on demand. It turns out not to be worth it. - If the machine has sse, re-enable zeroing pages in the idle loop and use the sse instructions so that we don't blow out the cache.
1.14	03-May-2008	ad	branches: 1.14.2; Back out previous which was not thought through properly.
1.13	03-May-2008	ad	Implement pmap_remove_all().
1.12	23-Jan-2008	bouyer	branches: 1.12.6; 1.12.8; 1.12.10; Merge the bouyer-xeni386 branch. This brings in PAE support to NetBSD xeni386 (domU only). PAE support is enabled by 'options PAE', see the new XEN3PAE_DOMU and INSTALL_XEN3PAE_DOMU kernel config files. See the comments in arch/i386/include/{pte.h,pmap.h} to see how it works. In short, we still handle it as a 2-level MMU, with the second level page directory being 4 pages in size. pmap switching is done by switching the L2 pages in the L3 entries, instead of loading %cr3. This is almost required by Xen, which handle the last L2 page (the one mapping 0xc0000000 - 0xffffffff) in a very special way. But this approach should also work for native PAE support if ever supported (in fact, the pmap should almost suport native PAE, what's missing is bootstrap code in locore.S).
1.11	20-Jan-2008	yamt	- rewrite P->V tracking. - use a hash rather than SPLAY trees. SPLAY tree is a wrong algorithm to use here. will be revisited if it slows down anything other than micro-benchmarks. - optimize the single mapping case (it's a common case) by embedding an entry into mdpage. - don't keep a pmap pointer as it can be obtained from ptp. (discussed on port-i386 some years ago.) ideally, a single paddr_t should be enough to describe a pte. but it needs some more thoughts as it can increase computational costs. - pmap_enter: simplify and fix races with pmap_sync_pv. - don't bother to lock pm_obj[i] where i > 0, unless DIAGNOSTIC. - kill mp_link to save space. - add many KASSERTs.
1.10	11-Jan-2008	bouyer	Merge the bouyer-xeni386 branch to head, at tag bouyer-xeni386-merge1 (the branch is still active and will see i386PAE support developement). Sumary of changes: - switch xeni386 to the x86/x86/pmap.c, and the xen/x86/x86_xpmap.c pmap bootstrap. - merge back most of xen/i386/ to i386/i386 - change the build to reduce diffs between i386 and amd64 in file locations - remove include files that were identical to the i386/amd64 counterparts, the build will find them via the xen-ma/machine link.
1.9	08-Jan-2008	yamt	kill unused PMF_USER_RELOAD.
1.8	02-Jan-2008	yamt	g/c pv_page stuffs.
1.7	25-Dec-2007	perry	Convert many of the uses of __attribute__ to equivalent __packed, __unused and __dead macros from cdefs.h
1.6	09-Dec-2007	jmcneill	branches: 1.6.2; Merge jmcneill-pm branch.
1.5	22-Nov-2007	bouyer	branches: 1.5.2; 1.5.4; Pull up the bouyer-xenamd64 branch to HEAD. This brings in amd64 support to NetBSD/Xen, both Dom0 and DomU.
1.4	15-Nov-2007	ad	Remove support for 80386 level CPUs. PR port-i386/36163.
1.3	07-Nov-2007	ad	Merge from vmlocking: - pool_cache changes. - Debugger/procfs locking fixes. - Other minor changes.
1.2	18-Oct-2007	yamt	branches: 1.2.2; 1.2.4; 1.2.6; 1.2.8; 1.2.10; merge yamt-x86pmap branch. - reduce differences between amd64 and i386. notably, share pmap.c between them. it makes several i386 pmap improvements available to amd64, including tlb shootdown reduction and bug fixes from Stephan Uphoff. - implement deferred pmap switching for amd64. - remove LARGEPAGES option. always use large pages if available. also, make it work on amd64.
1.1	08-Oct-2007	yamt	branches: 1.1.2; 1.1.4; file pmap.h was initially added on branch yamt-x86pmap.
1.1.4.4	18-Nov-2007	bouyer	Sync with HEAD
1.1.4.3	13-Nov-2007	bouyer	Sync with HEAD
1.1.4.2	25-Oct-2007	bouyer	Finish sync with HEAD. Especially use the new x86 pmap for xenamd64. For this: - rename pmap_pte_set() to pmap_pte_testset() - make pmap_pte_set() a function or macro for non-atomic PTE write - define and use pmap_pa2pte()/pmap_pte2pa() to read/write PTE entries - define pmap_pte_flush() which is a nop in x86 case, and flush the MMUops queue in the Xen case
1.1.4.1	25-Oct-2007	bouyer	Sync with HEAD.
1.1.2.3	18-Oct-2007	yamt	#ifdef out an unused member for x86_64.
1.1.2.2	14-Oct-2007	yamt	move pl_i_roundup to a header.
1.1.2.1	08-Oct-2007	yamt	merge some parts of x86 pmap.h.
1.2.10.5	23-Mar-2008	matt	sync with HEAD
1.2.10.4	09-Jan-2008	matt	sync with HEAD
1.2.10.3	08-Nov-2007	matt	sync with -HEAD
1.2.10.2	06-Nov-2007	matt	sync with HEAD
1.2.10.1	18-Oct-2007	matt	file pmap.h was added on branch matt-armv6 on 2007-11-06 23:23:38 +0000
1.2.8.4	18-Feb-2008	mjf	Sync with HEAD.
1.2.8.3	27-Dec-2007	mjf	Sync with HEAD.
1.2.8.2	08-Dec-2007	mjf	Sync with HEAD.
1.2.8.1	19-Nov-2007	mjf	Sync with HEAD.
1.2.6.6	04-Feb-2008	yamt	sync with head.
1.2.6.5	21-Jan-2008	yamt	sync with head
1.2.6.4	07-Dec-2007	yamt	sync with head
1.2.6.3	15-Nov-2007	yamt	sync with head.
1.2.6.2	27-Oct-2007	yamt	sync with head.
1.2.6.1	18-Oct-2007	yamt	file pmap.h was added on branch yamt-lazymbuf on 2007-10-27 11:28:56 +0000
1.2.4.6	27-Nov-2007	joerg	Sync with HEAD. amd64 Xen support needs testing.
1.2.4.5	21-Nov-2007	joerg	Sync with HEAD.
1.2.4.4	11-Nov-2007	joerg	Sync with HEAD.
1.2.4.3	28-Oct-2007	joerg	Cosmetic: reduce diff to HEAD.
1.2.4.2	26-Oct-2007	joerg	Sync with HEAD. Follow the merge of pmap.c on i386 and amd64 and move pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup code to restore CR4 before jumping back into kernel space as the large page option might cover that.
1.2.4.1	18-Oct-2007	joerg	file pmap.h was added on branch jmcneill-pm on 2007-10-26 15:43:44 +0000
1.2.2.4	03-Dec-2007	ad	Sync with HEAD.
1.2.2.3	24-Oct-2007	ad	Use a pool_cache to allocate pv entries. PR port-i386/37193.
1.2.2.2	23-Oct-2007	ad	Sync with head.
1.2.2.1	18-Oct-2007	ad	file pmap.h was added on branch vmlocking on 2007-10-23 20:36:40 +0000
1.5.4.1	11-Dec-2007	yamt	sync with head.
1.5.2.1	26-Dec-2007	ad	Sync with head.
1.6.2.7	20-Jan-2008	bouyer	Sync with HEAD
1.6.2.6	17-Jan-2008	bouyer	- Fix L2_SLOT_APTE value (not sure how I got this value but it was definitively wrong) - Use global variable for the PAE L3 page adresses, so that pmap.c can get it from the bootstrap code - Extent the size of our virtual PDP from 3 to 4 pages, so that pmap->pm_pdir[] is contigous for the whole VA range. The last page is a shadow of the kernel's real PDP (L3[3]). - make pm_pdirpa an array of 4 paddr_t if using PAE. introduce a pmap_pdirpa macro to get the physical address of a given PD entry. - fix pmap_map_pte The kernel now boots single-user. fsck will cause a kernel fault in pmap_pdes_invalid() on exit.
1.6.2.5	13-Jan-2008	bouyer	Work in progress on xeni386 PAE support: Make xeni386 build with a 64bit paddr_t. For this vaddr_t vs paddr_t vs pointers usages had to be clarified. If 'options PAE' is present in a Xen3 kernel, switch paddr_t, pd_entry_t and pt_entry_t to 64bits, and add the PAE entry in the __xen_guest ELF section.
1.6.2.4	10-Jan-2008	bouyer	Sync with HEAD
1.6.2.3	02-Jan-2008	bouyer	Sync with HEAD
1.6.2.2	13-Dec-2007	bouyer	- make amd64 XEN3 kernels build again - pin the pdp pages in the PDP cache contructor, and unpin them in the destructor. garbage-collect PMF_USER_XPIN.
1.6.2.1	11-Dec-2007	bouyer	Switch i386 to x86/x86/pmap.c
1.12.10.5	11-Aug-2010	yamt	sync with head.
1.12.10.4	11-Mar-2010	yamt	sync with head
1.12.10.3	19-Aug-2009	yamt	sync with head.
1.12.10.2	18-Jul-2009	yamt	sync with head.
1.12.10.1	04-May-2009	yamt	sync with head.
1.12.8.2	17-Jun-2008	yamt	sync with head.
1.12.8.1	04-Jun-2008	yamt	sync with head
1.12.6.4	17-Jan-2009	mjf	Sync with HEAD.
1.12.6.3	28-Sep-2008	mjf	Sync with HEAD.
1.12.6.2	29-Jun-2008	mjf	Sync with HEAD.
1.12.6.1	05-Jun-2008	mjf	Sync with HEAD. Also fix build.
1.14.2.3	24-Sep-2008	wrstuden	Merge in changes between wrstuden-revivesa-base-2 and wrstuden-revivesa-base-3.
1.14.2.2	18-Sep-2008	wrstuden	Sync with wrstuden-revivesa-base-2.
1.14.2.1	23-Jun-2008	wrstuden	Sync w/ -current. 34 merge conflicts to follow.
1.18.2.1	27-Jun-2008	simonb	Sync with head.
1.19.2.2	13-Dec-2008	haad	Update haad-dm branch to haad-dm-base2.
1.19.2.1	19-Oct-2008	haad	Sync with HEAD.
1.20.4.1	04-Apr-2009	snj	Pull up following revision(s) (requested by ad in ticket #656): sys/arch/amd64/amd64/gdt.c: revision 1.21 via patch sys/arch/amd64/amd64/machdep.c: revision 1.129 via patch sys/arch/i386/i386/gdt.c: revision 1.47 via patch sys/arch/i386/i386/kvm86.c: revision 1.17 via patch sys/arch/i386/i386/locore.S: revision 1.85 via patch sys/arch/i386/i386/machdep.c: revision 1.666 via patch sys/arch/i386/i386/vector.S: revision 1.45 via patch sys/arch/i386/include/pcb.h: revision 1.47 via patch sys/arch/x86/include/pmap.h: revision 1.22 via patch sys/arch/x86/include/sysarch.h: revision 1.8 via patch sys/arch/x86/x86/pmap.c: revision 1.80 via patch sys/arch/x86/x86/sys_machdep.c: revision 1.17 via patch sys/compat/linux/arch/i386/linux_machdep.c: revision 1.143 via patch sys/kern/init_main.c: revision 1.384 via patch PR port-i386/40143 Viewing an mpeg transport stream with mplayer causes crash Fix numerous problems: 1. LDT updates are not atomic. 2. Number of processes running with private LDTs and/or I/O bitmaps is not capped. System with high maxprocs can be paniced. 3. LDTR can be leaked over context switch. 4. GDT slot allocations can race, giving the same LDT slot to two procs. 5. Incomplete interrupt/trap frames can be stacked. 6. In some rare cases segment faults are not handled correctly.
1.20.2.2	28-Apr-2009	skrll	Sync with HEAD.
1.20.2.1	19-Jan-2009	skrll	Sync with HEAD.
1.21.2.11	27-Aug-2011	jym	Sync with HEAD. Most notably: uvm/pmap work done by rmind@, and MP Xen work of cherry@. No regression observed on suspend/restore.
1.21.2.10	26-May-2011	jym	Pull-up some modifications from -current to my branch.
1.21.2.9	02-May-2011	jym	Sync with head.
1.21.2.8	28-Mar-2011	jym	Sync with HEAD. TODO before merge: - shortcut for suspend code in sysmon, when powerd(8) is not running. Borrow ``xs_watch'' thread context? - bug hunting in xbd + xennet resume. Rings are currently thrashed upon resume, so current implementation force flush them on suspend. It's not really needed.
1.21.2.7	24-Oct-2010	jym	Sync with HEAD
1.21.2.6	01-Nov-2009	jym	- Upgrade suspend/resume code to comply with Xen2 removal. - Add support for PAE domUs suspend/resume. - Fix an issue regarding initialization of the xbd ring I/O that could end badly during resume, with invalid block operations submitted to dom0 backend. NetBSD supports PAE under x86_32 by considering the L2 page as being 4 pages long instead of 1. Xen validates the page types during resume. Sadly, the hypervisor handles alternative recursive mappings (== PG/PD entries pointing to pages other than self) inadequately, which could lead to incorrect page pinning. As a result, the important change with this patch is to clear these alternative mappings during suspend, and reset them back to their former self upon resume. For PAE, approx. all 4 PDIR_SLOT_PTEs could be considered as alternative recursive mappings. See comments in pmap.c for further details. Now, let the testing and bug hunting begin.
1.21.2.5	01-Nov-2009	jym	Sync with HEAD.
1.21.2.4	24-Jul-2009	jym	- rework the page pinning API, so that now a function is provided for each level of indirection encountered during virtual memory translations. Update pmap accordingly. Pinning looks cleaner that way, and it offers the possibility to pin lower level pages if necessary (NetBSD does not do it currently). - some fixes and comments to explain how page validation/invalidation take place during save/restore/migrate under Xen. L2 shadow entries from PAE are now handled, so basically, suspend/resume works with PAE. - fixes an issue reported by Christoph (cegger@) for xencons suspend/resume in dom0. TODO: - PAE save/restore is currently limited to single-user only, multi-user support requires modifications in PAE pmap that should be discussed first. See the comments about the L2 shadow pages cached in pmap_pdp_cache in this commit. - grant table bug is still there; do not use the kernels of this branch to test suspend/resume, unless you want to experience bad crashes in dom0, and push the big red button. Now there is light at the end of the tunnel :) Note: XEN2 kernels will neither build nor work with this branch.
1.21.2.3	23-Jul-2009	jym	Sync with HEAD.
1.21.2.2	31-May-2009	jym	Modifications for the Xen suspend/migrate/resume branch: - introduce xenbus_device_{suspend,resume}() functions. These are routines used to suspend/resume MI parts of the Xenbus device interfaces, like updating frontend/backend devices' paths found in XenStore. - introduce HYPERVISOR_sysctl(), an hypercall used only by Xentools to obtain information from hypervisor (listing VMs, printing console, etc.). I use it to query xenconsole from ddb(), as a last resort in case of a panic() in dom0 (xm being not available). Currently unused in the branch; could be, if requested. - disable the rwlock(9) used to protect code that could use transient MFNs. It could trigger nasty context switches in place it should not to. - fix some bugs in the xennet/xbd suspend/resume pmf(9) handlers. - following XenSource's design, talk_to_otherend() is now called watch_otherend(), and free_otherend_details() is used by Xenbus device suspend/resume routines. - some slight modifications in pmap regarding APDP. Introduce an inline function (pmap_unmap_apdp_pde()) that clears APDP entry for the current pmap. - similarly, implement pmap_unmap_all_apdp_pdes() that iterates through all pmaps and tears down APDP, as Xen does not handle them properly. TODO/XXX: - pmap_unmap_apdp_pde() does not handle APDP shadow entry of PAE. It will, once I figure out how PAE uses it. - revisit the pmap locking issue regarding transient MFNs. As NetBSD does not use kernel preemption and MP for Xen, this could be skipped momentarily. See http://mail-index.netbsd.org/port-xen/2009/04/27/msg004903.html for details. - fix a bug regarding grant tables which could technically DoS a dom0 if ridiculously high consumer/producer indexes are passed down in the ring during a resume. All in all, once the grant table index issue and APDP PAE are fixed, next step is to torture test this branch. Tested under i386 PAE and non-PAE, Xen3 dom0 and domU. amd64 is only compile tested.
1.21.2.1	13-May-2009	jym	Sync with HEAD. Commit is split, to avoid a "too many arguments" protocol error.
1.28.2.2	17-Aug-2010	uebayasi	Sync with HEAD.
1.28.2.1	30-Apr-2010	uebayasi	Sync with HEAD.
1.29.2.11	31-May-2011	rmind	sync with head
1.29.2.10	19-May-2011	rmind	Implement sharing of vnode_t::v_interlock amongst vnodes: - Lock is shared amongst UVM objects using uvm_obj_setlock() or getnewvnode(). - Adjust vnode cache to handle unsharing, add VI_LOCKSHARE flag for that. - Use sharing in tmpfs and layerfs for underlying object. - Simplify locking in ubc_fault(). - Sprinkle some asserts. Discussed with ad@.
1.29.2.9	17-Mar-2011	rmind	- Fix tlbflushg() to behave like tlbflush(), if page global extension (PGE) is not (yet) enabled. This fixes the issue of stale TLB entry, experienced early on boot, when PGE is not yet set on primary CPU. - Rewrite i386/amd64 TLB interrupt handlers in C (only stubs are in assembly), which simplifies and unifies (under x86) code, plus fixes few bugs. - cpu_attach: remove assignment to cpus_running, as primary CPU might not be attached first, which causes reset (and thus missed secondary CPUs).
1.29.2.8	08-Mar-2011	rmind	struct pmap_tlb_mailbox: make tm_pending and tm_gen volatile.
1.29.2.7	05-Mar-2011	rmind	sync with head
1.29.2.6	31-May-2010	rmind	- Split off Xen versions of pmap_map_ptes/pmap_unmap_ptes into Xen pmap, also move pmap_apte_flush() with pmap_unmap_apdp() there. - Make Xen buildable.
1.29.2.5	30-May-2010	rmind	sync with head
1.29.2.4	26-May-2010	rmind	Split x86 TLB shootdown code into a separate file. Code part is under TNF license, as per pmap.c 1.105.2.4 revision.
1.29.2.3	26-Apr-2010	rmind	Partly rewrite amd64 TLB shutdown handler for the changes in x86 pmap. At this point, branch seems to pass preliminar stress tests on amd64.
1.29.2.2	26-Apr-2010	rmind	Apply renovated patch to significantly reduce TLB shootdowns in x86 pmap, also provide TLBSTATS option to measure and track TLB shootdowns. Details: http://mail-index.netbsd.org/port-i386/2009/01/11/msg001018.html Patch from Andrew Doran, proposed on tech-x86 [sic], in January 2009. XXX: amd64 and xen are not yet; work in progress.
1.29.2.1	16-Mar-2010	rmind	Change struct uvm_object::vmobjlock to be dynamically allocated with mutex_obj_alloc(). It allows us to share the locks among UVM objects.
1.33.4.2	17-Feb-2011	bouyer	Sync with HEAD
1.33.4.1	08-Feb-2011	bouyer	Sync with HEAD
1.33.2.1	06-Jun-2011	jruoho	Sync with HEAD.
1.38.2.3	20-Sep-2011	cherry	Remove the "xpq lock", since we have per-cpu mmu queues now. This may need further testing. Also add some preliminary locking around queue-ops in the network backend driver
1.38.2.2	23-Jun-2011	cherry	Catchup with rmind-uvmplock merge.
1.38.2.1	03-Jun-2011	cherry	Initial import of xen MP sources, with kernel and userspace tests. - this is a source priview. - boots to single user. - spurious interrupt and pmap related panics are normal
1.43.2.6	22-May-2014	yamt	sync with head. for a reference, the tree before this commit was tagged as yamt-pagecache-tag8. this commit was splitted into small chunks to avoid a limitation of cvs. ("Protocol error: too many arguments")
1.43.2.5	16-Jan-2013	yamt	sync with (a bit old) head
1.43.2.4	23-May-2012	yamt	sync with head.
1.43.2.3	17-Apr-2012	yamt	sync with head
1.43.2.2	18-Nov-2011	yamt	share a lock among pmap uobjs
1.43.2.1	10-Nov-2011	yamt	sync with head
1.48.2.3	29-Apr-2012	mrg	sync to latest -current.
1.48.2.2	05-Apr-2012	mrg	sync to latest -current.
1.48.2.1	18-Feb-2012	mrg	merge to -current.
1.49.2.3	06-Mar-2017	snj	Pull up following revision(s) (requested by bouyer in ticket #1441): sys/arch/x86/x86/pmap.c: revision 1.241 via patch sys/arch/x86/include/pmap.h: revision 1.63 via patch Should be PG_k, doesn't change anything. -- Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege separation between the kernel and userland. On Xen-amd64, the kernel runs in ring3 just like userland, and the separation is guaranteed by the hypervisor - each syscall/trap is intercepted by Xen and sent manually to the kernel. Before that, the hypervisor modifies the page tables so that the kernel becomes accessible. Later, when returning to userland, the hypervisor removes the kernel pages and flushes the TLB. However, TLB flushes are costly, and in order to reduce the number of pages flushed Xen marks the userland pages as global, while keeping the kernel ones as local. This way, when returning to userland, only the kernel pages get flushed - which makes sense since they are the only ones that got removed from the mapping. Xen differentiates the userland pages by looking at their PG_u bit in the PTE; if a page has this bit then Xen tags it as global, otherwise Xen manually adds the bit but keeps the page as local. The thing is, since we set PG_u in the kernel pages, Xen believes our kernel pages are in fact userland pages, so it marks them as global. Therefore, when returning to userland, the kernel pages indeed get removed from the page tree, but are not flushed from the TLB. Which means that they are still accessible. With this - and depending on the DTLB size - userland has a small window where it can read/write to the last kernel pages accessed, which is enough to completely escalate privileges: the sysent structure systematically gets read when performing a syscall, and chances are that it will still be cached in the TLB. Userland can then use this to patch a chosen syscall, make it point to a userland function, retrieve %gs and compute the address of its credentials, and finally grant itself root privileges.
1.49.2.2	09-May-2012	riz	branches: 1.49.2.2.4; 1.49.2.2.6; Pull up following revision(s) (requested by rmind in ticket #202): sys/arch/x86/include/cpuvar.h: revision 1.46 sys/arch/xen/include/xenpmap.h: revision 1.34 sys/arch/i386/include/param.h: revision 1.77 sys/arch/x86/x86/pmap_tlb.c: revision 1.5 sys/arch/x86/x86/pmap_tlb.c: revision 1.6 sys/arch/i386/i386/genassym.cf: revision 1.92 sys/arch/xen/x86/cpu.c: revision 1.91 sys/arch/x86/x86/pmap.c: revision 1.177 sys/arch/xen/x86/xen_pmap.c: revision 1.21 sys/arch/x86/acpi/acpi_wakeup.c: revision 1.31 sys/kern/subr_kcpuset.c: revision 1.5 sys/arch/amd64/include/param.h: revision 1.18 sys/sys/kcpuset.h: revision 1.5 sys/arch/x86/x86/mtrr_i686.c: revision 1.26 sys/arch/x86/x86/mtrr_i686.c: revision 1.27 sys/arch/xen/x86/x86_xpmap.c: revision 1.43 sys/arch/x86/x86/cpu.c: revision 1.98 sys/arch/amd64/amd64/mptramp.S: revision 1.14 sys/kern/sys_sched.c: revision 1.42 sys/arch/amd64/amd64/genassym.cf: revision 1.50 sys/arch/i386/i386/mptramp.S: revision 1.24 sys/arch/x86/include/pmap.h: revision 1.52 sys/arch/x86/include/cpu.h: revision 1.50 - Convert x86 MD code, mainly pmap(9) e.g. TLB shootdown code, to use kcpuset(9) and thus replace hardcoded CPU bitmasks. This removes the limitation of maximum CPUs. - Support up to 256 CPUs on amd64 architecture by default. Bug fixes, improvements, completion of Xen part and testing on 64-core AMD Opteron(tm) Processor 6282 SE (also, as Xen HVM domU with 128 CPUs) by Manuel Bouyer. - pmap_tlb_shootdown: do not overwrite tp_cpumask with pm_cpus, but merge like pm_kernel_cpus. Remove unecessary intersection with kcpuset_running. Do not reset tp_userpmap if pmap_kernel(). - Remove pmap_tlb_mailbox_t wrapping, which is pointless after recent changes. - pmap_tlb_invalidate, pmap_tlb_intr: constify for packet structure. i686_mtrr_init_first: handle the case when there are no variable-size MTRR registers available (i686_mtrr_vcnt == 0).
1.49.2.1	22-Feb-2012	riz	Pull up following revision(s) (requested by bouyer in ticket #29): sys/arch/xen/x86/x86_xpmap.c: revision 1.39 sys/arch/xen/include/hypervisor.h: revision 1.37 sys/arch/xen/include/intr.h: revision 1.34 sys/arch/xen/x86/xen_ipi.c: revision 1.10 sys/arch/x86/x86/cpu.c: revision 1.97 sys/arch/x86/include/cpu.h: revision 1.48 sys/uvm/uvm_map.c: revision 1.315 sys/arch/x86/x86/pmap.c: revision 1.165 sys/arch/xen/x86/cpu.c: revision 1.81 sys/arch/x86/x86/pmap.c: revision 1.167 sys/arch/xen/x86/cpu.c: revision 1.82 sys/arch/x86/x86/pmap.c: revision 1.168 sys/arch/xen/x86/xen_pmap.c: revision 1.17 sys/uvm/uvm_km.c: revision 1.122 sys/uvm/uvm_kmguard.c: revision 1.10 sys/arch/x86/include/pmap.h: revision 1.50 Apply patch proposed in PR port-xen/45975 (this does not solve the exact problem reported here but is part of the solution): xen_kpm_sync() is not working as expected, leading to races between CPUs. 1 the check (xpq_cpu != &x86_curcpu) is always false because we have different x86_curcpu symbols with different addresses in the kernel. Fortunably, all addresses dissaemble to the same code. Because of this we always use the code intended for bootstrap, which doesn't use cross-calls or lock. 2 once 1 above is fixed, xen_kpm_sync() will use xcalls to sync other CPUs, which cause it to sleep and pmap.c doesn't like that. It triggers this KASSERT() in pmap_unmap_ptes(): KASSERT(pmap->pm_ncsw == curlwp->l_ncsw); 3 pmap->pm_cpus is not safe for the purpose of xen_kpm_sync(), which needs to know on which CPU a pmap is loaded now: pmap->pm_cpus is cleared before cpu_load_pmap() is called to switch to a new pmap, leaving a window where a pmap is still in a CPU's ci_kpm_pdir but not in pm_cpus. As a virtual CPU may be preempted by the hypervisor at any time, it can be large enough to let another CPU free the PTP and reuse it as a normal page. To fix 2), avoid cross-calls and IPIs completely, and instead use a mutex to update all CPU's ci_kpm_pdir from the local CPU. It's safe because we just need to update the table page, a tlbflush IPI will happen later. As a side effect, we don't need a different code for bootstrap, fixing 1). The mutex added to struct cpu needs a small headers reorganisation. to fix 3), introduce a pm_xen_ptp_cpus which is updated from cpu_pmap_load(), whith the ci_kpm_mtx mutex held. Checking it with ci_kpm_mtx held will avoid overwriting the wrong pmap's ci_kpm_pdir. While there I removed the unused pmap_is_active() function; and added some more details to DIAGNOSTIC panics. When using uvm_km_pgremove_intrsafe() make sure mappings are removed before returning the pages to the free pool. Otherwise, under Xen, a page which still has a writable mapping could be allocated for a PDP by another CPU and the hypervisor would refuse it (this is PR port-xen/45975). For this, move the pmap_kremove() calls inside uvm_km_pgremove_intrsafe(), and do pmap_kremove()/uvm_pagefree() in batch of (at most) 16 entries (as suggested by Chuck Silvers on tech-kern@, see also http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012727.html and followups). Avoid early use of xen_kpm_sync(); locks are not available at this time. Don't call cpu_init() twice. Makes LOCKDEBUG kernels boot again Revert pmap_pte_flush() -> xpq_flush_queue() in previous.
1.49.2.2.6.1	06-Mar-2017	snj	Pull up following revision(s) (requested by bouyer in ticket #1441): sys/arch/x86/x86/pmap.c: revision 1.241 via patch sys/arch/x86/include/pmap.h: revision 1.63 via patch Should be PG_k, doesn't change anything. -- Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege separation between the kernel and userland. On Xen-amd64, the kernel runs in ring3 just like userland, and the separation is guaranteed by the hypervisor - each syscall/trap is intercepted by Xen and sent manually to the kernel. Before that, the hypervisor modifies the page tables so that the kernel becomes accessible. Later, when returning to userland, the hypervisor removes the kernel pages and flushes the TLB. However, TLB flushes are costly, and in order to reduce the number of pages flushed Xen marks the userland pages as global, while keeping the kernel ones as local. This way, when returning to userland, only the kernel pages get flushed - which makes sense since they are the only ones that got removed from the mapping. Xen differentiates the userland pages by looking at their PG_u bit in the PTE; if a page has this bit then Xen tags it as global, otherwise Xen manually adds the bit but keeps the page as local. The thing is, since we set PG_u in the kernel pages, Xen believes our kernel pages are in fact userland pages, so it marks them as global. Therefore, when returning to userland, the kernel pages indeed get removed from the page tree, but are not flushed from the TLB. Which means that they are still accessible. With this - and depending on the DTLB size - userland has a small window where it can read/write to the last kernel pages accessed, which is enough to completely escalate privileges: the sysent structure systematically gets read when performing a syscall, and chances are that it will still be cached in the TLB. Userland can then use this to patch a chosen syscall, make it point to a userland function, retrieve %gs and compute the address of its credentials, and finally grant itself root privileges.
1.49.2.2.4.1	06-Mar-2017	snj	Pull up following revision(s) (requested by bouyer in ticket #1441): sys/arch/x86/x86/pmap.c: revision 1.241 via patch sys/arch/x86/include/pmap.h: revision 1.63 via patch Should be PG_k, doesn't change anything. -- Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege separation between the kernel and userland. On Xen-amd64, the kernel runs in ring3 just like userland, and the separation is guaranteed by the hypervisor - each syscall/trap is intercepted by Xen and sent manually to the kernel. Before that, the hypervisor modifies the page tables so that the kernel becomes accessible. Later, when returning to userland, the hypervisor removes the kernel pages and flushes the TLB. However, TLB flushes are costly, and in order to reduce the number of pages flushed Xen marks the userland pages as global, while keeping the kernel ones as local. This way, when returning to userland, only the kernel pages get flushed - which makes sense since they are the only ones that got removed from the mapping. Xen differentiates the userland pages by looking at their PG_u bit in the PTE; if a page has this bit then Xen tags it as global, otherwise Xen manually adds the bit but keeps the page as local. The thing is, since we set PG_u in the kernel pages, Xen believes our kernel pages are in fact userland pages, so it marks them as global. Therefore, when returning to userland, the kernel pages indeed get removed from the page tree, but are not flushed from the TLB. Which means that they are still accessible. With this - and depending on the DTLB size - userland has a small window where it can read/write to the last kernel pages accessed, which is enough to completely escalate privileges: the sysent structure systematically gets read when performing a syscall, and chances are that it will still be cached in the TLB. Userland can then use this to patch a chosen syscall, make it point to a userland function, retrieve %gs and compute the address of its credentials, and finally grant itself root privileges.
1.52.2.3	03-Dec-2017	jdolecek	update from HEAD
1.52.2.2	20-Aug-2014	tls	Rebase to HEAD as of a few days ago.
1.52.2.1	20-Nov-2012	tls	Resync to 2012-11-19 00:00:00 UTC
1.54.2.1	18-May-2014	rmind	sync with head
1.55.6.6	28-Aug-2017	skrll	Sync with HEAD
1.55.6.5	05-Dec-2016	skrll	Sync with HEAD
1.55.6.4	05-Oct-2016	skrll	Sync with HEAD
1.55.6.3	09-Jul-2016	skrll	Sync with HEAD
1.55.6.2	27-Dec-2015	skrll	Sync with HEAD (as of 26th Dec)
1.55.6.1	06-Apr-2015	skrll	Sync with HEAD
1.55.4.3	06-Mar-2017	snj	Pull up following revision(s) (requested by bouyer in ticket #1388): sys/arch/x86/x86/pmap.c: revision 1.241 Should be PG_k, doesn't change anything. -- Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege separation between the kernel and userland. On Xen-amd64, the kernel runs in ring3 just like userland, and the separation is guaranteed by the hypervisor - each syscall/trap is intercepted by Xen and sent manually to the kernel. Before that, the hypervisor modifies the page tables so that the kernel becomes accessible. Later, when returning to userland, the hypervisor removes the kernel pages and flushes the TLB. However, TLB flushes are costly, and in order to reduce the number of pages flushed Xen marks the userland pages as global, while keeping the kernel ones as local. This way, when returning to userland, only the kernel pages get flushed - which makes sense since they are the only ones that got removed from the mapping. Xen differentiates the userland pages by looking at their PG_u bit in the PTE; if a page has this bit then Xen tags it as global, otherwise Xen manually adds the bit but keeps the page as local. The thing is, since we set PG_u in the kernel pages, Xen believes our kernel pages are in fact userland pages, so it marks them as global. Therefore, when returning to userland, the kernel pages indeed get removed from the page tree, but are not flushed from the TLB. Which means that they are still accessible. With this - and depending on the DTLB size - userland has a small window where it can read/write to the last kernel pages accessed, which is enough to completely escalate privileges: the sysent structure systematically gets read when performing a syscall, and chances are that it will still be cached in the TLB. Userland can then use this to patch a chosen syscall, make it point to a userland function, retrieve %gs and compute the address of its credentials, and finally grant itself root privileges.
1.55.4.2	18-Dec-2016	snj	Pull up following revision(s) (requested by riastradh in ticket #1316): sys/arch/x86/x86/pmap.c: revision 1.223 sys/arch/x86/x86/vm_machdep.c: revision 1.26 sys/arch/x86/include/pmap.h: revision 1.61 PR/49691: KAMADA Ken'ichi: free deferred ptp mappings if present. XXX: pullup-7
1.55.4.1	23-Apr-2015	snj	branches: 1.55.4.1.2; 1.55.4.1.4; Pull up following revision(s) (requested by mrg in ticket #718): sys/arch/x86/include/pmap.h: revision 1.56 sys/arch/x86/x86/pmap.c: revision 1.188 sys/dev/pci/agp_amd64.c: revision 1.8 sys/dev/pci/agp_i810.c: revision 1.118 sys/external/bsd/drm2/dist/drm/i915/i915_dma.c: revision 1.16 sys/external/bsd/drm2/dist/drm/i915/i915_gem.c: revision 1.29 sys/external/bsd/drm2/dist/drm/nouveau/nouveau_agp.c: revision 1.3 sys/external/bsd/drm2/dist/drm/nouveau/nouveau_ttm.c: revision 1.4 sys/external/bsd/drm2/dist/drm/radeon/atombios_crtc.c: revision 1.3 sys/external/bsd/drm2/dist/drm/radeon/radeon_agp.c: revision 1.3 sys/external/bsd/drm2/dist/drm/radeon/radeon_display.c: revision 1.3 sys/external/bsd/drm2/dist/drm/radeon/radeon_legacy_crtc.c: revision 1.2 sys/external/bsd/drm2/dist/drm/radeon/radeon_object.c: revision 1.3 sys/external/bsd/drm2/dist/drm/radeon/radeon_ttm.c: revision 1.7 sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c: revisions 1.7-1.10 sys/external/bsd/drm2/dist/drm/ttm/ttm_bo_util.c: revision 1.5 sys/external/bsd/drm2/i915drm/intelfb.c: revision 1.13 sys/external/bsd/drm2/include/drm/drm_wait_netbsd.h: revisions 1.12, 1.13 sys/external/bsd/drm2/include/linux/mm.h: revision 1.5 sys/external/bsd/drm2/include/linux/pci.h: revisions 1.16, 1.17 sys/external/bsd/drm2/nouveau/nouveaufb.c: revision 1.2 sys/external/bsd/drm2/radeon/radeon_pci.c: revisions 1.8, 1.9 sys/uvm/uvm_init.c: revision 1.46 Hack against the blank console problem: Leave the CLUT alone on ancient cards. At least this leaves us with a semi working console (red and blue are flipped). Leave an example of what seems to be happening but disable it because colors are better than 444 bit greyscale. -- Initialize P->V tracking for unmanaged device pages in uvm_init. Conditional on __HAVE_PMAP_PV_TRACK until we add it to all pmaps. MI part of pmap_pv(9) change proposed on tech-kern: https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html -- Implement pmap_pv(9) for x86 for P->V tracking of unmanaged pages. Proposed on tech-kern with no objections: https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html -- Use pmap_pv(9) to remove mappings of Intel graphics aperture pages. Proposed on tech-kern with no objections: https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html Further background at: https://mail-index.netbsd.org/tech-kern/2014/07/23/msg017392.html -- Use pmap_pv(9) to remove mappings of device pages in TTM. Adapt nouveau and radeon to do pmap_pv_track for their device pages. Proposed on tech-kern with no objections: https://mail-index.netbsd.org/tech-kern/2015/03/26/msg018561.html Further background at: https://mail-index.netbsd.org/tech-kern/2014/07/23/msg017392.html -- Fix error branches in agp_amd64.c. - agp_generic_detach always. - Free asc if it was allocated. (Found by Brainy, noted by maxv@.) - Free the GATT if it was allocated. -- pmf_device_register returns false on failure, not true -- In DRM_SPIN_WAIT_ON, don't stop after waiting only one tick. Continue the loop to recheck the condition and count the whole duration. -- Don't use the video BIOS memory as an i915 flush page! -- Don't let anyone else allocate the video BIOS either. -- Missed a zero: it's 0x100000, not 0x10000. -- Don't reserve if atomic -- caller must have pre-pinned the buffer. -- Don't reserve if atomic -- caller must have pre-pinned the buffer. -- almost add radeondrmkms suspend/resume support. it unfortunately doesn't work. -- Need the page's uvm object lock to do pmap_page_protect. -- Use KASSERTMSG to show bad base/offset. -- KASSERT about page-alignment on initialization too. -- Don't break when hardclock_ticks wraps around. Since we now only count time spent in wait, rather than determining the end time and checking whether we've passed it, timeouts might be marginally longer in effect. Unlikely to be an issue. -- Remove broken drm2 vm_mmap stub. Can't possibly have ever worked. -- apply some of the additional changes from Arto Huusko in PR#49645: - call pmf_device_deregister on detach. i've kept the "resume = true" for radeon_resume_kms() call as it seems to work for me (indeed, code inspection shows it is unused on netbsd :-) my old nforce4 box that can resume old drm (or could, last i tried several years ago) while X and GL apps were running, can at least survive a resume if X hasn't started. my one attempt so far with X exited, but having run, did not work. -- First attempt to make ttm_buffer_object_transfer less bogus. -- Make sure mem.bus.is_iomem is initialized. PR 49833
1.55.4.1.4.2	13-Mar-2017	skrll	Sync with netbsd-7-1-RELEASE
1.55.4.1.4.1	18-Jan-2017	skrll	Sync with netbsd-5
1.55.4.1.2.2	06-Mar-2017	snj	Pull up following revision(s) (requested by bouyer in ticket #1388): sys/arch/x86/include/pmap.h: revision 1.63 via patch sys/arch/x86/x86/pmap.c: revision 1.241 via patch Should be PG_k, doesn't change anything. -- Remove PG_u from the kernel pages on Xen. Otherwise there is no privilege separation between the kernel and userland. On Xen-amd64, the kernel runs in ring3 just like userland, and the separation is guaranteed by the hypervisor - each syscall/trap is intercepted by Xen and sent manually to the kernel. Before that, the hypervisor modifies the page tables so that the kernel becomes accessible. Later, when returning to userland, the hypervisor removes the kernel pages and flushes the TLB. However, TLB flushes are costly, and in order to reduce the number of pages flushed Xen marks the userland pages as global, while keeping the kernel ones as local. This way, when returning to userland, only the kernel pages get flushed - which makes sense since they are the only ones that got removed from the mapping. Xen differentiates the userland pages by looking at their PG_u bit in the PTE; if a page has this bit then Xen tags it as global, otherwise Xen manually adds the bit but keeps the page as local. The thing is, since we set PG_u in the kernel pages, Xen believes our kernel pages are in fact userland pages, so it marks them as global. Therefore, when returning to userland, the kernel pages indeed get removed from the page tree, but are not flushed from the TLB. Which means that they are still accessible. With this - and depending on the DTLB size - userland has a small window where it can read/write to the last kernel pages accessed, which is enough to completely escalate privileges: the sysent structure systematically gets read when performing a syscall, and chances are that it will still be cached in the TLB. Userland can then use this to patch a chosen syscall, make it point to a userland function, retrieve %gs and compute the address of its credentials, and finally grant itself root privileges.
1.55.4.1.2.1	18-Dec-2016	snj	Pull up following revision(s) (requested by riastradh in ticket #1316): sys/arch/x86/x86/pmap.c: revision 1.223 sys/arch/x86/x86/vm_machdep.c: revision 1.26 sys/arch/x86/include/pmap.h: revision 1.61 PR/49691: KAMADA Ken'ichi: free deferred ptp mappings if present. XXX: pullup-7
1.58.2.5	26-Apr-2017	pgoyette	Sync with HEAD
1.58.2.4	20-Mar-2017	pgoyette	Sync with HEAD
1.58.2.3	07-Jan-2017	pgoyette	Sync with HEAD. (Note that most of these changes are simply $NetBSD$ tag issues.)
1.58.2.2	04-Nov-2016	pgoyette	Sync with HEAD
1.58.2.1	26-Jul-2016	pgoyette	Sync with HEAD
1.61.2.1	21-Apr-2017	bouyer	Sync with HEAD
1.64.6.2	22-Mar-2018	martin	Pull up the following revisions, requested by maxv in ticket #652: sys/arch/amd64/amd64/amd64_trap.S upto 1.39 (partial, patch) sys/arch/amd64/amd64/db_machdep.c 1.6 (patch) sys/arch/amd64/amd64/genassym.cf 1.65,1.66,1.67 (patch) sys/arch/amd64/amd64/locore.S upto 1.159 (partial, patch) sys/arch/amd64/amd64/machdep.c 1.299-1.302 (patch) sys/arch/amd64/amd64/trap.c upto 1.113 (partial, patch) sys/arch/amd64/amd64/amd64/vector.S upto 1.61 (partial, patch) sys/arch/amd64/conf/GENERIC 1.477,1.478 (patch) sys/arch/amd64/conf/kern.ldscript 1.26 (patch) sys/arch/amd64/include/frameasm.h upto 1.37 (partial, patch) sys/arch/amd64/include/param.h 1.25 (patch) sys/arch/amd64/include/pmap.h 1.41,1.43,1.44 (patch) sys/arch/x86/conf/files.x86 1.91,1.93 (patch) sys/arch/x86/include/cpu.h 1.88,1.89 (patch) sys/arch/x86/include/pmap.h 1.75 (patch) sys/arch/x86/x86/cpu.c 1.144,1.146,1.148,1.149 (patch) sys/arch/x86/x86/pmap.c upto 1.289 (partial, patch) sys/arch/x86/x86/vm_machdep.c 1.31,1.32 (patch) sys/arch/x86/x86/x86_machdep.c 1.104,1.106,1.108 (patch) sys/arch/x86/x86/svs.c 1.1-1.14 sys/arch/xen/conf/files.compat 1.30 (patch) Backport SVS. Not enabled yet.
1.64.6.1	16-Mar-2018	martin	Pull up the following revisions (via patch), requested by maxv in #635: sys/arch/amd64/amd64/gdt.c 1.39-1.45 (patch) sys/arch/amd64/amd64/amd64/machdep.c 1.284,1.287,1.288 (patch) sys/arch/amd64/amd64/include/param.h 1.23 (patch) sys/arch/amd64/include/types.h 1.53 (patch) sys/arch/x86/include/cpu.h 1.87 (patch) sys/arch/x86/include/pmap.h 1.73,1.74 (patch) sys/arch/x86/x86/cpu.c 1.142 (patch) sys/arch/x86/x86/intr.c 1.117 (partial),1.120 (patch) sys/arch/x86/x86/pmap.c 1.276 (patch) Initialize ist0 in cpu_init_tss. Backport __HAVE_PCPU_AREA.
1.76.2.6	26-Dec-2018	pgoyette	Sync with HEAD, resolve a few conflicts
1.76.2.5	26-Nov-2018	pgoyette	Sync with HEAD, resolve a couple of conflicts
1.76.2.4	06-Sep-2018	pgoyette	Sync with HEAD Resolve a couple of conflicts (result of the uimin/uimax changes)
1.76.2.3	28-Jul-2018	pgoyette	Sync with HEAD
1.76.2.2	25-Jun-2018	pgoyette	Sync with HEAD
1.76.2.1	21-May-2018	pgoyette	Sync with HEAD
1.80.2.3	13-Apr-2020	martin	Mostly merge changes from HEAD upto 20200411
1.80.2.2	08-Apr-2020	martin	Merge changes from current as of 20200406
1.80.2.1	10-Jun-2019	christos	Sync with HEAD
1.101.2.1	31-May-2020	martin	Pull up following revision(s) (requested by bouyer in ticket #935): sys/arch/xen/x86/x86_xpmap.c: revision 1.89 sys/arch/x86/include/pmap.h: revision 1.121 sys/arch/xen/xen/privcmd.c: revision 1.58 sys/external/mit/xen-include-public/dist/xen/include/public/memory.h: revision 1.2 sys/arch/xen/include/xenpmap.h: revision 1.44 sys/arch/xen/include/xenio.h: revision 1.12 sys/arch/x86/x86/pmap.c: revision 1.394 (all via patch) Ajust pmap_enter_ma() for upcoming new Xen privcmd ioctl: pass flags to xpq_update_foreign() Introduce a pmap MD flag: PMAP_MD_XEN_NOTR, which cause xpq_update_foreign() to use the MMU_PT_UPDATE_NO_TRANSLATE flag. make xpq_update_foreign() return the raw Xen error. This will cause pmap_enter_ma() to return a negative error number in this case, but the only user of this code path is privcmd.c and it can deal with it. Add pmap_enter_gnt()m which maps a set of Xen grant entries at the specified va in the specified pmap. Use the hooks implemented for EPT to keep track of mapped grand entries in the pmap, and unmap them when pmap_remove() is called. This requires pmap_remove() to be split into a pmap_remove_locked(), to be called from pmap_remove_gnt(). Implement new ioctl, needed by Xen 4.13: IOCTL_PRIVCMD_MMAPBATCH_V2 IOCTL_PRIVCMD_MMAP_RESOURCE IOCTL_GNTDEV_MMAP_GRANT_REF IOCTL_GNTDEV_ALLOC_GRANT_REF Always enable declarations needed by privcmd.c
1.108.2.2	29-Feb-2020	ad	Sync with head.
1.108.2.1	17-Jan-2020	ad	Sync with head.
1.117.2.1	25-Apr-2020	bouyer	Sync with bouyer-xenpvh-base2 (HEAD)
1.125.6.1	13-May-2021	thorpej	Sync with HEAD.

OpenGrok