Assumptions/statements:
- Code in kernel is bad
- Synchronous message passing is inherintly insecure
- in kernel async messaging is either too slow or too complex
- SMP systems are common
- processors have caches
- 32 bit archs are only used for "small" systems
- pages are "small", small enough that there are "many" on a system
Goals:
- provide a virtualization layer with IPC (an os with minimal abstractions)
- sufficiency for full-scale high performance/reliability systems
- sufficiency for partial posix userland libraries with non-abysmal performace
- allow user to take advantage of SMP systems and caching
- minimal in-kernel code
- ability to reasonably debug userland processes (can write a debugger)
- reasonably architecture indipendant (properly written C code is portable)
- makes no assumptions about userland programming language (may have no stacks)
- support for untrusted device drivers and other servers
- Hard realtime support
Non-goals:
- faster than monolothic kernels
- supply an abstraction layer (this is userlands problem)
- run *nix code
- support for large, high performance systems on 32 bit archs
- (that is, we don't care about scalability on 32 bit)
- support for systems without paged virtual memory
Terminology
-
Capability:
A general term referring to the dual to ACL's. In this system capabilities reside in address spaces, and have addresses. These addresses are therefore process specific and the only naming for a capability. Thus capabilities do not have cononical names across the system. This is where the security of capabilities is derived.
ALL system resources are represented as capabilities. -
VCPU:
This is similar to the concept of a "kernel thread". Basically a VCPU is a point of execution, which may or may not actually be running at any time. We use the term VCPU to get rid of the term "thread" wich, in this document, is used to refer exclusivly to the userland concept of a thread. a VCPU should be thaught of as a virtual processor. A process using more than one VCPU should be thaught of as an SMP system, and in fact the only reason to do this under this API is to take advantage of SMP.
-
PCPU:
An actual real physical CPU on a system.
-
Address space:
Conceptually this is a virtual memory mapping. Similar to a "process" on some systems, but only the memory part of that concept, not the execution part. A running VCPU is always running in some given address space, that is the virtual memory mapping the system will use for normal instructions.
-
Process:
This refers to a collection of VCPU's and address spaces. Please note that this term is extremely amorphous as VCPU's can change address spaces, and can actually indirectly work with address spaces that they are not running in. A process is most likely one or more VCPU's running in a single address space
-
Kernel process:
The kernel process is a the core piece of the system which is "colocated" with the kernel for efficiency reasons, but not for permissions or access reasons. This is a useful concept because this piece of the system can be easilly replaced to change the userland semantics. Basically, by thinking of this portion of the kernel as nothing special besides a "process" we can ensure our call semantics are easilly shimable, emulatable, etc. Making the system more flexible. The implementation need not follow this design.
Kernel API
Basic Types:
-
vr_ptr_#
"virtual register", This is probably just a register, but it could be something else depending on arch. It must be large enough to hold a pointer to memory.
-
ptr_t
a pointer, large enough to hold a pointer to anywhere in a virtual address space.
-
uintptr_t
same number of bits as ptr_t, but not a pointer, and unsigned. meant to imply usage as an integer
-
ptrdiff_t
same number of bits as ptr_t, but not a pointer, and signed meant to imply usage as a pointer offset (note that this is inconsistant with the C99-spec in that ptrdiff_t s guaranteed to be large enough for a full pointer, but this is consistant with 99% of C implementations)
-
size_t
basically the same as uintptr_t, but a size. It is possible that this could be smaller than uintptr_t for some special case arch.
-
cap_t (capability pointer)
subtype of ptr_t
An entity residing in one or more address spaces. All entities residing in an address space are capabilities. A capability may reside in multiple address spaces. Each residence is referred to as an "instance" of the capabilityCapabilities do not inherently have any alignment restrictions outside that required of the architecture itself (I.E. a byte for example), it is expected that every valid address is a valid capability address. Some specific capabilities do have alignment restrictions due again to architectural restrictions. As a result these restrictions are architecture specific (see mem_cap_t). Different capabilities may have different sizes.
-
Capability instance properties:
- single-use: if true, will automatically be revoked on first use (a message received on it, or a message sent).
-
Capability properties:
- parent: This is a notion of who spawned this capability. If the parent capability is free'd, this one will be as well. This is the handle that is used to allow secure reliable freeing of resources.
- owner: This is a notion of what address space "owns" this capability. When the capability is unmapped from the "owner" it is free'd and thus revoked from all address spaces. Owner also affects broadcast semantics when a capability is more than doubly owned.
These properties are described for the purpose of explaining semantics, but they do not exist in any hard sense to the user. They are not readable or writable directly, but may be manipulated via calls to the kernel.
-
Capability instance properties:
-
pure_cap_t: (pure capability pointer)
subtype of cap_t A pointer to a pure capability. A pure capability is simply a capability which is not a memory capability. That is, a capability which is not backed by memory. These capabilities have no alignment restrictions besides basic architecture addressing restrictions. These capabilities have size equal to their alignment. This concept is useful only as a way to say that something is not memory backed.
-
mem_cap_t (memory capability pointer)
subtype of cap_t
A pointer to a memory capability. This is simply one or more pages of physical memory. The correct way to think of a memory capability is as a capability which also happens to be backed by ram. This ram being mapped into the address space at the same address as the capability. These pages can allow or disallow any of read/write/execute, but all physical memory which is part of a mem_cap is the same mode. Depending on arch, turning off permissions may or may not actually disallow a given operation.
- note, it is okay for a mem_cap to point to anything that acts like ram, but is not just simple ram, such as memory mapped IO, AGP apperture, etc.- note2, mem_cap_t has architecture specific alignment requirements. On most archs the capability (and the capabilities size) must be page aligned Additional cap instance properties:
- read: whether the backing pages are readable
- write: whether the backing pages are writable
- execute: whether the backing pages are executable
- pages: A list of the actual memory backing this cap
-
fmem_cap_t (folded memory capability pointer)
This refers to the same object as a memory capability pointer, but it is not yet "unfolded" meaning that the backing ram does not actually live in this address space yet (and subsequently this cap can live anywhere in memory, no restrictions). As a result it has all of the same properties
-
addr_cap_t: (address space capability pointer)
subtype of pure_cap_t
This is a pointer to an address space capability. The address space capability is simply used to address an address space. It's a handle passed to the kernel which uniquely defines a particular address space whithin this address space. This allows us to control multiple address spaces. -
vcpu_cap_t: (virtual cpu capability pointer)
subtype of mem_cap_t
This is a pointer to a VCPU capability. The vcpu capability is used to control the VCPU. On most archs this is 1 page in length.
It's memory layout looks like so:- {uintptr_t upcall, regstate_t state, msg_queue_t send, msg_queue_t recv}
The send and recv queues split the remaining space in the cap. note that send and recv queues default to reasonable states with "head" and "tail" initialized
"interrupt" specifies whether or not this VCPU is willing to accept an upcall at this time (nonzero means that it will), this defaults to 0
It should be understood that whatever program a cap refers to can affectively supply whatever "properties" of that capability they wish. Any capability is functionally just a pure capability with other semantics supplied by whatever program or set of programs are talking on that capability. These are simply basic capabilities that are required to be able to write anything, and are thus have semantics supplied by the kernel.
Syscalls:
-
Send:
- vr_ptr_1 = address space cap to send on
- vr_ptr_2 = capability to send on
- vr_ptr_3 = address space cap to get sending cap from
- vr_ptr_4 = capability to send
- vr_ptr_5 = message to send
On success calling this syscall is identicle to a preemption event. sends capability at address vr_ptr_5 in address space at address vr_ptr_4 on capability at address vr_ptr_2 in address space vr_ptr_1. The value of vr_ptr5 is sent as well
On failure a return code will be placed in vr_ptr_1- 1 if vr_ptr_1 of that number does not point to a capability
- 2 if vr_ptr_1 is not an address space capability
- 3 if vr_ptr_2 of that number does not point to a capability
- 5 if vr_ptr_4 of that number does not point to a capability
- 6 if vr_ptr_4 is not an address space capability
- 7 if vr_ptr_5 of that number does not point to a capability
Capability addressing semantics:
Please note that all capability addressing is address space centric. Meaning that sends go to address spaces which have a capability mapped. Address spaces can then be assocated with VCPU's to actually generate the upcall events in a given VCPU.
Each capability has an "owner". There are 2 pieces of semantics related to owners. An owner refers to an address space not a VCPU. The important piece of semantic here is which address space the capability is addressed with: the "owning" address space, or another address space that it's also mapped into.
- When a capability is released from the owner, it is forcefully revoked from all other mappings. Meaning that it will be removed from all address spaces imeadiatly, and the capability itself is free'd. When this occurs a message is sent to every other holder of the capability.
- Freeing a capability further causes every "child" capability (a capability created using this capability) to be freed, and thus revoked. This is recursive of course.
- When a send is performed on a given capability, from the owning address space it is sent to all non-owners. When a send is performed on a non-owner address space it is sent to the owner only.
Kernel Process API
Strictly speaking the kernel is talked to exactly like any other process on the system. It does need an API though, and thus it requires some concepts of "special" capability types. These are all subtypes of either mem_cap or pure_cap and thus are not actually "special" except as part of an API like any normal process API.
-
msg_queue_t: (message queue)
{size_t size, ptrdiff_t head, ptrdiff_t tail, char[] data}
Note that it does not have a specific size. This is treated exactly as a ring buffer. The beginning of each message must contain the size of the message in bytes as the first word. head and tail refer to OFFSETS into data. They are not pointers. When "head == tail" the queue should be considered empty. When (head+msg_size)%sizeof(data) == tail the queue should be considered full and "msg" should not be inserted. Apon read, data should be read from where "tail" points, and then tail should be incremented by the appropriate size. Apon write, data should be written to where "head" points and head should be incremented by the appropriate size. Note that "head" MUST be incremented AFTER data is added, and "tail" MUST be incremented only AFTER data is copied out! This is a lockless queue and WILL BE USED AS SUCH! Be sure to use appropriate memory barriers for your architecture or your code may not function properly on SMP. -
regstate_t: (register state)
This is a lump of memory large enough to hold the entire register state for the architecture.
-
msg_type_t: (enum of all message types the kernel understands)
The values for this run in order of the messages as defined in the Kernal API section
-
perm_cap_t: (permissions capability pointer)
subtype of pure_cap_t
This indicates permissions to use some resource. In particular this is used for allocation. A perm cap may conceptually have a "count" of 0 through countable inifinity inclusive. In practice we represent this count as a finite integer and all 1's is infinity. decrementing or otherwise subtracting from infinity has the expected result of not changing it.Allocation is performed by sending a capability on the perm_cap_t, along with a number of *things* to be allocated. The other side will then send the things to be allocated over the newly shared capability. for reasons of security the capability that is sent should be marked as "single use". Note that in the count here the high bit is reserved for another meaning, so it does not count towards the count.
Splitting of a perm_cap_t (creating another one with some of the count of the original) is performed if the high bit of the value sent is 1. In this case the value of the rest of the bits (besides that 1) will be the. count of the number of *things* broken off into the new capability. The new perm_cap_t will then be sent back, similar to newly allocated caps.
Kernel API:
To talk to the kernel we send it messages using the vcpu_cap_t. To avoid races it is recommended that each VCPU use it's own capability to talk to kernel. To send a message we simply insert it into the "send" queue in the vcpu_cap_t, and call "send" on that vcpu_cap_t (and the address_space that it's mapped too) and the kernel will drop any responses in the "recv" queue, and if it responds it will also call "send" on the capability. The kernel can be modeled 100% correctly as an indipendant process which is asyncronously recieving messages. Once something is in that queue the kernel is free to perform the operation whenever it would like. "send" is therefore really more like "yield'ing" to the kernel. The kernel will however always perform the requested operations in strict queue order (at least as they are visible to the user). In this way dependant operation can be requested in a batch.
We're going to use the notation: resp msg_name(arg1, arg2, ...)
It should be understood that this is not a function call, but simply means
placing {uintptr_t ind, msg_type_t msg_name, arg1, arg2, ...} in the send
queue of the vcpu_cap_t. The kernel will respond with {uintptr_t ind, resp}
in the the "recv" queue. If there is no "resp" the kernel will not respond.
-
new_addr_cap(addr_cap_t perm_addr, perm_addr_cap_t perm,
addr_cap_t newcap_addr, addr_cap_t newcap)
Allocates a new address space at newcap in newcap_addr using permission perm in perm_addr. decrements perm's count by one. Ownership for the new cap is assigned to the new address space.
-
new_vcpu_cap(addr_cap_t perm_addr, perm_vcpu_cap_t perm,
addr_cap_t newcap_addr, vcpu_cap_t newcap)
Allocates a new vcpu at newcap in newcap_addr using permission perm in perm_addr. decrements perm's count by one. Ownership for the new cap is assigned to the new address space.
-
new_pure_cap(addr_cap_t perm_addr, perm_pure_cap_t perm,
addr_cap_t newcap_addr, pure_cap_t newcap)
Allocates a new pure at newcap in newcap_addr using permission perm in perm_addr. decrements perm's count by one. Ownership for the new cap is assigned to the new address space.
-
new_mem_cap(addr_cap_t perm_addr, perm_mem_cap_t perm,
addr_cap_t newcap_addr, mem_cap_t newcap, size_t size)
Allocates a new mem_cap_t at newcap with "size" of memory backing the cap in newcap_addr using permission perm in perm_addr.
Note that the cap will take up "size" in the address space, and "size" MUST be page aligned. Removes "size" from perm. Ownership for the new cap is assigned to the new address space. -
new_sig_cap(addr_cap_t perm_addr, perm_pure_cap_t perm,
addr_cap_t addr, addr_cap_t newcap_addr, sig_cap_t newcap)
Allocates a new sig_cap_t bound to addr using permission perm in perm_addr. The new cap is placed at newcap in newcap_addr
-
split_mem_cap(addr_cap_t mem_addr, mem_cap_t mem,
addr_cap_t newcap_addr, mem_cap_t newcap, size_t size)
(And you thaught a perm_cap_t with zero in it was useless!) This will split a memory capability into two memory capabilities. The new capability gets "size" of the memory and the old gets the rest. Please note that "size" must be smaller than the full size of the cap, and it must also be page aligned. Ownership for the new cap is assigned to the new address space.
-
move_cap(addr_cap_t from_addr, cap_t cap, addr_cap_t to_addr, cap_t cap,
rwe_t rwe, int owner)
Moves a capability from one address to another. rwe will be anded with the current rwe and the "cap" will be set to those permissions. If "owner" is set and from_addr is the current onwer then to_addr will be set as the new owner.
Please note that moving to the current place is the proper procedure for dropping permissions. -
copy_cap(addr_cap_t from_addr, cap_t cap, addr_cap_t to_addr, cap_t cap,
int owner)
Shallow copies the capability from one address to the other. This is now creating a new reference to a cap, so that it resides in the old place AND the new place. rwe will be anded with the current rwe and the "cap" will be set to those permissions. If owner is set and "from_addr" is the current owner ownership will be transfered to "to_addr".
-
copy_mem(addr_cap_t src_addr, ptr_t src, addr_cap_t dest_addr, ptr_t dest,
size_t size)
Copies data from inside of a memory capability from one place to another. This is useful because it allows us to reach inside of an address space that is not the one our VCPU is currently running in.
-
set_upcall(addr_cap_t vcpu_addr, vcpu_cap_t vcpu, addr_space_t upcall_addr,
ptr_t upcall)
When an upcall occurs on vcpu in vcpu_addr it will context switch to upcall_addr and begin execution at "upcall".
-
set_vcpu_addr(addr_cap_t vcpu_addr, vcpu_cap_t vcpu, addr_space_t addr)
This subscribes vcpu to events coming from addr. That is, any capability in addr which gets a "send" on it (in the appropriate address space) will cause an upcall. Note that a vcpu can only be subscribed to one address space at a time, so this will also unsubscribe the VCPU from events from other address spaces.
-
uintptr_t _count()
This simply returns a count of the number of cpu's on the system
-
set_vcpu_assoc(addr_cap_t vcpu_addr, vcpu_cap_t vcpu, size_t size,
char[] assoc)
"assoc" is the PCPU assocation of this VCPU, that is, it's a bit-mask of which physical CPU's this VCPU is willing to run on. The default is to allow running on all processors. a 1 in the first position lets it run on that index processor. The bitmask should be read as a multibyte integer would for the given architecture.
#TODO: do we want a processor permissions system? so we can protect a processor for use by only one VCPU for instance?
Upcalls and Events:
Events:
-
Send
So, what does it mean for an address space to be "sent" to on a capability? Being sent to generates an event in the address-space in which the capability is mapped. This event contains at a a minimum the address of the capability in the destination address space. If the "send" syscall generating the event also sent a capability, then the event will also conceptually contain the capability being sent.
Footnote: during transfer this "in-flight" capability is considered to be a reference to the capability, and if owner is being transfered is considered to be the current owner (all messages will be delayed though until it is mapped into a proper owner, as if upcalls were disabled). This is important for capability ref-counting and freeing semantics. As far as the user is concerned, they will see the address that the sent capability gets mapped to in the destination address space.
These two pieces of information are the only information contained in a send event.A portion of every address space is reserved for the kernel to map incoming capabilities too. What portion is architecture specific, but we recommend that the top quarter of the address space is kernel, and the next quarter is reserved for these mappings. If no addresses are free in this area the kernel will just drop it somewhere else in the address space. The kernel is free to choose any free address in the appropriate range in each case.
- There are no other event types :D (this way they can all be simulated)
junk I'm still figuring out/haven't typed up yet
Special events:
I think we want to avoid having these at all
preemption events come from vcpu_cap_t (preempt written in VCPU buffer)
kernel call return events come from vcpu_cap_t (msg written in VCPU buffer)
HMMMMMM.... not sure about this, be sure preempt can be emulated with send
sig_cap semantics (death of addr if no upcalls available)
Swap API:
Upcall System: