profile
viewpoint
Will Jhun wjhun US systems - kernel - networking - embedded

push eventnanovms/nanos

Will Jhun

commit sha ab844122d331d7e9321cb1ec5506d61dfba861b4

blockq_check_timeout(): protect against race between action invocation and inserting waiter The blockq previously had a known limitation whereby a wakeup occurring between initial invocation of the blockq action and insertion into the waiting list could go undetected. While the futex code had a specific workaround for this, this race has since led to missed wakeups in other areas, calling for a more general solution. With this change, blockq_wake_one(), under blockq lock, sets the wake flag if a waiting thread could not be dequeued. blockq_check_timeout() firsts clears the flag prior to invoking the action, then checks (under blockq lock) if the flag was set by a wake within this interval. If it was, a wakeup may (but not necessarily) have been missed, and blockq_wake_one() is called after releasing the locks. This solution may lead to a spurious wakeup in the case that the state change indicated by the wake was in fact detected by the blockq action. This would seem to affect only the futex code, however such spurious futex wakeups (a FUTEX_WAIT returning 0, not triggered by a FUTEX_WAKE) are explictly allowed. (See the FUTEX WAIT entry under the RETURN VALUE section of the futex(2) manpage.) Following this pattern (which appears to be common amongst pthreads and posix interfaces), all blockq actions should be written to assume that spurious invocations are possible. The aforementioned workaround in the futex code (spinning on non-zero waiters for a wake) has been removed.

view details

Will Jhun

commit sha 799bed3daf68c33c393645d5c3acebccf097fcc7

io_submit(): replace syscall context before returning, hold reference per iocb op The aio code was issuing read and write requests on behalf of the running syscall context without managing the lifetime of the context or replacing the cpu's default syscall context. The following scenario would lead to a crash during the aio test: An aio_complete executes within the saved syscall context but on a different cpu than the one that issued the requests. The originating cpu, still having the same syscall context set as default, then serves a syscall. The low-level syscall entry begins running on the stack of this context, expecting exclusive access to it, and corruption of the stack ensues. The solution presented here uses the active syscall context to encompass all continuations as a result of an io_submit(), despite such operations no longer being associated with a syscall. The default syscall context is replaced by calling check_syscall_context_replace() in io_submit(), and each issued I/O operation holds a reference to the context. The context is finally freed and recycled once the last operation completes.

view details

Will Jhun

commit sha 0290f1f39852dd4b450d146088a859da2aac07bb

io_uring_enter(): replace syscall context before returning, hold references for r/w ops As with aio, io_uring also was issuing read and write requests on behalf of the running syscall context without managing the lifetime of the context or replacing the cpu's default syscall context. The solution presented here uses the active syscall context to encompass all continuations stemming from io_uring_enter(), despite such operations no longer being associated with a syscall. The default syscall context is replaced by calling check_syscall_context_replace(), and each issued I/O operation (via iour_iov() or iour_rw()) holds a reference to the context. The context is finally freed and recycled once the last operation completes.

view details

Will Jhun

commit sha ad280412ac72dd3033381f98ab1a36aa5278c25c

mutex implementation Building off of the context switching work, this change implements a two-phase mutex lock which will first spin for some number of iterations (specified on mutex allocation) before finally suspending the running context if the mutex could not be acquired. Suspended contexts are queued and processed in a FIFO fashion, with the unlock function dequeueing and directly scheduling any context waiting on the mutex. This is useful for preventing cores from being excessively tied up with spinning while there is resource contention. The spin factor argument given on instantiation may be adjusted according to the needs of the workload; a value of 0 indicates no spinning is desired - in which case a context will suspend whenever the mutex cannot be immediately acquired - but may also be made large to prioritize spinning, leaving suspension as a fallback option that frees up cpu cycles for other contexts to run. The new context_suspend() function suspends a context by saving the running processor state to its frame and then adjusting it to return to the caller on frame return. As a first client, the lwIP lock has been converted from a spinlock to a mutex. It is instantiated with a large spin factor in order to maintain existing performance levels observed with our webg test. As with spinlocks, the factor of bus contention when considering performance increases with the number of cores contending for a mutex. This could potentially be ameliorated in the future by using something like the MCS lock in place of the compare-and-swap loop for the first phase. Using MCS would also insure that waiters are processed in a FIFO order for both phases of lock acquisition, not just the second one.

view details

Will Jhun

commit sha 848ef3b675d6024df07b56f3cfd023ecbd9c3430

address netsock bugs exposed by mutex lock Some races in the netsock and webg tests were exposed as a result of the lwIP lock acquisition causing context suspends. - In sock_read_bh_internal() and socket_write_tcp_bh_internal(), acquire the lwIP lock prior to inspecting socket state. - In netsock_poll(), net_loop_poll_queued was being cleared before acquiring the lwIP lock. If the context were to be suspended when taking the lock, further unnecessary enqueues could pile up. Move the lock acquire prior to clearing the flag. - In socket_write_tcp_bh_internal(), proactively invoke netif_poll_all() if the TCP socket sndbuf is full, avoiding the need to continue blocking in the case that netsock_poll() is scheduled but awaiting dispatch.

view details

Will Jhun

commit sha da159351d6dfa758e44dda1f2d7132f09b49f7bc

use lock-free list to implement queue of waiting contexts Using a fixed-size queue of waiters and spinning on failing enqueue operations ultimately defeats the mutex property that a context awaiting a mutex lock cannot tie up a CPU indefinitely. To remedy this, the queue has been substituted with a singly-linked list of waiting contexts. The lock-free list follows the one used in the MCS lock, albeit for enqueuing waiters and not spinning.

view details

push time in 20 hours

push eventnanovms/nanos

Will Jhun

commit sha ab844122d331d7e9321cb1ec5506d61dfba861b4

blockq_check_timeout(): protect against race between action invocation and inserting waiter The blockq previously had a known limitation whereby a wakeup occurring between initial invocation of the blockq action and insertion into the waiting list could go undetected. While the futex code had a specific workaround for this, this race has since led to missed wakeups in other areas, calling for a more general solution. With this change, blockq_wake_one(), under blockq lock, sets the wake flag if a waiting thread could not be dequeued. blockq_check_timeout() firsts clears the flag prior to invoking the action, then checks (under blockq lock) if the flag was set by a wake within this interval. If it was, a wakeup may (but not necessarily) have been missed, and blockq_wake_one() is called after releasing the locks. This solution may lead to a spurious wakeup in the case that the state change indicated by the wake was in fact detected by the blockq action. This would seem to affect only the futex code, however such spurious futex wakeups (a FUTEX_WAIT returning 0, not triggered by a FUTEX_WAKE) are explictly allowed. (See the FUTEX WAIT entry under the RETURN VALUE section of the futex(2) manpage.) Following this pattern (which appears to be common amongst pthreads and posix interfaces), all blockq actions should be written to assume that spurious invocations are possible. The aforementioned workaround in the futex code (spinning on non-zero waiters for a wake) has been removed.

view details

Will Jhun

commit sha 799bed3daf68c33c393645d5c3acebccf097fcc7

io_submit(): replace syscall context before returning, hold reference per iocb op The aio code was issuing read and write requests on behalf of the running syscall context without managing the lifetime of the context or replacing the cpu's default syscall context. The following scenario would lead to a crash during the aio test: An aio_complete executes within the saved syscall context but on a different cpu than the one that issued the requests. The originating cpu, still having the same syscall context set as default, then serves a syscall. The low-level syscall entry begins running on the stack of this context, expecting exclusive access to it, and corruption of the stack ensues. The solution presented here uses the active syscall context to encompass all continuations as a result of an io_submit(), despite such operations no longer being associated with a syscall. The default syscall context is replaced by calling check_syscall_context_replace() in io_submit(), and each issued I/O operation holds a reference to the context. The context is finally freed and recycled once the last operation completes.

view details

Will Jhun

commit sha 0290f1f39852dd4b450d146088a859da2aac07bb

io_uring_enter(): replace syscall context before returning, hold references for r/w ops As with aio, io_uring also was issuing read and write requests on behalf of the running syscall context without managing the lifetime of the context or replacing the cpu's default syscall context. The solution presented here uses the active syscall context to encompass all continuations stemming from io_uring_enter(), despite such operations no longer being associated with a syscall. The default syscall context is replaced by calling check_syscall_context_replace(), and each issued I/O operation (via iour_iov() or iour_rw()) holds a reference to the context. The context is finally freed and recycled once the last operation completes.

view details

push time in 20 hours

PullRequestReviewEvent

push eventnanovms/nanos

Justin Sanders

commit sha c3a488e4be78e8697fc422f05f6cb934d2192bf9

Add getsockopt support for reuseport/addr and tcp_nodelay This adds support in getsockopt for reading the values of SO_REUSEADDR, SO_REUSEPORT, and TCP_NODELAY. SO_REUSEPORT is not currently supported so it always returns 0 and setting it is a no-op. This also fixes setting TCP_NODELAY in setsockopt which was setting the opposite value requested.

view details

sanderssj

commit sha 2dbf2cabd0c16e20670afeb3f73e737e0b6e8784

Merge pull request #1652 from nanovms/getsockopt-newoptions Add getsockopt support for reuseport/addr and tcp_nodelay

view details

Will Jhun

commit sha e183b55a7b5dff1e38c3cba98c91f59ceb716f28

restore correct detection of runtime test errors by defaulting to qemu debug exit (#1653) With the introduction of acpica and the default installation of the acpi_powerdown vm_halt method on the PC platform, runtime CI tests have not been passing the test exit codes back to the shell. This results in "make runtime-tests" passing even if tests have failed with a non-zero exit code. This adds a "debug_exit" manifest flag to indicate that vm_exit() should return to qemu via the QEMU_HALT() function regardless of whether the platform has installed a vm_halt method. An option has been added to mkfs to allow the amending of manifest root tuples from the command line. Runtime tests are now built with "debug_exit:t" added to the root tuples by default.

view details

Will Jhun

commit sha c577c42e88cd5aac8a746d47ad78c984b88d916d

queue: allow specification of entry size with *_n variants This change allows multiple (power-of-2) word queueing using the *_n* variants of the enqueue and dequeue methods. The same size must be used for all operations on a given queue (though the test changes the size at quiescent state); operations of mixed sizes are not supported.

view details

Will Jhun

commit sha 217963da37d7f92469032ca03ea797869b4f2600

asynchronous completion queues This creates new scheduling queues for deferred processing of status_handlers and other closures whose execution may (or need to) happen asynchronously from variable application. The async_apply*() functions take a closure bound with variable arguments and enqueue them to the specified async queue. The queues are dispatched directly from the runloop, so queued continuations are free to do frame returns, switch to new stacks or take other paths that culminate into a call to runloop() rather than return directly. This change is necessary for pending context switching work.

view details

Will Jhun

commit sha c15d5eebc43e9f945b53b753a7d385c4984f46a5

use async_apply_1() for virtqueue completion processing The async queues greatly simplify virtqueue completion handling. The service queue and task is removed in favor of queueing directly from the interrupt handler. Used vqmsgs are now recycled with a free list rather than deallocated.

view details

Will Jhun

commit sha beca65c73813ed426569a27a42a6598c99795a65

async completions for pagecache and blockq This moves blockq action invocation to the async queues (note: there are currently missing thread_resumes, but this will be fixed with context switches in async dispatch). io_completions are now called directly from bq action handlers rather than stored in the thread for later completion. Pagecache completions are now queued for asynchronous completion. In futex_wake_one(), now that the blockq wake happens asynchronously, take the valid returned thread as proof of a wake, adjust the waiter count and return at once (this function could previously wake up multiple waiters, despite its name). In pagecache_write_sg, move new, whole pages to writing state immediately to prevent an intervening page fill. runloop_internal(): check queues again before going to sleep to catch work queued during service Blockq actions now must re-insert themselves into the blockq if a wakeup occurs for a thread but blocking needs to continue. The blockq_block_required() helper has been added for this purpose.

view details

Will Jhun

commit sha ad868cf2e1a03081337e35d7e9ac5bb8e9ccd66c

initial commit for new context type and embedding of context within closures This work introduces a redefined context type that associates a running frame with methods for pausing, resuming and scheduling contexts, a fault handler, and a transient heap that can be used for allocations that exist for the lifetime of the context. kernel_contexts and unix threads now extend this base type, and the syscall_context type has been created to encompass syscall operations. Syscall operations which block awaiting continuation now stamp their respective context in the continuation closure using contextual_closure(). Areas of the kernel which service such continuations now invoke them using async_apply*() functions, resulting in the context saved in such closures to be resumed. Now that each syscall in progress is represented by its own syscall_context, it is possible to service concurrent page faults for multiple syscalls. unix_fault_handler now handles faults in a way more orthogonal to the context type, instead relying on the context methods to effectively suspend and resume the context when blocking is necessary for faults on file-backed memory. Frame storage embedded within the context is now a fixed size, representing only a basic frame save. The extended frame (FP / SIMD) storage is now allocated separately, pointed to by the FRAME_EXTENDED field in the basic frame.

view details

Will Jhun

commit sha a3184b7b102edcf1f7c690285facee7eb049e307

remove kernel lock, fold async queues into one

view details

Will Jhun

commit sha 3679c8b30442d7c7d8483bcfae02b39076b1a998

force release console lock before outputting in fault handler

view details

Will Jhun

commit sha d36dfd928212760d2c2a3c92fa44e2b49250125b

make additional closures contextual where necessary; implement context acquire and release / locking to guarantee safe frame handoff on immediate completion of contextual closure when running with multiple processors; make kernel-specific __stack_chk_fail; do syscall frame return via separate callback instead of resume; use async apply for flush:service_list(); have syscall_finish optionally reschedule thread to avoid race between schedule_thread() and syscall_finish()

view details

Will Jhun

commit sha bf0e8e420c1c2ac93d46342f4952572dd5662a46

fault handling during sigframe setup

view details

Will Jhun

commit sha ec0e1c261c5cf44ec3966130410e3f6ba7c6f48d

pagecache_write_sg_finish: borrow thread frame handler for current kernel_context in lieu of context switch; add pre_suspend and schedule_return context methods to make fault handling more orthogonal to context type

view details

Will Jhun

commit sha cc6e744e5a6e839ef307ca766526d8ddc6df38e1

aarch64 support for new context type

view details

Will Jhun

commit sha 2aef46414e9fed7d49d1801306e1d2409c900778

use vft for pause, resume, pre_suspend and schedule_return instead of thunks; embed refcount in context struct; fix kernel context reuse, delay recycling of of kernel and syscall contexts until path completes (run from cpu_queue in runloop); fix context release in interrupt handlers; embed stacks in kernel and syscall contexts; hold context reference while frame is full, remove unnecessary reference in blockq_check*(); cleanup, rebase fixes

view details

Will Jhun

commit sha d97e6f2e994362f0980d7462ed903e990cc46b51

return kernel and syscall contexts to free list of originating cpu The prior scheme for recycling kernel and syscall contexts was naively returning retired contexts to the free list of the cpu where the final release occurred. This was leading to an imbalance of cached contexts amongst cpus and an excess of contexts being allocated to compensate. To remedy this, use the lock-free queue to return freed contexts to the cpu from which they originated. This creates an upper bound to the amount of context data that may be cached per processor (adjustable via constants in config.h). Contexts that cannot be returned to full free queues are simply deallocated. Note that these queues have multiple producers but only a single consumer (the owning cpu), so the more optimal dequeue_single() is used when removing contexts from the queue.

view details

Will Jhun

commit sha 05a0acb27f3cf23d82eea658893f89350a55f49b

increase cpu_queue size; validate enqueues when freeing syscall and kernel contexts An undetected overflowing of the cpu_queue was causing kernel contexts to be leaked. Add asserts to enqueues and increase the default size of cpu_queue.

view details

Will Jhun

commit sha 2271388896ac00d14e76a6ee6ed1912869e176af

replace switch_stack functions Adding variants of switch_stack for additional arguments revealed that the x86 implementations were missing register clobbers, possibly leading to incorrect assembly. This re-writes implemenations for both aarch64 and x86_64, making some attempt to lessen redundancy by composing the definitions wtih macros.

view details

Will Jhun

commit sha d1d5fdf6f1628c2bae696577a9340e6304da30d6

context_switch_and_branch() Add context_switch_and_branch() to better facilitate context switches which resume on a new stack. Such switches were previously depending on a group of inlined functions to not address the active stack even after releasing the corresponding context. To avoid the risk that the stack of an outgoing context might be touched after the release, these paths have been refactored to first switch to the new stack before setting the new context, releasing the outgoing context and calling the target function. Since runloop() is called immediately after (optionally) returning from the target function, the runloop trampoline helpers are no longer required.

view details

Will Jhun

commit sha c2b861903823893bf6a399f74bf1a1242b3a6d41

pending_fault_complete(): protect service of pending_fault dependents with p->faulting_lock This fixes a race whereby a thread could be added to a pending_fault's dependents vector during or after service of the vector in pending_fault_complete(). Such an occurance could result in a thread hanging indefinitely on a page fault. This moves the taking of p->faulting_lock in pending_fault_complete() up prior to servicing the dependents vector.

view details

push time in 5 days

push eventnanovms/nanos

Justin Sanders

commit sha c3a488e4be78e8697fc422f05f6cb934d2192bf9

Add getsockopt support for reuseport/addr and tcp_nodelay This adds support in getsockopt for reading the values of SO_REUSEADDR, SO_REUSEPORT, and TCP_NODELAY. SO_REUSEPORT is not currently supported so it always returns 0 and setting it is a no-op. This also fixes setting TCP_NODELAY in setsockopt which was setting the opposite value requested.

view details

sanderssj

commit sha 2dbf2cabd0c16e20670afeb3f73e737e0b6e8784

Merge pull request #1652 from nanovms/getsockopt-newoptions Add getsockopt support for reuseport/addr and tcp_nodelay

view details

Will Jhun

commit sha e183b55a7b5dff1e38c3cba98c91f59ceb716f28

restore correct detection of runtime test errors by defaulting to qemu debug exit (#1653) With the introduction of acpica and the default installation of the acpi_powerdown vm_halt method on the PC platform, runtime CI tests have not been passing the test exit codes back to the shell. This results in "make runtime-tests" passing even if tests have failed with a non-zero exit code. This adds a "debug_exit" manifest flag to indicate that vm_exit() should return to qemu via the QEMU_HALT() function regardless of whether the platform has installed a vm_halt method. An option has been added to mkfs to allow the amending of manifest root tuples from the command line. Runtime tests are now built with "debug_exit:t" added to the root tuples by default.

view details

Will Jhun

commit sha c577c42e88cd5aac8a746d47ad78c984b88d916d

queue: allow specification of entry size with *_n variants This change allows multiple (power-of-2) word queueing using the *_n* variants of the enqueue and dequeue methods. The same size must be used for all operations on a given queue (though the test changes the size at quiescent state); operations of mixed sizes are not supported.

view details

Will Jhun

commit sha 217963da37d7f92469032ca03ea797869b4f2600

asynchronous completion queues This creates new scheduling queues for deferred processing of status_handlers and other closures whose execution may (or need to) happen asynchronously from variable application. The async_apply*() functions take a closure bound with variable arguments and enqueue them to the specified async queue. The queues are dispatched directly from the runloop, so queued continuations are free to do frame returns, switch to new stacks or take other paths that culminate into a call to runloop() rather than return directly. This change is necessary for pending context switching work.

view details

Will Jhun

commit sha c15d5eebc43e9f945b53b753a7d385c4984f46a5

use async_apply_1() for virtqueue completion processing The async queues greatly simplify virtqueue completion handling. The service queue and task is removed in favor of queueing directly from the interrupt handler. Used vqmsgs are now recycled with a free list rather than deallocated.

view details

Will Jhun

commit sha beca65c73813ed426569a27a42a6598c99795a65

async completions for pagecache and blockq This moves blockq action invocation to the async queues (note: there are currently missing thread_resumes, but this will be fixed with context switches in async dispatch). io_completions are now called directly from bq action handlers rather than stored in the thread for later completion. Pagecache completions are now queued for asynchronous completion. In futex_wake_one(), now that the blockq wake happens asynchronously, take the valid returned thread as proof of a wake, adjust the waiter count and return at once (this function could previously wake up multiple waiters, despite its name). In pagecache_write_sg, move new, whole pages to writing state immediately to prevent an intervening page fill. runloop_internal(): check queues again before going to sleep to catch work queued during service Blockq actions now must re-insert themselves into the blockq if a wakeup occurs for a thread but blocking needs to continue. The blockq_block_required() helper has been added for this purpose.

view details

Will Jhun

commit sha ad868cf2e1a03081337e35d7e9ac5bb8e9ccd66c

initial commit for new context type and embedding of context within closures This work introduces a redefined context type that associates a running frame with methods for pausing, resuming and scheduling contexts, a fault handler, and a transient heap that can be used for allocations that exist for the lifetime of the context. kernel_contexts and unix threads now extend this base type, and the syscall_context type has been created to encompass syscall operations. Syscall operations which block awaiting continuation now stamp their respective context in the continuation closure using contextual_closure(). Areas of the kernel which service such continuations now invoke them using async_apply*() functions, resulting in the context saved in such closures to be resumed. Now that each syscall in progress is represented by its own syscall_context, it is possible to service concurrent page faults for multiple syscalls. unix_fault_handler now handles faults in a way more orthogonal to the context type, instead relying on the context methods to effectively suspend and resume the context when blocking is necessary for faults on file-backed memory. Frame storage embedded within the context is now a fixed size, representing only a basic frame save. The extended frame (FP / SIMD) storage is now allocated separately, pointed to by the FRAME_EXTENDED field in the basic frame.

view details

Will Jhun

commit sha a3184b7b102edcf1f7c690285facee7eb049e307

remove kernel lock, fold async queues into one

view details

Will Jhun

commit sha 3679c8b30442d7c7d8483bcfae02b39076b1a998

force release console lock before outputting in fault handler

view details

Will Jhun

commit sha d36dfd928212760d2c2a3c92fa44e2b49250125b

make additional closures contextual where necessary; implement context acquire and release / locking to guarantee safe frame handoff on immediate completion of contextual closure when running with multiple processors; make kernel-specific __stack_chk_fail; do syscall frame return via separate callback instead of resume; use async apply for flush:service_list(); have syscall_finish optionally reschedule thread to avoid race between schedule_thread() and syscall_finish()

view details

Will Jhun

commit sha bf0e8e420c1c2ac93d46342f4952572dd5662a46

fault handling during sigframe setup

view details

Will Jhun

commit sha ec0e1c261c5cf44ec3966130410e3f6ba7c6f48d

pagecache_write_sg_finish: borrow thread frame handler for current kernel_context in lieu of context switch; add pre_suspend and schedule_return context methods to make fault handling more orthogonal to context type

view details

Will Jhun

commit sha cc6e744e5a6e839ef307ca766526d8ddc6df38e1

aarch64 support for new context type

view details

Will Jhun

commit sha 2aef46414e9fed7d49d1801306e1d2409c900778

use vft for pause, resume, pre_suspend and schedule_return instead of thunks; embed refcount in context struct; fix kernel context reuse, delay recycling of of kernel and syscall contexts until path completes (run from cpu_queue in runloop); fix context release in interrupt handlers; embed stacks in kernel and syscall contexts; hold context reference while frame is full, remove unnecessary reference in blockq_check*(); cleanup, rebase fixes

view details

Will Jhun

commit sha d97e6f2e994362f0980d7462ed903e990cc46b51

return kernel and syscall contexts to free list of originating cpu The prior scheme for recycling kernel and syscall contexts was naively returning retired contexts to the free list of the cpu where the final release occurred. This was leading to an imbalance of cached contexts amongst cpus and an excess of contexts being allocated to compensate. To remedy this, use the lock-free queue to return freed contexts to the cpu from which they originated. This creates an upper bound to the amount of context data that may be cached per processor (adjustable via constants in config.h). Contexts that cannot be returned to full free queues are simply deallocated. Note that these queues have multiple producers but only a single consumer (the owning cpu), so the more optimal dequeue_single() is used when removing contexts from the queue.

view details

Will Jhun

commit sha 05a0acb27f3cf23d82eea658893f89350a55f49b

increase cpu_queue size; validate enqueues when freeing syscall and kernel contexts An undetected overflowing of the cpu_queue was causing kernel contexts to be leaked. Add asserts to enqueues and increase the default size of cpu_queue.

view details

Will Jhun

commit sha 2271388896ac00d14e76a6ee6ed1912869e176af

replace switch_stack functions Adding variants of switch_stack for additional arguments revealed that the x86 implementations were missing register clobbers, possibly leading to incorrect assembly. This re-writes implemenations for both aarch64 and x86_64, making some attempt to lessen redundancy by composing the definitions wtih macros.

view details

Will Jhun

commit sha d1d5fdf6f1628c2bae696577a9340e6304da30d6

context_switch_and_branch() Add context_switch_and_branch() to better facilitate context switches which resume on a new stack. Such switches were previously depending on a group of inlined functions to not address the active stack even after releasing the corresponding context. To avoid the risk that the stack of an outgoing context might be touched after the release, these paths have been refactored to first switch to the new stack before setting the new context, releasing the outgoing context and calling the target function. Since runloop() is called immediately after (optionally) returning from the target function, the runloop trampoline helpers are no longer required.

view details

Will Jhun

commit sha c2b861903823893bf6a399f74bf1a1242b3a6d41

pending_fault_complete(): protect service of pending_fault dependents with p->faulting_lock This fixes a race whereby a thread could be added to a pending_fault's dependents vector during or after service of the vector in pending_fault_complete(). Such an occurance could result in a thread hanging indefinitely on a page fault. This moves the taking of p->faulting_lock in pending_fault_complete() up prior to servicing the dependents vector.

view details

push time in 5 days

push eventnanovms/nanos

Will Jhun

commit sha e183b55a7b5dff1e38c3cba98c91f59ceb716f28

restore correct detection of runtime test errors by defaulting to qemu debug exit (#1653) With the introduction of acpica and the default installation of the acpi_powerdown vm_halt method on the PC platform, runtime CI tests have not been passing the test exit codes back to the shell. This results in "make runtime-tests" passing even if tests have failed with a non-zero exit code. This adds a "debug_exit" manifest flag to indicate that vm_exit() should return to qemu via the QEMU_HALT() function regardless of whether the platform has installed a vm_halt method. An option has been added to mkfs to allow the amending of manifest root tuples from the command line. Runtime tests are now built with "debug_exit:t" added to the root tuples by default.

view details

push time in 5 days

PR merged nanovms/nanos

restore correct detection of runtime test errors by defaulting to qemu debug exit

With the introduction of acpica and the default installation of the acpi_powerdown vm_halt method on the PC platform, runtime CI tests have not been passing the test exit codes back to the shell. This results in "make runtime-tests" passing even if tests have failed with a non-zero exit code.

This adds a "debug_exit" manifest flag to indicate that vm_exit() should return to qemu via the QEMU_HALT() function regardless of whether the platform has installed a vm_halt method. An option has been added to mkfs to allow the amending of manifest root tuples from the command line. Runtime tests are now built with "debug_exit:t" added to the root tuples by default.

+67 -22

0 comment

4 changed files

wjhun

pr closed time in 5 days

push eventnanovms/nanos

Will Jhun

commit sha dc63eb21eca95e56d8a56d3c3f94666a0c24fa0b

epoll_wait_notify(): avoid reporting duplicate epoll_events for an fd This addresses the scenario where check_fdesc() (invoked from epoll_wait) and wait_notify() (invoked from notify_dispatch()) could both invoke epoll_wait_notify() for a given fd. This is a reasonable occurance as a call to epoll_wait() could pick up a coincident event state change prior to completion of the ensuing notify_dispatch(). This race can lead to duplicate epoll_events being reported to userspace. Such a scenario would occur at a low frequency in the signalfd portion of the signal runtime test. In this case, the duplicate events entry led to a superfluous call to read(2), which would block indefinitely. epoll_wait_notify() will now scan the reported events array to look for a matching data field and, if found, update the associated events field with the union of the two reported events. (It is necessary to use the union of events rather than supplant the events in the array, otherwise edge-triggered events could be omitted.)

view details

Will Jhun

commit sha d46ad2b50ea1410663890832ae74fb32c969f3d9

mutex implementation Building off of the context switching work, this change implements a two-phase mutex lock which will first spin for some number of iterations (specified on mutex allocation) before finally suspending the running context if the mutex could not be acquired. Suspended contexts are queued and processed in a FIFO fashion, with the unlock function dequeueing and directly scheduling any context waiting on the mutex. This is useful for preventing cores from being excessively tied up with spinning while there is resource contention. The spin factor argument given on instantiation may be adjusted according to the needs of the workload; a value of 0 indicates no spinning is desired - in which case a context will suspend whenever the mutex cannot be immediately acquired - but may also be made large to prioritize spinning, leaving suspension as a fallback option that frees up cpu cycles for other contexts to run. The new context_suspend() function suspends a context by saving the running processor state to its frame and then adjusting it to return to the caller on frame return. As a first client, the lwIP lock has been converted from a spinlock to a mutex. It is instantiated with a large spin factor in order to maintain existing performance levels observed with our webg test. As with spinlocks, the factor of bus contention when considering performance increases with the number of cores contending for a mutex. This could potentially be ameliorated in the future by using something like the MCS lock in place of the compare-and-swap loop for the first phase. Using MCS would also insure that waiters are processed in a FIFO order for both phases of lock acquisition, not just the second one.

view details

Will Jhun

commit sha a422fd4c5485e82a6ff921700d6ff4bd710bfa4c

address netsock bugs exposed by mutex lock Some races in the netsock and webg tests were exposed as a result of the lwIP lock acquisition causing context suspends. - In sock_read_bh_internal() and socket_write_tcp_bh_internal(), acquire the lwIP lock prior to inspecting socket state. - In netsock_poll(), net_loop_poll_queued was being cleared before acquiring the lwIP lock. If the context were to be suspended when taking the lock, further unnecessary enqueues could pile up. Move the lock acquire prior to clearing the flag. - In socket_write_tcp_bh_internal(), proactively invoke netif_poll_all() if the TCP socket sndbuf is full, avoiding the need to continue blocking in the case that netsock_poll() is scheduled but awaiting dispatch.

view details

Will Jhun

commit sha 9b71446c72fd8c86c219d465003758cf89fff400

use lock-free list to implement queue of waiting contexts Using a fixed-size queue of waiters and spinning on failing enqueue operations ultimately defeats the mutex property that a context awaiting a mutex lock cannot tie up a CPU indefinitely. To remedy this, the queue has been substituted with a singly-linked list of waiting contexts. The lock-free list follows the one used in the MCS lock, albeit for enqueuing waiters and not spinning.

view details

push time in 6 days

push eventnanovms/nanos

Will Jhun

commit sha dc63eb21eca95e56d8a56d3c3f94666a0c24fa0b

epoll_wait_notify(): avoid reporting duplicate epoll_events for an fd This addresses the scenario where check_fdesc() (invoked from epoll_wait) and wait_notify() (invoked from notify_dispatch()) could both invoke epoll_wait_notify() for a given fd. This is a reasonable occurance as a call to epoll_wait() could pick up a coincident event state change prior to completion of the ensuing notify_dispatch(). This race can lead to duplicate epoll_events being reported to userspace. Such a scenario would occur at a low frequency in the signalfd portion of the signal runtime test. In this case, the duplicate events entry led to a superfluous call to read(2), which would block indefinitely. epoll_wait_notify() will now scan the reported events array to look for a matching data field and, if found, update the associated events field with the union of the two reported events. (It is necessary to use the union of events rather than supplant the events in the array, otherwise edge-triggered events could be omitted.)

view details

push time in 6 days

PullRequestReviewEvent

Pull request review commentnanovms/nanos

riscv64 support

 word expected_sum(word n, word nt) { void *terminus(void *k) {     word x = 0, v;+    double y = 0.0;     pipelock p= k;     while ((v = pipelock_read(p)) != pipe_exit) {         x += v;+        y += (double)v;     }-    if (x == expected_sum(NNUMS, nthreads)) {+    if (x == expected_sum(NNUMS, nthreads) && y == expected_float_sum(NNUMS, nthreads)) {

Due to floating point precision errors, it may be insufficient to use a direct comparison of doubles here. One solution would be to take the difference of y and the result of expected_float_sum() and compare that against some very small number.

sanderssj

comment created time in 7 days

Pull request review commentnanovms/nanos

riscv64 support

+#include <kernel.h>+#include <pci.h>+#include <drivers/console.h>++//#define PCI_PLATFORM_DEBUG+#ifdef PCI_PLATFORM_DEBUG+#define pci_plat_debug rprintf+#else+#define pci_plat_debug(...) do { } while(0)+#endif++#define PIO_DATA (mmio_base_addr(PCIE_PIO))+#define pio_in8(port) mmio_read_8(PIO_DATA + port)+#define pio_in16(port) mmio_read_16(PIO_DATA + port)+#define pio_in32(port) mmio_read_32(PIO_DATA + port)+#define pio_out8(port, source) mmio_write_8(PIO_DATA + port, source)+#define pio_out16(port, source) mmio_write_16(PIO_DATA + port, source)+#define pio_out32(port, source) mmio_write_32(PIO_DATA + port, source)++/* stub ... really shouldn't be hardwired into console.c */+void vga_pci_register(kernel_heaps kh, console_attach a)+{+}++u32 pci_cfgread(pci_dev dev, int reg, int bytes)

These definitions for PCIE_PIO and PCIE_ECAM could be put in a common place and shared with platform/virt - maybe move into src/kernel/pci.h?

sanderssj

comment created time in 9 days

PullRequestReviewEvent

PR opened nanovms/nanos

restore correct detection of runtime test errors by defaulting to qemu debug exit

With the introduction of acpica and the default installation of the acpi_powerdown vm_halt method on the PC platform, runtime CI tests have not been passing the test exit codes back to the shell. This results in "make runtime-tests" passing even if tests have failed with a non-zero exit code.

This adds a "debug_exit" manifest flag to indicate that vm_exit() should return to qemu via the QEMU_HALT() function regardless of whether the platform has installed a vm_halt method. An option has been added to mkfs to allow the amending of manifest root tuples from the command line. Runtime tests are now built with "debug_exit:t" added to the root tuples by default.

+67 -22

0 comment

4 changed files

pr created time in 7 days

create barnchnanovms/nanos

branch : qemu-default-debug-exit

created branch time in 7 days

PullRequestReviewEvent

push eventnanovms/nanos

Will Jhun

commit sha 7834280a1b4a593debcb5b646d90fef80182e0f5

remove utime_updated() and stime_updated() The utime_updated() and stime_updated() functions may access the "start_time" field of a thread or a thread's syscall_context, respectively. This is potentially unsafe as these fields, as well as the thread's syscall field, may be updated without holding the thread lock. These functions are also unnecessary because a thread's execution times are accumulated on each context switch, providing updates at a finer granularity (e.g. with each interrupt, syscall entry, blocking or suspending of a syscall, etc.).

view details

Will Jhun

commit sha 5f5d9afd3adf238d2998b463abcc8e5d0aef33ec

remove scheduling queue parameters from virtqueue interface With removal of the kernel lock and virtqueue completions now being applied asynchronously from a single queue, the sched_queue parameters to virtqueue_alloc() and associated functions are now obsolete.

view details

Will Jhun

commit sha 5872c586d29ab1ace8e7919dc3ed505aaa338755

mutex implementation Building off of the context switching work, this change implements a two-phase mutex lock which will first spin for some number of iterations (specified on mutex allocation) before finally suspending the running context if the mutex could not be acquired. Suspended contexts are queued and processed in a FIFO fashion, with the unlock function dequeueing and directly scheduling any context waiting on the mutex. This is useful for preventing cores from being excessively tied up with spinning while there is resource contention. The spin factor argument given on instantiation may be adjusted according to the needs of the workload; a value of 0 indicates no spinning is desired - in which case a context will suspend whenever the mutex cannot be immediately acquired - but may also be made large to prioritize spinning, leaving suspension as a fallback option that frees up cpu cycles for other contexts to run. The new context_suspend() function suspends a context by saving the running processor state to its frame and then adjusting it to return to the caller on frame return. As a first client, the lwIP lock has been converted from a spinlock to a mutex. It is instantiated with a large spin factor in order to maintain existing performance levels observed with our webg test. As with spinlocks, the factor of bus contention when considering performance increases with the number of cores contending for a mutex. This could potentially be ameliorated in the future by using something like the MCS lock in place of the compare-and-swap loop for the first phase. Using MCS would also insure that waiters are processed in a FIFO order for both phases of lock acquisition, not just the second one.

view details

Will Jhun

commit sha 3b96a71d208222cadc8e986f233cebcbf6ce4671

address netsock bugs exposed by mutex lock Some races in the netsock and webg tests were exposed as a result of the lwIP lock acquisition causing context suspends. - In sock_read_bh_internal() and socket_write_tcp_bh_internal(), acquire the lwIP lock prior to inspecting socket state. - In netsock_poll(), net_loop_poll_queued was being cleared before acquiring the lwIP lock. If the context were to be suspended when taking the lock, further unnecessary enqueues could pile up. Move the lock acquire prior to clearing the flag. - In socket_write_tcp_bh_internal(), proactively invoke netif_poll_all() if the TCP socket sndbuf is full, avoiding the need to continue blocking in the case that netsock_poll() is scheduled but awaiting dispatch.

view details

Will Jhun

commit sha b4453551463f22f60334637d6015f72f573c4fd9

use lock-free list to implement queue of waiting contexts Using a fixed-size queue of waiters and spinning on failing enqueue operations ultimately defeats the mutex property that a context awaiting a mutex lock cannot tie up a CPU indefinitely. To remedy this, the queue has been substituted with a singly-linked list of waiting contexts. The lock-free list follows the one used in the MCS lock, albeit for enqueuing waiters and not spinning.

view details

push time in 9 days

PullRequestReviewEvent

Pull request review commentnanovms/nanos

new context type, support for simultaneous blocking page faults, removal of kernel lock

 void process_get_cwd(process p, filesystem *cwd_fs, inode *cwd)     process_unlock(p); } -void thread_enter_user(thread in)-{-    thread_resume(in);-    in->sysctx = false;-}--void thread_enter_system(thread t)-{-    if (!t->sysctx) {-        timestamp here = now(CLOCK_ID_MONOTONIC_RAW);-        timestamp diff = here - t->start_time;-        t->utime += diff;-        t->start_time = here;-        t->sysctx = true;-        set_current_thread(&t->thrd);-    }-}--void thread_pause(thread t)-{-    if (get_current_thread() != &t->thrd)-        return;-    timestamp diff = now(CLOCK_ID_MONOTONIC_RAW) - t->start_time;-    if (t->sysctx) {-        t->stime += diff;-    }-    else {-        t->utime += diff;-    }-    context f = thread_frame(t);-    thread_frame_save_fpsimd(f);-    thread_frame_save_tls(f);-    set_current_thread(0);-}--void thread_resume(thread t)-{-    nanos_thread old = get_current_thread();-    if (old && old != &t->thrd)-        apply(old->pause);-    count_syscall_resume(t);-    if (get_current_thread() == &t->thrd)-        return;-    t->start_time = now(CLOCK_ID_MONOTONIC_RAW);-    set_current_thread(&t->thrd);-}- static timestamp utime_updated(thread t) {+    thread_lock(t);     timestamp ts = t->utime;-    if (!t->sysctx)+    if (t->start_time != 0)         ts += now(CLOCK_ID_MONOTONIC_RAW) - t->start_time;+    thread_unlock(t);     return ts; }  static timestamp stime_updated(thread t) {+    thread_lock(t);     timestamp ts = t->stime;-    if (t->sysctx)-        ts += now(CLOCK_ID_MONOTONIC_RAW) - t->start_time;+    if (t->syscall && t->syscall->start_time)+        ts += now(CLOCK_ID_MONOTONIC_RAW) - t->syscall->start_time;

Rather than requiring the thread lock to access t->syscall, I removed utime_updated() and stime_updated() (both of which are unsafe, as the start_time fields are also not protected by the thread or other lock) and have the calling routines simply read t->utime and t->stime directly.

The downside is that the times will not be current to now(), but the assumption here is that the granularity given by times being accumulated on each context switch will give sufficient resolution and accuracy for the dependent syscalls (times, getrusage, and clock_gettime).

wjhun

comment created time in 9 days

push eventnanovms/nanos

Will Jhun

commit sha 7834280a1b4a593debcb5b646d90fef80182e0f5

remove utime_updated() and stime_updated() The utime_updated() and stime_updated() functions may access the "start_time" field of a thread or a thread's syscall_context, respectively. This is potentially unsafe as these fields, as well as the thread's syscall field, may be updated without holding the thread lock. These functions are also unnecessary because a thread's execution times are accumulated on each context switch, providing updates at a finer granularity (e.g. with each interrupt, syscall entry, blocking or suspending of a syscall, etc.).

view details

Will Jhun

commit sha 5f5d9afd3adf238d2998b463abcc8e5d0aef33ec

remove scheduling queue parameters from virtqueue interface With removal of the kernel lock and virtqueue completions now being applied asynchronously from a single queue, the sched_queue parameters to virtqueue_alloc() and associated functions are now obsolete.

view details

push time in 9 days

push eventnanovms/nanos

Will Jhun

commit sha 7419dc7ab85991afb14291a8d38350e9f3349771

revert conditional lock taking in blockq actions Now that blockq actions are serviced asynchronously, it is no longer necessary to conditionalize the acquisition of a lock when code that wakes the blockq is holding the lock. Such conditionals are now unsafe and need to be removed.

view details

Will Jhun

commit sha 29020ccf8bcac6abc7e78df2fb5efa2b3c6e5ce6

mutex implementation Building off of the context switching work, this change implements a two-phase mutex lock which will first spin for some number of iterations (specified on mutex allocation) before finally suspending the running context if the mutex could not be acquired. Suspended contexts are queued and processed in a FIFO fashion, with the unlock function dequeueing and directly scheduling any context waiting on the mutex. This is useful for preventing cores from being excessively tied up with spinning while there is resource contention. The spin factor argument given on instantiation may be adjusted according to the needs of the workload; a value of 0 indicates no spinning is desired - in which case a context will suspend whenever the mutex cannot be immediately acquired - but may also be made large to prioritize spinning, leaving suspension as a fallback option that frees up cpu cycles for other contexts to run. The new context_suspend() function suspends a context by saving the running processor state to its frame and then adjusting it to return to the caller on frame return. As a first client, the lwIP lock has been converted from a spinlock to a mutex. It is instantiated with a large spin factor in order to maintain existing performance levels observed with our webg test. As with spinlocks, the factor of bus contention when considering performance increases with the number of cores contending for a mutex. This could potentially be ameliorated in the future by using something like the MCS lock in place of the compare-and-swap loop for the first phase. Using MCS would also insure that waiters are processed in a FIFO order for both phases of lock acquisition, not just the second one.

view details

Will Jhun

commit sha 07b1e114170a91a17836a06821fc33db129419c4

address netsock bugs exposed by mutex lock Some races in the netsock and webg tests were exposed as a result of the lwIP lock acquisition causing context suspends. - In sock_read_bh_internal() and socket_write_tcp_bh_internal(), acquire the lwIP lock prior to inspecting socket state. - In netsock_poll(), net_loop_poll_queued was being cleared before acquiring the lwIP lock. If the context were to be suspended when taking the lock, further unnecessary enqueues could pile up. Move the lock acquire prior to clearing the flag. - In socket_write_tcp_bh_internal(), proactively invoke netif_poll_all() if the TCP socket sndbuf is full, avoiding the need to continue blocking in the case that netsock_poll() is scheduled but awaiting dispatch.

view details

Will Jhun

commit sha f2a4217801f1da6d70e1116a6c21dc5e50e10e84

use lock-free list to implement queue of waiting contexts Using a fixed-size queue of waiters and spinning on failing enqueue operations ultimately defeats the mutex property that a context awaiting a mutex lock cannot tie up a CPU indefinitely. To remedy this, the queue has been substituted with a singly-linked list of waiting contexts. The lock-free list follows the one used in the MCS lock, albeit for enqueuing waiters and not spinning.

view details

push time in 11 days

push eventnanovms/nanos

Will Jhun

commit sha 7419dc7ab85991afb14291a8d38350e9f3349771

revert conditional lock taking in blockq actions Now that blockq actions are serviced asynchronously, it is no longer necessary to conditionalize the acquisition of a lock when code that wakes the blockq is holding the lock. Such conditionals are now unsafe and need to be removed.

view details

push time in 11 days

push eventnanovms/nanos

Francesco Lavra

commit sha 964bb308b87a19ce6d94d08c7aff8d490934aeea

x86_64 cpu: add support for SMEP and UMIP Supervisor Memory Execute Protection (SMEP) is a security feature that generates a page fault exception if the kernel tries to execute an instruction from a memory address accessible by the user program. If this exception occurs, the kernel dumps the current execution context to the console and halts. User-Mode Instruction Prevention (UMIP) is a security feature that generates a general protection exception if the user program tries to execute one of the the SGDT, SIDT, SLDT, SMSW, and STR instructions. If this exception occurs, the kernel sends a SEGV signal to the program. Tested on Google Cloud N2 instance with Intel Ice Lake CPU.

view details

Justin Sanders

commit sha ef62dd58355103e6048b012707b8e279f099a4d7

Aarch64: report atomic hwcap in program aux vector The dynamic linker can use the cpu's capabilities as reported in the aux vector pushed onto the program's initial stack to generate additional subdirectories to search when looking for dynamic libraries. This allows glibc to load hardware optimized versions of a library when available. On the aarch64 platform, glibc checks for atomic instructions on the cpu to prefer those versions, located in /usr/lib/aarch64-linux-gnu/atomics. If the filesystem only contains the atomics version but atomics are not reported in hwcaps, the dynamic library loading will fail as the atomics subdirectory will not be checked.

view details

Francesco Lavra

commit sha b6595578a1809e9d5e4ada3fd5523642b6583832

Hyper-V netvsc: remove promiscuous mode setup Promiscuous mode is never used in the kernel, thus the `hv_promisc_mode` global variable is not needed and can be removed.

view details

Francesco Lavra

commit sha 8b3ad5b2d38b2d7bed8cb5debe29e0c7afee5ea2

x86 TLB flush: remove flush_heap static variable This variable is only used in init_flush(), where its value is available as a function argument, thus we can remove it.

view details

Francesco Lavra

commit sha 964455f2ef1a7cc4b4cb821f20a2c35365cf17cc

x86 MP: remove `apboot` static variable This variable is being initialized to a fixed value, thus it can be replaced by a preprocessor define.

view details

Francesco Lavra

commit sha 99c0db5e1f5da9bf300bf08f4620de42a5751666

Global/static variables: add const modifier where appropriate Adding the const modifier instructs the linker to put those variables in the rodata section, which can be mapped as read-only in the MMU and thus protected against unwanted modifications. The `-n` linker flag, which turns off page alignment of ELF sections, may cause issues because the boot code that maps the kernel ELF binary expects the file offset of all sections to have the same page offset as the base virtual address, and the `-n` flag may cause this requirement to not be satisfied; thus, this flag is being removed. In order to prevent the linker from aligning ELF sections to 2MB boundaries (which would considerably increase the kernel binary file size), the `-z max-page-size=4096` linker flag is being added.

view details

Francesco Lavra

commit sha 41d8930521056bdaf1ec2dd8efb6dd16c564caac

Global/static variables: implement ro-after-init feature Certain global and static variables are initialized during boot and never modified after boot. In order to prevent unwanted modifications, these variables are now placed in special linker sections that are initially mapped as read-write but subsequently re-mapped as read-only. There are two such sections: one (selected by the RO_AFTER_INIT attribute) is for variables declared with fixed-value initializers, and the other (selected by the BSS_RO_AFTER_INIT attribute) is for uninitialized variables.

view details

Francesco Lavra

commit sha eed24389ab8aad8cfbaf5e5d3b3f55356d8a24ae

syscalls: truncate: fix call to filesystem_put_node() filesystem_put_node() should be called only if the previous call to filesystem_get_node() was successful, otherwise the filesystem reference count is wrongly decremented (and the filesystem lock is wrongly released if held by another thread).

view details

sanderssj

commit sha 61352ddb0a8824b7661dd9f31071d45f04e936fd

Merge branch 'master' into aarch64-atomic-hwcap

view details

sanderssj

commit sha d0504aa6602c724a07c355c85732bb3c0bf9bd07

Merge pull request #1646 from nanovms/aarch64-atomic-hwcap Aarch64: report atomic hwcap in program aux vector

view details

Francesco Lavra

commit sha 94e0bc7258c46bbd18dd01e7aa09fee7e1b7eeae

VMbus timer interrupt: add call to schedule_timer_service() Without this call, kernel timers are not serviced when a timer peripheral fires, which may cause among other things failure to get an IP address via DHCP on Azure instances.

view details

Francesco Lavra

commit sha 12381c4c068adb4079fc6bd8a6f6ca5e9e08ddba

net connection_handler: use input_buffer_handler for incoming data The closure that handles incoming data on a given network connection may close the connection (by invoking the output buffer handler with a null buffer argument); if this happens, the direct_conn struct is deallocated, thus it should not be accessed anymore. This change replaces the buffer_handler closure type for handling incoming data with a new input_buffer_handler type, whose return value is a boolean variable that indicates whether the connection has been closed. The code in net/direct.c uses this return value to ensure that the direct_conn struct is not accessed after closing a connection. In addition, since in case of client connections the associated direct struct is also deallocated, the direct_conn_closed() function has been modified to return a boolean value that indicates whether the direct struct has been deallocated. This change fixes a crash occurring on Microsoft Azure instances when the cloud_init klib reports the instance status to the wire server.

view details

Francesco Lavra

commit sha 2014435c52551179a105fc8ea91a7b44cf31cb17

Unix pipe: only report EPOLLIN event when data is available to read The Unix pipe implementation is incorrectly reporting both EPOLLIN and EPOLLHUP events on a pipe reader file descriptor after closing the writer file descriptor. This is causing `Null check operator used on a null value` unhandled exceptions whan running a Dart program without AOT compilation. This change fixes the above issue by only reporting EPOLLIN if data is available in the pipe buffer. A test case is being added to the pipe runtime test to check that the correct poll events are reported after closing the write side of a pipe.

view details

Francesco Lavra

commit sha 50a9b79d2663c1bae87b598007fe83a987e526ee

Network socket: modify poll events to match Linux implementation In Linux, a TCP socket that has never been connected reports (POLLHUP | POLLOUT) events when polled, and a TCP socket whose connection with the remote peer has been closed reports (POLLIN | POLLOUT). This commit modifies the events reported in socket_events() to match the Linux implementation. This fixes an issue that makes the redis server (version 5.0.5) use 100% CPU when subject to a workload from YCSB; more specifically, the redis server relies on TCP sockets whose connection with the remote peer has been closed to report the EPOLLIN event.

view details

Will Jhun

commit sha 7b06c5b6ba0b6f014677fde2ccd023e930b5b80e

queue: allow specification of entry size with *_n variants This change allows multiple (power-of-2) word queueing using the *_n* variants of the enqueue and dequeue methods. The same size must be used for all operations on a given queue (though the test changes the size at quiescent state); operations of mixed sizes are not supported.

view details

Will Jhun

commit sha f1250320983077e9982c7985ef1aad7e35fac883

asynchronous completion queues This creates new scheduling queues for deferred processing of status_handlers and other closures whose execution may (or need to) happen asynchronously from variable application. The async_apply*() functions take a closure bound with variable arguments and enqueue them to the specified async queue. The queues are dispatched directly from the runloop, so queued continuations are free to do frame returns, switch to new stacks or take other paths that culminate into a call to runloop() rather than return directly. This change is necessary for pending context switching work.

view details

Will Jhun

commit sha c62d0dfb41d8fca84893f68ac0453ca43dcbf818

use async_apply_1() for virtqueue completion processing The async queues greatly simplify virtqueue completion handling. The service queue and task is removed in favor of queueing directly from the interrupt handler. Used vqmsgs are now recycled with a free list rather than deallocated.

view details

Will Jhun

commit sha 27efa558787363bc35d54e92bac5f13f1d440eb9

async completions for pagecache and blockq This moves blockq action invocation to the async queues (note: there are currently missing thread_resumes, but this will be fixed with context switches in async dispatch). io_completions are now called directly from bq action handlers rather than stored in the thread for later completion. Pagecache completions are now queued for asynchronous completion. In futex_wake_one(), now that the blockq wake happens asynchronously, take the valid returned thread as proof of a wake, adjust the waiter count and return at once (this function could previously wake up multiple waiters, despite its name). In pagecache_write_sg, move new, whole pages to writing state immediately to prevent an intervening page fill. runloop_internal(): check queues again before going to sleep to catch work queued during service Blockq actions now must re-insert themselves into the blockq if a wakeup occurs for a thread but blocking needs to continue. The blockq_block_required() helper has been added for this purpose.

view details

Will Jhun

commit sha 472cd5faa6362df1808bf1161a2749b1ee2adeee

initial commit for new context type and embedding of context within closures This work introduces a redefined context type that associates a running frame with methods for pausing, resuming and scheduling contexts, a fault handler, and a transient heap that can be used for allocations that exist for the lifetime of the context. kernel_contexts and unix threads now extend this base type, and the syscall_context type has been created to encompass syscall operations. Syscall operations which block awaiting continuation now stamp their respective context in the continuation closure using contextual_closure(). Areas of the kernel which service such continuations now invoke them using async_apply*() functions, resulting in the context saved in such closures to be resumed. Now that each syscall in progress is represented by its own syscall_context, it is possible to service concurrent page faults for multiple syscalls. unix_fault_handler now handles faults in a way more orthogonal to the context type, instead relying on the context methods to effectively suspend and resume the context when blocking is necessary for faults on file-backed memory. Frame storage embedded within the context is now a fixed size, representing only a basic frame save. The extended frame (FP / SIMD) storage is now allocated separately, pointed to by the FRAME_EXTENDED field in the basic frame.

view details

Will Jhun

commit sha e9a64e1c45fef3ffed0f5607b98a55e2fcedf104

remove kernel lock, fold async queues into one

view details

push time in 12 days

push eventnanovms/nanos

Francesco Lavra

commit sha 964bb308b87a19ce6d94d08c7aff8d490934aeea

x86_64 cpu: add support for SMEP and UMIP Supervisor Memory Execute Protection (SMEP) is a security feature that generates a page fault exception if the kernel tries to execute an instruction from a memory address accessible by the user program. If this exception occurs, the kernel dumps the current execution context to the console and halts. User-Mode Instruction Prevention (UMIP) is a security feature that generates a general protection exception if the user program tries to execute one of the the SGDT, SIDT, SLDT, SMSW, and STR instructions. If this exception occurs, the kernel sends a SEGV signal to the program. Tested on Google Cloud N2 instance with Intel Ice Lake CPU.

view details

Justin Sanders

commit sha ef62dd58355103e6048b012707b8e279f099a4d7

Aarch64: report atomic hwcap in program aux vector The dynamic linker can use the cpu's capabilities as reported in the aux vector pushed onto the program's initial stack to generate additional subdirectories to search when looking for dynamic libraries. This allows glibc to load hardware optimized versions of a library when available. On the aarch64 platform, glibc checks for atomic instructions on the cpu to prefer those versions, located in /usr/lib/aarch64-linux-gnu/atomics. If the filesystem only contains the atomics version but atomics are not reported in hwcaps, the dynamic library loading will fail as the atomics subdirectory will not be checked.

view details

Francesco Lavra

commit sha b6595578a1809e9d5e4ada3fd5523642b6583832

Hyper-V netvsc: remove promiscuous mode setup Promiscuous mode is never used in the kernel, thus the `hv_promisc_mode` global variable is not needed and can be removed.

view details

Francesco Lavra

commit sha 8b3ad5b2d38b2d7bed8cb5debe29e0c7afee5ea2

x86 TLB flush: remove flush_heap static variable This variable is only used in init_flush(), where its value is available as a function argument, thus we can remove it.

view details

Francesco Lavra

commit sha 964455f2ef1a7cc4b4cb821f20a2c35365cf17cc

x86 MP: remove `apboot` static variable This variable is being initialized to a fixed value, thus it can be replaced by a preprocessor define.

view details

Francesco Lavra

commit sha 99c0db5e1f5da9bf300bf08f4620de42a5751666

Global/static variables: add const modifier where appropriate Adding the const modifier instructs the linker to put those variables in the rodata section, which can be mapped as read-only in the MMU and thus protected against unwanted modifications. The `-n` linker flag, which turns off page alignment of ELF sections, may cause issues because the boot code that maps the kernel ELF binary expects the file offset of all sections to have the same page offset as the base virtual address, and the `-n` flag may cause this requirement to not be satisfied; thus, this flag is being removed. In order to prevent the linker from aligning ELF sections to 2MB boundaries (which would considerably increase the kernel binary file size), the `-z max-page-size=4096` linker flag is being added.

view details

Francesco Lavra

commit sha 41d8930521056bdaf1ec2dd8efb6dd16c564caac

Global/static variables: implement ro-after-init feature Certain global and static variables are initialized during boot and never modified after boot. In order to prevent unwanted modifications, these variables are now placed in special linker sections that are initially mapped as read-write but subsequently re-mapped as read-only. There are two such sections: one (selected by the RO_AFTER_INIT attribute) is for variables declared with fixed-value initializers, and the other (selected by the BSS_RO_AFTER_INIT attribute) is for uninitialized variables.

view details

Francesco Lavra

commit sha eed24389ab8aad8cfbaf5e5d3b3f55356d8a24ae

syscalls: truncate: fix call to filesystem_put_node() filesystem_put_node() should be called only if the previous call to filesystem_get_node() was successful, otherwise the filesystem reference count is wrongly decremented (and the filesystem lock is wrongly released if held by another thread).

view details

sanderssj

commit sha 61352ddb0a8824b7661dd9f31071d45f04e936fd

Merge branch 'master' into aarch64-atomic-hwcap

view details

sanderssj

commit sha d0504aa6602c724a07c355c85732bb3c0bf9bd07

Merge pull request #1646 from nanovms/aarch64-atomic-hwcap Aarch64: report atomic hwcap in program aux vector

view details

Francesco Lavra

commit sha 94e0bc7258c46bbd18dd01e7aa09fee7e1b7eeae

VMbus timer interrupt: add call to schedule_timer_service() Without this call, kernel timers are not serviced when a timer peripheral fires, which may cause among other things failure to get an IP address via DHCP on Azure instances.

view details

Francesco Lavra

commit sha 12381c4c068adb4079fc6bd8a6f6ca5e9e08ddba

net connection_handler: use input_buffer_handler for incoming data The closure that handles incoming data on a given network connection may close the connection (by invoking the output buffer handler with a null buffer argument); if this happens, the direct_conn struct is deallocated, thus it should not be accessed anymore. This change replaces the buffer_handler closure type for handling incoming data with a new input_buffer_handler type, whose return value is a boolean variable that indicates whether the connection has been closed. The code in net/direct.c uses this return value to ensure that the direct_conn struct is not accessed after closing a connection. In addition, since in case of client connections the associated direct struct is also deallocated, the direct_conn_closed() function has been modified to return a boolean value that indicates whether the direct struct has been deallocated. This change fixes a crash occurring on Microsoft Azure instances when the cloud_init klib reports the instance status to the wire server.

view details

Francesco Lavra

commit sha 2014435c52551179a105fc8ea91a7b44cf31cb17

Unix pipe: only report EPOLLIN event when data is available to read The Unix pipe implementation is incorrectly reporting both EPOLLIN and EPOLLHUP events on a pipe reader file descriptor after closing the writer file descriptor. This is causing `Null check operator used on a null value` unhandled exceptions whan running a Dart program without AOT compilation. This change fixes the above issue by only reporting EPOLLIN if data is available in the pipe buffer. A test case is being added to the pipe runtime test to check that the correct poll events are reported after closing the write side of a pipe.

view details

Francesco Lavra

commit sha 50a9b79d2663c1bae87b598007fe83a987e526ee

Network socket: modify poll events to match Linux implementation In Linux, a TCP socket that has never been connected reports (POLLHUP | POLLOUT) events when polled, and a TCP socket whose connection with the remote peer has been closed reports (POLLIN | POLLOUT). This commit modifies the events reported in socket_events() to match the Linux implementation. This fixes an issue that makes the redis server (version 5.0.5) use 100% CPU when subject to a workload from YCSB; more specifically, the redis server relies on TCP sockets whose connection with the remote peer has been closed to report the EPOLLIN event.

view details

Will Jhun

commit sha 7b06c5b6ba0b6f014677fde2ccd023e930b5b80e

queue: allow specification of entry size with *_n variants This change allows multiple (power-of-2) word queueing using the *_n* variants of the enqueue and dequeue methods. The same size must be used for all operations on a given queue (though the test changes the size at quiescent state); operations of mixed sizes are not supported.

view details

Will Jhun

commit sha f1250320983077e9982c7985ef1aad7e35fac883

asynchronous completion queues This creates new scheduling queues for deferred processing of status_handlers and other closures whose execution may (or need to) happen asynchronously from variable application. The async_apply*() functions take a closure bound with variable arguments and enqueue them to the specified async queue. The queues are dispatched directly from the runloop, so queued continuations are free to do frame returns, switch to new stacks or take other paths that culminate into a call to runloop() rather than return directly. This change is necessary for pending context switching work.

view details

Will Jhun

commit sha c62d0dfb41d8fca84893f68ac0453ca43dcbf818

use async_apply_1() for virtqueue completion processing The async queues greatly simplify virtqueue completion handling. The service queue and task is removed in favor of queueing directly from the interrupt handler. Used vqmsgs are now recycled with a free list rather than deallocated.

view details

Will Jhun

commit sha 27efa558787363bc35d54e92bac5f13f1d440eb9

async completions for pagecache and blockq This moves blockq action invocation to the async queues (note: there are currently missing thread_resumes, but this will be fixed with context switches in async dispatch). io_completions are now called directly from bq action handlers rather than stored in the thread for later completion. Pagecache completions are now queued for asynchronous completion. In futex_wake_one(), now that the blockq wake happens asynchronously, take the valid returned thread as proof of a wake, adjust the waiter count and return at once (this function could previously wake up multiple waiters, despite its name). In pagecache_write_sg, move new, whole pages to writing state immediately to prevent an intervening page fill. runloop_internal(): check queues again before going to sleep to catch work queued during service Blockq actions now must re-insert themselves into the blockq if a wakeup occurs for a thread but blocking needs to continue. The blockq_block_required() helper has been added for this purpose.

view details

Will Jhun

commit sha 472cd5faa6362df1808bf1161a2749b1ee2adeee

initial commit for new context type and embedding of context within closures This work introduces a redefined context type that associates a running frame with methods for pausing, resuming and scheduling contexts, a fault handler, and a transient heap that can be used for allocations that exist for the lifetime of the context. kernel_contexts and unix threads now extend this base type, and the syscall_context type has been created to encompass syscall operations. Syscall operations which block awaiting continuation now stamp their respective context in the continuation closure using contextual_closure(). Areas of the kernel which service such continuations now invoke them using async_apply*() functions, resulting in the context saved in such closures to be resumed. Now that each syscall in progress is represented by its own syscall_context, it is possible to service concurrent page faults for multiple syscalls. unix_fault_handler now handles faults in a way more orthogonal to the context type, instead relying on the context methods to effectively suspend and resume the context when blocking is necessary for faults on file-backed memory. Frame storage embedded within the context is now a fixed size, representing only a basic frame save. The extended frame (FP / SIMD) storage is now allocated separately, pointed to by the FRAME_EXTENDED field in the basic frame.

view details

Will Jhun

commit sha e9a64e1c45fef3ffed0f5607b98a55e2fcedf104

remove kernel lock, fold async queues into one

view details

push time in 12 days

push eventnanovms/nanos

Will Jhun

commit sha 8d4e659912afa8fcc5a6459f95089bb87a1afd6f

pending_fault_complete(): protect service of pending_fault dependents with p->faulting_lock This fixes a race whereby a thread could be added to a pending_fault's dependents vector during or after service of the vector in pending_fault_complete(). Such an occurance could result in a thread hanging indefinitely on a page fault. This moves the taking of p->faulting_lock in pending_fault_complete() up prior to servicing the dependents vector.

view details

push time in 12 days

Pull request review commentnanovms/nanos

mutex implementation

+#include <kernel.h>++//#define MUTEX_DEBUG+#ifdef MUTEX_DEBUG+#define mutex_debug(x, ...) do {log_printf(" MTX", "%s: " x, __func__, ##__VA_ARGS__);} while(0)+#else+#define mutex_debug(x, ...)+#endif++#ifdef KERNEL+#define mutex_pause kern_pause+#else+#define mutex_pause()+#endif++/* This implements a two-phase mutex which optionally allows a degree for+   spinning when aquiring the lock. A limitation of this is that the spinning+   is only for an uncondended lock, and the waiter does not enter the queue of+   waiters until the second phase (context suspend). As such, waiters for a+   mutex are not strictly processed in FIFO order.++   Linux uses an MCS lock for the first phase, but the unqueueing (removal)+   operation is complex (yet necessary in order to migrate waiters into the+   queue of suspended tasks). It may be worth implementing something similar+   here, perhaps moreso for the lessened bus contention than the FIFO+   property. */++static inline void mutex_acquired(mutex m, context ctx)+{+    assert(!m->turn);+    m->turn = ctx;+    ctx->waiting_on = 0;+}++static inline boolean mutex_cas_take(mutex m, context ctx)+{+    if (compare_and_swap_32(&m->count, 0, 1)) {+        mutex_acquired(m, ctx);+        return true;+    }+    return false;+}++static inline boolean mutex_lock_internal(mutex m, boolean wait)+{+    cpuinfo ci = current_cpu();+    context ctx = get_current_context(ci);++    mutex_debug("cpu %d, mutex %p, wait %d, ra %p\n", ci->id, m, wait,+                __builtin_return_address(0));+    mutex_debug("   ctx %p, turn %p, count %d\n", ctx, m->turn, m->count);++    /* not preemptable (but could become so if needed) */+    if (m->turn == ctx)+        halt("%s: lock already held - cpu %d, mutex %p, ctx %p, ra %p\n", __func__,+             ci->id, m, ctx, __builtin_return_address(0));++    if (!wait)+        return mutex_cas_take(m, ctx);++    u64 spins_remain = m->spin_iterations;+    while (spins_remain-- > 0) {+        if (mutex_cas_take(m, ctx))+            return true;+        mutex_pause();+    }++    if (fetch_and_add_32(&m->count, 1) == 0) {+        mutex_acquired(m, ctx);+        return true;+    }++    /* race covered by dequeue loop in unlock */+    assert(!frame_is_full(ctx->frame));++    mutex_debug("ctx %p about to wait, count %d\n", ctx, m->count);+    ctx->waiting_on = m;+    while (!enqueue(m->waiters, ctx)) {+        mutex_pause();      /* XXX timeout */+    }

Ok, I think I have a solution. In the latest commit, the waiters queue is replaced by a lock-free singly-linked list (based on the one described in the MCS lock papers) composed of waiting contexts.

For another PR, I'll look at adding generic support for such a list - e.g. to use for pending_fault dependents. There are probably a number of places in the kernel that could benefit from such a lock-free list.

wjhun

comment created time in 12 days

PullRequestReviewEvent

push eventnanovms/nanos

Will Jhun

commit sha e92901ccc003416bd5ab31383b5b3fa9060ab43f

address netsock bugs exposed by mutex lock Some races in the netsock and webg tests were exposed as a result of the lwIP lock acquisition causing context suspends. - In sock_read_bh_internal() and socket_write_tcp_bh_internal(), acquire the lwIP lock prior to inspecting socket state. - In netsock_poll(), net_loop_poll_queued was being cleared before acquiring the lwIP lock. If the context were to be suspended when taking the lock, further unnecessary enqueues could pile up. Move the lock acquire prior to clearing the flag. - In socket_write_tcp_bh_internal(), proactively invoke netif_poll_all() if the TCP socket sndbuf is full, avoiding the need to continue blocking in the case that netsock_poll() is scheduled but awaiting dispatch.

view details

Will Jhun

commit sha a47becbdd2a579fe665908555ee8e32dce68e547

use lock-free list to implement queue of waiting contexts Using a fixed-size queue of waiters and spinning on failing enqueue operations ultimately defeats the mutex property that a context awaiting a mutex lock cannot tie up a CPU indefinitely. To remedy this, the queue has been substituted with a singly-linked list of waiting contexts. The lock-free list follows the one used in the MCS lock, albeit for enqueuing waiters and not spinning.

view details

push time in 12 days

PullRequestReviewEvent
PullRequestReviewEvent
more