You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These notes were made in an attempt to audit all execution paths that stem from interrupt handlers through enqueues to bhqueue and subsequent bhqueue service, as well as the request paths originating under the kernel lock that might possibly contend with these paths that run outside of the lock.
Most issues found here concern a lack of safety with heap allocation, primarily from heap (de)allocations in bh service as well as deallocation of objects upstream as a result of closure_finish() or a refcount release. There are a handful of allocations at interrupt level as well. The proposed solution here is to add spinlocks to lower-level heaps (physical - which is already protected by a lock in page.c, virtual_huge, virtual_page, and backed) and add a lock-protected mcache for memory that might be (de)allocated in bh service (or in interrupt handlers, though I would personally prefer to see these removed altogether if possible). The 'general' mcache will remain unlocked for use under the kernel lock only (for which we can add debug assertions that the kernel lock is in fact held during (de)allocs). The places that require use of a locked heap are indicated by the TODO items below (as well as some that otherwise require additional locking, or might just need verification).
BUG items indicate general bugs that were found during the audit, and COMPLETION items highlight the areas where a status handler completion is invoked (no action required for these - added for reference).
While there are many TODO items listed here, most would be cleared out by the use of locked heaps.
Note that completions up into the syscall level are not covered here; I think it will be possible to invoke these from the runqueue with a minor change to the per-page completion handling. Ideally, we can isolate all these safety issues to the pagecache level and lower.
drivers
ata-pci: apci->service (ata_pci_service)
ata_pci_service_reqs()
ata_pci_service_req()
debugs msgs could alloc, so maybe rprintf & friends should
use locked heap (not performance sensitive anyway) actually they use stack allocations
ata_io_cmd_dma()
ata_set_lba()
kernel_delay
no allocs, but calls now() and kern_pause - must be
safe outside kern lock - TODO
may call timm() on error - so errheap in runtime_init.c
should use locked heap - TODO
ata_io_cmd() (for pio fallback)
also may call timm()
ata_wait()
ata pio and kernel_delay, covered above
status handler invoked from here - COMPLETION
status handler invoked on error or request completion - COMPLETION
closure_finish - r->completion in storvsc_io() must come
from locked heap - TODO
storvsc_free_request - list insert under lock, safe
storvsc_process_hcb_queue
storvsc_action_io
create_storvsc_request - fills reqp, don't see any heap
activity
hv_storvsc_io_request
vmbus_chan_send_prplist
vmbus_txbr_write - no heap ops found, looks safe
vmbus_chan_signal_tx - atomic sets and hypercall - safe
vmbus_chan_send - similar to above, looks safe
bhqueue_enqueue_irqsafe
kernel_demand_pf_complete
the lock-free queue structure allows multiple writers, however
it is not safe for writing to a queue from both the interrupt
handler and kernel mode
so all enqueues to the bhqueue from the kernel should
disable interrupts (ok, ints are disabled already, but we
should still protect these cases in case one day we want to
handle interrupts during kernel processing) - maybe change
this to a more generic "enqueue_irqsafe" - TODO
virtqueue
virtqueue_service_vqmsgs
vq_interrupt
BUG: vq->service should be enqueued to vq->sched_queue, not
bhqueue
vq->service: virtqueue_service_vqmsgs
BUG: dequeue from service_queue needs to be irqsafe, make a
"dequeue_irqsafe" Actually it doesn't - there's no contention with a dequeue in interrupt level, and the enqueue in interrupt level shouldn't affect the dequeue.
m->completion applied here - COMPLETION
deallocate_vqmsg
allocate_vqmsg must allocate vqmsg and m->descv buffer
from locked heap - TODO
sg_buf_release - shouldn't cause dealloc upstream,
especially if reference to page is held during
read - but verify --- on review, a dealloc could happen if a page was touched and then released before read completion - but with the cache using a locked heap, this shouldn't be an issue
status_handler invoked for saved write error or zero range
COMPLETION
allocate_merge to pagecache_write_sg_finish
needs to come from safe heap - TODO
initiate reads for any RMW pages
see touch_or_fill_page_nodelocked above
prepare pages loop
allocate_page_nodelocked - see above
status_handler invoked on alloc fail - COMPLETION
merge should be discarded - minor, but BUG
realloc_pagelocked - see above
enqueue_page_completion_statelocked - see above
may immediately invoke pagecache_write_sg_finish if no I/O
initiated with merge
pagecache_write_sg_finish - may be invoked either from
pagecache_write_sg (syscall or dirty page commit) or from merge
completion on the bhqueue
page_lookup_nodelocked
rbtree_lookup for node->pages - protected by node lock
read error / bound(complete) == true path
BUG: write_count decrement, state change to NEW and refcount
release should only happen when bound(complete) is true, not
on a read error...
pagecache_page_queue_completions_locked
these are actually only be invoked after blocking I/O
completion, and thus are always run from the bh path
we have completions which should be run with the kernel
lock held (i.e. syscall -> pagecache_{read,write}sg,
pagecache_sync*) as well as intermediate completions
which need to run on the bhqueue (pagecache_map_page
(page fault), pagecache_write_sg_finish)
so we can invoke the bh completions directly and, if
any runqueue ones remain, queue a runqueue service
routine to handle them - TODO
we have both syscall completions and internal kernel
completions here (e.g. page fault,
pagecache_write_sg_finish, pagecache_sync_*
queues pc->service_completions
(pagecache_service_completions) to bhqueue
pagecache_service_completions (from bhqueue)
status_handlers invoked from here - COMPLETION
deallocate_vector - see
enqueue_page_completion_statelocked above
refcount_release on page - releases reference held during
write operation
could possibly lead to pagecache_page being freed if page
was evicted before i/o operation finished
pagecache_page_free
change_page_state_locked: move to free list, safe
dealloc from contiguous, safe if contiguous heap is
closure_finish() - see allocate_merge note in
pagecache_write_sg above
read completion (status ok) / apply write path
allocate_sg_list - see above
applies error to completion on alloc fail - COMPLETION
sg_list_tail_add
see above
sg_copy_to_buf
sg_buf_release - could possibly cause dealloc upstream,
though many cases will be a user buffer with null
refcount - need to verify paths anyway - TODO
fs_write (filesystem_storage_write)
change to schedule write from runqueue (locked) - TODO
may do write finish as internal func, and skip merge if no
rmw or read pending pages
Merge completion now scheduled from runqueue. Simpler to leave merge in-place, and completion closure is used for write completion regardless of whether or not reads are outstanding.
completion path:
pagecache_read_page_complete
called on storage read completion from bh path, or
directly from filesystem_storage_read if no I/O (e.g. all
extents uninited)
incomplete read error handling - TODO / file issue - Errors are handled, but there is a console dump of the error with a TODO to make a facility for capturing and reporting (with rate limiting) device I/O errors
change_page_state_locked - locked, no heap ops
pagecache_page_queue_completions_locked - see above
sg_list_release
release refcount held on pages while reading - similar to
refcount_release for pagecount_write_sg_finish noted above
deallocate_sg_list
add to sg free list, safe
closure_finish - covered in request path above
The text was updated successfully, but these errors were encountered:
As enumerated in #1302, there are many paths in the kernel where the kernel lock is not held and yet allocations or deallocations are being made from one of the kernel heaps. To address this issue, these changes introduce a "locked" kernel heap. Like general, this is an mcache heap, accessed through a locking wrapper heap which guards accesses with a spinlock. The "backed" heap is also accessed via a locking wrapper.
Since id_heap-specific methods wouldn't be covered by a generic wrapper, a "locking" flag is specified on id_heap creation. The virtual huge and page heaps are now locking and thus safe to use from any context.
Allocations made (and released) under protection of the kernel lock should continue to use the general heap, as this will avoid unnecessary spinlock operations.
These notes were made in an attempt to audit all execution paths that stem from interrupt handlers through enqueues to bhqueue and subsequent bhqueue service, as well as the request paths originating under the kernel lock that might possibly contend with these paths that run outside of the lock.
Most issues found here concern a lack of safety with heap allocation, primarily from heap (de)allocations in bh service as well as deallocation of objects upstream as a result of closure_finish() or a refcount release. There are a handful of allocations at interrupt level as well. The proposed solution here is to add spinlocks to lower-level heaps (physical - which is already protected by a lock in page.c, virtual_huge, virtual_page, and backed) and add a lock-protected mcache for memory that might be (de)allocated in bh service (or in interrupt handlers, though I would personally prefer to see these removed altogether if possible). The 'general' mcache will remain unlocked for use under the kernel lock only (for which we can add debug assertions that the kernel lock is in fact held during (de)allocs). The places that require use of a locked heap are indicated by the TODO items below (as well as some that otherwise require additional locking, or might just need verification).
BUG items indicate general bugs that were found during the audit, and COMPLETION items highlight the areas where a status handler completion is invoked (no action required for these - added for reference).
While there are many TODO items listed here, most would be cleared out by the use of locked heaps.
Note that completions up into the syscall level are not covered here; I think it will be possible to invoke these from the runqueue with a minor change to the per-page completion handling. Ideally, we can isolate all these safety issues to the pagecache level and lower.
drivers
ata-pci: apci->service (ata_pci_service)
debugs msgs could alloc, so maybe rprintf & friends shouldactually they use stack allocationsuse locked heap (not performance sensitive anyway)
safe outside kern lock - TODO
should use locked heap - TODO
storvsc: vmbus_chan_open
from locked heap - TODO
activity
bhqueue_enqueue_irqsafe
kernel_demand_pf_complete
the lock-free queue structure allows multiple writers, however
it is not safe for writing to a queue from both the interrupt
handler and kernel mode
disable interrupts (ok, ints are disabled already, but we
should still protect these cases in case one day we want to
handle interrupts during kernel processing) - maybe change
this to a more generic "enqueue_irqsafe" - TODO
virtqueue
bhqueue
BUG: dequeue from service_queue needs to be irqsafe, make aActually it doesn't - there's no contention with a dequeue in interrupt level, and the enqueue in interrupt level shouldn't affect the dequeue."dequeue_irqsafe"
from locked heap - TODO
virtio_scsi
addressed in incomplete handling of virtio scsi events #1441
would be a locked heap
s->v->virtio_dev.contiguous, should be locked in new
arrangement
heap - TODO
virtio_storage
pvscsi
meta from dev->general - but meta is only used on objcache
alloc, so should be safe
change to dequeue_irqsafe - TODO- as with virtqueue, dequeue doesn't contend with another...list_push_back - fixed in pvscsi: take queue_lock before enqueueing to hcb_queue #1442)
xen.c: xen_grant_init
(interrupt level), need to use irq-safe heap or defer - TODO / BUG
xenblk
irq-safe locked heap - TODO
heap - TODO
xen_grant_init above
xen_grant_init above
completions
(check net drivers too - anything on bhqueue?)
pagecache/tfs
request path:
pagecache read
rbtree traversal - but should always happen with kernel lock- nope, we should be covered by node lockheld - verify - TODO
if alterations are made from bh path - verify - TODO
is this redundant? should be covered inlooks ok on review - the reserve here is for alloc, not touchtouch_or_fill_page_nodelocked... maybe BUG
completion will happen in bh path - TODO
especially if reference to page is held during
read - but verify --- on review, a dealloc could happen if a page was touched and then released before read completion - but with the cache using a locked heap, this shouldn't be an issue
pagecache write - pn->cache_write (pagecache_write_sg)
initiated with merge
pagecache_write_sg_finish - may be invoked either from
pagecache_write_sg (syscall or dirty page commit) or from merge
completion on the bhqueue
release should only happen when bound(complete) is true, not
on a read error...
completion, and thus are always run from the bh path
lock held (i.e. syscall -> pagecache_{read,write}sg,
pagecache_sync*) as well as intermediate completions
which need to run on the bhqueue (pagecache_map_page
(page fault), pagecache_write_sg_finish)
any runqueue ones remain, queue a runqueue service
routine to handle them - TODO
completions here (e.g. page fault,
pagecache_write_sg_finish, pagecache_sync_*
(pagecache_service_completions) to bhqueue
enqueue_page_completion_statelocked above
write operation
was evicted before i/o operation finished
pagecache_write_sg above
though many cases will be a user buffer with null
refcount - need to verify paths anyway - TODO
may do write finish as internal func, and skip merge if normw or read pending pages
completion path:
directly from filesystem_storage_read if no I/O (e.g. all
extents uninited)
incomplete read error handling - TODO / file issue- Errors are handled, but there is a console dump of the error with a TODO to make a facility for capturing and reporting (with rate limiting) device I/O errorsrefcount_release for pagecount_write_sg_finish noted above
The text was updated successfully, but these errors were encountered: