track safety of interrupt handler and bhqueue processing paths #1302

wjhun · 2020-09-30T17:18:30Z

These notes were made in an attempt to audit all execution paths that stem from interrupt handlers through enqueues to bhqueue and subsequent bhqueue service, as well as the request paths originating under the kernel lock that might possibly contend with these paths that run outside of the lock.

Most issues found here concern a lack of safety with heap allocation, primarily from heap (de)allocations in bh service as well as deallocation of objects upstream as a result of closure_finish() or a refcount release. There are a handful of allocations at interrupt level as well. The proposed solution here is to add spinlocks to lower-level heaps (physical - which is already protected by a lock in page.c, virtual_huge, virtual_page, and backed) and add a lock-protected mcache for memory that might be (de)allocated in bh service (or in interrupt handlers, though I would personally prefer to see these removed altogether if possible). The 'general' mcache will remain unlocked for use under the kernel lock only (for which we can add debug assertions that the kernel lock is in fact held during (de)allocs). The places that require use of a locked heap are indicated by the TODO items below (as well as some that otherwise require additional locking, or might just need verification).

BUG items indicate general bugs that were found during the audit, and COMPLETION items highlight the areas where a status handler completion is invoked (no action required for these - added for reference).

While there are many TODO items listed here, most would be cleared out by the use of locked heaps.

Note that completions up into the syscall level are not covered here; I think it will be possible to invoke these from the runqueue with a minor change to the per-page completion handling. Ideally, we can isolate all these safety issues to the pagecache level and lower.

drivers

ata-pci: apci->service (ata_pci_service)
- ata_pci_service_reqs()
  - ata_pci_service_req()
    - debugs msgs could alloc, so maybe rprintf & friends should
      use locked heap (not performance sensitive anyway) actually they use stack allocations
    - ata_io_cmd_dma()
      - ata_set_lba()
        
        kernel_delay
        
        no allocs, but calls now() and kern_pause - must be
        safe outside kern lock - TODO
    - may call timm() on error - so errheap in runtime_init.c
      should use locked heap - TODO
    - ata_io_cmd() (for pio fallback)
      - also may call timm()
      - ata_wait()
        
        ata pio and kernel_delay, covered above
      - status handler invoked from here - COMPLETION
    - status handler invoked on error or request completion - COMPLETION
  - deallocates ata_pci_req
    - dev->h needs to be locked heap - TODO
storvsc: vmbus_chan_open
- hv_storvsc_on_channel_callback()
  - vmbus_chan_recv - optional debug output, otherwise looks safe
    - vmbus_rxbr_peek - no heap activity, uses spinlock
    - vmbus_rxbr_read - same
  - hv_storvsc_on_iocompletion()
    - storvsc_io_done
      - storvsc_cam_strvis - string func, no heap ops
  - hcb->completion applied here
    - storvsc_scsi_io_done
      - timm(), covered above
      - status handler called from here - COMPLETION
      - closure_finish - r->completion in storvsc_io() must come
        from locked heap - TODO
  - storvsc_free_request - list insert under lock, safe
  - storvsc_process_hcb_queue
    - storvsc_action_io
      - create_storvsc_request - fills reqp, don't see any heap
        activity
      - hv_storvsc_io_request
        
        vmbus_chan_send_prplist
        
        vmbus_txbr_write - no heap ops found, looks safe
        
        vmbus_chan_signal_tx - atomic sets and hypercall - safe
        
        vmbus_chan_send - similar to above, looks safe
bhqueue_enqueue_irqsafe
- kernel_demand_pf_complete
- the lock-free queue structure allows multiple writers, however
  it is not safe for writing to a queue from both the interrupt
  handler and kernel mode
  - so all enqueues to the bhqueue from the kernel should
    disable interrupts (ok, ints are disabled already, but we
    should still protect these cases in case one day we want to
    handle interrupts during kernel processing) - maybe change
    this to a more generic "enqueue_irqsafe" - TODO
virtqueue
- virtqueue_service_vqmsgs
- vq_interrupt
  - BUG: vq->service should be enqueued to vq->sched_queue, not
    bhqueue
- vq->service: virtqueue_service_vqmsgs
  - BUG: dequeue from service_queue needs to be irqsafe, make a
    "dequeue_irqsafe" Actually it doesn't - there's no contention with a dequeue in interrupt level, and the enqueue in interrupt level shouldn't affect the dequeue.
  - m->completion applied here - COMPLETION
  - deallocate_vqmsg
    - allocate_vqmsg must allocate vqmsg and m->descv buffer
      from locked heap - TODO
virtio_scsi
- virtio_scsi
  - virtio_scsi_enqueue_event:
    - allocation needs to come from locked heap - TODO
    - virtio_scsi_event_complete
      - calls virtio_scsi_enqueue_event to re-enqueue event -
        addressed in incomplete handling of virtio scsi events #1441
  - virtio_scsi_enqueue_request
    - allocation needs to come from locked heap - TODO
    - virtio_scsi_request_complete
      - applies vsr_complete completion - COMPLETION
        
        note that contiguous (backed) heap in proposed arrangement
        would be a locked heap
  - virtio_scsi_io
    - virtio_scsi_alloc_request
      - virtio_scsi_request allocation comes from
        s->v->virtio_dev.contiguous, should be locked in new
        arrangement
    - virtio_scsi_io_done allocation needs to come from locked
      heap - TODO
  - virtio_scsi_io_done
    - calls timm, addressed above
    - status_handler completion invoked here - COMPLETION
    - closure_finish, see virtio_scsi_io above
- s->command - unused? (not a bug - no need to queue requests there right now)
- s->eventq - completion re-enqueues itself - unused? see above
- s->requestq - virtio_scsi_enqueue_request above
virtio_storage
- storage_rw_internal
  - allocate_virtio_blk_req
    - allocate from contiguous, should be safe
  - allocate_vqmsg - covered above
  - must allocate complete closure from locked queue - TODO
- complete
  - timm() - covered above
  - status_handler completion invoked here - COMPLETION
  - deallocate_virtio_blk_req - contiguous heap
pvscsi
- pvscsi_attach
  - allocation of dev->hcb_objcache comes from contiguous, and
    meta from dev->general - but meta is only used on objcache
    alloc, so should be safe
- pvscsi_rx_service_bh
  - ~~change to dequeue_irqsafe - TODO~~ - as with virtqueue, dequeue doesn't contend with another...
  - apply hcb->completion - COMPLETION
  - pvscsi_hcb_dealloc
    - dealloc from contiguous - safe
    - dealloc from dev->hcb_objcache - should be safe, see above
- pvscsi_action_io (should take the lock for
  list_push_back - fixed in pvscsi: take queue_lock before enqueueing to hcb_queue #1442)
  - pvscsi_execute_ccb - looks safe
xen.c: xen_grant_init
- gt->entry_heap may have deallocations from xenblk_service_ring
  (interrupt level), need to use irq-safe heap or defer - TODO / BUG
  - check if xennet has violations too - TODO
xenblk
- xenblk_probe
  - xbd->h is used in xenblk_service_ring (int level), must be
    irq-safe locked heap - TODO
  - xbd->rreqs also allocated from general, should be int-safe
    heap - TODO
- xenblk_event_handler
  - xenblk_service_ring
    - xenblk_get_rreq
      - allocates from xbd->h, see above
      - vector_push to xbd->rreqs, see above
    - xen_grant_page_access - uses entry_heap - see
      xen_grant_init above
  - xenblk_service_pending - called from int handler and from io ops
    - xen_revoke_page_access
      - deallocation from xen_info.gtab.entry_heap - see
        xen_grant_init above
      - timm(), so errheap needs to be irqsafe too
  - xen_notify_evtchn - hypercall, should be safe
- xenblk_io - called within kern lock but may contend with bh
  completions
  - xenblk_get_req
    - alloc from xbd->h, see above TODO
  - xenblk_service_pending - see above
- xenblk_bh_service
  - status_handler invoked from here - COMPLETION

(check net drivers too - anything on bhqueue?)

pagecache/tfs

request path:

pagecache read
- entry points into touch_or_fill_page_nodelocked
  - pagecache_read_sg
  - pagecache_write_sg
    - touch_or_fill_page_by_num_nodelocked
  - pagecache_node_fetch_pages (readahead)
  - pagecache_map_page
  - pagecache_map_page_if_filled
    - touch_or_fill_page_by_num_nodelocked
- pagecache_read_sg
  - allocate_merge - safe heap - TODO
  - rbtree traversal - but should always happen with kernel lock
    held - verify - TODO - nope, we should be covered by node lock
  - allocate_page_nodelocked
    - contiguous alloc, safe
    - pagecache_page alloc, needs safe heap - TODO
    - rbtree_insert_node
      - safe from here, but rbtree may need to live on safe heap
        if alterations are made from bh path - verify - TODO
  - realloc_pagelocked
    - allocates page from pc->contiguous; should be safe
    - is this redundant? should be covered in
      touch_or_fill_page_nodelocked... maybe BUG looks ok on review - the reserve here is for alloc, not touch
  - sg_list_tail_add
    - covered below
  - status_handler invoked on alloc failure or no blocking I/O
    - COMPLETION
- touch_or_fill_page_nodelocked
  - enqueue_page_completion_statelocked
    - needs to allocate vector from safe heap - TODO
  - realloc_pagelocked
    - see above
  - allocate_sg_list
    - allocates from internal sg_heap - needs to be safe - TODO
  - sg_list_tail_add
    - buffer_extend, should be safe if sg_heap is safe
  - pagecache_read_page_complete closure
    - needs to be allocated from safe heap - TODO
  - fs_read (tfs: filesystem_storage_read)
    - allocate_merge - needs to come from safe heap as merge
      completion will happen in bh path - TODO
    - rangemap_range_lookup_with_gaps
      - read_extent
        
        filesystem_storage_op - looks safe, verify buf release
        
        block device io read
        
        sg_list_head_remove - looks safe
        
        sg_buf_release - shouldn't cause dealloc upstream,
        especially if reference to page is held during
        read - but verify --- on review, a dealloc could happen if a page was touched and then released before read completion - but with the cache using a locked heap, this shouldn't be an issue
        
        sg_zero_fill - looks safe
        
        sg_list_head_remove
        
        sg_buf_release
      - zero_hole
pagecache write - pn->cache_write (pagecache_write_sg)
- status_handler invoked for saved write error or zero range
  - COMPLETION
- allocate_merge to pagecache_write_sg_finish
  - needs to come from safe heap - TODO
- initiate reads for any RMW pages
  - see touch_or_fill_page_nodelocked above
- prepare pages loop
  - allocate_page_nodelocked - see above
  - status_handler invoked on alloc fail - COMPLETION
    - merge should be discarded - minor, but BUG
  - realloc_pagelocked - see above
  - enqueue_page_completion_statelocked - see above
  - may immediately invoke pagecache_write_sg_finish if no I/O
    initiated with merge
pagecache_write_sg_finish - may be invoked either from
pagecache_write_sg (syscall or dirty page commit) or from merge
completion on the bhqueue
- page_lookup_nodelocked
  - rbtree_lookup for node->pages - protected by node lock
- read error / bound(complete) == true path
  - BUG: write_count decrement, state change to NEW and refcount
    release should only happen when bound(complete) is true, not
    on a read error...
  - pagecache_page_queue_completions_locked
    - these are actually only be invoked after blocking I/O
      completion, and thus are always run from the bh path
      - we have completions which should be run with the kernel
        lock held (i.e. syscall -> pagecache_{read,write}sg,
        pagecache_sync*) as well as intermediate completions
        which need to run on the bhqueue (pagecache_map_page
        (page fault), pagecache_write_sg_finish)
        
        so we can invoke the bh completions directly and, if
        any runqueue ones remain, queue a runqueue service
        routine to handle them - TODO
      - we have both syscall completions and internal kernel
        completions here (e.g. page fault,
        pagecache_write_sg_finish, pagecache_sync_*
    - queues pc->service_completions
      (pagecache_service_completions) to bhqueue
      - pagecache_service_completions (from bhqueue)
        
        status_handlers invoked from here - COMPLETION
        
        deallocate_vector - see
        enqueue_page_completion_statelocked above
  - refcount_release on page - releases reference held during
    write operation
    - could possibly lead to pagecache_page being freed if page
      was evicted before i/o operation finished
    - pagecache_page_free
      - change_page_state_locked: move to free list, safe
      - dealloc from contiguous, safe if contiguous heap is
    - closure_finish() - see allocate_merge note in
      pagecache_write_sg above
- read completion (status ok) / apply write path
  - allocate_sg_list - see above
    - applies error to completion on alloc fail - COMPLETION
  - sg_list_tail_add
    - see above
  - sg_copy_to_buf
    - sg_buf_release - could possibly cause dealloc upstream,
      though many cases will be a user buffer with null
      refcount - need to verify paths anyway - TODO
  - fs_write (filesystem_storage_write)
    - change to schedule write from runqueue (locked) - TODO
      - may do write finish as internal func, and skip merge if no
        rmw or read pending pages
        
        Merge completion now scheduled from runqueue. Simpler to leave merge in-place, and completion closure is used for write completion regardless of whether or not reads are outstanding.

completion path:

pagecache_read_page_complete
- called on storage read completion from bh path, or
  directly from filesystem_storage_read if no I/O (e.g. all
  extents uninited)
- ~~incomplete read error handling - TODO / file issue~~ - Errors are handled, but there is a console dump of the error with a TODO to make a facility for capturing and reporting (with rate limiting) device I/O errors
- change_page_state_locked - locked, no heap ops
- pagecache_page_queue_completions_locked - see above
- sg_list_release
  - release refcount held on pages while reading - similar to
    refcount_release for pagecount_write_sg_finish noted above
- deallocate_sg_list
  - add to sg free list, safe
- closure_finish - covered in request path above

As enumerated in #1302, there are many paths in the kernel where the kernel lock is not held and yet allocations or deallocations are being made from one of the kernel heaps. To address this issue, these changes introduce a "locked" kernel heap. Like general, this is an mcache heap, accessed through a locking wrapper heap which guards accesses with a spinlock. The "backed" heap is also accessed via a locking wrapper. Since id_heap-specific methods wouldn't be covered by a generic wrapper, a "locking" flag is specified on id_heap creation. The virtual huge and page heaps are now locking and thus safe to use from any context. Allocations made (and released) under protection of the kernel lock should continue to use the general heap, as this will avoid unnecessary spinlock operations.

wjhun · 2021-03-29T14:49:07Z

All issues in the checklist have been resolved; closing.

wjhun added memory smp labels Sep 30, 2020

wjhun self-assigned this Sep 30, 2020

wjhun changed the title ~~track safety of bhqueue processing paths~~ track safety of interrupt handler and bhqueue processing paths Sep 30, 2020

wjhun mentioned this issue Oct 2, 2020

more smp fixes #1304

Merged

wjhun mentioned this issue Nov 23, 2020

locking heap wrapper #1337

Merged

wjhun closed this as completed Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

track safety of interrupt handler and bhqueue processing paths #1302

track safety of interrupt handler and bhqueue processing paths #1302

wjhun commented Sep 30, 2020 •

edited

Loading

wjhun commented Mar 29, 2021

track safety of interrupt handler and bhqueue processing paths #1302

track safety of interrupt handler and bhqueue processing paths #1302

Comments

wjhun commented Sep 30, 2020 • edited Loading

drivers

pagecache/tfs

wjhun commented Mar 29, 2021

wjhun commented Sep 30, 2020 •

edited

Loading