Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track safety of interrupt handler and bhqueue processing paths #1302

Closed
38 tasks done
wjhun opened this issue Sep 30, 2020 · 1 comment
Closed
38 tasks done

track safety of interrupt handler and bhqueue processing paths #1302

wjhun opened this issue Sep 30, 2020 · 1 comment
Assignees

Comments

@wjhun
Copy link
Contributor

wjhun commented Sep 30, 2020

These notes were made in an attempt to audit all execution paths that stem from interrupt handlers through enqueues to bhqueue and subsequent bhqueue service, as well as the request paths originating under the kernel lock that might possibly contend with these paths that run outside of the lock.

Most issues found here concern a lack of safety with heap allocation, primarily from heap (de)allocations in bh service as well as deallocation of objects upstream as a result of closure_finish() or a refcount release. There are a handful of allocations at interrupt level as well. The proposed solution here is to add spinlocks to lower-level heaps (physical - which is already protected by a lock in page.c, virtual_huge, virtual_page, and backed) and add a lock-protected mcache for memory that might be (de)allocated in bh service (or in interrupt handlers, though I would personally prefer to see these removed altogether if possible). The 'general' mcache will remain unlocked for use under the kernel lock only (for which we can add debug assertions that the kernel lock is in fact held during (de)allocs). The places that require use of a locked heap are indicated by the TODO items below (as well as some that otherwise require additional locking, or might just need verification).

BUG items indicate general bugs that were found during the audit, and COMPLETION items highlight the areas where a status handler completion is invoked (no action required for these - added for reference).

While there are many TODO items listed here, most would be cleared out by the use of locked heaps.

Note that completions up into the syscall level are not covered here; I think it will be possible to invoke these from the runqueue with a minor change to the per-page completion handling. Ideally, we can isolate all these safety issues to the pagecache level and lower.

drivers

  • ata-pci: apci->service (ata_pci_service)

    • ata_pci_service_reqs()
      • ata_pci_service_req()
        • debugs msgs could alloc, so maybe rprintf & friends should
          use locked heap (not performance sensitive anyway)
          actually they use stack allocations
        • ata_io_cmd_dma()
          • ata_set_lba()
            • kernel_delay
              • no allocs, but calls now() and kern_pause - must be
                safe outside kern lock - TODO
        • may call timm() on error - so errheap in runtime_init.c
          should use locked heap - TODO
        • ata_io_cmd() (for pio fallback)
          • also may call timm()
          • ata_wait()
            • ata pio and kernel_delay, covered above
          • status handler invoked from here - COMPLETION
        • status handler invoked on error or request completion - COMPLETION
      • deallocates ata_pci_req
        • dev->h needs to be locked heap - TODO
  • storvsc: vmbus_chan_open

    • hv_storvsc_on_channel_callback()
      • vmbus_chan_recv - optional debug output, otherwise looks safe
        • vmbus_rxbr_peek - no heap activity, uses spinlock
        • vmbus_rxbr_read - same
      • hv_storvsc_on_iocompletion()
        • storvsc_io_done
          • storvsc_cam_strvis - string func, no heap ops
      • hcb->completion applied here
        • storvsc_scsi_io_done
          • timm(), covered above
          • status handler called from here - COMPLETION
          • closure_finish - r->completion in storvsc_io() must come
            from locked heap - TODO
      • storvsc_free_request - list insert under lock, safe
      • storvsc_process_hcb_queue
        • storvsc_action_io
          • create_storvsc_request - fills reqp, don't see any heap
            activity
          • hv_storvsc_io_request
            • vmbus_chan_send_prplist
              • vmbus_txbr_write - no heap ops found, looks safe
              • vmbus_chan_signal_tx - atomic sets and hypercall - safe
            • vmbus_chan_send - similar to above, looks safe
  • bhqueue_enqueue_irqsafe

    • kernel_demand_pf_complete

    • the lock-free queue structure allows multiple writers, however
      it is not safe for writing to a queue from both the interrupt
      handler and kernel mode

      • so all enqueues to the bhqueue from the kernel should
        disable interrupts (ok, ints are disabled already, but we
        should still protect these cases in case one day we want to
        handle interrupts during kernel processing) - maybe change
        this to a more generic "enqueue_irqsafe" - TODO
  • virtqueue

    • virtqueue_service_vqmsgs
    • vq_interrupt
      • BUG: vq->service should be enqueued to vq->sched_queue, not
        bhqueue
    • vq->service: virtqueue_service_vqmsgs
      • BUG: dequeue from service_queue needs to be irqsafe, make a
        "dequeue_irqsafe"
        Actually it doesn't - there's no contention with a dequeue in interrupt level, and the enqueue in interrupt level shouldn't affect the dequeue.
      • m->completion applied here - COMPLETION
      • deallocate_vqmsg
        • allocate_vqmsg must allocate vqmsg and m->descv buffer
          from locked heap - TODO
  • virtio_scsi

    • virtio_scsi
      • virtio_scsi_enqueue_event:
      • virtio_scsi_enqueue_request
        • allocation needs to come from locked heap - TODO
        • virtio_scsi_request_complete
          • applies vsr_complete completion - COMPLETION
            • note that contiguous (backed) heap in proposed arrangement
              would be a locked heap
      • virtio_scsi_io
        • virtio_scsi_alloc_request
          • virtio_scsi_request allocation comes from
            s->v->virtio_dev.contiguous, should be locked in new
            arrangement
        • virtio_scsi_io_done allocation needs to come from locked
          heap - TODO
      • virtio_scsi_io_done
        • calls timm, addressed above
        • status_handler completion invoked here - COMPLETION
        • closure_finish, see virtio_scsi_io above
    • s->command - unused? (not a bug - no need to queue requests there right now)
    • s->eventq - completion re-enqueues itself - unused? see above
    • s->requestq - virtio_scsi_enqueue_request above
  • virtio_storage

    • storage_rw_internal
      • allocate_virtio_blk_req
        • allocate from contiguous, should be safe
      • allocate_vqmsg - covered above
      • must allocate complete closure from locked queue - TODO
    • complete
      • timm() - covered above
      • status_handler completion invoked here - COMPLETION
      • deallocate_virtio_blk_req - contiguous heap
  • pvscsi

    • pvscsi_attach
      • allocation of dev->hcb_objcache comes from contiguous, and
        meta from dev->general - but meta is only used on objcache
        alloc, so should be safe
    • pvscsi_rx_service_bh
      • change to dequeue_irqsafe - TODO - as with virtqueue, dequeue doesn't contend with another...
      • apply hcb->completion - COMPLETION
      • pvscsi_hcb_dealloc
        • dealloc from contiguous - safe
        • dealloc from dev->hcb_objcache - should be safe, see above
    • pvscsi_action_io (should take the lock for
      list_push_back - fixed in pvscsi: take queue_lock before enqueueing to hcb_queue #1442)
      • pvscsi_execute_ccb - looks safe
  • xen.c: xen_grant_init

    • gt->entry_heap may have deallocations from xenblk_service_ring
      (interrupt level), need to use irq-safe heap or defer - TODO / BUG
      • check if xennet has violations too - TODO
  • xenblk

    • xenblk_probe
      • xbd->h is used in xenblk_service_ring (int level), must be
        irq-safe locked heap - TODO
      • xbd->rreqs also allocated from general, should be int-safe
        heap - TODO
    • xenblk_event_handler
      • xenblk_service_ring
        • xenblk_get_rreq
          • allocates from xbd->h, see above
          • vector_push to xbd->rreqs, see above
        • xen_grant_page_access - uses entry_heap - see
          xen_grant_init above
      • xenblk_service_pending - called from int handler and from io ops
        • xen_revoke_page_access
          • deallocation from xen_info.gtab.entry_heap - see
            xen_grant_init above
          • timm(), so errheap needs to be irqsafe too
      • xen_notify_evtchn - hypercall, should be safe
    • xenblk_io - called within kern lock but may contend with bh
      completions
      • xenblk_get_req
        • alloc from xbd->h, see above TODO
      • xenblk_service_pending - see above
    • xenblk_bh_service
      • status_handler invoked from here - COMPLETION

(check net drivers too - anything on bhqueue?)

pagecache/tfs

request path:

  • pagecache read

    • entry points into touch_or_fill_page_nodelocked
      • pagecache_read_sg
      • pagecache_write_sg
        • touch_or_fill_page_by_num_nodelocked
      • pagecache_node_fetch_pages (readahead)
      • pagecache_map_page
      • pagecache_map_page_if_filled
        • touch_or_fill_page_by_num_nodelocked
    • pagecache_read_sg
      • allocate_merge - safe heap - TODO
      • rbtree traversal - but should always happen with kernel lock
        held - verify - TODO
        - nope, we should be covered by node lock
      • allocate_page_nodelocked
        • contiguous alloc, safe
        • pagecache_page alloc, needs safe heap - TODO
        • rbtree_insert_node
          • safe from here, but rbtree may need to live on safe heap
            if alterations are made from bh path - verify - TODO
      • realloc_pagelocked
        • allocates page from pc->contiguous; should be safe
        • is this redundant? should be covered in
          touch_or_fill_page_nodelocked... maybe BUG
          looks ok on review - the reserve here is for alloc, not touch
      • sg_list_tail_add
        • covered below
      • status_handler invoked on alloc failure or no blocking I/O
        • COMPLETION
    • touch_or_fill_page_nodelocked
      • enqueue_page_completion_statelocked
        • needs to allocate vector from safe heap - TODO
      • realloc_pagelocked
        • see above
      • allocate_sg_list
        • allocates from internal sg_heap - needs to be safe - TODO
      • sg_list_tail_add
        • buffer_extend, should be safe if sg_heap is safe
      • pagecache_read_page_complete closure
        • needs to be allocated from safe heap - TODO
      • fs_read (tfs: filesystem_storage_read)
        • allocate_merge - needs to come from safe heap as merge
          completion will happen in bh path - TODO
        • rangemap_range_lookup_with_gaps
          • read_extent
            • filesystem_storage_op - looks safe, verify buf release
              • block device io read
              • sg_list_head_remove - looks safe
              • sg_buf_release - shouldn't cause dealloc upstream,
                especially if reference to page is held during
                read - but verify --- on review, a dealloc could happen if a page was touched and then released before read completion - but with the cache using a locked heap, this shouldn't be an issue
            • sg_zero_fill - looks safe
              • sg_list_head_remove
              • sg_buf_release
          • zero_hole
  • pagecache write - pn->cache_write (pagecache_write_sg)

    • status_handler invoked for saved write error or zero range
      • COMPLETION
    • allocate_merge to pagecache_write_sg_finish
      • needs to come from safe heap - TODO
    • initiate reads for any RMW pages
      • see touch_or_fill_page_nodelocked above
    • prepare pages loop
      • allocate_page_nodelocked - see above
      • status_handler invoked on alloc fail - COMPLETION
        • merge should be discarded - minor, but BUG
      • realloc_pagelocked - see above
      • enqueue_page_completion_statelocked - see above
      • may immediately invoke pagecache_write_sg_finish if no I/O
        initiated with merge
  • pagecache_write_sg_finish - may be invoked either from
    pagecache_write_sg (syscall or dirty page commit) or from merge
    completion on the bhqueue

    • page_lookup_nodelocked
      • rbtree_lookup for node->pages - protected by node lock
    • read error / bound(complete) == true path
      • BUG: write_count decrement, state change to NEW and refcount
        release should only happen when bound(complete) is true, not
        on a read error...
      • pagecache_page_queue_completions_locked
        • these are actually only be invoked after blocking I/O
          completion, and thus are always run from the bh path
          • we have completions which should be run with the kernel
            lock held (i.e. syscall -> pagecache_{read,write}sg,
            pagecache_sync
            *) as well as intermediate completions
            which need to run on the bhqueue (pagecache_map_page
            (page fault), pagecache_write_sg_finish)
            • so we can invoke the bh completions directly and, if
              any runqueue ones remain, queue a runqueue service
              routine to handle them - TODO
          • we have both syscall completions and internal kernel
            completions here (e.g. page fault,
            pagecache_write_sg_finish, pagecache_sync_*
        • queues pc->service_completions
          (pagecache_service_completions) to bhqueue
          • pagecache_service_completions (from bhqueue)
            • status_handlers invoked from here - COMPLETION
            • deallocate_vector - see
              enqueue_page_completion_statelocked above
      • refcount_release on page - releases reference held during
        write operation
        • could possibly lead to pagecache_page being freed if page
          was evicted before i/o operation finished
        • pagecache_page_free
          • change_page_state_locked: move to free list, safe
          • dealloc from contiguous, safe if contiguous heap is
        • closure_finish() - see allocate_merge note in
          pagecache_write_sg above
    • read completion (status ok) / apply write path
      • allocate_sg_list - see above
        • applies error to completion on alloc fail - COMPLETION
      • sg_list_tail_add
        • see above
      • sg_copy_to_buf
        • sg_buf_release - could possibly cause dealloc upstream,
          though many cases will be a user buffer with null
          refcount - need to verify paths anyway - TODO
      • fs_write (filesystem_storage_write)
        • change to schedule write from runqueue (locked) - TODO
          • may do write finish as internal func, and skip merge if no
            rmw or read pending pages
            • Merge completion now scheduled from runqueue. Simpler to leave merge in-place, and completion closure is used for write completion regardless of whether or not reads are outstanding.

completion path:

  • pagecache_read_page_complete
    • called on storage read completion from bh path, or
      directly from filesystem_storage_read if no I/O (e.g. all
      extents uninited)
    • incomplete read error handling - TODO / file issue - Errors are handled, but there is a console dump of the error with a TODO to make a facility for capturing and reporting (with rate limiting) device I/O errors
    • change_page_state_locked - locked, no heap ops
    • pagecache_page_queue_completions_locked - see above
    • sg_list_release
      • release refcount held on pages while reading - similar to
        refcount_release for pagecount_write_sg_finish noted above
    • deallocate_sg_list
      • add to sg free list, safe
    • closure_finish - covered in request path above
@wjhun wjhun self-assigned this Sep 30, 2020
@wjhun wjhun changed the title track safety of bhqueue processing paths track safety of interrupt handler and bhqueue processing paths Sep 30, 2020
@wjhun wjhun mentioned this issue Oct 2, 2020
wjhun added a commit that referenced this issue Nov 25, 2020
As enumerated in #1302, there are many paths in the kernel where the kernel lock is not held and yet allocations or deallocations are being made from one of the kernel heaps. To address this issue, these changes introduce a "locked" kernel heap. Like general, this is an mcache heap, accessed through a locking wrapper heap which guards accesses with a spinlock. The "backed" heap is also accessed via a locking wrapper.

Since id_heap-specific methods wouldn't be covered by a generic wrapper, a "locking" flag is specified on id_heap creation. The virtual huge and page heaps are now locking and thus safe to use from any context.

Allocations made (and released) under protection of the kernel lock should continue to use the general heap, as this will avoid unnecessary spinlock operations.
@wjhun
Copy link
Contributor Author

wjhun commented Mar 29, 2021

All issues in the checklist have been resolved; closing.

@wjhun wjhun closed this as completed Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant