Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] rump hypcall integration for LKL #255

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

thehajime
Copy link
Member

@thehajime thehajime commented Oct 28, 2016

(since this is RFC, I'd be happy to have any comments and suggestions from yours)

This PR provides an initial implementation of rump kernel hypercalls (*1) which brings us a couple of interesting features to LKL.

The highlight of new features are:

  • additional host environments, NetBSD, qemu-arm, qemu-x86_64, xen other than linux/freebsd/windows which current LKL supports.
    • execuse: I haven't tested that much, only tested with qemu-arm, qemu-x86_64 on Linux.
  • non hijack-based existing application support with a dedicated libc bind to LKL using musl libc
    • rumprun/frankenlibc unikernel integration
  • new thread primitive based on green thread (a.k.a. userspace thread, related to official usermode thread support in LKL #250 ?)

The addition of rump hypercalls does not interfere the current implementation of LKL, but with additional external repositories (*2), you will run LKLed applications with rump hypercall.

Non-hijack application support requires to build an (existing) application by dedicated build toolchain (cross compilation as rumprun does), but it will solve most of hijack limitations such as conflicted symbols/namespaces, non exposed symbols (e.g., replace hidden symbols in glibc). With this toolchain, I confirmed nginx, netperf, and ghc (a haskell compiler) somehow work fine, though those need more work to be more than just a hello world.

The green thread implementation is obviously useful to avoid the expensive task of context switches. I saw some of netperf benchmarks outperform pthread based one (will re-measure with netperf in my spare time).
(rump hypercall has pthread-based thread implementation but I didn't test it in this PR).

I'm going to upstream the external repositories (*2) as well as this PR so, this PR will be a base of all of external dependencies.

I think merging non-master branch of LKL and synchronize periodically with master branch is also reasonable since the changes (of this PR) are a bit large.

*1 rump kernel
http://book.rumpkernel.org

*2 additional (external) repositories


This change is Reviewable

@@ -29,6 +29,11 @@ void calibrate_delay(void)
{
}

void read_persistent_clock(struct timespec *ts)
Copy link

@Rondom Rondom Oct 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to make this public i.e. in addition add some helper function in user space using settimeofday/adjtimex that syncs the host time with the lkl time. One possible use case is syncing the clock with the host in case of drift in a long-running application or a simple jump of the host time due to manual change. I know everyone can write this quickly themselves, but it would be a nice convenience function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment @Rondom.

It might make sense to make this public i.e. in addition add some helper function in user space using settimeofday/adjtimex that syncs the host time with the lkl time.

I don't understand this comment; what do you mean by making a function public ? btw, 66513c9 exposes all kernel functions by putting a prefix to avoid symbol collision.
(see #17 (comment) more detail)

though you might be aware of this, this additional function overrides a function of kernel (below), which each architecture can define its own way to access to a clock.

void __weak read_persistent_clock(struct timespec *ts) in kernel/time/timekeeping.c.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware (and not really awake while writing that answer). Thanks for the explanation.

@@ -32,8 +32,8 @@ CONFIG_EXT4_FS_SECURITY=y
CONFIG_XFS_FS=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_ISO9660_FS=y
CONFIG_BTRFS_FS=y
CONFIG_BTRFS_FS_POSIX_ACL=y
#CONFIG_BTRFS_FS=y
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for disabling btrfs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because this LKL kernel stops during make test, though I need to investigate more what's the root cause.

https://circleci.com/gh/libos-nuse/frankenlibc/136

I was suspecting some of busy loops may cause the stuck.

@tavip
Copy link
Member

tavip commented Nov 1, 2016

Awesome ! This looks good at a first glace except two minor issues.

One is the BTRFS issue. Do you see the stuck issue when running w/ rump or with regular Linux backend? If it is under rump/qemu you can probably start gdb and see where it is looping. On the other hand I think that we perhaps should look into adding proper IRQ support, since we have at least one new target that supports it.

There is another concertn I have with regard to the lkl__sync ops, there might be slower then the __sync primitives and that could cause some performance regressions.

@thehajime
Copy link
Member Author

One is the BTRFS issue. Do you see the stuck issue when running w/ rump or with regular Linux backend? If it is under rump/qemu you can probably start gdb and see where it is looping. On the other hand I think that we perhaps should look into adding proper IRQ support, since we have at least one new target that supports it.

https://gist.github.com/thehajime/f8af72bf625e90d397c2584a285d97d3

it stopped at do_xor_speed(), jiffies is not updated at all during the loop (I guess we fixed this before).
I will look at if the non-preemptive scheduler (with green thread implementation) works badly.

There is another concertn I have with regard to the lkl__sync ops, there might be slower then the __sync primitives and that could cause some performance regressions.

in the current version, I can't find proper atomic ops (at least can't use __sync prefixed atomic ops) on arm-none-eabi target so I needed to wrap the ops. x86_64 target should work as before without any performance regressions.

In any case, since atomic ops seems to not have a portable interface, I thought it's a good idea having wrapper functions (or adding host_ops entries).

Copy link
Member

@liuyuan10 liuyuan10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LKL_HIJACK_DEBUG doesn't work anymore for me. Can you also rebase to master so that I can check the performance of it?

@@ -22,6 +22,7 @@ struct thread_info {
lkl_thread_t tid;
struct task_struct *prev_sched;
unsigned long stackend;
void *rump_client; /* for syscall proxy */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addr of stackend should be the end of thread info?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I couldn't find how stackend is used, I agree with you by changing the place of new member.

@@ -102,7 +110,6 @@ int lkl_get_virtio_blkdev(int disk_id, uint32_t *pdevid)

ret = snprintf(sysfs_path + sysfs_path_len, sizeof(sysfs_path) - sysfs_path_len, "/%s/block",
virtio_name);
free(virtio_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to do so? seems like d_name is pointing a memory that will be freed in lkl_closedir

Copy link
Member Author

@thehajime thehajime Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fixed now with replacing strdup by lkl_ops.mem_alloc + memcpy.

@thehajime thehajime force-pushed the rump-hypcall-upstream branch 2 times, most recently from a40346e to 78522ae Compare November 2, 2016 05:38
@thehajime
Copy link
Member Author

The LKL_HIJACK_DEBUG doesn't work anymore for me.

should be fixed now. thanks !

Can you also rebase to master so that I can check the performance of it?

I have a trouble with a rebased version on qemu-x86_64 (on linux and rump/frankenlibc are working fine). It seems like the direct syscall patch (e1cd839) doesn't like me - revert the commit with a hello world program works fine.

(#257 didn't help me btw)

you can try a measurement with hijack library - this should work fine now.

@liuyuan10
Copy link
Member

Yes. The performance looks neutral. The direct syscall requires support of long jmp. Maybe you want to check if it works as expected on qemu?

@thehajime
Copy link
Member Author

The direct syscall requires support of long jmp. Maybe you want to check if it works as expected on qemu?

setjmp/longjmp seems to be working on qemu. incomplete pthread implementation outside of LKL might be an issue with the direct syscall implementation (e1cd839). so #267 didn't help either.

I will split some of commits which is independent from rump integration, while investigating the above issue separately.

@thehajime thehajime force-pushed the rump-hypcall-upstream branch 2 times, most recently from 51eca75 to 76abb0b Compare November 25, 2016 09:03
@tavip
Copy link
Member

tavip commented Dec 15, 2016

Can one of the admins verify this patch?

This will allow us to reimplpement atomic ops with lock/touch/unlock way
if underlying hosts don't support a particular atomic ops.

Signed-off-by: Hajime Tazaki <[email protected]>
This only handles signals right after a system call was invoked.

Signed-off-by: Hajime Tazaki <[email protected]>
This allows us to define IRQ related operation outside of the kernel by
host_ops implementation.

Signed-off-by: Hajime Tazaki <[email protected]>
This host_ops entry will provide a way to pass any variables (e.g.,
environmental variables) to an LKL application from the host
environment.

Signed-off-by: Hajime Tazaki <[email protected]>
This commit adds an implementation of new host_ops by using rump
hypercall (wrapped as an LKL host_ops) provided by rumpkernel project.
The rump hypercall implementation is specifically integrated with the
one provided by frankenlibc project, which is an alternative way to use
rump hypercalls.

There are other rump hypercall implementations such as rumprun, NetBSD
kernel tree, but this commit only focuses on the frankenlibc
integration.

The new features covered by this commits ranges in various ways, but
here is some highlights.

- New thread primitives
  frankenlibc implements user-space thread implementation based on
  makecontext(3)&co with non-preemptive (cooperative) scheduler
  implementation.
- LKL specific standard library bind
  frankenlibc offers cross-compiler build chains to build an LKL
  application bound to LKL-versioned standard library (i.e., libc). Our
  musl libc implementation is provided by different repository.
  So, any application can be built with this toolchain to use LKL
  instead of host libc.
- New host environments support
  LKLed application now has potentials to run on the platform which rump
  hypercall implementations support.  frankenlibc supports Liunx,
  FreeBSD, NetBSD, and qemu-arm.  Others supposrts run an LKL
  applications on hypervisors (kvm, xen, etc)

The frankenlibc code is located at the following repository, but it will
be upstreamed to the original one (justincormack/frankenlibc).

https://github.com/libos-nuse/frankenlibc

Signed-off-by: Hajime Tazaki <[email protected]>
qemu-arm support with frankenlibc & lkl is limited at this moment: no
network (virtio-net) is not supported yet, etc.

Signed-off-by: Hajime Tazaki <[email protected]>
The code is not built as-is but put an implementation using rump syscall
proxy which makes an LKL application communicate with outside of the
process.

Signed-off-by: Hajime Tazaki <[email protected]>
This commit offers an integration with rumprun unikernel.  It is only
tested with qemu-system-x86_64 but it potentially works fine with x86
baremetal machines as well as xen hypervior.

An initial PCI device support is also implemented by relying on exposed
virtual devices from hyperviros: virtio-net device is tested for the
moment.

The rumprun repository to support lkl build is not upstreamed yet - you
can try with the following repository before it will be in.

https://github.com/libos-nuse/rumprun

Signed-off-by: Hajime Tazaki <[email protected]>
Signed-off-by: Hajime Tazaki <[email protected]>
@thehajime thehajime force-pushed the rump-hypcall-upstream branch from 7d95a54 to f82b294 Compare February 13, 2017 00:11
This reverts commit 2c507e4 ("device core: Remove deprecated
create_singlethread_workqueue").

This is a temporary fix for qemu-arm hangs on the following command that
schedule_work() triggers (?) dead-lock that nanosleep on hello->main()
never returns.

$ qemu-system-arm -M versatilepb -m 512M -nographic  -serial null \
-semihosting -kernel rumpobj/tests/hello

This should be fixed in a transparent way in a future.

Signed-off-by: Hajime Tazaki <[email protected]>
@copumpkin
Copy link

This looks cool. What's the status on it?

@thehajime
Copy link
Member Author

@copumpkin I've constantly updating this branch. 2 PRs were also submitted to rumpkernel but still no immediate upstream yet.

thehajime pushed a commit to libos-nuse/lkl-linux that referenced this pull request Nov 8, 2017
commit 355627f upstream.

Commit 7c05126 ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().

However, it was overlooked that this introduced an new error path before
the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
being copied from the old mm_struct by the memcpy in dup_mm().  For a
task that has previously hit a uprobe tracepoint, this resulted in the
'struct xol_area' being freed multiple times if the task was killed at
just the right time while forking.

Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
than in uprobe_dup_mmap().

With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
program given by commit 2b7e866 ("fork: fix incorrect fput of
->exe_file causing use-after-free"), provided that a uprobe tracepoint
has been set on the fork_thread() function.  For example:

    $ gcc reproducer.c -o reproducer -lpthread
    $ nm reproducer | grep fork_thread
    0000000000400719 t fork_thread
    $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
    $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
    $ ./reproducer

Here is the use-after-free reported by KASAN:

    BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
    Read of size 8 at addr ffff8800320a8b88 by task reproducer/198

    CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 lkl#255
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Call Trace:
     dump_stack+0xdb/0x185
     print_address_description+0x7e/0x290
     kasan_report+0x23b/0x350
     __asan_report_load8_noabort+0x19/0x20
     uprobe_clear_state+0x1c4/0x200
     mmput+0xd6/0x360
     do_exit+0x740/0x1670
     do_group_exit+0x13f/0x380
     get_signal+0x597/0x17d0
     do_signal+0x99/0x1df0
     exit_to_usermode_loop+0x166/0x1e0
     syscall_return_slowpath+0x258/0x2c0
     entry_SYSCALL_64_fastpath+0xbc/0xbe

    ...

    Allocated by task 199:
     save_stack_trace+0x1b/0x20
     kasan_kmalloc+0xfc/0x180
     kmem_cache_alloc_trace+0xf3/0x330
     __create_xol_area+0x10f/0x780
     uprobe_notify_resume+0x1674/0x2210
     exit_to_usermode_loop+0x150/0x1e0
     prepare_exit_to_usermode+0x14b/0x180
     retint_user+0x8/0x20

    Freed by task 199:
     save_stack_trace+0x1b/0x20
     kasan_slab_free+0xa8/0x1a0
     kfree+0xba/0x210
     uprobe_clear_state+0x151/0x200
     mmput+0xd6/0x360
     copy_process.part.8+0x605f/0x65d0
     _do_fork+0x1a5/0xbd0
     SyS_clone+0x19/0x20
     do_syscall_64+0x22f/0x660
     return_from_SYSCALL_64+0x0/0x7a

Note: without KASAN, you may instead see a "Bad page state" message, or
simply a general protection fault.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 7c05126 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <[email protected]>
Reported-by: Oleg Nesterov <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
octaviansoldea pushed a commit to octaviansoldea/lkl-linux that referenced this pull request Nov 10, 2017
Commit 7c05126 ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().

However, it was overlooked that this introduced an new error path before
the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
being copied from the old mm_struct by the memcpy in dup_mm().  For a
task that has previously hit a uprobe tracepoint, this resulted in the
'struct xol_area' being freed multiple times if the task was killed at
just the right time while forking.

Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
than in uprobe_dup_mmap().

With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
program given by commit 2b7e866 ("fork: fix incorrect fput of
->exe_file causing use-after-free"), provided that a uprobe tracepoint
has been set on the fork_thread() function.  For example:

    $ gcc reproducer.c -o reproducer -lpthread
    $ nm reproducer | grep fork_thread
    0000000000400719 t fork_thread
    $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
    $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
    $ ./reproducer

Here is the use-after-free reported by KASAN:

    BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
    Read of size 8 at addr ffff8800320a8b88 by task reproducer/198

    CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 lkl#255
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Call Trace:
     dump_stack+0xdb/0x185
     print_address_description+0x7e/0x290
     kasan_report+0x23b/0x350
     __asan_report_load8_noabort+0x19/0x20
     uprobe_clear_state+0x1c4/0x200
     mmput+0xd6/0x360
     do_exit+0x740/0x1670
     do_group_exit+0x13f/0x380
     get_signal+0x597/0x17d0
     do_signal+0x99/0x1df0
     exit_to_usermode_loop+0x166/0x1e0
     syscall_return_slowpath+0x258/0x2c0
     entry_SYSCALL_64_fastpath+0xbc/0xbe

    ...

    Allocated by task 199:
     save_stack_trace+0x1b/0x20
     kasan_kmalloc+0xfc/0x180
     kmem_cache_alloc_trace+0xf3/0x330
     __create_xol_area+0x10f/0x780
     uprobe_notify_resume+0x1674/0x2210
     exit_to_usermode_loop+0x150/0x1e0
     prepare_exit_to_usermode+0x14b/0x180
     retint_user+0x8/0x20

    Freed by task 199:
     save_stack_trace+0x1b/0x20
     kasan_slab_free+0xa8/0x1a0
     kfree+0xba/0x210
     uprobe_clear_state+0x151/0x200
     mmput+0xd6/0x360
     copy_process.part.8+0x605f/0x65d0
     _do_fork+0x1a5/0xbd0
     SyS_clone+0x19/0x20
     do_syscall_64+0x22f/0x660
     return_from_SYSCALL_64+0x0/0x7a

Note: without KASAN, you may instead see a "Bad page state" message, or
simply a general protection fault.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 7c05126 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <[email protected]>
Reported-by: Oleg Nesterov <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>    [4.7+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
@laijs
Copy link

laijs commented Nov 15, 2017

I'm curious how the signals be delivered to the rump-client, cooperated lkl app or hijacked app?

@thehajime
Copy link
Member Author

I'm curious how the signals be delivered to the rump-client, cooperated lkl app or hijacked app?

Did you mean the rump-client as syscall proxy from external processes ?
If yes, then signals are not delivered to the client: only input/output of syscalls are proxyed/delivered to the client.

If not, would you care to elaborate ?

@laijs
Copy link

laijs commented Nov 17, 2017

Sure, I meant the rump-client as syscall proxy from external processes.
Thank you and I understood the signals are not delivered to the client.
But why the signals need to be handled in the kernel? since it can't be delivered.

@thehajime
Copy link
Member Author

But why the signals need to be handled in the kernel? since it can't be delivered.

this rump hypcall integration has several modes of execution: external process via syscall proxy is one of them. musl libc integration with no dependency to host kernel is the other one.

I am specifically trying to support alarm(3) used in netperf with this signal related patch.
Unlike hijack library mode as of current LKL, alarm(3) calls go to an LKL instance, wait for a signal (SIGALRM) but no delivery before this patch. (In the hijack mode, it is delivered by the host since alarm(3) goes to the host)

So, we need to have some sort of do_signal() and calling it at a proper point. it seems that calling this in run_syscall() is not the best place: any suggestions are welcome.

Is that clear to you ? Let me know if not.

@gparmer
Copy link

gparmer commented Apr 9, 2018

Is there a blocker on this? We have a rump hypcall-based environment, and would like to use Linux, but I've been waiting to put time into lkl until this was pulled in. Should we stop waiting?


if (write) {
if (size == 1) {
#ifdef __x86_64__
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this means that x86-32 is not supported?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LKL does not support x86-32, yet #341

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a messy part of lots of ifdefs and a result of what i personally tried. this can be expanded a platform which is currently not supported (e.g., x86-32).


static void *rump_tls_get(struct lkl_tls_key *key)
{
return rumpuser_curlwp();
Copy link

@gparmer gparmer Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't get the TLS variable given the key, correct? Knowing rumpkernel's handling of TLS (where gets are implicit on __thread variables), it seems like there might be a mismatch between the LKL and rumpkernel models here. I like explicit calls to get TLS based on a key in LKL much more.

I suppose the assumption here is that LKL only uses TLS to store the currently active thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you can see, this function drops the key argument the result only indicate a TLS val, not localized by a key.

I suppose the assumption here is that LKL only uses TLS to store the currently active thread?

Not really, just a limitation of in this RFC stage.

.mem_free = rump_mem_free,
.ioremap = lkl_ioremap,
.iomem_access = lkl_iomem_access,
.jmp_buf_set = jmp_buf_set,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the LKL way of doing this. Perhaps this isn't the place to ask this, but why not include a more abstract function that simply switches to another thread. Could be implemented with setjmp/longjmp, or with OS primitives below this layer.

Copy link
Member Author

@thehajime thehajime Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(asking these question here should be fine with me)

I'm not sure if I understand your question. maybe the commit 79131f2 is relevant to this question ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gparmer I have an old branch that removes most of this low level ops and switches to higher level opts like alloc/switch/free contexts, see: https://github.com/tavip/linux/commits/the-expanse, in particular tavip@e182fe0.

The reason I did not merge this is that we have optimizations to reduce the system call latency (see e1cd839) which I could not map well with the new scheme.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for the information and references! @thehajime @tavip

I think that the patches @tavip referenced for host context switching do what I was contemplating. I didn't foresee the complications with the system call path (nor do I fully understand the interaction now).

From my perspective (i.e. wanting to not use lkl as a library, but as a normal resource manager and abstraction provider, while using host abstractions such as threads and memory management), some of this seemed a little overly complex. However, given the goals of the system, this is all making quite a bit more sense. It seems really quite well engineered.

I don't mean to distract lkl at all, but I wonder if a number of simplifying assumptions would simplify the design. For example, if I could make the following set of assumptions:

  • all host threads that will make "system calls" into the library are created using the thread creation APIs of lkl (with the exception of the initial thread -- that executes main),
  • all interrupts are executed as host threads that the lkl creates and controls (thus execute from the Linux perspective more like the interrupt threads in the real-time patches), and
  • the context switch API focused on switching host threads.

Could it enable "direct" system calls (without switching stacks/contexts)? Would it allow preemptive execution? SMP?

Again, sorry for distracting from the main lkl goals and thrust. I certainly understand that lkl will not adopt these assumptions, given its goals. However, I'd very much value your opinion if these assumptions could simplify some of the interactions with the host system.

return ret;
}

struct lkl_dev_net_ops rumpfd_ops = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LKL way of doing things is to export the raw hardware I/O to the underlying/hosting system. In the BMK rumpkernel implementation, you actually use the devices within the RK for I/O. Given that you've implemented the PCI rumpuser interface, are LKL device drivers supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular piece of code (rumpfd_ops) is used by https://github.com/libos-nuse/frankenlibc, which is a variant of rumprun unikernel. and you're right: with rumprun bmk LKL can use drivers/net/virtio-net.c or drivers/block/virtio-blk.c. I have tried e1000 driver of Linux with qemu e1000 emulation over LKL. Though recently rumprun LKL is not well maintained (I hope I do as early as possible).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm familiar with rumprun (we have our own hacked up version with a backend for our own OS). I was not sure that you'd be able to run native drivers (e.g. e1000) since I didn't see the virtual to physical rumpuser function (which is used in bmk, for example to enable devices to understand how to interact with DMA).

Great to hear that devices are at least possible, if not supported!

I was unaware of the mainlining initiative so I certainly understand why this patchset is currently unmaintained.

@thehajime
Copy link
Member Author

Is there a blocker on this? We have a rump hypcall-based environment, and would like to use Linux, but I've been waiting to put time into lkl until this was pulled in. Should we stop waiting?

@gparmer thanks for the comment (and sorry for the late response). My personal opinion on this PR is still not changed: I would go for upstream basic LKL part to linus tree first, then add more extensions like this PR after that.

But for the mean time, it would be of course great to receive any comments/suggestions/patches :)

(ref: #304)

rodionov pushed a commit to rodionov/lkl that referenced this pull request Jan 2, 2025
Add a new test case which performs double query of the bpf_mprog through
libbpf API, but also via raw bpf(2) syscall. This is testing to gather
first the count and then in a subsequent probe the full information with
the program array without clearing passed structs in between.

  # ./vmtest.sh -- ./test_progs -t tc_opts
  [...]
  ./test_progs -t tc_opts
  [    1.398818] tsc: Refined TSC clocksource calibration: 3407.999 MHz
  [    1.400263] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd336761, max_idle_ns: 440795243819 ns
  [    1.402734] clocksource: Switched to clocksource tsc
  [    1.426639] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.428112] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  lkl#252     tc_opts_after:OK
  lkl#253     tc_opts_append:OK
  lkl#254     tc_opts_basic:OK
  lkl#255     tc_opts_before:OK
  lkl#256     tc_opts_chain_classic:OK
  lkl#257     tc_opts_chain_mixed:OK
  lkl#258     tc_opts_delete_empty:OK
  lkl#259     tc_opts_demixed:OK
  lkl#260     tc_opts_detach:OK
  lkl#261     tc_opts_detach_after:OK
  lkl#262     tc_opts_detach_before:OK
  lkl#263     tc_opts_dev_cleanup:OK
  lkl#264     tc_opts_invalid:OK
  lkl#265     tc_opts_max:OK
  lkl#266     tc_opts_mixed:OK
  lkl#267     tc_opts_prepend:OK
  lkl#268     tc_opts_query:OK            <--- (new test)
  lkl#269     tc_opts_replace:OK
  lkl#270     tc_opts_revision:OK
  Summary: 19/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
rodionov pushed a commit to rodionov/lkl that referenced this pull request Jan 2, 2025
Add a new test case to query on an empty bpf_mprog and pass the revision
directly into expected_revision for attachment to assert that this does
succeed.

  ./test_progs -t tc_opts
  [    1.406778] tsc: Refined TSC clocksource calibration: 3407.990 MHz
  [    1.408863] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fcaf6eb0, max_idle_ns: 440795321766 ns
  [    1.412419] clocksource: Switched to clocksource tsc
  [    1.428671] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.430260] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  lkl#252     tc_opts_after:OK
  lkl#253     tc_opts_append:OK
  lkl#254     tc_opts_basic:OK
  lkl#255     tc_opts_before:OK
  lkl#256     tc_opts_chain_classic:OK
  lkl#257     tc_opts_chain_mixed:OK
  lkl#258     tc_opts_delete_empty:OK
  lkl#259     tc_opts_demixed:OK
  lkl#260     tc_opts_detach:OK
  lkl#261     tc_opts_detach_after:OK
  lkl#262     tc_opts_detach_before:OK
  lkl#263     tc_opts_dev_cleanup:OK
  lkl#264     tc_opts_invalid:OK
  lkl#265     tc_opts_max:OK
  lkl#266     tc_opts_mixed:OK
  lkl#267     tc_opts_prepend:OK
  lkl#268     tc_opts_query:OK
  lkl#269     tc_opts_query_attach:OK     <--- (new test)
  lkl#270     tc_opts_replace:OK
  lkl#271     tc_opts_revision:OK
  Summary: 20/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants