nsdi17_reviews.txt

﻿===========================================================================
                          NSDI '17 Review #295A
---------------------------------------------------------------------------
Paper #295: Lock-in-Pop: Securing Privileged Operating System Kernels by
           Keeping on the Beaten Path
---------------------------------------------------------------------------

                     Overall merit: 2. Weak Reject (This paper doesn't
                                       belong in the conference, but it
                                       won't upset me.)
                Reviewer expertise: 3. Knowledgeable

                        ===== Paper summary =====

Lind is a new approach of trying to containerize the Linux kernel such
that applications only execute "popular" kernel code paths
(Lock-in-Pop concept), which are less likely to be buggy.  Lind uses
Google Native Client to forward system calls to a SafePOSIX library
OS implemented using restricted Python to limit which host system
calls can be accessed.  Results are presented for a limited set of
applications with 3-6x performance overhead for any workload that is
not just computation (ie primes).

                          ===== Strengths =====

Basic idea of limiting execution to only popular kernel code paths is
good.  Analysis of code path popularity is interesting and the
analysis is clear and easy to understand.

                    ===== Areas for improvement =====

Paper has little to do with networked systems.  Overhead is
prohibitive.  Evaluation very limited in terms of real applications,
so not clear how well the implementation works.  Measure of bug
triggering based on code reachability not very realistic.

                     ===== Comments for author =====

I like the basic idea of your approach as a solution for building more
reliable virtual systems, limiting execution to more popular code
paths.  That being said, I have two main concerns in the context of an NSDI
submission.  

First, this paper has little to do with networked systems.  Granted,
virtualization and reliability are important for networked systems,
but your approach, evaluation, etc. is largely missing the networked
part.  Grep, primes, cat, etc. are obviously not networked, and yes
you have Apache in your workload, but that is about it.  This really
belongs in something more like USENIX ATC.

Second, the evaluation is not very thorough.  In terms of performance
measurements, the workloads are quite limited, and even for these
limited workloads, the performance is pretty bad, 3-6x worse than
native.  This isn't for microbenchmarks, but for applications like
Apache, which make this unusable in practice.  Would replacing Repy
with a C implementation help - Python does not seem to be a very good
choice for this kind of work?  Are the workloads limited because the
implementation isn't robust enough to run real workloads?  Is it
because the premise behind your approach, limiting execution to
popular code paths, has implications for real applications in terms of
functionality?   

The typical argument for taking a non-full virtualization approach (ie
containers, etc.) is overhead with a potential cost of higher security
risks / less isolation.  But your overhead is worse than full
virtualization.  So given the performance tradeoff, it is not clear to
me why you would use your approach instead of full virtualization?

While code reachability is a useful approach for the preliminary
analysis you did for bugs, I would have liked to have seen a more
rigorous measure of vulnerabilities for the systems you evaluated to
see which vulnerabilities can happen in practice.  Nevertheless, the
results you did have were good with Lind vulnerable to only 1 out
of 35 tested vulnerabilities, while other container technologies are
exposed to 8 or 12 and baseline Linux vulnerable to all 35.  But the
potential vulnerability reachability is not a good tradeoff with the
huge performance hit.

"Functionality recreation" and "functionality re-creation" seem to be
inconsistently used.  I assume the latter is better since the former
means functionality leisure activities ;-).

===========================================================================
                          NSDI '17 Review #295B
---------------------------------------------------------------------------
Paper #295: Lock-in-Pop: Securing Privileged Operating System Kernels by
           Keeping on the Beaten Path
---------------------------------------------------------------------------

                     Overall merit: 3. Weak Accept (I can't complain about
                                       this paper being accepted, but I'm
                                       not enthusiastic.)
                Reviewer expertise: 3. Knowledgeable

                        ===== Paper summary =====

Lock-in-Pop improves the security of container virtual machines.  The
paper argues that common system calls have fewer kernel
vulnerabilities than uncommon system calls, with support from
vulnerability reports.  As a result, the paper proposes narrowing the
system call interface exposed to container VMs.  Common system calls
are invoked directly, while uncommon calls are emulated in a sandbox
environment to only use common syscalls in their emulated
implementations.  Lock-in-Pop is a prototype system that implements
this technique, and the paper evaluates how Lock-in-Pop avoids
vulnerabilities and its overhead.

                          ===== Strengths =====

+ syscall path/bug frequency analysis
+ prototype implementation of syscall emulation libary

                    ===== Areas for improvement =====

- solidify Lock-in-Pop's usage model

                     ===== Comments for author =====

The argument of restricting the set of system calls directly
accessible to user-level programs is an old argument (insert cliche of
Multics).  But I haven't seen Lock-in-Pop's argument before for kernel
vulnerabilities: restricting to the most popular system calls because
they have been exercised the most and hence have the fewest bugs that
could lead to kernel vulnerabilities.  The bug notification
evaluations support the argument, and the combination of sandbox and
syscall funnel/emulation layer viably implements the approach (albeit
under constraints, such as application using a single process).

The paper spends a fair amount of time defending the scope of work
relative to other approaches, particularly compared to using hardware
virtualization systems, both at the start of the paper (Section 2) and
the end (Section 7).  To me, this speaks to an issue that the paper
doesn't really come to grips with: who is the user group for a system
like Lock-in-Pop, and what is the usage scenario?

Are you expecting all users to eventually be running all of their
applications on Lock-in-Pop to better defend themselves from malware
exploting kernel vulnerabilities?  This goal strikes me as a long shot
given the overhead involved, practical constraints of only supporting
applications running in a single process, and the isolation that each
application runs in.  All of this combined does not seem like it
supports a realistic contemporary computing environment.

Or, are you targeting a usage scenario where users are experimenting
with an application, and want extra protection while doing so?  If so,
then Lock-in-Pop seems more realistic: you care much less about
overhead, the isolation and additional protection are a plus.  My
reaction is that better coming to grips with your usage scenario will
help with the scoping and overhead arguments.

Section 1: The intro starts by motivating using OSVMs like Docker or
LXC, but highlights bugs in hardware virtualization systems like
Virtual Box and VMware Workstation...and then spends substantial time
distancing itself from VB and VMW, going back to Docker and LXC in the
evaluation.  If you want to stay removed from hardware virtualization
systems, you should leave them out of the discussion.

Section 2: What about adding yet another layer of defense and running
applications in Lind on Linux running in VMware Workstation on Windows?

Section 3.2: "Therefore, consistent with prior research [28], the rate
of defect occurrence per LOC follows a Poisson distribution [34]."
This seems like a bit of a leap.  There has been much work exploring
bugs in OS code since [28] was published in 1989.  Does newer work
support this assumption?

Section 6.4: Why "legacy applications"?  Is that a constraint imposed
by the systems you use, like Repy?

===========================================================================
                          NSDI '17 Review #295C
---------------------------------------------------------------------------
Paper #295: Lock-in-Pop: Securing Privileged Operating System Kernels by
           Keeping on the Beaten Path
---------------------------------------------------------------------------

                     Overall merit: 3. Weak Accept (I can't complain about
                                       this paper being accepted, but I'm
                                       not enthusiastic.)
                Reviewer expertise: 2. Some familiarity

                        ===== Paper summary =====

The paper observes that the most commonly used syscall statistically
have fewer security vulnerabilities, hence it suggests to run applications
in a confined sandbox that only uses such syscalls.

The paper is well executed, going through the identification of the popularity
of code paths through tracing the activity of a set of users, then finding
how such paths are affected to 69 linux kernel vulnerabilities, and
finally implementing and evaluating a prototype of the design.

                          ===== Strengths =====

Neat idea and well executed paper.

                    ===== Areas for improvement =====

Scope ? NSDI is perhaps one of the less appropriate venues for this type of work -- networking aspects are close to non existent in this work.

                     ===== Comments for author =====

Some suggestions for clarifications:

I am not sure that Lock-in-pop could be considered a different category from "Functionality Re-Creation". 
From the description in Sec.4.2 and 4.3 they seem the same except for the way the set of original kernel services is selected.

I am curious where you set the boundary between popular kernel paths and other paths. Is "popular" defined as "those used in the first experiment in sec.3.1" ?
What is the relation between the kernel APIs in Table 1 (or whatever is needed by your sandbox) and your "popular" set ? 
Is it the case that your ideal choice of pop-APIs is constrained by whatever your sandbox requires  to operate ?
As an example, 5.2 seems to suggest that you are reimplementing the filesystem in the sandbox; 
it is unclear what happens in terms of device drivers (many perhaps are in the "unreached" set in fig.1, but potentially accessible through the socket and device I/O Repy functions.

===========================================================================
                          NSDI '17 Review #295D
---------------------------------------------------------------------------
Paper #295: Lock-in-Pop: Securing Privileged Operating System Kernels by
           Keeping on the Beaten Path
---------------------------------------------------------------------------

                     Overall merit: 1. Strong Reject (I'll argue against
                                       this paper.)
                Reviewer expertise: 3. Knowledgeable

                        ===== Paper summary =====

Lock-in-pop reduces the risk of kernel compromise by running untrusted
code in a NACL sandbox and restricting system calls to those that
remain on popular paths in the kernel. The hypothesis is that popular
paths are likely to be better tested and have fewer bugs; the paper
presents evidence on bug distribution to support this hypothesis ---
2.5% of CVE bugs were found in the 12% of the code that was "popular"
while 97.5% of CVE bugs were found in the 88% of the code that was
identified as less popular.

The last paragraph of the related work section asserts that lock in
pop differs from prior systems like drawbridge because such systems
"do not have a sandbox environment to properly contain buggy or
malicious behavior while lind can offer this more secure environment."
Please clarify -- isn't this exactly what, for example, drawbridge has
(e.g., https://www.microsoft.com/en-us/research/project/drawbridge/:
"Drawbridge is a research prototype of a new form of virtualization
for application sandboxing. Drawbridge combines two core technologies:
First, a picoprocess, which is a process-based isolation container
with a minimal kernel API surface. Second, a library OS, which is a
version of Windows enlightened to run efficiently within a
picoprocess. ... The Drawbridge picoprocess is a lightweight, secure
isolation container. It is built from an OS process address space, but
with all traditional OS services removed. The application binary
interface (ABI) between code running in the picoprocess and the OS
follows the design patterns of hardware VMs; it consists of a closed
set of 45 downcalls with fixed semantics that provide a stateless
interface. All ABI calls are serviced by the security monitor, which
plays a role similar to the hypervisor or VM monitor in traditional
hardware VM designs."

                          ===== Strengths =====

The popularity heuristic for identifying safe(r) paths in the kernel
seems potentially effective and useful.

                    ===== Areas for improvement =====

* The performance overheads (1x to 6x) are often substantial and
 likely limit how widely these techniques could be adopted

* Comparison of the mechanisms here compared to prior work should be
 clarified. At the end of the day, the mechanisms here don't seem
 particularly interesting or novel -- we just need a mechanism to
 restrict the set of system calls applications can make, and the
 system uses what appear to be fairly standard techniques (SFI,
 system call interposition, libOS) to meet those requirements.

* In my view, most of the potential contribution here is around the
 idea of limiting execution to popular kernel paths (and handling
 other kernel functionality in a libOS). However, the exposition and
 evaluation on that front could be improved.

    * Identification of popular paths: The methodology is to use
 logging of paths touched by a small sample set of users. It is not
 clear how general the results are. As we look at more users and move
 diverse applications, how much of the kernel becomes "popular"?

     * Mapping from path to allowed system calls and arguments.  The
 paper is not clear on how the monitor restricts execution to the
 identified popular paths. I'm guessing you somehow generate filters
 on allowed system calls? Do you allow any system call with any
 argument made in your trace? Do you generalize those arguments (so
 that different parameters can be passed in? If so, do you restrict
 the parameters?)

     * Mapping from trace to allowed system calls and
 arguments. It is not clear how the authors convert from the trace of
 popular paths to the list of allowed system calls in Table 1. It
 appears that the list in Table 1 is not the complete list of system
 calls observed in the trace (no fork/exec? no ioctl?). If the actual
 approach was to use heuristics to reject apparently "unsafe" rather
 than "unpopular" calls, then the results are much less convincing.

     * supported workloads. What workloads can and connot be
       supported with library OS over the set of system calls in
       table 1? (I can see that eliminating fork/exec makes your
       system safer, but it would seem to reduce the set of workloads
       you can run.)

                     ===== Comments for author =====

I don't understand how the measurements of popular paths get
translated into rules for what requests your sandbox kernel will
admit. For example, you traced several days of activity but section
5.2 explains how you simulate directories....so were there no system
calls to access directories in your traces? Or were they below some
threshold of frequency (if so, what was that threshold and what other
used-but-not-popular-enough calls were omitted?) Do you allow any
arguments to a set of system calls or do you restrict them; if the
former, to what extent do executions actually stay on "popular" paths?
(If the latter, same question.)

Or is the real conclusion that we should use a minimal set of system
calls (rather than a popular set?)

Per https://www.microsoft.com/en-us/research/project/drawbridge/
drawbridge allows 45 system calls. Looking at table 1, you allow a
similar number. Compare the lists of calls. Which are simialr and
which are different and how does that relate to the "popularity"
metric (and to security?)