Open exchange on fwtar #757
Replies: 5 comments 8 replies
-
Hi there, happy to talk a little about I'm wrapping up a PhD focused on firmware rehosting and dynamic analysis of firmware in general. Given a linux-based firmware image, I want to get it up and running under emulation with PANDA.re. My expertise (and the bulk of my research) is focused around runtime analysis and modification of a emulated guest. But a critical input to this is an accurate root filesystem for a given firmware image. Previously I've used binwalk/firmadyne's extractor but now it seems like unblob is the best tool around. There are 3 goals I have with fw2tar that don't seem to overlap with unblob which is why I threw those utility scripts into a stand-alone repo instead of opening up PRs:
Root filesystem detection This isn't anything too fancy - my scripts search for some standard linux directories and files to identify potential root filesystems. For each, we create a tar archive and try to exclude any recursive extraction artifacts (after prototyping a few approaches around multiple extraction passes with restricted depth), I ended up just excluding directories with Maintain permissions: If we want to boot a system using an extracted filesystem, it causes all sorts of problems if the permissions are modified. One of my collaborators, @off-by-1-error opened an issue when we noticed no unblob-produced files were executable. That issue got fixed (thanks), but I recently noticed the Support HPC environments: My research is focused on large-scale analyses of firmware - I'm working with thousands of firmware images and running them on a supercomputer where I don't have root access. I'm able to use singularity (basically a very-limited docker alternative) so I can install software and run experiments at scale. After I use fakeroot to run my modified unblob to preserve permissions and build an archive of a root filesystem, I then I feed the filesystem into PANDA.re with a custom kernel and emulate the target. I don't have any particular asks for the unblob team - I'm grateful you all have tackled the hard problem of filesystem extraction and helped with the issues we've opened! If you have any interest in supporting any of the use cases I mentioned above, I'd certainly prefer your implementations over mine. Any analyses you'd like to build for identifying linux root filesystems or changes to adding the ability to produce tar archives with correct permissions would be awesome. But I'm not sure if those would be broadly useful to other users. I do think I found an unblob bug around symlink handling and the Let me flip the question around and ask if there's anything I can do to help you - I'm running both I'm currently re-running my extraction at scale with #755 + #756 applied on my fork - I'll report back on how well those work and open new issues if I run into any other errors at runtime. But if there's anything else you'd like me to check at scale, let me know! I like the direction you're exploring with permission as metadata. If that was working, I think we'd be able to just run unblob (with no permissions), consume the metadata and then build the filesystem archive from there. |
Beta Was this translation helpful? Give feedback.
-
I love projects that have a good scope and stick to it. It makes since that you wouldn't be interested in those use cases. I can collect some performance data and share it next time I run on the full corpus, I'd guess a median time for unblob is ~2s vs ~10s for binwalk. In one extreme case unblob extracted a filesystem in 10s that binwalk took 300s to do. These symlinks bugs might be above my pay grade, I opened a PR with some fixes in #763, but I don't love what I built. In testing I found the symlinks utility (with patches to support rewriting links relative to a directory) wasn't working well - dangling symlinks weren't updated and would then point outside the extraction directory. I certainly won't be offended if you want to throw away all my code and build your own fix - a more unified interface + comprehensive tests for it would probably be a better design. If you're going to go that direction let me know and I can share some unit tests and minimal inputs I created while hacking on that PR. As for the comparison between binwalk and unblob, I'm looking within identified root directories and comparing the files that are present. I don't have a list of inputs blobs that unblob didn't extract, just files that are/aren't present in the output. With all the changes on my branch (#755, #756 + an additional fix for directories, and #763) unblob is looking quite good - there are a bunch of files only present in the unblob extractions (15,527 in my last run) and the files produced by binwalk that don't map directly to a file in the unblob extraction are either:
Without #755, #756 + my additional fix for it, a few extractors were sometimes failing and many files were missing when that happened. Without #763 a large number of valid symlinks were missing. I have a few non-public firmware corpora, but unfortunately none of them are mine so I can't redistribute them. I'm currently testing with the corpus from Greenhouse. But when I find failures I'm usually able to find the firmware online somewhere and share links to individual files. |
Beta Was this translation helpful? Give feedback.
-
Just wanted to say thanks for the close review of my PR and support for all the issues I'm opening. And sorry again for the confusion in that first big PR I opened, hopefully the smaller PR and issues with PoCs will be more useful for y'all. |
Beta Was this translation helpful? Give feedback.
-
I know you all have some different goals with extracted file permissions so I wanted to ask about this before I bother cleaning up code and opening PRs: if I'm expanding extractors and the filesystem class to better preserve permissions, would you be interested in getting any of those changes into unblob? Of course the permission bits you're adding would change the final permissions, but if other bits (e.g., For example, rehosting@2e4f43a expands Filesystem.write_chunks to take a Unrelated to that, I also wanted to share some results from large-scale comparisons with Binwalk - in my analysis I'm running both a slightly forked Binwalk) and Unblob with my changes*, looking for directories that seem to be the root of a linux filesystem (e.g., checking for standard directories and at least a few executable files), then selecting the largest good-looking directory found by each extractor. With this approach, I get:
|
Beta Was this translation helpful? Give feedback.
-
I think both of these changes make sense. What's your take @e3krisztian ? We expose the mode for
Can you expand a bit ? Which format-specific extractors are you using ?
Do you know the kind of filesystems making these 95 entries ? I would suppose it's custom squashfs that our sasquatch fork can't cover but would be happy to know the exact details.
Is it because they're encrypted or you're observing some filesystems that are not supported by the general public version of unblob ? |
Beta Was this translation helpful? Give feedback.
-
We received two excellent bug reports from @AndrewFasano who's working on https://github.com/AndrewFasano/fw2tar which is, according to the README:
They maintain a fork over with a few changes applied to unblob to support their permissions preservation main...AndrewFasano:unblob:main
So, these are a few things I would like to address in this thread:
fw2tar
? Is it part of academic research ? What are your plans for it ? do you have a list of filesystems you plan on supporting ?To give everyone a bit of background: our approach - which has not been implemented yet - would be for format handlers / extractors to yield metadata about the files they extract (ownership, permissions, timestamps) and have them saved in the unblob report so it can be used by external tools relying on unblob (either by re-applying these permissions / ownerships, or simply showing a view of it).
We can't rely on the fact that all our users run unblob under
fakeroot
, so we must adapt permissions as we recurse through extracted content otherwise we would lose visibility into files and directories that have strict permissions or wrong ownership and would end up raisingOSError
s all the time.Some references:
Beta Was this translation helpful? Give feedback.
All reactions