You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 23, 2021. It is now read-only.
In my collection, I have a number of Thunderbird email folders. Thunderbird stores an entire folder in one large mbox file. Apparently, snoop (or tika?) detects it as a generic text file format and extracts all data from it without any processing of the single email files it contains.
This is not only bad because single mails are not extracted, but also because indexing of these large text files (>> 1GB) causes elasticsearch to crash (see this issue: liquidinvestigations/hoover-search#31).
I am not sure how to handle this. Currently, I think of converting Thunderbird mbox files to pst or eml before giving them to hoover. Or is there any opportunity to detect and handle mbox files? What would you suggest?
The text was updated successfully, but these errors were encountered:
Hi @grenwi, you're right, snoop does not detect mbox files. They are a bit tricky since file detects them as plain text so we'd need some custom way of detecting them. Your best bet is to convert them to individual eml files, perhaps this script will help you.
By the way, I suggest taking a look at snoop2, it's a rewrite of the snoop module, with the processing pipeline modeled as a graph of dependent tasks, so you can re-run the relevant parts when the code changes.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
How are mbox files handled by snoop?
In my collection, I have a number of Thunderbird email folders. Thunderbird stores an entire folder in one large mbox file. Apparently, snoop (or tika?) detects it as a generic text file format and extracts all data from it without any processing of the single email files it contains.
This is not only bad because single mails are not extracted, but also because indexing of these large text files (>> 1GB) causes elasticsearch to crash (see this issue: liquidinvestigations/hoover-search#31).
I am not sure how to handle this. Currently, I think of converting Thunderbird mbox files to pst or eml before giving them to hoover. Or is there any opportunity to detect and handle mbox files? What would you suggest?
The text was updated successfully, but these errors were encountered: