Skip to content

Added JSON-Formatted Output #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Added JSON-Formatted Output #67

wants to merge 6 commits into from

Conversation

PoorBillionaire
Copy link
Contributor

@PoorBillionaire PoorBillionaire commented Jun 25, 2016

Greets--

Short: This pull request adds functionality to the samples/amcache.py script, to output results as JSON-formatted strings with the -j command-line switch.

Long:
amcache.py would be a great tool to use when sampling an environment for Amcache.hve artifacts at scale. I have considered various ways to consume and log the data parsed by amcache.py: shipping to Elasticsearch or Logstash (via syslog/filebeat), or perhaps by interpreting the data with other Python scripts.

Of the options I have considered, JSON-formatted output would be extremely valuable - for example, in Logstash the syslog and beats plugins contain codec functionality for JSON-formatted events. Similarly, other Python scripts could benefit from JSON-formatted strings by loading them as dictionary objects to access and act upon the various key/value combinations.

Sample output:

dev@computer:~/python-registry$ python samples/amcache.py Amcache.hve -j

{"pe_checksum": 161438, "modified_timestamp": "2013-08-22 13:25:38.066004", "sha1": "0000f783a29297a42e86e7f2ef17d91737eb5add732d", "pe_sizeofimage": 143360, "first_run": "2014-11-21 09:54:02.063293", "language": 1033, "linker_timestamp": "2013-08-22 02:47:53", "company": "Microsoft Corporation", "switchbackcontext": 72057594037929472, "product": "Microsoft® Windows® Operating System", "header_hash": "0101c6ba8c455c8332340bc4a73f782f15b427336eff", "modified_timestamp2": "2013-08-22 13:25:38.053610", "version": "6.3.9600.16384 (winblue_rtm.130821-1623)", "id": "-", "file_description": "Dism Host Servicing Process", "created_timestamp": "2014-11-21 09:54:01.969542", "path": "C:\\Users\\Administrator\\AppData\\Local\\Temp\\9D1571B1-DEC2-4D4D-8166-81378BC8398F\\DismHost.exe", "version_number": "6.3.9600.16384", "size": 140392}

Note: I am highly inadequate when it comes to handling unicode strings, but found a combination of methods that seemed to do the job. I would appreciate any further testing to ensure it works consistent with the rest of your project.

Thanks!

Adam
Tw: @_TrapLoop

import sys
import logging
import datetime
from collections import namedtuple
import json
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit pick: i like to order my imports by line length (with group of stdlib, pip-installable, and project-local modules).

@williballenthin
Copy link
Owner

williballenthin commented Jun 25, 2016

these changes are great! thanks for taking the time to make enhancements, and then explain your reasoning. i provided some comments on the code, which i are meant to be totally respectful and constructive. let me know if you have any questions or issues, and i'm happy to discuss.

@matthewdunwoody is currently working on an amcache.py invoker tool to make it easy to process large zip files of hives, and distribute the processing across many cores. i'll encourage him to make it available online for your enjoyment.

print(json.dumps(document, ensure_ascii=False).encode("utf-8"))

else:
w = unicodecsv.writer(sys.stdout, delimiter="|", quotechar="\"",
quoting=unicodecsv.QUOTE_MINIMAL, encoding="utf-8")
w.writerow(map(lambda e: e.name, FIELDS))
for e in ee:
print(e)
exit(type(e.path))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...I don't even remember doing this.

@PoorBillionaire
Copy link
Contributor Author

Thanks for the great feedback, it was very helpful. Let me know if those syntax and style changes work for you.

I need to do a bit of reading before I can talk more about the json/l formatting - I'll update this thread later tonight.

@PoorBillionaire
Copy link
Contributor Author

PoorBillionaire commented Jun 29, 2016

Apologies for the delayed response.

I've thought a bit about the formatting: one large JSON document per hive file, or using jsonl with each line representing a parsed entry in a given hive file. As you noted Willi, the pros of jsonl involve incremental processing and streaming. There is also a nice simplicity in the jsonl approach with a larger volume of small and self-contained entries from the parsed amcache hive.

In the case of one JSON document, the benefit is that one could easily shovel the output of a single document to another script - though I feel like there is a decent amount of added complexity to the object. Each key in the document would need to be unique. That key could be a processed value of some kind - perhaps I could parse the base file name from the path attribute to be the key, with the value containing the other attributes from the FIELDS object:

{
    baseFileName: {
        "path" : "<path>", "hash" : "<hash>", "timestamp" : "<timestamp>"
    },
    baseFileName: {
        "path" : "<path>", "hash" : "<hash>", "timestamp" : "<timestamp>"
    }
}

...unless you had a different structure in mind.

Sheepishly, I suppose i'm saying "it depends". For me, it would be more natural to use jsonl output, which I could ship to Elasticsearch directly and easily without worrying much about parsing - or use an indexer to perform additional processesing if needed. I am certainly open to other opinions. I am probably way too comfortable in my current way of thinking.

@PoorBillionaire
Copy link
Contributor Author

PoorBillionaire commented Jun 29, 2016

Additionally, I realized today that regardless of the format used, I have been ignoring the fact that I will need to (in my use-case at least) have the subject hostname or IP address available at run time: if an Analyst collects at scale, potentially thousands of amcache.hve files are produced - none of the analysis after that matters unless we can tie a given amcache entry to the host it was collected from.

In that case, perhaps the do_json piece would be wrapped in a function, which would take a host identifier as an optional function parameter. If the host parameter is provided, that value would be inserted into the json document. That value could be provided at the command line. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants