Skip to content

Commit dd90be5

Browse files
author
Florian Roth
committed
Updated README
Former-commit-id: 6249402 Former-commit-id: 579e936
1 parent 5645a37 commit dd90be5

File tree

1 file changed

+59
-92
lines changed

1 file changed

+59
-92
lines changed

README.md

+59-92
Original file line numberDiff line numberDiff line change
@@ -9,88 +9,52 @@
99

1010
Yara Rule Generator
1111
by Florian Roth
12-
February 2017
13-
Version 0.17.0
12+
August 2017
13+
Version 0.18.0
1414

1515
### What does yarGen do?
1616

1717
yarGen is a generator for [YARA](https://github.com/plusvic/yara/) rules
1818

19-
The main principle is the creation of yara rules from strings found in malware
20-
files while removing all strings that also appear in goodware files. Therefore
21-
yarGen includes a big goodware strings and opcode database as ZIP archives that
22-
have to be extracted before the first use.
19+
The main principle is the creation of yara rules from strings found in malware files while removing all strings that also appear in goodware files. Therefore yarGen includes a big goodware strings and opcode database as ZIP archives that have to be extracted before the first use.
2320

24-
Since version 0.12.0 yarGen does not completely remove the goodware strings from
25-
the analysis process but includes them with a very low score depending on the
26-
number of occurences in goodware samples. The rules will be included if no
21+
Since version 0.12.0 yarGen does not completely remove the goodware strings from the analysis process but includes them with a very low score depending on the number of occurences in goodware samples. The rules will be included if no
2722
better strings can be found and marked with a comment /* Goodware rule */.
28-
Force yarGen to remvoe all goodware strings with --excludegood. Also
29-
since version 0.12.0 yarGen allows to place the "strings.xml" from
30-
[PEstudio](https://winitor.com/) in the program directory in order to apply the
31-
blacklist definition during the string analysis process. You'll get better
32-
results.
33-
34-
Since version 0.14.0 it uses naive-bayes-classifier by Mustafa Atik and Nejdet
35-
Yucesoy in order to classify the string and detect useful words instead of
36-
compression/encryption garbage.
37-
38-
Since version 0.15.0 yarGen supports opcode elements extracted from the
39-
.text sections of PE files. During database creation it splits the .text
40-
sections with the regex [\x00]{3,} and takes the first 16 bytes of each part
41-
to build an opcode database from goodware PE files. During rule creation on
42-
sample files it compares the goodware opcodes with the opcodes extracted from
43-
the malware samples and removes all opcodes that also appear in the goodware
44-
database. (there is no further magic in it yet - no XOR loop detection etc.)
45-
The option to activate opcode integration is '--opcodes'.
46-
47-
Since version 0.16.0 yarGen supports the Binarly. Binarly is a "binary search
48-
engine" that can search arbitrary byte patterns through the contents of tens
49-
of millions of samples, instantly. It allows you to quickly get answers to
50-
questions like “What other files contain this code/string?” or “Can this
51-
code/string be found in clean applications or malware samples?”. This means
52-
that you can use Binarly to quickly verify the quality of your YARA strings.
53-
Furthermore, Binarly has a YARA file search functionality, which you can
54-
use to scan their entire collection (currently at 7.5+ Million PE files, 3.5M
55-
clean - over 6TB) with your rule in a less than a minute.
56-
For yarGen I integrated their [public API](https://github.com/binarlyhq/binarly-sdk).
57-
In order to be able to use it you just need an API key that you can get for
58-
free if you contact them at [email protected]. The option to activate binarly
59-
lookups is '--binarly'.
60-
61-
Since version 0.17.0 yarGen allows creating multiple databases for
62-
opcodes and strings. You can now easily create a new database by using
63-
"-c" and an identifier "-i identifier" e.g. "office". It will then create two new
64-
database files named "good-strings-office.db" and "good-opcodes-office.db"
65-
that will be initialized during startup with the built-in databases.
66-
67-
The rule generation process also tries to identify similarities between the
68-
files that get analyzed and then combines the strings to so called "super rules".
69-
Up to now the super rule generation does not remove the simple rule for the
70-
files that have been combined in a single super rule. This means that there
71-
is some redundancy when super rules are created. You can supress a simple rule
72-
for a file that was already covered by super rule by using --nosimple.
23+
Force yarGen to remvoe all goodware strings with --excludegood. Also since version 0.12.0 yarGen allows to place the "strings.xml" from [PEstudio](https://winitor.com/) in the program directory in order to apply the blacklist definition during the string analysis process. You'll get better results.
24+
25+
Since version 0.14.0 it uses naive-bayes-classifier by Mustafa Atik and Nejdet Yucesoy in order to classify the string and detect useful words instead of compression/encryption garbage.
26+
27+
Since version 0.15.0 yarGen supports opcode elements extracted from the `.text` sections of PE files. During database creation it splits the `.text` sections with the regex [\x00]{3,} and takes the first 16 bytes of each part
28+
to build an opcode database from goodware PE files. During rule creation on sample files it compares the goodware opcodes with the opcodes extracted from the malware samples and removes all opcodes that also appear in the goodware
29+
database. (there is no further magic in it yet - no XOR loop detection etc.) The option to activate opcode integration is '--opcodes'.
30+
31+
Since version 0.17.0 yarGen allows creating multiple databases for opcodes and strings. You can now easily create a new database by using "-c" and an identifier "-i identifier" e.g. "office". It will then create two new
32+
database files named "good-strings-office.db" and "good-opcodes-office.db" that will be initialized during startup with the built-in databases.
33+
34+
Since version 0.18.0 yarGen supports extra conditions that make use of the `pe` module. This includes [imphash](https://www.fireeye.com/blog/threat-research/2014/01/tracking-malware-import-hashing.html) values and the PE file's exports. We provide pre-generated imphash and export databases.
35+
36+
The rule generation process also tries to identify similarities between the files that get analyzed and then combines the strings to so called **super rules**. The super rule generation does not remove the simple rule for the files that have been combined in a single super rule. This means that there is some redundancy when super rules are created. You can supress a simple rule for a file that was already covered by super rule by using --nosimple.
7337

7438
### Installation
7539

76-
1. Make sure you have at least 4GB of RAM on the machine you plan to use yarGen (6GB if opcodes are included in rule generation, use with --opcodes)
40+
1. Make sure you have at least 4GB of RAM on the machine you plan to use yarGen (8GB if opcodes are included in rule generation, use with --opcodes)
7741
2. Download the latest release from the "release" section
7842
3. Install all dependancies with ```sudo pip install scandir lxml naiveBayesClassifier pefile``` (@twpDone reported that in case of errors try ```sudo pip install pefile``` and ```sudo pip install scandir lxml naiveBayesClassifier```)
79-
4. Clone and install [Binarly-SDK](https://github.com/binarlyhq/binarly-sdk/) and install it with ```python ./setup.py install```
80-
5. Run python ```yarGen.py --update``` to automatically download the built-in databases or download them manuall from [here](https://drive.google.com/drive/folders/0B2S_IOa0MiOHS0xmekR6VWRhZ28) and place them in a new './dbs' sub folder
43+
4. Run python ```yarGen.py --update``` to automatically download the built-in databases. The are saved into the './dbs' sub folder. (Download: 913 MB)
8144
6. See help with ```python yarGen.py --help``` for more information on the command line parameters
8245

8346
### Memory Requirements
84-
Warning: yarGen pulls the whole goodstring database to memory and uses at least
85-
4 GB of memory for a few seconds - 6 GB if opcodes evaluation is used.
8647

87-
I've already tried to migrate the database to sqlite but the numerous string
88-
comparisons and lookups made the analysis inacceptably slow.
48+
Warning: yarGen pulls the whole goodstring database to memory and uses at least 3 GB of memory for a few seconds - 6 GB if opcodes evaluation is activated (--opcodes).
49+
50+
I've already tried to migrate the database to sqlite but the numerous string comparisons and lookups made the analysis painfully slow.
8951

9052
# Multiple Database Support
53+
9154
yarGen allows creating multiple databases for opcodes or strings. You can easily create a new database by using "-c" for new database creation and "-i identifier" to give the new database a unique identifier as e.g. "office". It will the create two new database files named "good-strings-office.db" and "good-opcodes-office.db" that will from then on be initialized during startup with the built-in databases.
9255

9356
### Example
57+
9458
Create a new strings and opcodes database from an Office 2013 program directory:
9559
```
9660
yarGen.py -c --opcodes -i office -g /opt/packs/office2013
@@ -106,28 +70,18 @@ You can update the once created databases with the "-u" parameter
10670
```
10771
yarGen.py -u --opcodes -i office -g /opt/packs/office365
10872
```
109-
This would update the "office" databases with new strings extracted from files in the given directory.
110-
111-
## Binarly
112-
113-
In order to use the Binarly lookup, you need an API key placed in a file named
114-
```apikey.txt``` in the ```./config``` subfolder.
115-
116-
Request an Binarly API key by mail to: [email protected]
117-
118-
### Offline
119-
Feb 2017: The Binarly API service is currently offline. There will be a replacement in the near future which will then be supported by yarGen.
73+
This would update the "office" databases with new strings extracted from files in the given directory.
12074

12175
## Command Line Parameters
12276

12377
```
124-
usage: yarGen.py [-h] [-m M] [-l min-size] [-z min-score] [-x high-scoring]
78+
usage: yarGen.py [-h] [-m M] [-y min-size] [-z min-score] [-x high-scoring]
12579
[-s max-size] [-rc maxstrings] [--excludegood]
126-
[-o output_rule_file] [-a author] [-r ref] [-p prefix]
127-
[--score] [--nosimple] [--nomagic] [--nofilesize] [-fm FM]
128-
[--globalrule] [--nosuper] [-g G] [-u] [-c] [-i I] [--nr]
129-
[--oe] [-fs size-in-MB] [--debug] [--opcodes] [-n opcode-num]
130-
[--binarly]
80+
[-o output_rule_file] [-a author] [-r ref] [-l lic]
81+
[-p prefix] [--score] [--nosimple] [--nomagic] [--nofilesize]
82+
[-fm FM] [--globalrule] [--nosuper] [--update] [-g G] [-u]
83+
[-c] [-i I] [--nr] [--oe] [-fs size-in-MB] [--noextras]
84+
[--debug] [--opcodes] [-n opcode-num]
13185
13286
yarGen
13387
@@ -136,10 +90,10 @@ optional arguments:
13690
13791
Rule Creation:
13892
-m M Path to scan for malware
139-
-l min-size Minimum string length to consider (default=8)
93+
-y min-size Minimum string length to consider (default=8)
14094
-z min-score Minimum score to consider (default=5)
14195
-x high-scoring Score required to set string as 'highly specific
142-
string' (default: 30, +10 with binarly)
96+
string' (default: 30)
14397
-s max-size Maximum length to consider (default=128)
14498
-rc maxstrings Maximum number of strings per rule (default=20,
14599
intelligent filtering will be applied)
@@ -149,6 +103,7 @@ Rule Output:
149103
-o output_rule_file Output rule file
150104
-a author Author Name
151105
-r ref Reference
106+
-l lic License
152107
-p prefix Prefix for the rule description
153108
--score Show the string scores as comments in the rules
154109
--nosimple Skip simple rule creation for files included in super
@@ -162,9 +117,12 @@ Rule Output:
162117
various files
163118
164119
Database Operations:
120+
--update Update the local strings and opcodes dbs from the
121+
online repository
165122
-g G Path to scan for goodware (dont use the database
166123
shipped with yaraGen)
167-
-u Update local standard goodware database (use with -g)
124+
-u Update local standard goodware database with a new
125+
analysis result (used with -g)
168126
-c Create new local goodware database (use with -g and
169127
optionally -i "identifier")
170128
-i I Specify an identifier for the newly created databases
@@ -176,14 +134,14 @@ General Options:
176134
--oe Only scan executable extensions EXE, DLL, ASP, JSP,
177135
PHP, BIN, INFECTED
178136
-fs size-in-MB Max file size in MB to analyze (default=10)
137+
--noextras Don't use extras like Imphash or PE header specifics
179138
--debug Debug output
180139
181140
Other Features:
182141
--opcodes Do use the OpCode feature (use this if not enough high
183142
scoring strings can be found)
184143
-n opcode-num Number of opcodes to add if not enough high scoring
185144
string could be found (default=3)
186-
--binarly Use binarly to lookup string statistics
187145
```
188146

189147
## Best Practice
@@ -240,14 +198,6 @@ In order to use only strings for your rules that match a certain minimum score u
240198

241199
```python yarGen.py --opcodes -a "Florian Roth" -r "http://goo.gl/c2qgFx" -m /opt/mal/case33 -o rules33.yar```
242200

243-
### Exclude all strings from Goodware samples
244-
245-
```python yarGen.py --excludegood -m /opt/mal/case_441```
246-
247-
### Supress simple rule if alreay covered by a super rules
248-
249-
```python yarGen.py --nosimple -m /opt/mal/case_441```
250-
251201
### Show debugging output
252202

253203
```python yarGen.py --debug -m /opt/mal/case_441```
@@ -262,10 +212,27 @@ This will generate two new databases for strings and opcodes named:
262212

263213
The new databases will automatically be initialized during startup and are from then on used for rule generation.
264214

265-
### Update a goodware strings database (append new strings to the old ones)
215+
### Update a goodware strings database (append new strings, opcodes, imphashes, exports to the old ones)
266216

267217
```python yarGen.py -u -g /home/user/Downloads/office365 -i office```
268218

269219
### My Best Pratice Command Line
270220

271-
```python yarGen.py --opcodes -a "Florian Roth" -r "Internal Reserahc" -m /opt/mal/apt_case_32 -o rules32.yar```
221+
```python yarGen.py -a "Florian Roth" -r "Internal Research" -m /opt/mal/apt_case_32```
222+
223+
# db-lookup.py
224+
225+
A tool named `db-lookup.py`, which was introduced with version 0.18.0 allows you to query the local databases in a simple command line interface. The interface takes an input value, which can be `string`, `export` or `imphash` value, detects the query type and then performs a lookup in the loaded databases. This allows you to query the yarGen databases with `string`, `export` and `imphash` values in order to check if this value appears in goodware that has been processed to generate the databases.
226+
227+
This is a nice feature that helps you ta answer the following questions:
228+
229+
* Does this string appear in goodware samples of my database?
230+
* Does this export name appear in goodware samples of my database?
231+
* Does a sample in my goodware database has this imphash?
232+
233+
However, there are several drawbacks:
234+
235+
* It does only match on the full string (no contains, no startswith, no endswith)
236+
* Opcode lookup is not supported (yet)
237+
238+
I plan to release a new project named `Valknut` which extracts overlapping byte sequences from samples and creates searchable databases. This project will be the new backend API for yarGen allowing all kinds of queries, opcodes and string values, ascii and wide formatted.

0 commit comments

Comments
 (0)