Generate Code-Based Yara Rules using Machine Learning.
clava was developed during an industry project at Hochschule Luzern with the goal to automatically create Yara rules, based on a given malware sample. Rules created with clava should not be used in production, but can assist during rule development. Since this project is heavily inspired by yarGen, See also Floriah Roth's blog post on "How to post-process YARA rules generated by yarGen".
We've kept the machine learning part intentionally rudimentary to demonstrate how much can be achieved with simple techniques. See Summary for a quick roundup of the research. As a next step, one could explore more sophisticated techniques to improve the results. See the section Contribute below for some ideas. In the first, rudimentary iteration we used a simple logistic regression classifier, which was trained on the term frequency weights of mnemonic n-grams. If you are interested in the written report, feel free to contact me.
clava was heavily inspired by these projects:
Note: At the moment, the models are not public. However, you can easily train a model on your own dataset. Instructions will follow.
To install clava
, clone this repository and run:
$ python setup.py install
clava offers a simple CLI to interact. To list all available options, run:
$ clava -h
To generate a yara rule based on a sample:
$ clava yara <path/to/sample>
During development, I recommend installing clava
in editable mode:
$ pip install -e .[dev]
clava uses pytest. To run the test suite with a set of predefined settings, run:
$ make tests
Alternatively, you can run pytest against the tests/
directory with your own settings.