Ganska nyligen sprang jag på en riktigt intressant uppsats om hur vi kan skapa ett system för att automatiskt upptäcka, analysera och samla in indicators of compromise (IoCs). Författarna lyckas med detta genom att kombinera unika egenskaper i hur IoCs vanligtvis beskrivs i tekniska artiklar, kombinerat med beroendegrafer, support vector machines (SVM) och en rad andra tekniker. Resultaten är inspirerande och de uppnår så mycket som 95% precision och 90% recall.

Från abstrakt:

In this paper, we present iACE, an innovation solution for fully automated IOC extraction. Our approach is based on the observation that the IOCs in technical articles are often described in a predictable way: being connected to a set of context terms (e.g., “download”) through stable grammatical relations. Leveraging this observation, iACE is designed to automatically locate a putative IOC token (e.g., a zip file) and its context (e.g., “malware”, “download”) within the sentences in a technical article, and further analyze their relations through a novel application of graph mining techniques. Once the grammatical connection between the tokens is found to be in line with the way that the IOC is commonly presented, these tokens are extracted to generate an OpenIOC item that describes not only the indicator (e.g., a malicious zip file) but also its context (e.g., download from an external source). Running on 71,000 articles collected from 45 leading technical blogs, this new approach demonstrates a remarkable performance: it generated 900K OpenIOC items with a precision of 95% and a coverage over 90%, which is way beyond what the state-of-the-art NLP technique and industry IOC tool can achieve, at a speed of thousands of articles per hour.

PS: Begreppen precision och recall är särskilda uttryck som används inom, bland andra, maskinlärning.