How to Write Simple but Sound Yara Rules – Part 2

by Florian RothOct 17, 2015

Months ago I wrote a blog article on “How to write simple but sound Yara rules“. Since then the mentioned techniques and tools have improved. I’d like to give you a brief update on certain Yara features that I frequently use and tools that I use to generate and test my rules.

Handle Very Specific Strings Differently

In the past I was glad to see very specific strings in samples and sometimes used these strings as the only indicator for detection. E.g. whenever I’ve found a certain typo in the PE header fields like “Micorsoft Corportation” I cheered and thought that this would make a great signature. But – and I have to admit that now – this only makes a nice signature. Great signatures require not only to match on a certain sample in the most condensed way but aims to match on similar samples created by the same author or group.
Look at the following rule:

rule Enfal_Malware_Backdoor {
	meta:
		description = "Generic Rule to detect the Enfal Malware"
		author = "Florian Roth"
		date = "2015/02/10"
		super_rule = 1
		hash0 = "6d484daba3927fc0744b1bbd7981a56ebef95790"
		hash1 = "d4071272cc1bf944e3867db299b3f5dce126f82b"
		hash2 = "6c7c8b804cc76e2c208c6e3b6453cb134d01fa41"
		score = 60
	strings:
		$x1 = "Micorsoft Corportation" fullword wide
		$x2 = "IM Monnitor Service" fullword wide
		$a1 = "imemonsvc.dll" fullword wide
		$a2 = "iphlpsvc.tmp" fullword
		$a3 = "{53A4988C-F91F-4054-9076-220AC5EC03F3}" fullword
		$s1 = "urlmon" fullword
		$s2 = "Registered trademarks and service marks are the property of their" wide
		$s3 = "XpsUnregisterServer" fullword
		$s4 = "XpsRegisterServer" fullword
	condition:
		uint16(0) == 0x5A4D and
		(
			( 1 of ($x*) ) or
			( 2 of ($a*) and all of ($s*) )
		)
}

What I do when I review the 20 strings that are generated by yarGen is that I try to categorize the extracted strings in 3 different groups:

Very specific strings (one of them is sufficient for successful detection, e.g. IP addresses, payload URLs, PDB paths, user profile directories)
Specific strings (strings that look good but may appear in goodware as well, e.g. “wwwlib.dll”)
Other strings (even strings that appear in goodware; without random code from compressed or encrypted data; e.g. “ModuleStart”)

Then I create a condition that defines:

A Certain Magic Header (remove it in case of ASCII text like scripts or webshells)
1 of the very specific strings OR
some of the specific strings combined with many (but not all) of the common strings

Here is another example that does only have very specific strings (x) and common strings (s):

rule Cobra_Trojan_Stage1 {
	meta:
		description = "Cobra Trojan - Stage 1"
		author = "Florian Roth"
		reference = "https://blog.gdatasoftware.com/blog/article/analysis-of-project-cobra.html"
		date = "2015/02/18"
		hash = "a28164de29e51f154be12d163ce5818fceb69233"
	strings:
		$x1 = "KmSvc.DLL" fullword wide
		$x2 = "SVCHostServiceDll_W2K3.dll" fullword ascii
		$s1 = "Microsoft Corporation. All rights reserved." fullword wide
		$s2 = "srservice" fullword wide
		$s3 = "Key Management Service" fullword wide
		$s4 = "msimghlp.dll" fullword wide
		$s5 = "_ServiceCtrlHandler@16" fullword ascii
		$s6 = "ModuleStart" fullword ascii
		$s7 = "ModuleStop" fullword ascii
		$s8 = "5.2.3790.3959 (srv03.sp2.070216-1710)" fullword wide
	condition:
		uint16(0) == 0x5A4D and filesize &lt; 50000 and 1 of ($x*) and 6 of ($s*)
}

If you can’t create a rule that is sufficiently specific, I recommend the following methods to restrict the rule:

Magic Header (use it as first element in condition – see performance guidelines, e.g. “uint16(0) == 0x5A4D”)
File Size (malware that mimics valid system files, drivers or legitimate software often differs significantly in size; try to find the valid files online and set a size value in your rule, e.g. “filesize > 200KB and filesize < 600KB”)
String Location (see the “Location is Everything” section)
Exclude strings that occur in false positives (e.g. $fp1 = “McAfeeSig”)

Location is Everything

One of the most underestimated features of Yara is the possibility to define a range in which strings occur in order to match. I used this technique to create a rule that detect metasploit meterpreter payloads quite reliably even if it’s encoded/cloaked. How that?
If you see malware code that is hidden in an overlay at the end of a valid executable (e.g. “ab.exe”) and you see only strings that are typical function exports or mimics a well-known executable ask the following questions:

Is it normal that these strings are located at this location in the file?
Is it normal that these strings occur more than once in that file?
Is the distance between two strings somehow specific?

Malware Strings

In case of the unspecific malware code in the PE overlay, try to define a rule that looks for a certain file size (e.g. filesize > 800KB) and the malware strings relative to the end of the file (e.g. $s1 in (filesize-500..filesize)).
The following example shows a unspecified webshell that contains strings that may be modified by an attacker in future versions when applied in a victim’s network. Try always to extract strings that are less likely to be changed.

Webshell Code PHP

The variable name “$code” is more likely to change than the function combination “@eval(gzinflate(base64_decode(” at the end of the file. It is possible that valid php code contains “eval(gzinflate(base64_decode(” somewhere in the code but it is less likely that it occurs in the last 50 bytes of the file.
I therefore wrote the following rule:

rule Webshell_b374k_related_1 {
	meta:
		description = "Detects b374k related webshell"
		author = "Florian Roth"
		reference = "https://goo.gl/ZuzV2S"
		score = 65
		hash = "d5696b32d32177cf70eaaa5a28d1c5823526d87e20d3c62b747517c6d41656f7"
		date = "2015-10-17"
	strings:
		$m1 = "<!--?php"
  		$s1 = "@eval(gzinflate(base64_decode(" ascii
 	condition:
 		$m1 at 0 and $s1 in (filesize-50..filesize) and filesize &lt; 20KB 
}

Performance Guidelines

I collected many ideas by Wesley Shields and Victor M. Alvarez and composed a gist called “Yara Performance Guidelines”. This guide shows you how to write Yara rules that use less CPU cycles by avoiding CPU intensive checks or using new condition checking shortcuts introduced in Yara version 3.4.
Yara Performance Guidelines

PE Module

People sometimes ask why I don’t use the PE module. The reason is simple: I avoid using modules that are rather new and would like to see it thoroughly tested prior using it in my scanners running in productive environments. It is a great module and a lot of effort went into it. I would always recommend using the PE module in lab environments or sandboxes. In scanners that walk huge directory trees a minor memory leak in one of the modules could lead to severe memory shortages. I’ll give it another year to prove its stability and then start using it in my rules.

yarGen

yarGen has an opcode feature since the last minor version. It is active by default but only useful in cases in which not enough strings could be extracted.
I currently use the following parameters to create my rules:

python yarGen.py --noop -z 0 -a "Florian Roth" -r "http://link-to-sample" /mal/malware

The problem with the opcode feature is that it requires about 2,5 GB more main memory during rule creation. I’ll change it to an optional parameter in the next version.

yarAnalyzer

yarAnalyzer is a rather new tool that focuses on rule coverage. After creating a bigger rule set or a generic rule that should match on several samples you’d like to check the coverage of your rules in order to detect overlapping rules (which is often OK).
yarAnalyzer helps you to get an overview on:

rules that match on more than one sample
samples that show hits from more than one rule
rules without hits
samples without hits

yarAnalayzer Screenshot

yarAnalyzer Github Repository

String Extraction and Colorization

To review the strings in a sample I use a simple shell one-liner that a good friend sent me once.
“strings” version for Linux

#!/bin/bash
(strings -a -td "$@" | sed 's/^\(\s*[0-9][0-9]*\) \(.*\)$/\1 A \2/' ; strings -a -td -el "$@" | sed 's/^\(\s*[0-9][0-9]*\) \(.*\)$/\1 W \2/') | sort -n

“gstrings” version for OS X (sudo port install binutils)

#!/bin/bash
(gstrings -a -td "$@" | gsed 's/^\(\s*[0-9][0-9]*\) \(.*\)$/\1 A \2/' ; gstrings -a -td -el "$@" | gsed 's/^\(\s*[0-9][0-9]*\) \(.*\)$/\1 W \2/') | sort -n

It produces an output as shown in the above screenshot with green text and the description “Malware Strings” showing the offset, ascii (A) or wide (W) and the string at this offset.
For a colorization of the string check my new tool “prisma” that colorizes random type standard output.

Prisma STDOUT colorization

Contact

Follow me on Twitter: @Cyb3rOps

About the author:

Florian Roth

Florian Roth serves as the Head of Research and Development at Nextron Systems. With a background in IT security since 2000, he has delved deep into nation-state cyber attacks since 2012. Florian has developed the THOR Scanner and actively engages with the community via his Twitter handle @cyb3rops. He has contributed to open-source projects, including 'Sigma', a generic SIEM rule format, and 'LOKI', an open-source scanner. Additionally, he has shared valuable resources like a mapping of APT groups and operations and an Antivirus Event Analysis Cheat Sheet.