How to Write Simple but Sound Yara Rules

by Florian RothFeb 16, 2015

During the last 2 years I wrote approximately 2000 Yara rules based on samples found during our incident response investigations. A lot of security professionals noticed that Yara provides an easy and effective way to write custom rules based on strings or byte sequences found in their samples and allows them as end user to create their own detection tools.
However it makes me sad to see that there are mainly two types of rules published by the researchers:

rules that generate many false positives and
rules that match only the specific sample and are not much better than a hash value.

I therefore decided to write an article on how to build optimal Yara rules, which can be used to scan single samples uploaded to a sandbox and whole file systems with a minimal chance of false positives.
These rules are based on contained strings and easy to comprehend. You do not need to understand the reverse engineering of executables and I decided to avoid the new Yara modules like “pe” which I still consider as “testing” features that may lead to memory leaks or other errors when used in practice.

Automatic Rule Generation

First I believed that automatically generated rules can never be as good as manually created ones. During my work for out IOC scanners THOR and LOKI I had to create hundreds of Yara rules manually and it became clear that there is an obvious disadvantage. What I used to do was to extract UNICODE and ASCII strings from my samples by the following commands:

strings -el samples.exe
strings -a sample.exe

I prefer the UNICODE strings as they are often overlooked and less frequently changed within a certain malware/tool family. Make sure that you use UNICODE strings with the “wide” keyword and ASCII strings with the “ascii” keyword in your rules and use “fullword” if there is a word boundary before and after the string. The problem with this method is that you cannot decide if the string that is returned by the commands is unique for this malware or often used in goodware samples as well.
Look at the extracted strings in the following example:

NTLMSSP
%d.%d.%d.%d
%s\IPC$
\\%s
NT LM 0.12
%s%s%s
%s.exe %s
%s\Admin$\%s.exe
RtlUpcaseUnicodeStringToOemString
LoadLibrary( NTDLL.DLL ) Error:%d

Could you be sure that the string “NT LM 0.12” is a unique one, which is not used by legitimate software?
To accomplish this task for me I developed “yarGen“, a Yara rule generator that ships with a huge string database of common and benign software. I used the Windows system folder files of Windows 2003, Windows 7 and Windows 2008 R2 server, typical software like Microsoft Office, 7zip, Firefox, Chrome, Cygwin and various Antivirus solution program folders to generate the database. yarGen allows you to generate your own database or add folders with more goodware to the existing database.
yarGen extracts all ASCII and UNICODE strings from a sample and removes all strings that do also appear in the goodware string database. Then it evaluates and scores every string by using fuzzy regular expressions and the “Gibberish Detector” that allows yarGen to detect and prefer real language over character chains without meaning. The top 20 of the strings will be integrated in the resulting rule.
Let’s look at two examples from my work. A sample of the Enfal Trojan and a SMB Worm sample.
yarGen generates the following rule for the Enfal Trojan sample:

rule Enfal_Generic {
	meta:
		description = "Auto-generated rule - from 3 different files"
		author = "YarGen Rule Generator"
		reference = "not set"
		date = "2015/02/15"
		super_rule = 1
		hash0 = "6d484daba3927fc0744b1bbd7981a56ebef95790"
		hash1 = "d4071272cc1bf944e3867db299b3f5dce126f82b"
		hash2 = "6c7c8b804cc76e2c208c6e3b6453cb134d01fa41"
	strings:
		$s0 = "urlmon" fullword
		$s1 = "Registered trademarks and service marks are the property of their respec" wide
		$s2 = "Micorsoft Corportation" fullword wide
		$s3 = "IM Monnitor Service" fullword wide
		$s4 = "imemonsvc.dll" fullword wide
		$s5 = "iphlpsvc.tmp" fullword
		$s6 = "XpsUnregisterServer" fullword
		$s7 = "XpsRegisterServer" fullword
		$s8 = "{53A4988C-F91F-4054-9076-220AC5EC03F3}" fullword
		$s9 = "tEHt;HuD" fullword
		$s10 = "6.0.4.1624" fullword wide
		$s11 = "#*8;-&gt;)" fullword
		$s12 = "%/&gt;#?#*8" fullword
		$s13 = "\\%04x%04x\\" fullword
		$s14 = "3,8,18" fullword
		$s15 = "3,4,15" fullword
		$s16 = "3,7,12" fullword
		$s17 = "3,4,13" fullword
		$s18 = "3,8,12" fullword
		$s19 = "3,8,15" fullword
		$s20 = "3,6,12" fullword
	condition:
		all of them
}

The resulting string set contains many useful strings but also random ASCII characters ($s9, $s11, $s12) that do match on the given sample but are less likely to produce the same result on other samples of the family.
yarGen generates the following rule for the SMB Worm sample:

rule sig_smb {
	meta:
		description = "Auto-generated rule - file smb.exe"
		author = "YarGen Rule Generator"
		reference = "not set"
		date = "2015/02/15"
		hash = "db6cae5734e433b195d8fc3252cbe58469e42bf3"
	strings:
		$s0 = "LoadLibrary( NTDLL.DLL ) Error:%d" fullword ascii
		$s1 = "SetServiceStatus failed, error code = %d" fullword ascii
		$s2 = "%s\\Admin$\\%s.exe" fullword ascii
		$s3 = "%s.exe %s" fullword ascii
		$s4 = "iloveyou" fullword ascii
		$s5 = "Microsoft@ Windows@ Operating System" fullword wide
		$s6 = "\\svchost.exe" fullword ascii
		$s7 = "secret" fullword ascii
		$s8 = "SVCH0ST.EXE" fullword wide
		$s9 = "msvcrt.bat" fullword ascii
		$s10 = "Hello123" fullword ascii
		$s11 = "princess" fullword ascii
		$s12 = "Password123" fullword ascii
		$s13 = "Password1" fullword ascii
		$s14 = "config.dat" fullword ascii
		$s15 = "sunshine" fullword ascii
		$s16 = "password &lt;=14" fullword ascii
		$s17 = "del /a %1" fullword ascii
		$s18 = "del /a %0" fullword ascii
		$s19 = "result.dat" fullword ascii
		$s20 = "training" fullword ascii
	condition:
		all of them
}

The resulting rules are good enough to use them as they are, but they are far from an optimal solution. However it is good that so many strings have been found, which do not appear in the analyzed goodware samples.
If you don’t want to use or download yarGen, you could also use the online tool Yara Rule Generator provided by Joe Security, which was inspired by/based on yarGen.
It is not necessary to use a generator if your eye is trained and experienced. In this case just read the next section and select the strings to match the requirements of the (what I call) sufficiently generic Yara rules.

Sufficiently Generic Yara Rules

As I said in the introduction rules that generate false positives are pretty annoying. However the real tragedy is that most of the rules are far too specific to match on more than one sample and are therefore almost as useful as a file hash.
What I tend to do with the rules is to check all the strings and put them into at least 2 different categories:

Very specific strings = hard indicators for a malicious sample
Rare strings = likely that they do not appear in goodware samples, but possible
Strings that look common = (Optional) e.g. yarGen output strings that do not seem to be specific but didn’t appear in the goodware string database

Check out the modified rules in order to understand this splitting. Ignore the definition named $mz, I’ll explain it later and look at the string definitions below.
The definitions starting with $s contain the very specific strings, which I regard as so special that they would not appear in legitimate software. Note the typos in both strings: “Micorsoft Corportation” instead of “Microsoft Corporation” and “Monnitor” instead of “Monitor”.
The strings starting with $x seem to be special (I tend to google the strings) but I cannot say if they also appear in legitimate software. The definitions starting with $z seem to be ordinary but have not been part of the goodware string database so they have to be special in some way.

rule Enfal_Malware_Backdoor {
	meta:
		description = "Generic Rule to detect the Enfal Malware"
		author = "Florian Roth"
		date = "2015/02/10"
		super_rule = 1
		hash0 = "6d484daba3927fc0744b1bbd7981a56ebef95790"
		hash1 = "d4071272cc1bf944e3867db299b3f5dce126f82b"
		hash2 = "6c7c8b804cc76e2c208c6e3b6453cb134d01fa41"
	strings:
		$mz = { 4d 5a }
		$s1 = "Micorsoft Corportation" fullword wide
		$s2 = "IM Monnitor Service" fullword wide
		$x1 = "imemonsvc.dll" fullword wide
		$x2 = "iphlpsvc.tmp" fullword
		$x3 = "{53A4988C-F91F-4054-9076-220AC5EC03F3}" fullword
		$z1 = "urlmon" fullword
		$z2 = "Registered trademarks and service marks are the property of their" wide
		$z3 = "XpsUnregisterServer" fullword
		$z4 = "XpsRegisterServer" fullword
	condition:
		( $mz at 0 ) and
		(
			( 1 of ($s*) ) or
			( 2 of ($x*) and all of ($z*) )
		)
		and filesize &lt; 40000
}

Now check the condition statement and notice that I combine the rules with a magic header of an executable defined by $mz and a file size to exclude typical false positives like Antivirus signature files, browser cache or dictionary files. Set an ample file size value to avoid false negatives. (e.g. samples between 100K and 200K => set file size < 300K)
You can see that I decided that a single occurrence of one of the very specific strings would trigger that rule. ( 1 of $s* )
Than I combine a bunch of less unique strings with most or all of the ordinary looking strings. ( 2 of $x* and all of $z* )
Let’s look at second example. (see below)
$s1 is a very special string with string formatting placeholders “%s” in combination with an Admin$ share. $s2 seems to be the typical “svchost.exe” but contains the number “0” instead of an “O”, which is very uncommon and a clear indicator for something malicious.
All the definitions starting with $a are special but I cannot say for sure if they won’t appear in legitimate software. The strings defined by $x seem ordinary but were produced by yarGen, which means that they did not appear in the goodware string database.
This special example contains a list of typical passwords which is defined by $z1..z8.

rule SMB_Worm_Tool_Generic {
	meta:
		description = "Generic SMB Worm/Malware Signature"
		author = "Florian Roth"
		reference = "http://goo.gl/N3zx1m"
		date = "2015/02/08"
		hash = "db6cae5734e433b195d8fc3252cbe58469e42bf3"
	strings:
		$mz = { 4d 5a }
		$s1 = "%s\\Admin$\\%s.exe" fullword ascii
		$s2 = "SVCH0ST.EXE" fullword wide
		$a1 = "LoadLibrary( NTDLL.DLL ) Error:%d" fullword ascii
		$a2 = "\\svchost.exe" fullword ascii
		$a3 = "msvcrt.bat" fullword ascii
		$a4 = "Microsoft@ Windows@ Operating System" fullword wide
		$x1 = "%s.exe %s" fullword ascii
		$x2 = "password &lt;=14" fullword ascii
		$x3 = "del /a %1" fullword ascii
		$x4 = "del /a %0" fullword ascii
		$x5 = "SetServiceStatus failed, error code = %d" fullword ascii
		$z1 = "secret" fullword ascii
		$z2 = "Hello123" fullword ascii
		$z3 = "princess" fullword ascii
		$z4 = "Password123" fullword ascii
		$z5 = "Password1" fullword ascii
		$z6 = "sunshine" fullword ascii
		$z7 = "training" fullword ascii
		$z8 = "iloveyou" fullword ascii
	condition:
		$mz at 0 and
		( 1 of ($s*) and 1 of ($x*) ) or
		( all of ($a*) and 2 of ($x*) ) or
		( 5 of ($z*) and 2 of ($x*) ) and
		filesize &lt; 200000
}

You see that I combined the string definitions in a similar way as before. This method in combination with the magic header and the file size should be a good starting point for the final stage – testing.

Testing

Testing the rules is very important. It seems that most authors decide that the rules are good enough if they match on the given samples.
You should definitely do the following checks:

Scan the malware samples
Scan a big goodware archive

To carry out the tests download the Yara scanner and run it from the command line. The goodware directory should include system files from various Windows versions, typical software and possible false positive sources (e.g. typical CMS software if you wrote Yara rules that match on malicious web shells)

Yara Rule Testing on Samples and Goodware

If the rule matched on the malicious samples and did not generate a match on the goodware archive your rule is good enough to test the rule in practice.

Update

Make sure to check Part 2 of “How to Write Simple and Sound YARA Rules”.

About the author:

Florian Roth

Florian Roth serves as the Head of Research and Development at Nextron Systems. With a background in IT security since 2000, he has delved deep into nation-state cyber attacks since 2012. Florian has developed the THOR Scanner and actively engages with the community via his Twitter handle @cyb3rops. He has contributed to open-source projects, including 'Sigma', a generic SIEM rule format, and 'LOKI', an open-source scanner. Additionally, he has shared valuable resources like a mapping of APT groups and operations and an Antivirus Event Analysis Cheat Sheet.