Malware Pattern Matching

Table of Contents

Obfuscated malware code can be challenging to detect, let alone read. The era of generative language models empowers malware authors to increase the level of obfuscation with ease. When confronted with a newly discovered sample, it is the security researcher’s responsibility to create a suitable signature so to improve the rate of detection.

Currently the most popular method of PHP malware obfuscation is use of randomly generated comments. A preview of just how aggressive this technique can be shown below:

PHP

<?php /*d&m-(H@AnJ#(F5+*/parse_str#U=~LxnADRDY!:3Y@f!`m!"aGcz
(/*A4y4,bzBv}f0.-TKoj(WG}WEr*///|lvjIKl_qyh=l|*/Dib22$rt>"M3BHpZFKo($p
'0'//RV (H# r=I}sR6#?xGf{pI3},8"*,C:V`dV&+k n<T.V
.//hOPD5,EEl>D8pRU22gRx1HC`^Sb.|UWW
'=' #;&AGT9I>6}P.vxwE\?O #
.#R'eaw99\=~&[g[VeJ~iV,K+5!oX<nN^rh8:SUNxGI!W&*2c
'%'/*4Bd2WS$xZfdxSlci''MUIF}t@,cp!)[#/]2R*/./*y;C_B2e!DL%t(=18(JJXL*crn~*/'6' 	//cr+@9aoHZ;3}ow%|R9grk
.#QTOgvGg0Jt4Cw1e>x"]NW
'1'/*Rp\PbMfd\2pV1|Tia;zU )\\Krb5vWf$abp>^1[h*/.  #&8&#8aJ?,"FVa6QN9jt;
'%'/*::x%_|c.9=8O%r:GOf iUR.td*/.#@yd(?allny0mL.xA*<^*'pK>5iwbuS,Lx+bA`
'7'/*gl/vx=)Gnl*'SP*/./*+:6=sY0$NE-}0N*b^5S</KFCDQ0 U9nmG"lr.^*/'2'/*N1\`D "E_$#s29;o=.pOp^>X.7l='vt*/. #ElB3\^V*@sWB/ms%<<HkN`a^
'%'// e] ,zLTRhk7%-4k#SoH=y6p6m
.//;Jv?ZUoG*Ajop"Y
'7'//'}Hc@&'1niOuNbE2lz wf+"Gj;`Rr%bYOP<M
. //Op^xtDgO3#9~k;:ZO*D*_za_-jj"7(gFq
'2'#=M4UG E;TqouT#++h}@0"a*h
.  /*Ni#q~:,%H%mg~z*/'%'//=ATw{[?49C}Q%T+U`\nT}Xa4=tmi
./*utbvA".J3C6f;wSP{~&wbtyo7HHlD>s<5j+<ATQ8qov/?@[*/'6'#0Q,(Iy^%ze}CW5W,S9JwS4!p
;/*i0nDEG4u}$+*/@/*88I>p:f^Ynsgfeo1~A&&VQS~3Xf$(F[m{;^G"*/eval	/*>z-NHKm9.09~byL9k)s/]OM:}Nfd"uZ&N*/(#UWtZC$]1"'36{Fv9:7Z5V=]Xxoq?z

Imagine being a responder to a website which has been affected by this malware, especially if not from a security background. What would the search pattern be for a yara rule or even plain grep? A first take might assume the starting numbers such as '0', '1' or '7' which is a perfectly valid observation. Perhaps the exposed eval?

The true weakness is abundance of /* */ PHP comments. These comments are randomly generated and placed but this otherwise powerful strategy will be used against it.

First, we must ensure the regex is as simple and optimal as possible. In this case $re = /\/\*\S{5,25}\*\// is perfect. It is extraordinarily rare for legitimate PHP comments to not contain whitepaces defined in flowerpot styling, especially groups of them.

The next requirement is to ensure there are at least two such comments per line. This could be achieved using regex alone but that is not as resource efficient as using conditional yara logic operators. Here is the condition:

PLAINTEXT

1
2
3

condition:
  #re > 1 and
  @re[2] - (@re[1] + !re[1]) <= 64

The number of comment occurrences must be at least two, which for this malware is certain. It then compares the distance between the detections with a 64 byte threshold to ensure close proximity. It doesn’t guarantee to be on the same line but that is not a strict requirement.

A single clean and concise expression in combination with built in yara conditions results in a powerful rule which can not only detect this specific sample but varients of it with immense accuracy. No other aspects of the malware need consideration. It can even detect entirely entirely different malware as long as the same commonly used obfuscation technique is involved.

There is no need for overly complex rules when the power of yara is fully understood. It is not a regex only finite state machine.

Thanks for reading!