How can I reject base64 encoded spam email?
Solution 1:
Don't do this with Postfix body_check
but write a Spamassassin rule for it, instead. Spamassain decodes the message body before applying its rules. Something like:
body LOCAL_QUANZHOUCOOWAY /Quanzhoucooway/
score LOCAL_QUANZHOUCOOWAY 7.0
describe LOCAL_QUANZHOUCOOWAY Block word Quanzhoucooway
These rules belongs to /etc/mail/spamassassin/local.cf
(or ~/.spamassassin/user_prefs
).
Solution 2:
Technically, you could directly filter the base64 encoded data for keywords. I'm not saying it's a practical or a reasonable thing to do, given the existence of better and simpler alternatives (as described e.g. in Esa's answer above), but it is possible.
The trick is to realize that base64 encoding is a deterministic mapping of 3-byte blocks of raw unencoded data into 4-character blocks of base64 characters. Thus, any time a certain sequence of 3-byte blocks appears in the unencoded data, the same sequence of 4-character blocks will appear in the encoded version.
For example, if you enter the string Quanzhoucooway
into a base64 encoder, you'll get the output UXVhbnpob3Vjb293YXk=
. Since the length of the input is not a multiple of 3 bytes, the output contains some padding at the end, but if we drop the final =
signs and the last actual base64 character k
(since it also encodes some padding bits), we get the string UXVhbnpob3Vjb293YX
that is guaranteed to appear in the base64-encoded data whenever the byte triplets Qua
, nzh
, ouc
, oow
and the partial triplet ay
appear in the input in that order.
But, of course, the string Quanzhoucooway
might not start exactly on triplet boundary. For example, if we encode the string XQuanzhoucooway
instead, we get the output WFF1YW56aG91Y29vd2F5
, which looks completely different. This time, the input length is divisible by three, so there are no padding characters to discard at the end, but we do need to discard the first two characters (WF
) which each encode some of the bits from the prepended X
byte, leaving us with F1YW56aG91Y29vd2F5
.
Finally, base64 encoding XXQuanzhoucooway
gives the output WFhRdWFuemhvdWNvb3dheQ==
, which has padding at both ends. Removing the first three characters WFh
(which encode the XX
prefix) and the last three characters Q==
(which encode the zero bit padding at the end), we're left with the string RdWFuemhvdWNvb3dhe
. Thus, we obtain the following three base64-encoded strings:
UXVhbnpob3Vjb293YX
F1YW56aG91Y29vd2F5
RdWFuemhvdWNvb3dhe
of which (at least) one must appear in the base64 encoded form of any input string containing the word Quanzhoucooway
.
Of course, if you're unlucky, the base64 encoder may insert a line break in the middle of them, between any two encoded triplets. (Your example message, for example, has one between F1YW56
and aG91Y29vd2F5
.) Thus, to reliably match these strings with regexps, you'd need something like the following (using PCRE syntax):
/UXVh\s*bnpo\s*b3Vj\s*b293\s*YX/ DISCARD
/F1\s*YW56\s*aG91\s*Y29v\s*d2F5/ DISCARD
/R\s*dWFu\s*emhv\s*dWNv\s*b3dh\s*e/ DISCARD
Generating these patterns by hand is kind of tedious, but it wouldn't be hard to write a simple script to do it in your favorite programming language, at least as long as it provides a base64 encoder.
If you really wanted, you could even implement case-insensitive matching by base64 encoding both the lowercase and the uppercase version of the keyword and combining them into a regexp that matches any combination of them. For example, the base64 encoding of quanzhoucooway
is cXVhbnpob3Vjb293YXk=
while that of QUANZHOUCOOWAY
is UVVBTlpIT1VDT09XQVk=
, so the rule:
/[cU][XV]V[hB]\s*[bT][nl]p[oI]\s*[bT][31]V[jD]\s*[bT][20]9[3X]\s*[YQ][XV]/ DISCARD
will match the base64 encoded word "Quanzhoucooway" in any case, provided that it begins on a triplet boundary. Generating the other two corresponding regexps for the shifted versions is left as an exercise. ;)
Alas, doing anything more complicated than simple substring matching like this quickly becomes impractical. But at least it's a neat trick. In principle, it could even be useful, if you for some reason could not use SpamAssassin or any other filter that can decode the base64 encoding before filtering. But if you can do that, instead of using hacks like this, you certainly should.