Replace MAC address with UUID
The following perl
script uses either the Digest::MD5
or Digest::SHA
module to transform a MAC address to a hash, using a secret salt. See the man pages of the modules for more details on them. It's worth noting that Digest::SHA has several more algorithms to choose from.
The code is written to make it easy to choose a different hashing algorithm - uncomment one and comment out the others to choose whichever suits you best. BTW, the output from the _base64
versions of the functions is a little shorter then the _hex
functions but look more like line-noise.
I simplified your provided regex (couldn't see any need for a look-behind). You may need to tweak that a bit to work with your input data....you didn't provide any sample so I just guessed.
#!/usr/bin/perl
# choose one of the following digest modules:
use Digest::MD5 qw(md5_hex md5_base64);
#use Digest::SHA qw(sha256_hex sha256_base64);
use strict;
my $salt='secret salt phrase';
# store seen MAC addresses in a hash so we only have to calculate the digest
# for them once. This speed optimisation is only useful if the input file
# is large AND any given MAC address may be seen many times.
my %macs=();
while(<>) {
if (m/clientMac:\s*([A-Z0-9]{12})/i) {
my $mac = $1;
if (!defined($macs{$mac})) {
# choose one of the following digest conversions:
#my $uuid = sha256_hex($mac . $salt);
#my $uuid = sha256_base64($mac . $salt);
my $uuid = md5_hex($mac . $salt);
#my $uuid = md5_base64($mac . $salt);
$macs{$mac} = $uuid;
};
s/(clientMac:\s*)$mac/$1$macs{$mac}/gio;
};
print;
};
As requested in the comment, here is an example how to perform such a substitution with sed
. You used the /linux tag, so it should be safe to use GNU sed
with its e
flag for the s
command:
sed -E 'h;s/.*clientMac":\s"([A-Z0-9]{12}).*/echo secretKey\1|md5sum/e;T
G;s/(.*)\s*-\n(.*clientMac":\s")[A-Z0-9]{12}(.*)/\2\1\3/' logfile
Explanation:
- The
h
command saves the line to the hold space, so we can restore it after messing up the line (-; s/.*clientMac":\s"([A-Z0-9]{12}).*/echo secretKey\1|md5sum/e
matches the whole line, putting the actual MAC in()
to be reused in the replacement. The replacement forms the command to be executed:echo
ing the MCA along with the "salt" and piping it intomd5sum
. Thee
flag makessed
execute this in the shell and putting the result in the buffer againT
branches to the end of the script if no replacement was made. This is to print lines without MAC unmodified. Following lines are executed only if a replacement was madeG
appends the original line from the hold buffer, so now we have themd5sum
output, a newline and the original line in the buffers/(.*)\s*-\n(.*clientMac":\s")[A-Z0-9]{12}(.*)/\2\1\3/
captures the MD5 in the first pair of()
, the line before the MAC in the second and the rest of the line after the MAC in the third, thus\2\1\3
replaces the MAC with the MD5
As an alternative approach, sometimes I used simple line numbers as obfuscation value. This makes the output more compact and more readable.
Also, awk
is a good tool when one needs to perform "smart" operations on a text file, having a more readable language than sed
. The "smart" operation to perform in this case is avoid re-executing the obfuscation algorithm when any one MAC address is encountered more than once. This can speed up operations quite a lot if you have thousand of lines referring to a small number of MAC addresses.
In practice, consider the following script, which also handles possible multiple MAC addresses occurring on any one line, identifying and replacing each occurrence, and then prints a mapping table at the end:
awk -v pat='clientMac"\\s*"[[:xdigit:]]{12}' -v table='sort -k 1,1n | column -t' -- '
$0 ~ pat {
for (i=1; i <= NF; i++)
if (match($i, pat)) {
if (!($i in cache))
cache[$i]=NR "." i
$i = "MAC:" cache[$i]
}
}
1
END {
print "---Table: "FILENAME"\nnum MAC" | table
for (mac in cache)
print cache[mac], mac | table
}
' file.log
The table at the end can be easily separated from the main output by an additional editing step, or by just making the command string in the -v table=
argument redirect its output to a file, like in -v table='sort -k 1,1n | column -t > table'
. It can also be removed altogether by just removing the entire END{ … }
block.
As a variant, using a real encryption engine to compute obfuscation values and hence with no mapping table at the end:
awk -v pat='clientMac"\\s*"[[:xdigit:]]{12}' -v crypter='openssl enc -aes-256-cbc -a -pass file:mypassfile' -- '
$0 ~ pat {
for (i=1; i <= NF; i++)
if (match($i, pat)) {
addr = cache[$i]
if (addr == "") {
"echo '\''" $i "'\'' | " crypter | getline addr
cache[$i] = addr
}
$i = "MAC:" addr
}
}
1
' file.log
Here I used openssl
as encryption engine selecting its aes-256-cbc
cipher (with also a base64-encoded output in order to be text-friendly), and making it read the encryption secret from a file named mypassfile
.
Strings encrypted with a symmetric cipher (like aes-256-cbc
) can be decrypted by knowing the secret used (the contents of mypassfile
, which you want to keep for yourself), therefore they can be reversed. Also, since openssl
uses a random salt by default, each run produces different values for the same input. Not using a salt (option -nosalt
) would make openssl
produce the same value for each run, so less secure, but on the other hand would produce shorter texts while still being encrypted.
The same awk
script would work for other external commands instead of openssl
by just replacing the command in the -v crypter=
argument to awk
, as long as the external command you choose can accept input from stdin and print output to stdout.
Strings hashed with algorithms like MD5 or SHA instead are one-way only (i.e. they can't be reversed), and always produce the same value for the same input, therefore you'd want to "salt" them so that the computed values produced in output can't just be searched over all possible MAC addresses. You might add a random "salt" as in the following slightly modified script:
awk -v pat='clientMac"\\s*"[[:xdigit:]]{12}' -v crypter='sha256sum' -- '
$0 ~ pat {
for (i=1; i <= NF; i++)
if (match($i, pat)) {
addr = cache[$i]
if (addr == "") {
"(dd if=/dev/random bs=16 count=1 2>/dev/null; echo '\''" $i "'\'') | " crypter | getline addr
cache[$i] = addr
}
$i = "MAC:" addr
}
}
1
' file.log
This latter script uses a 16 bytes-long (pseudo-)random value as "salt", thus producing a different hash value on each run over the same data.