Concurrency of a find, hash val, and replace across large amount of rows
FWIW I think this is the fastest way you could do it in a shell script:
$ cat tst.sh
#!/usr/bin/env bash
for file in "$@"; do
while IFS='"' read -ra a; do
sha=$(printf '%s' "${a[1]}" | sha1sum)
sha="${sha% *}"
printf '%s"%s"%s"%s"%s"%s"%s"\n' "${a[0]}" "$sha" "${a[2]}" "${a[3]}" "${a[4]}" "$sha" "${a[6]}"
done < "$file"
done
$ ./tst.sh file
$ cat file
"e8bb6adbb44a2f4c795da6986c8f008d05938fac" : ["200000", "e8bb6adbb44a2f4c795da6986c8f008d05938fac"]"
"aaac41fe0491d5855591b849453a58c206d424df" : ["200000", "aaac41fe0491d5855591b849453a58c206d424df"]"
but as I mentioned in the comments you'd be better of for speed of execution using a tool with sha1sum functionality built in, e.g. python.
As advised by Ed Morton, with a little help from python.
Create a python script /tmp/sha1.py and make it executable
#! /usr/local/bin/python -u
import hashlib
import sys
for line in sys.stdin:
words = line.split()
str_hash=hashlib.sha1(words[0].encode())
words[0] = str_hash.hexdigest()
print(" ".join(words))
The first line should contain the correct location of your python, but don't remove the "-u".
Then a ksh script, that you should also make executable.
#! /usr/bin/ksh
/tmp/sha1.py |&
for y in files*
do
while read A B
do
eval "echo $A" >&p
read A <&p
echo \"$A\" $B
done < $y > TMP.$y
mv TMP.$y $y
done
# terminate sha1.py
exec 3>&p
exec 3>&-
Now, if you want performance, you should let python handle a complete file at once. The following scripts treats each input line as a filename, and does your dirty work:
#! /usr/local/bin/python
import hashlib
import os
import sys
for IFileNmX in sys.stdin:
IFileNm = IFileNmX.strip()
IFile = open(IFileNm,'r')
OFileNm = ".".join(["TMP",IFileNm])
OFile = open(OFileNm,'w')
for line in IFile.readlines():
words = line.split()
word1 = words[0].strip('"')
str_hash=hashlib.sha1(word1.encode())
words[0] = "".join(['"',str_hash.hexdigest(),'"'])
OFile.write("".join([" ".join(words),'\n']))
OFile.close()
IFile.close()
os.rename(OFileNm,IFileNm)
If you call this script /tmp/sha1f.py, and make it executable, I wonder how many minutes
ls files* | /tmp/sha1f.py
would take. My system took 12 seconds to deal with a 400Mb file of a million lines. But that's boasting, of course.