How to know if a text file is a subset of another
If those file contents are called file1
, file2
and file3
in order of apearance then you can do it with the following one-liner:
# python -c "x=open('file1').read(); y=open('file2').read(); print x in y or y in x"
True
# python -c "x=open('file2').read(); y=open('file1').read(); print x in y or y in x"
True
# python -c "x=open('file1').read(); y=open('file3').read(); print x in y or y in x"
False
With perl
:
if perl -0777 -e '$n = <>; $h = <>; exit(index($h,$n)<0)' needle.txt haystack.txt
then echo needle.txt is found in haystack.txt
fi
-0octal
defines the record delimiter. When that octal number is greater than 0377 (the maximum byte value), that means there's no delimiter, it's equivalent to doing $/ = undef
. In that case, <>
returns the full content of a single file, that's the slurp mode.
Once we have the content of the files in two $h
and $n
variables, we can use index()
to determine if one is found in the other.
That means however that the whole files are stored in memory which means that method won't work for very large files.
For mmappable files (usually includes regular files and most seekable files like block devices), that can be worked around by using mmap()
on the files, like with the Sys::Mmap
perl module:
if
perl -MSys::Mmap -le '
open N, "<", $ARGV[0] || die "$ARGV[0]: $!";
open H, "<", $ARGV[1] || die "$ARGV[1]: $!";
mmap($n, 0, PROT_READ, MAP_SHARED, N);
mmap($h, 0, PROT_READ, MAP_SHARED, H);
exit (index($h, $n) < 0)' needle.txt haystack.txt
then
echo needle.txt is found in haystack.txt
fi
I found a solution thanks to this question
Basically I am testing two files a.txt
and b.txt
with this script:
#!/bin/bash
first_cmp=$(diff --unchanged-line-format= --old-line-format= --new-line-format='%L' "$1" "$2" | wc -l)
second_cmp=$(diff --unchanged-line-format= --old-line-format= --new-line-format='%L' "$2" "$1" | wc -l)
if [ "$first_cmp" -eq "0" -o "$second_cmp" -eq "0" ]
then
echo "Subset"
exit 0
else
echo "Not subset"
exit 1
fi
If one is subset of the other the script return 0
for True
otherwise 1
.