Printing lines from one file if part of them appears in another. Both files are millions of lines long
You can do this very easily using grep
:
$ grep -Ff 123.txt 789.txt
http://www.a.com/kgjdk-jgjg/
http://www.b.com/gsjahk123/
http://www.c.com/abc.txt
The command above will print all lines from file 789.txt
that contain any of the lines from 123.txt
. The -f means "read the patterns to search from this file" and the -F tells grep to treat the search patterns as strings and not its default regular expressions.
This will not work if the lines of 123.txt
contain trailing spaces, grep
will treat the spaces as part of the pattern to look for an will not match if it occurs within a word. For example, the pattern foo
(note the trailing space) will not match
foobar
. To remove trailing spaces from your file, run this command:
$ sed 's/ *$//' 123.txt > new_file
Then use the new_file
to grep:
$ grep -Ff new_file 789.txt
You can also do this without a new file, using the i
flag:
$ sed -i.bak 's/ *$//' 123.txt
This will change file 123.txt
and keep a copy of the original called 123.txt.bak
.
(Note that this form of the -i
flag to sed
assumes you have GNU sed
; for BSD sed
use -i .bak
with a space in between.)
If the files like in your example are sorted and always follow that pattern, you could write it:
join -t/ -1 3 -2 3 123.txt 789.txt |
sed -n 's,\([^/]*/\)\([^/]*://\)\2,\2\1,p'
That would be the most efficient.