Basic grep/awk help - extracting all lines containing a list of terms from one file into a separate file
To extract the lines from data.txt
with the genes listed in genelist.txt
:
grep -w -F -f genelist.txt data.txt > newdata.txt
grep
options used:
-w
tellsgrep
to match whole words only (i.e. soABC123
won't also matchABC1234
).-F
search for fixed strings (plain text) rather than regular expressions-f genelist.txt
read search patterns from the file
If you want the header (Sample 1, Sample 2, etc) line as well:
grep -w -F -f genelist.txt -e Sample data.txt > newdata.txt
-e Sample
also search for "Sample"
To find lines in genelist.txt
that aren't in newdata.txt
:
grep -v -w -F -f <(sed -E -e 's/(\t| +).*//' newdata.txt) genelist.txt
-v
invert the search, print non-matching lines.
The rest of the grep options are the same, but instead of using a file with the -f
option, it's using something called Process Substitution (See also), which allows you to use a command in place of an actual file. Whatever output the command creates is treated as the "file"'s contents.
In this case, we're using the command sed -E -e 's/(\t| +).*//' newdata.txt
, which outputs each line of newdata.txt after first deleting everything from either the first TAB character or the first pair of spaces it sees. In other words, the first field (e.g. "Gene A"). I had to use TAB or double space because a) i wasn't sure if your data was space-separated or TAB separated and b) the first fields in your example contained spaces.
sed
options used:
-E
use extended regular expressions, so we can use plain(
,)
, and+
which are more readable than having to escape them with\
as\(
,\)
,\+
.-e 's/(\t| +).*//'
specifies the sed script to apply against the input (newdata.txt)
Running that command on your sample data.txt
would produce the following output:
$ sed -E -e 's/(\t| +).*//' data.txt
Gene A
Gene B
Gene C
Gene D
Anyway, the output of that sed
command is used as the list of search patterns by the grep
command.
To actually answer your question:
fgrep -w -f genelist.txt data.txt >results.txt
fgrep
looks for fixed strings, rather than regular expressions (asgrep
andegrep
do)-w
tellsfgrep
to match whole words, soABC123
won't matchABC1234
-f genelist.txt
tellsfgrep
to read search patterns fromgenelist.txt
.
Seeing which genes from genelist.txt
were not included in the extraction is a little more complicated. One way to do it:
awk '{ print $1 }' results.txt | fgrep -w -v -f - genelist.txt >outsiders.txt
awk '{ print $1 }'
prints the first column in a text file; these is the list of matched genesfgrep
again matches fixed strings-w
tellsfgrep
to match whole words-v
tells it to print lines that don't match-f -
tells it to read the list of patterns fromstdin
, that is the list of matched genes fromawk
.
You can also make things a little more efficient by eliminating duplicates from the list of matched genes before searching, by interceding sort -u
between awk
and fgrep
:
awk '{ print $1 }' results.txt | sort -u | fgrep -w -v -f - genelist.txt >outsiders.txt
This is quite an undertaking without any previous Linux experience. However, I think I understand what you need, and it shouldn't be too difficult. PArdon me in advance, this is a very concise crash course in addition to a very basic explanation but I'd be happy to expound in detail if it doesn't make sense, or edit as necessary.
If you simply want to parse the data.txt
and move it to the genelist.txt
you could simply use cat data.txt >> genelist.txt newfile.txt
. (newfile.txt is the other file you mentioned it going to - the name is arbitrary).
If you want to print out the lines for a specific name, you could use cat data.txt | grep ABCD123 >> genelist.txt newfile.txt
and change ABCD123 to whatever you want.
This command will ONLY output the lines found using grep (kind of like a "search" function, but it searches only by line.)
The "|" is called piping, and when coupled with "grep" command, acts a little like a filter for whatever you're looking for. (cat zoofile.txt | grep pandas
for instance will look for all lines including the word "pandas" is a file names "zoofile." Note Linux IS CASE SENSITIVE and will only find EXACTLY what you put in. If you want ALL instances of either "panda, pandas, panderoons, or pandering, you could use pand*, where * is a wildcard and could be any character from 0 to 255 bits in length. This would pick up pand to pandzzzzzzzzzz and anything in between, including numbers).
You can use awk for more fancy column parsing (it's one of my favorite tools!) but it doesn't seem like it would fit here unless you ONLY want data from one of the columns based on certain parameters.
Finally, here is a good place to learn a bit about the command line. This may help with grep, but it doesn't cover awk.
https://www.codecademy.com/learn/learn-the-command-line
After that, this should cover awk in more detail. There are a lot of VERY expansive courses on awk, but they're easy to get lost in. This is a practical site that demonstrates more what you're looking to do.
https://www.ibm.com/developerworks/library/l-awk1/
EDIT - after re-reading, I may have missed something - are you looking to compare the two files and print out only things that match from one to the other? Please advise and provide an example and I'd be happy to edit my answer accordingly.