Remove line breaks in a FASTA file
This awk
program:
% awk '!/^>/ { printf "%s", $0; n = "\n" }
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta
Will yield:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Explanation:
On lines that don't start with a >
, print the line without a line break and store a newline character (in variable n
) for later.
On lines that do start with a >
, print the stored newline character (if any) and the line. Reset n
, in case this is the last line.
End with a newline, if required.
Note:
By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.
--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide
There is another awk one-liner, should work for your case.
awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file
The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:
awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file
Explanation:
For lines beginning with >
, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >
, print the line without a trailing newline character. Since the last line in the file won't begin with >
, use the END
block to print a final newline character.
Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >
. Try:
awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file