Unix fold command behaving strangely
Your issue has nothing to do with the encoding of your file. The fold
utility is quite primitive and breaks lines at particular lengths, but it does not join lines.
You may also want to be careful with retaining the fasta header lines as they are (i.e., not fold these).
awk -v W=50 '
/^>/ { if (seq != "") print seq; print; seq = ""; next }
{
seq = seq $1
while (length(seq) > W) {
print substr(seq, 1,W)
seq = substr(seq, 1+W)
}
}
END { if (seq != "") print seq }' file.fa
This awk
command would reformat your sequence to 50 characters, leaving the header lines intact. The width, 50, is adjustable through the W
variable and may be set to any positive integer.
The first block in the code handles header lines and will output the accumulated sequence bit from the previous sequence, if there is any left to output, before passing along the header line unmodified to the output.
The second block accumulates a line of sequence and will possibly output the accumulated sequence if it's long enough, in appropriate chunks.
The last block (END
) outputs any left-over sequence when reaching the end of input.
Running this on a file consisting of two copies of your sequence will produce
>m64093_191209_130050/133911/ccs_64
TTCAGGCTGTGTTCCATTTGATTTAAAATCAAATAATTTCATTCGCGTCA
GAACACCTGGTTTCACGACCATAAATAATTTACCAGTGAATCGAGGCTCA
ATTATAGATCCTCGGACGCGAGTTCTCGGTTGACGAGTGGGATTCGAATT
ATTTTTCACCGAAAATTTTAGTCGACGAGTTCAGATAAATTTGTTCGGGA
TAAAATCATCTGAGTAGGTCGGGCTTCTGAATTTCGTATTCTTGCGAGCA
ATGAATTTTAAATAATCATCGGACATACCAATTTTTGGAACAATAATGTT
CCGAACATCCCGAAAATATAGGAAGAGCCCGGATAGATAAAAATAAACAC
>m64093_191209_130050/133911/ccs_64
TTCAGGCTGTGTTCCATTTGATTTAAAATCAAATAATTTCATTCGCGTCA
GAACACCTGGTTTCACGACCATAAATAATTTACCAGTGAATCGAGGCTCA
ATTATAGATCCTCGGACGCGAGTTCTCGGTTGACGAGTGGGATTCGAATT
ATTTTTCACCGAAAATTTTAGTCGACGAGTTCAGATAAATTTGTTCGGGA
TAAAATCATCTGAGTAGGTCGGGCTTCTGAATTTCGTATTCTTGCGAGCA
ATGAATTTTAAATAATCATCGGACATACCAATTTTTGGAACAATAATGTT
CCGAACATCCCGAAAATATAGGAAGAGCCCGGATAGATAAAAATAAACAC
Changing W
to 30 gives
>m64093_191209_130050/133911/ccs_64
TTCAGGCTGTGTTCCATTTGATTTAAAATC
AAATAATTTCATTCGCGTCAGAACACCTGG
TTTCACGACCATAAATAATTTACCAGTGAA
TCGAGGCTCAATTATAGATCCTCGGACGCG
AGTTCTCGGTTGACGAGTGGGATTCGAATT
ATTTTTCACCGAAAATTTTAGTCGACGAGT
TCAGATAAATTTGTTCGGGATAAAATCATC
TGAGTAGGTCGGGCTTCTGAATTTCGTATT
CTTGCGAGCAATGAATTTTAAATAATCATC
GGACATACCAATTTTTGGAACAATAATGTT
CCGAACATCCCGAAAATATAGGAAGAGCCC
GGATAGATAAAAATAAACAC
>m64093_191209_130050/133911/ccs_64
TTCAGGCTGTGTTCCATTTGATTTAAAATC
AAATAATTTCATTCGCGTCAGAACACCTGG
TTTCACGACCATAAATAATTTACCAGTGAA
TCGAGGCTCAATTATAGATCCTCGGACGCG
AGTTCTCGGTTGACGAGTGGGATTCGAATT
ATTTTTCACCGAAAATTTTAGTCGACGAGT
TCAGATAAATTTGTTCGGGATAAAATCATC
TGAGTAGGTCGGGCTTCTGAATTTCGTATT
CTTGCGAGCAATGAATTTTAAATAATCATC
GGACATACCAATTTTTGGAACAATAATGTT
CCGAACATCCCGAAAATATAGGAAGAGCCC
GGATAGATAAAAATAAACAC
You may also be interested in the FASTX-Toolkit from CSHL. I've never use this myself, but it seems to include a "FASTA Formatter (changes the width of sequences line in a FASTA file)". The latest release of the tools are from 2014 (quite old), so you may want to compile them yourself from source rather than using one of the provided precompiled binaries, unless your particular Unix distribution provides a package (check your package repository).
try this:
<input.fasta tr -d '\n'|fold -w 50 >output.fasta
This uses tr
to remove the existing end-of-lines and then formats the resulting single line into multiple lines each with max length 50.
To keep the first line at its current length, and not combine it with following lines, this should work (and ends the output with an end-of-line):
awk '{if (NR==1) {print $0 gensub(/ /, " ", "g", sprintf("%*s", 50-length($0), ""))} else print $0}' input.fasta|tr -d '\n'|sed '$s/$/\n/'|fold -w 50|awk '{$1=$1};1' >output.fasta
This is just how fold
works. You've never seed it before because you haven't had lines of this length before. The folding happens on each line separately. So, if the line's length isn't an exact multiple of the size you are folding to, you will get that kind of output. For example:
$ perl -le 'for (0..2){print "A" x 12}'
AAAAAAAAAAAA
AAAAAAAAAAAA
AAAAAAAAAAAA
$ perl -le 'for (0..2){print "A" x 12}' | fold -w 6
AAAAAA
AAAAAA
AAAAAA
AAAAAA
AAAAAA
AAAAAA
$ perl -le 'for (0..2){print "A" x 12}' | fold -w 7
AAAAAAA
AAAAA
AAAAAAA
AAAAA
AAAAAAA
AAAAA
Now, this isn't actually a problem. That is still a valid fasta file, but it isn't very pretty. As a workaround, you can take the FastaToTbl
and TblToFasta
scripts I have posted previously and do:
$ FastaToTbl input.fasta | TblToFasta
>m64093_191209_130050/133911/ccs_64
TTCAGGCTGTGTTCCATTTGATTTAAAATCAAATAATTTCATTCGCGTCAGAACACCTGG
TTTCACGACCATAAATAATTTACCAGTGAATCGAGGCTCAATTATAGATCCTCGGACGCG
AGTTCTCGGTTGACGAGTGGGATTCGAATTATTTTTCACCGAAAATTTTAGTCGACGAGT
TCAGATAAATTTGTTCGGGATAAAATCATCTGAGTAGGTCGGGCTTCTGAATTTCGTATT
CTTGCGAGCAATGAATTTTAAATAATCATCGGACATACCAATTTTTGGAACAATAATGTT
CCGAACATCCCGAAAATATAGGAAGAGCCCGGATAGATAAAAATAAACAC
The TblToFasta
script will ensure that the output is the standard 60bp per line. If you really need 50 instead, you could do (this assumes GNU sed
):
$ FastaToTbl input.fasta | sed -E 's/^/>/;s/\t/\n/ ' | sed -E 's/(.{50})/\1\n/g'
>m64093_191209_130050/133911/ccs_64
TTCAGGCTGTGTTCCATTTGATTTAAAATCAAATAATTTCATTCGCGTCA
GAACACCTGGTTTCACGACCATAAATAATTTACCAGTGAATCGAGGCTCA
ATTATAGATCCTCGGACGCGAGTTCTCGGTTGACGAGTGGGATTCGAATT
ATTTTTCACCGAAAATTTTAGTCGACGAGTTCAGATAAATTTGTTCGGGA
TAAAATCATCTGAGTAGGTCGGGCTTCTGAATTTCGTATTCTTGCGAGCA
ATGAATTTTAAATAATCATCGGACATACCAATTTTTGGAACAATAATGTT
CCGAACATCCCGAAAATATAGGAAGAGCCCGGATAGATAAAAATAAACAC