Loop over a list of strings and increment letter count in a corresponding sublist
sequences = { "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC\
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN\
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI\
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS\
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF\
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV\
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK\
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG\
NIRCNICI"};
counts = {{"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V"}, {0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}};
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
{{"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"}, {31, 27, 45, 30, 18, 27, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25}}
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[{{a, _Integer, 1}, {max, _Integer}},
Block[{b},
b = Table[0, {max}];
Do[b[[Compile`GetElement[a, i]]]++, {i, 1, Length[a]}];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> {Listable},
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet[], {1000, 1000}];
letters = {"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"};
Here is how kglr's and Kuba's very elegant solutions perform. lcs2a
is a modification of Kuba's code to cope with Missing[AbsentKey]
which may occur when some of the elements of letters
do not occur in any of the elements in sequences
(as kglr pointed out in a comment). It is also a bit faster.
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First
3.59
0.075
0.059
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
0.0094
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs2a == lcs3
True
You can use LetterCounts
as follows:
letters = {"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K", "M", "F", "P", "S", "T", "W", "Y", "V"};
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet[], {10, 100}];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[{letters}, lcs];
counts // Grid