Use of DeleteCases to Delete columns of a matrix by specifying multiple criteria for deletion
This can be done in stages. It's easier to delete rows rather than columns, so we will temporarily work with the transpose:
arr = Transpose[{{"G", "T", "T", "A", "A", "C", "G", "A", "C", "-"},
{"G", "T", "T", "A", "G", "C", "G", "T", "C", "-"},
{"C", "A", "A", "T", "T", "C", "G", "T", "C", "G"},
{"T", "A", "A", "T", "T", "C", "G", "T", "C", "A"},
{"C", "A", "T", "A", "C", "C", "C", "G", "A", "A"},
{"-", "A", "T", "A", "C", "C", "G", "G", "A", "A"}}];
Apply the first criterion:
a1 = DeleteCases[arr, v_ /; MemberQ[v, "-"]]
{{"T", "T", "A", "A", "A", "A"},
{"T", "T", "A", "A", "T", "T"},
{"A", "A", "T", "T", "A", "A"},
{"A", "G", "T", "T", "C", "C"},
{"C", "C", "C", "C", "C", "C"},
{"G", "G", "G", "G", "C", "G"},
{"A", "T", "T", "T", "G", "G"},
{"C", "C", "C", "C", "A", "A"}}
The second and third criteria can be applied simultaneously:
a2 = DeleteCases[a1, v_ /; Length[Union[v]] != 2]
{{"T", "T", "A", "A", "A", "A"},
{"T", "T", "A", "A", "T", "T"},
{"A", "A", "T", "T", "A", "A"},
{"G", "G", "G", "G", "C", "G"},
{"C", "C", "C", "C", "A", "A"}}
Finally, apply the last criterion:
a3 = DeleteCases[a2, v_ /; MemberQ[Tally[v][[All, -1]], 1]]
{{"T", "T", "A", "A", "A", "A"},
{"T", "T", "A", "A", "T", "T"},
{"A", "A", "T", "T", "A", "A"},
{"C", "C", "C", "C", "A", "A"}}
and transpose back to the desired format:
Transpose[a3]
{{"T", "T", "A", "C"},
{"T", "T", "A", "C"},
{"A", "A", "T", "C"},
{"A", "A", "T", "C"},
{"A", "T", "A", "A"},
{"A", "T", "A", "A"}}
In fact, all the criteria can be brought together in a single DeleteCases[]
call:
Transpose[DeleteCases[arr, v_ /;
(MemberQ[v, "-"] || Length[Union[v]] != 2 ||
MemberQ[Tally[v][[All, -1]], 1])]]
(I really wish you hadn't deleted the quote marks in your original post; putting them back in for writing this answer was annoying.)
If, like in kglr's answer, you also want to track column indices, this can be done with minimal changes to the code above, where we exploit the fact that DeleteCases[]
also works on associations:
assoc = AssociationThread[Range[Length[arr]], arr];
filt = DeleteCases[assoc, v_ /; (MemberQ[v, "-"] || Length[Union[v]] != 2 ||
MemberQ[Tally[v][[All, -1]], 1])]
<|2 -> {"T", "T", "A", "A", "A", "A"},
3 -> {"T", "T", "A", "A", "T", "T"},
4 -> {"A", "A", "T", "T", "A", "A"},
9 -> {"C", "C", "C", "C", "A", "A"}|>
The column indices can then be retrieved using Keys[filt]
, and the array itself can be reconstituted with Transpose[Values[filt]]
.
m = {{"G", "T", "T", "A", "A", "C", "G", "A", "C", "-"},
{"G", "T", "T", "A", "G", "C", "G", "T", "C", "-"},
{"C", "A", "A", "T", "T", "C", "G", "T", "C", "G"},
{"T", "A", "A", "T", "T", "C", "G", "T", "C", "A"},
{"C", "A", "T", "A", "C", "C", "C", "G", "A", "A"},
{"-", "A", "T", "A", "C", "C", "G", "G", "A", "A"}};
mindexed = MapIndexed[#2 -> # &, m, {2}]
criterion1 = FreeQ[{"-", _}] @* Tally @* Values;
criterion23 = Not@*(MemberQ[{1, 3, 4}, #] &) @* Length @* Tally @* Values;
criterion4 = (Or[Length@# != 2, FreeQ[{_, 1}]@#] &) @* Tally @* Values;
result = Fold[Select[#2]@# &, Transpose[mindexed], {criterion1, criterion23, criterion4}]
MatrixForm @ Transpose @ result
MatrixForm @ Transpose @ Values @ result
MatrixForm[Transpose@Keys@result, TableDirections -> {Column, Row, Row}]
Grid[Transpose[result] /. Rule -> (Column[{#2, #}, Alignment -> Center] &),
Dividers -> All]
You can use Cases
or DeleteCases
instead of Select
:
result2 = Fold[Cases[_?#2]@# &,
Transpose[mindexed], {criterion1, criterion23, criterion4}];
result3 = Fold[DeleteCases[Except[_?#2]]@# &,
Transpose[mindexed], {criterion1, criterion23, criterion4}];
result == result2 == result3
True
Since you're only deleting and testing columns, my approach here would be to group only by columns (and associating it with their index), and then test those. One way to do that is
MapIndexed[First[#2] -> #1 &, Transpose[m]]
The First
is because MapIndexed
includes the position via the second argument as a single-element list, e.g. {3}
, even if we're working wit ha list of lists.
For ease with pattern-matching and extraction (though I don't know if it actually has any effect on performance), I'm going to get rid of the ->
and use just a list.
columns = MapIndexed[{First[#2], #1} &, Transpose[m]]
(You could also use Transpose[{Table[i, {i,1,Length[Transpose[m]]}], Transpose[m]}]
, or MapThread[List, {Table[i, {i,1,Length[Transpose[m]]}], Transpose[m]}]
; I don't know if either of these is faster.)
Then you can DeleteCases easily! (Also, by the way, I'm assuming you're working with strings "A"
, "-"
, etc. If not, you can do so with m = Map[ToString, m, {2}]
.)
Method 1 (DeleteCases)
columns = DeleteCases[columns, _?(MemberQ[Last[#],"-"] &), 1]
columns = DeleteCases[columns, _?(Length[DeleteDuplicates[Last[#]]]!=2 &), 1]
(*Note that the above takes care of both criteria 2 and 3 at once!*)
columns = DeleteCases[columns, _?(MemberQ[Tally[Last[#]], 1, {2}] &), 1]
Notice, though, that we're doing tests for each pattern. So it might be better to use something which applies tests, such as Select
or Pick
, especially for large data.
Method 2 (Select)
columns = Select[columns,
(!MemberQ[Last[#],"-"] && Length[DeleteDuplicates[Last[#]]]==2 && !MemberQ[Tally[Last[#]], 1, {2}]) &]
Note that Mathematica short-circuits &&
, so it makes sense to evaluate the "easiest" things first in the &&
expression.
We could also get rid of that pesky Last
everywhere by creating a selector list (one which is True
in every position you want to keep, and False
elsewhere), and then using Pick
. I don't know if that would be better or worse in terms of performance, but it would look like:
Method 3.1 (Pick)
(*The same as the function in Select, but without the Last around every # *)
selector = (!MemberQ[#,"-"] && Length[DeleteDuplicates[#]]==2 && !MemberQ[Tally[#], 1, {2}]) & /@ Transpose[m]
columns = Pick[columns, selector]
It might also make sense to instead use Tally
on every list in the first place and just match these tallies to create the selector.
Method 3.2 (Pick, Tally)
selector = MatchQ[{{Except["-"],Except[1]},{Except["-"],Except[1]}}] /@ Tally /@ Transpose[m]
columns = Pick[columns, selector]
It might also instead make sense to get a bunch of indices for which this pattern is matched, and then use Part
to extract those indices.
Method 4 (Part, Tally)
indices = Flatten@Position[Tally /@ Transpose[m], {{Except["-"],Except[1]},{Except["-"],Except[1]}}, 1]
columns = Part[columns, indices]
(One could also somehow mark off the ones that contain "-" before Tally
and then use the (maybe?) higher performance pattern {{_, Except[1]}, {_, Except[1]}}
, but that might be a tiny bit more complicated to construct.)
I don't know which of these will deliver usable performance given the size of your matrix, but hopefully at least one of them will! (I tested them, and they all at least work on your test case.) Let me know if you have any questions, or if the implementation of any of these is unclear! :)
Edit: oops, I see this got similar answers while I was writing this. However, there are still some novelties in this answer, so I'll leave it!