Compare two DataTables to determine rows in one but not the other
would I have to iterate through each row on each DataTable to check if they are the same.
Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.
Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:
1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:
- Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
- Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.
This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes )
2: An alternative approach, which may be more or less efficient depending on how big your data is:
- Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
- Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.
I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)
You can use the Merge and GetChanges methods on the DataTable to do this:
A.Merge(B); // this will add to A any records that are in B but not A
return A.GetChanges(); // returns records originally only in B
Assuming you have an ID column which is of an appropriate type (i.e. gives a hashcode and implements equality) - string in this example, which is slightly pseudocode because I'm not that familiar with DataTables and don't have time to look it all up just now :)
IEnumerable<string> idsInA = tableA.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> idsInB = tableB.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> bNotA = idsInB.Except(idsInA);