How do I intersect two lists in OCaml?

My OCaml isn't the best, but I hacked this function together which will intersect sorted lists:

let rec intersect l1 l2 =
    match l1 with [] -> []
        | h1::t1 -> (
          match l2 with [] -> []
              | h2::t2 when h1 < h2 -> intersect t1 l2
              | h2::t2 when h1 > h2 -> intersect l1 t2
              | h2::t2 -> (
                match intersect t1 t2 with [] -> [h1]
                    | h3::t3 as l when h3 = h1 -> l
                    | h3::t3 as l -> h1::l
              )
        );;

which should run in O(n+m) time. Basically it checks the first element of each list. If they are equal, it stores the result of the recursive call to their tails, and then checks to see if the head of the stored result is equal to the heads of the lists. If it isn't, it inserts it, otherwise it's a duplicate and it ignores it.

If they aren't equal, it just advances whichever one is smaller.


As Franck and Rémi said, converting your lists to sets (from stdlib module Set) costs n log(n), and then Sets provides a linear implementation of intersection. Franck also mentioned the equivalent alternative to sort the lists, and then traverse them in a synchronized way. These are roughly the same (and by the way, in both cases you need to be able to provide a total ordering on the elements in your lists).

If intersections are an important part of your algorithm and you want them to be faster in the case of two sets of elements that are only slightly different, you need to switch to a mergeable structure such as Patricia trees. See files pt* in http://www.lri.fr/~filliatr/ftp/ocaml/ds/ .

If you need intersection to be fast in all cases, you have the possibility to use hash-consed Patricia trees. Hash-consing helps to recognize structurally identical sub-trees, and help to build efficient caches for previous operations by making comparison cheap.

Patricia trees can not use an arbitrary type as key (typically they are presented with ints as keys). But you can sometimes work around this limitation by numbering at creation each value you intend to use as a key.