Filter subsets based on length?
When I get stuck with a little confusion in filtering, I go a level up and use foldr
in this case would be as simple as:
filterLength3 = foldr (\x rs -> if (length x) == 3 then x : rs else rs) []
filterLength3 (subsets [1,2,3,4,5])
output
=> [[1,2,3],[1,2,4],[1,3,4],[2,3,4],[1,2,5],[1,3,5],[2,3,5],[1,4,5],[2,4,5],[3,4,5]]
With filter
should be:
filter ((==3) . length) (subsets [1,2,3,4,5])
=> [[1,2,3],[1,2,4],[1,3,4],[2,3,4],[1,2,5],[1,3,5],[2,3,5],[1,4,5],[2,4,5],[3,4,5]]
Edit
After thinking a lot, and with the help of chi, and asking this question I was able to solve it:
import Data.List
subsetsOfThree ws = [ [x,y,z] | (x:xs) <- tails ws, (y:ys) <- tails xs, z <- ys ]
some examples:
subsetsOfThree [1..3]
=> [[1,2,3]]
subsetsOfThree [1..4]
=> [[1,2,3],[1,2,4],[1,3,4],[2,3,4]]
subsetsOfThree [1..5]
=> [[1,2,3],[1,2,4],[1,2,5],[1,3,4],[1,3,5],[1,4,5],[2,3,4],[2,3,5],[2,4,5],[3,4,5]]
subsetsOfThree [1..10]
=> [[1,2,3],[1,2,4],[1,2,5],[1,2,6],[1,2,7],[1,2,8],[1,2,9],[1,2,10],[1,3,4],[1,3,5],[1,3,6],[1,3,7],[1,3,8],[1,3,9],[1,3,10],[1,4,5],[1,4,6],[1,4,7],[1,4,8],[1,4,9],[1,4,10],[1,5,6],[1,5,7],[1,5,8],[1,5,9],[1,5,10],[1,6,7],[1,6,8],[1,6,9],[1,6,10],[1,7,8],[1,7,9],[1,7,10],[1,8,9],[1,8,10],[1,9,10],[2,3,4],[2,3,5],[2,3,6],[2,3,7],[2,3,8],[2,3,9],[2,3,10],[2,4,5],[2,4,6],[2,4,7],[2,4,8],[2,4,9],[2,4,10],[2,5,6],[2,5,7],[2,5,8],[2,5,9],[2,5,10],[2,6,7],[2,6,8],[2,6,9],[2,6,10],[2,7,8],[2,7,9],[2,7,10],[2,8,9],[2,8,10],[2,9,10],[3,4,5],[3,4,6],[3,4,7],[3,4,8],[3,4,9],[3,4,10],[3,5,6],[3,5,7],[3,5,8],[3,5,9],[3,5,10],[3,6,7],[3,6,8],[3,6,9],[3,6,10],[3,7,8],[3,7,9],[3,7,10],[3,8,9],[3,8,10],[3,9,10],[4,5,6],[4,5,7],[4,5,8],[4,5,9],[4,5,10],[4,6,7],[4,6,8],[4,6,9],[4,6,10],[4,7,8],[4,7,9],[4,7,10],[4,8,9],[4,8,10],[4,9,10],[5,6,7],[5,6,8],[5,6,9],[5,6,10],[5,7,8],[5,7,9],[5,7,10],[5,8,9],[5,8,10],[5,9,10],[6,7,8],[6,7,9],[6,7,10],[6,8,9],[6,8,10],[6,9,10],[7,8,9],[7,8,10],[7,9,10],[8,9,10]]
And now you are able to make your monster a little puppet:
length $ subsetsOfThree [1..10]
=> 120
length $ subsetsOfThree [1..20]
=> 1140
length $ subsetsOfThree [1..50]
=> 19600
length $ subsetsOfThree [1..100]
=> 161700
length $ subsetsOfThree [1..500]
=> 20708500
The number of subsets for a list of 100 elements is about 2100 ≃ 1.26*1030, a really huge number. So the filter
approach does not seem practical. The problem should be solved by manipulating lists containing just a few numbers between 1 and 100.
So we aim to write a function to be named kSubsets
which returns the list of all subsets of cardinality k:
kSubsets :: Int -> [a] -> [[a]]
where k is the first argument.
A solution based on recursive list processing:
A possible way to build the functionality of kSubsets
consists in using an auxiliary kIndexSubsets
function which computes the zero-based indexes of the elements, instead of the elements themselves. The kIndexSubsets
function can be written in a recursive fashion.
In that case, the kSubsets
function is essentially a wrapper which maps the element indexes to the actual list elements. This gives the following code:
import qualified Data.Map as M
import qualified Data.Maybe as Mb
import qualified Data.List as L
kIndexSubsets :: Int -> Int -> [[Int]]
kIndexSubsets 0 _ = [[]]
kIndexSubsets k nn =
-- first element chosen must leave room for (k-1) elements after itself
let lastChoice = if (k > nn)
then error "k above nn in kIndexSubsets"
else (nn -k)
choices = [0 .. lastChoice]
-- for each possible first element, recursively compute
-- all the possible tails:
fn hd = let tails1 = kIndexSubsets (k-1) (nn - (hd+1))
-- rebase subsequent indexes:
tails2 = map (map (\x -> (x+hd+1))) tails1
in -- add new leftmost element:
map (\ls -> hd:ls) tails2
in
concatMap fn choices
-- return the list of all subsets of ls having k elements:
kSubsets :: Int -> [a] -> [[a]]
kSubsets 0 _ = [[]]
kSubsets k ls =
let nn = length ls
-- need a map for fast access to elements of ls:
ma = M.fromList $ zip [0..] ls
extractor ix = Mb.fromJust(M.lookup ix ma)
indexSubSets = kIndexSubsets k nn
in
map (map extractor) indexSubSets
We can now test our kSubsets
function. This involves checking that the length of the resulting output list conforms to the classic combinatorics formula, that is n!/(k! * (n-k)!) where n is the length of the input list.
*Main> let ls = "ABCDEFGH"
*Main> kSubsets 0 ls
[""]
*Main> kSubsets 1 ls
["A","B","C","D","E","F","G","H"]
*Main> kSubsets 2 ls
["AB","AC","AD","AE","AF","AG","AH","BC","BD","BE","BF","BG","BH","CD","CE","CF","CG","CH","DE","DF","DG","DH","EF","EG","EH","FG","FH","GH"]
*Main> kSubsets 3 ls
["ABC","ABD","ABE","ABF","ABG","ABH","ACD","ACE","ACF","ACG","ACH","ADE","ADF","ADG","ADH","AEF","AEG","AEH","AFG","AFH","AGH","BCD","BCE","BCF","BCG","BCH","BDE","BDF","BDG","BDH","BEF","BEG","BEH","BFG","BFH","BGH","CDE","CDF","CDG","CDH","CEF","CEG","CEH","CFG","CFH","CGH","DEF","DEG","DEH","DFG","DFH","DGH","EFG","EFH","EGH","FGH"]
*Main>
*Main> kSubsets 7 ls
["ABCDEFG","ABCDEFH","ABCDEGH","ABCDFGH","ABCEFGH","ABDEFGH","ACDEFGH","BCDEFGH"]
*Main>
*Main> kSubsets 8 ls
["ABCDEFGH"]
*Main>
*Main>
*Main> div ((100*99*98)::Integer) ((2*3)::Integer)
161700
*Main>
*Main> length $ kSubsets 3 [ 1 .. 100 ]
161700
*Main>
*Main> div ((100*99*98*97*96)::Integer) ((2*3*4*5)::Integer)
75287520
*Main> length $ kSubsets 5 [ 1 .. 100 ]
75287520
*Main>
The evaluation of kSubsets 3 [ 1 .. 100 ]
takes less than 50 msec on a plain vanilla x86-64 Linux machine.
An alternative solution based on a state machine:
The (reversed) list of chosen indexes is taken to be the state of an automaton, and we advance the state step by step, until this is no longer possible, at which point the list of sublists is complete.
Basically, if there is room to advance the rightmost index, fine, otherwise we recurse to advance the rest of the list, and then move the rightmost index as far left as possible.
The approach gives this alternative source code for kIndexSubsets
, in which the key piece is the ksAdvance
stepping function:
import qualified Data.Map as M
import qualified Data.Maybe as Mb
import qualified Data.List as L
-- works on the *reversed* list of chosen indexes:
ksAdvance :: Int -> Int -> Maybe [Int] -> Maybe [Int]
ksAdvance k nn Nothing = Nothing
ksAdvance k nn (Just []) = Nothing
ksAdvance k nn (Just (h:rls)) =
if (h == (nn-1))
then -- cannot advance rightmost index, so must recurse
let mbols2 = ksAdvance (k-1) (nn-1) (Just rls)
in
case mbols2 of
Nothing -> Nothing
Just ols2 -> let y = ((head ols2)+1) in Just (y:ols2)
else -- just advance rightmost index:
Just ((h+1):rls)
kIndexSubsets :: Int -> Int -> [[Int]]
kIndexSubsets 0 _ = [[]]
kIndexSubsets k nn =
let startList = reverse $ [ 0 .. (k-1) ]
cutList = takeWhile Mb.isJust
mbls = cutList $ iterate (ksAdvance k nn) (Just startList)
in
map (reverse . Mb.fromJust) mbls
This algorithm seems less memory-hungry and faster than the first one.
Using this main program for a quick performance test, with subsets of 5 elements out of 100, generating 75287520 subsets:
kSubsets :: Int -> [a] -> [[a]]
kSubsets 0 _ = [[]]
kSubsets k ls =
let nn = length ls
-- need a map for fast access to elements of ls:
ma = M.fromList $ zip [0..] ls
eltFromIndex = \ix -> Mb.fromJust (M.lookup ix ma)
indexSubSets = kIndexSubsets k nn
in
map (map eltFromIndex) indexSubSets
main = do
let nn = 100
let k = 5
let ls = [ 1 .. nn ]::[Int]
let str = "count of " ++ (show k) ++ " out of " ++ (show nn) ++
" elements subsets = " ++ (show $ length (kSubsets k ls))
putStrLn $ str
Memory performance is improved:
$ /usr/bin/time ./kSubsets03.x +RTS -s
count of 5 out of 100 elements subsets = 75287520
4,529,861,272 bytes allocated in the heap
623,240 bytes copied during GC
44,504 bytes maximum residency (2 sample(s))
29,224 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
...
Productivity 98.4% of total user, 98.5% of total elapsed
0.70user 0.00system 0:00.72elapsed 99%CPU (0avgtext+0avgdata 4724maxresident)k
0inputs+0outputs (0major+436minor)pagefaults 0swaps
$
Not yet as good as Fortran but getting close :-)
Here's a general solution for length-n subsets not using filter.
Where our initial list is x:xs
, notice that we can partition these subsets into those containing x
and those not containing x
. This shows us a nice recursive structure; the first partition is x
prepended to each length-(n-1) subset of xs
, and the second is just the length-n subsets of xs
.
subsetsOfLength n (x:xs) = map (x:) (subsetsOfLength (n-1) xs) ++ subsetsOfLength n xs
All we need are the base cases. There is a single length-0 subset, and no subset is larger than the original:
subsets 0 _ = [[]]
subsets _ [] = []
Stick these bases above the recursive step and throw an appropriate type signature on it, and we're done.
λ> subsetsOfLength 3 [1..5]
[[1,2,3],[1,2,4],[1,2,5],[1,3,4],[1,3,5],[1,4,5],[2,3,4],[2,3,5],[2,4,5],[3,4,5]]
λ> length $ subsetsOfLength 5 [1..100]
252
Nice.
Be careful. (++)
is slow; if you know at compile-time the length you'll be using, Damián Rafael Lattenero's tails
approach may be more performant. Not entirely sure about this, though. Also, depending on the values, you might do well to swap the operands of (++)
. I haven't yet done the math.