Find k smallest sum n-tuples

Bug Fixes

I believe there are two bugs in getTupleSetsC implementation. When shifting queue elements back, part specification: 2 ;; qels + 1 is used before decrementing qels which causes Part::take error when queue is full. Also when creating new push element push[[j]] += 1 is done unconditionally which leads to unbounded increase of indices. After fixing those bugs we get function which we'll use as reference for benchmarks:

getTupleSetsCb3m2a1 =
  Compile[
   {{vals, _Real, 2}, {n, _Integer}},
   Module[
    {
     dim,
     k,
     queue,
     costs,
     defel,
     tups,
     pop,
     push,
     cost,
     qels = 0,
     solns = 0,
     order
     },
    (*dim = Length@vals;*)
    {dim, k} = Dimensions@vals;
    (* Initialize priority queue and cost vector *)

    defel = Table[-1, {$, dim}];
queue = Table[defel, {$, n}];
costs = Table[-1., {$, n}];
(* Initialize solutions *)
tups = Table[defel, {$, n}];
(* Determine first element *)
push = Table[1, {$, dim}];
    (* Compute cost *)
    cost = 0.;
    Do[cost += vals[[i, push[[i]]]], {i, dim}];
    (* Push onto queue *)
    ++qels;
    queue[[qels]] = push;
    costs[[qels]] = cost;
    (* Iterate *)
    pop = defel;
    Do[
     (* pop the first element, add it to the tups array *)

     pop = queue[[1]];
     tups[[++solns]] = pop;
     (* shift the queue elements back *)

     (*queue[[;; qels]] = queue[[2 ;; qels + 1]];
     costs[[;; qels]] = costs[[2 ;; qels + 1]];
     qels--;*)
     qels--;
     If[qels > 0,
       queue[[;; qels]] = queue[[2 ;; qels + 1]];
       costs[[;; qels]] = costs[[2 ;; qels + 1]];
     ];
     Do[
      (* If j-th index is already maximal there's no tuple with incremented j-th index.
         Since it also means that j-th index > 1, we can not only continue to next j, but break the loop. *)
      If[pop[[j]] == k, Break[]];

      (* Create new push element *)
      push = pop;
      push[[j]] += 1;
      (* Compute cost *)
      cost = 0.;
      Do[cost += vals[[i, push[[i]]]], {i, dim}];
      (* determine where on queue would be added *)
      ++qels;
      (* allocate more queue if necessary *)

      If[qels > Length@queue,
       queue =
        Join[queue, 
         Table[defel, {$$, 2*Length@queue}]
     ];
   costs =
    Join[costs, 
     Table[-1., {$$, 2*Length@queue}]
         ]
       ];
      (* push onto queue *)
      queue[[qels]] = push;
      costs[[qels]] = cost;
      (* If next element would be added in later iteration, break *)

           If[pop[[j]] > 1, Break[]],
      {j, dim}
      ];
     (* resort by priority *)

     order = Ordering[costs[[;; qels]]];
     queue[[;; qels]] = queue[[;; qels]][[order]];
     costs[[;; qels]] = costs[[;; qels]][[order]],
     {$, n - 1}
     ];
    pop = queue[[1]];
    tups[[++solns]] = pop;
    tups
    ],
    CompilationTarget -> "C",
    RuntimeOptions -> "Speed"
   ] // Last;

Performance improvements

I don't have better algorithm than one used by OP. So let's just change few implementation details.

First obvious, and already hinted by OP, thing is priority queue implementation. Instead of shifted and resorted array let's use heap-ordered binary tree.

As explained in linked answer to CSSE question our queue can be bounded, and when maximal size is reached we'll remove unneeded most expensive elements. Heap is not an asymptotically optimal data structure when comes to removing most expensive elements, but in this particular algorithm, I believe, our heap operates almost all the time below its maximal capacity. So occasional, if present at all, non-optimal removals will hopefully be "amortized". And we'll benefit from lower constants in optimal operations of adding arbitrary elements and removing cheapest ones.

To not waste time on swapping whole tuples when restoring heap property, we'll use heap of indices to base arrays containing tuples and their costs. To quickly find empty spots in those base arrays we'll also keep an array based linked list of empty spots.

We'll keep track of largest cost of tuples present in queue and won't push a tuple if we already have enough cheaper ones.

Since new tuple is always created from old one by changing one index, we don't have to loop over all its elements to calculate its cost, it's enough to focus on changed element: subtract cost of old element and add cost of new one.

Above changes result in following code:

getTupleSetsCjkuczm = Hold@Compile[{{vals, _Real, 2}, {k, _Integer}},
  Module[{m, n, d, firstEmpty, noEmpty, nextEmpty, queueCurrent, queue, currentCost, largestCost, emptyCost, costs, tuples, result},
    {m, n} = Dimensions@vals;
    d = k + m;
    firstEmpty = 1;
    noEmpty = d + 1;
    nextEmpty = Range[2, noEmpty];
    queueCurrent = 1;
    queue = Table[0, d];
    largestCost = 0.;
    emptyCost = 2. m Max@vals;
    costs = Table[emptyCost, d];
    tuples = Table[1, d, m];
    result = Table[1, k, m];

    (* Compute cost of first tuple. *)
    currentCost = vals[[1, 1]];
    Do[currentCost += vals[[i, 1]], {i, 2, m}];

    Do[
      Do[
        Module[{oldInd, newInd, newCost},
          oldInd = result[[i, j]];

          (* If j-th index is already maximal there's no tuple with incremented j-th index.
             Since it also means that j-th index > 1, we can not only continue to next j, but break the loop. *)
          If[oldInd == n, Break[]];

          newInd = oldInd + 1;
          newCost = currentCost - vals[[j, oldInd]] + vals[[j, newInd]];

          (* If we already have enough cheaper tuples, don't push this one. *)
          If[queueCurrent > k - i && newCost >= largestCost, If[oldInd > 1, Break[], Continue[]]];

          (* If there are no empty spots, leave only k - i cheapes elements, we don't need rest. *)
          If[firstEmpty == noEmpty,
            Module[{oPrev},
              (* Create new queue. This is slow, but should be very rare. *)
              queueCurrent = k - i + 1;
              queue = Ordering@costs;
              largestCost = costs[[queue[[k - i]]]];

              (* Update list of empty spots. *)
              oPrev = queue[[queueCurrent]];
              firstEmpty = oPrev;
              Do[
                With[{o = queue[[l]]},
                  nextEmpty[[oPrev]] = o;
                  oPrev = o;
                ]
                ,
                {l, k - i + 2, d}
              ];
              nextEmpty[[oPrev]] = noEmpty;
            ];
          ];

          (* Store tuple and its cost. *)
          tuples[[firstEmpty]] = result[[i]];
          tuples[[firstEmpty, j]] = newInd;
          costs[[firstEmpty]] = newCost;
          If[newCost > largestCost, largestCost = newCost];

          (* Push index into queue. *)
          queue[[queueCurrent++]] = firstEmpty;
          (* Restore heap property. *)
          Module[{l, p},
            l = queueCurrent - 1;
            p = Quotient[l - 2, 2] + 1;
            While[l > 1 && costs[[queue[[l]]]] < costs[[queue[[p]]]],
              With[{lth = queue[[l]]},
                queue[[l]] = queue[[p]];
                queue[[p]] = lth;
              ];
              l = p;
              p = Quotient[l - 2, 2] + 1;
            ];
          ];

          (* Remove first empty spot from list of empty spots. *)
          firstEmpty = nextEmpty[[firstEmpty]];

          (* If next element would be added in later iteration, break. *)
          If[oldInd > 1, Break[]]
        ]
        ,
        {j, m}
      ];

      Module[{cheapest},
        (* Pop cheapest element from queue. *)
        cheapest = queue[[1]];
        queue[[1]] = queue[[--queueCurrent]];
        (* Restore heap property. *)
        Module[{j = 1},
          While[True,
            Module[{l, r, o, jth, jthCost},
              l =  2 j;
              r = 2 j + 1;
              jth = queue[[j]];
              jthCost = costs[[jth]];
              o = If[r < queueCurrent && costs[[queue[[r]]]] < jthCost,
                If[costs[[queue[[l]]]] < costs[[queue[[r]]]], l, r]
              (* else *),
                If[l < queueCurrent && costs[[queue[[l]]]] < jthCost, l, 0]
              ];
              If[o < 1, Break[]];
              queue[[j]] = queue[[o]];
              queue[[o]] = jth;
              j = o;
            ];
          ];
        ];

        currentCost = costs[[cheapest]];
        costs[[cheapest]] = emptyCost;

        (* Add cheapest element to result array. *)
        result[[i + 1]] = tuples[[cheapest]];

        (* Prepend cheapest index to list of empty indices. *)
        nextEmpty[[cheapest]] = firstEmpty;
        firstEmpty = cheapest;
      ];
      ,
      {i, k - 1}
    ];
    result
  ],
  CompilationTarget -> "C",
  RuntimeOptions -> "Speed"
] /. Part -> Compile`GetElement //. HoldPattern[Compile`GetElement[x__] = y_] :> (Part@x = y) // ReleaseHold // Last;

Tests

On small data sets we can test that we get same results as ones obtained from Sorted lists of all Tuples:

getCosts = Total[Transpose@MapThread[#1[[#2]] &, {#1, Transpose@#2}], {2}] &;

SeedRandom@0
testData = Sort /@ RandomInteger[{0, 500}, {2, 4}]
res = Sort@Total[Tuples@testData, {2}];
Array[getCosts[testData, getTupleSetsCb3m2a1[testData, #]] === Take[res, #] &, Length@res, 1, And] // AbsoluteTiming
Array[getCosts[testData, getTupleSetsCjkuczm[testData, #]] === Take[res, #] &, Length@res, 1, And] // AbsoluteTiming
(* {{32, 266, 334, 399}, {30, 270, 272, 414}} *)
(* {0.00083,  True} *)
(* {0.000243, True} *)

testData = Sort /@ RandomInteger[{0, 500}, {6, 3}]
res = Sort@Total[Tuples@testData, {2}];
Array[getCosts[testData, getTupleSetsCb3m2a1[testData, #]] === Take[res, #] &, Length@res, 1, And] // AbsoluteTiming
Array[getCosts[testData, getTupleSetsCjkuczm[testData, #]] === Take[res, #] &, Length@res, 1, And] // AbsoluteTiming
(* {{195, 266, 342}, {138, 309, 473}, {264, 338, 351}, {111, 126, 356}, {1, 77, 179}, {210, 421, 495}} *)
(* {1.44529,  True} *)
(* {0.089297, True} *)

testData = Sort /@ RandomInteger[{0, 500}, {5, 5}]
res = Sort@Total[Tuples@testData, {2}];
Array[getCosts[testData, getTupleSetsCb3m2a1[testData, #]] === Take[res, #] &, Length@res, 1, And] // AbsoluteTiming
Array[getCosts[testData, getTupleSetsCjkuczm[testData, #]] === Take[res, #] &, Length@res, 1, And] // AbsoluteTiming
(* {{14, 201, 249, 315, 487}, {269, 321, 340, 411, 464}, {23, 229, 357, 372, 377}, {89, 205, 297, 420, 442}, {171, 268, 298, 374, 481}} *)
(* {71.903,   True} *)
(* { 1.43877, True} *)

Benchmarks

Let's compare time and memory usage on data from OP:

tests = Sort /@ BlockRandom[SeedRandom@1; Table[RandomSample[Range[0, 500], 100], 6]];
res1 = getTupleSetsCb3m2a1[tests, 100]; // MaxMemoryUsed // RepeatedTiming
res2 = getTupleSetsCjkuczm[tests, 100]; // MaxMemoryUsed // RepeatedTiming
getCosts[tests, res1] === getCosts[tests, res2]
(* {0.000523, 59296} *)
(* {0.000032, 24032} *)
(* True *)

res1 = getTupleSetsCb3m2a1[tests, 10^4]; // MaxMemoryUsed // RepeatedTiming
res2 = getTupleSetsCjkuczm[tests, 10^4]; // MaxMemoryUsed // RepeatedTiming
getCosts[tests, res1] === getCosts[tests, res2]
(* {1.24,    1783744} *)
(* {0.00288, 1673128} *)
(* True *)

And on other combinations of data sizes:

Needs@"GeneralUtilities`"
Table[
  With[{testData = Sort /@ RandomInteger[{0, 500}, {i, j}]},
    BenchmarkPlot[<|"b3m2a1" -> (getTupleSetsCb3m2a1[testData, #] &), "jkuczm" -> (getTupleSetsCjkuczm[testData, #] &)|>, Identity, 2^Range@16]
  ]
  ,
  {i, 2^(3 Range@3)}, {j, 2^(3 Range@3)}
] // TableForm[#, TableHeadings -> {2^(3 Range@3), 2^(3 Range@3)}, TableAlignments -> Center] &

Benchmark Plots


Could do this as a polynomial algebra problem with matrix values used as exponents, with a distinct variable for each, summed in rows and product taken of row sums. Use a distinct variable to record total degrees. We set this up as below.

mat = {{1, 2, 3}, {5, 6, 7}, {3, 4, 5}};
xmat = Array[x, Dimensions[mat]];
prod = Apply[Times, Apply[Plus, xmat^mat, {1}]] /. 
  Thread[Flatten[xmat] -> t*Flatten[xmat]]

(* Out[142]= (t x[1, 1] + t^2 x[1, 2]^2 + t^3 x[1, 3]^3) (t^5 x[2, 1]^5 +
    t^6 x[2, 2]^6 + t^7 x[2, 3]^7) (t^3 x[3, 1]^3 + t^4 x[3, 2]^4 + 
   t^5 x[3, 3]^5) *)

Extract the min possible value, as the minimal exponent in t.

minexpon = Exponent[prod, t, Min]

(* Out[122]= 9 *)

Now we define a function to extract lists of matrix column indices that all give rise to a particular power (row being denoted by position in the list). This version also gives the matrix entries in a separate list

getIndexVectors[prod_, deg_] := Module[
  {coeffs},
  coeffs = Coefficient[prod, t, deg];
  If [Head[coeffs] === Plus, coeffs = Apply[List, coeffs], 
   coeffs = {coeffs}];
  Map[Transpose, coeffs /. Times -> List /. x[i_, j_]^k_. :> {j, k}]
  ]

Some tests:

In[143]:= getIndexVectors[prod, minexpon]

(* Out[143]= {{{1, 1, 1}, {1, 5, 3}}} *)

In[144]:= getIndexVectors[prod, minexpon + 1]

(* Out[144]= {{{2, 1, 1}, {2, 5, 3}}, {{1, 2, 1}, {1, 6, 3}}, {{1, 1, 
   2}, {1, 5, 4}}} *)