Filtering list under limitations
The original algorithm you used will always tend to minimize the number of results, because in any mutual-exclusive choice between items the highest-score item wins. This way the algorithm operates like a sieve, eliminating many lower-score items.
In order to support choosing a set of at least size X (10 in this case) from an original set of items length Y (11 in your example), you will need to collect a list of decision sets rather than eliminating items by score alone. A decision set(m,n) is a set of m items from which you must choose to keep n items and eliminate the rest. Since most of the rules in your system are single item of attribute x, most decision sets in your list will be set(m,1) - choose 1 of the m items, eliminate the rest.
The first pass on the full items set will populate the decision set list, and the second pass will go over that list and choose from each decision set the items to eliminate from the original set. Once a decision is made and the item/s are eliminated from the original set, the decision set is removed from the list (resolved). Once the list of decision sets has been cleared, your original set is legal.
The goal is to have the decision set list cleared in at most Y-X eliminations. Since an item can appear in multiple decision sets, you can also add for each item a "survival score". The survival score suggests the maximum number of item which will have to be eliminated if this item is kept. It is calculated per item by going over each decision set (m,n) and adding to each item contained m-n to its accumulated score.
Let's look at your example, and build its decision sets:
- Item1(Category A, Author 1)
- Item2(Category A, Author 2)
- Item3(Category A, Author 3)
- Item4(Category B, Author 2)
- Item5(Category C, Author 5)
- Item6(Category D, Author 6)
- Item7(Category E, Author 7)
- Item8(Category F, Author 8)
- Item9(Category G, Author 9)
- Item10(Category H, Author 10)
- Item11(Category I, Author 11)
The decision sets we compile are (note the survival score in parenthesis):
- Author decision set (2,1) = {item 2 (2), item 4 (1)}
- Category decision set (3,2) = {item 1 (1), item 2 (2), item 3 (1)}
Our goal is to resolve the decision set list in at most 1 elimination. You can see that all the items are with survival score 1 (meaning that keeping them will result in at most 1 other item eliminated) except for item 2 which has survival score of 2 (keeping it will eliminate at most 2 items). We cannot afford 2 items, and therefore we cannot afford to keep item 2 regardless of its score. eliminating it will resolve both decision sets and is the only option.
The more general algorithm can be more complex: in every iteration you eliminate the items with survival score you cannot afford, and if you're not near that limit use a combination of score and survival-score to decide which one must go.
I think your solution is pretty good, not to mention efficient, however, you're not dealing with cases in which selecting the next top scorer is the best solution. In this case, this could only happen when selecting one item would result in a top 10 list of less than 10 items.
Decision sets (as suggested by @Assafs) would do the trick; however, a generic decision set algorithm is very ineffective for the general case of your problem. If you have to build full decision sets for millions of entries just to avoid a case where there are not enough input items seems like overkill to me.
Moreover, the description of the "top 10" result set does not reveal what's the correct result set in all cases. For instance, if you more items in the example you give, there's nothing to indicate that removing item 2 is the correct approach. It's not clear how we're supposed to measure which 10 are the "top 10".
What I suggest is adding a partial decision making process to your algorithm, that would on the one hand, not compromise the performance and on the other hand, will provide the means to tackle the short input problem.
To implement that, you need to keep a map from each of your already selected items to the items it caused to be disqualified. On top of that, you need to keep a map from each of the disqualified items to each of the items disqualifying it.
After you get the result set from your greedy algorithm, if you have less than 10 results, you go over the result set and replace items that disqualify a lot of items with the items their disqualifying.
The way you decide which items to check first is really up to you, since again, there's no description on how to determine what's "top 10". You can eliminate lower score items that disqualify more than 1 other items first, or look for the items that disqualifies the most, etc.
The important thing is that whatever algorithm you're choosing, your result set has a maximum of 10 items, and therefore, even if you come up with an over the top O(n^3) algorithm, it's still a constant 1000 operation algorithm which is in reality O(1).
Building the maps needed to implement this also require O(1) time during the running of your initial algorithm.
You have a list of items and you need to remove some of them to achieve your goal. You need to check if removing each candidate item can give you better solution! Try to remove each item that could improve your list and see if you have achieved your goal.
Here is steps to solve it with recursion:
- If given list size is less than or equal to 10, then return
- Build list of candidates for removal from your list (Item1, Item2, Item3 and Item4 in case of your example) If candidate list is empty then you are done.
- Remove each candidate one by one, call recursion and then put back removed item
- In case of recursion returns true, you are done
Here is some pseudo code
bool recurse(list l)
if (l.size() <= 10) {
// cannot remove anything
return false
candidates = buildCandidateList(l)
if (candidates.size() == 0)
// check for top 10 items
return true
for each c in candidates {
remove c from l
if (recurse(l) == true)
// check for top 10 items
return true
add c back into l
}
// just get first 10 items
return false
This solution is based on backtracking with recursion. Alternatively, I think it is also possible to build up bottom-up dynamic programming solution as well which would give better result in terms of Big-O complexity time.