Finding all combinations based on multiple conditions for a large list
It is a classical operations research problem.
There are tons of algorithms that permit to find an optimal (or just a very good depending on the algorithm) solution :
- Mixed-Integer Programming
- Metaheuristics
- Constraint Programming
- ...
Here is a code that will find the optimal solution using MIP, ortools library and default solver COIN-OR :
from ortools.linear_solver import pywraplp
import pandas as pd
solver = pywraplp.Solver('cyclist', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
cyclist_df = pd.read_csv('cyclists.csv')
# Variables
variables_name = {}
variables_team = {}
for _, row in cyclist_df.iterrows():
variables_name[row['Naam']] = solver.IntVar(0, 1, 'x_{}'.format(row['Naam']))
if row['Ploeg'] not in variables_team:
variables_team[row['Ploeg']] = solver.IntVar(0, solver.infinity(), 'y_{}'.format(row['Ploeg']))
# Constraints
# Link cyclist <-> team
for team, var in variables_team.items():
constraint = solver.Constraint(0, solver.infinity())
constraint.SetCoefficient(var, 1)
for cyclist in cyclist_df[cyclist_df.Ploeg == team]['Naam']:
constraint.SetCoefficient(variables_name[cyclist], -1)
# Max 4 cyclist per team
for team, var in variables_team.items():
constraint = solver.Constraint(0, 4)
constraint.SetCoefficient(var, 1)
# Max cyclists
constraint_max_cyclists = solver.Constraint(16, 16)
for cyclist in variables_name.values():
constraint_max_cyclists.SetCoefficient(cyclist, 1)
# Max cost
constraint_max_cost = solver.Constraint(0, 100)
for _, row in cyclist_df.iterrows():
constraint_max_cost.SetCoefficient(variables_name[row['Naam']], row['Waarde'])
# Objective
objective = solver.Objective()
objective.SetMaximization()
for _, row in cyclist_df.iterrows():
objective.SetCoefficient(variables_name[row['Naam']], row['Punten totaal:'])
# Solve and retrieve solution
solver.Solve()
chosen_cyclists = [key for key, variable in variables_name.items() if variable.solution_value() > 0.5]
print(cyclist_df[cyclist_df.Naam.isin(chosen_cyclists)])
Prints :
Naam Ploeg Punten totaal: Waarde
1 SAGAN Peter BORA - hansgrohe 522 11.5
2 GROENEWEGEN Dylan Team Jumbo-Visma 205 11.0
8 VIVIANI Elia Deceuninck - Quick Step 273 9.5
11 ALAPHILIPPE Julian Deceuninck - Quick Step 399 9.0
14 PINOT Thibaut Groupama - FDJ 155 8.5
15 MATTHEWS Michael Team Sunweb 323 8.5
22 TRENTIN Matteo Mitchelton-Scott 218 7.5
24 COLBRELLI Sonny Bahrain Merida 238 6.5
25 VAN AVERMAET Greg CCC Team 192 6.5
44 STUYVEN Jasper Trek - Segafredo 201 4.5
51 CICCONE Giulio Trek - Segafredo 153 4.0
82 TEUNISSEN Mike Team Jumbo-Visma 255 3.0
83 HERRADA Jesús Cofidis, Solutions Crédits 255 3.0
104 NIZZOLO Giacomo Dimension Data 121 2.5
123 MEURISSE Xandro Wanty - Groupe Gobert 141 2.0
151 TRATNIK Jan Bahrain Merida 87 1.0
How does this code solve the problem ? As @KyleParsons said, it looks like the knapsack problem and can be modelized using Integer Programming.
Let's define variables Xi (0 <= i <= nb_cyclists)
and Yj (0 <= j <= nb_teams)
.
Xi = 1 if cyclist n°i is chosen, =0 otherwise
Yj = n where n is the number of cyclists chosen within team j
To define the relation between those variable, you can model these constraints :
# Link cyclist <-> team
For all j, Yj >= sum(Xi, for all i where Xi is part of team j)
To select only 4 cyclists per team max, you create these constraints :
# Max 4 cyclist per team
For all j, Yj <= 4
To select 16 cyclists, here are the associated constraints :
# Min 16 cyclists
sum(Xi, 1<=i<=nb_cyclists) >= 16
# Max 16 cyclists
sum(Xi, 1<=i<=nb_cyclists) <= 16
The cost constraint :
# Max cost
sum(ci * Xi, 1<=i<=n_cyclists) <= 100
# where ci = cost of cyclist i
Then you can maximize
# Objective
max sum(pi * Xi, 1<=i<=n_cyclists)
# where pi = nb_points of cyclist i
Notice that we model the problem using linear objective and linear inequation constraints. If Xi and Yj would be continous variables, this problem would be polynomial (Linear programming) and could be solved using :
- Interior point methodes (polynomial solution)
- Simplex (non polynomial but more effective in practice)
Because these variables are integers (Integer Programming or Mixed Integer Programming), the problem is known as be part of NP_complete class (cannot be solved using polynomial solutions unless you are a genious). Solvers like COIN-OR
use complex Branch & Bound or Branch & Cut methods to solve them efficiently. ortools
provides a nice wrapper to use COIN with python. These tools are free & open source.
All these methods have the advantage of finding an optimal solution without iterating on all the possible solutions (and considerably reduce the combinatorics).
I add an other answer for your question :
The CSV I posted was actually modified, my original one also contains a list for each rider with their score for each stage. This list looks like this
[0, 40, 13, 0, 2, 55, 1, 17, 0, 14]
. I am trying to find the team that performs the best overall. So I have a pool of 16 cyclists, from which the score of 10 cyclists counts towards the score of each day. The scores for each day are then summed to get a total score. The purpose is to get this final total score as high as possible.
If you think I should edit my first post please let me know, I think that it is more clear like this because my first post is quite dense and answers the initial question.
Let's introduce a new variable :
Zik = 1 if cyclist i is selected and is one of the top 10 in your team on day k
You need to add these constraints to link variables Zik and Xi (variable Zik cannot be = 1 if cyclist i is not selected i.e if Xi = 0)
For all i, sum(Zik, 1<=k<=n_days) <= n_days * Xi
And these constraints to select 10 cyclists per day :
For all k, sum(Zik, 1<=i<=n_cyclists) <= 10
Finally, your objective could be written like this :
Maximize sum(pik * Xi * Zik, 1<=i<=n_cyclists, 1 <= k <= n_days)
# where pik = nb_points of cyclist i at day k
And here is the thinking part. An objective written like this is not linear (notice the multiplication between the two variables X and Z). Fortunately, there are both binaries and there is a trick to transform this formula to its linear form.
Let's introduce again new variables Lik (Lik = Xi * Zik
) to linearize the objective.
The objective can now be written like this and be linear :
Maximize sum(pik * Lik, 1<=i<=n_cyclists, 1 <= k <= n_days)
# where pik = nb_points of cyclist i at day k
And we need now to add these constraints to make Lik
equal to Xi * Zik
:
For all i,k : Xi + Zik - 1 <= Lik
For all i,k : Lik <= 1/2 * (Xi + Zik)
And voilà. This is the beauty of mathematics, you can model a lot of things with linear equations. I presented advanced notions and it is normal if you don't understand them at first glance.
I simulated the score per day column on this file.
Here is the Python code to solve the new problem :
import ast
from ortools.linear_solver import pywraplp
import pandas as pd
solver = pywraplp.Solver('cyclist', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
cyclist_df = pd.read_csv('cyclists_day.csv')
cyclist_df['Punten_day'] = cyclist_df['Punten_day'].apply(ast.literal_eval)
# Variables
variables_name = {}
variables_team = {}
variables_name_per_day = {}
variables_linear = {}
for _, row in cyclist_df.iterrows():
variables_name[row['Naam']] = solver.IntVar(0, 1, 'x_{}'.format(row['Naam']))
if row['Ploeg'] not in variables_team:
variables_team[row['Ploeg']] = solver.IntVar(0, solver.infinity(), 'y_{}'.format(row['Ploeg']))
for k in range(10):
variables_name_per_day[(row['Naam'], k)] = solver.IntVar(0, 1, 'z_{}_{}'.format(row['Naam'], k))
variables_linear[(row['Naam'], k)] = solver.IntVar(0, 1, 'l_{}_{}'.format(row['Naam'], k))
# Link cyclist <-> team
for team, var in variables_team.items():
constraint = solver.Constraint(0, solver.infinity())
constraint.SetCoefficient(var, 1)
for cyclist in cyclist_df[cyclist_df.Ploeg == team]['Naam']:
constraint.SetCoefficient(variables_name[cyclist], -1)
# Max 4 cyclist per team
for team, var in variables_team.items():
constraint = solver.Constraint(0, 4)
constraint.SetCoefficient(var, 1)
# Max cyclists
constraint_max_cyclists = solver.Constraint(16, 16)
for cyclist in variables_name.values():
constraint_max_cyclists.SetCoefficient(cyclist, 1)
# Max cost
constraint_max_cost = solver.Constraint(0, 100)
for _, row in cyclist_df.iterrows():
constraint_max_cost.SetCoefficient(variables_name[row['Naam']], row['Waarde'])
# Link Zik and Xi
for name, cyclist in variables_name.items():
constraint_link_cyclist_day = solver.Constraint(-solver.infinity(), 0)
constraint_link_cyclist_day.SetCoefficient(cyclist, - 10)
for k in range(10):
constraint_link_cyclist_day.SetCoefficient(variables_name_per_day[name, k], 1)
# Min/Max 10 cyclists per day
for k in range(10):
constraint_cyclist_per_day = solver.Constraint(10, 10)
for name in cyclist_df.Naam:
constraint_cyclist_per_day.SetCoefficient(variables_name_per_day[name, k], 1)
# Linearization constraints
for name, cyclist in variables_name.items():
for k in range(10):
constraint_linearization1 = solver.Constraint(-solver.infinity(), 1)
constraint_linearization2 = solver.Constraint(-solver.infinity(), 0)
constraint_linearization1.SetCoefficient(cyclist, 1)
constraint_linearization1.SetCoefficient(variables_name_per_day[name, k], 1)
constraint_linearization1.SetCoefficient(variables_linear[name, k], -1)
constraint_linearization2.SetCoefficient(cyclist, -1/2)
constraint_linearization2.SetCoefficient(variables_name_per_day[name, k], -1/2)
constraint_linearization2.SetCoefficient(variables_linear[name, k], 1)
# Objective
objective = solver.Objective()
objective.SetMaximization()
for _, row in cyclist_df.iterrows():
for k in range(10):
objective.SetCoefficient(variables_linear[row['Naam'], k], row['Punten_day'][k])
solver.Solve()
chosen_cyclists = [key for key, variable in variables_name.items() if variable.solution_value() > 0.5]
print('\n'.join(chosen_cyclists))
for k in range(10):
print('\nDay {} :'.format(k + 1))
chosen_cyclists_day = [name for (name, day), variable in variables_name_per_day.items()
if (day == k and variable.solution_value() > 0.5)]
assert len(chosen_cyclists_day) == 10
assert all(chosen_cyclists_day[i] in chosen_cyclists for i in range(10))
print('\n'.join(chosen_cyclists_day))
Here are the results :
Your team :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
BENOOT Tiesj
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
MEURISSE Xandro
GRELLIER Fabien
Selected cyclists per day
Day 1 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
Day 2 :
SAGAN Peter
ALAPHILIPPE Julian
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
NIZZOLO Giacomo
MEURISSE Xandro
Day 3 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
MATTHEWS Michael
TRENTIN Matteo
VAN AVERMAET Greg
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
Day 4 :
SAGAN Peter
VIVIANI Elia
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
Day 5 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
CICCONE Giulio
HERRADA Jesús
Day 6 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
Day 7 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
MEURISSE Xandro
Day 8 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
NIZZOLO Giacomo
MEURISSE Xandro
Day 9 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
TEUNISSEN Mike
HERRADA Jesús
Day 10 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
PINOT Thibaut
COLBRELLI Sonny
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
NIZZOLO Giacomo
Let's compare the results of answer 1 and answer 2 print(solver.Objective().Value())
:
You get 3738.0
with the first model, 3129.087388325567
with the second one. The value is lower because you select only 10 cyclists per stage instead of 16.
Now if keep the first solution and use the new scoring method, we get 3122.9477585307413
We could consider that the first model is good enough : we didn't have to introduce new variables/constraints, model stays simple and we got a solution almost as good as the complex model. Sometimes it is not necessary to be 100% accurate and a model can be solved more easily and quickly with some approximations.