How to find set of lowest sum of distinct column elements in python?
I wanted to experiment with genetic algorithms, and this seemed like a good optimization type problem to apply it to. With 15 rows that could be in any order, there's 15! permutations, or 1.0e+12. A brute-force approach to try all permutations isn't practical.
I have the function below that calculates the "fitness" of the individuals in the population. The score is a combination of the mean and standard deviation. My math might not be entirely sound and I'm definitely winging it with numpy, but it does seem to produce good results.
def calculate_fitness(population):
fitness_scores = []
for individual in population:
# Group the rows in 3's according to the columns.
proj_a = individual[ : 3,1] # First 3 rows, column 1.
proj_b = individual[ 3: 6,2] # Next 3 rows, column 2, etc.
proj_c = individual[ 6: 9,3]
proj_d = individual[ 9:12,4]
proj_e = individual[12:15,5] # Bottom 3 rows, last column.
arr = np.array([proj_a, proj_b, proj_c, proj_d, proj_e])
mean = arr.mean() # Mean.
std = np.abs(arr.std()) # Standard deviation.
# We want both the lowest mean and lowest standard deviation.
# For simplicity, let's just add them and use that as the score.
fitness_scores.append(mean + std)
# Invert and scale the values so they can be used as weights
# for random selection.
fitness_scores = np.array(fitness_scores)
fitness_scores = (fitness_scores.max() + .3 ) - fitness_scores
fitness_scores /= (fitness_scores.max() + .07)
fitness_scores *= 100
return fitness_scores
Output - the first 3 rows belong to A, the next 3 to B and so forth:
employee proj_A proj_B proj_C proj_D proj_E
A3 1 2 4 3 5
C4 1 2 3 4 5
A1 1 5 3 4 2
C2 3 1 2 5 4
B5 2 1 3 5 4
C5 2 1 4 5 4
A2 4 2 1 3 5
A5 1 3 2 5 4
B3 2 3 1 5 4
B1 5 4 1 2 3
C3 5 3 4 1 2
C1 2 3 4 1 5
B2 4 5 3 2 1
B4 5 3 4 2 1
A4 4 5 3 2 1
In this grouping it seems everyone is very happy and it's probably the optimal mix.
Here everyone is extremely happy with all 1's except A3 who gets a 3.
employee proj_A proj_B proj_C proj_D proj_E
C4 1 _ _ _ _
A1 1 _ _ _ _
A5 1 _ _ _ _
B5 _ 1 _ _ _
C2 _ 1 _ _ _
C5 _ 1 _ _ _
A2 _ _ 1 _ _
B3 _ _ 1 _ _
B1 _ _ 1 _ _
C1 _ _ _ 1 _
A3 _ _ _ 3 _
C3 _ _ _ 1 _
A4 _ _ _ _ 1
B4 _ _ _ _ 1
B2 _ _ _ _ 1
I found that adjusting for a high rate of mutation, and protecting the top 5 individuals from mutation and death greatly improves the results.
Parents are selected by taking 4 individuals randomly using their fitness scores as weights to prefer higher fitness parents. The top of the 4 is then matched against any of the others that doesn't have an identical fitness score to try and prevent inbreeding and keep the population diversity in a good range.
Each iteration, one individual dies, two parents are selected and produce a child, and at a 50% rate an individual is selected and mutated by randomly swapping a couple of its rows.
The population I've found best is 150 members, and 1k to 2k iterations seems to get consistent results.