fisher's linear discriminant in Python


enter image description here

See for more information.

Implementation using Iris

Since you want to use LDA for dimensionality reduction but provide only 2d data I am showing how to perform this procedure on the iris dataset.

Let's import libraries

    import pandas as pd
    import numpy as np
    import sklearn as sk
    from collections import Counter
    from sklearn import datasets
    # load dataset and transform to pandas df
    X, y = datasets.load_iris(return_X_y=True)
    X = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(4)])
    y = pd.DataFrame(y, columns=['labels'])
    tot = pd.concat([X,y], axis=1)
    # calculate class means
    class_means = tot.groupby('labels').mean()
    total_mean = X.mean()

The class_means are given by:


    feat_0  feat_1  feat_2  feat_3
0   5.006   3.428   1.462   0.246
1   5.936   2.770   4.260   1.326
2   6.588   2.974   5.552   2.026

![enter image description here

To do this, we first subtract the class means from each observation (basically we calculate x - m_i from the equation above). Subtract the corresponding class mean from each observation. Since we want to calculate

x_mi = tot.transform(lambda x: x - class_means.loc[x['labels']], axis=1).drop('labels', 1)

def kronecker_and_sum(df, weights):
    S = np.zeros((df.shape[1], df.shape[1]))
    for idx, row in df.iterrows():
        x_m = row.as_matrix().reshape(df.shape[1],1)
        S += weights[idx]*, x_m.T)
    return S

# Each x_mi is weighted with 1. Now we use the kronecker_and_sum function to calculate the within-class scatter matrix S_w
S_w = kronecker_and_sum(x_mi, 150*[1])


mi_m = class_means.transform(lambda x: x - total_mean, axis=1)
# Each mi_m is weighted with the number of observations per class which is 50 for each class in this example. We use kronecker_and_sum to calculate the between-class scatter matrix.

S_b=kronecker_and_sum(mi_m, 3*[50])

enter image description here

eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_w).dot(S_b))

We only need to consider the eigenvalues which are remarkably different from zero (in this case only the first two)

array([ 3.21919292e+01,  2.85391043e-01,  6.53468167e-15, -2.24877550e-15])

Transform X with the matrix of the two eigenvectors which correspond to the highest eigenvalues

    W = eig_vecs[:, :2]
    X_trafo =, W)
    tot_trafo = pd.concat([pd.DataFrame(X_trafo, index=range(len(X_trafo))), y], 1)
    # plot the result
    tot_trafo.plot.scatter(x=0, y=1, c='labels', colormap='viridis')

enter image description here We have reduced the dimensions from 4 to 2 and chosen the space in such a way, that the classes can be well seperated.

Scikit-learn usage

Scikit has LDA support aswell. What we did in dozens of lines can be done with the following lines of code:

from sklearn import discriminant_analysis
lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2)
X_trafo_sk = lda.fit_transform(X,y)
pd.DataFrame(np.hstack((X_trafo_sk, y))).plot.scatter(x=0, y=1, c=2, colormap='viridis')

I'm not giving a plot here, cause it is the same as in our derived example (except for a 180 degree rotation).

Before answering your question, I will first touch the basic difference between PCA and (F)LDA. In PCA you don't know anything about underlying classes, but you assume that the information about classes separability lies in the variance of data. So you rotate your original axes (sometimes it is called projecting all the data onto new ones) in such way that your first new axis is pointing to the direction of most variance, second one is perpendicular to the first one and pointing to the direction of most residiual variance, and so on. This way a PCA transformation results in a (sub)space of the same dimensionality as the original one. Than you can take only first 2 dimensions, rejecting the rest, hence getting a dimensionality reduction from k dimensions to only 2.

LDA works a bit differently. In this case you know in advance how many classes there are in your data, and you can find their mean and covariance matrices. What Fisher criterion does it finds a direction in which the mean between classes is maximized, while at the same time total variability is minimized (total variability is a mean of within-class covariance matrices). And for each two classes there is only one such line. This is why when your data has C classes, LDA can provide you at most C-1 dimensions, regardless of the original data dimensionality. In your case this means that as you have only 2 classes A and B, you will get a one-dimensional projection, i.e. a line. And this is exactly what you have in your picture: original 2d data is projected on to a line. The direction of the line is the solution of the eigenproblem. Let's generate data that is similar to your picture:

a = np.random.multivariate_normal((1.5, 3), [[0.5, 0], [0, .05]], 30)
b = np.random.multivariate_normal((4, 1.5), [[0.5, 0], [0, .05]], 30)
plt.plot(a[:,0], a[:,1], 'b.', b[:,0], b[:,1], 'r.')
mu_a, mu_b = a.mean(axis=0).reshape(-1,1), b.mean(axis=0).reshape(-1,1)
Sw = np.cov(a.T) + np.cov(b.T)
inv_S = np.linalg.inv(Sw)
res =  # the trick
# more general solution
# Sb = (mu_a-mu_b)*((mu_a-mu_b).T)
# eig_vals, eig_vecs = np.linalg.eig(
# res = sorted(zip(eig_vals, eig_vecs), reverse=True)[0][1] # take only eigenvec corresponding to largest (and the only one) eigenvalue
# res = res / np.linalg.norm(res)

plt.plot([-res[0], res[0]], [-res[1], res[1]]) # this is the solution
plt.plot(mu_a[0], mu_a[1], 'cx')
plt.plot(mu_b[0], mu_b[1], 'yx')

# let's project data point on it
r = res.reshape(2,)
n2 = np.linalg.norm(r)**2
for pt in a:
    prj = r * / n2
    plt.plot([prj[0], pt[0]], [prj[1], pt[1]], 'b.:', alpha=0.2)
for pt in b:
    prj = r * / n2
    plt.plot([prj[0], pt[0]], [prj[1], pt[1]], 'r.:', alpha=0.2)

The resulting projection is calculated using a neat trick for two class problem. You can read details on it here in section 1.6.

enter image description here

Regarding the "examples" you mention in your question. I believe you need to repeat the process for each example, as it is a different set of data point probably with different distributions. Also, put attention that estimated mean (mu_a, mu_b) and class covariance matrices would be slightly different from the ones that data was generated with, especially for small sample size.