Numpy sort ndarray on multiple columns
numpy ndarray sort by the 1st, 2nd or 3rd column:
>>> a = np.array([[1,30,200], [2,20,300], [3,10,100]])
>>> a
array([[ 1, 30, 200],
[ 2, 20, 300],
[ 3, 10, 100]])
>>> a[a[:,2].argsort()] #sort by the 3rd column ascending
array([[ 3, 10, 100],
[ 1, 30, 200],
[ 2, 20, 300]])
>>> a[a[:,2].argsort()][::-1] #sort by the 3rd column descending
array([[ 2, 20, 300],
[ 1, 30, 200],
[ 3, 10, 100]])
>>> a[a[:,1].argsort()] #sort by the 2nd column ascending
array([[ 3, 10, 100],
[ 2, 20, 300],
[ 1, 30, 200]])
To explain what is going on here: argsort()
is passing back an array containing integer sequence of its parent:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
>>> x = np.array([15, 30, 4, 80, 6])
>>> np.argsort(x)
array([2, 4, 0, 1, 3])
Sort by column 3, then by column 2 then 1:
>>> a = np.array([[2,30,200], [1,30,200], [1,10,200]])
>>> a
array([[ 2, 30, 200],
[ 1, 30, 200],
[ 1, 10, 200]])
>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))]
array([[ 1, 10, 200],
[ 1, 30, 200],
[ 2, 30, 200]])
>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))][::-1] #reverse
array([[ 2 30 200]
[ 1 30 200]
[ 1 10 200]])
Import letting Numpy guess the type and sorting in place:
import numpy as np
# let numpy guess the type with dtype=None
my_data = np.genfromtxt(infile, dtype=None, names=["a", "b", "c", "d"])
# access columns by name
print(my_data["b"]) # column 1
# sort column 1 and column 0
my_data.sort(order=["b", "a"])
# save specifying required format (tab separated values)
np.savetxt("sorted.tsv", my_data, fmt="%d\t%d\t%.6f\t%.6f"
Alternatively, specifying the input format and sorting to a new array:
import numpy as np
# tell numpy the first 2 columns are int and the last 2 are floats
my_data = np.genfromtxt(infile, dtype=[('a', '<i8'), ('b', '<i8'), ('x', '<f8'), ('d', '<f8')])
# access columns by name
print(my_data["b"]) # column 1
# get the indices to sort the array using lexsort
# the last element of the tuple (column 1) is used as the primary key
ind = np.lexsort((my_data["a"], my_data["b"]))
# create a new, sorted array
sorted_data = my_data[ind]
# save specifying required format (tab separated values)
np.savetxt("sorted.tsv", sorted_data, fmt="%d\t%d\t%.6f\t%.6f")
Output:
2 1 2.000000 0.000000
3 1 2.000000 0.000000
4 1 2.000000 0.000000
2 2 100.000000 0.000000
3 2 4.000000 0.000000
4 2 4.000000 0.000000
2 3 100.000000 0.000000
3 3 6.000000 0.000000
4 3 6.000000 0.000000
this method works for any numpy array:
import numpy as np
my_data = [[ 2., 1., 2., 0.],
[ 2., 2., 100., 0.],
[ 2., 3., 100., 0.],
[ 3., 1., 2., 0.],
[ 3., 2., 4., 0.],
[ 3., 3., 6., 0.],
[ 4., 1., 2., 0.],
[ 4., 2., 4., 0.],
[ 4., 3., 6., 0.]]
my_data = np.array(my_data)
r = np.core.records.fromarrays([my_data[:,1],my_data[:,0]],names='a,b')
my_data = my_data[r.argsort()]
print(my_data)
Result:
[[ 2. 1. 2. 0.]
[ 3. 1. 2. 0.]
[ 4. 1. 2. 0.]
[ 2. 2. 100. 0.]
[ 3. 2. 4. 0.]
[ 4. 2. 4. 0.]
[ 2. 3. 100. 0.]
[ 3. 3. 6. 0.]
[ 4. 3. 6. 0.]]