How to randomly split a DataFrame into several smaller DataFrames?

A simple demo:

df = pd.DataFrame({"movie_id": np.arange(1, 25),
          "borda": np.random.randint(1, 25, size=(24,))})
n_split = 5
# the indices used to select parts from dataframe
ixs = np.arange(df.shape[0])
np.random.shuffle(ixs)
# np.split cannot work when there is no equal division
# so we need to find out the split points ourself
# we need (n_split-1) split points
split_points = [i*df.shape[0]//n_split for i in range(1, n_split)]
# use these indices to select the part we want
for ix in np.split(ixs, split_points):
    print(df.iloc[ix])

The result:

    borda  movie_id
8       3         9
10      2        11
22     14        23
7      14         8

    borda  movie_id
0      16         1
20      4        21
17     15        18
15      1        16
6       6         7

    borda  movie_id
9       9        10
19      4        20
5       1         6
16     23        17
21     20        22

    borda  movie_id
11     24        12
23      5        24
1      22         2
12      7        13
18     15        19

    borda  movie_id
3      11         4
14     10        15
2       6         3
4       7         5
13     21        14

Use np.array_split

shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)

df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.

It gives you:

for part in result:
    print(part,'\n')

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
5          6  5  0  0  0  0  0  0  5   0   0   0     10
4          5  3  0  0  0  0  0  0  0   0   0   0      3
7          8  1  0  0  0  4  5  0  0   0   4   0     14
16        17  3  0  0  4  0  0  0  0   0   0   0      7
22        23  4  0  0  0  4  3  0  0   5   0   0     16 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
13        14  5  4  0  0  5  0  0  0   0   0   0     14
14        15  5  0  0  0  3  0  0  0   0   5   5     18
21        22  4  0  0  0  3  5  5  0   5   4   0     26
1          2  3  0  0  3  0  0  0  0   0   0   0      6
20        21  1  0  0  3  3  0  0  0   0   0   0      7 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
10        11  2  0  4  0  0  3  3  0   4   2   0     18
9         10  3  2  0  0  0  4  0  0   0   0   0      9
11        12  5  0  0  0  4  5  0  0   5   2   0     21
8          9  5  0  0  0  4  5  0  0   4   5   0     23
12        13  5  4  0  0  2  0  0  0   3   0   0     14 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
18        19  5  3  0  0  4  0  0  0   0   0   0     12
3          4  3  0  0  0  0  5  0  0   4   0   5     17
0          1  5  4  0  4  4  0  0  0   4   0   0     21
23        24  3  0  0  4  0  0  0  0   0   3   0     10
6          7  4  0  0  0  2  5  3  4   4   0   0     22 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
17        18  4  0  0  0  0  0  0  0   0   0   0      4
2          3  4  0  0  0  0  0  0  0   0   0   0      4
15        16  5  0  0  0  0  0  0  0   4   0   0      9
19        20  4  0  0  0  0  0  0  0   0   0   0      4

IIUC, you can do this:

frames={}
for e,i in enumerate(np.split(df,6)):
    frames.update([('df_'+str(e+1),pd.DataFrame(np.random.permutation(i),columns=df.columns))])
print(frames['df_1'])

   movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
0         4  3  0  0  0  0  5  0  0   4   0   5     17
1         3  4  0  0  0  0  0  0  0   0   0   0      4
2         2  3  0  0  3  0  0  0  0   0   0   0      6
3         1  5  4  0  4  4  0  0  0   4   0   0     21

Explanation: np.split(df,6) splits the df to 6 equal size. pd.DataFrame(np.random.permutation(i),columns=df.columns) randomly reshapes the rows so creating a dataframe with this information and storing in a dictionary names frames.

Finally print the dictionary by calling each keys, values as dataframe will be returned. you can try print frames['df_1'] , frames['df_2'] , etc. It will return random permutations of a split of the dataframe.

How to randomly split a DataFrame into several smaller DataFrames?

Tags:

Python

Pandas

Python 3.X

Dataframe

Jupyter

Related

Recent Posts