Simple way to Dask concatenate (horizontal, axis=1, columns)
The solution (from the comments by @Primer):
- both repartitioning and resetting the index
- use assign instead of concatenate
The final code;
import os
from pathlib import Path
import dask.dataframe as dd
import numpy as np
import pandas as pd
df = dd.read_csv(['data/untermaederbrunnen_station1_xyz_intensity_rgb.txt'], delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b'])
df_label = dd.read_csv(['data/untermaederbrunnen_station1_xyz_intensity_rgb.labels'], header=None, names=['label'])
# len(df), len(df_label), df_label.label.isnull().sum().compute()
df = df.repartition(npartitions=200)
df = df.reset_index(drop=True)
df_label = df_label.repartition(npartitions=200)
df_label = df_label.reset_index(drop=True)
df = df.assign(label = df_label.label)
df.head()
I had similar problem and the solution was simply to compute the chunk sizes of each dask array that I was going to put into dataframe using .compute_chunk_sizes()
. After that there was no issues to concatenate them into dataframe on axis=1
.
I had the same problem and solved it by making sure that both dataframes have the same number of partitions (since we know already that both have the same length):
df = df.repartition(npartitions=200)
df_label = df_label.repartition(npartitions=200)
df = dd.concat([df, df_label], axis=1)