Simple way to Dask concatenate (horizontal, axis=1, columns)

The solution (from the comments by @Primer):

both repartitioning and resetting the index
use assign instead of concatenate

The final code;

import os
from pathlib import Path
import dask.dataframe as dd
import numpy as np
import pandas as pd



df = dd.read_csv(['data/untermaederbrunnen_station1_xyz_intensity_rgb.txt'], delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b'])
df_label = dd.read_csv(['data/untermaederbrunnen_station1_xyz_intensity_rgb.labels'], header=None, names=['label'])
# len(df), len(df_label), df_label.label.isnull().sum().compute()

df = df.repartition(npartitions=200)
df = df.reset_index(drop=True)
df_label = df_label.repartition(npartitions=200)
df_label = df_label.reset_index(drop=True)

df = df.assign(label = df_label.label)
df.head()

I had similar problem and the solution was simply to compute the chunk sizes of each dask array that I was going to put into dataframe using .compute_chunk_sizes(). After that there was no issues to concatenate them into dataframe on axis=1.

I had the same problem and solved it by making sure that both dataframes have the same number of partitions (since we know already that both have the same length):

df = df.repartition(npartitions=200)
df_label = df_label.repartition(npartitions=200)
df = dd.concat([df, df_label], axis=1)

Simple way to Dask concatenate (horizontal, axis=1, columns)

Tags:

Python

Pandas

Dask

Related

Recent Posts