Python: Extract dimension data from dataframe string column and create columns with values for each of them
Option 1: I prefer splitting several time:
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# split second time for individual measurement
new_df = (new_series.str
.split(':', expand=True)
.reset_index()
)
# stripping off leading/trailing spaces
new_df[0] = new_df[0].str.strip()
new_df[1] = new_df[1].str.strip()
# unstack to get the desire table:
new_df.set_index(['ID', 0])[1].unstack()
Option 2: Use split(',|:')
as what you tried:
# splitting
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',|:', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# concat along axis=1 to get dataframe with two columns
# new_df.columns = ('ID', 0, 1) where 0 is measurement name
new_df = (pd.concat((new_series[::2].str.strip(),
new_series[1::2]), axis=1)
.reset_index())
new_df.set_index(['ID', 0])[1].unstack()
Output:
Depth Diameter Height Length Volume Weight
ID
12 NaN NaN 2 cm NaN 4cl 100g
34 NaN NaN 5 cm 10cm NaN NaN
56 80cm NaN NaN NaN NaN NaN
78 NaN NaN NaN 7 cm NaN 2 kg
90 NaN 4 cm NaN NaN 50 cl NaN
This is a hard question , your string need to be split
and your each items after split need to be convert to dict
, then we can using DataFrame
constructor rebuild those columns
d=[ [{y.split(':')[0]:y.split(':')[1]}for y in x.split(',')]for x in df.all_dimensions]
from collections import ChainMap
data = list(map(lambda x : dict(ChainMap(*x)),d))
s=pd.DataFrame(data)
df=pd.concat([df,s.groupby(s.columns.str.strip(),axis=1).first()],1)
df
Out[26]:
ID all_dimensions Depth ... Length Volume Weight
0 12 Height:2 cm,Volume: 4cl,Weight:100g NaN ... NaN 4cl 100g
1 34 Length: 10cm, Height: 5 cm NaN ... 10cm NaN NaN
2 56 Depth: 80cm 80cm ... NaN NaN NaN
3 78 Weight: 2 kg, Length: 7 cm NaN ... 7 cm NaN 2 kg
4 90 Diameter: 4 cm, Volume: 50 cl NaN ... NaN 50 cl NaN
[5 rows x 8 columns]
Check the columns
df['Height']
Out[28]:
0 2 cm
1 5 cm
2 NaN
3 NaN
4 NaN
Name: Height, dtype: object