Data Type Recognition/Guessing of CSV data in python

After putting some thought into it, this is how I would design the algorithm myself:

For performance reasons: take a sample for each column (say, 1%)
run a regex match for each cell in the sample, checking for the data type
Choose the appropriate data type for the column based on the frequency distribution

The two questions that arise:

What's a sufficient sample size? For small data sets? For large data sets?
What's a high enough threshold for selecting a data type based on the frequency distribution?

You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

https://github.com/okfn/messytables
https://messytables.readthedocs.org/ - docs

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

Maybe csvsql could be useful here? No idea how efficient it is but definitely gets the job done for generating sql create table statements out of csvs.

$ csvsql so_many_columns.csv  >> sql_create_table_with_char_types.txt

Data Type Recognition/Guessing of CSV data in python

Tags:

Python

Algorithm

Csv

Heuristics

Schema

Related

Recent Posts