Data Type Recognition/Guessing of CSV data in python
After putting some thought into it, this is how I would design the algorithm myself:
- For performance reasons: take a sample for each column (say, 1%)
- run a regex match for each cell in the sample, checking for the data type
- Choose the appropriate data type for the column based on the frequency distribution
The two questions that arise:
- What's a sufficient sample size? For small data sets? For large data sets?
- What's a high enough threshold for selecting a data type based on the frequency distribution?
You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:
- https://github.com/okfn/messytables
- https://messytables.readthedocs.org/ - docs
It happily scales to very large files, to streaming data off the internet etc.
There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)
The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164
Maybe csvsql could be useful here? No idea how efficient it is but definitely gets the job done for generating sql create table statements out of csvs.
$ csvsql so_many_columns.csv >> sql_create_table_with_char_types.txt