Using numpy.genfromtxt to read a csv file with strings containing commas
Make a better function that combines the power of the standard csv
module and Numpy's recfromcsv
. For instance, the csv
module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.
The example genfromcsv_mod
function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.
import csv
import numpy as np
def recfromcsv_mod(fname, **kwargs):
def rewrite_csv_as_tab(fname):
with open(fname, newline='') as fp:
dialect = csv.Sniffer().sniff(fp.read(1024))
fp.seek(0)
for row in csv.reader(fp, dialect):
yield "\t".join(row)
return np.recfromcsv(
rewrite_csv_as_tab(fname), delimiter="\t", encoding=None, **kwargs)
# Use it to read a CSV file into a record array
x = recfromcsv_mod("t.csv", case_sensitive=True)
If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:
import pandas
data = pandas.read_csv('file.csv').as_matrix()
Pandas will handle the "Lexington, KY" case correctly
The problem with the additional comma, np.genfromtxt
does not deal with that.
One simple solution is to read the file with csv.reader()
from python's csv module into a list and then dump it into a numpy array if you like.
If you really want to use np.genfromtxt
, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...)
. So, you can wrap a csv.reader
in an iterator and give it to np.genfromtxt
.
That would go something like this:
import csv
import numpy as np
np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")
This essentially replaces on-the-fly only the appropriate commas with tabs.
You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv
can handle this. From the docs:
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
The default value is "
. An example:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True
to deal with the spaces after the comma-delimiter.
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).