Changing strings to floats in an imported .csv
You are correct that Python's builtin csv module is very primitive at handling mixed data-types, does all its type conversion at import-time, and even at that has a very restrictive menu of options, which will mangle most real-world datasets (inconsistent quoting and escaping, missing or incomplete values in Booleans and factors, mismatched Unicode encoding resulting in phantom quote or escape characters inside fields, incomplete lines will cause exception). Fixing csv import is one of countless benefits of pandas. So, your ultimate answer is indeed stop using builtin csv import and start using pandas. But let's start with the literal answer to your question.
First you asked "How to convert strings to floats, on csv import". The answer to that is to open the csv.reader(..., quoting=csv.QUOTE_NONNUMERIC)
as per the csv doc
csv.QUOTE_NONNUMERIC: Instructs the reader to convert all non-quoted fields to type float.
That works if you're ok with all unquoted fields (integer, float, text, Boolean etc.) being converted to float, which is generally a bad idea for many reasons (missing or NA values in Booleans or factors will get silently squelched). Moreover it will fail (throw exception) on unquoted text fields obviously. So it's brittle and needs to be protected with try..catch
.
Then you asked: 'I suppose the overall question is really just "What's the easiest way to read, organize, and synthesize data in .csv or excel format using Python?"'
to which the crappy csv.reader solution is to open with csv.reader(..., quoting=csv.QUOTE_NONNUMERIC)
But as @geoffspear correctly replied 'The answer to your "overall question" may be "Pandas", although it's a bit vague.'
Try something like the following
import csv
def read_lines():
with open('testdata.csv', 'rU') as data:
reader = csv.reader(data)
for row in reader:
yield [ float(i) for i in row ]
for i in read_lines():
print(i)
# to get a list, instead of a generator, use
xy = list(read_lines())
As for the easiest way, then I suggest you see the xlrd
, xlwt
modules, personally I always have hard time with all the varying CSV formats.
When converting a bunch of strings to floats, you should use a try/except to catch errors:
def conv(s):
try:
s=float(s)
except ValueError:
pass
return s
print [conv(s) for s in ['1.1','bls','1','nan', 'not a float']]
# [1.1, 'bls', 1.0, nan, 'not a float']
Notice that the strings that cannot be converted are simply passed through unchanged.
A csv file IS a text file, so you should use a similar functionality:
def readLines():
def conv(s):
try:
s=float(s)
except ValueError:
pass
return s
with open('testdata.csv', 'rU') as data:
reader = csv.reader(data)
for row in reader:
for cell in row:
y=conv(cell)
# do what ever with the single float
# OR
# yield [conv(cell) for cell in row] if you want to write a generator...