How to read the csv file properly if each row contains different number of fields (number quite big)?

As suggested, DictReader could also be used as follows to create a list of rows. This could then be imported as a frame in pandas:

import pandas as pd
import csv

rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']

with open('input.csv', 'rb') as f_input:
    for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
        try:
            rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
        except KeyError, e:
            rows.append([row['user'], row['item'], row['rating'], ' '])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display the following:

         user      item rating                                  review
0  disjiad123  TYh23hs9      5  I love this phone as it is easy to use
1  hjf2329ccc  TGjsk123      3                         Suck restaurant

If the review appears at the start of the row, then one approach would be to parse the line in reverse as follows:

import pandas as pd
import csv


rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']

with open('input.csv', 'rb') as f_input:
    for row in f_input:
        cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
        rows.append(cols[:4] + [' '.join(cols[4:][::-1])])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display:

  rating      time      item        user  \
0      5  13160032  TYh23hs9   isjiad123   
1      3  14423321  TGjsk123  hjf2329ccc   

                                    review  
0  I love this phone as it is easy to used  
1                          Suck restaurant  

row[::-1] is used to reverse the text of the whole line, the [2:] skips over the line ending which is now at the start of the line. Each line is then split on spaces. A list comprehension then re-reverses each split entry. Finally rows is appended to first by taking the fixed 5 column entries (now at the start). The remaining entries are then joined back together with a space and added as the final column.

The benefit of this approach is that it does not rely on your input data being in an exactly fixed width format, and you don't have to worry if the column widths being used change over time.


It looks like this is a fixed width file. Pandas supplies read_fwf for this exact purpose. The following code reads the file correctly for me. You may want to mess around with the widths a little if it doesn't work perfectly.

pandas.read_fwf('test.fwf', 
                 widths=[13, 12, 13, 5, 100], 
                 names=['user', 'item', 'time', 'rating', 'review'])

If the columns still line up with the edited version (where the rating comes first), you just need to add the correct specification. A guide line like the following helps to do this quickly:

0        1         2         3         4         5         6         7         8
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
  I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                          hjf2329ccc   TGjsk123     14423321    3     

So the new command becomes:

pandas.read_fwf('test.fwf', 
                colspecs=[[0, 43], [44, 56], [57, 69], [70, 79], [80, 84]], 
                names=['review', 'user', 'item', 'time', 'rating'])

Usecols refers to the name of the columns in the input file. If your file doesn't have those columns named like that (user, item, rating) it won't know which columns you're referring to. Instead you should pass an index like usecols=[0,1,2].

Also, names refers to what you're calling the columns you import. So, I think you cannot have four names upon importing 3 columns. Does this work?

pd.read_csv(filename, sep = " ", 
                      header = None, 
                      names = ["user","item","rating"], 
                      usecols = [0,1,2])

The tokenizing error looks like a problem with the delimiter. It may try to parse your review text column as many columns, because "I" "love" "this" ... are all separated by spaces. Hopefully if you're only reading the first three columns you can avoid throwing an error, but if not you could consider parsing row-by-row (for example, here: http://cmdlinetips.com/2011/08/three-ways-to-read-a-text-file-line-by-line-in-python/) and writing to a DataFrame from there.

Tags:

Python

Pandas

Csv