Python Pandas read_csv skip rows but keep header
Great answers already. Consider this generalized scenario:
Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row) is the real header and you want to load 10 rows starting from row #50 (i.e 51st row).
Here's the snippet:
pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)
You can pass a list of row numbers to skiprows
instead of an integer.
By giving the function the integer 10, you're just skipping the first 10 lines.
To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
Other ways to skip rows using read_csv
The two main ways to control which rows read_csv
uses are the header
or skiprows
parameters.
Supose we have the following CSV file with one column:
a
b
c
d
e
f
In each of the examples below, this file is f = io.StringIO("\n".join("abcdef"))
.
Read all lines as values (no header, defaults to integers)
>>> pd.read_csv(f, header=None) 0 0 a 1 b 2 c 3 d 4 e 5 f
Use a particular row as the header (skip all lines before that):
>>> pd.read_csv(f, header=3) d 0 e 1 f
Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):
>>> pd.read_csv(f, header=[2, 4]) c e 0 f
Skip N rows from the start of the file (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=3) d 0 e 1 f
Skip one or more rows by giving the row indices (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=[2, 4]) a 0 b 1 d 2 f
To expand on @AlexRiley's answer, the skiprows
argument takes a list of numbers which determines what rows to skip. So:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
is the same as:
pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])
The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like range
that returns a list of integers) and pass it to skiprows
.