How do I keep the first 200 lines of all the csv files in a directory using bash?
Assuming that the current directory contains all CSV files and that they all have a .csv
filename suffix:
for file in ./*.csv; do
head -n 200 "$file" >"$file.200"
done
This outputs the first 200 lines of each CSV file to a new file using head
and a redirection. The new file's name is the same as the old but with .200
appended to the end of the name. There is no check to see if the new filename already exists or not.
If you want to replace the originals:
for file in ./*.csv; do
head -n 200 "$file" >"$file.200" &&
mv "$file.200" "$file"
done
The &&
at the end of the head
command makes it so that the mv
won't be run if there was some issue with running head
.
If your CSV files are scattered in subdirectories under the current directory, then use shopt -s globstar
and then replace the pattern ./*.csv
in the loop with ./**/*.csv
. This will locate any CSV file in or below the current directory and perform the operation on each. The **
globbing pattern matches "recursively" down into subdirectories, but only if the globstar
shell option is set.
For CSV files containing data with embedded newlines, the above will not work properly as you may possibly truncate a record. Instead, you would have to use some CSV-aware tool to do the job for you.
The following uses CSVkit, a set of command-line tools for parsing and in general working with CSV files, together with jq
, a tool for working with JSON files.
There is no tool in CSV kit that can truncate a CSV file at a particular point, but we can convert the CSV files to JSON and use jq
to only output the first 200 records:
for file in ./*.csv; do
csvjson -H "$file" | jq -r '.[:200][] | map(values) | @csv' >"$file.200" &&
mv "$file.200" "$file"
done
Given some CSV file like the below short example,
a,b,c
1,2,3
"hello, world",2 3,4
"hello
there","my good
man",nice weather for ducks
the csvjson
command would produce
[
{
"a": "a",
"b": "b",
"c": "c"
},
{
"a": "1",
"b": "2",
"c": "3"
},
{
"a": "hello, world",
"b": "2 3",
"c": "4"
},
{
"a": "hello\nthere",
"b": "my good\nman",
"c": "nice weather for ducks"
}
]
The jq
tool would then take this, and for each object in the array (restricted to the first 200 objects), extract the values as an array and format it as CSV.
It's probably possible to do this transformation directly with csvpy
, another tool in CSVkit, but as my Python skills are non-existent, I will not attempt to come up with a solution that does that.
Previous answers copy data and overwrite files. This technique should keep the same inodes, do no copying, and run a whole lot faster. For each file :
(a) Find the length of each file by reading the first 200 lines.
(b) Truncate the file to that length using truncate
from GNU coreutils, or with the truncate
found on some BSD systems:
SZ="$( head -n 200 -- "${file}" | wc -c )"
truncate -s "${SZ}" -- "${file}"
Using sed with shell globbing:
sed -ni '1,200p' *.csv
Using globbing/sed/parallel:
printf '%s\n' *.csv | parallel -- sed -ni '1,200p' {}
This will find all .csv
files in the current directory and feed them to GNU parallel which will execute a sed command on them to keep only the first 200 lines. Note this will overwrite the files in place.
Or using head with parallel:
printf '%s\n' *.csv | parallel -- head -n 200 {} ">" {}.out
This will create new files with the .out
suffix.