How do I keep the first 200 lines of all the csv files in a directory using bash?

Assuming that the current directory contains all CSV files and that they all have a .csv filename suffix:

for file in ./*.csv; do
    head -n 200 "$file" >"$file.200"
done

This outputs the first 200 lines of each CSV file to a new file using head and a redirection. The new file's name is the same as the old but with .200 appended to the end of the name. There is no check to see if the new filename already exists or not.

If you want to replace the originals:

for file in ./*.csv; do
    head -n 200 "$file" >"$file.200" &&
    mv "$file.200" "$file"
done

The && at the end of the head command makes it so that the mv won't be run if there was some issue with running head.

If your CSV files are scattered in subdirectories under the current directory, then use shopt -s globstar and then replace the pattern ./*.csv in the loop with ./**/*.csv. This will locate any CSV file in or below the current directory and perform the operation on each. The ** globbing pattern matches "recursively" down into subdirectories, but only if the globstar shell option is set.

For CSV files containing data with embedded newlines, the above will not work properly as you may possibly truncate a record. Instead, you would have to use some CSV-aware tool to do the job for you.

The following uses CSVkit, a set of command-line tools for parsing and in general working with CSV files, together with jq, a tool for working with JSON files.

There is no tool in CSV kit that can truncate a CSV file at a particular point, but we can convert the CSV files to JSON and use jq to only output the first 200 records:

for file in ./*.csv; do
    csvjson -H "$file" | jq -r '.[:200][] | map(values) | @csv' >"$file.200" &&
    mv "$file.200" "$file"
done

Given some CSV file like the below short example,

a,b,c
1,2,3
"hello, world",2 3,4
"hello
there","my good
man",nice weather for ducks

the csvjson command would produce

[
  {
    "a": "a",
    "b": "b",
    "c": "c"
  },
  {
    "a": "1",
    "b": "2",
    "c": "3"
  },
  {
    "a": "hello, world",
    "b": "2 3",
    "c": "4"
  },
  {
    "a": "hello\nthere",
    "b": "my good\nman",
    "c": "nice weather for ducks"
  }
]

The jq tool would then take this, and for each object in the array (restricted to the first 200 objects), extract the values as an array and format it as CSV.

It's probably possible to do this transformation directly with csvpy, another tool in CSVkit, but as my Python skills are non-existent, I will not attempt to come up with a solution that does that.

Previous answers copy data and overwrite files. This technique should keep the same inodes, do no copying, and run a whole lot faster. For each file :

(a) Find the length of each file by reading the first 200 lines.

(b) Truncate the file to that length using truncate from GNU coreutils, or with the truncate found on some BSD systems:

SZ="$( head -n 200 -- "${file}" | wc -c )"
truncate -s "${SZ}" -- "${file}"

Using sed with shell globbing:

sed -ni '1,200p' *.csv

Using globbing/sed/parallel:

printf '%s\n' *.csv | parallel -- sed -ni '1,200p' {}

This will find all .csv files in the current directory and feed them to GNU parallel which will execute a sed command on them to keep only the first 200 lines. Note this will overwrite the files in place.

Or using head with parallel:

printf '%s\n' *.csv | parallel -- head -n 200 {} ">" {}.out

This will create new files with the .out suffix.

How do I keep the first 200 lines of all the csv files in a directory using bash?

Tags:

Csv

Bash

Files

Related

Recent Posts