extract the 4th column from a csv file using unix command
UPDATE:
Actually, a much easier way is to set the record separator in gawk
:
$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4}' myFile.csv
"col4
"4th column
"4th column2
However, this will remove the trailing "
from the end of each column. To fix that you can print it yourself:
$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4"\""}' myFile.csv
"col4"
"4th column"
"4th column2"
If you don't want the quotes at all, you can set the field separator to ","
:
$ gawk 'BEGIN{RS="\"\n"; FS="\",\""}{print $3}' myFile.csv
col3
4th column
4th column2
The only way I can think of One way of doing this is to first modify the file and then parse it. In your example, the newline that actually separates two records is always following a "
:
"col1","col2","col3","col4" <-- here
1,"text1","<p>big <-- no "
If that is the case for the entire file, you can replace all newlines that are not immediately after a "
with a placeholder and so have everything in a single line. You can then parse normally with gawk
and finally replace the placeholder with the newline again. I will use the string &%&
as a placeholder since it is unlikely to exist in your file:
$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv | awk -F, '{print $4}'
"col4"
"4th column"
"4th column2"
The -p
flag for perl
means print each line of the input file
after applying the script given by -e
. Then there are 3 substitution (s/foo/bar/
) commands:
s/"\s*\n/"&%&/
: This will find any"
which is followed by 0 or more whitespace characters (\s*
) and then a newline character (\n
). It will replace that with"&%&
. The quotes are added to preserve the format and the&%&
is just a random placeholder, it could be anything that does not appear in your file.s/\n//g;
: since the real newlines have been replaced with the placeholder, we can now safely remove all remaining newlines in this record. This means that all lines of the current record have now been concatenated into the current line.s/&%&/\n/
: This turns the placeholder back into a normal new line.
To understand the output of the command run it without gawk
:
$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv
"col1","col2","col3","col4"
1,"text1","<p>big html text</p>","4th column"
2,"text2","<p>big2 html2 text2</p>","4th column2"
So, you now have your long records on single lines and this is perfect food for gawk
.
You can also do it directly in Perl:
perl -ne '$/="\"\n"; chomp;@a=split(/,/);print "$a[3]\"\n"' myFile.csv
"col4"
"4th column"
"4th column2"
This is using a bit more Perl magic. The $/
special variable is the input record separator. By setting it to "\n
we tell Perl to split lines not at \n
but only at "\n"
so that each record will be treated as a single line. Once that is done, chomp
removes the newline from the end of the line (for printing later) and split
splits each record (on ,
) and saves it in the array @a
. Finally, we print the 4th element of the array (arrays are numbered from 0 so that is $a[3]
) which is the 4th column.
And even more magic, turn on auto spitting (-a
) and split on commas (F","
). This will split each record into the special @F
array and you can print the 4th element of the array:
$ perl -F"," -ane '$/="\"\n";chomp;print "$F[3]"' myFile.csv
"col4"
"4th column"
"4th column2"
I would recommend using a battle-tested CSV parsing module. For example:
perl -MText::CSV -E '
$csv = Text::CSV->new({binary=>1});
while ($row = $csv->getline(STDIN)) {say $row->[3]}
' < file.csv
col4
4th column
4th column2
or this produces the same results:
ruby -rcsv -e 'CSV.foreach(ARGV.shift) {|row| puts row[3]}' file.csv
Python :
python -c "import csv,sys; print '\n'.join([ r[3] for r in csv.reader(open(sys.argv[1]))])" myfile.csv
A memory-conservative solution for large files that iterates through the file a line at a time unlike the above approach that loads the contents of the file into memory via a list
#!/usr/bin/env python
import sys
import csv
with open(sys.argv[1]) as f:
for row in csv.reader(f):
print(row[3])
TEST RESULT of all Solutions:
OS: Ubuntu 12.04
Public CSV Data Download from : http://seanlahman.com/baseball-archive/statistics/
Versions details
root@ubuntu:~# python --version
Python 2.7.3
root@ubuntu:~# ruby --version
ruby 1.8.7 (2011-06-30 patchlevel 352) [i686-linux]
root@ubuntu:~# perl --version
This is perl 5, version 14, subversion 2 (v5.14.2) built for i686-linux-gnu-thread-multi-64int
Result with time
root@ubuntu:~# time python -c "import csv,sys; print '\n'.join([ r[3] for r in csv.reader(open(sys.argv[1]))])" Master.csv > /tmp/python
real 0m1.112s
user 0m0.056s
sys 0m0.316s
root@ubuntu:~# time ruby -rcsv -e 'CSV.foreach(ARGV.shift) {|row| puts row[3]}' Master.csv > /tmp/ruby
real 0m24.582s
user 0m23.397s
sys 0m0.448s
root@ubuntu:~# time perl -MText::CSV -E '
> $csv = Text::CSV->new({binary=>1});
> while ($row = $csv->getline(STDIN)) {say $row->[3]}
> ' < Master.csv > /tmp/perl
real 0m7.049s
user 0m5.876s
sys 0m0.468s