What's the easiest way to get the headers from a CSV file in Ruby?
It looks like CSV.read
will give you access to a headers
method:
headers = CSV.read("file.csv", headers: true).headers
# => ["A", "B", "C"]
The above is really just a shortcut for CSV.open("file.csv", headers: true).read.headers
. You could have gotten to it using CSV.open
as you tried, but since CSV.open
doesn't actually read the file when you call the method, there is no way for it to know what the headers are until it's actually read some data. This is why it just returns true
in your example. After reading some data, it would finally return the headers:
table = CSV.open("file.csv", :headers => true)
table.headers
# => true
table.read
# => #<CSV::Table mode:col_or_row row_count:2>
table.headers
# => ["A", "B", "C"]
In my opinion the best way to do this is:
headers = CSV.foreach('file.csv').first
Please note that its very tempting to use CSV.read('file.csv'. headers: true).headers
but the catch is, CSV.read
loads complete file in memory and hence increases your memory footprint and as also it makes it very slow to use for bigger files. Whenever possible please use CSV.foreach
.
Below are the benchmarks for just a 20 MB file:
Ruby version: ruby 2.4.1p111
File size: 20M
****************
Time and memory usage with CSV.foreach:
Time: 0.0 seconds
Memory: 0.04 MB
****************
Time and memory usage with CSV.read:
Time: 5.88 seconds
Memory: 314.25 MB
A 20MB file increased memory footprint by 314 MB with CSV.read
, imagine what a 1GB file will do to your system. In short please do not use CSV.read
, i did and system went down for a 300MB file.
For further reading: If you want to read more about this, here is a very good article on handling big files.
Also below is the script i used for benchmarking CSV.foreach
and CSV.read
:
require 'benchmark'
require 'csv'
def print_memory_usage
memory_before = `ps -o rss= -p #{Process.pid}`.to_i
yield
memory_after = `ps -o rss= -p #{Process.pid}`.to_i
puts "Memory: #{((memory_after - memory_before) / 1024.0).round(2)} MB"
end
def print_time_spent
time = Benchmark.realtime do
yield
end
puts "Time: #{time.round(2)} seconds"
end
file_path = '{path_to_csv_file}'
puts 'Ruby version: ' + `ruby -v`
puts 'File size:' + `du -h #{file_path}`
puts 'Time and memory usage with CSV.foreach: '
print_memory_usage do
print_time_spent do
headers = CSV.foreach(file_path, headers: false).first
end
end
puts 'Time and memory usage with CSV.read:'
print_memory_usage do
print_time_spent do
headers = CSV.read(file_path, headers: true).headers
end
end
If you want a shorter answer then can try:
headers = CSV.open("file.csv", &:readline)
# => ["A", "B", "C"]