Ruby : How can I detect/intelligently guess the delimiter used in a CSV file?
Looks like the py implementation just checks a few dialects: excel or excel_tab. So, a simple implementation of something that just checks for ","
or "\t"
is:
COMMON_DELIMITERS = ['","',"\"\t\""].freeze
def sniff(path)
first_line = File.open(path).first
return unless first_line
snif = {}
COMMON_DELIMITERS.each do |delim|
snif[delim] = first_line.count(delim)
end
snif = snif.sort { |a,b| b[1]<=>a[1] }
snif[0][0] if snif.size > 0
end
Note: that would return the full delimiter it finds, e.g. ","
, so to get ,
you could change the snif[0][0]
to snif[0][0][1]
.
Also, I'm using count(delim)
because it is a little faster, but if you added a delimiter that is composed of two (or more) characters of the same type like --
, then it would could each occurrence twice (or more) when weighing the type, so in that case, it may be better to use scan(delim).length
.
Here is Gary S. Weaver answer as we are using it in production. Good solution that works well.
class ColSepSniffer
NoColumnSeparatorFound = Class.new(StandardError)
EmptyFile = Class.new(StandardError)
COMMON_DELIMITERS = [
'","',
'"|"',
'";"'
].freeze
def initialize(path:)
@path = path
end
def self.find(path)
new(path: path).find
end
def find
fail EmptyFile unless first
if valid?
delimiters[0][0][1]
else
fail NoColumnSeparatorFound
end
end
private
def valid?
!delimiters.collect(&:last).reduce(:+).zero?
end
# delimiters #=> [["\"|\"", 54], ["\",\"", 0], ["\";\"", 0]]
# delimiters[0] #=> ["\";\"", 54]
# delimiters[0][0] #=> "\",\""
# delimiters[0][0][1] #=> ";"
def delimiters
@delimiters ||= COMMON_DELIMITERS.inject({}, &count).sort(&most_found)
end
def most_found
->(a, b) { b[1] <=> a[1] }
end
def count
->(hash, delimiter) { hash[delimiter] = first.count(delimiter); hash }
end
def first
@first ||= file.first
end
def file
@file ||= File.open(@path)
end
end
Spec
require "spec_helper"
describe ColSepSniffer do
describe ".find" do
subject(:find) { described_class.find(path) }
let(:path) { "./spec/fixtures/google/products.csv" }
context "when , delimiter" do
it "returns separator" do
expect(find).to eq(',')
end
end
context "when ; delimiter" do
let(:path) { "./spec/fixtures/google/products_with_semi_colon_seperator.csv" }
it "returns separator" do
expect(find).to eq(';')
end
end
context "when | delimiter" do
let(:path) { "./spec/fixtures/google/products_with_bar_seperator.csv" }
it "returns separator" do
expect(find).to eq('|')
end
end
context "when empty file" do
it "raises error" do
expect(File).to receive(:open) { [] }
expect { find }.to raise_error(described_class::EmptyFile)
end
end
context "when no column separator is found" do
it "raises error" do
expect(File).to receive(:open) { [''] }
expect { find }.to raise_error(described_class::NoColumnSeparatorFound)
end
end
end
end
I'm not aware of any sniffer implementation in the CSV library included in Ruby 1.9. It will try to auto-discover the row separator, but the column separator is assumed to be a comma by default.
One idea would be to try parsing a sample number of rows (5% of total maybe?) using each of the possible separators. Whichever separator results in the same number of columns most consistently is probably the correct separator.