HTML to Plain Text with Ruby?

Actually, this is much simpler:

require 'rubygems'
require 'nokogiri'

puts Nokogiri::HTML(my_html).text

You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.


You could start with something like this:

require 'open-uri'
require 'rubygems'
require 'nokogiri'

uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")

Tags:

Ruby