September 21, 2008
How to stip tags, script and style off the HTML
Havn’t you just wished sometimes that all the html, script and style tags would just vanish from the html pages and all you get is pure text (for fun and profit). Well, here’s how I am managing it
require “open-uri”
require “hpricot”
require “sanitize”
html = open(”http://www.google.com”)
hp = Hpricot(html.read)
hp.search(”script”).remove
hp.search(”style”).remove
sanitize(hp.innerHTML, okTags=”")
And output?
“GoogleWeb Images News Orkut Groups Gmail more […]


