September 21, 2008
(Posted at 5:07 am)How to stip tags, script and style off the HTML
Havn’t you just wished sometimes that all the html, script and style tags would just vanish from the html pages and all you get is pure text (for fun and profit). Well, here’s how I am managing it
require "open-uri"
require "hpricot"
require "sanitize"
html = open("http://www.google.com")
hp = Hpricot(html.read)
hp.search("script").remove
hp.search("style").remove
sanitize(hp.innerHTML, okTags="")
And output?
“GoogleWeb Images News Orkut Groups Gmail more ▼ Books Scholar Blogs YouTube Calendar Photos Documents Reader even more » iGoogle | Sign inIndia Advanced Search Preferences Language ToolsSearch: the web pages from India Google.co.in offered in: Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam Punjabi Advertising Programs - About Google - Go to Google.com©2008 - Privacy”
Now you can use this text to any imaginable use - as I mentioned earlier - maybe fun & profit
Libraries - hpricot, sanitize, open-uri
Have fun!



hp.inner_text after removing the script and style tags
Comment by makuchaku — November 18, 2008 @ 12:21 pm