Spread and use Firefox
Advertise on this blog!
-->

Subscribe to my posts via Email


Delivered by FeedBurner

I'm an Indian
Darn proud of being an Indian!

Links...

Categories

Archives

Friends



Subscribe





External Links


September 21, 2008

(Posted at 5:07 am)

How to stip tags, script and style off the HTML

Havn’t you just wished sometimes that all the html, script and style tags would just vanish from the html pages and all you get is pure text (for fun and profit). Well, here’s how I am managing it :)

require "open-uri"
require "hpricot"
require "sanitize"

html = open("http://www.google.com")
hp = Hpricot(html.read)
hp.search("script").remove
hp.search("style").remove
sanitize(hp.innerHTML, okTags="")

And output?

“GoogleWeb Images News Orkut Groups Gmail more ▼ Books Scholar Blogs YouTube Calendar Photos Documents Reader even more » iGoogle | Sign inIndia   Advanced Search  Preferences  Language ToolsSearch: the web pages from India Google.co.in offered in: Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam Punjabi Advertising Programs - About Google - Go to Google.com©2008 - Privacy”

Now you can use this text to any imaginable use - as I mentioned earlier - maybe fun & profit :)

Libraries - hpricot, sanitize, open-uri

Have fun!

hp.inner_text after removing the script and style tags

Comment by makuchaku — November 18, 2008 @ 12:21 pm

Leave a comment

*
To prove you're a person (not a spam script), type the security word shown in the picture.
Anti-Spam Image