A web spider dream


(Spidering the web, literally !!)

Its been one of my dreams for a long time, to write my own spider. It came first to me while watching the movie matrix, where in one of the initial scenes,in which while the lead character is sleeping, his computer automatically runs search(about another character called "Morpheus") and gets the results for him.

Writing a automated search would only take a couple of minutes using any automation software like autoit but to efficiently gather data from the various search engines, organizing them, sorting it out and storing them requires talent of another kind. . .

So the journey begins, for the making of the web spider.


The google bot

By the way, guys, do you know that, google gets data of different sites using something like the web spider I mentioned earlier. The name of the spider that they use is google bot.


Robots.txt in action

Another interesting thing which i noted was almost all sites I visited after learning about spiders had robots.txt file which stated which parts of the site the web bot(spider) can visit(access). You too can see the file and the names of various spiders in the worlds, Just type the root name of the site and append it with robots.txt

Imagine you wanted to get the robots.txt of http://en.wikipedia.org
Then type http://en.wikipedia.org/robots.txt
Then you get the details of all bots(spiders) allowed and disallowed by wikipedia guys. You can try this out for any site. Cool huh ?


So in basic terms, the web spider is basically a software that goes to the site and gets the data for you, rather than you manually visiting each site using a web browser.

In future posts, we shall begin the journey of actually coding a spider and using it.

(As a tribute to the movie "The Matrix", this post was intentionally colored green)

Labels: , , , ,