Wikiwix crawling robot

The Wikiwix robot works in 3 different modes:
  • Crawling the Wikimedia projects: for all projects of the wikimedia foundation, our robot listens to the RC IRC bot irc.wikimedia.org and go crawling the pages as soon as they have been modified or created. It keeps our search engine on Wikipedia and sister projects up to date.
  • Crawling web pages for our Twitter based search engine. This is complementary to the encyclopedic Wikipedia search, and gives you access to what is buzzing now on the internet.
  • Crawling entire web sites on demand: one can customize a search engine on its favorite web with Wikimarks.

How to control our robot

Prevent our robot to crawl your website

  • In the robots.txt file, located at the root of your website (http://yoursite.com/robots.txt), put the following lines:

    User-agent:wikiwix

    Disallow: /

  • Or for just a subtree or particular pages :

    User-agent:wikiwix

    Disallow: /subtree

    Disallow: /somewhere/particular-page.html

Prevent our robot to crawl some pages

We respect meta-tags for robots :
  • Putting a line < meta name="robots" content="noindex,nofollow"/ > in the page will prevent us from crawling the page nor from following links in it.
  • any combination of noindex or index, with nofollow or follow, will lead us to index/not to index, follow links/not to follow links in the page.