Robots.txt is 25 years old

The /robots.txt is 25 years old.

Thanks to Gary Illyes, Webmaster Trends Analyst at Google, for pointing that out.

So we celebrate with cake!

A robots.txt cake
Robots.txt cake, inspired by last.fm's /robots.txt

More importantly, see the Formalizing the Robots Exclusion Protocol Specification post on Google's Webmaster Central Blog for news about an effort to, somewhat belatedly, turn it into standard. See draft over at the IETF.

This milestone prompted me to go spelunking through the archives to put together a timelime. Pure nostalgia, and I expect no-one cares. But if do, read on.

The actual announcement was on July 1994 msg:

Date: Sun, 3 Jul 1994 18:58:37 +0200
From: Martijn Koster <m.koster@nexor.co.uk>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: ANNOUNCE: A Standard for Robot Exclusion

Some of you may remember a proposed standard for robot exclusion
I posted to this forum last year.
In short it is a method of guiding robots away from certain areas in a
Web server's URL space, by providing a simple text file on the server.
This is especially handy if you have large archives, CGI scripts with
massive URL subtrees, temporary information, or you simply don't want
to serve robots.

After some recent discussions on the Robots mailing list some changes
were proposed, and a new version of this document is now available
from <http://web.nexor.co.uk/mak/doc/robots/robots.html>. Most of the
robots in operation either use it already, or have promised support
soon.

-- Martijn

But obviously some work preceded that.

In 1993 I ran a web sever on web.nexor.co.uk; I announced that to the www-talk mailing list on 4 May 1993. I don't recall what software that used (for some dynamic sites I started with shell scripts launched by initd), but by 28 June 1993 I was running Tony Sanders' Plexus server, a web server implemented in Perl. The infrastructure was an scrounged Sun 4/330 UNIX sever, connected to the internet over a company leased line (ISDN perhaps?), running a Perl web server.

From an old web page:

Then in September 1993 my Web server mailed me a report indicating someone had been retrieving an unusually large number of documents from my Web server. Further investigation indicated that at three points in time the same sequence of documents was retrieved at an enourmeous rate, in the order of 1 document per second. The sequence of documents showed that they were all interlinked; A Web wanderer had been at work.

and I found an email exchange with Wayne Allen at MCC dated 23 Sept 1993 where I complained about his frequent repeated automated downloads; so perhaps that was what motiviated me to try and do something about it.

I did see Matthew Gray's home page mentioning World Wide Web Wanderer, and an email quote from him on 6 Jun 1994: "W4v1.0 was written in June of 1993 as a simple depth first search that kept the entire database in memory of where it had been and dumped to disk when it had exhausted a document tree. Very simple.".

And per this article, Jonathon Fletcher was working on his crawler for JumpStation in September 1993.

So other web robots were around at the time.

Charlie Stross describes overloading my server with his Websnarf crawler. I found emails from him from 7 Mar 1994; I'm not sure if there were others.

Anyway, suffice to say this was an issue that affected my attempts to offer web services.

The first public mention of "a proposed standard that will allow WWW server maintainers to indicate if they want robots to access their server, and if so which parts" was on 25 Feb 1994, to the www-talk@www0.cern.ch mailing list (msg):

Date: Fri, 25 Feb 1994 13:18:03 --100
From: Martijn Koster <m.koster@nexor.co.uk>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Important: Spiders, Robots and Web Wanderers

Recognising that robots exist and will never go away, I have
setup a page devoted to gathering as much info about active 
robots as possible: http://web.nexor.co.uk/mak/doc/robots/robots.html
(Please use this exact URL for all accesses).

It contains codes of practice for robot writers, a list of all known
robots in use, and most importantly a proposed standard that will
allow WWW server maintainers to indicate if they want robots to access
their server, and if so which parts.

This proposed standard doesn't require any server/client/protocol
changes, and can provide a partial solution to problems caused by
robots. I am inviting comments on it, but I do hope we can keep the
discussion focused, and not degenerate in a "robots are good/bad"
discussion that won't be resolved.

Robots are one of the few aspects of the web that cause operational
problems and cause people grief. At the same time they do provide very
useful services. This standard should minimise the problems and may
well maximise the benefits, so I think we need to sort this out as
soon as possible. The major robot writers are in favour of this idea,
so I don't see any fundamental problems.

unfortunately no copy of that version survives.

That same day, there was a discussion in comp.infosystems.www started by "PG" (Paul Ginsparg, then at Los Alamos National Lab) on 25 Feb 1994 "WHAT TO DO ABOUT MISERABLE AUTOMATED DOWNLOADS???", after a robot visit from Imperial College London, and I mentioned it in that discussion too.

Back then it used /RobotsNotWanted.txt (see this mention). By May 1993 requests were spotted in the wild.

In May 1994 the "First International Conference on the World-Wide Web" conference (now just The Web Conference) was held in Geneva. I was there to present a paper on "Aliweb - Archie-Like Indexing in the Web" (paper), which included a mention:

Currently there is an effort underway to prevent some of these problems: there are Guidelines for Robot Writers and there is a Proposal for Robot Exclusion.

In Roy's Fielding's presentation of MOMSpider (paper).

As an additional precaution, MOMspider periodically looks for and obeys any restrictions found in a site’s /RobotsNotWanted.txt document as per the standard proposed by Martijn Koster [Koster94]. Before any link is tested, the destination site is looked-up in a table of recently accessed sites (the definition of "recently" can be set by the user). If it is not found, that site’s /RobotsNotWanted.txt document is requested and, if found, parsed for any restrictions to be placed on MOMspider robots. Any such restrictions are added to the user’s avoid list and the site is added to the site table, both with expiration dates indicating when the site must be checked again. Although this opens the possibility for a discrepancy to exist between the restrictions applied and the contents of a recently changed /RobotsNotWanted.txt document, it is necessary to avoid a condition where the site checks cause a greater load on the server than would the maintenance requests alone. An example sites file is provided in Appendix D.

David Eichmann's presentation of The RBSE Spider (paper) also mentions an implementation:

A third table, added more recently following the derivation of the robot exclusion protocol, is a permanent list of patterns to match against for suppression.

After the conference, on 1 Jun 1994 a mailing list was created to discuss all things web robot.

Date: Mon, 6 Jun 1994 09:39:16 +0100
From: Martijn Koster <m.koster@nexor.co.uk>
Subject:  Topics

...
  - robotsnotwanted proposal

    I'd like to get some more discussion on this. As all the robot writers
    are on this list we should be able to decide on something that can
    easily be implemented by robots and users. The only outstanding issue
    is the name of the file; it is too long for DOS-based servers. Is there
    any problem with changing the filename to robotsp.txt (for robots
    policy) ?

Of note was a discussion about the name of the file:

Date: Wed, 15 Jun 1994 14:58:51 +0100
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
To: /CN=robots/@nexor.co.uk
Subject:  Proposed name change for /RobotsNotWanted.txt

Hello all,

I was just editing my MOMspider paper for final submission in the 
WWW94 proceedings (what a pain!) and noticed that I have several
references to the name /RobotsNotWanted.txt in the text.  I would
like to change the name before it gets written in stone (i.e. before
I hand over copyright to Elsevier).

I propose that the name be:   /spiders.txt


Reasons:  1) It fits within the 8.3 filename restrictions for PCs
          2) It is easy to remember and hard to mistake (i.e. no mixed case)
          3) It is more web-ish than /robots.txt
          4) It does not imply that all robots are excluded (/norobots.txt)

So, what's the general consensus?  I need to have a decision within
the next 24 hours in order to get my paper done on time ;-)


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Proposed name change for /RobotsNotWanted.txt 

Hmmm...not a whole lot of consensus out there.

Acronyms such as "racl.txt" are too hard to remember.

I think the extension needs to reflect the content-type, not its purpose.
Specifically, it is not fair to ask people to define a new type just for
this file.

On the other hand, we could always call it "robots.pl" and require
the format to be in Perl4.  ;-)

...

Just my NZ half-penny ...
=======================================================================

Okay, the voting so far, counting my own (I think):

RTF = your's truly                  Y = Yes
GvR = Guido van Rossum              N = No
MLM = Michael L. Mauldin            O = Okay, maybe, don't care, ...
MAK = Martijn Koster
JRL = John R. R. Leavitt
CT  = Christophe Tronche


spiders.txt robots.txt robots.lmt racl.txt agents.pol agents.txt avoidURL.txt
----------- ---------- ---------- -------- ---------- ---------- ------------
RTF   Y         O          N          N        N          O          Y
GvR   N         Y                                                       
MLM   O         O                              Y 
MAK   N         Y
JRL   N         Y          Y
CT                         N          Y


and a grand total of USD $0.04 + FF $0.02 + NZD $0.005

and decided:

Date: Fri, 17 Jun 1994 10:54:41 +0100
From: Martijn Koster <m.koster@nexor.co.uk>
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: Evolving Standard 

> Ok.. so how is the standard emerging out of all this turmoil?
>
> "robots.txt"?
> empty file = all robots permitted?
> '#' = comment character?
> no "Expires" lines?

Yes. I'll be changing the document accordingly.

-- Martijn

I found Lycos Perl code from Jun 1994, and was also chuffed to see a mailing list messages with a norobots.py implementation from Guido.van.Rossum@cwi.nl on 21 Jun 1994.

The next Web Conference I attended was the Third International WWW Conference in Darmstadt in April 1995, where Brian Pinkerton organised a workshop for web-wide indexing, which was followed by a meeting in the Braustübl pub. But I don't recall any specific changes to robots.txt there. Whereas the previous year the environment felt very academic, there was a new sense of commercial competition. In the months and years following there wasn't a lot of collaboration. I had ideas for data sharing between robots, but that never happened.

There have been some proposed extensions but I don't think they've gotten traction.

In 2004 Bing proposed a Crawl-delay but I don't believe Google ever implemented that.

In 2005 Google, Microsoft and Yahoo introduced sitemaps (sitemaps.org) which adds a Sitemap: http://www.example.com/sitemap.xml. It's a bit of an odd fit in the file syntax, as it is a "independent of the user-agent line". But it is nice to see somthing at least partly in the spirit of ALIWEB: you get a list of URLs, but no content (which would be carpet-bombed by SEO types, and probably not accurately reflect the pages in question).

In 2005 HEALTHCARE ADVOCATES sued the Harding Earley Follmer & Frailey and the Internet Archive over /robots.txt handling. In 2006 The Internet Archive settled out of court. The case against Harding et al was found in favour of the defendants. See The Register and Internetcases.

In 2010, RFC5785 (updated by RFC8615) defined "well-known URI"s with a /.well-known/ URL prefix, and official registration process, mainly to prevent naming collisions. Note:

What impact does this have on existing mechanisms, such as P3P and robots.txt?

None, until they choose to use this mechanism.

There's an obvious chicken-and-egg issue there.

In 2014 Baidu won a lawsuit against Qihoo 360, involving the latter ignoring /robots.txt, and being ordered to pay ¥700,000 yuan (US$113,506)

In 2017, the Internet Archive announced they would ignore /robots.txt because "robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes".

Now it's 2019, robots.txt endures; it is used by over 500 million websites according to Google.

I continue to host the robotstxt.org website as a reference, and I rather like Google's content About robots.txt and tester tool, and blog post.

I hope you enjoyed this litle trip down memory lane.

Further reading:

Google's Birthday Tweets and Announcements

From twitter: