/robots.txt is 25 years old. Here I describe
how it came about.
Thanks to Gary Illyes, Webmaster Trends Analyst at Google, for pointing that outthe birthday. We celebrate with cake!
More importantly, see the Formalizing the Robots Exclusion Protocol Specification post on Google's Webmaster Central Blog for news about an effort to, somewhat belatedly, turn it into standard. See draft over at the IETF.
This milestone prompted me to go spelunking through the archives to put together a timelime. Pure nostalgia, and I expect no-one cares. But if do, read on.
The actual announcement was on July 1994 msg:
Date: Sun, 3 Jul 1994 18:58:37 +0200 From: Martijn Koster <firstname.lastname@example.org> To: Multiple recipients of list <email@example.com> Subject: ANNOUNCE: A Standard for Robot Exclusion Some of you may remember a proposed standard for robot exclusion I posted to this forum last year. In short it is a method of guiding robots away from certain areas in a Web server's URL space, by providing a simple text file on the server. This is especially handy if you have large archives, CGI scripts with massive URL subtrees, temporary information, or you simply don't want to serve robots. After some recent discussions on the Robots mailing list some changes were proposed, and a new version of this document is now available from <http://web.nexor.co.uk/mak/doc/robots/robots.html>. Most of the robots in operation either use it already, or have promised support soon. -- Martijn
But obviously some work preceded that.
In 1993 I ran a web sever on web.nexor.co.uk; I announced
that to the www-talk mailing list on 4 May 1993. I don't recall
what software that used (for some dynamic sites I started with
shell scripts launched by
initd), but by 28 June 1993
I was running Tony Sanders' Plexus server, a web server implemented
in Perl. The infrastructure was an scrounged Sun 4/330 UNIX sever,
connected to the internet over a company leased line (ISDN
perhaps?), running a Perl web server.
From an old web page:
Then in September 1993 my Web server mailed me a report indicating someone had been retrieving an unusually large number of documents from my Web server. Further investigation indicated that at three points in time the same sequence of documents was retrieved at an enourmeous rate, in the order of 1 document per second. The sequence of documents showed that they were all interlinked; A Web wanderer had been at work.
and I found an email exchange with Wayne Allen at MCC dated 23 Sept 1993 where I complained about his frequent repeated automated downloads; so perhaps that was what motiviated me to try and do something about it.
I did see Matthew Gray's home page mentioning World Wide Web Wanderer, and an email quote from him on 6 Jun 1994: "W4v1.0 was written in June of 1993 as a simple depth first search that kept the entire database in memory of where it had been and dumped to disk when it had exhausted a document tree. Very simple.".
And per this article, Jonathon Fletcher was working on his crawler for JumpStation in September 1993.
So other web robots were around at the time.
Charlie Stross describes overloading my server with his Websnarf crawler. I found emails from him from 7 Mar 1994; I'm not sure if there were others.
Anyway, suffice to say this was an issue that affected my attempts to offer web services.
The first public mention of "a proposed standard that will allow WWW server maintainers to indicate if they want robots to access their server, and if so which parts" was on 25 Feb 1994, to the firstname.lastname@example.org mailing list (msg):
Date: Fri, 25 Feb 1994 13:18:03 --100 From: Martijn Koster <email@example.com> To: Multiple recipients of list <firstname.lastname@example.org> Subject: Important: Spiders, Robots and Web Wanderers Recognising that robots exist and will never go away, I have setup a page devoted to gathering as much info about active robots as possible: http://web.nexor.co.uk/mak/doc/robots/robots.html (Please use this exact URL for all accesses). It contains codes of practice for robot writers, a list of all known robots in use, and most importantly a proposed standard that will allow WWW server maintainers to indicate if they want robots to access their server, and if so which parts. This proposed standard doesn't require any server/client/protocol changes, and can provide a partial solution to problems caused by robots. I am inviting comments on it, but I do hope we can keep the discussion focused, and not degenerate in a "robots are good/bad" discussion that won't be resolved. Robots are one of the few aspects of the web that cause operational problems and cause people grief. At the same time they do provide very useful services. This standard should minimise the problems and may well maximise the benefits, so I think we need to sort this out as soon as possible. The major robot writers are in favour of this idea, so I don't see any fundamental problems.
unfortunately no copy of that version survives.
That same day, there was a discussion in comp.infosystems.www started by "PG" (Paul Ginsparg, then at Los Alamos National Lab) on 25 Feb 1994 "WHAT TO DO ABOUT MISERABLE AUTOMATED DOWNLOADS???", after a robot visit from Imperial College London, and I mentioned it in that discussion too.
In May 1994 the "First International Conference on the World-Wide Web" conference (now just The Web Conference) was held in Geneva. I was there to present a paper on "Aliweb - Archie-Like Indexing in the Web" (paper), which included a mention:
Currently there is an effort underway to prevent some of these problems: there are Guidelines for Robot Writers and there is a Proposal for Robot Exclusion.
In Roy's Fielding's presentation of MOMSpider (paper).
As an additional precaution, MOMspider periodically looks for and obeys any restrictions found in a site’s /RobotsNotWanted.txt document as per the standard proposed by Martijn Koster [Koster94]. Before any link is tested, the destination site is looked-up in a table of recently accessed sites (the definition of "recently" can be set by the user). If it is not found, that site’s /RobotsNotWanted.txt document is requested and, if found, parsed for any restrictions to be placed on MOMspider robots. Any such restrictions are added to the user’s avoid list and the site is added to the site table, both with expiration dates indicating when the site must be checked again. Although this opens the possibility for a discrepancy to exist between the restrictions applied and the contents of a recently changed /RobotsNotWanted.txt document, it is necessary to avoid a condition where the site checks cause a greater load on the server than would the maintenance requests alone. An example sites file is provided in Appendix D.
David Eichmann's presentation of The RBSE Spider (paper) also mentions an implementation:
A third table, added more recently following the derivation of the robot exclusion protocol, is a permanent list of patterns to match against for suppression.
After the conference, on 1 Jun 1994 a mailing list was created to discuss all things web robot.
Date: Mon, 6 Jun 1994 09:39:16 +0100 From: Martijn Koster <email@example.com> Subject: Topics ... - robotsnotwanted proposal I'd like to get some more discussion on this. As all the robot writers are on this list we should be able to decide on something that can easily be implemented by robots and users. The only outstanding issue is the name of the file; it is too long for DOS-based servers. Is there any problem with changing the filename to robotsp.txt (for robots policy) ?
Of note was a discussion about the name of the file:
Date: Wed, 15 Jun 1994 14:58:51 +0100 From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU> To: /CNfirstname.lastname@example.org Subject: Proposed name change for /RobotsNotWanted.txt Hello all, I was just editing my MOMspider paper for final submission in the WWW94 proceedings (what a pain!) and noticed that I have several references to the name /RobotsNotWanted.txt in the text. I would like to change the name before it gets written in stone (i.e. before I hand over copyright to Elsevier). I propose that the name be: /spiders.txt Reasons: 1) It fits within the 8.3 filename restrictions for PCs 2) It is easy to remember and hard to mistake (i.e. no mixed case) 3) It is more web-ish than /robots.txt 4) It does not imply that all robots are excluded (/norobots.txt) So, what's the general consensus? I need to have a decision within the next 24 hours in order to get my paper done on time ;-) ....Roy Fielding ICS Grad Student, University of California, Irvine USA (email@example.com) <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU> To: /CNfirstname.lastname@example.org Subject: Re: Proposed name change for /RobotsNotWanted.txt Hmmm...not a whole lot of consensus out there. Acronyms such as "racl.txt" are too hard to remember. I think the extension needs to reflect the content-type, not its purpose. Specifically, it is not fair to ask people to define a new type just for this file. On the other hand, we could always call it "robots.pl" and require the format to be in Perl4. ;-) ... Just my NZ half-penny ... ======================================================================= Okay, the voting so far, counting my own (I think): RTF = your's truly Y = Yes GvR = Guido van Rossum N = No MLM = Michael L. Mauldin O = Okay, maybe, don't care, ... MAK = Martijn Koster JRL = John R. R. Leavitt CT = Christophe Tronche spiders.txt robots.txt robots.lmt racl.txt agents.pol agents.txt avoidURL.txt ----------- ---------- ---------- -------- ---------- ---------- ------------ RTF Y O N N N O Y GvR N Y MLM O O Y MAK N Y JRL N Y Y CT N Y and a grand total of USD $0.04 + FF $0.02 + NZD $0.005
Date: Fri, 17 Jun 1994 10:54:41 +0100 From: Martijn Koster <email@example.com> Cc: /CNfirstname.lastname@example.org Subject: Re: Evolving Standard > Ok.. so how is the standard emerging out of all this turmoil? > > "robots.txt"? > empty file = all robots permitted? > '#' = comment character? > no "Expires" lines? Yes. I'll be changing the document accordingly. -- Martijn
I found Lycos
Perl code from Jun 1994, and was also chuffed to see a mailing
list messages with a
norobots.py implementation from
Guido.van.Rossum@cwi.nl on 21 Jun 1994.
The next Web Conference I attended was the Third International
WWW Conference in Darmstadt in April 1995, where Brian Pinkerton
organised a workshop for web-wide indexing, which was followed by a
meeting in the Braustübl pub. But I
don't recall any specific changes to
Whereas the previous year the environment felt very academic, there
was a new sense of commercial competition. In the months and years
following there wasn't a lot of collaboration. I had ideas for data
sharing between robots, but that never happened.
There have been some proposed extensions but I don't think they've gotten traction.
In 2005 Google, Microsoft and Yahoo introduced sitemaps (sitemaps.org) which adds a
Sitemap: http://www.example.com/sitemap.xml. It's a
bit of an odd fit in the file syntax, as it is a "independent of
the user-agent line". But it is nice to see somthing at least
partly in the spirit of ALIWEB: you get a list of URLs, but no
content (which would be carpet-bombed by SEO types, and probably
not accurately reflect the pages in question).
In 2005 HEALTHCARE
ADVOCATES sued the Harding Earley Follmer & Frailey and the
Internet Archive over
/robots.txt handling. In
2006 The Internet Archive settled out of court. The case against
Harding et al was
found in favour of the defendants. See The
What impact does this have on existing mechanisms, such as P3P and robots.txt?
None, until they choose to use this mechanism.
There's an obvious chicken-and-egg issue there.
In 2014 Baidu
won a lawsuit against Qihoo 360, involving the latter ignoring
/robots.txt, and being ordered to pay ¥700,000 yuan
In 2017, the Internet Archive
announced they would ignore
"robots.txt files that are geared toward search engine crawlers do
not necessarily serve our archival purposes".
Now it's 2019,
robots.txt endures; it is used by
over 500 million websites according to Google.
I hope you enjoyed this litle trip down memory lane.
- One Million robots.txt Files
- A Larger Scale Study of Robots.txt
- Exploring some of the lesser known robots.txt directives
- Search Engine Watch and Search Engine Land
- Fun with robots.txt
- Original and funny robots.txt files
- Ask Inbound: What's the best robots.txt file you've ever seen?
- What are robots.txt files? Featuring 15 of our favourites
Google's Birthday Tweets and Announcements
It's 1994 and crawlers are overwhelming servers 🥺. To help webmasters, Martijn Koster (@makuk66), a webmaster himself, proposes a protocol to control what URLs crawlers may access on sites 🧐.— Google Webmasters (@googlewmc) July 1, 2019
The robots.txt protocol is very simple, yet incredibly effective: by specifying a user-agent and rules for it, webmasters have granular control over what crawlers may access. It doesn't matter if it's a single URL, a certain file-type, or a whole site– robots.txt works for each. pic.twitter.com/fOlFFE2yMi— Google Webmasters (@googlewmc) July 1, 2019
It's been 25 years, and the Robots Exclusion Protocol never became an official standard. While it was adopted by all major search engines, it didn't cover everything: does a 500 HTTP status code mean that the crawler can crawl anything or nothing? 😕 pic.twitter.com/imqoVQW92V— Google Webmasters (@googlewmc) July 1, 2019
Today we're announcing that after 25 years of being a de-facto standard, we worked with Martijn Koster (@makuk66), webmasters, and other search engines to make the Robots Exclusion Protocol an official standard!https://t.co/Kcb9flvU0b— Google Webmasters (@googlewmc) July 1, 2019
In 25 years, robots.txt has been widely adopted– in fact over 500 million websites use it! While user-agent, disallow, and allow are the most popular lines in all robots.txt files, we've also seen rules that allowed Googlebot to "Learn Emotion" or "Assimilate The Pickled Pixie". pic.twitter.com/tmCApqVesh— Google Webmasters (@googlewmc) July 1, 2019
But there are also lots of typos in robots.txt files. Most people miss colons in the rules, and some misspell them. What should crawlers do with a rule named "Dis Allow"? pic.twitter.com/nZEIyPYI9R— Google Webmasters (@googlewmc) July 1, 2019
To help developers create parsers that reflect the Robots Exclusion Protocol requirements, we're releasing our robots.txt parser as open source!— Google Webmasters (@googlewmc) July 1, 2019
Updated to cover all corner cases, the parser ensures that Googlebot only crawls what it's allowed to.https://t.co/NmbLRzDkHF
Happy 25th birthday, robots.txt! You make the Internet a better place. You're the real MVP! pic.twitter.com/vxvZTcHpR3— Google Webmasters (@googlewmc) July 1, 2019