# Define access-restrictions for robots/spiders # http://www.robotstxt.org/wc/norobots.html # By default we allow robots to access all areas of our site # already accessible to anonymous users User-agent: * Disallow: /search # Add Googlebot-specific syntax extension to exclude forms # that are repeated for each piece of content in the site # the wildcard is only supported by Googlebot # http://www.google.com/support/webmasters/bin/answer.py?answer=40367&ctx;=sibling User-Agent: Googlebot Disallow: /*sendto_form$ Disallow: /*folder_factories$ Disallow: /*?searchterm=* # Don't spider all the old revisions of stuff. 4 hits a second from Googlebot eats two of our logical cores for not much added value: Disallow: /*?rev=* Disallow: /*&rev;=* # Penn State's Google Search Appliance comes at us so hard and fast that it burns 60% of our CPU, even though it gives us hardly any referrals. We're also taking up more of their index space than any other site at the university, so much that they asked me if my server was misconfigured. ;-) Really limit what it spiders: User-Agent: PennStateSpider Disallow: /*sendto_form$ Disallow: /*folder_factories$ Disallow: /*?rev=* Disallow: /*&rev;=* Disallow: /*?searchterm=* Disallow: /trac/weblion/changeset/* Disallow: /trac/weblion/browser/* Disallow: /trac/weblion/log/* Disallow: /trac/weblion/export/*