It is often said that if you do not want your information to be stolen, don’t put it on the Internet. However, the Internet has become an integral part of our lives, and we can’t help but post some kind of web site, blog, or forum. Even if you don’t tell anyone about your web site, once it is published it will eventually be discovered.
How, you ask? By robot indexing programs, A.K.A. bots, crawlers and spiders. These little programs swarm out onto the Internet looking up every web site, caching and logging web site information in their databases. Often created by search engines to help index pages, they roam the Internet freely crawling all web sites all the time.
Normally this is an acceptable part of the Internet, but some search engines are so aggressive that they can increase bandwidth consumption. And some bots are malicious, stealing photos from web sites or harvesting email addresses so that they can be spammed. The simplest way to block these bots is to create a simple robots.txt file that contains instructions to block the bots:
However, there are a couple of things wrong with this approach. One is that bots can still hit the site, ignoring your robots.txt file and your wish not to be indexed.
But there is good news. If you are on an IIS 7 server, you have another alternative. You can use the RequestFiltering Rule that is built-in to IIS 7. It works on a higher level portion of the web service and it cannot be bypassed by a bot.
The setup is fairly simple, and the easiest and fastest way to initiate your ReqestFiltering Rule is to code it in your application’s web.config file. The RequestFiltering element goes inside the <system.webServer><security> elements. If you do not have this in your applications web.config file you should be able to create them. Once that is created type this schema to setup your RequestFiltering rule.
<requestFiltering> <filteringRules> <filteringRule name="BlockSearchEngines" scanUrl="false" scanQueryString="false"> <scanHeaders> <clear /> <add requestHeader="User-Agent" /> </scanHeaders> <appliesTo> <clear /> </appliesTo> <denyStrings> <clear /> <add string="YandexBot" /> </denyStrings> </filteringRule> </filteringRules> </requestFiltering> <authentication> <basicAuthentication enabled="true" /> <anonymousAuthentication enabled="true" /> </authentication>
You can name the filtering rule whatever you’d like and in the “requestHeader” element you will need to make sure you define “User-Agent.” Within the “add string” element you’ll need to specify the User Agent name. In this example I set it to YandexBot which blocks a search engine originating from Russia. You can also block search engines such as Googlebot or Bingbot.
If you want to see if this rule is actually blocking these bots, you will need to download your HTTP raw logs from the server and parse them to look for the headers User-Agent. If you scroll to the left and find the headers SC-Status (status code) you should see a 404 HTTP response. In addition the headers will also carry sc-substatus which will be a substatus code to the primary HTTP response code.
Here is a list of potential substatus codes you may see when you impose your RequestFiltering rule.