Advertisement
S.E.O.

Managing your Site's Accessibility for Search Engines

by

Your site is useless if it can’t be indexed by search engines. If you want it to show up in the search results, you need to make sure that it can be accessed by search engines. However, sometimes you'll want to restrict access to certain parts of your site, perhaps you want to hide irrelevant pages or private documents. In this article you'll learn how to manage your site's accessibility for search engines via a robots.txt file or the robots meta tag.


Benefits of Robots Files and Tags

Before we dig into the details of how to create a robots.txt file or robots meta tag, we should take a look at their benefits. There are some scenarios where their implementation might come in handy, such as:

  • Preventing duplicate content from being indexed (e.g. printable versions of pages).
  • For incomplete pages.
  • Restricting search engines from indexing confidential pages or files.

Duplicate content dilutes your SEO efforts as search engines find it hard to decide which version is the most relevant for the users' search query. This problem can be prevented by blocking duplicate pages via a robots file or tag. There's another way to manage duplicate content, but we'll discuss that later.

If you have new but incomplete pages online, it's best to block them from crawlers to prevent them from being indexed. This might be useful for new product pages, for example - if you want to keep them a secret until launch, add a robots file or tag.

Some websites have confidential pages or files that aren't blocked by a login form. An easy way to hide these from search engines is via the robots.txt file or meta tag.

Now that we know why we should manage the accessibility of certain pages, it's time to learn how we can do this.


The robots.txt File

Crawlers are workaholics. They want to index as much as possible, unless you tell them otherwise.

When a crawler visits your website, it will search for the robots.txt file. This file gives it instructions on which pages should be indexed and which should be ignored. By creating a robots.txt file you can prevent crawlers from accessing certain parts of your website.

The robots.txt file must be placed in the top-level directory of your site - for example: www.domain.com/robots.txt. This filename is also case sensitive.

Warning: if you add a robots.txt file to your website, please double-check for errors. You don’t want to inadvertently block crawlers from indexing important pages.


Creating a robots.txt File

robots.txt is a simple text file with several records. Each record has two elements: user-agent and disallow.

The user-agent element tells which crawlers should use the disallow information. Disallow tells crawlers which part of the website can’t be indexed.

A record will look something like this:

User-agent: *
Disallow: 

The record above gives search engines access to all pages. We use the asterisk (*) to target all crawlers and because we haven’t specified a disallow page, they can index all pages.

However, by adding a forward slash to the disallow field, we can prevent all crawlers from indexing anything from our website:

User-agent: *
Disallow: / 

We can also choose to target a single crawler. Take a look at the example below:

User-agent: Googlebot
Disallow: /private-directory/ 

This record tells Google not to index the private directory; Googlebot is used by Google for web searches. For a complete list of all crawlers, visit the web robots database.

Coupling one disallow to one user-agent would be a time-consuming job. Fortunately we can add multiple disallows in the same record.

User-agent: Bingbot
Disallow: /sample-directory/
Disallow: /an-uninteresting-page.html
Disallow: /pictures/logo.jpg 

This will prevent Bing from indexing the sample directory, the uninteresting page and the logo.

Wildcards

As we're leaning on regular expressions here, we can also make use of wildcards in a robots.txt file.

For example, a lot of people use Wordpress as a CMS. Visitors can use the built-in search function to find posts about a certain topic and the url for a search query has the following structure: http://domain.com/?s=searchquery.

If I want to block search results from being indexed, I can use a wildcard. The robots.txt record will look like this:

User-agent: *
Disallow: /?s= 

You can also use wildcards to prevent filetypes from being indexed. The following code will block all .png images:

User-agent: *
Disallow: /*.png$ 

Don’t forget to add the dollar sign at the end. It tells search engines that it’s the end of a URL string.

Testing Your robots.txt File

It’s always a good idea to test your robots.txt file to see if you’ve made any mistakes. You can use Google Webmaster Tools for this.

Under ‘health’ you’ll find the ‘blocked urls’ page. Here you’ll find all the information about your file. You can also test changes before uploading them.


Robots Meta Tag

The robots meta tag is used to manage the accessibility of crawlers to a single page. It tells search engines if the page can be crawled, archived or if the links on the page may be followed.

This is what the robots meta tag looks like:


<head>

	<meta name=”robots” content=”noindex” />

</head>

This meta tag prevents crawlers from indexing the web page. Besides “noindex” there are several other attributes that might be useful:

  • index: this page can be indexed.
  • noindex: this page can’t be displayed in the search results.
  • follow: the links on this page can be followed.
  • nofollow: the links on this page can’t be followed.
  • archive: a cache-copy of this page is permitted.
  • noarchive: a cache-copy of this page isn’t permitted.

Multiple attributes can be used in a single robots meta tag, for example:


<head>

	<meta name=”robots” content=”noindex, nofollow” />

</head>

This markup prevents crawlers from indexing the page and following its links.

If you happen to be using conflicting tags, Google will use the most limiting option. Let’s say you use ‘“index” and “noindex” in the same tag, the page will not be indexed (most restrictive option, just to be safe).


Do I use robots.txt or Meta Tags?

As we've discussed, there are two ways to manage the accessibility of web pages: a robots.txt file and meta tags.

The robots.txt file is great for blocking complete directories or certain file types. With a single line of text you can do a lot of work (and potentially a lot of damage!) But if you want to block an individual page, it’s best to use the robots meta tag.

Sometimes URLs that are blocked via the robots.txt file can still appear in the search results. When there are a lot of links pointing to the page and Google believes the only relevant search result for the search query, it will still show up. If you absolutely don’t want the page to be displayed you should add the noindex meta tag. This may sound complicated but Matt Cutts explains everything into detail in Uncrawled URLs in search results on YouTube.


Conclusion

With the robots.txt file and robots meta tags you can easily manage your site’s accessibility for search engines.

Don’t forget to check and double-check your meta tags and robots.txt file to prevent inadvertently blocking crawlers from indexing important pages.

Related Posts
  • Code
    Android SDK
    Integrating Google Play Services on Android85ude preview image@2x
    By integrating your Android apps with Google Play Services, you can access Google services, such as Maps, Drive, and Google+. In this tutorial, we will go through the process of integrating Google Play Services with Android apps.Read More…
  • Computer Skills
    Productivity
    Techniques to Share Apple and Google CalendarsCalendar in osx and google 00
    Calendars are most useful when they're available everywhere and not locked into a single device or service. That's often easier said than done. In this tutorial, I’ll show you the techniques to get Apple and Google calendars talking to each other.Read More…
  • Code
    Plugins
    The Beginner’s Guide to WordPress SEO by Yoast: On Page SEOThe beginners guide to wordpress seo by yoast 400
    In the last tutorial of The Beginner's Guide to WordPress SEO by Yoast we did the final tweaking, after that comes the part of how to use the On Page SEO meta box when writing a post or a page in WordPress. Today we will explore this panel/meta box and will discuss the some generic factors which help in ranking your content well. Let's start with a deeper look towards elements of this post. Read More…
  • Code
    Plugins
    The Beginner’s Guide to WordPress SEO by Yoast: Final TweakingThe beginners guide to wordpress seo by yoast
    In my previous article, I discussed the social settings of Yoast's WordPress SEO plugin. In this tutorial, you will learn the final steps to configuring the WordPress SEO plugin with the ultimate goal of making it as rock-solid as possible for your blog.Read More…
  • Code
    Plugins
    The Beginner's Guide to WordPress SEO by Yoast: ConfigurationThe beginners guide to wordpress seo by yoast 400
    Everyone these days tries to rank their site's content higher in Google Search results. There are marketing firms earning a good deal of revenue in relation to the most notorious digital term these days, "SEO". In this post I will be explaining the different aspects of "On Page SEO", how to deal with it in WordPress using one of the best free plugins: WordPress SEO by Yoast. This is a series of tutorials, in the first one we will be configuring and understanding different sections of the WordPress SEO Plugin. Future tutorials in this series will be about different aspects of SEO, how to utilize your site's Tags & Categories, the concept of rel='canonical', a practical example of an SEO optimized post, and finally some discussion about what else you need to do after having this plugin configured.Read More…
  • Web Design
    S.E.O.
    Meta Tags and SEOSeo meta preview
    Much is misunderstood when it comes to the effect meta tags have on search engine rankings. Some tags were frequently used in the past, but have since lost their power. So which tags are still useful for SEO and which aren’t?Read More…