Robots.txt is an important tool for managing the indexing of your website. It allows you to specify exactly which parts of your website can be indexed and which parts cannot. It is essential to properly configure this file to ensure optimal visibility and protection of confidential information.
Why do you need robots.txt?
Robots.txt is important for several reasons:
- Controlling indexing: Website owners can use robots.txt to specify which parts of their site should be indexed and which parts should not. For example, you may want to exclude administrative pages or confidential content from indexing. Such pages do not provide useful information to users, so they can " spam" search engine results, which is generally a bad sign for search engines
- Protection of confidentiality: If your site contains confidential information that you don't want to show to search engines, robots.txt allows you to do so. For example, your site may contain various documents, personal data of users, etc. And if they are added to the index, everyone will be able to find them and open them
- Saving bandwidth: Search engine crawlers consume bandwidth when crawling a website. Thus, for example, if your site has 50 thousand pages and 10 thousand pages are required for crawling, it will take much longer to index your site. Therefore, using robots.txt, you can specify which pages should not be crawled, which will not only save resources and speed up website crawling, but also block incorrect pages from being included in the index.
So, to summarize, robots.txt is needed to tell search engines what to index and what not to index. For example, product, service, and category pages should all be shown to Google, but authorization pages, technical template pages, and personal user information should be hidden.
What do robots.txt records mean?
In order to investigate what exactly this or that entry means, let's take a completed robots.txt as an example:
In total, there are 4 records that you need to know:
- User-agent: * - this record shows which search engines we are using. The * mark means that we use all search engines.
- Disallow: /*?set_filter=* - this entry indicates the directories that need to be closed from indexing. In this case, you need to close all links that contain /*?set_filter=* - asterisks mean any content that is there, for example: https://domain.com/123?set_filter=123/
https://domain.com/category:farby?set_filter=colir:zelenyy/
Such pages will be closed. - Allow: /local/*.gif - this entry indicates open categories for indexing.
In general, search engines will index all pages that are not blocked, so why do you need to allow some URLs, images, or scripts to be indexed? It's very simple, a search engine crawler reads any code from top to bottom. This means that if we block indexing of, for example, filter pages, we don't want images or scripts in these categories to be blocked. Therefore, in this way we indicate that, for example, this link is blocked:
https://domain.com/123?set_filter=123/
But the images in it, in this example, the .gif format, must be crawled. - Sitemap: https://domain.com/sitemap.xml - this record should be inserted at the very end to indicate the sitemap.
How does robots.txt differ in different CMS?
If you use popular CMSs for your website, you can use the robots.txt templates.
An example for Wordpress:
User-agent: *
Disallow: /author/
Disallow: /wp-
Disallow: /readme
Disallow: /search
Disallow: *?s=
Disallow: *&s=
Disallow: */reviews/
Disallow: */attachment/
Disallow: */embed
Disallow: */page/
Disallow: *ycl=
Disallow: *gcl=
Disallow: *cpa=
Disallow: *utm=
Disallow: *clid=
Disallow: *openstat=
Allow: /wp-*/*.css
Allow: /wp-*/*.js
Allow: /wp-*/*.jpg
Allow: /wp-*/*.png
Allow: /wp-*/*.gif
Allow: /wp-*/*.woff
Sitemap:
An example for Opencart:
User-agent: *
Disallow: /*route=account/
Disallow: /*route=affiliate/
Disallow: /*route=checkout/
Disallow: /*route=product/search
Disallow: /index.php?route=product/product*&manufacturer_id=
Disallow: /admin
Disallow: /catalog
Disallow: /system
Disallow: /*?sort=
Disallow: /*&sort=
Disallow: /*?order=
Disallow: /*&order=
Disallow: /*?limit=
Disallow: /*&limit=
Disallow: /*?filter=
Disallow: /*&filter=
Disallow: /*?filter_name=
Disallow: /*&filter_name=
Disallow: /*?filter_sub_category=
Disallow: /*&filter_sub_category=
Disallow: /*?filter_description=
Disallow: /*&filter_description=
Disallow: /*?tracking=
Disallow: /*&tracking=
Sitemap: https://domen.com/sitemap.xml
But with each new version of a particular CMS, some directories may differ, so we advise you to manually check which pages are technical and whether we have accidentally closed a useful page for the user. As for self-written sites, you need to examine all the pages and scripts that need to be executed and fill out the robots very carefully.
Also, if you are not sure if the robots file is filled in correctly, you can use Google's tool:
https://www.google.com/webmasters/tools/robots-testing-tool
Summary
Robots.txt is an extremely important tool for correct website indexing.
If you do not create a robots.txt, search engines will index all available pages of your website. This can lead to undesirable consequences, such as indexing of confidential information, page duplication, spamming of search engine results, etc. Therefore, as soon as you plan to launch your website, you need to analyze what exactly you need to show search engines and what you shouldn't.
Similar articles
All articles