Do you want to optimize your WordPress robots.txt file? Not sure why and how robots.txt file is important for your SEO? What Google and Other’s Say are about robots.txt files?
Learn about robots.txt files
A robots.txt file is a file at the root of your site that indicates those parts of your site you don’t want accessed by search engine crawlers. The file uses the Robots Exclusion Standard, which is a protocol with a small set of commands that can be used to indicate access to your site by section and by specific kinds of web crawlers (such as mobile crawlers vs desktop crawlers). Robots.txt file usually resides in your site’s root folder. You will need to connect to your site using an FTP client or by using cPanel file manager to view it.
It is just like any ordinary text file, and you can open it with a plain text editor like Notepad. One of the essential part of SEO are Robots.txt. This small text file standing at the root of your Website can help in serious optimization of your Website.
If you do not have a robots.txt file in your site’s root directory, then you can always create one. All you need to do is create a new text file on your computer and save it as robots.txt. Next, simply upload it to your site’s root folder.
What is robots.txt used for?
The first line format for robots.txt file usually names a user agent. The user agent is actually the name of the search bot you are trying to communicate with. For example, Googlebot or Bingbot. You can use asterisk * to instruct all bots.
The next line follows with Allow or Disallow instructions for search engines, so they know which parts you want them to index, and which ones you don’t want indexed.
See a sample robots.txt file:
For non-image files (that is, web pages) robots.txt should only be used to control crawling traffic, typically because you don’t want your server to be overwhelmed by Google’s crawler or to waste crawl budget crawling unimportant or similar pages on your site. You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file. If you want to block your page from search results, use another method such as password protection or noindex tags or directives.
Robots.txt does prevent image files from appearing in Google search results. (However it does not prevent other pages or users from linking to your image.)
You can use robots.txt to block resource files such as unimportant image, script, or style files, if you think that pages loaded without these resources will not be significantly affected by the loss. However, if the absence of these resources makes the page harder to understand for Google’s crawler, you should not block them, or else Google won’t do a good job of analyzing your pages that depend on those resources.
Understand the limitations of robots.txt
Before you build your robots.txt, you should know the risks of this URL blocking method. At times, you might want to consider other mechanisms to ensure your URLs are not findable on the web.
· Robots.txt instructions are directives only
The instructions in robots.txt files cannot enforce crawler behavior to your site; instead, these instructions act as directives to the crawlers accessing your site. While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not. Therefore, if you want to keep information secure from web crawlers, it’s better to use other blocking methods, such as password-protecting private files on your server.
· Different crawlers interpret syntax differently
Although respectable web crawlers follow the directives in a robots.txt file, each crawler might interpret the directives differently. You should know the proper syntax for addressing different web crawlers as some might not understand certain instructions.
· Your robots.txt directives can’t prevent references to your URLs from other sites
While Google won’t crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Google Search results completely by using other URL blocking methods, such as password-protecting the files on your server or using the noindex meta tag or response header.
Create a robots.txt file
In order to make a robots.txt file, you need access to the root of your domain. If you’re unsure about how to access the root, you can contact your web hosting service provider. Also, if you know you can’t access to the root of the domain, you can use alternative blocking methods, such as password-protecting the files on your server, and inserting meta tags into your HTML.
You can make or edit an existing robots.txt file using the robots.txt Tester tool. This allows you to test your changes as you adjust yourrobots.txt.
Learn robots.txt syntax
The simplest robots.txt file uses two key words, User-agent and Disallow. User-agents are search engine robots (or web crawler software); most user-agents are listed in the Web Robots Database. Disallow is a command for the user-agent that tells it not to access a particular URL. On the other hand, to give Google access to a particular URL that is a child directory in a disallowed parent directory, then you can use a third key word, Allow.
Google uses several user-agents, such as Googlebot for Google Search and Googlebot-Image for Google Image Search. Most Google user-agents follow the rules you set up for Googlebot, but you can override this option and make specific rules for only certain Google user-agents as well.
The syntax for using the keywords is as follows:
User-agent: [the name of the robot the following rule applies to]
Disallow: [the URL path you want to block]
Allow: [the URL path in of a subdirectory, within a blocked parent directory, that you want to unblock]
These two lines are together considered a single entry in the file, where the Disallow rule only applies to the user-agent(s) specified above it. You can include as many entries as you want, and multiple Disallow lines can apply to multiple user-agents, all in one entry. You can set theUser-agent command to apply to all web crawlers by listing an asterisk (*) as in the example below:
URL blocking commands to use in your robots.txt file
|The entire site with a forward slash (/):||Disallow: /|
|A directory and its contents by following the directory name with a forward slash:||Disallow: /sample-directory/|
|A webpage by listing the page after the slash:||Disallow: /private_file.html|
|A specific image from Google Images:||User-agent: Googlebot-Image
|All images on your site from Google Images:||User-agent: Googlebot-Image
|Files of a specific file type (for example,. if):||User-agent: Googlebot
|Pages on your site, but show AdSense ads on those pages, disallow all web crawlers other than Mediapartners-Google. This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors to your site.||User-agent: *
Pattern-matching rules to streamline your robots.txt code
|To block any sequence of characters, use an asterisk (*). For instance, the sample code blocks access to all subdirectories that begin with the word “private”:||User-agent: Googlebot
|To block access to all URLs that include question marks (?).For example, the sample code blocks URLs that begin with your domain name, followed by any string, followed by a question mark, and ending with any string:||User-agent: Googlebot
|To block any URLs that end in a specific way, use $. For instance, the sample code blocks any URLs that end with.xls:||User-agent: Googlebot
|To block patterns with the Allow and Disallow directives, see the sample to the right. In this example, a ? indicates a session ID. URLs that contain these IDs should typically be blocked from Google to prevent web crawlers from crawling duplicate pages. Meanwhile, if some URLs ending with ? are versions of the page that you want to include, you can use the following approach of combining Allow and Disallow directives:
1. The Allow: /*?$ directive allows any URL that ends in a ? (more specifically, it allows a URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
2. The Disallow: / *? directive blocks any URL that includes a ?(more specifically, it blocks a URL that begins with your domain name, followed by a string, followed by a question mark, followed by a string).
Save your robots.txt file
You must apply the following saving conventions so that Googlebot and other web crawlers can find and identify your robots.txt file:
- You must save yourtxt code as a text file,
- You must place the file in the highest-level directory of your site (or the root of your domain), and
- The robots.txt file must be namedtxt.
As an example, a robots.txt file saved at the root of example.com, at the URL address http://www.example.com/robots.txt, can be discovered by web crawlers, but arobots.txt file at http://www.example.com/not_root/robots.txt cannot be found by any web crawler.
Submit your updated robots.txt to Google
The Submit function of the robots.txt Tester tool allows you to easily put in place and ask Google to more quickly crawl and index a newrobots.txt file for your site. Update and notify Google of changes to your robots.txt file by following the steps below.
- ClickSubmitin the bottom-right corner of therobots.txt editor. This action opens up a Submit dialog.
- Download your editedrobots.txtcode from the robots.txt Tester page by clicking Downloadin the Submit dialog.
- Upload your newrobots.txtfile to the root of your domain as a text file named robots.txt (the URL for your robots.txt file should be/robots.txt).
If you do not have permission to upload files to the root of your domain, you should contact your domain manager to make changes.
For example, if your site home page resides undersubdomain.example.com/site/example/, you likely cannot update the robots filesubdomain.example.com/robots.txt. In this case, you should contact the owner of example.com/ to make any necessary changes to the robots.txtfile.
- ClickVerify live versionto see that your liverobots.txt is the version that you want Google to crawl.
- ClickSubmit live versionto notify Google that changes have been made to your robots.txt file and request that Google crawl it.
- Check that your newest version was successfully crawled by Google by refreshing the page in your browser to update the tool’s editor and see your liverobots.txtcode. After you refresh the page, you can also click the dropdown above the text editor to view the timestamp of when Google first saw the latest version of yourrobots.txt file.
Optimize WordPress Robots.txt File
WordPress robots.txt file plays a major role in search engine ranking. It helps to block search engine bots to index and crawl important part of our blog. Many popular blogs use very simple robots.txt files. Their contents vary, depending on the needs of the specific site. This robots.txt file simply tells all bots to index all content and provides the links to site’s XML sitemaps.
Example of a robots.txt file, use on WordPress
Robots.Txt for WordPress:
You can either edit your WordPress Robots.txt file by logging into your FTP account of the server or you can use plugin like Robots meta to edit robots.txt file from WordPress dashboard. There are few things, which you should add in your robots.txt file along with your sitemap URL. Adding sitemap URL helps search engine bots to find your sitemap file and thus faster indexing of pages.
Here is a sample Robots.txt file for any domain. In sitemap, replace the Sitemap URL with your blog URL:
# disallow all files in these directories
User-agent: Mediapartners-Google*Allow: /
User-agent: Adsbot-GoogleAllow: /
User-agent: Googlebot-MobileAllow: /
Make your sitemap available to Google (Submit your sitemap to Google)
There are two different ways to make your sitemap available to Google:
- Submit it to Google using the Search Console Sitemaps tool
- Insert the following line anywhere in your txtfile, specifying the path to your sitemap: