To Fix “indexed, though blocked by robots.txt” in GSC
"Indexed, although blocked by robots.txt" appears in the Google Search Console (GSC) when Google has indexed URLs that it is not authorized to crawl.
In most cases, this will be a simple issue where crawling stuck in the robots.txt file.
But there are a few more conditions that can trigger the issue, so let's go through the following troubleshooting process to diagnose and fix the issues as efficiently as possible:
The first step is to ask oneself if one wants Google to index the URL.
If one doesn't want the URL indexed, just add a noindex Meta robots tag and make sure to allow crawling, assuming it's canonical.
If one blocks the crawling of a page, Google can still index it because crawling and indexing are two different things.
Unless Google can crawl a page, it won't see the noindex meta tag and will still be able to index it because it contains links.
If the URL is piped to another page, do not add a robots noindex meta tag.
Just make sure the proper canonicalization signals are in place, including a canonical tag on the canonical page, and allow crawling for the signals to pass and consolidate properly.
If one wants the URL to be indexed, one has to understand why Google can't crawl the URL and remove the block.
The most likely cause is a crawl block in the robots.txt file.
Look for a crawl block in the robots.txt file
Check for intermittent blocks.
Find a user-agent block.
Check for IP blocking.
Look for a crawl block in the robots.txt file
If one knew what one is looking for and what does not have access to GSC, one can go to domain.com/robots.txt to find the file.
To prohibit:
A specific user agent can be mentioned, or it can block everyone.
If the site is new or has recently been launched, one can search:
User-agent: *
To prohibit:
It is possible that someone has already patched the robots.txt block and fixed the problem before looking into the problem.
To remove the disallow statement causing the crash.
WordPress:
If the problem is affecting the entire website, the most likely cause is that a setting in WordPress has been checked to disallow indexing.
Click on "Settings"
Click on "Read"
Make sure that the "Search engine visibility" option is not checked.
3-block-search-engine-WordPress.png
WordPress with Yoast
If one uses the Yoast SEO plugin, one can directly edit the robots.txt file to remove the blocking instruction.
Click on "Yoast SEO"
Click on "Tools"
Click on "File Editor"
WordPress with Rank Math
Similar to Yoast, Rank Math allows one to directly edit the robots.txt file.
Click on "Rank Math"
Click on "General settings"
Click on "Edit robots.txt file"
FTP or hosting
If one has FTP access to the site, one can directly edit the robots.txt file to remove the prohibition instruction causing the problem.
The hosting provider may also provide access to a file manager which provides direct access to the robots.txt file.
Check for intermittent blocks:
Intermittent problems can be more difficult to resolve because the conditions causing the blockage may not always be present.
If one clicks on the drop-down list, the file one can click and see what they contained.
4-historical-robots-txt.gif:
The Wayback Machine on archive.org also has a history of robots.txt files for the websites they crawl.
One can click on any of the dates for which they have data and see what the file included that day.
5-Wayback-machine.png:
Use the beta version of the change report, which makes it easy to see content changes between two different versions.
6-Wayback-machine.gif:
The process for resolving intermittent blocks will depend on the cause of the problem.
For example, one possible cause would be a shared cache between a test environment and a live environment.
When the test environment cache is active, the robots.txt file might include a block directive. and when the live environment cache is active, the site can be crawled.
We would like to split the cache or maybe exclude the .txt files from the cache in the test environment.
User-agent blocks occur when a site blocks a specific user-agent like Googlebot or AhrefsBot.
The site detects a specific bot and blocks the corresponding user agent.
If one can display a page correctly in the normal browser but hang after changing the user agent, it means that the specific user-agent entered is blocked.
One can specify a particular user agent using the Chrome developer tools. Another option is to use a browser extension to change user agents like this.
One can also search for user agent blocks with a cURL command.
Here's how to do it in Windows:
Press Windows + R to open a "Run" box.
Type "cmd", then click "OK".
Enter a cURL command:
To fix, here's another one where knowing how to fix it will depend on where the block is found.
Many different systems can block a bot, including .htaccess, server setup, firewalls, CDN, or even something that one might not be able to see that the hosting provider is controlling.
The best bet may be to contact the hosting provider or the CDN and ask them where the block is coming from.
Check IP blocks:
If one has confirmed that not blocked by the robots.txt file and excluded user agent blocks, it is probably an IP block.
IP blocks are difficult problems to detect.
As with user-agent blocks, the best bet may be to contact the hosting provider or CDN and ask them where the block is coming from and how it can be resolved.
Comments
Post a Comment