Robots.txt is a helpful tool you can use to provide search engine crawlers with guidelines on how you want them to crawl and index your website.
Although not the almighty solution for keeping a web page entirely out of Google, it can help prevent your site or server from being overloaded by crawler requests or keep selected parts of the website private.
What Does Blocked by robots.txt Mean?
What is Robots.txt? Robots.txt is a plain text file placed in the root directory of your website that instructs search engine crawlers which pages on your website to index and which to ignore.
In other words, if you block a page by robots.txt, it is less likely to be crawled and appear in search results as it lets crawlers know whether to access a file or not. This way, you have some control over your ranking.
How Does It Work and How Does It Affect SEO?
Blocking pages by robots.txt is a delicate balancing act – if you overdo it, it could hurt your ranking. If you strategically select which pages to block, you can give yourself a little SEO boost.
When a search engine crawler crawls a page blocked by robots.txt, it’s considered a soft 404 error. While a few of these won’t hurt, having too many can hurt your ranking because it leads to a slower crawl rate and a wasted crawl budget.
So, we recommend only blocking pages you’re absolutely sure you want to restrict access to and don’t want to be found by search engines.
For example, when you want to keep parts of a site, like admin pages, private, prevent duplicate content or media files from appearing in the SERPs, avoid indexation problems, prevent search engines from indexing specific files like images or PDFs, etc.
How to Create robots.txt
The first thing you need to do is create a text file and name it “robots.txt.” You can do this using any text editor, such as Notepad or TextEdit. Once you’ve created the file, open it and add the following lines of code:
User-agent: *
Disallow: /file-name.html
Replace “file-name.html” with the name of the page you want to stop from getting crawled. You can also use wildcards to block multiple pages at once. For example, if you wanted to block all pages that start with “product,” you would use the following code:
User-agent: *
Disallow: /product*
Save the file and upload it to your website’s root directory. If you want to address specific search engines individually, you can! Simply name the bot in the user-agent section. For example:
User-agent: Googlebot
How to Save and Apply Your robots.txt File
When you’re ready to save the file, go to File > Save As. In the “Save as type” drop-down menu, select “All Files.”
Name your file “robots,” and include quotation marks to ensure that the file is saved as a .txt file and not a .html file.
- You can apply the file to your website using any FTP client, such as FileZilla or Cyberduck.
- Once you’ve connected to your website using FTP, open the public_html folder. That’s where your website’s files are stored.
- Finally, drag and drop your robots.txt file into the public_html folder. Once it’s uploaded, your file will be live on your website.
What Does “Indexed, though blocked by robots.txt” Mean and How to Fix It?
“Indexed, though blocked by robots.txt” is a warning that appears in Google Search Console indicating that Google had indexed the URLs even though they were blocked by the robots.txt file.
This happens because they were unsure whether you want these URLs indexed and, ultimately, found links to them and deemed them important enough to do so.
To fix this issue and keep a web page out of Google search results in the future, export the list of URLs from the Google Search Console, comb through them, and check for URLs that search engines can access, but you don’t want to have indexed.
Update the robots.txt of these pages to reflect this, apply meta robots “noindex” tags, remove any internal links, or password-protect the page and save your changes.