robots.txt: Controlling What Google Crawls on Your Site

robots.txt directs bots where to go and where not. Without it or misconfigured, you waste crawl budget and expose internal paths.

robots.txt is a small text file at the site root that tells crawlers which paths to crawl and which to skip. Without one, the default is "crawl everything" - bots wander into /wp-admin/, /wp-includes/, and internal search query strings.

Why this matters

Google allocates each site a finite crawl budget. Without a sound robots.txt, that budget is spent on admin pages, internal search queries (/?s=...), and media attachment pages - none of which should appear in search. On larger sites this delays new-content discovery by days or weeks.

Worse: unwanted crawls sometimes lead to indexing of admin URLs, internal search results, or thank-you pages. These leak into Google search results and can be served to users - a messy and occasionally embarrassing exposure.

Indirect security angle: malicious bots crawl /wp-admin/ and /xmlrpc.php looking for vulnerabilities. robots.txt does not block them (they ignore the file), but it does steer legitimate crawlers away, conserving server resources for real content.

How to detect

Quick check: visit https://YOUR-SITE/robots.txt - it should return content. An empty body, 404, or just User-agent: *\nDisallow: indicates a problem.

Complementary check: in Google Search Console > Settings > robots.txt (the new report), Google shows the file it fetched and any syntax errors.

Third check: in Coverage report look for "Crawled - currently not indexed". A pile of internal-search or wp-admin URLs there means robots.txt is not directing properly.

How to fix

WordPress 5.5+ generates a virtual robots.txt automatically. It is fine for the bare minimum, but most sites need more. Two approaches:

Approach 1 - SEO plugin: Yoast / Rank Math / RankPlus all let you edit robots.txt through their UI. In Yoast: SEO > Tools > File editor > robots.txt. In Rank Math: General Settings > Edit robots.txt.

Approach 2 - physical file: create a file named robots.txt at the site root (public_html/). It overrides the virtual one. A solid baseline:

User-agent: *
Allow: /
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search/
Disallow: /wp-content/plugins/
Disallow: /readme.html
Disallow: /wp-login.php

Sitemap: https://example.com/sitemap.xml

Add a Sitemap: line pointing at your real sitemap - Google reads it automatically.

After changes, validate in Google Search Console > Settings > robots.txt - it shows status "Fetched" with the last crawl timestamp.

Common mistakes

First and most common: using robots.txt to "hide" sensitive paths. The file is public - anyone can read it, which means attackers see exactly the paths you tried to hide. For real protection use meta robots noindex or HTTP Basic Auth in .htaccess.

Second mistake: Disallow: /wp-includes/. That directory holds JavaScript and CSS Google needs to render the page. Blocking it cripples rendering and indirectly damages SEO. Leave it open.

Third mistake: accidental Disallow: /. That single line blocks the entire site. I have seen production sites carrying a Disallow: / inherited from staging - SEO disappears overnight. Always verify after a deploy that this line is absent.

Fourth mistake: ignoring noindex on internal search pages. Disallow: /?s= in robots.txt is fine, but if the internal search page is already indexed, it will not be removed. Add meta robots noindex to actively remove.

Fifth mistake: forgetting to update after a domain change. After moving from example.com to example.co.uk, ensure Sitemap: points to the new domain.

Verifying the fix

Reload /robots.txt and confirm the content is right. In Search Console > Settings > robots.txt verify status "Fetched" with no errors. Run URL Inspection on a page you want crawled (like the homepage) - confirm no "Blocked by robots.txt" message. After two weeks, watch Coverage - the count under "Excluded by robots.txt" should match what you intentionally blocked.

Tip: Do not block in robots.txt pages that already have noindex. If Google cannot crawl the page, it cannot see the noindex - and the page can linger in results for a long time. Leave them crawlable so the noindex is honored.