โ๏ธ
FREE
+55 XP
robots.txt Deep Dive
๐ก๏ธ
Googlebot arrives at a new site
"The very first thing I do at any site is read /robots.txt. No file? I crawl everything. File exists? I follow its rules. This isn't a request โ it's a protocol."
๐ robots.txt โ a plain text file at the site root (
https://example.com/robots.txt) that controls which sections search bots can access.robots.txt Directives
| Directive | What it does | Example |
|---|---|---|
User-agent | Which bot the rule applies to | User-agent: * (all bots) |
Disallow | Blocks crawling of a path | Disallow: /admin/ |
Allow | Permits a sub-path inside a Disallowed area | Allow: /admin/public/ |
Crawl-delay | Pause between bot requests (seconds) | Crawl-delay: 2 |
Sitemap | Points to the XML sitemap | Sitemap: https://site.com/sitemap.xml |
Real-World robots.txt Example
User-agent: * Disallow: /admin/ Disallow: /cart/ Disallow: /checkout/ Allow: /admin/assets/ User-agent: Googlebot Disallow: /staging/ Sitemap: https://example.com/sitemap.xml
Wildcards: * and $
*โ matches any sequence of characters:Disallow: /*.pdf$blocks all PDF files$โ end of URL:Disallow: /search$blocks /search but not /search/results
Common Mistakes
| Mistake | Consequence |
|---|---|
| Blocking /static/ or /css/ with Disallow | Googlebot can't render the page โ treats it as broken |
| Thinking Disallow = noindex | The page can still be indexed via external links |
| Forgetting to block parameterized URLs | Duplicates like ?sort=&page= waste crawl budget |
๐งโ๐ป
Alex checks robots.txt in GSC
"Google Search Console โ Settings โ robots.txt Tester. Enter a URL and instantly see: blocked or not. An essential tool before any launch."
robots.txt checklist:
โ File is accessible at /robots.txt
โ CSS, JS and images are NOT blocked
โ /admin/, /cart/, /checkout/ are blocked
โ Sitemap directive points to the current XML
โ Tested in GSC robots.txt Tester
โ Disallow โ noindex: important content not hidden via robots
โ File is accessible at /robots.txt
โ CSS, JS and images are NOT blocked
โ /admin/, /cart/, /checkout/ are blocked
โ Sitemap directive points to the current XML
โ Tested in GSC robots.txt Tester
โ Disallow โ noindex: important content not hidden via robots
โ ๏ธ Remember: Disallow blocks CRAWLING, not INDEXING. To remove a page from the index โ use a
noindex meta tag, but keep the page crawlable: otherwise the bot won't see the noindex.๐ฎ Test yourself: which robots.txt directive blocks crawling?
Lesson Task
Test your knowledge and earn +20 XP