Crawler-Specific Directives
Why Robots.txt Matters for SEO
Crawl Control
Proper robots.txt files help you:
- Prevent crawling of duplicate content
- Block private/admin sections
- Guide crawlers to important pages
Server Efficiency
Optimized crawling reduces:
- Server load during crawls
- Bandwidth consumption
- Unnecessary crawl budget usage
Security
Adds a layer of protection by:
- Hiding development environments
- Blocking sensitive file types
- Preventing login page indexing
Robots.txt FAQ
Everything you need to know about controlling search engine crawlers
A robots.txt file tells search engine crawlers:
- What to crawl: By specifying allowed paths
- What to avoid: Through disallowed directories/files
- How to crawl: Using crawl-delay directives
- Where to find sitemaps: Via sitemap declarations
It's the first file crawlers look for when visiting your site (HTTP status code 200 should return when accessing yourdomain.com/robots.txt).
No! Robots.txt only controls crawling, not indexing. Pages blocked via robots.txt may still appear in search results with a "No information is available for this page" snippet if:
- Other sites link to them (Google may infer their existence)
- They're included in XML sitemaps
- They have canonical tags pointing to them
To prevent indexing, use:
<meta name="robots" content="noindex">
tags- Password protection
X-Robots-Tag
HTTP headers
Rule | Example | Purpose |
---|---|---|
User-agent: |
User-agent: Googlebot |
Specifies which crawler the rules apply to |
Disallow: |
Disallow: /private/ |
Blocks crawling of specific paths |
Allow: |
Allow: /public/ |
Overrides Disallow for specific paths |
Crawl-delay: |
Crawl-delay: 10 |
Sets delay between crawl requests (seconds) |
Sitemap: |
Sitemap: https://example.com/sitemap.xml |
Indicates location of XML sitemap |
Clean-param: |
Clean-param: ref /products/ |
Ignores specified URL parameters |
Note: Each directive must be on its own line, and paths are case-sensitive.
Target specialized crawlers with their specific user-agent names:
User-agent: Googlebot-Image
Disallow: /
User-agent: Googlebot-News
Disallow: /draft-articles/
User-agent: Bingbot
Disallow: /temp/
Major crawlers and their user-agents:
- Google (desktop): Googlebot
- Google (mobile): Googlebot-Mobile
- Google Images: Googlebot-Image
- Bing: Bingbot
- Yandex: YandexBot
- Baidu: Baiduspider
- Facebook: facebookexternalhit
While not a direct ranking factor, robots.txt impacts SEO by:
Positive Effects
- Preserves crawl budget for important pages
- Prevents duplicate content issues
- Reduces server load during crawls
Potential Risks
- Accidental blocking of important pages
- Over-restrictive crawl delays
- Blocking CSS/JS needed for rendering
Pro Tip: Always test changes in Google Search Console's robots.txt tester before deployment.
The robots.txt file must be located at your site's root:
- Correct:
https://example.com/robots.txt
- Incorrect:
https://example.com/subfolder/robots.txt
Implementation checklist:
- Upload to root directory (typically public_html or www)
- Ensure UTF-8 encoding
- Verify HTTP 200 status code
- Keep file size under 500KB
- Use plain text format (.txt extension)
Disallow is the primary directive, while Allow creates exceptions:
Scenario: Block /private/ except one subfolder
User-agent: *
Disallow: /private/
Allow: /private/public-files/
Key rules:
- More specific paths take precedence
- Order of directives matters (first match wins)
- Google supports unlimited Allow directives, while other crawlers may not
Avoid: Disallow:
(empty value) - this actually allows crawling!
Support varies by search engine:
Pattern | Bing | Yandex | Example | |
---|---|---|---|---|
* (wildcard) |
✅ Yes | ❌ No | ✅ Yes | Disallow: /*.jpg$ |
$ (end anchor) |
✅ Yes | ❌ No | ✅ Yes | Disallow: /print$ |
Full regex | ❌ No | ❌ No | ✅ Limited | Disallow: /user/*/profile |
Best practice: Stick to basic patterns for maximum compatibility.
Crawl delay recommendations based on server capacity:
Shared Hosting
Crawl-delay: 10
(1 request every 10 seconds)
VPS
Crawl-delay: 5
(1 request every 5 seconds)
Dedicated Server
Crawl-delay: 1-3
(1-3 requests per second)
Note: Google ignores crawl-delay and instead uses dynamic crawling based on server response times. This directive primarily affects Bing/Yandex.
Yes, you can include multiple sitemap declarations:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/news-sitemap.xml
Sitemap: https://example.com/image-sitemap.xml
Best practices:
- Include all relevant sitemaps (pages, images, videos, news)
- Use absolute URLs
- Place sitemap declarations at the top of the file
- Keep sitemaps updated (crawlers check them frequently)
Pro Tip: Also submit sitemaps directly to Google Search Console for faster discovery.
Sensitive Information
Avoid exposing private paths like /admin/
, /wp-login/
- this actually reveals their location to malicious bots.
CSS/JS Files
Never block .css
or .js
- Google needs these to properly render and index pages.
Important Content
Don't accidentally block pages you want indexed - double-check with URL Inspection tool.
Use these free testing tools:
Google's Tester
Official validator in Search Console
Bing Validator
In Bing Webmaster Tools
Robots.txt Tester
Third-party advanced validator
Testing checklist:
- Verify syntax is error-free
- Check if important pages are blocked
- Test with different user-agents
- Confirm sitemap is accessible