Crawler-Specific Directives

Why Robots.txt Matters for SEO

Crawl Control

Proper robots.txt files help you:

  • Prevent crawling of duplicate content
  • Block private/admin sections
  • Guide crawlers to important pages

Server Efficiency

Optimized crawling reduces:

  • Server load during crawls
  • Bandwidth consumption
  • Unnecessary crawl budget usage

Security

Adds a layer of protection by:

  • Hiding development environments
  • Blocking sensitive file types
  • Preventing login page indexing

Robots.txt FAQ

Everything you need to know about controlling search engine crawlers

A robots.txt file tells search engine crawlers:

  • What to crawl: By specifying allowed paths
  • What to avoid: Through disallowed directories/files
  • How to crawl: Using crawl-delay directives
  • Where to find sitemaps: Via sitemap declarations

It's the first file crawlers look for when visiting your site (HTTP status code 200 should return when accessing yourdomain.com/robots.txt).

No! Robots.txt only controls crawling, not indexing. Pages blocked via robots.txt may still appear in search results with a "No information is available for this page" snippet if:

  • Other sites link to them (Google may infer their existence)
  • They're included in XML sitemaps
  • They have canonical tags pointing to them

To prevent indexing, use:

  1. <meta name="robots" content="noindex"> tags
  2. Password protection
  3. X-Robots-Tag HTTP headers
Rule Example Purpose
User-agent: User-agent: Googlebot Specifies which crawler the rules apply to
Disallow: Disallow: /private/ Blocks crawling of specific paths
Allow: Allow: /public/ Overrides Disallow for specific paths
Crawl-delay: Crawl-delay: 10 Sets delay between crawl requests (seconds)
Sitemap: Sitemap: https://example.com/sitemap.xml Indicates location of XML sitemap
Clean-param: Clean-param: ref /products/ Ignores specified URL parameters

Note: Each directive must be on its own line, and paths are case-sensitive.

Target specialized crawlers with their specific user-agent names:

User-agent: Googlebot-Image
Disallow: /
            
User-agent: Googlebot-News
Disallow: /draft-articles/
            
User-agent: Bingbot
Disallow: /temp/

Major crawlers and their user-agents:

  • Google (desktop): Googlebot
  • Google (mobile): Googlebot-Mobile
  • Google Images: Googlebot-Image
  • Bing: Bingbot
  • Yandex: YandexBot
  • Baidu: Baiduspider
  • Facebook: facebookexternalhit

While not a direct ranking factor, robots.txt impacts SEO by:

Positive Effects

  • Preserves crawl budget for important pages
  • Prevents duplicate content issues
  • Reduces server load during crawls

Potential Risks

  • Accidental blocking of important pages
  • Over-restrictive crawl delays
  • Blocking CSS/JS needed for rendering

Pro Tip: Always test changes in Google Search Console's robots.txt tester before deployment.

The robots.txt file must be located at your site's root:

  • Correct: https://example.com/robots.txt
  • Incorrect: https://example.com/subfolder/robots.txt

Implementation checklist:

  1. Upload to root directory (typically public_html or www)
  2. Ensure UTF-8 encoding
  3. Verify HTTP 200 status code
  4. Keep file size under 500KB
  5. Use plain text format (.txt extension)

Disallow is the primary directive, while Allow creates exceptions:

Scenario: Block /private/ except one subfolder
User-agent: *
Disallow: /private/
Allow: /private/public-files/

Key rules:

  • More specific paths take precedence
  • Order of directives matters (first match wins)
  • Google supports unlimited Allow directives, while other crawlers may not

Avoid: Disallow: (empty value) - this actually allows crawling!

Support varies by search engine:

Pattern Google Bing Yandex Example
* (wildcard) ✅ Yes ❌ No ✅ Yes Disallow: /*.jpg$
$ (end anchor) ✅ Yes ❌ No ✅ Yes Disallow: /print$
Full regex ❌ No ❌ No ✅ Limited Disallow: /user/*/profile

Best practice: Stick to basic patterns for maximum compatibility.

Crawl delay recommendations based on server capacity:

Shared Hosting

Crawl-delay: 10

(1 request every 10 seconds)

VPS

Crawl-delay: 5

(1 request every 5 seconds)

Dedicated Server

Crawl-delay: 1-3

(1-3 requests per second)

Note: Google ignores crawl-delay and instead uses dynamic crawling based on server response times. This directive primarily affects Bing/Yandex.

Yes, you can include multiple sitemap declarations:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/news-sitemap.xml
Sitemap: https://example.com/image-sitemap.xml

Best practices:

  • Include all relevant sitemaps (pages, images, videos, news)
  • Use absolute URLs
  • Place sitemap declarations at the top of the file
  • Keep sitemaps updated (crawlers check them frequently)

Pro Tip: Also submit sitemaps directly to Google Search Console for faster discovery.

Sensitive Information

Avoid exposing private paths like /admin/, /wp-login/ - this actually reveals their location to malicious bots.

CSS/JS Files

Never block .css or .js - Google needs these to properly render and index pages.

Important Content

Don't accidentally block pages you want indexed - double-check with URL Inspection tool.

Use these free testing tools:

Testing checklist:

  1. Verify syntax is error-free
  2. Check if important pages are blocked
  3. Test with different user-agents
  4. Confirm sitemap is accessible