Executive Summary
This whitepaper analyzes how the 8 main AI bots crawl websites, their technical differences, and how to configure your robots.txt to maximize visibility on each platform.
Key finding: 23% of sites inadvertently block at least one critical AI bot.
The 8 AI Bots You Should Know
Critical Impact Bots
These bots are essential for visibility on major AI platforms:
1. GPTBot (OpenAI)
- Purpose: Model training and ChatGPT web browsing
- User-Agent: GPTBot/1.0
- Documentation: openai.com/gptbot
- Impact: Critical - Feeds ChatGPT with updated information
2. ChatGPT-User (OpenAI)
- Purpose: Real-time web browsing in ChatGPT
- User-Agent: ChatGPT-User
- Impact: Critical - Real-time user searches
3. ClaudeBot (Anthropic)
- Purpose: Crawling for Claude AI
- User-Agent: ClaudeBot/1.0
- Impact: Critical - Fast-growing model
4. Google-Extended (Google)
- Purpose: Gemini training (separate from Googlebot)
- User-Agent: Google-Extended
- Impact: Critical - Google ecosystem integration
High Impact Bots
5. PerplexityBot (Perplexity)
- Purpose: Conversational search engine
- User-Agent: PerplexityBot
- Impact: High - Direct citations with sources
6. Applebot-Extended (Apple)
- Purpose: Apple Intelligence and Siri
- User-Agent: Applebot-Extended
- Impact: High - iOS/macOS ecosystem
Medium Impact Bots
7. Googlebot (Google)
- Purpose: Google Search indexing (not AI-specific)
- User-Agent: Googlebot
- Impact: High for SEO, medium for direct GEO
8. CCBot (Common Crawl)
- Purpose: Research dataset used to train LLMs
- User-Agent: CCBot/2.0
- Impact: Medium - Base for many models
Technical Differences Between Bots
| Bot | Crawl Frequency | Respects robots.txt | Processes JavaScript | Size Limit |
|---|---|---|---|---|
| GPTBot | Daily/Weekly | Yes | Limited | ~100KB |
| ChatGPT-User | Real-time | Yes | Yes (headless) | ~50KB |
| ClaudeBot | Weekly | Yes | Limited | ~100KB |
| Google-Extended | Continuous | Yes | Yes | No limit |
| PerplexityBot | Real-time | Yes | Yes | ~100KB |
Optimal robots.txt Configuration
Recommended Configuration (Maximum Visibility)
# AI Bots - Allow full access
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: CCBot
Allow: /
# Traditional search bots
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Default rule
User-agent: *
Allow: /
# Sitemap
Sitemap: https://yourdomain.com/sitemap.xmlCommon Errors That Block AI Bots
Error 1: Global Disallow Without Exceptions
Incorrect:
User-agent: *
Disallow: /Correct:
User-agent: *
Disallow: /admin/
Disallow: /private/
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /Error 2: Blocking Specific AI Bots
Some sites inherit configurations that block AI bots:
# BAD - Blocks visibility in ChatGPT
User-agent: GPTBot
Disallow: /Error 3: Not Having robots.txt
Without robots.txt, bots assume allowed access. However:
- No control over which pages to crawl
- Cannot indicate the sitemap
- Makes it difficult to monitor bot behavior
How to Verify Your Configuration
1. Review current robots.txt
curl https://yourdomain.com/robots.txt2. Verify each bot individually
# Check if GPTBot can access
curl -A "GPTBot/1.0" https://yourdomain.com/3. Use validation tools
- Google Search Console (for Googlebot)
- Our GEO audit (for all AI bots)
Data from Our Audits
Analyzing 500+ sites, we found:
| Finding | Percentage |
|---|---|
| Allow all AI bots | 54% |
| Block at least 1 critical bot | 23% |
| No robots.txt | 12% |
| Block all bots | 11% |
Most frequently blocked bots:
1. GPTBot (blocked on 18% of sites)
2. CCBot (blocked on 15% of sites)
3. ClaudeBot (blocked on 9% of sites)
Recommendations by Use Case
For Maximum AI Visibility
- Allow all listed bots
- Include sitemap.xml
- Update content regularly
For Selective Control
- Allow critical bots (GPTBot, ClaudeBot, Google-Extended)
- Block training bots if data use is a concern (CCBot)
For Sites with Sensitive Content
- Use selective Disallow by path, not by bot
- Keep public content accessible to AI bots
Conclusions
robots.txt configuration is fundamental for visibility in AI systems. A common error can completely exclude you from ChatGPT, Claude, or Perplexity.
Immediate actions:
1. Review your current robots.txt
2. Verify that the 4 critical bots have access
3. Add sitemap if you don't have one
4. Regularly monitor changes in bot policies