Executive Summary
RAG (Retrieval Augmented Generation) systems are the backbone of ChatGPT, Perplexity, and other conversational AI models. This whitepaper explains how to optimize your content to be retrieved and cited by these systems.
Key finding: Sites with 20-40 words per section are 2.8x more likely to be included in AI responses.
What is RAG and Why Does It Matter?
RAG (Retrieval Augmented Generation) is an architecture that combines:
1. Retrieval: The system searches for relevant information from external sources
2. Augmented: This information is injected into the model's context
3. Generation: The model generates responses based on retrieved information
Simplified flow:
User asks → System searches sources → Retrieves relevant fragments → LLM processes with context → Generates response with citations
How RAG Systems Process Your Content
Step 1: Crawling and Indexing
AI bots crawl your site and extract content:
- Visible text in semantic elements
- Metadata and Schema.org
- Headings and structure
Step 2: Chunking (Segmentation)
Content is divided into "chunks" or fragments:
- Typically 100-500 tokens per chunk
- Semantic boundaries are preserved (paragraphs, sections)
- Each chunk is indexed independently
Step 3: Embedding
Each chunk is converted into a numerical vector:
- Captures semantic meaning
- Enables similarity search
- Typical dimensions: 768-1536
Step 4: Retrieval
When a user asks:
- The question is converted to embedding
- Chunks with high similarity are searched
- Top chunks are sent to the LLM
Key Metrics for RAG Optimization
1. Average Words Per Section
This metric indicates how well-segmented your content is.
| Range | Evaluation | RAG Impact |
|---|---|---|
| < 15 | Too short | Chunks with little context |
| 15-25 | Low optimal | Good for FAQs |
| 25-40 | Optimal | Ideal for educational content |
| 40-60 | Acceptable | Potentially long chunks |
| > 60 | Suboptimal | Risk of semantic cuts |
Optimized site example:
A well-segmented site has approximately 27 words per section, 14 headings, and a total of 381 well-distributed words.
2. Entity Density
Measures the proportion of named entities vs total text.
| Range | Evaluation |
|---|---|
| < 0.05 | Generic content |
| 0.05-0.10 | Low in entities |
| 0.10-0.20 | Optimal |
| 0.20-0.30 | High in entities |
| > 0.30 | Saturated |
Why it matters: RAG systems prioritize chunks with clear entities because they're easier to relate to specific queries.
3. Semantic Structure
| Metric | Optimal Value | Reason |
|---|---|---|
| Semantic container | Yes | Defines main content |
| Paragraph count | 10-25 | Good segmentation |
| Hierarchy jumps | 0 | Clear hierarchy |
Optimization Strategies
Strategy 1: Structure for Natural Chunks
Bad:
<div>
All content in a single long block without
clear structure or intermediate headings...
</div>Good:
<article>
<section>
<h2>Main Topic</h2>
<p>Topic-specific content...</p>
</section>
<section>
<h2>Related Subtopic</h2>
<p>Subtopic content...</p>
</section>
</article>Strategy 2: Front-loading Information
Place the most important information at the beginning of each section. Start with a clear definition, then context, and finally technical details.
Strategy 3: Clear and Consistent Entities
Use complete and consistent names:
Bad:
- "The model" (ambiguous)
- "GPT" (incomplete)
- "That works better" (vague pronoun)
Good:
- "ChatGPT by OpenAI"
- "GPT-4 (OpenAI's language model)"
- "The GEO strategy works better"
Strategy 4: Explicit Questions and Answers
RAG systems favor Q&A format content. Implement Schema.org FAQPage for frequently asked questions.
Data from Our Audits
Analyzing 500+ sites and their presence in AI responses:
| Characteristic | Cited Sites | Non-Cited Sites |
|---|---|---|
| Words per section | 28.4 | 67.2 |
| Entity density | 0.14 | 0.06 |
| Semantic container | 89% | 34% |
| FAQ schema | 67% | 12% |
Correlations found:
- +287% citations with correct semantic structure
- +156% citations with FAQ schema
- +89% citations with entity density > 0.10
RAG Optimization Checklist
Structure
- Container
<main>or<article>defines main content - Headings in hierarchy without jumps (h1→h2→h3)
- 25-40 average words per section
- 2-4 sentence paragraphs
Content
- Key information at beginning of sections
- Explicit and consistent named entities
- Clear definitions of technical terms
- Q&A format where natural
Schema.org
- FAQPage for FAQ content
- Article with headline and description
- Organization for brand identity
Technical
- Text-to-code ratio > 0.05
- No content gating (paywalls, login walls)
- Fast loading (< 3s)
- Correct semantic HTML
Conclusions
Optimizing for RAG systems is not optional if you want AI visibility. Models like ChatGPT and Perplexity depend on retrieving relevant information from your site.
Priority actions:
1. Audit your heading structure
2. Calculate your current entity density
3. Implement FAQ schema where relevant
4. Ensure each section is a self-contained chunk