Whitepaper

Optimizing for RAG Systems

How to structure your content for Retrieval Augmented Generation systems. Optimal segmentation metrics, entity density and chunking.

RAG|Optimal chunking

Executive Summary

RAG (Retrieval Augmented Generation) systems are the backbone of ChatGPT, Perplexity, and other conversational AI models. This whitepaper explains how to optimize your content to be retrieved and cited by these systems.

Key finding: Sites with 20-40 words per section are 2.8x more likely to be included in AI responses.

What is RAG and Why Does It Matter?

RAG (Retrieval Augmented Generation) is an architecture that combines:

1. Retrieval: The system searches for relevant information from external sources

2. Augmented: This information is injected into the model's context

3. Generation: The model generates responses based on retrieved information

Simplified flow:

User asks → System searches sources → Retrieves relevant fragments → LLM processes with context → Generates response with citations

How RAG Systems Process Your Content

Step 1: Crawling and Indexing

AI bots crawl your site and extract content:

  • Visible text in semantic elements
  • Metadata and Schema.org
  • Headings and structure

Step 2: Chunking (Segmentation)

Content is divided into "chunks" or fragments:

  • Typically 100-500 tokens per chunk
  • Semantic boundaries are preserved (paragraphs, sections)
  • Each chunk is indexed independently

Step 3: Embedding

Each chunk is converted into a numerical vector:

  • Captures semantic meaning
  • Enables similarity search
  • Typical dimensions: 768-1536

Step 4: Retrieval

When a user asks:

  • The question is converted to embedding
  • Chunks with high similarity are searched
  • Top chunks are sent to the LLM

Key Metrics for RAG Optimization

1. Average Words Per Section

This metric indicates how well-segmented your content is.

RangeEvaluationRAG Impact
< 15Too shortChunks with little context
15-25Low optimalGood for FAQs
25-40OptimalIdeal for educational content
40-60AcceptablePotentially long chunks
> 60SuboptimalRisk of semantic cuts

Optimized site example:

A well-segmented site has approximately 27 words per section, 14 headings, and a total of 381 well-distributed words.

2. Entity Density

Measures the proportion of named entities vs total text.

RangeEvaluation
< 0.05Generic content
0.05-0.10Low in entities
0.10-0.20Optimal
0.20-0.30High in entities
> 0.30Saturated

Why it matters: RAG systems prioritize chunks with clear entities because they're easier to relate to specific queries.

3. Semantic Structure

MetricOptimal ValueReason
Semantic containerYesDefines main content
Paragraph count10-25Good segmentation
Hierarchy jumps0Clear hierarchy

Optimization Strategies

Strategy 1: Structure for Natural Chunks

Bad:

<div>
  All content in a single long block without
  clear structure or intermediate headings...
</div>

Good:

<article>
  <section>
    <h2>Main Topic</h2>
    <p>Topic-specific content...</p>
  </section>

  <section>
    <h2>Related Subtopic</h2>
    <p>Subtopic content...</p>
  </section>
</article>

Strategy 2: Front-loading Information

Place the most important information at the beginning of each section. Start with a clear definition, then context, and finally technical details.

Strategy 3: Clear and Consistent Entities

Use complete and consistent names:

Bad:

  • "The model" (ambiguous)
  • "GPT" (incomplete)
  • "That works better" (vague pronoun)

Good:

  • "ChatGPT by OpenAI"
  • "GPT-4 (OpenAI's language model)"
  • "The GEO strategy works better"

Strategy 4: Explicit Questions and Answers

RAG systems favor Q&A format content. Implement Schema.org FAQPage for frequently asked questions.

Data from Our Audits

Analyzing 500+ sites and their presence in AI responses:

CharacteristicCited SitesNon-Cited Sites
Words per section28.467.2
Entity density0.140.06
Semantic container89%34%
FAQ schema67%12%

Correlations found:

  • +287% citations with correct semantic structure
  • +156% citations with FAQ schema
  • +89% citations with entity density > 0.10

RAG Optimization Checklist

Structure

  • Container <main> or <article> defines main content
  • Headings in hierarchy without jumps (h1→h2→h3)
  • 25-40 average words per section
  • 2-4 sentence paragraphs

Content

  • Key information at beginning of sections
  • Explicit and consistent named entities
  • Clear definitions of technical terms
  • Q&A format where natural

Schema.org

  • FAQPage for FAQ content
  • Article with headline and description
  • Organization for brand identity

Technical

  • Text-to-code ratio > 0.05
  • No content gating (paywalls, login walls)
  • Fast loading (< 3s)
  • Correct semantic HTML

Conclusions

Optimizing for RAG systems is not optional if you want AI visibility. Models like ChatGPT and Perplexity depend on retrieving relevant information from your site.

Priority actions:

1. Audit your heading structure

2. Calculate your current entity density

3. Implement FAQ schema where relevant

4. Ensure each section is a self-contained chunk