A Knowledge Graph-Based Memory Architecture for Multi-Modal Semi-Supervised Learning in Agentic AI Systems

Integrating Short-Term and Long-Term Memory Consolidation for Adaptive Agent Behavior Yellowsys AI Research Division, research@yellowsys.ai, November 26, 2025

5/8/20246 min read

A Hybrid Strategy for Instant Document Grounding

Technical Whitepaper v1.2

November 2024

Prepared for:

AI Architects, CTOs, and Technical Decision Makers

YellowMind.ai by Yellowsys

AI That Manages What Matters


Executive Summary

This whitepaper presents a novel hybrid strategy for document grounding without requiring heavyweight infrastructure or persistent indexing. The approach adapts processing techniques based on document size, achieving sub-second ingest latency for the majority of enterprise documents while maintaining high retrieval quality.

Key Innovations

1. Tiered Processing Architecture: Documents are automatically routed to optimal processing pipelines based on size and structure characteristics.

2. BM25-First Retrieval: Lexical search handles 80%+ of queries instantly; embeddings computed only when needed.

3. Lazy Embedding Strategy: Eliminates the embedding bottleneck that plagues traditional RAG implementations.

4. Progressive Ingest: Large documents become queryable page-by-page, enabling interaction before full processing completes.

Performance Targets

Metric

Target

Approach

Ingest (< 5 pages)

< 200ms

Direct context injection

Ingest (5-20 pages)

< 500ms

BM25 indexing only

Ingest (> 20 pages)

< 1s partial

Progressive streaming

Query (keyword-rich)

< 50ms

BM25 direct

Query (semantic)

< 500ms

Lazy embed + rerank

Table 1: Performance targets by document size and query type


The Document Grounding Challenge

Traditional RAG Limitations

Traditional Retrieval-Augmented Generation (RAG) pipelines follow a sequential process that creates significant latency bottlenecks:

Upload → Parse → Chunk → Embed → Index → Query → Retrieve → Generate

The embedding step is particularly problematic. Even with optimized models, embedding a 20-page document (approximately 100-150 chunks) requires 3-10 seconds of processing time. This latency breaks the instant upload experience that users have come to expect from consumer AI products like ChatGPT and Microsoft Copilot.

User Experience Gap

Modern users expect immediate responsiveness. When uploading a document to ask questions, they anticipate:

• Instant acknowledgment (< 200ms)

• Ability to ask questions within 1-2 seconds

• No visible loading states or progress bars for small documents

• Graceful handling of larger documents with clear progress indication

Traditional RAG fails to meet these expectations, forcing users to wait through visible processing states that erode confidence in the system.

The Size-Complexity Mismatch

A critical insight is that most document interactions involve relatively small files. Analysis of enterprise document workflows reveals:

• 60-70% of documents are under 5 pages (emails, memos, invoices)

• 25-30% are 5-20 pages (contracts, reports, proposals)

• < 10% exceed 20 pages (manuals, specifications, legal documents)

Yet traditional RAG applies the same heavyweight processing to a 2-page invoice as to a 200-page technical manual. This one-size-fits-all approach is fundamentally inefficient.


Hybrid instant grounding strategy

The hybrid strategy addresses the size-complexity mismatch by implementing three distinct processing tiers, each optimized for a specific document size range.

Tier Classification

Tier

Document Size

Token Range

Processing Strategy

Tier 1

< 5 pages

2K - 10K

Direct Context Injection

Tier 2

5 - 20 pages

10K - 50K

BM25-First + Lazy Embedding

Tier 3

> 20 pages

50K+

Progressive + Hierarchical

Table 2: Tier classification by document characteristics

Architecture Diagram

The following diagram illustrates the decision flow and processing pipeline:

Figure 1: Tier-based processing architecture


Tier 1: Direct Context Injection

Rationale

For small documents, the overhead of chunking, indexing, and retrieval actually hurts performance and quality. Modern LLMs with 128K+ context windows can easily accommodate a 5-page document (~10K tokens) with room to spare. The LLM's native attention mechanism becomes the retrieval system.

Processing Pipeline

1. Document Reception (0-50ms): Validate file type and size, generate session-scoped document ID, acknowledge upload immediately.

2. Fast Parse (50-150ms): Extract text preserving basic structure, convert to clean markdown format, preserve tables as markdown tables.

3. Metadata Extraction (20-50ms): Infer document title, detect language, identify document type heuristically, extract obvious entities.

4. Cache for Session (10ms): Store full markdown text in session memory. No chunking, no indexing, no embedding.

Query Handling

Queries are handled by prepending the full document to the system prompt. The LLM attends to relevant portions naturally through its attention mechanism.

Advantages

• Zero retrieval latency: No search step between query and generation

• Perfect recall: LLM sees everything, nothing is missed due to retrieval failures

• Context coherence: No chunking artifacts, relationships preserved

• Simplicity: Minimal moving parts, fewer failure modes

Limitations

• Token cost: Every query pays for full document in input tokens

• Context budget: Leaves less room for conversation history

• Scale ceiling: Not viable beyond ~12K tokens

Tier Promotion Triggers

Documents are automatically promoted to Tier 2 processing when:

• Document exceeds 12K tokens (~6 pages)

• Multiple documents uploaded in same session

• User explicitly requests precise citation/sourcing


Tier 2: BM25-First with Lazy Embedding

Rationale

Mid-size documents are too large for comfortable context injection but small enough that full embedding is overkill. BM25 lexical search handles 80%+ of queries effectively. Embeddings are computed only when lexical matching fails to achieve sufficient confidence.

Processing Pipeline

1. Structure-Aware Parsing (100-300ms): Parse with layout detection (headings, paragraphs, lists, tables), identify document hierarchy, preserve page boundaries.

2. Semantic Chunking (50-150ms): Chunk by semantic boundaries (not fixed token counts), respect heading hierarchy, target 256-512 tokens with flexibility.

3. Enrichment (50-100ms): Extract keywords per chunk (TF-IDF style), identify entities using pattern matching, build parent context breadcrumbs.

4. BM25 Index Construction (20-50ms): Build inverted index over normalized chunk text, include keywords with boosted weight, store document frequency statistics.

Query Pipeline

Figure 2: Tier 2 query pipeline with confidence-based routing

Confidence Threshold Calibration

Query Type

Threshold

Rationale

Factual ("what is X")

0.75

Needs precise match

Definition ("define X")

0.70

Usually keyword-rich

Summary ("summarize")

0.50

Broad queries, multiple chunks acceptable

Comparison ("X vs Y")

0.65

Needs both comparison terms

Procedural ("how to")

0.60

Often has clear action keywords

Table 3: Confidence thresholds by query type

Lazy Embedding Cache

When embeddings are computed for reranking, they are cached in session memory. Subsequent queries benefit from cached embeddings, and over multiple queries, popular chunks become "warm." This approach provides organic embedding coverage growth without upfront compute investment.


Tier 3: Progressive Ingest + Hierarchical Retrieval

Large documents cannot be processed synchronously without breaking user experience. The solution is to make content available progressively and use two-stage retrieval that first identifies relevant sections before diving deep.

Progressive Ingest Pipeline

Documents are processed page-by-page with immediate indexing:

1. Streaming Parse: Each page is parsed independently (100-200ms per page)

2. Immediate Indexing: Each page becomes queryable as soon as it's processed

3. Progress Feedback: UI shows "5/30 pages ready - you can start asking questions"

4. Background Map Generation: Parallel process builds document structure map (500-1500 tokens)

Hierarchical Retrieval

Large document queries use two-stage retrieval:

Stage 1: Section Selection (200-400ms)

The document map (table of contents, section summaries, key entities) is presented to the LLM with the user's query. The LLM identifies which sections are most likely to contain relevant information.

Stage 2: Deep Retrieval (100-300ms)

BM25 search is filtered to only the selected sections, standard confidence check and lazy reranking applied, then grounding context is formatted from results.

Partial Availability Handling

When users query before processing completes, the system can:

• Answer from available content with disclaimer: "Based on the first 12 pages..."

• Smart prediction: If TOC shows relevant section is in unprocessed pages, notify user

• Wait and notify: Offer to wait for complete processing or proceed with partial results


Competitive Benchmark Analysis

This section compares the YellowMind Hybrid Strategy against the document grounding approaches used by Microsoft Copilot, ChatGPT (OpenAI), and Claude (Anthropic).

Competitor Strategy Overview

Microsoft Copilot

Copilot leverages enterprise-grade RAG with a pre-built Semantic Index over Microsoft 365 content. Documents in SharePoint/OneDrive are continuously indexed in the background, enabling zero-latency retrieval for existing content. However, newly uploaded documents require processing time similar to traditional RAG.

ChatGPT (OpenAI)

ChatGPT uses hybrid context injection combined with Code Interpreter for document processing. Small files are injected directly into context, while larger files are processed through Code Interpreter for extraction. The 128K token context window handles most documents without chunking.

Claude (Anthropic)

Claude relies on industry-leading context windows (200K tokens) with direct injection preferred over retrieval. PDF and document parsing is built into the model, and extended thinking enables complex document analysis. This context-first approach eliminates chunking artifacts for most documents.

Ingest Latency Comparison

System

< 5 pages

5-20 pages

20-50 pages

50+ pages

YellowMind Hybrid

~200ms

~500ms

~1s partial

Progressive

Microsoft Copilot

~0ms*

~0ms*

~0ms*

~0ms*

ChatGPT

~300ms

~1-2s

~3-5s

May timeout

Claude

~200ms

~500ms

~1-2s

~2-5s

Table 4: Ingest latency comparison (* Copilot assumes pre-indexed M365 content)

Query Latency Comparison

System

Simple Factual

Semantic/Conceptual

Cross-Section

YellowMind Hybrid

~50ms

~500ms

~300ms

Microsoft Copilot

~200ms

~300ms

~400ms

ChatGPT

~500ms

~800ms

~1000ms

Claude

~400ms

~600ms

~800ms

Table 5: Query latency by query complexity


Retrieval Quality Assessment

System

Keyword

Semantic

Multi-hop

Citations

YellowMind

★★★★★

★★★★☆

★★★☆☆

★★★★★

Copilot

★★★★☆

★★★★★

★★★★☆

★★★★☆

ChatGPT

★★★☆☆

★★★★☆

★★★☆☆

★★★☆☆

Claude

★★★★☆

★★★★★

★★★★★

★★★★☆

Table 6: Retrieval quality assessment by capability

Cost Efficiency Analysis

System

Token Cost

Storage Cost

Compute Efficiency

YellowMind

Low

Low

High

Copilot

Medium

High

Medium

ChatGPT

High

None

Low

Claude

High

None

Low

Table 7: Cost efficiency comparison


Strategic Positioning

The YellowMind Hybrid Strategy occupies a unique position in the document grounding landscape:

Figure 3: Competitive positioning matrix

Key Differentiators

Advantage

vs. Copilot

vs. ChatGPT

vs. Claude

No ecosystem lock-in

✓ Better

Similar

Similar

Instant small-doc UX

✓ Faster uploads

✓ Much faster

✓ Faster

Adaptive processing

✓ No pre-index

✓ Smarter scale

✓ More efficient

Cost efficiency

Similar

✓ Lower tokens

✓ Lower tokens

Multi-doc native

Similar

✓ Better

✓ Better

Precise citations

Similar

✓ Much better

✓ Better

Sovereign/on-prem

✓ Available

N/A

N/A

Table 8: Key differentiators by competitor


Deployment Recommendations

Infrastructure Requirements

The hybrid strategy is designed to run efficiently on modest infrastructure:

• Compute: 2-4 vCPUs per instance, scales horizontally

• Memory: 4-8 GB RAM (BM25 indexes are memory-efficient)

• Storage: Session-scoped (ephemeral), no persistent vector store required

• Optional: Redis for distributed session cache in multi-instance deployments

Integration Path

1. Phase 1 - Core Module: Deploy instant grounding service alongside existing chat infrastructure

2. Phase 2 - File Type Expansion: Add parsers for additional formats (PPTX, email, images with OCR)

3. Phase 3 - Semantic Enhancement: Integrate embedding model for lazy reranking

4. Phase 4 - Enterprise Connectors: Add SharePoint, Google Drive, Confluence integrations

Success Metrics

Track these KPIs to measure hybrid strategy effectiveness:

• P95 Ingest Latency: Target < 500ms for documents under 20 pages

• P95 Query Latency: Target < 100ms for BM25-only queries

• Lazy Embed Rate: Target < 20% of queries require embedding computation

• Retrieval Precision: Target > 85% relevant chunks in top-5 results

• User Satisfaction: Target > 4.0/5.0 on document Q&A experience

Conclusion

The YellowMind Hybrid Strategy represents a paradigm shift in document grounding architecture. By recognizing that document size and query patterns should drive processing decisions, the strategy achieves:

• Copilot-like speed for the 80% of documents under 20 pages

• ChatGPT-like simplicity without ecosystem requirements

• Claude-like quality through intelligent retrieval

• Unique advantages in multi-document scenarios and precise citations

The lazy embedding approach is particularly powerful: it provides the cost efficiency of lexical search for most queries while preserving the option for semantic understanding when needed—without the upfront compute investment that makes traditional RAG slow.

For organizations seeking to deliver consumer-grade document AI experiences without the infrastructure complexity of enterprise search platforms, the hybrid strategy offers a compelling path forward.

─────────────────────────────────────────────

For technical discussions and partnership inquiries:

www.yellowmind.ai