A Knowledge Graph-Based Memory Architecture for Multi-Modal Semi-Supervised Learning in Agentic AI Systems
Integrating Short-Term and Long-Term Memory Consolidation for Adaptive Agent Behavior Yellowsys AI Research Division, research@yellowsys.ai, November 26, 2025
5/8/20246 min read
A Hybrid Strategy for Instant Document Grounding
Technical Whitepaper v1.2
November 2024
Prepared for:
AI Architects, CTOs, and Technical Decision Makers
YellowMind.ai by Yellowsys
AI That Manages What Matters
Executive Summary
This whitepaper presents a novel hybrid strategy for document grounding without requiring heavyweight infrastructure or persistent indexing. The approach adapts processing techniques based on document size, achieving sub-second ingest latency for the majority of enterprise documents while maintaining high retrieval quality.
Key Innovations
1. Tiered Processing Architecture: Documents are automatically routed to optimal processing pipelines based on size and structure characteristics.
2. BM25-First Retrieval: Lexical search handles 80%+ of queries instantly; embeddings computed only when needed.
3. Lazy Embedding Strategy: Eliminates the embedding bottleneck that plagues traditional RAG implementations.
4. Progressive Ingest: Large documents become queryable page-by-page, enabling interaction before full processing completes.
Performance Targets
Metric
Target
Approach
Ingest (< 5 pages)
< 200ms
Direct context injection
Ingest (5-20 pages)
< 500ms
BM25 indexing only
Ingest (> 20 pages)
< 1s partial
Progressive streaming
Query (keyword-rich)
< 50ms
BM25 direct
Query (semantic)
< 500ms
Lazy embed + rerank
Table 1: Performance targets by document size and query type
The Document Grounding Challenge
Traditional RAG Limitations
Traditional Retrieval-Augmented Generation (RAG) pipelines follow a sequential process that creates significant latency bottlenecks:
Upload → Parse → Chunk → Embed → Index → Query → Retrieve → Generate
The embedding step is particularly problematic. Even with optimized models, embedding a 20-page document (approximately 100-150 chunks) requires 3-10 seconds of processing time. This latency breaks the instant upload experience that users have come to expect from consumer AI products like ChatGPT and Microsoft Copilot.
User Experience Gap
Modern users expect immediate responsiveness. When uploading a document to ask questions, they anticipate:
• Instant acknowledgment (< 200ms)
• Ability to ask questions within 1-2 seconds
• No visible loading states or progress bars for small documents
• Graceful handling of larger documents with clear progress indication
Traditional RAG fails to meet these expectations, forcing users to wait through visible processing states that erode confidence in the system.
The Size-Complexity Mismatch
A critical insight is that most document interactions involve relatively small files. Analysis of enterprise document workflows reveals:
• 60-70% of documents are under 5 pages (emails, memos, invoices)
• 25-30% are 5-20 pages (contracts, reports, proposals)
• < 10% exceed 20 pages (manuals, specifications, legal documents)
Yet traditional RAG applies the same heavyweight processing to a 2-page invoice as to a 200-page technical manual. This one-size-fits-all approach is fundamentally inefficient.
Hybrid instant grounding strategy
The hybrid strategy addresses the size-complexity mismatch by implementing three distinct processing tiers, each optimized for a specific document size range.
Tier Classification
Tier
Document Size
Token Range
Processing Strategy
Tier 1
< 5 pages
2K - 10K
Direct Context Injection
Tier 2
5 - 20 pages
10K - 50K
BM25-First + Lazy Embedding
Tier 3
> 20 pages
50K+
Progressive + Hierarchical
Table 2: Tier classification by document characteristics
Architecture Diagram
The following diagram illustrates the decision flow and processing pipeline:
Figure 1: Tier-based processing architecture
Tier 1: Direct Context Injection
Rationale
For small documents, the overhead of chunking, indexing, and retrieval actually hurts performance and quality. Modern LLMs with 128K+ context windows can easily accommodate a 5-page document (~10K tokens) with room to spare. The LLM's native attention mechanism becomes the retrieval system.
Processing Pipeline
1. Document Reception (0-50ms): Validate file type and size, generate session-scoped document ID, acknowledge upload immediately.
2. Fast Parse (50-150ms): Extract text preserving basic structure, convert to clean markdown format, preserve tables as markdown tables.
3. Metadata Extraction (20-50ms): Infer document title, detect language, identify document type heuristically, extract obvious entities.
4. Cache for Session (10ms): Store full markdown text in session memory. No chunking, no indexing, no embedding.
Query Handling
Queries are handled by prepending the full document to the system prompt. The LLM attends to relevant portions naturally through its attention mechanism.
Advantages
• Zero retrieval latency: No search step between query and generation
• Perfect recall: LLM sees everything, nothing is missed due to retrieval failures
• Context coherence: No chunking artifacts, relationships preserved
• Simplicity: Minimal moving parts, fewer failure modes
Limitations
• Token cost: Every query pays for full document in input tokens
• Context budget: Leaves less room for conversation history
• Scale ceiling: Not viable beyond ~12K tokens
Tier Promotion Triggers
Documents are automatically promoted to Tier 2 processing when:
• Document exceeds 12K tokens (~6 pages)
• Multiple documents uploaded in same session
• User explicitly requests precise citation/sourcing
Tier 2: BM25-First with Lazy Embedding
Rationale
Mid-size documents are too large for comfortable context injection but small enough that full embedding is overkill. BM25 lexical search handles 80%+ of queries effectively. Embeddings are computed only when lexical matching fails to achieve sufficient confidence.
Processing Pipeline
1. Structure-Aware Parsing (100-300ms): Parse with layout detection (headings, paragraphs, lists, tables), identify document hierarchy, preserve page boundaries.
2. Semantic Chunking (50-150ms): Chunk by semantic boundaries (not fixed token counts), respect heading hierarchy, target 256-512 tokens with flexibility.
3. Enrichment (50-100ms): Extract keywords per chunk (TF-IDF style), identify entities using pattern matching, build parent context breadcrumbs.
4. BM25 Index Construction (20-50ms): Build inverted index over normalized chunk text, include keywords with boosted weight, store document frequency statistics.
Query Pipeline
Figure 2: Tier 2 query pipeline with confidence-based routing
Confidence Threshold Calibration
Query Type
Threshold
Rationale
Factual ("what is X")
0.75
Needs precise match
Definition ("define X")
0.70
Usually keyword-rich
Summary ("summarize")
0.50
Broad queries, multiple chunks acceptable
Comparison ("X vs Y")
0.65
Needs both comparison terms
Procedural ("how to")
0.60
Often has clear action keywords
Table 3: Confidence thresholds by query type
Lazy Embedding Cache
When embeddings are computed for reranking, they are cached in session memory. Subsequent queries benefit from cached embeddings, and over multiple queries, popular chunks become "warm." This approach provides organic embedding coverage growth without upfront compute investment.
Tier 3: Progressive Ingest + Hierarchical Retrieval
Large documents cannot be processed synchronously without breaking user experience. The solution is to make content available progressively and use two-stage retrieval that first identifies relevant sections before diving deep.
Progressive Ingest Pipeline
Documents are processed page-by-page with immediate indexing:
1. Streaming Parse: Each page is parsed independently (100-200ms per page)
2. Immediate Indexing: Each page becomes queryable as soon as it's processed
3. Progress Feedback: UI shows "5/30 pages ready - you can start asking questions"
4. Background Map Generation: Parallel process builds document structure map (500-1500 tokens)
Hierarchical Retrieval
Large document queries use two-stage retrieval:
Stage 1: Section Selection (200-400ms)
The document map (table of contents, section summaries, key entities) is presented to the LLM with the user's query. The LLM identifies which sections are most likely to contain relevant information.
Stage 2: Deep Retrieval (100-300ms)
BM25 search is filtered to only the selected sections, standard confidence check and lazy reranking applied, then grounding context is formatted from results.
Partial Availability Handling
When users query before processing completes, the system can:
• Answer from available content with disclaimer: "Based on the first 12 pages..."
• Smart prediction: If TOC shows relevant section is in unprocessed pages, notify user
• Wait and notify: Offer to wait for complete processing or proceed with partial results
Competitive Benchmark Analysis
This section compares the YellowMind Hybrid Strategy against the document grounding approaches used by Microsoft Copilot, ChatGPT (OpenAI), and Claude (Anthropic).
Competitor Strategy Overview
Microsoft Copilot
Copilot leverages enterprise-grade RAG with a pre-built Semantic Index over Microsoft 365 content. Documents in SharePoint/OneDrive are continuously indexed in the background, enabling zero-latency retrieval for existing content. However, newly uploaded documents require processing time similar to traditional RAG.
ChatGPT (OpenAI)
ChatGPT uses hybrid context injection combined with Code Interpreter for document processing. Small files are injected directly into context, while larger files are processed through Code Interpreter for extraction. The 128K token context window handles most documents without chunking.
Claude (Anthropic)
Claude relies on industry-leading context windows (200K tokens) with direct injection preferred over retrieval. PDF and document parsing is built into the model, and extended thinking enables complex document analysis. This context-first approach eliminates chunking artifacts for most documents.
Ingest Latency Comparison
System
< 5 pages
5-20 pages
20-50 pages
50+ pages
YellowMind Hybrid
~200ms
~500ms
~1s partial
Progressive
Microsoft Copilot
~0ms*
~0ms*
~0ms*
~0ms*
ChatGPT
~300ms
~1-2s
~3-5s
May timeout
Claude
~200ms
~500ms
~1-2s
~2-5s
Table 4: Ingest latency comparison (* Copilot assumes pre-indexed M365 content)
Query Latency Comparison
System
Simple Factual
Semantic/Conceptual
Cross-Section
YellowMind Hybrid
~50ms
~500ms
~300ms
Microsoft Copilot
~200ms
~300ms
~400ms
ChatGPT
~500ms
~800ms
~1000ms
Claude
~400ms
~600ms
~800ms
Table 5: Query latency by query complexity
Retrieval Quality Assessment
System
Keyword
Semantic
Multi-hop
Citations
YellowMind
★★★★★
★★★★☆
★★★☆☆
★★★★★
Copilot
★★★★☆
★★★★★
★★★★☆
★★★★☆
ChatGPT
★★★☆☆
★★★★☆
★★★☆☆
★★★☆☆
Claude
★★★★☆
★★★★★
★★★★★
★★★★☆
Table 6: Retrieval quality assessment by capability
Cost Efficiency Analysis
System
Token Cost
Storage Cost
Compute Efficiency
YellowMind
Low
Low
High
Copilot
Medium
High
Medium
ChatGPT
High
None
Low
Claude
High
None
Low
Table 7: Cost efficiency comparison
Strategic Positioning
The YellowMind Hybrid Strategy occupies a unique position in the document grounding landscape:
Figure 3: Competitive positioning matrix
Key Differentiators
Advantage
vs. Copilot
vs. ChatGPT
vs. Claude
No ecosystem lock-in
✓ Better
Similar
Similar
Instant small-doc UX
✓ Faster uploads
✓ Much faster
✓ Faster
Adaptive processing
✓ No pre-index
✓ Smarter scale
✓ More efficient
Cost efficiency
Similar
✓ Lower tokens
✓ Lower tokens
Multi-doc native
Similar
✓ Better
✓ Better
Precise citations
Similar
✓ Much better
✓ Better
Sovereign/on-prem
✓ Available
N/A
N/A
Table 8: Key differentiators by competitor
Deployment Recommendations
Infrastructure Requirements
The hybrid strategy is designed to run efficiently on modest infrastructure:
• Compute: 2-4 vCPUs per instance, scales horizontally
• Memory: 4-8 GB RAM (BM25 indexes are memory-efficient)
• Storage: Session-scoped (ephemeral), no persistent vector store required
• Optional: Redis for distributed session cache in multi-instance deployments
Integration Path
1. Phase 1 - Core Module: Deploy instant grounding service alongside existing chat infrastructure
2. Phase 2 - File Type Expansion: Add parsers for additional formats (PPTX, email, images with OCR)
3. Phase 3 - Semantic Enhancement: Integrate embedding model for lazy reranking
4. Phase 4 - Enterprise Connectors: Add SharePoint, Google Drive, Confluence integrations
Success Metrics
Track these KPIs to measure hybrid strategy effectiveness:
• P95 Ingest Latency: Target < 500ms for documents under 20 pages
• P95 Query Latency: Target < 100ms for BM25-only queries
• Lazy Embed Rate: Target < 20% of queries require embedding computation
• Retrieval Precision: Target > 85% relevant chunks in top-5 results
• User Satisfaction: Target > 4.0/5.0 on document Q&A experience
Conclusion
The YellowMind Hybrid Strategy represents a paradigm shift in document grounding architecture. By recognizing that document size and query patterns should drive processing decisions, the strategy achieves:
• Copilot-like speed for the 80% of documents under 20 pages
• ChatGPT-like simplicity without ecosystem requirements
• Claude-like quality through intelligent retrieval
• Unique advantages in multi-document scenarios and precise citations
The lazy embedding approach is particularly powerful: it provides the cost efficiency of lexical search for most queries while preserving the option for semantic understanding when needed—without the upfront compute investment that makes traditional RAG slow.
For organizations seeking to deliver consumer-grade document AI experiences without the infrastructure complexity of enterprise search platforms, the hybrid strategy offers a compelling path forward.
─────────────────────────────────────────────
For technical discussions and partnership inquiries:
