How to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama

DOCS = {
   "transformer_architecture.md": textwrap.dedent("""\
       # Transformer Architecture


       ## Overview
       The Transformer is a deep learning architecture introduced in "Attention Is All
       You Need" (Vaswani et al., 2017). It replaced recurrent networks with a
       self-attention mechanism, enabling parallel training and better long-range
       dependency modelling.


       ## Key Components
       - **Multi-Head Self-Attention**: Computes attention in h parallel heads, each
         with its own learned Q/K/V projections, then concatenates and projects.
       - **Feed-Forward Network (FFN)**: Two linear layers with a ReLU activation,
         applied position-wise.
       - **Positional Encoding**: Sinusoidal or learned embeddings that inject
         sequence-order information, since attention is permutation-invariant.
       - **Layer Normalisation**: Applied before (Pre-LN) or after (Post-LN) each
         sub-layer, stabilising gradients.
       - **Residual Connections**: Added around each sub-layer to ease gradient flow.


       ## Encoder vs Decoder
       The encoder stack processes input tokens bidirectionally (e.g. BERT).
       The decoder stack uses causal (masked) attention over previous outputs plus
       cross-attention over encoder outputs (e.g. GPT, T5).


       ## Scaling Laws
       Kaplan et al. (2020) showed that model loss decreases predictably as a power
       law with compute, data, and parameter count. This motivated GPT-3 (175B) and
       subsequent large language models.


       ## Limitations
       - Quadratic complexity in sequence length: O(n^2)
       - No inherent recurrence -> long-context challenges
       - High memory footprint during training


       ## References
       Vaswani et al. (2017). Attention Is All You Need. NeurIPS.
       Kaplan et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
   """),


   "rag_systems.md": textwrap.dedent("""\
       # Retrieval-Augmented Generation (RAG)


       ## Definition
       RAG augments a generative LLM with a retrieval step: given a query, relevant
       documents are fetched from a corpus and prepended to the prompt, giving the
       model grounded context beyond its training data.


       ## Architecture
       1. **Indexing Phase** — Documents are chunked, embedded via a bi-encoder
          (e.g. text-embedding-3-large), and stored in a vector database (e.g.
          Faiss, Pinecone, Weaviate).
       2. **Retrieval Phase** — The user query is embedded; approximate nearest-
          neighbour (ANN) search returns the top-k chunks.
       3. **Generation Phase** — Retrieved chunks + query are passed to the LLM
          which synthesises a final answer.


       ## Variants
       - **Dense Retrieval**: DPR, Contriever — queries and docs in the same space.
       - **Sparse Retrieval**: BM25 — term frequency-based, no embeddings needed.
       - **Hybrid Retrieval**: Reciprocal Rank Fusion (RRF) combines dense + sparse.
       - **Re-ranking**: A cross-encoder re-scores the top-k before the LLM sees them.


       ## Challenges
       - Context window limits: long retrieved passages may not fit.
       - Retrieval quality is a hard ceiling on generation quality.
       - Chunking strategy significantly affects recall.
       - Multi-hop questions require iterative retrieval (IRCoT, ReAct).


       ## Relationship to Transformers
       RAG systems rely on transformer-based encoders for embedding and decoder
       models for generation. The quality of the embedding model directly determines
       retrieval precision and recall.


       ## References
       Lewis et al. (2020). RAG for Knowledge-Intensive NLP Tasks. NeurIPS.
       Gao et al. (2023). RAG for Large Language Models. arXiv:2312.10997.
   """),


   "knowledge_graph_integration.md": textwrap.dedent("""\
       # Knowledge Graphs and LLM Integration


       ## What is a Knowledge Graph?
       A knowledge graph (KG) is a directed labelled graph of entities (nodes) and
       relations (edges): (subject, predicate, object) triples, e.g.
       (Vaswani, authored, "Attention Is All You Need").


       ## Why Combine KGs with LLMs?
       LLMs hallucinate facts; KGs provide structured, verifiable ground truth.
       KGs are hard to query in natural language; LLMs provide the interface.
       Together they enable faithful, grounded, explainable question answering.


       ## Integration Strategies
       ### KG-Augmented Generation (KGAG)
       Retrieve triples or sub-graphs instead of text chunks, serialise into text,
       then feed to the LLM prompt.


       ### LLM-Assisted KG Construction
       LLMs extract (subject, relation, object) triples from unstructured text,
       reducing manual curation effort significantly.


       ### GraphRAG (Microsoft Research, 2024)
       GraphRAG clusters document communities, generates community summaries, and
       stores them in a KG. Queries answered by map-reduce over community summaries
       outperform flat-vector RAG on sensemaking tasks.


       ## Challenges
       - KG construction quality depends on extraction LLM accuracy.
       - Graph databases add infrastructure complexity.
       - Ontology design requires domain expertise.
       - KGs go stale without continuous update pipelines.


       ## Relation to RAG and Transformers
       KG integration addresses two key RAG limitations: lack of structured reasoning
       and inability to follow multi-hop relations.


       ## References
       Pan et al. (2023). Unifying LLMs and KGs. IEEE Intelligent Systems.
   """),
}