Presenters

Source

Building a Temporal Graph Reasoning Platform on MongoDB: A Deep Dive 🚀

Are you ready to go beyond basic RAG and build truly intelligent systems? James Melvin from LexisNexis shares his journey of constructing a temporal graph reasoning platform on MongoDB, tackling the complexities of data, time, and AI readiness. This session dives deep into the challenges and innovations behind creating a robust knowledge graph solution.

The Pillars of Knowledge Graphs: Ontology, Graph, and Vectors 🏗️

James emphasizes that building effective knowledge graphs hinges on three non-negotiable components:

  • Ontologies 🧠: These are crucial for semantic governance and normalization. In a world where users express information in myriad ways (think of the nine different ways to say “United States”), ontologies act as the semantic layer, defining meaning, relationships, and hierarchies. They ensure that your system understands the nuances of your domain, like the difference between reporting on “Asia” versus “Southeast Asia.”
  • Knowledge Graphs 🕸️: While great for grounding RAG, knowledge graphs themselves can be brittle if they’re just statements of facts. The structure and the underlying nouns are critical.
  • Vectors 💡: Well-understood for semantic similarity, vectors alone struggle with complex relationships and temporal context. Simply counting mentions of “Asia” or “Germany” doesn’t reveal the underlying structure or trends.

The “Synchronization Tax” and the Quest for a Unified Solution 🏦

A major hurdle in traditional approaches is the synchronization tax. When ontologies, knowledge graphs, and vectors reside in separate systems (e.g., Neo4j for graphs and a separate vector store), keeping them in sync becomes a monumental task. Any desynchronization leads to hallucinations in your AI models.

This challenge led James and his team to seek a true hybrid database – one that could natively manage ontologies (understanding formats like RDF), store nodes, relationships, and vectors, all within a single system.

The Problem with Plain RAG: Noise and Hallucinations 📢

James highlights a fundamental limitation of standard RAG:

  • Noise: Using only semantic similarity (like cosine similarity) can introduce a lot of noise. If you query for “Bob’s project,” a simple search might return thousands of irrelevant documents. The goal is to narrow down to relevant documents, which is incredibly difficult without understanding the underlying relationships.
  • Hallucinations: The more data you feed an LLM without proper context and structure, the higher the chance of it generating incorrect information.

Building a Temporal Knowledge Graph: Time is of the Essence ⏳

A critical missing piece in many RAG systems is the understanding of time. James explains:

  • Temporal Reasoning: In domains like finance, today’s share price of $100 and tomorrow’s $110 are both true statements. However, answering questions like “What has happened to this share price over the last 6 months?” requires understanding the trajectory and temporal aspect.
  • LLM Limitations with Dates: LLMs struggle with the concept of dates and temporal context. A mention of “Friday” in a piece of text is ambiguous without knowing which Friday. This leads to a lack of context and potential misinterpretations.

To address this, the team set out to build a temporal knowledge graph where time is a first-class citizen.

From Data Ingestion to AI Readiness: The ETL Process ⚙️

The platform employs a robust ETL process to prepare data for AI:

  • Multiple Sources: Data is ingested from various sources, including external searches, pricing databases, and cloud services.
  • Flow-Driven Transformation: Each data source is treated as a distinct “flow” with its own transformation logic, inspired by approaches like Snowflake but implemented in-house to avoid vendor costs.
  • Semantic Enrichment & Pre-processing: Data undergoes semantic enrichment and pre-processing to make it AI-ready. James defines AI-ready as the ability to explain the data and its structure to an LLM in just a few sentences, avoiding lengthy system prompts.

Hybrid Search: The Best of All Worlds 🔎

The platform leverages a hybrid retrieval strategy, combining multiple search methodologies:

  • Vector Search: For semantic similarity.
  • Graph Traversal: To navigate relationships within the knowledge graph.
  • Community Search (Leiden Algorithm): To identify local and global themes, crucial for understanding overarching trends in large datasets like “Alice in Wonderland.” This addresses the limitation of LLMs not being able to read every document to grasp a theme.
  • MongoDB’s BM25: For traditional keyword-based search.
  • Temporal Filter: To specifically query data within defined timeframes.
  • Fusion and Re-ranking: Combining results from different search methods for a more accurate outcome.

The “Temporal Triple” and the Power of Time-Stamped Assertions 🕰️

A key innovation is the “temporal triple,” extending the traditional Subject-Predicate-Object of knowledge graphs to include time.

  • Subject-Predicate-Object-Time: This structure allows for precise tracking of assertions. For example, instead of just “Shell owns X,” it becomes “Shell owned X from [start date] to [end date].” If the end date is null, it means they currently own it. This eliminates ambiguity and the need to reason over every record.
  • Bi-Temporal Querying: The system supports bi-temporal querying, distinguishing between business time (when an event occurred) and service time (when the data was recorded or updated). This is crucial because created_at and updated_at are audit fields, not temporal facts about the data itself.

Rust and MongoDB: Speed, Efficiency, and Unified Storage ⚡

The team opted for Rust for performance-critical components, citing its speed (on average, 14 times faster than Python in unit tests) and superior memory management. This allows for fast computations, leaving more time for LLM calls, which are now the primary bottleneck.

MongoDB serves as the unified storage solution, offering:

  • Unified Storage: Documents, vectors, graph edges, facts, and ordered data all reside in one place.
  • Simplified Management: Easier backup, security, and permission management (entitlements).
  • Shared Workspace: Enables developers to experiment with LLMs and graphs before production deployment.
  • Hybrid Retrieval: The ability to leverage multiple search methodologies provides the best of all worlds, significantly increasing the chances of delivering the right data to the LLM at the right time.

Tackling Tabular Facts and the Limitations of Text-to-SQL 📊

A significant challenge lies in extracting insights from tabular facts. While text-to-SQL is often touted, it struggles with complex data structures and ambiguous queries. James shares that their text-to-SQL solutions achieved only 60% accuracy, failing to understand nuances like regional hierarchies or specific port relationships.

The graph-based approach, however, helps MQL (MongoDB Query Language) write more accurate queries by providing context and filtering relevant data points.

Key Lessons Learned: The Power of a Unified Platform 🌟

The journey has yielded invaluable lessons:

  • One MongoDB Cluster, Multiple Databases: The flexibility of MongoDB allows for specialized databases within a single cluster.
  • Single Source of Truth: Consolidating all data types (documents, vectors, graphs, facts) into one system simplifies everything.
  • Hybrid Retrieval is Key: Combining multiple search strategies offers superior accuracy and relevance.
  • Embrace Temporal Data: Understanding and modeling time is essential for true reasoning.

James’s team successfully built a robust temporal graph reasoning platform on MongoDB, demonstrating that by unifying ontologies, knowledge graphs, and vectors, and by embracing temporal data, you can unlock a new level of AI intelligence.

Appendix