A structured 16-problem map for RAG and LLM pipeline debugging

Hi everyone,

I’m an independent builder working on a structured “16-problem ProblemMap” for diagnosing common RAG and LLM pipeline failures.

Instead of proposing yet another framework, the goal is to provide a fixed diagnostic lens for where things typically break in real systems. The map categorizes recurring failure modes across:

  • ingestion and chunking mismatches

  • embedding and vector store inconsistencies

  • retriever ranking and recall gaps

  • evaluation blind spots

  • hallucination and guardrail leakage

  • deployment and bootstrap ordering issues

In practice, many issues that appear to be “model problems” are actually structural mismatches between components. Having a stable taxonomy has helped reduce trial-and-error debugging in my own work.

The project is maintained under the GitHub account onestardao, under the name WFGY.
The ProblemMap component has been referenced or integrated in several academic labs, RAG infrastructure projects, and curated AI research lists.

I’m sharing this here mainly to get feedback from people building or researching LLM systems:

  • Which failure modes do you encounter most often?

  • Do you find structured taxonomies useful in production settings?

  • What categories feel missing or too rigid?

Would appreciate thoughtful discussion.

1 Like
  • Context Fragmentation: Poor chunking that kills semantic meaning.
  • “Lost in the Middle”: Correct data is retrieved but ignored by the LLM.
  • Query Gap: Users use different terminology than the technical docs.

Yes, but primarily as a communication tool. In production, “it’s hallucinating” is a useless bug report. A taxonomy allows a team to:

  • Assign Ownership: Is this a Data Engineering problem (Ingestion/Chunking), a DevOps problem (Vector Store latency), or a Prompt Engineering problem (Guardrails)?
  • Benchmark Progress: It allows you to track “Recall @ K” or “Faithfulness” scores over time. Without a taxonomy, you are playing “Whac-A-Mole”—fixing one hallucination only to break the retriever’s ranking elsewhere.
  • Standardize Post-Mortems: It moves the team from “vibes-based” evaluation to a structured diagnostic workflow.
  • Temporal Drift: Handling stale data/metadata.
  • Agentic Complexity: Failures in multi-step reasoning or “multi-hop” retrieval.
  • Economics: Latency and token cost as “production failures.”
1 Like