Story: The Day I Finished the Last Line

From the first line of code to a production HTTPS endpoint, to the final “Production Checklist” stamping ✅—one full round of the project.

That day I opened the dev notes and read from the top. It looked like a finished product, but the more I read, the more it looked like a coordinate system.

Every change is a capability point. But “running” and “understanding why others designed it this way” are two different things. A finished product is “do we have these capabilities”; a coordinate system is “how do these capabilities interact, which is the base layer, which is the top, which can be split, which can’t”.

This Part 11 isn’t a summary index of Parts 02-10—it’s re-encoding the whole project into the 3 highest-level observations. The specific technical details were covered in each Part; this one is about the things I only saw clearly after doing them.

These 3 things are:

  1. Capabilities can be collapsed: 11 flags collapsed into 5 modes—the shift from feature-oriented to system-oriented design
  2. Learning paths can be re-encoded: re-sort the entire learning order after one round of implementation—a refactoring the teacher has to go through
  3. What you don’t do matters more than what you do: the deferred and explicitly-not-now list is more worth talking about than the done list

A note up front: these 3 things aren’t “my success story”—they’re “the structure I only saw after finishing”. If you’re in the middle of building a RAG project right now, I hope these 3 observations save you a stretch.


Observation 1: 11 Flags → 5 Modes

The original design had 11 independent flags:

auto_route
auto_rewrite
retrieval_mode
use_reranker
use_parent_expansion
use_context_compression
use_long_context_reorder
use_faithfulness_check
use_ragas_check
use_citation_check
use_llm_citation_judge
RAG_SYNTHESIS_MODE

Each flag on/off is a behavior, and the combinations are dozens of query pipelines. Sounds flexible, but in practice it’s outsourcing decisions to the caller.

Problems surfaced quickly:

  • The caller doesn’t know which flags to turn on or off
  • Different callers turn on different flags, and during debugging you can’t tell which pipeline this query ran
  • Adding a new capability becomes “one more flag”, and the system becomes harder and harder to read

I eventually collapsed it into 5 modes:

fast      — direct lookup, no evaluation
safe      — adds faithfulness check, citation
deep_eval — runs RAGAS and citation judge synchronously
creative  — LLM-first synthesis
agentic   — allows multi-step workflow, tool routing

The difference after the collapse:

  • The caller just picks a mode, doesn’t think about flag combinations
  • The pipeline is determined (5 paths), so debugging means looking at the mode to know which path ran
  • Adding a new capability becomes “which mode does this belong to” instead of “one more flag”

Takeaway: When your system has more than 5 independent binary switches, users have effectively given up trying to understand it. Collapsing into modes turns “combinatorial explosion” back into “a few paths”.

This is the thing I only truly understood after finishing: Not every capability should be a user-facing switch. Collapsing it into modes is system design, not interface polish.


Observation 2: Re-sort the Learning Path After One Round

I wrote teaching notes as I went. The original learning path was sorted by “what I know, so teach it first”:

- LlamaIndex Workflows
- Tool Routing
- MCP Tool Call Integration
- Evaluate / Reflect / Refine / Retry Loop
- FastAPI Auth
- Dockerize FastAPI + Qdrant
- Deploy to VPS
- Cloud Deployment Options

After one round of implementation, I found two problems with the order:

  1. Tool routing and MCP were placed after Workflows: but Workflows internally use tool routing; teaching Workflows first then tool routing leaves students stuck on how Workflows dispatch tools
  2. FastAPI Auth and Docker were separated: but in practice, the Auth middleware and Docker port planning are coupled; teaching them separately makes students only discover at deploy time that auth can’t use loopback inside the container

So I broke up the entire path and re-sorted it. The re-sorted order is by “where students actually get stuck”, not by “what I know”:

- Production Query Modes / Cost Profiles
- Runtime Budget Guard
- LlamaIndex Workflows
- Tool Routing
- MCP Tool Call Integration
- Reflect / Refine / Retry Loop
- Async Evaluators
- Ingestion Job Queue
- FastAPI Auth + Permission-aware Retrieval Hardening
- Document Management APIs
- Citation Source Viewer Payload
- Dockerize FastAPI + Qdrant + Worker
- Cloud Deployment Options
- Oracle E2 + Qdrant Cloud decision
- Deploy API to Oracle E2 + Qdrant Cloud
- Cloudflare domain + HTTPS
- Production Checklist Final Pass

FastAPI Auth moved from position 5 to “just before Dockerization”—because auth can’t use loopback inside a container, a coupling that only surfaces at deploy time.

This wasn’t covered in Part 03 or Part 04, and is only covered here in Part 11—because this isn’t teaching content, it’s teaching design itself.

Takeaway: Re-sorting the learning path after one round of implementation is a refactoring the teacher has to go through. The first path is sorted by “what I know”; the re-sort is sorted by “where students get stuck”. The latter matters more than the former.


Observation 3: What You Don’t Do Matters More Than What You Do

Parts 02-10 walked through “building it”. But a production RAG that’s actually stable to run isn’t just because of what you built—it’s because of what you explicitly chose not to build.

Only the deferred items (out of scope for this series) are listed here:

- Multi-LLM provider abstraction (Anthropic / OpenAI / Gemini all supported)
- Self-hosted Qdrant (currently using Qdrant Cloud)
- Streaming response (currently returns complete answer)
- WebSocket push (citation changes proactively notify client)
- Multi-region failover (VM is only in us-ashburn-1)
- Production-style observability stack (Prometheus + Grafana + alert)
- A/B testing framework (5 query modes have no traffic split)
- Fine-tuning embedding model (using off-the-shelf OpenAI / BGE)
- RAGAS / LLM judge switched to local models (currently cloud LLM)
- In-house prompt registry (prompts scattered inside each module)

Each of these is “buildable”. Each is “not building right now”.

Why this matters more than Done:

The existence of Deferred means you know what you’re not doing. A RAG project’s “not done” list tells you more about the system’s maturity than its “done” list.

  • “We’re in production” → What does that mean? How much traffic? What SLA? Unspecified
  • “We have 10 capability flags” → So what? How does the caller use them? Unspecified
  • “We didn’t do multi-LLM abstraction” → Why? Cost? Time? Vendor lock-in? That reason is the real information

Source materials: 3-3/Deferred / Not Now section + implicit assumptions in 4-1’s closing section.

Takeaway: A RAG project’s “not done” list IS its design philosophy. Writing this list down is harder than writing the Done list, and far more valuable.


Alignment with Part 10: These Are My Choices, Not General Advice

Part 10’s closing covered 4 architecture choices (Oracle VM / Qdrant Cloud / Cloudflare / Docker Compose). Those 4 are “system layer” choices. The 3 choices in this Part 11 are “methodology layer” choices.

ChoicePart 10 (system layer)Part 11 (methodology layer)
How to pick VMOracle Cloud Always Free
How to pick Vector DBQdrant Cloud
How to pick HTTPSCloudflare Origin Cert
How to pick orchestrationDocker Compose (not k8s)
How to manage capabilities11 flags → 5 modes
How to order learning pathre-sort by production sequence
How to define scopeexplicit deferred list

Same principle applies: these 3 are “why I chose this”, not “you should do this”. If you’re building a different RAG project, the conclusions may flip entirely—for example, if your caller is an engineering team, 11 flags might be fine; if your curriculum is for experts, the production sequence might not be right; your deferred list might not need to be deferred at all.

What these 3 have in common isn’t “my answer is right”—it’s “after finishing, I really know what I chose, and what I gave up”. That is closure.


What the Series Closure Looks Like

Part 01 is the interactive demo entry. Parts 02-10 walk through the whole arc. Part 11 collapses them into a coordinate system.

But what does “coordinate system” actually mean? It’s not “this system has X, Y, Z capabilities”—it’s “how do these capabilities interact, how can a reader enter”.

My own way of entering:

  1. Want to understand how RAG works → Part 02, 03
  2. Want to see what a real project looks like → Part 05-09
  3. Want to push RAG to production → Part 10
  4. Want to see how the whole project reads, how to assess your own progress → Part 11 (this Part)
  5. Want to compare framework choices → Part 04 (looking back)

Readers can enter from anywhere. Part 11 isn’t “the last article”, it’s “the first page of the coordinate system”—after you finish Parts 02-10, if you want the panoramic view, come back here; after you read this, if any section interests you, jump back to Parts 02-10.


For Readers Building a RAG Project Right Now

If I had to summarize these 3 things for readers in the middle of building a RAG project:

The first one (flags → modes) is a reminder for system designers: when you feel flags multiplying and callers randomly flipping switches, stop and collapse them into modes. This isn’t interface simplification, it’s system design.

The second one (re-sort learning path) is a reminder for teachers: the first path is sorted by “what I know”—the re-sort is sorted by “where students get stuck”. If you teach others or keep a record of your own learning path, run one round then come back and re-sort.

The third one (“not done” list) is a reminder for project owners: the Done list is “we’ve built it”; the “not done” list is “we haven’t built it, and we know why we haven’t”. The latter is harder to write and better at helping you locate your own progress.

If any of these 3 makes you stop and think for a moment, that’s the value of closure.