9 minutes
What It’s Like Building AI Systems in 2026
No AI agent frameworks were created during the writing of this article. Three were deprecated.
Hey, I haven’t built anything in a while. I want to code up something simple using AI tools. Just a chatbot for our internal docs. I was thinking I’ll just use the Anthropic API and reference our internal markdown docs in the prompt.
Oh god. No. Nobody does that anymore. You need an agentic RAG pipeline with tool use and a memory layer.
OK. What’s RAG?
Retrieval Augmented Generation. You chunk your docs, embed them, store them in a vector database, retrieve the top-k chunks at query time, and stuff those into the prompt.
Right, I’ve heard of that. Which vector database?
Depends. Pinecone if you want managed, Qdrant if you want self-hosted, Weaviate if you want hybrid, pgvector if you already have Postgres, LanceDB if you’re cool, Turbopuffer if you read Hacker News, Chroma if it’s still 2023, or Milvus if you hate yourself.
I just want to pick one.
Actually, don’t bother. Pure vector search is not dead, exactly, which means it is dead in conference-talk years. You need hybrid search with BM25 plus dense retrieval, then a reranker on top. Then maybe a graph layer for entity relationships. GraphRAG. Or LightRAG. Or HippoRAG. Or just give up and use long context.
Can I just use long context? Gemini has a 2 million token context window.
You could, but then your costs explode and latency dies. Unless you use prompt caching. Anthropic, OpenAI, and Google all have it now, but they all work differently, so you’ll want a cache-aware router.
A router?
OpenRouter, Portkey, Helicone, LiteLLM, take your pick. They route between providers, handle fallbacks, track costs, and abstract the API differences. Otherwise, when Anthropic has an outage your whole product dies.
So my simple chatbot now needs traffic control?
Only if you care about resilience.
Fine. So now I just call the LLM with the retrieved chunks?
No. You agentify it. The model decides what to retrieve, reads the chunks, decides if it needs more, retrieves again. Tool use. That’s the paradigm.
How do I build that?
LangChain. Or LlamaIndex. Or LangGraph. Or CrewAI. Or AutoGen. Or Mastra. Or Pydantic AI. Or Haystack. Or DSPy. Or Mirascope. Or OpenAI Agents SDK. Or Google ADK. Or Microsoft Agent Framework. Or just write it yourself because all of them are abstraction soup.
Should I write it myself?
You should write it yourself.
OK, I’ll write it myself. What are you using to write code these days?
Cursor, Windsurf, Zed, Claude Code, Codex, Aider, Cline, Continue, OpenCode. Take your pick.
I’ve heard good things about Claude Code.
It’s solid, but make sure you have the right plugins. Also add skills.
Skills?
A folder with a markdown file that tells the agent what to do. Like a prompt, but in a folder. They called it a skill because “folder of prompts” doesn’t sell.
I was mostly hoping it would help me write the code.
That’s how it starts.
What about Lovable or Replit?
Those are for prototypes, internal tools, and the occasional product someone accidentally ships. Lovable is for when the CEO wants something for the all-hands tomorrow. Replit Agent is for when the CEO wants something for the all-hands in fifteen minutes. v0 is for when you need a landing page that looks like every other landing page.
Got it, I’ll use Claude Code.
Until next month, when something else ships and the team migrates again.
Whatever. I’ll start with Claude Code.
Back to the application you’re building. You’ll need observability. Plug in Langfuse. Or LangSmith. Or Braintrust. Or Arize Phoenix. Or W&B Weave. Or Helicone. Or Laminar.
What do those do?
Trace your LLM calls so you can debug them. Because the model is non-deterministic, your pipeline has 17 steps, and when something breaks you have no idea which step did it.
I haven’t built the bot yet, but I already need a dashboard to explain why it doesn’t work.
Exactly. Now you’re thinking like a platform team.
Right. Anyway, my bot now retrieves docs and answers questions.
You also need evals.
What’s an eval?
A test, but for vibes. You define a dataset of inputs and either golden outputs or LLM-as-judge rubrics, then you run your pipeline against the dataset every time you change a prompt. Otherwise you’ll fix one thing and silently break five others.
This is feeling like a lot for a chatbot for internal company docs.
We haven’t gotten to memory yet. Your bot needs to remember user preferences across sessions. Use Mem0 or Letta or Zep. Or roll your own with a summarization loop and a user profile table.
OK, memory.
And you’ll want to expose your docs over MCP so other agents can use them.
What’s MCP?
Model Context Protocol. Anthropic released it in late 2024. A standard protocol for agents to talk to tools and data sources. Everyone has an MCP server, a registry entry, or at least a roadmap slide.
So my docs become an MCP server, and any agent can query them?
Right. Then your chatbot doesn’t even need its own retrieval logic. It just connects to the MCP server. Or rather, to your MCP gateway, because MCP gives you a protocol, not your auth model, rate limits, audit logs, tenant isolation, and governance story.
Why is there a gateway in front of the protocol that was supposed to be the standard?
Because standards tell systems how to talk. They do not make your compliance team less alive.
I asked for a chatbot and accidentally summoned enterprise architecture.
That’s how you know it’s ready for production.
Hold on, didn’t I read last week that MCP is actually out?
It’s complicated. Tool definitions can blow out your context. Some clients eagerly stuff every available tool schema into the prompt. Thirty tools across five servers can be fifteen to twenty thousand tokens before the user has typed a word. And the model still picks the wrong tool.
So what do I do instead?
Give the agent a CLI. One binary, one entrypoint. The agent learns the subcommands at runtime by running
--help. Fewer tokens, more flexible, and it composes with everything Unix already gives you.
So MCP is dead?
No. MCP is fine for some things, just not all the things people are using it for. Use both. Or neither. Or wrap them in a skill.
I’ll ignore that. So I have a chatbot, RAG, MCP, evals, observability, memory. Am I done?
Are you using the right model?
I was going to use an Anthropic model.
Which one?
I don’t know, the latest one.
Opus is smart but expensive. Sonnet is the workhorse. Haiku is cheap but sometimes too eager. Don’t rule out OpenAI models, or even open-source models. Then there’s the new one that came out yesterday, which I haven’t tested yet. Also, are you on the API, or Bedrock, or Vertex? They all have slightly different rate limits and slightly different region availability.
Should I just route between models based on query complexity?
Now you need a router model, which is itself an LLM, which means you have an LLM choosing which LLM to call, and you’ll pay for both.
I’m losing my will to live.
Good. That means you’re past the prototype phase. Now let’s discuss fine-tuning.
It’s a simple chatbot. Fine-tuning feels like overkill.
You should at least distill. Take Opus traces, fine-tune a small open model on them, save 90% on costs.
That sounds reasonable.
Until the base model changes, your serving provider tweaks behavior, and the cheap clone drifts just enough to ruin your evals without throwing an error.
What about agents talking to other agents? I keep hearing about that.
A2A. Agent-to-Agent protocol. Google announced it in 2025. Like MCP, but for agents instead of tools. So now you have agents calling agents calling agents calling MCP servers calling tools.
Who’s debugging that?
Nobody. That’s why we have observability platforms. Six of them.
Step back for a second. You know a lot about these tools. What have you actually built with all these?
What do you mean?
Like, what have you shipped? What’s in production? How are your customers benefiting?
We have a really solid agentic framework now. The eval harness is best-in-class internally. We rebuilt the orchestrator twice. We’re on our third vector store.
Right, but what does it do?
It’s a platform. A foundation.
For what?
Future use cases.
Has it generated any revenue?
Not directly. But the team has gotten really sharp at MCP server architecture.
Has anyone outside your team used it?
We did a brown bag.
How long have you been working on this?
Fourteen months.
My docs are already in markdown. I’ll just reference them in the system prompt. I’ll build with Claude Code and call it a day.
That might work for you. It would never work for us.
Why not?
Our CEO said we’re an AI-first company. You can’t walk into planning with “I put markdown in a prompt.” Where’s the platform? Where are the agents? What do we put on the slide?
I’m not sure “create an over-engineered platform filled with tools” is what they meant.
Then where do the agents go?
Wherever the customer’s problem requires them. Not one step earlier.
That sounds like regular engineering. Plus, how will I increase my token usage?
Token usage?
We’re expected to spend a certain amount on tokens per month. We even have an internal leaderboard on which developer has the highest LLM token usage
Are you measuring return on investment for those tokens?
What do you mean?
Spending money on tokens for the sake of token usage makes no sense. What value does it bring to the business?
It’s part of our AI-first strategy. More token usage means higher AI adoption. I gotta show I’m embracing AI by pumping my token usage
That sounds expensive. How do you measure the value added from your token usage?
Pffft “tOkEnS aRe ExPeNsIvE”. You sound like a luddite. I used 20 billion tokens last month
What did you build with 20 billion tokens?
A dashboard showing that I used 20 billion tokens.
So you used tokens to build a dashboard about using tokens?
That’s the adoption flywheel.
This was inspired by Jose Aguinaga and his article How it feels to learn JavaScript in 2016. Ten years later, we’re doing the same thing. History doesn’t repeat itself, but it rhymes.