Autodidact: An AI Agent That Learns Like a New Employee

13 minute read

Published:

v1.0 of Autodidact — an open-source self-evolving local-first AI agent — is shipping today on PyPI.

Most AI products today pay frontier-model prices for every query — even simple ones, even repeats. Autodidact does what humans do: think first, escalate when stuck, and remember what you learned so it doesn’t have to ask twice. After 30 queries on my dev workload: 67% local, ~70% cost saved versus an all-cloud baseline. pip install autodidact.

Autodidact session summary — 30 queries, 67% local + memory, $0.70 saved


Why I built this

I’m passionate about local LLMs and self-learning AI. I’ve always wondered: why can’t an AI agent work like a human? Have a local brain; when asked, think first; if unsure, ask someone smarter (a cloud model, or search); then learn from the answer so next time you don’t need to ask. Today’s AI agents work the opposite way: every query, fresh round-trip, full price - even when similar questions come up.

The pattern bothered me because it’s not how a smart human assistant would behave. The whole point of having someone in the room is that they accumulate context.

I started sketching what an alternative would look like. The shape was obvious almost immediately:

  • A local model as the default - fast, free, on your machine
  • An escalation path to the cloud when the local model wasn’t confident
  • A learning loop that distilled the cloud’s response into local knowledge so the next similar query could be answered locally

The name came later. Autodidact - someone who learns by themselves. The agent that escalates the first time and learns by the second.

What stuck with me wasn’t the architecture, though. It was the realization that this is roughly how every knowledge worker operates. We don’t ask senior people every question; we ask the ones we can’t figure out, and we remember the answers. The whole machine of human collaboration runs on this loop. AI agents currently don’t.

So I started building it.


What v1.0 does

When a query comes in, Autodidact runs through a four-stage routing decision:

Query → Think  (check memory)
      → Try    (local model answers if confident)
      → Ask    (escalate to cloud when uncertain)
      → Learn  (store the answer for next time)
      ──────────────────────────────────────────
      Next similar query → Answer from memory, $0.00

Every response in the chat is annotated with which path it took:

  • [CLOUD] - escalated, paid cloud cost, fact extracted in the background
  • [LOCAL] - local model answered using its own knowledge or ingested docs
  • [LOCAL+MEMORY] - local model answered, grounded by previously-learned facts

The visible difference between [CLOUD] and [LOCAL+MEMORY] on similar follow-up questions is the demo. You ask something hard - the agent escalates, learns. A few queries later, you ask something close to it - and it answers locally, for free, citing the memory it just stored.

That’s the magic moment. Everything else in v1.0 — the setup wizard, the hybrid retrieval, the document synthesis — exists to make that moment actually happen reliably.

What’s in v1.0

  • Confidence-based routing. GSA pre-screen + logprob uncertainty + refusal detection. The signals (and the choice to use them) are grounded in my recent paper on zero-shot confidence estimation — we measured AUROC 0.65–0.83 across 3 model families × 2 datasets, beating supervised baselines on out-of-distribution queries.
  • Hybrid retrieval. BM25 keyword search (FTS5) + vector similarity (FAISS), merged via Reciprocal Rank Fusion. Finds documents by exact terms and by meaning.
  • Document synthesis. autodidact learn <path> doesn’t just chunk-and-index. It extracts key facts into memory in the background. The agent answers from internalized knowledge, not raw chunks.
  • Five setup modes. Local+Cloud (default), Cloud+Cloud (no GPU required), Local+Local (fully offline learning), Custom OpenAI-compatible server (llama.cpp, LM Studio, vLLM), or Local-only.
  • Multi-provider. Ollama, any OpenAI-compatible API (10 cloud presets including OpenRouter, DeepSeek, Anthropic proxies), or AWS Bedrock.
  • Local-first. All state in one portable SQLite file (~/.autodidact/memory.db). Works offline after setup.

Quickstart is four commands:

pip install autodidact
autodidact init                      # interactive setup wizard
autodidact learn <path-to-docs>      # ingest your docs/codebase
autodidact chat                      # start asking questions

If Ollama isn’t installed, the wizard offers to install it. If your model isn’t pulled, it pulls it. If your daemon isn’t running, it starts it. The first three minutes of pip install autodidact to first-chat are deliberate - most of the engineering work in v1.0 went into making that path reliable across environments.


The hard parts

The architecture I sketched in my head was clean. Implementation surfaced three problems I didn’t fully see going in.

1. The routing decision: when does the local model actually know?

The naive approach is to ask the model: “Are you confident you can answer this?”

It doesn’t work. Self-reported model confidence is poorly calibrated - the same Llama 3.1 8B can confidently produce a wrong answer one minute and hedge on something it knows the next. Token log-probabilities, the most-cited proxy for confidence, are slightly better but not great. The routing literature (RouteLLM and similar) tries to fix this with supervised classifiers — train a model to predict whether the local model will get a question right, then use the classifier’s prediction.

That works in-distribution. It collapses out-of-distribution. The classifier learns which kinds of MMLU questions the local model gets right, not what the model knows. Move to TriviaQA and the routing falls apart.

I spent about a month measuring different signals on different models, and the finding that surprised me became the paper I published in May: a single zero-shot signal - average token log-probability — matches or beats the trained classifier. Across 3 model families × 2 datasets × ~4,500 queries, logprob-based routing scored AUROC 0.65–0.83. RouteLLM-style supervised routing topped out lower and broke when the question distribution shifted.

That finding is the foundation of Autodidact’s routing layer. The implementation in v1.0 combines the logprob signal with two others: a grounded self-assessment probe (“based on what you know, can you answer?”) and a refusal detector (catches the model when it’s about to say “I don’t have enough information”). The combination is more robust than any single signal.

I wrote up the full empirical work in a separate post and in the paper. Short version: confidence calibration is a real engineering problem, not just a research curiosity, and getting it right was the unlock.

2. Learning extraction: what should the agent actually store?

This one I’m less sure about, even now.

When the cloud responds with a 30-line answer to a question, what do you save? A few options I considered:

  1. The full cloud response, verbatim. Easy. Wasteful. The next paraphrase of the question doesn’t match well, and you’ve paid storage cost for redundant context.
  2. A summary of the cloud response. Better, but generated by what? The local model? The cloud model? Both have their failure modes.
  3. Structured facts extracted from the cloud response. What I went with. The cloud’s answer to “What’s our PTO policy?” gets distilled into something like “Company PTO is 20 days per year, accrued monthly.” Stored separately. Retrievable by similarity.

Option 3 is what v1.0 does. The extraction runs in the background after each cloud response, doesn’t block the user, and feeds into the same hybrid retrieval that handles ingested documents. The result is the [MEMORY] path you see in the demo.

But “structured fact extraction” hides a lot of decisions:

  • Granularity. Should “Company PTO is 20 days per year” be one fact or two (“PTO is 20 days” + “accrued monthly”)? Fine-grained facts retrieve better but are harder to deduplicate.
  • Deduplication. Same fact stated three different ways across three queries. v1.0 uses similarity-based dedup on insert; this works but isn’t perfect.
  • Staleness. Cloud answers can become outdated. v1.0 flags when a stored fact is older than a threshold but doesn’t proactively re-verify; v2.0’s plan is a self-verification cycle.
  • What counts as a “fact” vs. a “skill.” v1.0 only does facts. Procedures - how to do something - don’t extract well into the same store. That’s a v2.0 problem.

This is the part of the project I’m least confident is right. v1.0 ships my best current answer. I’m genuinely interested in better ones.

3. Memory management at scale

The first version of the memory store was just a vector DB with cosine similarity. It worked for ~50 facts. By 500 facts, it started missing obvious matches when query phrasing diverged from the stored fact’s phrasing. Vector similarity is sensitive to surface form in ways that aren’t always semantic.

What v1.0 ships is a hybrid: BM25 (which catches exact-term matches that embeddings miss) plus vector similarity (which catches semantic matches that BM25 misses), merged via Reciprocal Rank Fusion (which doesn’t require tuning weights between the two). Hybrid retrieval was a measurable improvement —-not just on stored facts but on document chunks too.

The deeper question - should facts decay over time? — is something I deferred. Right now, facts live forever. An Ebbinghaus-style forgetting curve would be more biologically realistic and would let the system gracefully handle staleness, but designing the decay function carefully matters: you don’t want the system to forget the facts a user actually relies on. v1.0 punts; v2.0 will revisit.


What v1.0 is NOT

I want to be specific about scope. Software gets harder to use when its scope is overstated, and the v1.0 / v2.0 distinction matters.

v1.0 is a chat agent over your documents and codebase. It can ingest a folder of markdown / code / text files, answer questions about the content, learn from cloud escalations, and grow its memory over time.

v1.0 is not:

  • An agent with tool execution. You can’t (yet) ask it to run a terminal command, edit a file, or hit an API. Tool execution is in v2.0.
  • A skill-learning agent. It learns facts from cloud responses. It doesn’t yet learn procedures — patterns of action you want it to repeat. Skill extraction is also v2.0.
  • A drop-in replacement for Cursor/Aider/Claude Code. v1.5 ships an autodidact serve proxy that exposes an OpenAI-compatible API; until then, integration with those tools is manual.
  • An MCP server. Coming in v2.0 alongside the tool execution layer.

The roadmap is honest about which of these ships when:

VersionWhatStatus
v1.0.xHybrid retrieval, document synthesis, 5 setup modes, confidence routingShipping now
v1.5Topic pages, USearch, parallel retrieval, autodidact serve proxyIn progress
v2.0Tool execution, skill learning, tiered routing, reranking, self-verificationDesigned
v3.0Agent network — agents teaching each otherPlanned

I’d rather ship a small thing that works than a big thing that nearly works.

A note on the broader landscape: local-first AI with cloud fallback is a pattern that’s emerging across the ecosystem — Claude Code with local LLMs, LM Studio with router scripts, various wrapper tools. Autodidact does what those don’t: it learns from each cloud query so the next similar one stays local. The routing is a means to an end; the learning loop is the point.


What I want from readers

Three asks, in increasing order of commitment:

Try it. pip install autodidact && autodidact init && autodidact chat. Three minutes of your time. Tell me when it does something stupid. Tell me when it surprises you. Both are useful.

Tell me when the routing is wrong. v1.0 has known limitations: the local model occasionally answers confidently when it shouldn’t (false negative on escalation) and occasionally escalates when it didn’t need to (false positive). If you find a clear case where the routing decision was wrong, file an issue with the query. That’s how v1.1 gets better.

Help build v1.5 / v2.0. The README has a “Good first issues” list:

  • autodidact serve - OpenAI-compatible proxy mode, drop-in for Cursor / Aider
  • MCP server for Claude Desktop / Cursor / Gemini CLI
  • Topic-based knowledge pages (a v1.5 core feature — the architecture is in docs/DESIGN-V2.md)
  • Skill extraction from cloud responses (the harder, more interesting v2.0 problem)
  • Cross-encoder reranking on retrieval candidates

If any of those interest you and you want context, open an issue or DM me. The codebase is small enough to onboard a contributor in an afternoon.


Closing

Most AI products today pay frontier-model prices for every query. Autodidact pays them once.

The thesis is simple. The implementation surfaced more interesting problems than I expected. v1.0 is what I have so far - small, scoped, honest about what it does and doesn’t. If it makes one developer’s day a little cheaper or a little smarter, the project is worth it; if it sparks a conversation that produces a better v2.0, even better.

I’m at paulnng@icloud.com and on LinkedIn. The code is at github.com/BuffaloTechRider/Autodidact.

Build the thing that bothers you about the way things work. That’s how the next thing gets made.