Autodidact — A Self-Evolving AI Agent

A self-evolving AI agent that learns like a new employee.
pip install autodidact

Overview

Autodidact is an open-source AI agent built around a simple idea: most queries don’t need a frontier model. A local model on your machine can answer them in milliseconds, for free — escalating to a cloud model only when uncertain. Each escalation becomes permanent local knowledge. The next similar query is answered locally, at zero cost.

The thesis is grounded in published research. The routing layer is built on findings from my paper on zero-shot confidence estimation, which showed that average token log-probability matches or beats supervised routing baselines (RouteLLM) — without per-model training data.

Status: v1.0.x shipping on PyPI. v1.5 (topic pages, OpenAI-compatible proxy, parallel retrieval) in progress. v2.0 (tool execution, skill learning, agent network) designed.

How It Works

Query → Think  (check memory)
      → Try    (local model answers if confident)
      → Ask    (escalate to cloud when uncertain)
      → Learn  (store the answer for next time)
      ──────────────────────────────────────────
      Next similar query → Answer from memory, $0.00

Every response shows the reasoning in real time:

You> What's our PTO policy?
[THINKING] Checking memory... no relevant entries (0 hits)
[CLOUD] Escalated — gpt-4o-mini took 1.2s
[LEARNED] ✅ Stored: "Company PTO is 20 days per year, accrued monthly."
  💰 $0.003 | Confidence: 0.34 → escalated | ✅ Learned

You> How much vacation do I get?
[THINKING] Checking memory... found 1 similar entry (similarity: 0.91)
[MEMORY] Company PTO is 20 days per year, accrued monthly.
  💰 $0.00 | Confidence: 0.91 | Route: memory

That’s the magic moment — the agent answers from learned knowledge for free, because it remembered a question it was asked moments ago.

What’s in v1.0.x

Zero-friction setup wizard. Auto-detects Ollama, pulls models, starts the daemon, retries on failure. Presets for 10 cloud providers.
Five setup modes. Local+Cloud, Cloud+Cloud (no GPU), Local+Local (offline learning), Custom server (llama.cpp / LM Studio / vLLM), Local-only.
Hybrid retrieval. BM25 keyword search (FTS5) + vector similarity (FAISS), merged via Reciprocal Rank Fusion. Finds documents by exact terms and meaning.
Document synthesis. autodidact learn doesn’t just index — it extracts key facts into memory in the background. The agent answers from internalized knowledge, not raw chunks.
Confidence-based routing. GSA pre-screen + logprob uncertainty + refusal detection. The escalation decision uses the signals validated in my paper.
Learning from escalations. Structured knowledge extraction from cloud responses, deduplication on insert, all in the background.
Visible learning UX. [THINKING], [MEMORY], [LOCAL], [CLOUD], [LEARNED] tags surface what the agent is doing and why.
Cost tracking. autodidact savings reports cumulative cost avoided versus an all-cloud baseline.
Local-first. All state in one portable SQLite file. Works offline after setup.
Multi-provider. Ollama, any OpenAI-compatible server, AWS Bedrock. Ten cloud-provider presets.

The Interesting Engineering Problems

Building this surfaced design questions that aren’t in the textbooks:

Confidence calibration is hard. Self-reported model confidence is poorly calibrated. The paper underneath this project established empirically which signals work and which don’t — logprob_uncertainty ended up as the dominant signal (AUROC 0.65–0.83 across 3 model families × 2 datasets).
Hybrid retrieval beats either approach alone. BM25 catches exact-term matches that embeddings miss; embeddings catch semantic matches that BM25 misses. Reciprocal Rank Fusion combines them without tuning weights.
Document synthesis vs. raw chunks. Storing extracted facts produces noticeably better answers than dumping raw chunks into context. The synthesis happens in the background so it doesn’t block ingestion.
Cold-start matters more than the routing. A brand-new agent with an empty brain is useless. autodidact learn <path> solves this by ingesting documents up front, then the routing+learning loop takes over.
Failure recovery in setup is the user experience. If Ollama isn’t installed, isn’t running, or the model isn’t pulled, the wizard handles each case with a clear retry path. Most of the engineering effort in v1.0 went into making the first three minutes of pip install autodidact reliable.

Roadmap

Version	What	Status
v1.0.x	Hybrid retrieval, document synthesis, 5 setup modes, confidence routing	Current
v1.5	Topic-based knowledge pages, USearch, parallel retrieval, OpenAI-compatible proxy	In progress
v2.0	Tool execution, skill learning, tiered routing, cross-encoder reranking	Designed
v3.0	Agent network — agents teaching each other	Planned

Tech Stack

Python 3.10+ · SQLite (WAL mode) · FAISS · Pydantic v2 · Typer + Rich · Ollama / OpenAI-compatible / AWS Bedrock

Why This Matters

The architectural decisions here — local-first inference with cloud fallback, confidence-based routing, learning-from-escalation, graceful degradation — are the same patterns that show up in any safety-critical AI deployment. The thesis is that AI products should be reasonable about when to use a frontier model, transparent about why, and structured so failures fall back rather than fail loudly.

It’s the same discipline I’ve been practicing shipping automotive AI, packaged as something an individual developer can install and run on their own machine.

Paul Nguyen