Appendix E: Bibliography#
Canonical papers, books, and repositories referenced throughout the course, organized by category.
Foundational Papers#
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Wei et al. (2022) — NeurIPS 2022. Established that step-by-step reasoning in the prompt dramatically improves LLM performance on complex tasks. The foundation for ReAct and most modern agent reasoning patterns. https://arxiv.org/abs/2201.11903
ReAct: Synergizing Reasoning and Acting in Language Models Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). Defined the Thought→Action→Observation loop that is the basis of the agent loop in Chapter 3. https://arxiv.org/abs/2210.03629
Self-Refine: Iterative Refinement with Self-Feedback Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2023). Formalised the generator→critic→refine loop. Directly motivates the two-agent pattern in Chapter 6. https://arxiv.org/abs/2303.17651
Reflexion: Language Agents with Verbal Reinforcement Learning Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Introduced verbal reinforcement: agents reflect on failures and store reflections in memory to avoid repeating mistakes. Informs the self-improvement patterns in Chapters 6 and 11. https://arxiv.org/abs/2303.11366
Toolformer: Language Models Can Teach Themselves to Use Tools Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. — Meta (2023). Showed that models can learn to call APIs by self-supervising on their own generated examples. Influential for understanding tool-use at a conceptual level. https://arxiv.org/abs/2302.04761
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). An early multi-model orchestration system using ChatGPT as a controller. Demonstrated the orchestrator-worker pattern before it became standard. https://arxiv.org/abs/2303.17580
Voyager: An Open-Ended Embodied Agent with Large Language Models Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Introduced the skill library concept: an agent that builds a library of reusable, tested capabilities through self-directed exploration. Directly implemented in Chapter 11. https://arxiv.org/abs/2305.16291
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. — Microsoft (2023). Formalised multi-agent conversation as a programming model. The human proxy agent remains one of the cleanest HITL implementations. https://arxiv.org/abs/2308.08155
Mixture-of-Agents Enhances Large Language Model Capabilities Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., & Zou, J. (2024). Formalised the MoA pattern: multiple agents independently generate responses, then an aggregator synthesises them. Implemented in Chapter 8. https://arxiv.org/abs/2406.04692
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance Chen, L., Zaharia, M., & Zou, J. (2023). Introduced cascade-based routing across model tiers with quality gating. The classic cost-reduction baseline that motivates the heuristic and learned routers in Chapter 9 and Appendix: Routing Research. https://arxiv.org/abs/2305.05176
Lost in the Middle: How Language Models Use Long Contexts Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). The empirical finding that LLMs tend to underweight information placed in the middle of long contexts relative to the beginning and end. Motivates the compaction and placement strategies in Chapter 9. https://arxiv.org/abs/2307.03172
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Shrivastava, M., Domonik, S., Windsor, J., Byrne, B., Opsahl-Ong, A., Bhatt, K., Potapczynski, A., Liang, L., Head, M., Ni, P., Bhavnani, P., Brennan-Jones, C., Gottumukkala, P., Anish, A. M., Rao, V. S., . . . Potts, C. — Stanford (2023). Introduced the concept of programming (not prompting) language models via declarative signatures and automated optimisation. Referenced in Chapter 12. https://arxiv.org/abs/2310.03714
Constitutional AI: Harmlessness from AI Feedback Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., . . . Kaplan, J. — Anthropic (2022). Introduced the technique of using a model to critique and revise its own outputs against a written constitution. Foundational for alignment-aware agent design in Chapter 10. https://arxiv.org/abs/2212.08073
Anthropic Sources#
Building Effective Agents Anthropic (2024). The canonical taxonomy of six agentic workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and agents). The architectural backbone of this entire course. https://www.anthropic.com/research/building-effective-agents
Model Context Protocol (MCP) Specification Anthropic (2024). The open protocol for standardising model-tool interactions, built on JSON-RPC 2.0. Implemented in Chapter 4. https://modelcontextprotocol.io
Anthropic Prompt Caching Documentation Anthropic (2024). The production reference for prompt caching implementation, cache hit rates, and cost modelling. Central to Chapter 2. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Claude Code Source Code (v2.1.100)
Anthropic (open-sourced April 2026).
The primary reference for all production patterns in this course. See README_SOTA.md for the full analysis.
https://github.com/anthropics/claude-code
Benchmarks#
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). Defined the gold-standard benchmark for software engineering agents. The Lite and Verified variants are used in Chapter 12. https://arxiv.org/abs/2310.06770
GAIA: A Benchmark for General AI Assistants Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., & Scialom, T. — Meta (2023). Defined a benchmark of real-world questions requiring multi-step reasoning, web access, and tool use. Used in Chapter 12. https://arxiv.org/abs/2311.12983
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). The canonical reference for LLM-as-judge methodology, including position bias analysis and mitigation strategies. Required reading before Chapter 7. https://arxiv.org/abs/2306.05685
Tooling & Infrastructure#
Designing Data-Intensive Applications Kleppmann, M. (2017). O'Reilly Media. The definitive reference for distributed systems engineering: replication, partitioning, consistency, consensus, and stream processing. Chapters 5, 7, and 11 directly inform the persistence and crash-recovery patterns in Chapter 11 of this course.
Agent-to-Agent (A2A) Protocol Specification Google (2025). Open protocol for inter-agent communication across frameworks. Referenced in Chapter 12. https://google.github.io/A2A
OpenTelemetry Semantic Conventions for Generative AI
OpenTelemetry (2024–2025).
The gen_ai.* attribute conventions used throughout Chapter 7 and beyond.
https://opentelemetry.io/docs/specs/semconv/gen-ai/
AutoGPT Richards, T. B. (2023) — GitHub. The first widely viral autonomous agent. Demonstrated the public appetite for long-running autonomous agents, and the failure modes of unconstrained tool use. https://github.com/Significant-Gravitas/AutoGPT
LangChain Chase, H. (2022–) — GitHub. The most widely adopted agent framework. Its evolution from chains to LangGraph tracks the maturation of agent architecture thinking. https://github.com/langchain-ai/langchain
LangGraph LangChain AI (2024–). The graph-based successor to LangChain's chain abstraction; the most mature framework for stateful agentic workflows. https://github.com/langchain-ai/langgraph
CrewAI Moura, J. (2024–). Role-based multi-agent framework; useful reference for comparing role specialisation patterns with Chapter 6's approach. https://github.com/crewAIInc/crewAI
OpenAI Agents SDK OpenAI (2025–). Production SDK with native MCP, Responses API, and Guardrails support. Surveyed in Appendix A. https://github.com/openai/openai-agents-python
Google Agent Development Kit (ADK) Google (2025–). Multi-agent framework with native A2A support. Surveyed in Appendix A. https://google.github.io/adk-docs/
Swarms (kyegomez) Gomez, K. (2023–). Large-scale swarm framework supporting many parallel topologies; useful for experiments at scale. https://github.com/kyegomez/swarms
Further Reading by Chapter#
The original per-chapter "Further Reading" lists are consolidated below. Each entry: one line, link only. Follow the link when you want depth on that specific technique.
Chapter 1 — Raw Call#
- Anthropic Messages API, API versioning, count_tokens endpoint, Anthropic pricing
- Attention Is All You Need (Vaswani 2017); Language Models Are Few-Shot Learners (Brown 2020); BPE tokenization (Sennrich 2016); The Illustrated Transformer
- tiktoken; httpx docs
Chapter 2 — Providers & Prompt Caching#
- Anthropic prompt caching; OpenAI chat completions; LiteLLM; Gemini API; Groq API; Ollama API
- Designing Data-Intensive Applications (Kleppmann 2017) — Ch1 for abstraction layer reliability
Chapter 3 — Agent Loop, Tools & MCP#
- ReAct (Yao 2022); Toolformer (Schick 2023); HuggingGPT (Shen 2023); Self-Refine (Madaan 2023)
- MCP spec; Anthropic tool-use docs; OpenAI function calling
- Sandboxing: seccomp-bpf; Poka-yoke: Zero Quality Control (Shingo 1986)
Chapter 4 — State & Collaboration#
- MemGPT (Packer 2023); Reflexion (Shinn 2023); Voyager (Wang 2023)
- Agent-Context Protocol (Anthropic); Condorcet's jury theorem
Chapter 5 — Evaluation & Observability#
- LLM-as-judge (Zheng 2023); HELM (Stanford); OpenTelemetry GenAI conventions
- Practitioner tools: LangSmith, Phoenix/Arize, DeepEval, Ragas
Chapter 6 — Orchestrator-Workers#
Chapter 7 — Routing, Compaction & Guardrails#
- Lost in the Middle (Liu 2023); Constitutional AI (Bai 2022); Prompt Injection attacks (Greshake 2023)
- Anthropic HITL patterns; OWASP LLM Top 10
Chapter 8 — Production, Skills & Plugins#
- Voyager skill library (Wang 2023); Kafka (append-only logs); DDIA Ch5/7/11 (Kleppmann 2017)
- Claude Code plugins; Claude Code v2.1.100 source
- systemd service pattern
Chapter 9 — Capstone#
- SWE-bench (Jimenez 2024); SWE-bench Verified; GAIA (Mialon 2023); TAU-bench (Yao 2024)
- DSPy; A2A protocol; Agentic RL survey
- Anthropic's Six Agentic Patterns — canonical taxonomy this book maps onto