Appendix E: Bibliography#

Canonical papers, books, and repositories referenced throughout the course, organized by category.

Foundational Papers#

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Wei et al. (2022) — NeurIPS 2022. Established that step-by-step reasoning in the prompt dramatically improves LLM performance on complex tasks. The foundation for ReAct and most modern agent reasoning patterns. https://arxiv.org/abs/2201.11903

ReAct: Synergizing Reasoning and Acting in Language Models Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). Defined the Thought→Action→Observation loop that is the basis of the agent loop in Chapter 3. https://arxiv.org/abs/2210.03629

Self-Refine: Iterative Refinement with Self-Feedback Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2023). Formalised the generator→critic→refine loop. Directly motivates the two-agent pattern in Chapter 6. https://arxiv.org/abs/2303.17651

Reflexion: Language Agents with Verbal Reinforcement Learning Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Introduced verbal reinforcement: agents reflect on failures and store reflections in memory to avoid repeating mistakes. Informs the self-improvement patterns in Chapters 6 and 11. https://arxiv.org/abs/2303.11366

Toolformer: Language Models Can Teach Themselves to Use Tools Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. — Meta (2023). Showed that models can learn to call APIs by self-supervising on their own generated examples. Influential for understanding tool-use at a conceptual level. https://arxiv.org/abs/2302.04761

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). An early multi-model orchestration system using ChatGPT as a controller. Demonstrated the orchestrator-worker pattern before it became standard. https://arxiv.org/abs/2303.17580

Voyager: An Open-Ended Embodied Agent with Large Language Models Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Introduced the skill library concept: an agent that builds a library of reusable, tested capabilities through self-directed exploration. Directly implemented in Chapter 11. https://arxiv.org/abs/2305.16291

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. — Microsoft (2023). Formalised multi-agent conversation as a programming model. The human proxy agent remains one of the cleanest HITL implementations. https://arxiv.org/abs/2308.08155

Mixture-of-Agents Enhances Large Language Model Capabilities Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., & Zou, J. (2024). Formalised the MoA pattern: multiple agents independently generate responses, then an aggregator synthesises them. Implemented in Chapter 8. https://arxiv.org/abs/2406.04692

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance Chen, L., Zaharia, M., & Zou, J. (2023). Introduced cascade-based routing across model tiers with quality gating. The classic cost-reduction baseline that motivates the heuristic and learned routers in Chapter 9 and Appendix: Routing Research. https://arxiv.org/abs/2305.05176

Lost in the Middle: How Language Models Use Long Contexts Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). The empirical finding that LLMs tend to underweight information placed in the middle of long contexts relative to the beginning and end. Motivates the compaction and placement strategies in Chapter 9. https://arxiv.org/abs/2307.03172

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Shrivastava, M., Domonik, S., Windsor, J., Byrne, B., Opsahl-Ong, A., Bhatt, K., Potapczynski, A., Liang, L., Head, M., Ni, P., Bhavnani, P., Brennan-Jones, C., Gottumukkala, P., Anish, A. M., Rao, V. S., . . . Potts, C. — Stanford (2023). Introduced the concept of programming (not prompting) language models via declarative signatures and automated optimisation. Referenced in Chapter 12. https://arxiv.org/abs/2310.03714

Constitutional AI: Harmlessness from AI Feedback Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., . . . Kaplan, J. — Anthropic (2022). Introduced the technique of using a model to critique and revise its own outputs against a written constitution. Foundational for alignment-aware agent design in Chapter 10. https://arxiv.org/abs/2212.08073

Anthropic Sources#

Building Effective Agents Anthropic (2024). The canonical taxonomy of six agentic workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and agents). The architectural backbone of this entire course. https://www.anthropic.com/research/building-effective-agents

Model Context Protocol (MCP) Specification Anthropic (2024). The open protocol for standardising model-tool interactions, built on JSON-RPC 2.0. Implemented in Chapter 4. https://modelcontextprotocol.io

Anthropic Prompt Caching Documentation Anthropic (2024). The production reference for prompt caching implementation, cache hit rates, and cost modelling. Central to Chapter 2. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Claude Code Source Code (v2.1.100) Anthropic (open-sourced April 2026). The primary reference for all production patterns in this course. See README_SOTA.md for the full analysis. https://github.com/anthropics/claude-code

Benchmarks#

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). Defined the gold-standard benchmark for software engineering agents. The Lite and Verified variants are used in Chapter 12. https://arxiv.org/abs/2310.06770

GAIA: A Benchmark for General AI Assistants Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., & Scialom, T. — Meta (2023). Defined a benchmark of real-world questions requiring multi-step reasoning, web access, and tool use. Used in Chapter 12. https://arxiv.org/abs/2311.12983

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). The canonical reference for LLM-as-judge methodology, including position bias analysis and mitigation strategies. Required reading before Chapter 7. https://arxiv.org/abs/2306.05685

Tooling & Infrastructure#

Designing Data-Intensive Applications Kleppmann, M. (2017). O'Reilly Media. The definitive reference for distributed systems engineering: replication, partitioning, consistency, consensus, and stream processing. Chapters 5, 7, and 11 directly inform the persistence and crash-recovery patterns in Chapter 11 of this course.

Agent-to-Agent (A2A) Protocol Specification Google (2025). Open protocol for inter-agent communication across frameworks. Referenced in Chapter 12. https://google.github.io/A2A

OpenTelemetry Semantic Conventions for Generative AI OpenTelemetry (2024–2025). The gen_ai.* attribute conventions used throughout Chapter 7 and beyond. https://opentelemetry.io/docs/specs/semconv/gen-ai/

AutoGPT Richards, T. B. (2023) — GitHub. The first widely viral autonomous agent. Demonstrated the public appetite for long-running autonomous agents, and the failure modes of unconstrained tool use. https://github.com/Significant-Gravitas/AutoGPT

LangChain Chase, H. (2022–) — GitHub. The most widely adopted agent framework. Its evolution from chains to LangGraph tracks the maturation of agent architecture thinking. https://github.com/langchain-ai/langchain

LangGraph LangChain AI (2024–). The graph-based successor to LangChain's chain abstraction; the most mature framework for stateful agentic workflows. https://github.com/langchain-ai/langgraph

CrewAI Moura, J. (2024–). Role-based multi-agent framework; useful reference for comparing role specialisation patterns with Chapter 6's approach. https://github.com/crewAIInc/crewAI

OpenAI Agents SDK OpenAI (2025–). Production SDK with native MCP, Responses API, and Guardrails support. Surveyed in Appendix A. https://github.com/openai/openai-agents-python

Google Agent Development Kit (ADK) Google (2025–). Multi-agent framework with native A2A support. Surveyed in Appendix A. https://google.github.io/adk-docs/

Swarms (kyegomez) Gomez, K. (2023–). Large-scale swarm framework supporting many parallel topologies; useful for experiments at scale. https://github.com/kyegomez/swarms

Appendix E: Bibliography#

Foundational Papers#

Anthropic Sources#

Benchmarks#

Tooling & Infrastructure#

Further Reading by Chapter#

Chapter 1 — Raw Call#

Chapter 2 — Providers & Prompt Caching#

Chapter 3 — Agent Loop, Tools & MCP#

Chapter 4 — State & Collaboration#

Chapter 5 — Evaluation & Observability#

Chapter 6 — Orchestrator-Workers#

Chapter 7 — Routing, Compaction & Guardrails#

Chapter 8 — Production, Skills & Plugins#

Chapter 9 — Capstone#