Skip to content

Appendix E: Bibliography#

Canonical papers, books, and repositories referenced throughout the course, organized by category.


Foundational Papers#

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Wei et al. (2022) — NeurIPS 2022. Established that step-by-step reasoning in the prompt dramatically improves LLM performance on complex tasks. The foundation for ReAct and most modern agent reasoning patterns. https://arxiv.org/abs/2201.11903


ReAct: Synergizing Reasoning and Acting in Language Models Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). Defined the Thought→Action→Observation loop that is the basis of the agent loop in Chapter 3. https://arxiv.org/abs/2210.03629


Self-Refine: Iterative Refinement with Self-Feedback Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2023). Formalised the generator→critic→refine loop. Directly motivates the two-agent pattern in Chapter 6. https://arxiv.org/abs/2303.17651


Reflexion: Language Agents with Verbal Reinforcement Learning Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Introduced verbal reinforcement: agents reflect on failures and store reflections in memory to avoid repeating mistakes. Informs the self-improvement patterns in Chapters 6 and 11. https://arxiv.org/abs/2303.11366


Toolformer: Language Models Can Teach Themselves to Use Tools Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. — Meta (2023). Showed that models can learn to call APIs by self-supervising on their own generated examples. Influential for understanding tool-use at a conceptual level. https://arxiv.org/abs/2302.04761


HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). An early multi-model orchestration system using ChatGPT as a controller. Demonstrated the orchestrator-worker pattern before it became standard. https://arxiv.org/abs/2303.17580


Voyager: An Open-Ended Embodied Agent with Large Language Models Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Introduced the skill library concept: an agent that builds a library of reusable, tested capabilities through self-directed exploration. Directly implemented in Chapter 11. https://arxiv.org/abs/2305.16291


AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. — Microsoft (2023). Formalised multi-agent conversation as a programming model. The human proxy agent remains one of the cleanest HITL implementations. https://arxiv.org/abs/2308.08155


Mixture-of-Agents Enhances Large Language Model Capabilities Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., & Zou, J. (2024). Formalised the MoA pattern: multiple agents independently generate responses, then an aggregator synthesises them. Implemented in Chapter 8. https://arxiv.org/abs/2406.04692


FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance Chen, L., Zaharia, M., & Zou, J. (2023). Introduced cascade-based routing across model tiers with quality gating. The classic cost-reduction baseline that motivates the heuristic and learned routers in Chapter 9 and Appendix: Routing Research. https://arxiv.org/abs/2305.05176


Lost in the Middle: How Language Models Use Long Contexts Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). The empirical finding that LLMs tend to underweight information placed in the middle of long contexts relative to the beginning and end. Motivates the compaction and placement strategies in Chapter 9. https://arxiv.org/abs/2307.03172


DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Shrivastava, M., Domonik, S., Windsor, J., Byrne, B., Opsahl-Ong, A., Bhatt, K., Potapczynski, A., Liang, L., Head, M., Ni, P., Bhavnani, P., Brennan-Jones, C., Gottumukkala, P., Anish, A. M., Rao, V. S., . . . Potts, C. — Stanford (2023). Introduced the concept of programming (not prompting) language models via declarative signatures and automated optimisation. Referenced in Chapter 12. https://arxiv.org/abs/2310.03714


Constitutional AI: Harmlessness from AI Feedback Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., . . . Kaplan, J. — Anthropic (2022). Introduced the technique of using a model to critique and revise its own outputs against a written constitution. Foundational for alignment-aware agent design in Chapter 10. https://arxiv.org/abs/2212.08073


Anthropic Sources#

Building Effective Agents Anthropic (2024). The canonical taxonomy of six agentic workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and agents). The architectural backbone of this entire course. https://www.anthropic.com/research/building-effective-agents


Model Context Protocol (MCP) Specification Anthropic (2024). The open protocol for standardising model-tool interactions, built on JSON-RPC 2.0. Implemented in Chapter 4. https://modelcontextprotocol.io


Anthropic Prompt Caching Documentation Anthropic (2024). The production reference for prompt caching implementation, cache hit rates, and cost modelling. Central to Chapter 2. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching


Claude Code Source Code (v2.1.100) Anthropic (open-sourced April 2026). The primary reference for all production patterns in this course. See README_SOTA.md for the full analysis. https://github.com/anthropics/claude-code


Benchmarks#

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). Defined the gold-standard benchmark for software engineering agents. The Lite and Verified variants are used in Chapter 12. https://arxiv.org/abs/2310.06770


GAIA: A Benchmark for General AI Assistants Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., & Scialom, T. — Meta (2023). Defined a benchmark of real-world questions requiring multi-step reasoning, web access, and tool use. Used in Chapter 12. https://arxiv.org/abs/2311.12983


Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). The canonical reference for LLM-as-judge methodology, including position bias analysis and mitigation strategies. Required reading before Chapter 7. https://arxiv.org/abs/2306.05685


Tooling & Infrastructure#

Designing Data-Intensive Applications Kleppmann, M. (2017). O'Reilly Media. The definitive reference for distributed systems engineering: replication, partitioning, consistency, consensus, and stream processing. Chapters 5, 7, and 11 directly inform the persistence and crash-recovery patterns in Chapter 11 of this course.


Agent-to-Agent (A2A) Protocol Specification Google (2025). Open protocol for inter-agent communication across frameworks. Referenced in Chapter 12. https://google.github.io/A2A


OpenTelemetry Semantic Conventions for Generative AI OpenTelemetry (2024–2025). The gen_ai.* attribute conventions used throughout Chapter 7 and beyond. https://opentelemetry.io/docs/specs/semconv/gen-ai/


AutoGPT Richards, T. B. (2023) — GitHub. The first widely viral autonomous agent. Demonstrated the public appetite for long-running autonomous agents, and the failure modes of unconstrained tool use. https://github.com/Significant-Gravitas/AutoGPT


LangChain Chase, H. (2022–) — GitHub. The most widely adopted agent framework. Its evolution from chains to LangGraph tracks the maturation of agent architecture thinking. https://github.com/langchain-ai/langchain


LangGraph LangChain AI (2024–). The graph-based successor to LangChain's chain abstraction; the most mature framework for stateful agentic workflows. https://github.com/langchain-ai/langgraph


CrewAI Moura, J. (2024–). Role-based multi-agent framework; useful reference for comparing role specialisation patterns with Chapter 6's approach. https://github.com/crewAIInc/crewAI


OpenAI Agents SDK OpenAI (2025–). Production SDK with native MCP, Responses API, and Guardrails support. Surveyed in Appendix A. https://github.com/openai/openai-agents-python


Google Agent Development Kit (ADK) Google (2025–). Multi-agent framework with native A2A support. Surveyed in Appendix A. https://google.github.io/adk-docs/


Swarms (kyegomez) Gomez, K. (2023–). Large-scale swarm framework supporting many parallel topologies; useful for experiments at scale. https://github.com/kyegomez/swarms



Further Reading by Chapter#

The original per-chapter "Further Reading" lists are consolidated below. Each entry: one line, link only. Follow the link when you want depth on that specific technique.

Chapter 1 — Raw Call#

Chapter 2 — Providers & Prompt Caching#

Chapter 3 — Agent Loop, Tools & MCP#

Chapter 4 — State & Collaboration#

Chapter 5 — Evaluation & Observability#

Chapter 6 — Orchestrator-Workers#

Chapter 7 — Routing, Compaction & Guardrails#

Chapter 8 — Production, Skills & Plugins#

Chapter 9 — Capstone#