📚 Academic References
This project builds upon a variety of academic and technical sources related to sparse autoencoders, interpretability, and language modeling. The following references were used primarily in the theoretical framework of the thesis:
- Anthropic Interpretability Team. (Jan 2025). Circuits updates. Retrieved from Transformer Circuits Thread: https://transformer-circuits.pub/2025/january-update/index.html
- Bloom, J. (Feb 2, 2024). Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small. Retrieved from LessWrong: https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream
- Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., ... Olah, C. (Oct 4, 2023). Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Retrieved from: https://transformer-circuits.pub/2023/monosemantic-features/index.html
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33.
- Bussmann, B., Leask, P., & Nanda, N. (2024). BatchTopK Sparse Autoencoders.
- Chaudhary, M., & Geiger, A. (2024). Evaluating open-source sparse autoencoders on disentangling factual knowledge in GPT-2 small. arXiv preprint.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (Vol. 1).
- Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., ... Olah, C. (Sept 14, 2022). Toy Models of Superposition. Retrieved from: https://transformer-circuits.pub/2022/toy_model/index.html
- Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., ... Olah, C. (Dec 22, 2021). A Mathematical Framework for Transformer Circuits. Retrieved from: https://transformer-circuits.pub/2021/framework/index.html
- Gao, L., Dupré de la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., ... Wu, J. (2024). Scaling and Evaluating Sparse Autoencoders.
- Li, Y., Ildiz, M. E., Papailiopoulos, D., & Oymak, S. (2023). Transformers as Algorithms: Generalization and Stability in In-Context Learning. In International Conference on Machine Learning (ICML).
- Mikolov, T., Yih, W.-t., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. In NAACL-HLT, pp. 746–751.
- O'Neill, C., Ye, C., Iyer, K., & Wu, J. F. (2024). Disentangling Dense Embeddings with Sparse Autoencoders. arXiv preprint.
- Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.
- Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramár, J., ... Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv preprint.
- Rajamanoharan, S., Kramár, J., & Nanda, N. (2024). Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint.
- Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., ... Henighan, T. (May 21, 2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Retrieved from: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30.
- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E. H., Narang, S., ... Zhou, D. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv preprint.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 35.
- Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., ... Chi, E. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv preprint.