📚 Academic References

This project builds upon a variety of academic and technical sources related to sparse autoencoders, interpretability, and language modeling. The following references were used primarily in the theoretical framework of the thesis:

Anthropic Interpretability Team. (Jan 2025). Circuits updates. Retrieved from Transformer Circuits Thread: https://transformer-circuits.pub/2025/january-update/index.html
Bloom, J. (Feb 2, 2024). Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small. Retrieved from LessWrong: https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., ... Olah, C. (Oct 4, 2023). Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Retrieved from: https://transformer-circuits.pub/2023/monosemantic-features/index.html
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33.
Bussmann, B., Leask, P., & Nanda, N. (2024). BatchTopK Sparse Autoencoders.
Chaudhary, M., & Geiger, A. (2024). Evaluating open-source sparse autoencoders on disentangling factual knowledge in GPT-2 small. arXiv preprint.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (Vol. 1).
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., ... Olah, C. (Sept 14, 2022). Toy Models of Superposition. Retrieved from: https://transformer-circuits.pub/2022/toy_model/index.html
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., ... Olah, C. (Dec 22, 2021). A Mathematical Framework for Transformer Circuits. Retrieved from: https://transformer-circuits.pub/2021/framework/index.html
Gao, L., Dupré de la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., ... Wu, J. (2024). Scaling and Evaluating Sparse Autoencoders.
Li, Y., Ildiz, M. E., Papailiopoulos, D., & Oymak, S. (2023). Transformers as Algorithms: Generalization and Stability in In-Context Learning. In International Conference on Machine Learning (ICML).
Mikolov, T., Yih, W.-t., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. In NAACL-HLT, pp. 746–751.
O'Neill, C., Ye, C., Iyer, K., & Wu, J. F. (2024). Disentangling Dense Embeddings with Sparse Autoencoders. arXiv preprint.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.
Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramár, J., ... Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv preprint.
Rajamanoharan, S., Kramár, J., & Nanda, N. (2024). Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint.
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., ... Henighan, T. (May 21, 2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Retrieved from: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E. H., Narang, S., ... Zhou, D. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv preprint.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 35.
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., ... Chi, E. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv preprint.