๐ Introduction
๐ง Motivation
One of the main focus of mechanistic interpretability is to understand how language
models represent the information in their embeddings. The field has mostly focused on
studying the representations of the residual stream in autoregressive transformers.
However, little work has been done to understand the representations of other types
of language models, such as sentence embedding models. Till now, the only paper
that has studied the representations of sentence embedding models is
this one. This project will use the ideas shown in
this paper to study the representations of Sentence-BERT
, one of (if not the)
most popular sentence embedding model. More details about the model can be found
here.
One of the biggest obstacles when training a Sparse Autoencoder (SAE), especially when having a limited budget, is the amount of compute needed. A SAE is able to extract monosemantic features from a specific part of the model, and it can't be used in other parts. Because of this, one would need to train a SAE for each part of the model, which would need a huge amount of computation and training time. However, if we study a sentence embedding model, there is no need to train a SAE for each part of the model. Instead, we can train a SAE on the final embeddings and get a monosemantic representation of the embedding space used to encode the sentences.
Once the SAE is trained, the features must be interpreted. A SAE usually has lots of
features, making it not possible to analyze each one of them manually. Because of that,
many efforts have been made to develop methods to interpret the features obtained from
the SAEs. The most common method is to call a LLM to generate a description of the
feature. However, making a huge number of API calls to a LLM is costly,
which makes it not possible to use this method in a limited budget. This project
presents a new feature interpretation method that does not require any API calls. It
is based on keyword extraction via Key-BERT
.
๐ฏ Project Objectives
This project aims to explore the use of Sparse Autoencoders for interpreting
Sentence-BERT
embeddings. The main goals are:
- Train a SAE on
Sentence-BERT
embeddings from different knowledge domains. - Present a method to interpret the features obtained from the SAEs.
๐ Scope
This work does not aim to improve downstream task performance, nor to benchmark new autoencoder architectures. Instead, it focuses on applying the existing architectures to a new domain and developing a method to interpret the features.
The entire pipeline is modular and open-source, and can be adapted to other embedding models or datasets.
This documentation is intended to be modular and accessible. Each section can be read independently depending on your interestโwhether you're focused on model training, interpretability methods, or practical application.