Welcome to the Tegmark AI Safety Group!

Are you excited about AI but concerned about humanity losing control of it? Then please consider collaborating with our group! Our main focus is on mechanistic interpretability (MI): given a trained neural network that exhibits intelligent behavior, how can we figure out how it works, preferably automatically? Today's large language models and other powerful AI systems tend to be opaque black boxes, offering few guarantees that they will behave as desired. In order of increasing ambition level, here are our three motivations:

  1. Diagnose trustworthiness
  2. Improve trustworthiness
  3. Guarantee trustworthiness
For 2) and 3), we work on techniques for automatically extracting the knowledge learned during training. For 3), we are interested in how to reimplementation the extracted knowledge and algorithms in a computational architecture where we can formally verify that it will do what we want. In addition to these efforts, which support alignment of a single AI to its user, we are also interested in game-theory and mechanism design useful for multi-scale alignment (incentivizing people, companies, etc to use AI in ways furthering the common good).

Members

Max Tegmark

PI

Website  /  Twitter

Peter S. Park

Postdoc

Website  /  Twitter

Eric J. Michaud

PhD student

Website  /  Twitter

Ziming Liu

PhD student

Website  /  Twitter

Josh Engels

PhD student

Website   

Wes Gurnee

PhD student

Website  /  Twitter

Isaac Liao

Master's student

Website

Vedang Lad

Master's student

Website
Research

Below are examples of our Mechanistic Interpretability research so far, which includes auto-discovering knowledge representations, hidden symmetries, modularity and conserved quantities. You'll find a complete list of our publications here.

Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
Ziming Liu, Eric Gan, Max Tegmark
arXiv / GitHub / Colab Demo

To make neural networks more like brains, we embed neurons into a geometric space and maximize locality of neuron connections. The resulting networks demonstrate extreme sparsity and modularity, which makes mechanistic interpretability much easier.

The Quantization Model of Neural Scaling
Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark
arXiv / GitHub

We develop a model of neural scaling laws where a Zipf distribution over discrete subtasks translates into power law scaling in the number of network parameters and the amount of training data.

Omnigrok: Grokking Beyond Algorithmic Data
Ziming Liu, Eric J. Michaud, Max Tegmark
ICLR 2023 (Spotlight)
arXiv / GitHub

We understand the phenomenon of "grokking" in neural networks in terms of the interplay between generalization and network weight norm, and use this understanding to control grokking: we can induce grokking (delay generalization) in a wide range of tasks and reduce grokking (accelerate generalization) on algorithmic tasks.

Towards Understanding Grokking: An Effective Theory of Representation Learning
Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams
ICLR 2023 (Spotlight)
arXiv / GitHub

We study the relationship between generalization and the formation of structured representations in neural networks trained on algorithmic tasks.


This website is based on a template by Jon Barron