Tegmark Group

Welcome to the Tegmark AI Safety Group!

Are you excited about AI but concerned about humanity losing control of it? Then please consider collaborating with our group! Our main focus is on mechanistic interpretability (MI): given a trained neural network that exhibits intelligent behavior, how can we figure out how it works, preferably automatically? Today's large language models and other powerful AI systems tend to be opaque black boxes, offering few guarantees that they will behave as desired. In order of increasing ambition level, here are our three motivations:

Diagnose trustworthiness
Improve trustworthiness
Guarantee trustworthiness

For 2) and 3), we work on techniques for automatically extracting the knowledge learned during training. For 3), we are interested in how to reimplementation the extracted knowledge and algorithms in a computational architecture where we can formally verify that it will do what we want. In addition to these efforts, which support alignment of a single AI to its user, we are also interested in game-theory and mechanism design useful for multi-scale alignment (incentivizing people, companies, etc to use AI in ways furthering the common good).

Members

Max Tegmark

Website / Twitter

Ziming Liu

PhD student

Website / Twitter

Eric J. Michaud

PhD student

Website / Twitter

David D. Baek

PhD student

Website / Twitter

Josh Engels

PhD student

Website / Twitter

Subhash Kantamneni

Master's student

Twitter

Research

Below are examples of our Mechanistic Interpretability research so far, which includes auto-discovering knowledge representations, hidden symmetries, modularity and conserved quantities. You'll find a complete list of our publications here.

	Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability Ziming Liu, Eric Gan, Max Tegmark arXiv / GitHub / Colab Demo To make neural networks more like brains, we embed neurons into a geometric space and maximize locality of neuron connections. The resulting networks demonstrate extreme sparsity and modularity, which makes mechanistic interpretability much easier.
	The Quantization Model of Neural Scaling Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark arXiv / GitHub We develop a model of neural scaling laws where a Zipf distribution over discrete subtasks translates into power law scaling in the number of network parameters and the amount of training data.
	Omnigrok: Grokking Beyond Algorithmic Data Ziming Liu, Eric J. Michaud, Max Tegmark ICLR 2023 (Spotlight) arXiv / GitHub We understand the phenomenon of "grokking" in neural networks in terms of the interplay between generalization and network weight norm, and use this understanding to control grokking: we can induce grokking (delay generalization) in a wide range of tasks and reduce grokking (accelerate generalization) on algorithmic tasks.
	Towards Understanding Grokking: An Effective Theory of Representation Learning Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams ICLR 2023 (Spotlight) arXiv / GitHub We study the relationship between generalization and the formation of structured representations in neural networks trained on algorithmic tasks.

This website is based on a template by Jon Barron