Deep Learning Reading Group

A weekly casual reading group to explore recent work in deep learning and related machine learning topics. Upcoming material is to be read before each session, so that it may be discussed in an open format. Staff and students from all backgrounds who are interested in these topics are welcome, as we aim to cover a broad mix of both theoretical and application-focused papers. This semester will focus on transformers; the fundamental architecture behind state-of-the-art image recognition systems and large language models.

Sessions are held in person at Melbourne Connect.

This group is led by Dr Liam Hodgkinson, Lecturer (Data Science).

For direct information, sign up to the dedicated group list. If you have issues with this mailing list form, please email mcds@unimelb.edu.au

Deep Learning Reading Group sessions

Upcoming Discussion Sessions and Readings

Date	Readings
17 April 2024 Tues	(Methods) Mamba: Linear-Time Sequence Modeling with Selective State Spaces (https://arxiv.org/abs/2312.00752): The Transformer architecture is ubiquitous in deep learning, underlying models achieving state-of-the-art performance across almost every machine learning task. However, Transformers have one major limitation at scale: inference quickly becomes infeasible on long sequences. State space models have become a popular alternative to address this issue. This paper presents Mamba, the latest evolution in state space model architectures, which shows highly competitive performance to Transformers.
24 April 2024 Wed	(Theory) An Explanation of In-Context Learning as Implicit Bayesian Inference (https://arxiv.org/abs/2111.02080): Large language models can perform in-context learning, where the model itself can learn at inference time to accomplish a downstream task specified only using a prompt. This paper provides one explanation for this phenomenon using ideas from Bayesian statistics. See also the blog post: https://www.inference.vc/implicit-bayesian-inference-in-sequence-models/
1 May 2024 Wed	(Methods) Single-Model Uncertainties for Deep Learning (https://arxiv.org/abs/1811.00908): Quantifying uncertainties for the predictions of deep learning models is a difficult task in general, but remains important for drawing inferences. Here, a simple universal scheme is proposed to augment deep learning models with an additional input that one can adjust to predict arbitrary quantiles.
8 May 2024 Wed	(Theory) Linear attention is (maybe) all you need (to understand transformer optimization) (https://arxiv.org/abs/2310.01082): From the theoretical point of view, there are several unique properties of transformers that help to distinguish them from other neural network architectures. This paper outlines many of these, and shows that they can be replicated and studied using a (very) basic transformer model.
15 May 2024 Wed	(Methods) Optuna: A Next-generation Hyperparameter Optimization Framework (https://arxiv.org/abs/1907.10902) The tuning of hyperparameters in deep learning is often conducted using grid search, i.e. trying a bunch of different hyperparameters and selecting the best performing one. However, there are more principled approaches worth discussing. This software paper outlines a few of these in a more practical context.
22 May 2024 Wed	(Methods) The Forward-Forward Algorithm: Some Preliminary Investigations (https://arxiv.org/abs/2212.13345) All deep learning architectures today are trained using a gradient-based procedure (backpropagation). This paper discusses a curious and controversial alternative approach for training networks using only forward passes.
29 May 2024 Wed	(Theory) A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning (https://arxiv.org/abs/2310.18988): The conventional statistical wisdom of the bias—variance tradeoff is broken in the deep learning regime, where the “bigger is better” rule of thumb reigns supreme. One theoretical explanation for these heuristics is the double descent phenomenon. This paper provides a recent nuanced take on the nature of the phenomenon and how to approach deep learning theory from the statistical point of view.

Date

Readings

17 April 2024

Tues

(Methods) Mamba: Linear-Time Sequence Modeling with Selective State Spaces (https://arxiv.org/abs/2312.00752):

The Transformer architecture is ubiquitous in deep learning, underlying models achieving state-of-the-art performance across almost every machine learning task. However, Transformers have one major limitation at scale: inference quickly becomes infeasible on long sequences. State space models have become a popular alternative to address this issue. This paper presents Mamba, the latest evolution in state space model architectures, which shows highly competitive performance to Transformers.

24 April 2024

Wed

(Theory) An Explanation of In-Context Learning as Implicit Bayesian Inference (https://arxiv.org/abs/2111.02080):

Large language models can perform in-context learning, where the model itself can learn at inference time to accomplish a downstream task specified only using a prompt. This paper provides one explanation for this phenomenon using ideas from Bayesian statistics. See also the blog post: https://www.inference.vc/implicit-bayesian-inference-in-sequence-models/

1 May 2024

Wed

(Methods) Single-Model Uncertainties for Deep Learning (https://arxiv.org/abs/1811.00908):

Quantifying uncertainties for the predictions of deep learning models is a difficult task in general, but remains important for drawing inferences. Here, a simple universal scheme is proposed to augment deep learning models with an additional input that one can adjust to predict arbitrary quantiles.

8 May 2024

Wed

(Theory) Linear attention is (maybe) all you need (to understand transformer optimization) (https://arxiv.org/abs/2310.01082):

From the theoretical point of view, there are several unique properties of transformers that help to distinguish them from other neural network architectures. This paper outlines many of these, and shows that they can be replicated and studied using a (very) basic transformer model.

15 May 2024

Wed

(Methods) Optuna: A Next-generation Hyperparameter Optimization Framework
(https://arxiv.org/abs/1907.10902)

The tuning of hyperparameters in deep learning is often conducted using grid search, i.e. trying a bunch of different hyperparameters and selecting the best performing one. However, there are more principled approaches worth discussing. This software paper outlines a few of these in a more practical context.

22 May 2024

Wed

(Methods) The Forward-Forward Algorithm: Some Preliminary Investigations (https://arxiv.org/abs/2212.13345)

All deep learning architectures today are trained using a gradient-based procedure (backpropagation). This paper discusses a curious and controversial alternative approach for training networks using only forward passes.

29 May 2024

Wed

(Theory) A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning (https://arxiv.org/abs/2310.18988):

The conventional statistical wisdom of the bias—variance tradeoff is broken in the deep learning regime, where the “bigger is better” rule of thumb reigns supreme. One theoretical explanation for these heuristics is the double descent phenomenon. This paper provides a recent nuanced take on the nature of the phenomenon and how to approach deep learning theory from the statistical point of view.

Past Deep Learning Discussion Sessions and Readings

2024 Sessions

2024.04.03 -
(Theory) Are Emergent Abilities of Large Language Models a Mirage? (https://arxiv.org/abs/2304.15004)
Major successes with current transformers in language and vision tasks seem to suggest sudden emergent properties in the latest models on offer. This paper pushes back on this idea, implying that common metrics for determining model success are more pass/fail than they might otherwise seem.
2024.03.27 -
(Methods) A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis (https://arxiv.org/abs/2311.04157):
Although DETRs appear to work well in practice, it is sometimes unclear whether the model is correctly detecting an object by looking at relevant features. To better interpret how a DETR is making decisions, a new architecture is presented that explicitly shows where the model is looking for predicting each particular class.
2024.03.20 -
(Methods) DETRs Beat YOLOs on Real-time Object Detection (https://arxiv.org/abs/2304.08069)
State-of-the-art DETRs are typically too slow to be able to recognise objects in real-time (e.g. through a video feed). Here, a new real-time DETR is presented to address this deficiency.
2024.03.13 -
(Theory) Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel (https://arxiv.org/abs/1908.11775)
For further reading: https://arxiv.org/abs/2006.16236
2024.03.06 -
(Methods) End-to-End Object Detection with Transformers (https://arxiv.org/pdf/2005.12872v3.pdf)
2023 Sessions

2023.11.28 -
Deep Kernel Learning. https://arxiv.org/abs/1511.02222
2023.11.14 -
https://openreview.net/pdf?id=uNkKaD3MCs
2023.10.31 -
https://ai.meta.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training/
https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/
2023.10.17 -
https://ai.meta.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training/
https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/
2023.10.03 -
https://openai.com/research/clip
2023.09.19 -
Convolutional neural networks: an overview and application in radiology
2023.05.02: Session was led by Archer Moore
“NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” - https://arxiv.org/abs/2003.08934
2023.04.18: Session was led by Liam Hodgkinson
"A Unified Approach to Interpreting Model Predictions" (https://arxiv.org/pdf/1705.07874.pdf)
2023.04.04: Session was led by Chris van der Heide
“Attention is All You Need” - https://arxiv.org/abs/1706.03762
“Training Compute-Optimal Large Language Models” - https://arxiv.org/abs/2203.15556
2023.03.20: Session was led by Liam Hodgkinson
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation: https://arxiv.org/abs/1406.1078
Learning to forget: continual prediction with LSTM: https://ieeexplore.ieee.org/document/818041
More recent developments in RNNs as seen in https://iclr-blog-track.github.io/2022/03/25/annotated-s4/
2023.03.06: Session was led by Liam Hodgkinson. No pre-reading.