HN Reader

Normalizing Flows Are Capable Generative Models

169

It’s pretty great that despite having large data centers capable of doing this kind of computation, Apple continues to make things work locally. I think there is a lot of value in being able to hold the entirety of a product in hand.

5 days agoby imoverclocked

As far as I'm aware, this is the largest Normalizing Flow that exists, and I think they undermined their work by not mentioning this...

Their ImageNet model (4_1024_8_8_0.05[0]) is ~820M while AFHQ is ~472M. Prior to that there is DenseFlow[1] and MaCow[2], which are both <200M parameters. For more comparison, that makes DenseFlow and MaCow smaller than iDDPM[3] (270M params) and ADM[4] (553M for 256 unconditional). And now, it isn't uncommon for modern diffusion models to have several billion parameters![5] (from this we get some numbers on ImageNet-256, which allows a direct comparison, making TarFlow closer to MaskDiT/2 and much smaller than SimpleDiffusion and VDM++, both of which are in billions. But note that this is 128 vs 256!)

Essentially, the argument here is that you can scale (Composable) Normalizing Flows just as well as diffusion models. There's a lot of extra benefits you get too in the latent space, but that's a much longer discussion. Honestly, the TarFlow method is simple and there's probably a lot of improvements that can be made. But don't take that as a knock on this paper! I actually really appreciated it and it really set out to show what they tried to show. The real thing is just no one trained flows at this scale before and this really needs to be highlighted.

The tldr: people have really just overlooked different model architectures

[0] Used a third party reproduction so might be different but their AFHQ-256 model matches at 472M params https://github.com/encoreus/GS-Jacobi_for_TarFlow

[1] https://arxiv.org/abs/2106.04627

[2] https://arxiv.org/abs/1902.04208

[3] https://arxiv.org/abs/2102.09672

[4] https://arxiv.org/abs/2105.05233

[5] https://arxiv.org/abs/2401.11605

[Side note] Hey, if the TarFlow team is hiring, I'd love to work with you guys

4 days agoby godelski

I've been working on a JAX implementation for my own projects. I've implemented everything in the paper except guidance.

See here: https://github.com/homerjed/transformer_flow

I'm happy to see the return of normalising flows - exact likelihood models have many benefits. I found the model needed soft-clipping on some operations to ensure numerical stability.

I wonder if adding transformers can be done for the GLOW algorithm since attention and 1x1 convolutions could be made to do the same operation.

3 days agoby kleskling

https://github.com/bayesiains/nflows

5 days agoby tiahura

i've been trying to keep up with this field (image generation) so here's quick notes I took:

Claude's Summary: "Normalizing flows aren't dead, they just needed modern techniques"

My Summary: "Transformers aren't just for text"

1. SOTA model for likelihood on ImageNet 64×64, first ever sub 3.2 (Bits Per Dimension) prev was 2.99 by a hybrid diffusion model

2. Autoregressive (transformers) approach, right now diffusion is the most popular in this space (it's much faster but a diff approach)

tl;dr of autoregressive vs diffusion (there's also other approaches)

Autoregression: step based, generate a little then more then more

Diffusion: generate a lot of noise then try to clean it up

The diffusion approach that is the baseline for sota is Flow Matching from Meta: https://arxiv.org/abs/2210.02747 -- lots of fun reading material if you throw both of these into an LLM and ask it to summarize the approaches!

4 days agoby jc4p

I wonder if it’s noticeably faster or slower than the common way on the same set of hardware.

5 days agoby MBCook

normalizing flow might be unpopular but definitely not a forgotten technique

5 days agoby lnyan

Earlier discussion: https://news.ycombinator.com/item?id=44358535

4 days agoby layer8

flows make sense here not just for size but cuz they're fully invertible and deterministic. imagine running same gen on 3 iphones, same output. means apple can kinda ensure same input gives same output across devices, chips, runs. no weird variance or sampling noise. good for caching, testing, user trust all that. fits apple's whole determinism dna and more of predictable gen at scale

5 days agoby b0a04gl