HN Reader

Show HN: I invented a new generative model and got accepted to ICLR

631

I invented Discrete Distribution Networks, a novel generative model with simple principles and unique properties, and the paper has been accepted to ICLR2025!

Modeling data distribution is challenging; DDN adopts a simple yet fundamentally different approach compared to mainstream generative models (Diffusion, GAN, VAE, autoregressive model):

1. The model generates multiple outputs simultaneously in a single forward pass, rather than just one output. 2. It uses these multiple outputs to approximate the target distribution of the training data. 3. These outputs together represent a discrete distribution. This is why we named it "Discrete Distribution Networks".

Every generative model has its unique properties, and DDN is no exception. Here, we highlight three characteristics of DDN:

- Zero-Shot Conditional Generation (ZSCG). - One-dimensional discrete latent representation organized in a tree structure. - Fully end-to-end differentiable.

Reviews from ICLR:

> I find the method novel and elegant. The novelty is very strong, and this should not be overlooked. This is a whole new method, very different from any of the existing generative models. > This is a very good paper that can open a door to new directions in generative modeling.

This sounds somewhat like a normalizing flow from a discrete space to a continuous space. I think there's a way you can rewrite your DDN layer as a normalizing flow which avoids the whole split and prune method.

1. Replace the DDN layer with a flow between images and a latent variable. During training, compute in the direction image -> latent. During inference, compute in the direction latent -> image. 2. For your discrete options 1, ..., k, have trainable latent variables z_1, ..., z_k. This is a "code book".

Training looks like the following: Start with an image and run a flow from the image to the latent space (with conditioning, etc.). Find the closest option z_i, and compute the L2 loss between z_i and your flowed latent variable. Additionally, add a loss corresponding to the log determinant of the Jacobian of the flow. This second loss is the way a normalizing flow avoids mode collapse. Finally, I think you should divide the resulting gradient by the softmax of the negative L2 losses for all the latent variables. This gradient division is done for the same reason as dividing the gradient when training a mixture-of-experts model.

During inference, choose any latent variable z_i and flow from that to a generated image.

21 hours agoby cooljoseph

Very impressive to see a single author paper in ICLR, especially for an innovative method. Well done!

1 day agoby michaeldoron

Pretty interesting architecture, seems very easy to debug, but as a downside you effectively discard K-1 computations at each layer since it's using a sampler rather than a MoE-style router.

The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.

1 day agoby f_devd

Green flag that he references the I Ching, most original ideas come through analogy. Paul Werbos claims he invented backprop to formalize Freud's theory of “psychic energy” into an algorithm.

22 hours agoby mysterEFrank

After reading the paper there is one thing I don't understand about the DDL. It seems each "concat" will increase the size of the "output feature" relative to the "input feature" by the size of the "generated image".

Is that right?

If so, how is this increased size handled by each downstream DDL?

Or, is there a 2x pooling in the "concat" step so that final size remains unchanged?

12 hours agoby frumiousirc

An uninformed question: If the network is fully composed of 1x1 convolutions, doesn’t that mean no information mixing between pixels occur? Would that not imply that each pixel is independent of each other? How can that not lead to incoherent results?

1 day agoby qazxcvbnm

Super cool concept!

Looking at the examples below the abstract, there's several details that surprise me with how correct the model is. For examples: hairline in row 2 column 3; shirt color in row 2 columns 7, 8, 9, 11; lipstick throughout rows 4 and 6; face/hair position and shape in row 6 column 4. Of particular note is the red in the bottom left of row 6 column 4. It's a bit surprising—but still understandable—that the model realized there is something red there, but it's very surprising that it chose to put the red blob in exactly the right spot.

I think some of this can be explained by bias in the dataset (e.g. lipstick) and cherry picking on my part (I'm ignoring the ones it got wildly wrong), but I can't explain the red shoulder strap/blob. Is there any possibility of data leakage and/or overfitting of a biased dataset, or are these just coincidences?

1 day agoby hatthew

I built something similar in structure, if not in method, using a hierarchy of cross attention and learned queries, made sparse by applying L1 to the attention matrices.

Discrete hierarchical representations are super cool. The pattern of activations across layers amounts to a “parse tree” for each input. You have effectively compressed the image into a short sequence of integers.

1 day agoby intalentive

Pretty interesting. I was just doing research on diffusion using symbolic transform matrices to try and parallelize a deep graph reactive system a few days ago, seems to be a general direction that people are going, I wouldn't be surprised to see diffusion adjacent models take over for codegen in the next year or two.

1 day agoby CuriouslyC

Super cool, I spent a lot of time playing with representation learning back in the day and the grids of MNIST digits took me right back :)

A genuinely interesting and novel approach, I'm very curious how it will perform when scaled up and applied to non-image domains! Where's the best place to follow your work?

1 day agoby moconnor

Can you train this model to detect objects (e.g detect a fish in the picture)?

1 day agoby FitchApps

Slightly meta-level: I'm glad the authors finds the ICLR reviews useful, and this illustrates one of the successes of ICLR's policy of always open sourcing the reviews (regardless of whether the paper is accepted or rejected).

The authors benefit from having "testimonials" of how anonymous reviewers interpreted their works, and it also allows opens the door to people outside of the classic academic pipeline to see the behind the scenes arguments to accept/reject a paper.

Here are the reviews for this paper btw: https://openreview.net/forum?id=xNsIfzlefG

And here's a list of all the rejected papers: https://openreview.net/group?id=ICLR.cc/2025/Conference#tab-...

1 day agoby aseg

It's not often you read a title like that and expect it to pan out, but from a quick browse, it looks pretty good.

Now I just need a time-turner.

1 day agoby Lerc

This looks like great work.

I've added it to my reading list.

Thank you for sharing it on HN.

1 day agoby cs702

I don't have a super deep understanding of the underlying algorithms involved, but going off the demo and that page, is this mainly a model for image related tasks, or could it also be trained to do things like what GPT/Claude/etc does (chat conversations)?

1 day agoby VoidWhisperer

Could this be used to train a text -> audio model? I'm thinking of an architecture that uses RVQ. Would RVQ still be necessary?

1 day agoby cellis

How does it compare to state of the art models? Does it scale?

1 day agoby p1esk

fwiw ICLR:

International Conference on Learning Representations

https://en.wikipedia.org/wiki/International_Conference_on_Le...

1 day agoby mellosouls

It seem to have both feature and a discrete number passed into next layer, which one did you think of first? or it is both by design?

22 hours agoby cttet

Reminds me of particle filters.

10 hours agoby v9v

Do you have any details on the experiment procedures? E.g. hardware, training time, loss curves? It is difficult to confidently reproduce research without at least some of these details.

1 day agoby highd

isn't this kind of like an 80% vq-vae?

1 day agoby serf

The part about pruning and selecting sounds similar to genetic algorithms from before the popularity of nn.

1 day agoby 0xdeadbeefbabe

How did this get accepted without any baseline comparisons? They should have compared this to VQ-VAE, diffusion inpainting and a lot more.

1 day agoby gurtinator

Amazing! So basically the statistical LLM concept for imaging.

1 day agoby wordglyph

It's so cool to see the hierarchical generation of the model, on their Github page they have one with L=4: https://discrete-distribution-networks.github.io/img/tree-la...

The one shown on their page is L=3.

1 day agoby GaggiX

Impressive, congrats.

1 day agoby nothrowaways

very interesting stuff! great work and congratulations on the ICLR acceptance!

1 day agoby kaiokendev

Congrats!

19 hours agoby jama211

Congrats!! Very cool.

1 day agoby nvr219

Deeply uninformed person here:

Is the inference cost of generating this tree to be pruned something of a hindrance? In particular I'm watching your MNIST example and thinking - does each cell in that video require a full inference? Or is this done in parallel at least? In any case, you're basically memory for "faster" runtime (for more correct outputs), no?

1 day agoby throwaway314155

First, I think this is really cool. Its great to see novel generative architectures.

Here are my thoughts on the statistics behind this. First, let D be the data sample. Start with the expectation of -Log[P(D)] (standard generative model objective).

We then condition on the model output at step N.

- Expectation of Log[Sum over model outputs at step N{P(D | model output at step N) * P(model output at step N)}]

Now use Jensen's inequality to transform this to

<= - expectation of Sum over model outputs at step N{Log[P(D | model output at step N) * P(model output at step N)]}

Apply Log product to sum rule

= - expectation of Sum over model outputs at step N {Log(P(D | model output at step N)) + Log(P(model output at step N))}

If we assume there is some normally distributed noise we can transform the first term into the standard L2 objective.

= - expectation of Sum over model outputs at step N {L2 distance(D, model output at step N) + Log(P(model output at step N))}

Apply linearity of expectation

= Sum over model outputs at step N [expectation of{L2 distance(D, model output at step N)}] - Sum over model outputs at step N [expectation of {Log(P(model output at step N))}]

and the summations can be replaced with sampling

= expectation of {L2 distance(D model output at step N)} - expectation of {Log(P(model output at step N))}]

Now, focusing on just the - expectation of Log(P(sampled model output at step N)) term.

= - expectation of Log[P(model output at step N)]

and condition on the prior step to get

= - expectation of Log[Sum over possible samples at N-1 of (P(sample output at step N| sample at step N - 1) * P(sample at step N - 1))]

Now, for each P(sample at step T | sample at step T - 1) this is approximately equal to 1/K. This is enforced by the Split-and-Prune operations which try to keep each output sampled at roughly equal frequencies.

So this is approximately equal to

≃ - expectation of Log[Sum over possible samples at N-1 of (1/K * P(possible sample at step N - 1))]

And you get an upper bound by only considering the actual sample.

<= -Log[1/K * expectation of P(actual sample at step N - 1))]

And applying some log rules you get

= Log(K) - expectation of Log[P(sample at step N - 1)]

Now, you have (approximately) expectation of -Log[P(sample at step N)] <= Log(K) - expectation of Log[P(sample at step N - 1)]. You can repeatedly apply this transformation until step 0 to get

(approximately) expectation of -Log[P(sample at step N)] <= N * Log(K) - expectation of Log[P(sample at step 0)]

and WLOG assume that expectation of P(sample at step 0) is 1 to get

expectation of -Log[P(sample at step N)] <= N * Log(K)

Plugging this back into the main objective, we get (assuming the Split-and-Prune is perfect)

expectation of -Log[P(D)] <= expectation of {L2 distance(D, sampled model output at step N)} + N * Log(K)

And this makes sense. You are providing the model with an additional Log_2(K) bits of information every time you perform an argmin operation, so in total you have provided the model with N * Log_2(K) bits for information. However, this is constant so you can ignore it from the gradient based optimizer.

So, given this analysis my conclusions are:

1) The Split-and-Merge is a load-bearing component of the architecture with regards to its statistical correctness. I'm not entirely sure about how this fits with the gradient based optimizer. Is it working with the gradient based optimizer, fighting the gradient based optimizer, or somewhere in the middle? I think the answer to this question will strongly affect this approaches scalability. This will also need a more in-depth analysis to study how deviations from perfect splitting affect the upper bound on loss.

2) With regards to statistical correctness, the L2 distance between the output at step N and D is the only one that is important. The L2 losses in the middle layers can be considered auxiliary losses. Maybe the final L2 loss / L2 losses deeper in the model should be weighted more heavily? In final evaluation the intermediate L2 losses can be ignored.

3) Future possibilities could include some sort of RL to determine the number of samples K and depth N on a dynamic basis. Even a split with K=2 increases NLL loss by Log_2(2) = 1. For many samples after a given depth the increase in loss due to the additional information outweighs the decrease in L2 loss. This also points to another difficulty, it is hard to give fractional information in this Discrete Distribution Network architecture. In contrast, diffusion models and autoregressive models can handle fractional bits. This could be another point of future development.

1 day agoby elchananHaas

Why do you refer to yourself as "we" in the paper?

1 day agoby Invictus0

Wtf, iclr reviews are happening right now. Did you get accepted into a workshop? How do you know it’s been accepted?

1 day agoby Der_Einzige

>Figure 18: The Taiji-DDN exhibits a surprising similarity to the ancient Chinese philosophy of Taiji. Records of Taiji can be traced back to the I Ching (Book of Changes) from the late 9th century BC, often described by the quote on the left (a) that explains the universe’s generation and transformation. This description coincidentally also summarizes the generation process and the transformations in the generative space of Taiji-DDN. Moreover, the diagram (b) from the book Tom (2013) bears a closely resemblance to the tree structure of DDN’s latent fig. 1b. Therefore, we have named the DDN with K = 2 as Taiji-DDN.

Very nitpicky comment, but I personally find such things to make for a bad impression. To be more specific, branching structures are a fairly universal idea, so the choice of relating it ancient proverbs instead of something much mundane raises an eyebrow.

1 day agoby E-Reverance