Recurrent Parameter Generation

Abstract

Parameter generation has long struggled to match the scale of today’s large vision and language models, curbing its broader utility. In this paper, we introduce Recurrent Diffusion for Large-Scale Parameter Generation (RPG), a novel framework that generates full neural network parameters—up to hundreds of millions—on a single GPU. Our approach first partitions a network’s parameters into non-overlapping ‘tokens’, each corresponding to a distinct portion of the model. A recurrent mechanism then learns the inter-token relationships, producing ‘prototypes’ which serve as conditions for a diffusion process that ultimately synthesizes the full parameters. Across a spectrum of architectures and tasks—including ResNets, ConvNeXts and ViTs on ImageNet-1K and COCO, and even LoRA-based LLMs—RPG achieves performance on par with fully trained networks while avoiding excessive memory overhead. Notably, it generalizes beyond its training set to generate valid parameters for previously unseen tasks, highlighting its flexibility in dynamic and open-ended scenarios. By overcoming the longstanding memory and scalability barriers, RPG serves as a critical advance in ‘AI generating AI’, potentially enabling efficient weight generation at scales previously deemed infeasible.

The number of generated parameters is too small

Figure 1: Partial roadmap of vision, language, and parameter generation models. Parameter number in vision or language models is at least 10⁴ times larger than that of generated parameters.

Looking back on the journey of deep learning, the scaling up of neural networks is one of the most important keys to its remarkable success across various tasks. In contrast, parameter generation, from HyperNetwork to recent diffusion-based methods, has struggled to scale up effectively, limiting its practical applications. As shown in Figure 1, the scale gap between vision (or language) models and the generated parameters is at least 10⁴, posing significant challenges for this field.

Previous works hardly make trade-off between scalability and performance

Figure 2: P-diff usually confronts out-of-memory issues when scaling up.

Figure 3: SANE individually generates parameter parts, which leads to poor performance of the generated parameters.

There exists a fundamental trade-off between scale and performance in parameter generation, which significantly limits its practical applications. Specifically, P-diff flattens parameters into one-dimensional vectors, leading to out-of-memory issues when scaling up. Meanwhile, SANE generates local parameters independently, neglecting the modeling of relationships between these local parameters, which results in poor performance, especially when generating large-scale parameters. Large Language Models (LLMs) have demonstrated exceptional capabilities in context modeling. They efficiently handle the trade-off between memory consumption and sequence length. Can we treat network parameters as textual data?

Tokenization + Recurrent diffusion, a way for large-scale parameter generation

Figure 4: Our approach comprises two key components: parameter tokenization and recurrent diffusion. In the figure above, we show the inference process of recurrent diffusion. The permutation state and position embedding are fed into the recurrent model. Then, the outputs of the recurrent model serve as conditions for the diffusion process, which generates the entire neural network parameters.

In this work, we propose Recurrent diffusion for large-scale neural network Parameters Generation (RPG). Our approach first divides the trained network parameters into a set of non-overlapping parameter parts (simply called ‘tokens’ in the following). Subsequently, we use a recurrent model to learn the relationships among the tokens. Finally, the outputs of the recurrent model, as conditions, are fed into a diffusion process to generate the network parameters.

Efficiency of RPG

Figure 5: RPG can generate the complete parameters for ViT-Base and ConvNeXt-L within minutes, achieving results comparable to the original models. Additionally, our approach can be run on various NVIDIA GPUs.

The efficiency of generating large-scale parameters is crucial for evaluating the practicality of our approach. In Figure 5, we present the time cost for generating models of ViT-Base and ConvNeXt-L across various DDIM sampling steps. All results are obtained with a single NVIDIA H100 80G GPU. Our approach enables the generation of models within minutes, achieving performance comparable to that of the original models. Notably, the inference memory requirement is approximately 20GB, so RPG can be deployed on NVIDIA RTX 3090 or similar-level GPUs.

Evaluating RPG in unseen tasks

Figure 6: An illustration of our binary embedding strategy and dataset construction. Left: binary embeddings (1022 in total) encode different CIFAR-10 classification tasks, where 1s indicate classes to be classified together (e.g., 'ship' and 'truck' in the first example). Right: the dataset consists of parameter-encoding pairs, formed by network parameters with their corresponding binary embeddings. These pairs are split into non-overlapping training and validation sets.

To assess RPG’s capability in generating models for unseen tasks, we construct various binary classification tasks on CIFAR-10 and encode each task as a 10-bit binary embedding, where each bit corresponds to a CIFAR-10 category. Given this encoding strategy, we can create 2^10 possible binary embeddings. After removing the two trivial cases (all 0s and all 1s), we obtain 1022 valid embeddings. For each embedding, we collect its corresponding model parameters, forming embedding-parameter pairs. These pairs are then split into non-overlapping sets for training and validation, allowing us to evaluate RPG’s generality on unseen tasks

Results in unseen tasks

Figure 7: Left: result comparisons between original and generated models on unseen embeddings (tasks). Right: Original and generated models with 3 unseen binary embeddings are compared.

We compare the results of our approach and original models on unseen binary embeddings in Figure 7 left. Notably, RPG yields commendable performance in these unseen tasks, even without being trained on the specific unseen embeddings. That demonstrates the strong practicality and potential of our approach in generating models under unseen tasks. We also visualize the original and generated models for both seen and unseen tasks in in Figure 7 right. Surpassingly, we find that our approach can learn unseen parameter patterns. This demonstrates the potential generalization ability of our method.

Text-Driven Parameter Generation

Figure 8: Examples of RPG-generated models guided by binary embeddings from a large language model (Qwen2.5-3B), demonstrating neural network parameter generation conditioned by natural language. For the first example, we give the LLM a prompt: ”Give me a model to select all living things.” With the binary embedding provided by the LLM, our RPG then generates a ViT-Tiny classifier. After that, We use images in CIFAR-10 to evaluate the model’s accuracy. The model should classify ”bird”, ”cat”, ”deer”, ”dog”, ”frog”, and ”horse” to the positive class, which we used as the ground truth. The result is 97.1%. The second example follows the same process.

To demonstrate the application scenarios of our model, we utilize large language model to generate binary embeddings to guide RPG in generating corresponding classification models. We provide two example results in Figure 8. This experiment demonstrates our method’s capability to generate neural network parameters based on natural language guidance, highlighting the potential applications of our method.

Conclusion

Our approach demonstrates promising results in large-scale parameter generation across various vision, language and unseen tasks. However, we acknowledge that achieving true ‘AI generating AI’ remains a distant goal. Firstly, while our method shows potential in generating models for unseen tasks, it currently faces limitations in generating parameters for novel model architectures. Secondly, our approach is constrained by modeling parameter relationships within a single task, potentially limiting its practical applicability. More importantly, future work should focus on simultaneously modeling parameter relationships across diverse architectures and tasks. Such an approach could yield a more powerful and versatile parameter generator, potentially advancing us closer to the ‘AI generating AI’ era. We hope our approach will inspire and encourage future research in neural network parameter generation.

Acknowledgments

We thank Zhiyuan Liang, Zhuang Liu, Gongfan Fang, Xuanlei Zhao, Yuhao Zhou, Mingjia Shi, Zangwei Zheng, Ziheng Qin, Tianlong Chen, and Zhangyang Wang, for valuable discussions and feedbacks. This research is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2021-08-008).

BibTeX

@misc{wang2025recurrent,
      title={Recurrent Diffusion for Large-Scale Parameter Generation},
      author={Wang, Kai and Tang, Dongwen and Zhao, Wangbo and You, Yang},
      year={2025},
}

Scaling Up Parameter Generation: A Recurrent Diffusion Approach

Generating customized models with prompts in just minutes!