CODA: Repurposing Continuous VAEs for Discrete Tokenization

Zeyu Liu1*    Zanlin Ni1*    Yeguo Hua1    Xin Deng2    Xiao Ma3    Cheng Zhong3    Gao Huang1†   
*Equal Contribution    Corresponding Author   
1Tsinghua University         2Renmin University         3Lenovo Research, AI Lab        

Main Idea

main idea

(a) Conventional discrete VQ tokenizers learn to compress and discretize inherently continuous visual signals into codes simultaneously. This lead to multiple challenges in training and the corresponding unsatisfactory latent space poses a bottleneck that limit the performance of discrete token-based generation models.

(b) Our proposed CODA tokenizer leverages continuous VAEs for compression, directly discretizing the latent space.

(c) Quantitative comparisons between VQGAN and our proposed CODA tokenizer.



Abstract

Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes.

Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce CODA (COntinuous-to-Discrete Adaptation), a framework that decouples compression and discretization.

Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs --- already optimized for perceptual compression --- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs.

Empirically, with 6x less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of 0.43 and 1.34 for 8x and 16x compression on ImageNet 256 x 256 benchmark.



Approach

Enhanced Representational Capacity

capacity

Left: Visualization of latent space approximation and lack of representational capacity: (a) the original latent space of the continuous VAE, (b) latent space approximated by vector quantization and (c) latent space approximated by residual quantization. This discrepancy indicate a substantial information loss during the VQ approximation process.


Right: Effect of residual quantization levels and enhanced representational capacity on tokenizer performance. With more levels of residual quantization, quantization error is consistently minimized, and the reconstruction performance (measured by rFID) steadily improves.



Sparse Unambiguous Assignment

sparsity

Left: Visualization of top assignment confidence scores: for 16 randomly selected continuous VAE features. For vector quantization, we visualize the distance of codes to the continuous feature, with lower distance representing higher confidence. This reveals a clear pattern of ambiguity in code assignment.


Right: Visualization of training dynamics. With the proposed sparse attention-based assignment, codes are pushed to fully occupy the latent space, whereas vector quantization shows limited coverage of latent space. This enforces more sufficient latent space coverage and unambiguous codebook assignment, along with enhanced training dynamics.



Pipeline

pipeline

Illustration of our CODA tokenizer

(a) a residual quantization process of L levels iteratively refines the approximation of a continuous VAE vector f through a composite of multiple quantization layers, thus progressively minimizing the quantization error. Meanwhile, as the continuous VAE vector is approximated by a combination of L discrete codes, the representational capacity is significantly enlarged.

(b) the attention-based quantization process frames discretization as a retrieval task. Continuous features and codebook embeddings are projected and normalized onto a normed hypersphere, where the softmax attention matrix is computed to determine the confidence of code selection. As codes compete within the softmax attention framework, this approach ensures a sparse and unambiguous assignment.



Main Results

pipeline

Reconstruction results on ImageNet


pipeline

Generation results on ImageNet



Visualizations

pipeline


BibTeX


      Coming Soon.
    

Acknowledgements


Website adapted from the following template.