Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

CVPR 2026
1Tianjin University, 2Independent Researcher
*Corresponding author   Project lead
LayerBind overview figure

Figure 1. We propose LayerBind, a training-free strategy to empower text-to-image DiT models with regional and occlusion controllability. Compared to prior methods, LayerBind produces customized images that better respect specified spatial layout and occlusion relations while preserving quality. The design is based on context-sharing and region-branching generation, enabling editable generation such as changing instances or visible orders.

Abstract

Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: training-based approaches inherit data bias and often degrade image quality, while current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during generation, LayerBind enables precise regional and occlusion controllability. Motivated by the observation that spatial layout and occlusion are established at very early denoising stages, our method follows two phases: Layer-wise Instance Initialization and Layer-wise Semantic Nursing. The first phase creates per-instance branches with shared background context and fuses them early according to desired layer order to establish a structured latent layout. The second phase reinforces regional details and preserves occlusion order through layer-wise attention enhancement and a transparency scheduler. LayerBind is training-free, plug-and-play, and supports editable workflows such as changing per-region instances and rearranging visible orders.

Pipeline

LayerBind pipeline

Figure 2. Overview of the LayerBind pipeline. Layer-wise Instance Initialization splits early denoising into background and instance branches. Each instance branch generates independently with shared context and is fused to establish the initial layered layout. Layer-wise Semantic Nursing then performs sequential layer-wise attention updates to refine per-region semantics and maintain occlusion consistency throughout denoising.

Galleries

BibTeX

@article{LayerBind2026,
  title={Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers},
  author={Ruidong Chen and Yancheng Bai and Xuanpu Zhang and Jianhao Zeng and Lanjun Wang and Dan Song and Lei Sun and Xiangxiang Chu and Anan Liu},
  journal={CVPR},
  year={2026}
}