Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

Chen, Ruidong; Bai, Yancheng; Zhang, Xuanpu; Zeng, Jianhao; Wang, Lanjun; Song, Dan; Sun, Lei; Chu, Xiangxiang; Liu, Anan

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

CVPR 2026

Ruidong Chen¹, Yancheng Bai²^‡, Xuanpu Zhang¹, Jianhao Zeng¹, Lanjun Wang¹, Dan Song¹, Lei Sun², Xiangxiang Chu², Anan Liu¹^*

¹Tianjin University, ²Independent Researcher
^*Corresponding author ^‡Project lead

Paper Code BindBench

Figure 1. We propose LayerBind, a training-free strategy to empower text-to-image DiT models with regional and occlusion controllability. Compared to prior methods, LayerBind produces customized images that better respect specified spatial layout and occlusion relations while preserving quality. The design is based on context-sharing and region-branching generation, enabling editable generation such as changing instances or visible orders.

Abstract

Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: training-based approaches inherit data bias and often degrade image quality, while current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during generation, LayerBind enables precise regional and occlusion controllability. Motivated by the observation that spatial layout and occlusion are established at very early denoising stages, our method follows two phases: Layer-wise Instance Initialization and Layer-wise Semantic Nursing. The first phase creates per-instance branches with shared background context and fuses them early according to desired layer order to establish a structured latent layout. The second phase reinforces regional details and preserves occlusion order through layer-wise attention enhancement and a transparency scheduler. LayerBind is training-free, plug-and-play, and supports editable workflows such as changing per-region instances and rearranging visible orders.

Pipeline

Figure 2. Overview of the LayerBind pipeline. Layer-wise Instance Initialization splits early denoising into background and instance branches. Each instance branch generates independently with shared context and is fused to establish the initial layered layout. Layer-wise Semantic Nursing then performs sequential layer-wise attention updates to refine per-region semantics and maintain occlusion consistency throughout denoising.

Galleries

LayerBind application: occlusion control and instance editing

Application 1. Flexible occlusion control and instance modification. LayerBind supports controllable reordering of visible layers and targeted per-instance edits while preserving global scene coherence.

BibTeX

@article{chen2026layer,
  title={Layer-wise instance binding for regional and occlusion control in text-to-image diffusion transformers},
  author={Chen, Ruidong and Bai, Yancheng and Zhang, Xuanpu and Zeng, Jianhao and Wang, Lanjun and Song, Dan and Sun, Lei and Chu, Xiangxiang and Liu, Anan},
  journal={arXiv preprint arXiv:2603.05769},
  year={2026}
}

More Works from Our Team

Group-Relative Attention Guidance for Image Editing

Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

AnyScene: Customized Image Synthesis with Composited Foreground

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

Abstract

Pipeline

Galleries

Application 1. Flexible occlusion control and instance modification. LayerBind supports controllable reordering of visible layers and targeted per-instance edits while preserving global scene coherence.

Application 2. Composited image editing with branch-wise instructions. By treating the original generation as shared background context, LayerBind enables localized edits with consistent layout and occlusion relationships.

BibTeX