Large-scale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name? In this paper, we explore the existence of a ``Name Space'', where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities' names. Specifically, we first extract the embeddings of celebrities' names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models.
Overview of the Proposed Method. (a) Dataset Construction. Celebrity images and their corresponding names were extracted from the Laion5b dataset, with name embeddings generated through a text encoder $E_{text}$ based on the names. (b) Image Encoder Architecture and Training. Features were extracted from input images using two CLIP image encoders, followed by a three-layer fully connected network to produce the ``name'' prediction. Mean Squared Error (MSE) loss was computed against the ground truth name embedding $F_{ID}^{gt}$. (c) Pipeline Inference. The name embedding predicted by the image encoders $E_{image}$ were combined with the original text embeddings by ``name'' Prepending ($NP(\cdot)$) to obtain final embeddings $F_{prompt}^{ID}$. These embeddings were then used to guide a U-net model for denoising. (d) ``Name'' Integrating. There is no need to set a specific placeholder or identify its position, simply inserting the name embedding between the start token(red block) and the first semantic token(blue block) is sufficient to achieve consistent identity generation. The padding tokens(yellow block) exceeding the length of 77 will be discarded.
Visualization and Comparison.The results concludes four tasks, i.e., scene construction, stylization, action control and emotional editing. Please devote attention to semantic consistency and visual aesthetics of images with the same level of concern as given to ID consistency. The results demonstrate that our approach maintains ID consistency while perfectly preserving the original semantic performance (complex semantic consistency) of the generator(SDXL), a feat not achieved by other works.
In the $\mathcal N$ space, any given point corresponds to a person identity. For real individuals, we can obtain their mapping in the $\mathcal N$ space using the image encoder proposed and trained in this paper, thereby achieving consistent identity generation. Furthermore, by interpolating between any two points (i.e., name embedding) within the name space, we can create new fictional characters.
The results indicate that, on one hand, even well-performing generative models like SDXL exhibit a significant gap between their image outputs and real images. On the other hand, the gap between real test images and generated training images also impacts inference performance. Therefore, it is evident that the dataset constructed in this study, LaionCele, holds substantial value and contribution to the field.
The proposed $\mathcal N$ Space for consistent ID generation is agnostic to the generative model and can be seamlessly integrated with any variant based on SDXL to achieve ID-consistent image generation. This figure presents the consistent ID generation results with styleUnet. The experimental outcomes indicate that the ``name'' sampled from the $\mathcal N$ Space can be easily utilized with alternative generative models to produce consistent ID generation without compromising the original specialized capabilities.
The construction of the LaionCele dataset represents one of the significant contributions of this paper, with the detailed construction process summarized in this figure.
This figure demonstrates the identity and semantic consistency of our work in the decorative generation tasks.
The supplementary cases demonstrating ID consistency and semantic consistency in the action control generation tasks.
The supplementary cases demonstrating ID consistency and semantic consistency in the action control generation tasks.
The supplementary cases demonstrating ID consistency and semantic consistency in the emotional editing tasks.
The supplementary cases demonstrating ID consistency and semantic consistency in the stylization generation tasks.
The experimental data presented in Table 2 of the main text indicate that fine-tuning enhances generative ID consistency. Here, we demonstrate the effects of fine-tuning LoRA on the LaionCele dataset using our approach.
The experimental data presented in Table 2 of the main text indicate that fine-tuning enhances generative ID consistency. Here, we demonstrate the effects of fine-tuning LoRA on the LaionCele dataset using our approach.