##plugins.themes.bootstrap3.article.main##

Preetish Kakkar

Hariharan Ragothaman

Ananya Ghosh Chowdhury

Abstract

In this paper, we propose a new pipeline for the generation and the editing of the 3D-aware images which aims at solving a set of the most critical problems associated with the semantic and appearance consistency across the different modalities. The proposed method is less rigid in terms of the spatial arrangement of visual attributes, and owing to the cross-attention-based disentanglement process, the suggested approach entails higher levels of control. Ongoing evaluations prove that the suggested framework achieves better results compared to other existing standards, with the FID of 21.28 and KID of 0.008 to describe the improved image quality. In addition, the mIoU is established to be 0.49, and the pixel accuracy to 0.88, which confirms the semantic alignment capability of the proposed model. Also, it achieves a Free-Viewpoint Video (FVV) Identity score of 0.53 which validates the subject’s ability to maintain identity in FVV configuration. The results validate this flexibility of this framework as it takes multi-modal inputs that include pure noise, textual descriptions and reference images, and therefore may be used in a variety of creative disciplines including; art generation, virtual reality and gaming. Besides the contribution which is made in the 3D-aware image generation problem, this work also lays a foundation for other related image synthesis and real-time issues researches.

##plugins.themes.bootstrap3.article.details##