In recent years, progress in vision-language models has transformed multi-modal understanding. However, there is still room for investigation in terms of understanding synthetic or generated images. These images present a unique challenge due to their diverse content and stylistic attributes, which current models struggle to fully comprehend. To address this issue, we introduce a large-scale dataset, JourneyDB, specifically designed for multi-modal visual understanding in generative images.
This curated dataset includes over 4 million diverse, high-quality generated images, each accompanied by the textual prompts that led to their creation. In addition, we have developed four benchmarks aimed at quantifying performance in understanding generated images from both content and style perspectives. These benchmarks include prompt inversion, style retrieval, image captioning, and visual question answering tasks.
Finally, we evaluate the performance of existing leading multi-modal models using JourneyDB, presenting a detailed analysis of their effectiveness in understanding generated content. We hope that JourneyDB and our proposed benchmarks will drive further research in generative content understanding. The dataset will be made available at https://journeydb.github.io.
Introduction
In recent years, vision-language models, models that combine visual and textual understanding, have significantly improved. These advancements have led to revolutionary developments in multi-modal understanding. However, an uncharted area within this domain lies in understanding synthetic or generatively created images. Synthetic images, characterized by their diverse content and style, present a unique set of challenges that traditional models have yet to conquer.
JourneyDB: A Dataset for Generative Image Understanding
To tackle this problem, we have created JourneyDB, a large-scale dataset specifically developed for multi-modal visual understanding in generative images. Comprising over 4 million diverse, high-quality generated images, each entry in the dataset includes the text prompts used to generate the associated image.
The dataset provides a rich variety of image and text data, offering an ideal testing ground for researchers and AI developers to train and test models capable of understanding generated images.
Benchmarks for Evaluating Performance
As part of our work on JourneyDB, we have also established four benchmarks to measure models' performance in understanding generative images, focusing on both content and stylistic interpretation.
Prompt Inversion: This involves generating a text prompt from a given image, effectively "inverting" the image creation process.
Style Retrieval: In this task, models must identify and retrieve images with similar stylistic properties from the database.
Image Captioning: This requires the model to generate a descriptive caption for each image.
Visual Question Answering: This task tests a model's ability to answer questions about the content and style of a given image.
These benchmarks provide a robust framework for evaluating a model's ability to understand generated images.
Evaluating Current Models with JourneyDB
We used JourneyDB to assess current state-of-the-art multi-modal models and found varied levels of success in comprehending generated content. We found that while some models could interpret certain image styles effectively, others struggled, indicating room for improvement.
Conclusion
JourneyDB represents an exciting opportunity for the continued advancement of generative content understanding. By providing a large-scale dataset and rigorous benchmarks, we hope to inspire further research and development in this field, leading to models that can effectively understand the complexities and nuances of generated images. JourneyDB will be made available to the research community at https://journeydb.github.io.
Commentaires