In the grand theatre of life, our brains serve as both the audience and the projector, watching and creating an unbroken sequence of experiences. This naturalistic paradigm likens our brains to a moviegoer, watching the non-stop movie of our life experiences. Cognitive neuroscience's intriguing challenge is to decrypt the information nestled within our complex brain activity, in particular, reconstructing human vision from brain recordings.
Recreating human vision, especially via non-invasive tools like functional Magnetic Resonance Imaging (fMRI), presents an exciting yet arduous task. The complex and expensive process of acquiring neuroimaging data, along with the noise and other interferences impacting these non-invasive methods, adds to the intricacy. However, the combined forces of deep learning and representation learning have achieved significant strides, notably in learning valuable fMRI features with limited fMRI-annotation pairs, giving us invaluable insights into the vibrant panorama of human perception.
Recreating the Perpetual Theatre of Human Vision
Unlike static images, our vision is a fluid, diverse sequence of scenes, movements, and objects. Capturing this dynamic visual experience poses a challenge due to the intrinsic nature of fMRI, which measures blood oxygenation level dependent (BOLD) signals and snapshots brain activity every few seconds. These snapshots essentially provide an "average" of brain activity during that period, which might span multiple video frames with varied visual stimuli.
In our latest work, we present MinD-Video, a two-module pipeline designed to bridge the gap between image and video brain decoding. The proposed model progressively learns from brain signals, gaining a deeper understanding of the semantic space through multiple stages. Initially, we utilize large-scale unsupervised learning with masked brain modeling to learn general visual fMRI features. These features are then fine-tuned through co-training with an augmented stable diffusion model tailored for video generation under fMRI guidance.
A Step-by-Step Approach: From Learning to Generating
MinD-Video's progressive learning scheme begins with an fMRI encoder that learns brain features in multiple stages. It uses the Contrastive Language-Image Pre-Training (CLIP) space with contrastive learning and multimodality of the annotated dataset to distill semantic-related features.
In the next stage, the learned features are fine-tuned with an augmented stable diffusion model through co-training. This model is specifically designed for scene-dynamic video generation with near-frame attention and is tailored to fMRI guidance with adversarial conditioning. This two-pronged approach ensures a more adaptable brain decoding pipeline, where the fMRI encoder and the augmented stable diffusion model can be trained separately and fine-tuned together.
Outperforming State-of-the-Art Models
By leveraging the progressive learning scheme and the stable diffusion model, MinD-Video successfully reconstructs high-quality videos that accurately capture semantics, motions, and scene dynamics. The method achieves 85% accuracy in semantic metrics and 0.19 in SSIM, outperforming previous state-of-the-art approaches by 45%.
Additionally, the attention analysis of our model demonstrates a mapping to the visual cortex and higher cognitive networks, suggesting biological plausibility and interpretability.
Looking Beyond
The advancements made through MinD-Video bring us a step closer to understanding the ceaseless spectacle of human perception. By progressively learning and fine-tuning features, this pipeline provides a flexible and adaptable approach to brain decoding. Though challenges persist, such as the HR variability across subjects and brain regions, the strides made in this study illuminate the exciting prospects that lie ahead in the sphere of cognitive neuroscience. In the grand theater of life, we are getting closer to understanding not just the moviegoer but also
Comentários