Meta unveils I-JEPA, an AI computer vision model that learns more like humans do

June 14, 2023  20:05

Meta Platforms Inc., a leading company in artificial intelligence (AI) research, is making significant strides towards the development of a revolutionary architecture that enables machines to learn internal models of the world. Yann LeCun, Chief AI Scientist at Meta, envisions this architecture as a means for AI models to learn faster, effectively plan complex tasks, and readily adapt to unfamiliar situations. Today, Meta's AI team announced the introduction of the first AI model based on a key component of LeCun's vision: the Image Joint Embedding Predictive Architecture (I-JEPA).

Unlike traditional AI models that compare pixels, I-JEPA learns by creating an internal model of the outside world that analyzes abstract representations of images. This approach closely resembles the way humans acquire new concepts. I-JEPA is built on the premise that humans accumulate substantial background knowledge passively as they observe the world. It seeks to replicate this learning process by capturing common-sense background knowledge and encoding it into digital representations for later access. However, the challenge lies in training I-JEPA's representations in a self-supervised manner using unlabeled data, such as images and sounds, instead of relying on labeled datasets.

At its core, I-JEPA predicts the representation of one part of an input by leveraging the representation of other parts within the same input. This sets it apart from newer generative AI models that remove or distort portions of the input and then attempt to predict the missing information. The traditional generative approach tends to focus excessively on irrelevant details, striving to fill in every bit of missing information despite the inherently unpredictable nature of the world. Consequently, generative methods often make mistakes that a human would never make, such as inaccurately generating a human hand by adding extra digits or introducing other errors.

I-JEPA circumvents these pitfalls by predicting missing information in a more human-like manner. It employs abstract prediction targets that eliminate unnecessary pixel-level details. Through this approach, I-JEPA's predictor can model spatial uncertainty in static images based on partially observable context, allowing it to predict higher-level information about unseen regions in an image rather than fixating on pixel-level details.

Meta reports that I-JEPA has demonstrated remarkable performance across various computer vision benchmarks, surpassing other computer vision models in terms of computational efficiency. Moreover, the representations learned by I-JEPA can be applied to other applications without requiring extensive fine-tuning. Meta's researchers highlight the impressive achievements, stating, "For example, we train a 632-million-parameter visual transformer model using 16 A100 GPUs in under 72 hours and it achieves state-of-the-art performance for low-shot classification on ImageNet, with only 12 labeled examples per class." By comparison, alternative methods typically consume 2-10 times more GPU-hours while achieving worse error rates with the same amount of data.

Meta emphasizes the significant potential of architectures that can learn competitive off-the-shelf representations without relying on additional knowledge encoded in hand-crafted image transformations. To foster collaboration and innovation, Meta's researchers are open-sourcing both I-JEPA's training code and model checkpoints. Moving forward, their objective is to expand the approach to other domains, including image-text paired data and video data, anticipating exciting applications for JEPA models in tasks like video understanding. They believe this marks a crucial step towards the application and scalability of self-supervised methods in learning a comprehensive model of the world. 


 
 
 
 
  • Archive