To understand scenes from images, video or 3D data, computer vision often relies on models trained on large datasets. But as the field matured, algorithms are now expected to perform in real-world, beyond training conditions. Hence, new tasks emerged like generalization, open-world vision which seeks to adjust to unseen conditions (lighting, weather, etc.), robustness to adversarial attacks, image generation, etc. Weakly-supervised learning helps address these challenges because it relaxes the need of costly annotations and minimizes the biases of training datasets. Thus, truly paving the way to real-world applications like autonomous driving, mobile robotics, virtual reality, image generation...
This workshop will deep dive into the latest research with talks from renowned speakers. Among others, we will address techniques like vision-language model (VLM), transfer learning, diffusion models, contrastive learning, vision transformers (ViT), continual learning, neural fields, and else; while revolving on how to relax supervision (less labels, less data), adapt to unseen data, or benefit from other modalities (text, image + text, video + text).

Thanks all for the great workshop. Best awards here 🏆

Invited Speakers

Mathieu Salzmann
EPFL

Gül Varol
École des Ponts

Stéphane Lathuilière
Telecom Paris

Mathilde Caron
Google

Mohamed H. Elhoseiny
KAUST

Program

Times are local Ghana time (GMT).

11:00am

- Workshop start

11:00am

- Opening remarks. Raoul de Charette, Inria

11:10am

- Stéphane Lathuilière, Telecom Paris

Data Frugality in Image Generation, Image Generation for Data Frugality

[+] Abstract

In this talk, we aim to showcase recent developments that demonstrate how image generation tasks can now be addressed with only a few examples. We'll explain how image synthesis models conditioned on semantic maps can be trained effectively using a small set of samples through a transfer learning approach. Next, we'll explore how recent text-conditioned diffusion models go even further, enabling semantic image synthesis in a zero-shot manner.
Furthermore, we'll delve into the potential of deep image-generation techniques to facilitate learning in perception tasks with minimal training data. Specifically, we'll investigate how a semantic segmentation model can be adapted from one visual domain to another using just a single unlabeled sample. This adaptation is achieved through leveraging pre-trained text-to-image models. While earlier methods relied on style transfer for such adaptations, we'll illustrate how text-to-image diffusion models can generate synthetic target datasets that replicate real-world scenes and styles. This method allows guiding image generation towards specific concepts while retaining spatial context from a single training image.

12:00pm

- Mohamed H. Elhoseiny, KAUST

Imaginative Vision Language Models: Towards human-level imaginative AI skills transforming species discovery, content creation, self-driving cars, and Emotional Health / Health Care

[+] Abstract

Most existing AI learning methods can be categorized into supervised, semi-supervised, and unsupervised methods. These approaches rely on defining empirical risks or losses on the provided labeled and/or unlabeled data. Beyond extracting learning signals from labeled/unlabeled training data, we will reflect in this talk on a class of methods that can learn beyond the vocabulary that was trained on and can compose or create novel concepts. Specifically, we address the question of how these AI skills may assist species discovery, content creation, self-driving cars, emotional health, and more. We refer to this class of techniques as imagination AI methods, and we will dive into how we developed several approaches to build machine learning methods that can See, Create, Drive, and Feel. See: recognize unseen visual concepts by imaginative learning signals and how that may extend in a continual setting where seen and unseen classes change dynamically. Create: generate novel art and fashion by creativity losses. Drive: improve trajectory forecasting for autonomous driving by modeling hallucinative driving intents. Feel: generate emotional descriptions of visual art that are metaphoric and go beyond grounded descriptions. Feel: generate emotional descriptions of visual art that are metaphoric and go beyond grounded descriptions, and how to build these AI systems to be more inclusive of multiple cultures. I will also conclude by pointing out future directions where imaginative AI may help develop better assistive technology for multicultural and more inclusive metaverse, emotional health, and drug discovery.

🍽️ Lunch break 🍽️

2:00pm

- Mathilde Caron, Google

Large-scale and Efficient Visual Understanding with Transformers

[+] Abstract

I will present various advances that we have made in developing accurate and scalable models for visual and multi-modal understanding, requiring few annotations. My presentation will include:
DINO: We expose our observations from adapting self-supervised learning to vision transformer: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image. Second, these features can be easily adapted to downstream task with minimal adaptation and are for instance excellent k-NN classifiers.
Verbs in action: Current video models and benchmarks have a single frame/object/noun bias. We tackle this problem by bringing together a set of benchmarks that focus more on "verb" or "temporal" understanding and use LLMs to create harder text pairs for contrastive pretraining that forces video models to pay more attention to verbs.
RECO: We propose to equip existing foundation models with the ability to refine their embedding with cross-modal retrieved information from an external memory at inference time, which greatly improves zero-shot predictions on fine-grained recognition tasks.

2:50pm

- Mathieu Salzmann, EPFL

Generalizing to Unseen Objects without Re-training

[+] Abstract

In many practical situations, the objects that will be observed when deploying a deep learning model may differ from those seen during training. This is for example the case in automated driving, where obstacles on the road can be of any kind, significantly differing from the standard categories used to train an object detector. This also occurs in the context of space debris capture, which relies on estimating the 6D pose of debris that may differ from the training ones. In such situations, no annotations are provided for the new objects and re-training or fine-tuning the model is often not practical. Nevertheless, such unseen objects must be handled to ensure operation safety. In this talk, I will present our recent progress towards training deep learning models able to generalize to new object categories at test time. I will focus on the scenarios of road obstacle detection and 6D object pose estimation, and will show that, in both cases, generalization can be facilitated by learning to compare images.

3:40pm

- Spotlights presentations

Mirror-Aware Neural Humans Daniel Ajisafe, James Tang, Shih-Yang Su, Bastian Wandt, Helge Rhodin
COVID-Attention: Efficient COVID19 Detection using Pre-trained Deep Models based on Vision Transformers and X-ray Images Imed-eddine Haouli, Walid Hariri, Hassina Seridi-Bouchelaghem
3D reconstructions of brain from MRI scans using neural radiance fields Khadija Iddrisu, Sylwia Malec, Alessandro Crimi
Mobile-Based Early Skin Disease Diagnosis for Melanin-Rich Skins Mathews Jahnical Jere
Hybrid Optimization of Coccidiosis Chicken Disease Prediction, Detection and Prevention Using Deep Learning Frameworks David Wairimu

4:00pm

👁️🗣️🤝🏾 Poster sessions (15 posters) + ☕ Coffee Break

4:50pm

- Gül Varol, École des Ponts

Automatic annotation of open-vocabulary sign language videos

[+] Abstract

Research on sign language technologies has suffered from the lack of data to train machine learning models. This talk will describe our recent efforts on scalable approaches to automatically annotate continuous sign language videos with the goal of building a large-scale dataset. In particular, we leverage weakly-aligned subtitles from sign interpreted broadcast footage. These subtitles provide us candidate keywords to search and localise individual signs. To this end, we develop several sign spotting techniques: (i) using mouthing cues at the lip region, (ii) looking up videos from sign language dictionaries, and (iii) exploring the sign localisation that emerges from the attention mechanism of a sequence prediction model. We further tackle the subtitle alignment problem to improve their synchronization with signing. With these methods, we build the BBC-Oxford British Sign Language Dataset (BOBSL), continuous signing videos of more than a thousand hours, containing millions of sign instance annotations from a large vocabulary. More information about the dataset can be found at https://arxiv.org/abs/2111.03635

5:40pm

- Panel Closing Remarks + 🏆 Announcement of the Awards 🏆

6:00pm

- Workshop end

Awards

Best paper award

Mirror-Aware Neural Humans
Daniel Ajisafe, James Tang, Shih-Yang Su, Bastian Wandt, Helge Rhodin

Hero11 camera

Best poster award

Continual Self-Supervised Learning for Scalable Multi-script Handwritten Text Recognition
Marwa Dhiaf, Mohamed Ali Souibgui, Ahmed Cheikh Rouhou, Kai Wang, Yuyang Liu, Yousri Kessentini, Alicia Fornés

Nvidia Jetson Nano

Honorable mention award

Segmentation of Tuta Absoluta’s Damage on Tomato Plants: A Computer Vision Approach
Loyani K. Loyani, Karen Bradshaw, Dina Machuve

Call for Papers

We welcome submission of short/regular papers on ANY computer vision topics, for presentation at the poster session. It can be original or recently published work.

Submission deadline: August 13^th 2023 (Anywhere on Earth).
Submission deadline (EXTENSION): August 20^th 2023 (Anywhere on Earth).
🏆 Best papers will be awarded with a prize. 🏆
1000€ prize pool (GoPro, Nvidia)

Submission website:

https://openreview.net/group?id=DeepLearningIndaba.com/2023/Workshop/WSCV

Instructions:

Submissions should be 4 to 8 pages (excluding references pages).
We encourage submissions to use our double-column latex kit but we will accept single/double columns submissions with any format. Anonymity is optional.
We accept submission which are original, under review, or already published.

The topics of interest include, but are not limited to:

3D computer vision
Adversarial learning, adversarial attack for vision algorithms
Autonomous agents with vision (reinforcement/imitation learning)
Biometrics, face, gesture, body pose
Computational photography, image and video synthesis
Explainable, fair, accountable, privacy-preserving, ethical computer vision
Image recognition and understanding (object detection, categorization, segmentation, scene modeling, visual reasoning)
Low-level and physics-based vision
Semi-/Self-/Un-supervised learning and Few-/Zero-shot algorithms
Transfer learning (domain adaptation, etc.)
Video understanding (tracking, action recognition, etc.)
Multi-modal vision (image+text, image+sound, etc.)