Since the acquisition and annotation of real-world data is complex, computer vision datasets only capture a fraction of our continuous world. To cope with unseen conditions, fight biases of the training data, or simply reduce our dependency on data, algorithms may be trained in a weak-/un-supervised fashion. Recently, novel avenues of research have emerged to relax supervision (less labels, less data) for example using multimodal models, generative AI, transfer learning, continual learning, etc. This lets us foresee new frontiers of computer vision, holding immense potential for the African society. This 3rd WSCV edition will gather leading computer vision figures with keynotes and lightning talks on various topics like: zero-shot training, multimodal models / foundational models, open-vocabulary, self-/un-supervised training, diffusion models, robustness and uncertainty estimation; as well as talks on african initiatives.

📢 The workshop will have a poster session showcasing participants works on computer vision.
Prize will be awarded 🏆

News 08/21: Papers submission is now closed. DLI participants can still present their DLI poster at the workshop (cf. contact at the bottom). We announced the prizes details ;-)

Invited Speakers

Vicky Kalogeiton
École Polytechnique

Daniel Omeiza
University of Oxford

Joyce Nakatumba-Nabende
Makerere University

Oriane Siméoni
valeo.ai

Candace Ross
Meta AI

Daniel Ajisafe
University of British Columbia

Pierluigi Zama Ramirez
University of Bologna

Program

Times are local Dakar time (GMT).

08:30am

- Workshop start

- Opening remarks. Raoul de Charette, Inria

08:30am

- Oriane Siméoni, Valeo.ai

Object localization (almost) for free harnessing self-supervised features

[+] Abstract

The localization of objects in images is today at the heart of many perception systems. However, training object detectors requires large and expensive campaigns of annotation for a finite and pre-defined vocabulary. Instead, being able to discover objects in images without knowing in advance which objects populate a dataset is an exciting prospect. In this talk we will discuss solutions to exploit self-supervised pre-trained features to perform class-agnostic object localization with zero annotation, and such without requiring object proposals nor expensive exploration of image collections. Then, we will investigate means to unite unsupervised object localization with VLM open-vocabulary features leading to good quality open-vocabulary semantic segmentation with no extra annotation.

09:30am

- Pierluigi Zama Ramirez, University of Bologna

Neural Processing of 3D Neural Fields

[+] Abstract

In recent years, Neural Fields have emerged as an effective tool for encoding diverse continuous signals such as images, videos, and 3D shapes. However, given that Neural Fields are essentially neural networks, it remains unclear whether and how they can be seamlessly integrated into deep learning pipelines for solving downstream tasks. This presentation delves into the novel research problem of Neural Fields processing through deep learning pipelines, exploring techniques for leveraging this data representation to perform tasks such as classification, segmentation, and even more complex tasks like natural language understanding of the neural field content.

10:30am

☕ Coffee Break

11:00am

- Daniel Ajisafe, University of British Columbia

Behind the Scenes - Learning Human Body Pose, Shape and Appearance from Mirror Videos

[+] Abstract

Humans exists as essential part of the world and its important to develop algorithms that can reconstruct humans in their full digital form. While prior works have attempted using marker suits to collect 3D data or using multiple cameras to achieve this purpose, mirrors form an affordable and available alternative producing the reflection of a person in a way that is temporally synchronized. In this talk, I will uncover whats behind the scenes, specifically our main contributions that extends articulated neural radiance fields to include a notion of the mirror and making it sample efficient over potential occlusion regions. I will also demonstrate the benefit of learning a complete body model from mirror scenes.

11:30am

- Joyce Nakatumba-Nabende, Makerere AI Lab

(TBD) Computer vision and African Initiatives at Makerere AI Lab

[+] Abstract

TBD

12:00pm

- Candace Ross, Meta AI

(TBD) Vision and Language

[+] Abstract

TBD

🍽️ Lunch break / 🎓 Mentoring lunch (upon registration at the workshop)

2:00pm

- Daniel Omeiza, University of Oxford

Providing Explanations for Responsible Autonomous Driving

[+] Abstract

The increasing development of sophisticated AI models over the last few years has been characterised by rising societal concerns around safety and trust. Groups advocating responsible AI (RAI) have advocated for explainability, a desirable requirement for AI technologies, including agent-based systems such as autonomous vehicles (AVs). AVs should be able to explain what they have ‘seen’, done, and might do in environments in which they operate and do so in intelligible forms. In this talk, I would motivate the need for explainability in autonomous driving, talk about existing efforts in CV and NLP to make AVs explainable, and discuss some research efforts from our group at Oxford to make AVs explainable and safer.

2:45pm

- Spotlights presentations

BioNAS: Incorporating Bio-inspired Learning Rules to Neural Architecture Search Imane Hamzaoui
RGB UAV Imagery Segmentation: Comparative Study Mathews Jahnical Jere
AJA-pose: A Framework for Animal Pose Estimation based on VHR Network Architecture Austin Kaburia Kibaara, Joan Kabura, Antony M. Gitau, Ciira wa Maina

3:00pm

👁️🗣️🤝🏾 Poster sessions (20 posters) + ☕ Coffee Break

4:00pm

- Vicky Kalogeiton, École Polytechnique

Multimodality for story-level understanding and generation of visual data

[+] Abstract

In this talk, I will address the importance of multimodality (i.e. using more than one modality, such as video, audio, text, masks and clinical data) for story-level recognition and generation. First, I will focus on story-level multimodal video understanding, as audio, faces, and visual temporal structure come naturally with the videos, and we can exploit them for free (FunnyNet-W and Short Film Dataset). Then, I will show some examples of visual generation from text and other modalities (ET, CAD, DynamicGuidance).

5:00pm

- Panel + 🏆 Announcement of the Awards 🏆

5:30pm

- Workshop end

Call for Papers

We welcome submission of short/regular papers on any computer vision topics, for presentation at the poster session. It can be original or recently published work.

Submission deadline: August 11^th 2024 (Anywhere on Earth).
Submission deadline (EXTENSION): August 20^th 2024 (Anywhere on Earth).
🏆 Prizes will be awarded. 🏆
500$ cash (Snap gift)
GoPro HERO12 (GoPro gift)

Submission website:

https://openreview.net/group?id=DeepLearningIndaba.com/2024/Workshop/WSCV

Instructions:

Submissions should be 4 to 8 pages (excluding references pages).
We encourage submissions to use our double-column latex kit but we will accept single/double columns submissions with any format. Anonymity is optional.
We accept submission which are original, under review, or already published.

⚠️ If you don't yet have an openreview account, note that the account creation can take a few days for validation. Create your account asap.

The topics of interest include, but are not limited to:

3D computer vision
Adversarial learning, adversarial attack for vision algorithms
Autonomous agents with vision (reinforcement/imitation learning)
Biometrics, face, gesture, body pose
Computational photography, image and video synthesis
Explainable, fair, accountable, privacy-preserving, ethical computer vision
Foundation models, multimodal large language models, etc.
Image recognition and understanding (object detection, categorization, segmentation, scene modeling, visual reasoning)
Low-level and physics-based vision
Semi-/Self-/Un-supervised learning and Few-/Zero-shot algorithms
Transfer learning (domain adaptation, etc.)
Video understanding (tracking, action recognition, etc.)
Multi-modal vision (image+text, image+sound, etc.)