Overview

In the context of the 2022 Deep Learning Indaba this hybrid workshop will focus on computer vision algorithms using less labels and/or less data. Classically, machine learning relies on abundant annotated data. This is prone to cultural biases, since datasets are commonly recorded in western countries, as well as distribution biais due to rare events, such as adverse weather/lighting conditions which are rarely included in public datasets.
Speakers will present alternative strategies to reduce the need of labels (e.g., domain adaptation, domain generalization) or the need of data (e.g., few-/zero-shot learning, continual learning). Other strategies that rely on accessible priors will be presented, like self-supervised, cross modal, or model-based learning. The applications will focus, but not be limited to, autonomous driving and robotics.


The recording is available:
Workshop on Weakly Supervised Computer-Vision

Invited Speakers

Matthieu Cord
Matthieu Cord

Sorbonne Uni / Valeo.ai
Gabriela Csurka
Gabriela Csurka

Naver Labs Europe
Sileye Ba
Sileye Ba

L'Oréal
Fabio Cermelli
Fabio Cermelli

Politecnico di Torino
Umberto Michieli
Umberto Michieli

Samsung Research
Fatma Güney
Fatma Güney

Koç University

Program

The recording is available:
Workshop on Weakly Supervised Computer-Vision


All times are local Tunis time (GMT+1).
2:00pm
- Workshop start
2:00pm
- Matthieu Cord, Sorbonne Uni / Valeo.ai
Vision Transformers - Video
[+] Abstract
Originally proposed in natural language processing, transformers are attracting growing interest in computer vision, providing state-of-the-art results for tasks such as image classification or object detection. In this talk, I present the underlying motivation and the basic architecture of Vision Transformers (ViT). I detail their difference with classical convolution -based architectures for classification, and more general framework for different tasks in computer vision. I also present ViT pre-training with large multimodal language and vision datasets, for downstream tasks with few or zero-shot supervision.
2:45pm
- Umberto Michieli, Samsung research
Learning to Segment Images with Limited Data across Devices, Domains and Tasks - Video
[+] Abstract
Dense prediction tasks, such as semantic segmentation, are nowadays tackled with data-hungry deep learning architectures. However, oftentimes only limited data is available. In this talk, we argue the need for versatility of deep neural architectures from various perspectives.
First, we discuss the federated learning (FL) paradigm to train deep architectures in a distributed setting with data available only at remote clients. We address non-i.i.d. distribution of samples among clients via 1) a naïve FL optimizer that is fair from the users' perspective, and 2) a prototype-guided FL optimizer that is evaluated also on FL segmentation benchmarks.
Second, we briefly overview model adaptation to unseen visual domains with no ground truth annotations available and we discuss a recent synthetic dataset (SELMA) to aid the segmentation task on such domains.
Finally, we empower deep models to recognize novel semantic concepts without forgetting previously learned ones. We investigate continual semantic segmentation via knowledge distillation, latent space regularization, and replay samples retrieved via weakly-supervised GANs or web-crawled images.
3:15pm
- Sileye Ba, L'Oréal
Real-time Make Up Virtual-Try-On Through Deep Inverse Graphics - Video
[+] Abstract
Augmented reality applications have rapidly spread across online retail platforms and social media, allowing consumers to virtually try-on a large variety of products, such as makeup, hair dying, or shoes. However, parametrizing a renderer to synthesize realistic images of a given product remains a challenging task that requires expert knowledge. While recent work has introduced neural rendering methods for virtual try-on from example images, current approaches are based on large generative models that cannot be used in real-time on mobile devices. This calls for a hybrid method that combines the advantages of computer graphics and neural rendering approaches. In this paper, we propose a novel framework based on deep learning to build a real-time inverse graphics encoder that learns to map a single example image into the parameter space of a given augmented reality rendering engine. Our method leverages self-supervised learning and does not require labeled training data, which makes it extendable to many virtual try-on applications. Furthermore, most augmented reality renderers are not differentiable in practice due to algorithmic choices or implementation constraints to reach real-time on portable devices. To relax the need for a graphics-based differentiable renderer in inverse graphics problems, we introduce a trainable imitator module. Our imitator is a generative network that learns to accurately reproduce the behavior of a given non-differentiable renderer. We propose a novel rendering sensitivity loss to train the imitator, which ensures that the network learns an accurate and continuous representation for each rendering parameter. Automatically learning a differentiable renderer, as proposed here, could be beneficial for various inverse graphics tasks. Our framework enables novel applications where consumers can virtually try-on a novel unknown product from an inspirational reference image on social media. It can also be used by computer graphics artists to automatically create realistic rendering from a reference product image.
3:45pm
- Poster session & Coffee break

  • Deep Learning Architecture For Brain Vessel Segmentation. K. Iddrisu. [poster]
  • Single-modality and joint fusion deep learning for diabetic retinopathy diagnosis. K. El-ateif. [poster]
  • Co-attention Mechanism with Multi-Modal Factorized Bilinear Pooling for Medical Image Question Answering. V.-S. Mfogo, J. Fadugba, X. Chen, G. Gkioxari. [poster]
  • AFRIGAN: African Fashion Style Generator using Generative Adversarial Networks (GANs). M.-M. Salami, W. Oyewusi, O. Adekanmbi. [poster]
  • Sign-to-Speech Model for Sign Language Understanding: A Case Study of Nigerian Sign Language. S. Kolawole. [poster]
  • ATLAS: Universal Function Approximator for Memory Retention. H. Van Deventer. [poster]
  • Application of Artificial Intelligence (AI) and Collective Intelligence (CI) for Diagnosis of Breast Cancer. A. Jimoh, A. Adeniran, M. Klein. [poster]
4:30pm
- Gabriela Csurka, Naver Labs Europe
Learning from unlabeled data at NLE - Video
[+] Abstract
In this talk after a few words about Naver and Naver Labs, I will shortly present a few recently published works from our lab related to Self-supervision such as MoCHi Mixing of Contrastive Hard Negatives and ICMLM (Image-conditioned Masked Language Modelling) and to continual Domain adaptation such as CDAML (Continual DA with Meta-Learning) and OASiS (Online Adaptation for Semantic Image Segmentation).
5:15pm
- Fabio Cermelli, Politecnico di Torino
Semantic Segmentation from Weakly and Partial Annotations - Video
[+] Abstract
Due to the rise of deep learning and the accessibility of big human-annotated datasets, tremendous progress has been made in the fundamental computer vision task of semantic segmentation. However, because each pixel of the image needs to have a label, annotations are quite expensive. As a result, the annotation cost hinders the applications of semantic segmentation in the real world. In the presentation, we outline ways that significantly lower the cost by utilizing less expensive and more readily available annotations.
We first look into the use of partial annotations, where labels are only given for specific areas of the image. We begin with an incremental learning application, where the objective is to expand a model to learn new classes without forgetting and without being given annotations for existing classes. We present a straightforward adjustment of the cross-entropy loss to deal with this situation. The proposed losses are then extended to the point-and-scribble supervised segmentation, where only a small portion of the image's pixels are annotated.
Finally, we show the scenario of not having any pixel-level information. The goal is to learn a segmentation model using cheap and widely available image-level labels, that only indicate the presence of an object in the image without providing any localization cue. We review the current state-of-the-art and illustrate the current solutions based on Class-Activation Maps. Then, extending these techniques, we introduce a framework that learns to segment new classes over time from image-level labels.
5:45pm
- Fatma Güney, Koç University
Predictive World Models in Autonomous Driving - Video
[+] Abstract
I'll talk about future prediction in video sequences. We propose to address the inherent uncertainty in future predictions with stochastic models. While most of the previous methods predict the future in the pixel space, we propose to predict the future also in the motion space to separately model appearance and motion history. We then extend our solution to real-world driving scenarios where the background moves according to the ego-motion of the vehicle. We predict the changes in the static part by modeling the structure and ego-motion. Conditioned on the static prediction, we predict the remaining changes in the dynamic part which correspond to independently moving objects. Finally, we propose to combine information from multiple cameras into a Bird’s Eye View (BEV) representation and predict the future in that compact representation. We efficiently learn the temporal dynamics in the BEV representation with a state space model. Our models outperform the previous methods on standard future frame prediction datasets MNIST, KTH, and BAIR but especially in real-world driving datasets KITTI, Cityscapes, and NuScenes.
6:15pm
- Workshop end

Organizers

Fabio Pizzati
Fabio Pizzati

Inria and Unibo
Patrick Pérez
Patrick Pérez

Valeo.ai
Tuan-Hung Vu
Tuan-Hung Vu

Valeo.ai
Andrei Bursuc
Andrei Bursuc

Valeo.ai
Massimiliano Mancini
Massimiliano Mancini

Uni. of Tübingen

Call for Papers

To foster interactions, attendees of the 2022 Deep Learning Indaba are invited to submit of any work related to computer vision (not limited to weakly supervised), for presentation at the poster session. Original articles as well as previously published ones can be submitted.

Please submit pdf of your work on CMT: https://cmt3.research.microsoft.com/WSCV2022/
Deadline is extended to August 7th (11:59pm AOE).


The selection of relevant papers (of at least 4 pages) will be done by the organization board, for presentation at the poster session.

The topics of interest include, but are not limited to:
  1. 3D computer vision
  2. Adversarial learning, adversarial attack for vision algorithms
  3. Autonomous agents with vision (reinforcement/imitation learning)
  4. Biometrics, face, gesture, body pose
  5. Computational photography, image and video synthesis
  6. Explainable, fair, accountable, privacy-preserving, ethical computer vision
  7. Image recognition and understanding (object detection, categorization, segmentation, scene modeling, visual reasoning)
  8. Low-level and physics-based vision
  9. Semi-/Self-/Un-supervised learning and Few-/Zero-shot algorithms
  10. Transfer learning (domain adaptation, etc.)
  11. Video understanding (tracking, action recognition, etc.)
  12. Multi-modal vision (image+text, image+sound, etc.)

PDF version

Important workshop dates

Any questions ? Contact Raoul de Charette.