I am an independent group leader at the University of Tübingen leading the group on Scalable Trustworthy AI (STAI).
I am interested in training reliable models (
I have been a research scientist at NAVER AI Lab for 3.5 years. I received my PhD in computer vision and machine learning at Max-Planck Institute for Informatics in 2018, under the supervision of Bernt Schiele and Mario Fritz, with a focus on the privacy and security implications of CV and ML (Thesis). I received the Master of Mathematics with Distinction in 2014 and Bachelor of Arts in Mathematics as a Wrangler in 2013, both at University of Cambridge.
I started compiling the principles for life and research 🍎.
Email  /  Google Scholar  /  LinkedIn  /  Twitter  /  Github
I have tried to push certain fronts in ML research to make models truly useful and deployable in real life. They can be grouped into a few keywords.
Robustness. Changes in the input distribution shall not disrupt the model's predictive power. Ideally, a model should be robust against the shifts in input domain (e.g. natural and adversarial perturbations) and confounders (e.g. fairness).
Uncertainty. A model should know when it is going to get it wrong. This allows the users and downstream systems to make sensible and safe decisions based on the estimated confidence levels.
Human Annotation. An integral part of training a high-performance model is the human supervision. I have sought cost-effective ways to extract useful supervisory signals from humans.
Privacy & Security. There are different privacy and security angles with which ML can be analyzed. One may question the "stealability" of a black-box model as an IP; one may also question the privacy guarantees for user data in the federated learning setup. Still others may wonder whether certain level privacy is achievable at all on internet, with the increasing volume of user data online and more widespread use of machine learning algorithms to process such data.
Evaluation. Correct evaluation is undoubtably important in research and industrial applications, yet it is surprisingly difficult. I have cleaned up benchmarks and evaluation protocols in a few domains.
Large-Scale ML. Some of the methodologies I have been involved in are designed for large-scale ML. They typically require minimal changes to the original ML system but bring consistent gains across the board.
See slides and video (3 Aug 2022) for an overview of the past researches and future research ideas for the scalable trustworthy AI.
Supervised learning trains models with (X,Y) data. The (X,Y) data comes from the annotation procedure where annotators provide the correct Y for each X. But behind the scene, annotators generate much more data than the (X,Y) data themselves: they unintionally generate auxiliary information during the annotation task, such as the history of corrections and the time-series of mouse traces and clicks. We call them annotation byproducts (AB) Z. We propose the new paradigm of learning using annotation byproducts (LUAB), where models are trained with the triplets (X,Y,Z) involving the ABs. We reproduce the original annotation procedures for ImageNet and COCO to generate AB-enriched datasets: ImageNet-AB and COCO-AB. we show that the auxiliary Z may help models be better aligned with human recognition mechanisms, leading to improved model robustness.
We finally came up with some theoretical guarantees for probabilistic embeddings! Given a spherical embedding space with a von-Mises-Fisher (vMF) family of true latent embedding distribution, one may identify the true latent vMF for every data point up to rotations with a Monte-Carlo version of InfoNCE (called MCInfoNCE). This result is a probabilistic extension of the work by Zimmerman et al.
ViT’s itchy point seems to be the uniform attention. ViTs are hungry for denser connections, yet dense connections are hard to achieve because of softmax's steep gradient around the uniform attention. We manually insert additional uniform attention layers in ViT models. This is very cheap! It turns out to be an effective trick for increasing the capacity and generalisation for ViT models, especially for the smaller versions.
Several recent studies have reported positive correlations between in-distribution (ID) and out-of-distribution (OOD) generalisation performances. In particular, Wenzel et al. (2022) found that none of the 31k networks examined on 172 dataset pairs has shown a trade-off, or a negative correlation, between the ID and OOD performances. They further recommend that, to improve the OOD generalisation, one can instead focus on improving the ID generalisation. We argue that this may not always be true. We present counterexamples where one does observe a trade-off between ID and OOD generalisation. We point to the selection method for networks as the key reason for the contradicting observations. We alter the recommendation to the field in a more nuanced manner.
A classifier gets biased when its decision boundary separates the bias attribute (e.g. gender attribute for profession prediction). Some prior de-biasing methods correct the decision boundary by identifying the bias-conflicting samples in the training data (e.g. female mechanical engineers) and giving more weight on them. We go one step further. We argue that it's more effective to augment the whole convex hull between usual data points (e.g. male mechanical engineers) and bias-conflicting samples (e.g. female mechanical engineers). We do this through simple Mixup. It effectively de-biases a model, even in the presence of strong label noise, arguably the greatest arch-enemy for a de-biasing method.
Image-captioning benchmarks such as COCO Captions contain lots of nonsense. For the same image on the left, the caption that goes "Playing tennis with a racket" is deemed correct, while "Swinging a tennis racket" is penalised. This comes from the erratic recipe for constructing the datasets: (1) let annotators write down 5 captions per image and (2) consider only those 5 captions to be correct matches. We show that this practice introduces a lot of noise in the evaluation benchmarks. We then introduce a novel image-captioning dataset based on the MS-COCO Captions that captures the model performances more precisely.
Dataset condensation is the art of compactifying a training dataset. The aim is that a model trained on a condensed dataset is similar to the one trained on the original dataset, most importantly in terms of model accuracy (e.g. 91%-accuracy MNIST classifier with only 1 sample per class). We introduce many practical tricks to make data condensation work beyond the toy setting. We present the first data condensation method that actually works on images with sizes as large as 224x224, instead of 32x32!
Weakly-supervised semantic segmentation (WSSS) is the task of solving pixel-wise class assignment with only the image-level supervision. The problem is ill-posed because the image-level labels alone do not let models distinguish foreground (FG) objects (e.g. train) from spuriously-correlated background (BG) cues (e.g. rail). Researchers have sought external sources of information, such as shape prior, to address the ill-posedness. In this paper, we explore a novel source: BG images (e.g. rail images without a train). Conceptually, telling models what are not the FG cues is equivalent to telling them what actually are the FG cues; BG images are sufficient for turning the problem into a well-posed one. Collecting such BG data is cost-efficient, requiring orders of magnitude less annotation costs than the already-cheap image-level labels.
Shortcut learning is emerging as a key limitation of the current generation of machine learning models (CVPR'20, ICML'20). In this work, instead of proposing yet another solution, we take a step back and deepen our understanding of the problem. For example, trained on a dataset where both colour and shape are valid cues for recognising the object, models of different types (MLP, CNN, and ViT) choose to use colour over shape. Why is that? We provide an explanation from the parameter-space perspective. Read the paper. Worth it!
This is an NLP paper. There have been many attempts at enlarging the training text data for few-shot text classification, like back-translation (e.g. En-Fr-En) and the use of pre-trained language models. Unlike those, we propose an augmentation method that is fully aware of the underlying grammatical structure of the sentence. Importantly, our method generates a set of synonymous sentences that are both grammatically correct and grammatically diverse! Here we gain quite some points in few-shot text classification benchmarks. Another contribution is viewing the train-val split as part of the method and seeking the best splitting strategy when data augmentation is being used. It turns out that splitting the few-shot labelled samples S into disjoint train-val splits (train split is then augmented) is sub-optimal; a better strategy is to use the augmented source data S' as the train split and the original S itself as the validation split!
Journal extension of CVPR'20! It now contains more analyses, including the evaluation of input gradient variants as Weakly-Supervised Object Localization (WSOL) methods.
It is difficult to find a CV researcher or practitioner who hasn't used (or at least heard of) the Class Activation Maps (CAM). It is a seminal feature attribution method that has left a deep mark on the vision research and applications. Notwithstanding its popularity, we found some practical and conceptual issues that makes CAM not as interpretable as it should be. We address the issues with a probabilistic treatment of the last layers of CNNs where the latent cue variable Z is trained via Marginal Likelihood (ML) or Expectation-Maximisation (EM) algorithms. The resulting Class Activation Latent Maps, or CALM, produces more precise and interpretable score maps.
The Tranformer architecture has successfully been adapted to visual models (e.g. ViT). However, Transformers, originally designed for language modelling, and ViT assign a constant ratio of computational loads between spatial and channel dimensions at different depths. We postulate this as a suboptimal design choice, as CNNs assign different ratios at different depths to maximise the utility of compute. We thus present Pooling-based Vision Transformer (PiT) that does this.
Recovering the dynamical systems, or the data generation process, behind time series data enables an effective and robust prediction, interpretation, and forecasting. There exist prior methods for recovering either continuous or discrete dynamics, but not the mixture. The underlying dynamics behind many real-world systems contain both continuous and discrete elements. For example, an aircraft essentially follows a continuous dynamics but goes through a discrete mode shift at touchdown. Such a system is referred to as a Stochastic Hybrid System (SHS). We present a framework that recovers SHS from time series data using ingredients like Neural ODEs and latent variable models.
ImageNet labels contain lots of noise (e.g. Shankar et al.). There have been efforts to fix them on the evaluation set, but not yet on the training set. We fix them on the training set (published at codebase), but with the help of a bigger image classifier, to make the task feasible at all. This is another trick that will improve the ImageNet & downstream task accuracies across the board.
Given an image, there are many ways to describe it in text. Given a text description, there are likewise many possible images that suits the description. Cross-model associations are of many-to-many nature. The usual deterministic embeddings cannot model this well. We introduce a probabilistic embedding scheme based on the Hedged Instance Embedding (ICLR'19) to handle the many-to-many mapping gracefully. We address another crucial issue with evaluation: your method gets either penalised or rewarded for retrieving synonymous sentences. This is because of the non-exhaustive true matches in the eval set. Since ground-up collection of such matches is too expensive, we introduce a novel surrogate measure Plausible-Match R-Precision based on the estimated true matches.
When you apply a momentum-based optimizer over scale-invariant parameters, their norms increase quite a bit. The norm increase doesn't contribute anything to the loss minimization while only slowing down the convergence. We fix this by appending a projection operation on SGD and Adam. This leads to performance improvements across the board.
Data augmentation is not as extensively studied in the video recognition tasks as in the static image recognition domain. We study the extension of popular static-image augmentation method, such as CutMix, on video recognition tasks.
Evaluating generative models is tricky. There are Inception Score and Fréchet Inception Distance measures indeed, and then (Improved) Precision and Recall metrics to separately examine the fidelity and diversity aspects. Yet, they are still not perfect. We address the issues with Improved Precision and Recall metrics and propose new metrics: Density and Coverage.
Models pick up correlations, rather than causal mechanisms, between inputs and outputs. De-biasing (and fairness) researches have guided models on "which cues to look at" through explicit bias labels or by re-weighting or re-generating training data to remove bias. We show that, for many application scenarios, it is possible to encode the "cues to look at" through model architecture and such expensive strategies are no longer needed.
I have long waited for this moment since CVPR'17. Weakly-Supervised Object Localization, or WSOL, has in fact been not weakly supervised in a strict sense. Design choices and hyperparameters are validated with the localization annotations! This paper explains why researchers had to rely on localization validation -- without localization supervision, there is no way to force a model to not extract cues from background regions. We propose a new fair benchmark acknowledging the need for localization annotations and show that WSOL methods since CAM in 2016 have not introduced much gain.
Scene text recognition works well, but there are remaining corner cases. An example is texts with unusual orientations and arrangements (e.g. BMW logo). We focus on this corner case and propose a model based on self-attention.
Journal version of my first paper ICCV'15, after five years! We have developed the version two of the ICCV'15 system that outperforms the methods that have appeared in the meantime.
Book chapter version of ICLR'18! We build connections between our black-box inspection methodology and the explainable AI.
Federated learning allows sensitive user data to never leave the device and still be used for training. It is considered a safer option than sending the user data directly to the server. But is it? We show that users may be identified and linked based on the model updates communicated between the device and server.
There has been a line of research on simple regularization techniques like CutMix (ICCV'19) and other lines of research on robustness and uncertainty. We make a happy marriage of the two and measure how well the regularization techniques improve robustness and uncertainty of a model.
A simple solution that works surprisingly well! Cut and paste patches from other images during training. Quite likely, you will see a performance boost.
Scene text recognition field has long suffered from the lack of a unified agreement on the evaluation protocol. We provide a standard protocol. We also provide a unified view on the previous methods and discover a novel combination of existing modules that turns out to be the state of the art.
There has been quite some work on representing uncertainty for classification or regression tasks. Is there a way to represent uncertainty for instance embedding models too? We show that it is possible to train probabilistic representatitons for instances based on their inherent ambiguity.
Can a bad guy hijack an RL agent? We show that it is possible to let an agent pursue an alternative reward by introducing small adversarial perturbations in the input stream.
Recipes for training a high-performance model are not cheap. Think about the GPU-and-research-scientist-and-engineer hours to find the right architectural components and optimizer hyperparameters. What if they can be stolen by examining the model responses to certain inputs?
Adversarial perturbation solutions (ICCV'17) produce visually pleasant protections with high protection rates, but their effects may be confined to a handful of recognition systems. We propose another solution based on face inpainting that changes the face to a fictitious yet natural-looking identity. It is effective against a broader set of recognition systems.
We stop and look back on the visual privacy papers (ICCV'15, ECCV'16, ICCV'17).
If face blurring doesn't work (ECCV'16), how should we shield our personal photos online against recognition systems? We propose a solution based on adversarial perturbations and the game theoretic considerations for the evaluation therein.
There has been quite some work around training models for localization tasks (e.g. semantic segmentation) from the image tag supervision only. But is this fundamentally possible without relying on extensive validation with full localization annotations? We argue that certain priors are necessary at the very least to encode the extent of objects. Saliency, we argue, is a handy prior.
We casually use pronouns to refer to others. For machines, however, referring to people with pronouns necessitates new types of data and training strategies to explicitly localize and link people across frames. We do that.
But can you still be recognized even with a blur on your face? Quite likely.
You are a janitor at Taj Mahal. Against you will, sightseers take photos with your face in the background. How can you opt out of being present in someone else's photo? We present a mobile-system based solution.
How well does a CNN model recognize people in personal photos? Even when people don't look at cameras, CNN finds out who they are, based on the context (e.g. location and social connections).
As machine learning technology gets applied to actual products and solutions, new challenges have emerged. Models unexpectedly fail to generalise well to small changes in the distribution; some models are found to utilise sensitive features that could treat certain demographic user groups unfairly; models tend to be confident on novel types of data; models cannot communicate the rationale behind their decisions effectively with the end users like medical staff to maximise the human-machine synergies. Collectively, we face a trustworthiness issue with the current machine learning technology. A large fraction of the machine learning research nowadays is dedicated to expanding the frontier of Trustworthy Machine Learning (TML). The course covers a theoretical and technical background for key topics in TML. We conduct a critical review of important classical and contemporary research papers on related topics and provide hands-on practicals to implement TML techniques.
ImageNet symbolises the stellar achievements in ML and CV in the past decade.
It has served as the go-to benchmark for model architectures and training techniques and as a common pre-training dataset for numerous downstream tasks.
As of 2021, ImageNet is going through a creative destruction.
As the SOTA models are saturating towards the upper bound of the benchmark, new versions of the benchmarks are being proposed (
Deeply-learned computer vision models are data-hungry and manual annotations are expensive. Can we train models with “weaker” annotations? This tutorial provides an overview of the vast literature on weakly supervised learning methods in computer vision. We also discuss the limitations of current state-of-the-art methods and evaluation metrics. We propose future research directions that hopefully will spur disruptive progress in weakly supervised learning.