Fifth Workshop on Computer Vision for AR/VR

( CV4ARVR )

October 16, 2021

Organized in conjunction with ICCV 2021

Overview | Submission | Program | Papers | People | FAQ

Q&A

Do you have a question for the authors? Join the Virtual Poster Session on Discord.

Join the Discord Server (2021)

Join the CV4ARVR 2021 Discord server and leave a question for the authors, or participate in the live Q&A session on October 16th during ICCV, see program for timing. For the best experience, we recommend using the Discord desktop app.

Extended Abstracts

Play all videos here (YouTube playlist)

Previous years: 2020, 2019

DIREG3D: DIrectly REGress 3D Hands from Multiple Cameras

Ashar Ali (Qualcomm Technologies, Inc.); Upal Mahbub (Qualcomm); Gokce Dane (Qualcomm); Gerhard Reitmayr (Qualcomm)

ICCV Workshop on Computer Vision for Augmented and Virtual Reality, 2021

In this paper, we present DIREG3D, a holistic framework for 3D Hand Tracking. The proposed framework is capable of utilizing camera intrinsic parameters, 3D geometry, intermediate 2D cues, and visual information to regress parameters for accurately representing a Hand Mesh model. Our experiments show that information like the size of the 2D hand, its distance from the optical center, and radial distortion is useful for deriving highly reliable 3D poses in camera space from just monocular information. Furthermore, we extend these results to a multi-view camera setup by fusing features from different viewpoints.

Dynamics-Regulated Kinematic Policy for Egocentric Pose Estimation

Zhengyi Luo (Carnegie Mellon University); Ryo Hachiuma (Keio University); Ye Yuan (Carnegie Mellon University); Kris Kitani (Carnegie Mellon University)

ICCV Workshop on Computer Vision for Augmented and Virtual Reality, 2021

We propose a method for object-aware 3D egocentric pose estimation that tightly integrates kinematics modeling, dynamics modeling, and scene object information. Unlike prior kinematics or dynamics-based approaches where the two components are used disjointly, we synergize the two approaches via dynamics-regulated training. At each timestep, a kinematic model is used to provide a target pose using video evidence and simulation state. Then, a prelearned dynamics model attempts to mimic the kinematic pose in a physics simulator. By comparing the pose instructed by the kinematic model with the pose generated by the dynamics model, we can use their misalignment to further improve the kinematic model. By factoring in the 6DoF pose of objects (e.g., chairs, boxes) in the scene, we demonstrate for the first time, the ability to estimate physically-plausible 3D human-object interactions using a single wearable camera. We evaluate our egocentric pose estimation method in both controlled laboratory settings and real-world scenarios.

FaceEraser: Removing Facial Parts for Augmented Reality

Miao Hua (Bytedance); Lijie Liu (ByteDance Ltd.); Ziyang Cheng (ByteDance); Qian He (Bytedance); Bingchuan Li (Bytedance); Zili Yi (ByteDance)

ICCV Workshop on Computer Vision for Augmented and Virtual Reality, 2021

Our task is to remove all facial parts (e.g., eyebrows, eyes, mouth and nose), and then impose visual elements onto the “blank” face for augmented reality. Conventional object removal methods rely on image inpainting techniques (e.g., EdgeConnect, HiFill) that are trained in a self-supervised manner with randomly manipulated image pairs. Specifically, given a set of natural images, randomly masked images are used as inputs and the raw images are treated as ground truths. Whereas, this technique does not satisfy the requirements of facial parts removal, as it is hard to obtain ``ground-truth'' images with real ``blank'' faces. Simple techniques such as color averaging or PatchMatch fail to assure texture or color coherency. To address this issue, we propose a novel data generation technique to produce paired training data that well mimic the ``blank'' faces. In the mean time, we propose a novel network architecture for improved inpainting quality for our task. Finally, we demonstrate various face-oriented augmented reality applications on top of our facial parts removal model. Our method has been integrated into commercial products and its effectiveness has been verified with unconstrained user inputs. The source codes are released at https://github.com/duxingren14/FaceEraser on GitHub for research purposes.

Indoor Scene Augmentation via Limited Scene Priors

Mohammad Keshavarzi (University of California Berkeley); Christian Reyes (University of California Berkeley); Ritika Shrivastava (University of California Berkeley); Oladapo Afolabi (U.C. Berkeley); Luisa Caldas (University of California Berkeley); Allen Y. Yang (University of California Berkeley)

ICCV Workshop on Computer Vision for Augmented and Virtual Reality, 2021

Contextually completing scenes via learning-based scene synthesis techniques has gained momentum in computer vision and graphics communities with applications in augmented and virtual reality. In this extended abstract we present a novel contextual scene augmentation system that can be trained with limited scene priors. Our proposed method utilizes a novel scene graph extraction and parametric data augmentation method combined with a Graph Attention and Siamese network architecture followed by an Autoencoder network to perform training with limited scene priors. We show the effectiveness of our proposed system by conducting a comparative study with alternative systems on the Matterport3D dataset. Our results indicate that our scene augmentation outperforms prior art in scene synthesis with limited scene priors available. Finally, to demonstrate our system in action, we present an augmented reality application, in which objects can be contextually augmented in real-time.

On-device Real-time Hand Gesture Recognition

George Sung (Google LLC); Kanstantsin Sokal (Google); Esha Uboweja (Google LLC); Valentin Bazarevsky (Google LLC); Jonathan Baccash (Google); Eduard Gabriel Bazavan (Google); Chuo-Ling Chang (Google LLC); Matthias Grundmann (Google Research)

ICCV Workshop on Computer Vision for Augmented and Virtual Reality, 2021

We present an on-device real-time hand gesture recognition (HGR) system, which detects a set of predefined static gestures from a single RGB camera. The system consists of two parts: a hand skeleton tracker and a gesture classifier. We use MediaPipe Hands as the basis of the hand skeleton tracker, improve the keypoint accuracy, and add the estimation of 3D keypoints in a world metric space. We create two different gesture classifiers, one based on heuristics and the other using neural networks (NN).

Residual Aligned: Gradient Optimization for Non-Negative Image Synthesis

Yu Shen (Cornell Tech); Katie Z Luo (Cornell University); Guandao Yang (Cornell University); Harald Haraldsson (Cornell Tech); Serge Belongie (University of Copenhagen)

ICCV Workshop on Computer Vision for Augmented and Virtual Reality, 2021

In this work, we address an important problem of optical see-through (OST) augmented reality: non-negative image synthesis. Most of the image generation methods fail under this condition since they assume full control over each pixel and cannot create darker pixels by adding light. In order to solve the non-negative image generation problem in AR image synthesis, prior works have attempted to utilize optical illusion to simulate human vision but fail to preserve light-ness constancy well under situations such as high dynamic range. In our paper, we instead propose a method that is able to preserve lightness constancy at a local level, thus capturing high-frequency details. Compared with existing work, our method shows strong performance in image-to-image translation tasks, particularly in scenarios such as large scale images, high resolution images, and high dynamic range image transfer.

Soft Expectation and Deep Maximization for Image Feature Detection

Alexander Mai (University of California, San Diego); Allen Y Yang (UC Berkeley, USA); Dominique E Meyer (UC San Diego)

ICCV Workshop on Computer Vision for Augmented and Virtual Reality, 2021

Central to the application of many multi-view geometry algorithms is the extraction of matching points between multiple viewpoints, enabling classical tasks such as camera pose estimation and 3D reconstruction. Many approaches that characterize these points have been proposed based on hand-tuned appearance models or data-driven learning methods. We propose Soft Expectation and Deep Maximization (SEDM), an iterative unsupervised learning process that directly optimizes the repeatability of the features by posing the problem in a similar way to expectation maximization (EM). We found convergence to be reliable and the new model to be more lighting invariant and better at localize the underlying 3D points in a scene, improving SfM quality when compared to SuperPoint and R2D2.

Unconstrained Scene Generation with Locally Conditioned Radiance Fields

Terrance DeVries (Apple); Miguel Angel Bautista (Apple); Nitish Srivastava (Apple); Graham Taylor (University of Guelph); Joshua M Susskind (Apple)

ICCV Workshop on Computer Vision for Augmented and Virtual Reality, 2021

We tackle the challenge of learning a distribution over complex, realistic, indoor scenes. In this paper, we introduce Generative Scene Networks (GSN), which learns to decompose scenes into a collection of many local radiance fields that can be rendered from a free moving camera. Our model can be used as a prior to generate new scenes, or to complete a scene given only sparse 2D observations. Recent work has shown that generative models of radiance fields can capture properties such as multi-view consistency and view-dependent lighting. However, these models are specialized for constrained viewing of single objects, such as cars or faces. Due to the size and complexity of realistic indoor environments, existing models lack the representational capacity to adequately capture them. Our decomposition scheme scales to larger and more complex scenes while preserving details and diversity, and the learned prior enables high-quality rendering from viewpoints that are significantly different from observed viewpoints. When compared to existing models, GSN produces quantitatively higher-quality scene renderings across several different scene datasets.