Summary

CPVR 2019 was dope (and complicated). I was blown away by how brilliant everybody was. Frankly, as someone who’s just tapping into Computer Vision and seeking to learn more about the field, it was intimidating and I understood very little of the content. This conference is definitely meant for those that are “in-the-know.”

With that said though, it was still a fascinating conference. My main takeaways (or at least the ones I understood)dealt with the different applications and the different types of research within the Computer Vision Field today.

So to summarize my experience, I wanted to share some of the eccentric presentations and posters that stood out. This is in no way shape or form a comprehensive list and does not vouch to the academic contributions they make to the industry. My own list would probably change tomorrow as I look through my notes and pictures again. Regardless, even if these summations don’t do justice to the wonderful works that were presented, I hope it’ll help pique some of your interests as it had mine.

All Talks, Tutorials and Workshops also had a corresponding poster. They are all listed here

Some of the major topics showcased at the conference were:

  • Deep Learning
  • 3D Multiview
  • Action & Video
  • Recognition
  • Segmentation, Grouping, and Shape
  • Statistics, Physics, Theory, and Datasets
  • Face & Body
  • Motion & Biometrics
  • Synthesis
  • Computational Photography & Graphics
  • Low-Level & Optimization
  • Scenes & Representation
  • Language & Reasoning
  • Applications, Medical, and Robotics
  • 3D Single View & RGBD

Top 8 (In no particular order)

Timeception for Complex Action Recognition

https://noureldien.com/research/timeception/
Gihub: https://github.com/noureldien/timeception

From the abstract: “We revisit the conventional definition of activity and restrict it to “Complex Action”: a set of one-actions with a weak temporal pattern that serves a specific purpose. […] In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works.”

Take-aways:

  • There exists a dataset called Charades that contained slightly-longer video clips that are labeled with more “complex actions”
  • “Simple actions” are actions like “jumping” or “cutting” but everyday actions are more complex like “cooking breakfast” or “cleaning the house.” Methods to detect these complex actions are not as prevalent
  • Paper proposes three contributions:
  • “We introduce a convolutional temporal layer effectively and efficiently learn minute-long action ranges of 1024 timesteps, a factor of 8 longer than best related work.”
  • “We introduce multi-scale temporal kernels to account for large variations in duration of action components.”
  • “We use temporal-only convolutions, which are better suited for complex actions than spatiotemporal counterparts.”

Speech2Image

https://speech2face.github.io/

From the abstract: “How much can we infer about a person’s looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural videos of people speaking from Internet/Youtube. During training, our model learns audiovisual, voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity.”

Take-aways:

  • Not sure how useful this actually is but it’s still pretty cool as hell
  • Please do read the “Ethical Considerations” section of their website

OpenCV 4.x and more new tools for CV R&D

Tutorial Link: https://opencv.org/cvpr-2019-tutorial.html/
Recommended section:
Updates on OpenCV 4.x — http://dl.opencv.org/present/cvpr_opencv.pdf

Take-aways:

Panoptic Segmentation

https://research.fb.com/publications/panoptic-segmentation/
Corresponding presentation: http://presentations.cocodataset.org/COCO17-Invited-PanopticAlexKirillov.pdf

Summary: Panoptic segmentation is a mix of object detection and semantic segmentation. Object Detection means putting a bounding box around an object in a given image and labeling what that object is. Semantic segmentation is associating each pixel to a specific instance.

Inserting Videos into Videos

http://openaccess.thecvf.com/content_CVPR_2019/html/Lee_Inserting_Videos_Into_Videos_CVPR_2019_paper.html

From the abstract: “Our main task is, given an object video and a scene video, to insert the object video at a user-specified location in the scene video so that the resulting video looks realistic. We aim to handle different object motions and complex backgrounds without expensive segmentation annotations”

Best seen through qualitative examples:

SFNet: Learning Object-aware Semantic Correspondence

http://openaccess.thecvf.com/content_CVPR_2019/html/Lee_SFNet_Learning_Object-Aware_Semantic_Correspondence_CVPR_2019_paper.html

From the abstract: “We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task.”

Take-aways:

  • Semantic correspondence means matching specific components from one image to equivalent components on another image
  • Again, best showcased by some qualitative examples:

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

https://arxiv.org/abs/1811.11683
https://github.com/hassanhub/MultiGrounding (code had not yet been uploaded as of 07–16–2019)

From the Introduction: “Phrase grounding is the task of localizing within an image a given natural language input phrase. This ability to link text and image content is a key component of many visual semantic tasks such as image captioning, visual question answering, text-based image retrieval, and robotic navigation. It is especially challenging as it requires a good representation of both the visual and textual domain and an effective way of linking them.”

Complete the Look: Scene-based Complementary Product Recommendation

https://arxiv.org/abs/1812.01748
Medium article: https://medium.com/@Pinterest_Engineering/introducing-complete-the-look-a-scene-based-complementary-recommendation-system-eb891c3fe88

Summary: Recommendation system that takes an image as an input and does two things. First, identify clothing items in the said image and find products similar to that item; And second, recommend other items of clothing that may go well with the product in the input image.

Take-aways:

  • More application-focused than other posters/papers
  • The products that the system recommends are based on training data of commonly matched products that are often seen worn together on fashion models/image