Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture

预测深度,表面的法向量以及语义标签,使用了常见的多尺度卷积结构

Abstract

摘要

       In this paper we address three different computer vision tasks using a single multiscale convolutional network architecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.

       在这篇文章中我们解决了三个不同的计算机视觉的问题,使用了一个单个的多尺度卷积神经网络结构:预测深度,表面法向量估计,以及语义标签。我们提出的神经网络有能力自然地适应许多任务,而只需要做一些微调就可以了,直接的从输入图像到输出图像进行回归即可。我们的方法使用了一系列不同的尺度稳定地改进预测结果,无需更多的超像素内容或低级的分割就可以获得了许多图像的细节。我们在基准衡量方面,这三个任务都达到了业界最先进水平。

1.     Introduction

1.     简介

        Since understanding is a central problem in vision that has many different aspects. These include semantic labels describing the identity of different scene portions; surface normal or depth estimates describing the physical geometry; instance labels of the extent of individual objects; and affordances capturing possible interactions of people with the environment. Many of these are often for presented with a pixel-map containing a value or label for each pixel, e.g. a map containing the semantic label of the object visible at each pixel, or the vector coordinates of the surface normal orientation.

        在视觉领域中包括许多个不同的领域,理解任务是一个重要的问题。这包括了用来描述不同场景部分的特点的语义标签;表面法向量或者是深度估计,用于描述现实的几何结构;关于个体范围内的实例标签;以及可见性功能,支持捕捉人与环境之间可能的互动。这其中的许多经常被表现为像素级的映射,包括每个像素的一个数值或标签,例如,一个不包含物体可见的每个限速的语义标签,或者表面法向量方向的坐标向量。

        In this paper, we address three of these tasks. depth prediction, surface normal estimation and semantic segmentation – all using a single common architecture. Our multi-scale approach generates pixel-maps directly from an input image, without the need for low-level superpixels or contours, and is able to align to many image details using a series of convolutional network stacks applied at increasing resolution. At test time, all three outputs can be generated in real time (~30Hz). We achieve state-of-the-art results on all three tasks we investigate, demonstrating our model’s versatility.

        在这篇论文中,我们解决了这三个问题:深度预测,表面法向量估计和语义分割——所有都是同一个结构来做的。我们的多尺度方法直接从输入图像生成像素级的映射,而不需要低级别的超像素或者轮廓,也可以用于匹配和校准许多图像的细节,方法是使用一系列卷积神经网络的堆叠,应用在逐步提升的分辨率上在测试的时候,所有的三个问题的输出都可以在实时范围内生成(大约30Hz)。我们在调研过程中发现,在这三个任务上我们的结果已经达到了最先进的标准,展示出我们提出的模型的多功能性。

        There are several advantages in developing a general model for pixel-map regression. First, applications to new tasks may be quickly developed, with much of the new work lying in defining an appropriate training set and loss function; in this light, our work is a step towards building off-the-shelf regressor models that can be used for many applications. In addition, use of a single architecture helps simplify the implementation of systems that require multiple modalities, e.g. robotics or augmented reality. Lastly, in the case of depth and normals, much of the computation can be shared between modalities, making the system more efficient.

       生成一个一般化的像素地图回归有许多优点。首先,在新任务上的应用可以很快的进行开发,因为许多神经网络依赖于定义一个合适的训练集合和合适的损失函数;就此而论,我们的工作比构建现有的回归量模型更进一步,也因此可以应用于更多的应用中。另外,使用一个单个的结构有助于简化那些需要多个方法的系统的实现,比如,机器人或者是增强现实。最后,对于深度和法向量,许多的计算可以在这两个形态之间共享,使得系统更加高效。

 

文献:07-Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2650-2658. 有1982次引用

 

内容来源于网络如有侵权请私信删除

文章来源: 博客园

原文链接: https://www.cnblogs.com/ProfSnail/p/14898831.html

你还没有登录,请先登录注册
  • 还没有人评论,欢迎说说您的想法!