Cityscapes
The Cityscapes Dataset for Semantic Urban Scene Understanding
Abstract
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes.
To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.
 
Common settings
- All baselines were trained using 8 GPU with a batch size of 8 (1 images per GPU) using the linear scaling rule to scale the learning rate.
 
- All models were trained on 
cityscapes_train, and tested on cityscapes_val. 
- 1x training schedule indicates 64 epochs which corresponds to slightly less than the 24k iterations reported in the original schedule from the Mask R-CNN paper
 
- COCO pre-trained weights are used to initialize.
 
- A conversion script is provided to convert Cityscapes into COCO format. Please refer to install.md for details.
 
CityscapesDataset implemented three evaluation methods. bbox and segm are standard COCO bbox/mask AP. cityscapes is the cityscapes dataset official evaluation, which may be slightly higher than COCO. 
Faster R-CNN
| Backbone | 
Style | 
Lr schd | 
Scale | 
Mem (GB) | 
Inf time (fps) | 
box AP | 
Config | 
Download | 
| R-50-FPN | 
pytorch | 
1x | 
800-1024 | 
5.2 | 
- | 
40.3 | 
config | 
model | log | 
Mask R-CNN
| Backbone | 
Style | 
Lr schd | 
Scale | 
Mem (GB) | 
Inf time (fps) | 
box AP | 
mask AP | 
Config | 
Download | 
| R-50-FPN | 
pytorch | 
1x | 
800-1024 | 
5.3 | 
- | 
40.9 | 
36.4 | 
config | 
model | log | 
Citation
@inproceedings{Cordts2016Cityscapes,
   title={The Cityscapes Dataset for Semantic Urban Scene Understanding},
   author={Cordts, Marius and Omran, Mohamed and Ramos, Sebastian and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and Schiele, Bernt},
   booktitle={Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
   year={2016}
}