# ViTDet ## Description This is an implementation of [ViTDet](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet) based on [MMDetection](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet), [MMCV](https://github.com/open-mmlab/mmcv), and [MMEngine](https://github.com/open-mmlab/mmengine). ## Usage ### Training commands Follow original [setting](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet), this project is trained with total batch size of 64 (16 GPU with 4 images per GPU). In MMDetection's root directory, run the following command to train the model: ```bash GPUS=${GPUS} ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} ``` Below is an example of using 16 GPUs to train VitDet on a Slurm partition named _dev_, and set the work-dir to some shared file systems. ```shell GPUS=16 ./tools/slurm_train.sh dev vitdet_mask_b projects/ViTDet/configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py /nfs/xxxx/vitdet_mask-rcnn_vit-b-mae_lsj-100e ``` ### Testing commands In MMDetection's root directory, run the following command to test the model: ```bash python tools/test.py projects/ViTDet/configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py ${CHECKPOINT_PATH} ``` ## Results Based on mmdetection, this project almost aligns the test and train accuracy of the [ViTDet](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet). | Method | Backbone | Pretrained Model | Training set | Test set | Epoch | Val Box AP | Val Mask AP | Download | | :--------------------------------------------------------: | :------: | :--------------: | :------------: | :----------: | :---: | :--------: | :----------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | [ViTDet](./configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py) | ViT-B | MAE | COCO2017 Train | COCO2017 Val | 100 | 51.6 | 45.7 | [model](https://download.openmmlab.com/mmdetection/v3.0/vitdet/vitdet_mask-rcnn_vit-b-mae_lsj-100e/vitdet_mask-rcnn_vit-b-mae_lsj-100e_20230328_153519-e15fe294.pth) / [log](https://download.openmmlab.com/mmdetection/v3.0/vitdet/vitdet_mask-rcnn_vit-b-mae_lsj-100e/vitdet_mask-rcnn_vit-b-mae_lsj-100e_20230328_153519.log.json) | **Note**: 1. The mask AP is lower than official [repo](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet) slightly 2. other model vision will release code and weights in the future ## Citation ```latex @article{li2022exploring, title={Exploring plain vision transformer backbones for object detection}, author={Li, Yanghao and Mao, Hanzi and Girshick, Ross and He, Kaiming}, journal={arXiv preprint arXiv:2203.16527}, year={2022} } ``` ## Checklist - [x] Milestone 1: PR-ready, and acceptable to be one of the `projects/`. - [x] Finish the code - [x] Basic docstrings & proper citation - [x] Test-time correctness - [x] A full README - [x] Milestone 2: Indicates a successful model implementation. - [x] Training-time correctness - [ ] Milestone 3: Good to be a part of our core package! - [ ] Type hints and docstrings - [ ] Unit tests - [ ] Code polishing - [ ] Metafile.yml - [ ] Move your modules into the core package following the codebase's file hierarchy structure. - [ ] Refactor your modules into the core package following the codebase's file hierarchy structure.