3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking logo

3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning.

GitHub Link

The GitHub link is https://github.com/dsx0511/3dmotformer

Introduce

The repository "3DMOTFormer" is the official implementation of the ICCV2023 paper titled "3DMOTFormer Graph Transformer for Online 3D Multi-Object Tracking." The paper addresses the challenge of accurate 3D object tracking for autonomous vehicles. It introduces a learned framework called 3DMOTFormer that leverages the transformer architecture. The framework uses an Edge-Augmented Graph Transformer to handle frame-by-frame reasoning on track-detection graphs and performs data association through edge classification. To mitigate the gap between training and inference, an innovative online training strategy is proposed. The approach achieves state-of-the-art results on nuScenes validation and test data using CenterPoint detections. The repository provides installation instructions and data preparation steps for replication. Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning.

Content

Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) works typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves state-of-the-art 71.2% and 68.2% AMOTA on nuScenes validation and test split. In addition, a trained 3DMOTFormer model generalizes well across different object detectors.

Alternatives & Similar Tools

LongLLaMA-handle very long text contexts, up to 256,000 tokens logo

LongLLaMA is a large language model designed to handle very long text contexts, up to 256,000 tokens. It's based on OpenLLaMA and uses a technique called Focused Transformer (FoT) for training. The repository provides a smaller 3B version of LongLLaMA for free use. It can also be used as a replacement for LLaMA models with shorter contexts.