Multi-Modal End-to-End Holistic Perception for Autonomous Driving - Ph.D. Proposal
This post is to describe my Ph.D. proposal to the Foundation for Science and Technology (FCT) in Portugal. This Ph.D. proposal was placed in 3rd out of 140 candidates in the Electrical and Electronic Engineering evaluation panel of the 2023 FCT Ph.D. Scholarships. My Ph.D. is taking place at the University of Aveiro in Portugal, and I will colaborate in the future with the Computer Vision Center (CVC) of the Universitat Autònoma de Barcelona (UAB).
Abstract
Autonomous driving systems require accurate perception algorithms to navigate safely through traffic. End-to-end perception approaches jointly learn perception tasks, transforming raw sensor data directly into object detection with motion predictions. To improve trajectory predictions, recent approaches attempted to model interactions between actors (dynamic objects), whose behavior depends on each other and their interplay with the scene. However, current models only consider interactions within the same object class. Modeling interactions between actors from different classes improves perception accuracy and enables a more holistic understanding of the scene by preventing misinterpretation of object behavior. In this context, this Ph.D. proposal aims to develop a multi-modal end-to-end holistic perception approach capable of modeling both inter-class and intra-class interactions between actors, and their interplay with the scene. Perception systems still face challenges in achieving high accuracy and robustness, with a complete holistic perception of the scene yet to be achieved despite recent progress.
Keywords
- Object Detection
- Motion Prediction
- Holistic Perception
- End-to-End Perception
- Deep Learning
- Autonomous Driving
State of the Art
Autonomous driving is an emerging technology that holds the promise of revolutionizing transportation. Autonomous driving systems are usually developed using a modular approach [1, 2]. One of the most critical components is the perception module, which is responsible for accurately assessing the environment surrounding the vehicle to enable safe navigation through traffic [3]. The perception module must be accurate, robust, and operate in real-time [4].
The perception module is divided into three tasks: object detection, object tracking, and motion prediction [5]. These tasks are usually learned independently and executed sequentially. Information is not shared among these tasks, and uncertainty is rarely propagated between them, leading to information loss [6, 7]. Recently, end-to-end perception approaches jointly learn and optimize all these tasks within a single neural network, using multi-task learning [5–12]. These approaches are efficient for real-time operations by sharing computational resources between all tasks. The three tasks directly access raw data, leveraging the shared knowledge between them. These approaches have yielded promising results, reducing the detection of false negatives for distant and occluded objects, and false positives by accumulating evidence over time [5, 6, 13, 14].
Most of these approaches predict the trajectory of an actor (dynamic object) independently based on its past trajectory, without taking into account the interactions between actors. This strategy undermines the accuracy of the predicted trajectories, as the evolution of a trajectory is highly dependent on the behavior of other actors and environmental factors. To address this limitation, recent approaches attempted to model the interactions between actors and their interplay with the scene [7,14–17]. These approaches model spatial and temporal dependencies using different neural network architectures, such as graph neural networks (GNN) [16], recurrent neural networks (RNN) [17], convolutional neural networks (CNN) [14], and transformers [17]. These interaction models improved the accuracy of perception systems, demonstrating their importance in predicting motion. However, these approaches focus only on a single category of objects, such as cars or pedestrians, and inter-class interactions are not considered.
The interactions between actors from different classes have not been studied so far and remain a research gap that this Ph.D. proposal aims to address. Modeling these interactions is an essential step toward achieving more accurate and reliable perception systems. These interactions can significantly affect the behavior and trajectory of each actor. For example, the presence of a cyclist in the vicinity of a car can significantly affect the behavior of the car and, in turn, the trajectory of the car can influence the trajectory of the cyclist. Therefore, failing to model these interactions can lead to inaccurate predictions and potential safety hazards. In this context, this Ph.D. proposal aims to develop a complete end-to-end holistic perception approach capable of modeling the interactions between actors from the same and different classes and their interplay with the scene. Additionally, the goal is to explore the benefits of sensor fusion between different data modalities [17–22], allowing the system to work properly in a large set of environments with diverse weather and lighting conditions [23].
Objectives
The main objective of this Ph.D. proposal is to develop a multi-modal end-to-end holistic perception approach for autonomous driving. Its primary contribution is the complete holistic perception of the scene, considering the interactions between actors from the same and different classes (cars, buses, trucks, pedestrians, and cyclists), and their interplay with the scene (lanes, traffic signs, and traffic states), to accurately detect and predict their future motions. The joint detection and motion prediction of each actor in the scene will be based on three mutual components:
- Processing its past motion
- Modeling the interactions with other actors from the same and different classes
- Processing the contextual information about the scene
Based on this objective, the proposed research question is: how can interactions between actors from different categories of objects be effectively modeled into an end-to-end perception approach, in order to accurately detect and predict their future motions and improve the overall holistic perception of the scene? In summary, this Ph.D. proposal can be divided into three specific objectives:
- Explore perception datasets:
- We intend to use the two most used large-scale datasets nuScenes [24] and Waymo Open Dataset [25], which provide thousands of real-world labeled scenes with diverse weather and lighting conditions, and offer benchmarks with standard metrics.
- Development of end-to-end holistic perception architectures:
- Our objective is to explore and combine multi-task learning [6] with RNNs and Transformers that are suitable to capture temporal dependencies between different time steps and spatial dependencies between actors, respectively [17].
- Explore sensor fusion techniques:
- Most end-to-end perception approaches receive as input only point clouds [18]. Point clouds are usually transformed into birds-eye-view and range-view representations, which are suitable for fusing with HD maps and RGB images, respectively [21]. Therefore, our objective is to combine the benefits of fusing these four data modalities.
Detailed Description of the Tasks
The detailed description of this Ph.D. proposal consists of seven tasks to be completed over a four-year period starting from 01/12/2023. We outline each task in detail and also present the risks involved and respective mitigation plans.
- Task 1 - State-of-the-art review (43 months):
- In this task, all relevant publications in the field of end-to-end perception will be continuously analyzed throughout the entire time span of the Ph.D. The goal is to conduct a narrative review of the literature to stay up-to-date with the new methods in this area and may adapt the research statement to explore new opportunities and ideas in the field.
- Task 2 - Perception datasets (3 months):
- Recently, all perception algorithms have relied on neural networks. Neural networks are data-driven approaches and require large-scale datasets. Accordingly, the goal of this task is to explore nuScenes [24] and Waymo Open Dataset [25], which are real-world datasets with thousands of labeled scenes captured by a full sensor suite from a real autonomous vehicle. The intention is to explore how to use their data to develop our perception algorithm. Additionally, we will study the standard metrics used to evaluate the perception tasks, such as mean average precision, average multiple object tracking accuracy, average displacement error, and trajectory collision rate, among others. These metrics are used in the datasets’ benchmarks, where our algorithm can be compared with state-of-the-art perception algorithms.
- Task 3 - End-to-end holistic perception architectures (13 months):
- We intend to explore and implement state-of- the-art end-to-end perception algorithms that model a holistic perception of the scene. From these algorithms, we gather several ideas on how to model the interactions between actors and their interplay with the scene. The most common architectures used to model these interactions are RNNs and transformers [17], combined with the multi-task learning paradigm for end-to-end perception [6]. A comparison and evaluation of their advantages and disadvantages will be performed to decide the best architecture or combination of architectures for the holistic perception model. The goal is to develop an end-to-end holistic perception algorithm, that captures the complex interplay between multiple actors across different object categories and their interactions with the scene. At the end of this task, we expect to prove the importance of modeling inter-class interactions between actors by improving the accuracy and reliability of the perception system, which is the key objective of this Ph.D. proposal.
- Task 4 - Sensor fusion techniques (13 months):
- Most end-to-end perception approaches receive as input point clouds and HD maps, and only a few of them explored the benefits of sensor fusion with RGB images [17–21]. Fusing data modalities from different sensors can overcome the shortcomings of individual sensors working independently [23]. In this context, the goal is to explore sensor fusion techniques to aggregate the benefits of fusing both birds-eye-view and range-view point cloud representations with RGB images and HD maps. RGB images provide color and high-resolution data, and HD maps provide semantic information about the scene, such as road markings, lane boundaries, and traffic signs [20]. One advantage of birds-eye-view is that the size of objects remains constant regardless of range. However, this representation loses information necessary to detect smaller objects due to the discretization of the point cloud into voxels. On the other hand, range-view is the native representation of point clouds, providing strong detection performance for smaller objects. However, in this representation, the size of the objects varies with range [21]. The selected sensor fusion technique will be incorporated into our perception algorithm to leverage its robustness and accuracy.
- Task 5 - Domain adaptation for AtlasCar 2 in Aveiro (8 months):
- AtlasCar 2 [26] is an autonomous vehicle developed by the Atlas Project at the University of Aveiro. We intend to explore domain adaptation to adapt our perception algorithm to receive as input data from the sensor suite of the AtlasCar 2, which may differ from the sensor suite (and respective calibration) of the autonomous vehicle used to produce the public datasets. The goal is to test and validate our perception algorithm with an autonomous vehicle available at the hosting institution.
- Task 6 - Writing of articles (6 months):
- The goal is to publish our state-of-the-art contributions in international journals and conferences (milestone 1). The international conferences that we aim to publish are: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), IEEE International Conference on Computer Vision (ICCV), and IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Concerning the international journals, we aim to publish in: IEEE Transactions on Pattern Analysis and Machine Intelligence and IEEE Transactions on Neural Networks and Learning Systems.
- Task 7 - Writing of the thesis (7 months):
- This task is concerned with writing the Ph.D. thesis (milestone 2). An introduction contextualization and a systematic literature review will be presented, as well as a summary of the research questions that the Ph.D. aimed to answer. Then, the document will present all relevant contributions of the Ph.D. alongside the methodologies, results, and conclusions.
- Acknowledged risks and mitigation plans:
- An impacting risk is the shutdown of the Atlas Project, and consequently the non-availability of the AtlasCar 2 for task 5. In this case, we plan to use the CARLA Simulator [27] from the foreign hosting institution CVC to mitigate this risk. Using CARLA, we can implement Domain Adaptation for several sensor suites and setups, not only the one available in AtlasCar 2 and the one from the datasets. Another risk involved is the misselection of the interaction architectures and sensor fusion techniques for development tasks 3 and 4, leading to suboptimal performance. In this case, we plan to explore other potential architectures and their feasibility, such as GNNs, and also the dropout of one data representation to reduce the complexity of the system. Another impacting risk is the malfunctioning of the hardware (DeepLar server - DEM/UA) available to develop and train the perception algorithms. To mitigate this risk, the supervision team has several contacts and partners in other institutions (including CVC), where we can request the usage of their hardware.
Timetable - Gantt Chart
Figure 1: Gantt chart of the Ph.D. proposal.
References
[1] A. Kendall et al., “Learning to Drive in a Day”, in 2019 International Conference on Robotics and Automation (ICRA), May 2019, pp. 8248–8254. doi: 10.1109/ICRA.2019.8793742.
[2] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A Survey of Autonomous Driving: Common Practices and Emerging Technologies”, IEEE Access, vol. 8, pp. 58443–58469, Jun. 2020, doi: 10.1109/ACCESS.2020.2983149.
[3] R. Qian, X. Lai, and X. Li, “3D Object Detection for Autonomous Driving: A Survey”, Pattern Recognition, vol. 130, p. 108796, Oct. 2022, doi: 10.1016/j.patcog.2022.108796.
[4] D. Feng et al., “Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges”, IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 3, pp. 1341–1360, Mar. 2021, doi: 10.1109/TITS.2020.2972974.
[5] M. Liang et al., “PnPNet: End-to-End Perception and Prediction With Tracking in the Loop”, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11550–11559. doi: 10.1109/CVPR42600.2020.01157.
[6] W. Luo, B. Yang, and R. Urtasun, “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net”, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 3569–3577. doi: 10.1109/CVPR.2018.00376.
[7] Z. Zhang, J. Gao, J. Mao, Y. Liu, D. Anguelov, and C. Li, “STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction”, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11343–11352. doi: 10.1109/CVPR42600.2020.01136.
[8] W. Zeng et al., “End-To-End Interpretable Neural Motion Planner”, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 8652–8661. doi: 10.1109/CVPR.2019.00886.
[9] F. Duffhauss and S. A. Baur, “PillarFlowNet: A Real-time Deep Multitask Network for LiDAR-based 3D Object Detection and Scene Flow Estimation”, in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2020, pp. 10734–10741. doi: 10.1109/IROS45743.2020.9341002.
[10] G. P. Meyer et al., “LaserFlow: Efficient and Probabilistic Object Detection and Motion Forecasting”, IEEE Robot. Autom. Lett., vol. 6, no. 2, pp. 526–533, Apr. 2021, doi: 10.1109/LRA.2020.3047793.
[11] P. Wu, S. Chen, and Di. N. Metaxas, “MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird’s Eye View Maps”, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11382–11392. doi: 10.1109/CVPR42600.2020.01140.
[12] Y. H. Khalil and H. T. Mouftah, “LidNet: Boosting Perception and Motion Prediction from a Sequence of LIDAR Point Clouds for Autonomous Driving”, in GLOBECOM 2022 - 2022 IEEE Global Communications Conference, Dec. 2022, pp. 3533–3538. doi: 10.1109/GLOBECOM48099.2022.10001152.
[13] S. Ye, H. Yao, W. Wang, Y. Fu, and Z. Pan, “SDAPNet: End-to-End Multi-task Simultaneous Detection and Prediction Network”, in 2021 International Joint Conference on Neural Networks (IJCNN), Jul. 2021, pp. 1–8. doi: 10.1109/IJCNN52387.2021.9533290.
[14] S. Casas, W. Luo, and R. Urtasun, “IntentNet: Learning to Predict Intention from Raw Sensor Data”, in 2018 2nd Annual Conference on Robot Learning (CoRL 2018), Jan. 2018, vol. 87, pp. 947–956. [Online]. Available: http://arxiv.org/abs/2101.07907
[15] W. Luo, C. Park, A. Cornman, B. Sapp, and D. Anguelov, “JFP: Joint Future Prediction with Interactive Multi-Agent Modeling for Autonomous Driving”, in 6th Conference on Robot Learning (CoRL 2022), Dec. 2022, pp. 1–11. [Online]. Available: http://arxiv.org/abs/2212.08710
[16] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “SpAGNN: Spatially-Aware Graph Neural Networks for Relational Behavior Forecasting from Sensor Data”, in 2020 IEEE International Conference on Robotics and Automation (ICRA), May 2020, pp. 9491–9497. doi: 10.1109/ICRA40945.2020.9196697.
[17] L. L. Li et al., “End-to-end Contextual Perception and Prediction with Interaction Transformer”, in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2020, pp. 5784–5791. doi: 10.1109/IROS45743.2020.9341392.
[18] A. Mohta, F.-C. Chou, B. C. Becker, C. Vallespi-Gonzalez, and N. Djuric, “Investigating the Effect of Sensor Modalities in Multi-Sensor Detection-Prediction Models”, in Machine Learning for Autonomous Driving Workshop at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Jan. 2021. [Online]. Available: http://arxiv.org/abs/2101.03279
[19] Y. H. Khalil and H. T. Mouftah, “LiCaNext: Incorporating Sequential Range Residuals for Additional Advancement in Joint Perception and Motion Prediction”, IEEE Access, vol. 9, pp. 146244–146255, 2021, doi: 10.1109/ACCESS.2021.3123169.
[20] Y. H. Khalil and H. T. Mouftah, “LiCaNet: Further Enhancement of Joint Perception and Motion Prediction Based on Multi-Modal Fusion”, IEEE Open Journal of Intelligent Transportation Systems, vol. 3, no. July 2021, pp. 222–235, 2022, doi: 10.1109/OJITS.2022.3160888.
[21] S. Fadadu et al., “Multi-View Fusion of Sensor Data for Improved Perception and Prediction in Autonomous Driving”, in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan. 2022, pp. 3292–3300. doi: 10.1109/WACV51458.2022.00335.
[22] Y. H. Khalil and H. T. Mouftah, “End-to-End Multi-View Fusion for Enhanced Perception and Motion Prediction”, in 2021 IEEE 94th Vehicular Technology Conference (VTC2021-Fall), Sep. 2021, pp. 1–6. doi: 10.1109/VTC2021- Fall52928.2021.9625271.
[23] D. J. Yeong, G. Velasco-Hernandez, J. Barry, and J. Walsh, “Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review.”, Sensors (Basel, Switzerland), vol. 21, no. 6, p. 2140, Mar. 2021, doi: 10.3390 s21062140.
[24] H. Caesar et al., “nuScenes: A Multimodal Dataset for Autonomous Driving”, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11618–11628. doi: 10.1109/CVPR42600.2020.01164.
[25] P. Sun et al., “Scalability in Perception for Autonomous Driving: Waymo Open Dataset”, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 2443–2451. doi: 10.1109/CVPR42600.2020.00252.
[26] V. Santos et al., “ATLASCAR - technologies for a computer assisted driving system on board a common automobile”, in 13th International IEEE Conference on Intelligent Transportation Systems, Sep. 2010, pp. 1421– 1427. doi: 10.1109/ITSC.2010.5625031.
[27] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An Open Urban Driving Simulator”, in 1st Conference on Robot Learning (CoRL 2017), Nov. 2017, pp. 1–16. [Online]. Available: http://arxiv.org/abs/1711.03938