Abstract
The spatial heterogeneity and temporal variability of traffic in urban environments make traffic emissions inference challenging. To address this challenge, this study introduces a novel geographical context-based approach utilizing high-resolution taxi GPS data, incorporating multidimensional contextual factors such as road data, points of interest (POI), weather data, and population density. The proposed method can enhance the precision of traffic emissions inference compared to conventional macroscopic estimation techniques. To overcome the issue of missing data in traffic emissions inference from taxi data, three ensemble machine learning algorithms—Random Forest, Gradient Boosting Decision Trees (GBDT), and eXtreme Gradient Boosting (XGBoost)—are employed. These algorithms efficiently handle a substantial volume of taxi GPS data, achieving reduced computational time and model complexity. The proposed framework establishes localized models for each road segment, taking into consideration both geographical and external features that characterize the urban environment. This localized modeling contributes significantly to a more profound understanding of traffic dynamics. A thorough comparative analysis is conducted to assess the performance of the proposed method. Results indicate that incorporating multidimensional urban features is advantageous for traffic speed inference. Among the ensemble learning models, Random Forest outperforms others when dealing with a small missing rate or limited sample size, while XGBoost exhibits superior performance for larger missing rates or substantial sample sizes. Additionally, an analysis of the feature importance in traffic speed highlights that road network features are the most significant factors, followed by temporal characteristics, spatial attributes, POI data, and weather information. Finally, leveraging inferred traffic speed and volume information, emissions from large-scale urban road traffic are inferred based on the COPERT model. In contrast to methods relying on complex, multi-source data for emission estimation, our approach utilizes simple and easily accessible data, enabling precise estimation of emissions on a large-scale spatiotemporal basis.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In recent years, climate change has emerged as a critical global concern, giving rise to a multitude of environmental challenges. These include rising global temperatures, shifting weather patterns, and an increased frequency of extreme weather events, which negatively affect ecosystems, water resources, and human health (Jordan et al., 2018; Yu et al., 2020d). The relentless economic growth has driven rapid urban expansion in many cities. This has led to problems like traffic congestion, environmental degradation, and a significant increase in vehicle-related air pollutants. These issues contribute to the exacerbation of climate change (Brueckner, 2007; Sui et al., 2020). Therefore, it is crucial to estimate and assess urban traffic emissions for the development of sustainable cities (Liu et al., 2019; Zhang et al., 2019b). Advanced GPS technology for positioning and navigation has significantly improved the collection of data across extensive road networks, particularly through GPS-equipped taxis (Castro et al., 2013; Yu et al., 2022). Estimating traffic emissions using taxi data offers advantages such as broader coverage, lower costs, and easy accessibility compared to using loop detectors and cameras (Li et al., 2019a). Leveraging taxi GPS data for traffic emission estimation provides the opportunity for fine-grained measurements, forming a critical foundation for emission control and management.
Fine-grained emission estimation poses several challenges. First, existing city-scale emission estimation models often work at a macroscopic level, relying on factors like fuel consumption and vehicle mileage. This makes it challenging to achieve detailed spatiotemporal emission estimation within urban areas. Second, when using taxi GPS data to estimate emissions on urban road networks, there may be missing data for certain road segments. This leads to gaps in the spatiotemporal coverage of road network emissions. Lastly, the use of extensive and complex urban data for inferring traffic emissions is costly and challenging to acquire. Models developed for one city might not be easily transferable to other cities, complicating the process and posing obstacles to fine-grained control and management of traffic emissions. These challenges hinder the ability to achieve detailed emission control and management effectively.
To overcome these challenges, the study introduces methods for estimating traffic emissions and energy consumption, addressing the challenges in both modeling and data aspects of traffic emissions inference. Specifically, the COPERT model is applied at the road segment level to accurately assess traffic emissions in urban area segments. Additionally, it tackles missing data issues in road segment data, ensuring complete spatiotemporal coverage of segment speeds. By utilizing readily available data sources, this research achieves accurate and comprehensive spatiotemporal coverage of urban traffic emissions in a large-scale urban traffic network. The methods are expected to provide more accurate data support for urban energy management and sustainable development, offering a more reliable basis for decision-making and resource allocation.
This paper makes the following key contributions:
-
(1)
Infer traffic emissions accurately: Unlike traditional macroscopic emission estimation models, this study leverages Taxi GPS data and urban data based on the COPERT model to perform precise traffic emissions inference in urban areas at the road segment level, which offers a detailed understanding of specific road segments and their contributions to traffic emissions based on urban data.
-
(2)
The missing data problem in traffic emission inference is effectively addressed: By using machine learning methods and a variety of urban data sources, missing speed information for road segments is inferred, ensuring complete spatiotemporal coverage of road segment speeds and emissions.
-
(3)
Inferencing large-scale urban traffic emissions using easily accessible urban data: This study accurately infers large-scale urban traffic emissions by leveraging available urban data sources, introducing a transferable method for effective traffic emission inference in diverse cities.
In this study, traffic speeds within a road network are determined through map matching using taxi GPS data. To estimate traffic speeds in the network, the spatial and temporal characteristics of traffic speeds are considered, alongside the incorporation of multidimensional data, including road networks, weather, POI, population density, and taxi GPS records. Three ensemble machine learning algorithms(algorithms that combine multiple machine learning models to improve the overall performance and accuracy of predictions), including Random Forest, GBDT, and XGBoost, are applied to infer the missing traffic speeds within a large-scale urban network. The performance of these algorithms is evaluated under five scenarios: random missing, varying sample sizes, differing time intervals, continuous missing data, and spatially random missing data. Ultimately, comprehensive spatiotemporal coverage of traffic emissions is achieved based on the inferred traffic speed, traffic volume, and vehicle type we set by using the COPERT model.
The paper is structured as follows: Section 2 describes a literature review of traffic emissions and traffic speed inference. Section 3 outlines the methodology for traffic emissions inference. Section 4 presents a case study on traffic emissions estimation using Chengdu taxi GPS data. Finally, Section 5 concludes the paper and discusses potential future research directions.
2 Related works
2.1 Traffic emission inference
In the realm of analyzing vehicle emissions and urban air pollution, researchers have developed diverse methodologies to evaluate the influence of traffic on air quality and to compute emissions. These approaches can be broadly categorized as top-down and bottom-up methods (Cai et al., 2021).
Top-down emission estimation methods rely on a range of macro-level data sources and input parameters for estimating overall emissions. These methods commonly incorporate data on fuel consumption (Palocz-Andresen et al., 2013), vehicle registration (Hao et al., 2011), and emission factors. The fuel consumption-based approach employs aggregated fuel consumption data, which may include fuel sales records and energy statistics. The vehicle registration and emission factors method involves data on registered vehicles, such as vehicle types, and their associated emission factors, typically derived from laboratory measurements and standards. Additionally, the energy consumption-based approach utilizes energy consumption data from diverse sources, encompassing data on electricity consumption and other energy forms. These top-down emission estimation methods primarily focus on providing a macro-level estimation of emissions for a specific region or area, relying on aggregated data sources. However, they lack the ability to provide detailed and fine-grained emission estimates.
On the other hand, bottom-up methods encompass both macro-level and micro-level traffic emission estimation approaches (Yu et al., 2020a, c; Zhang et al., 2019a). A macro-level model, relying on Vehicle Kilometers Traveled (VKT) data (Fukuda et al., 2013), offers a comprehensive regional or city- level emission perspective. Furthermore, microsimulation software such as IVE and MOVES can simulate individual vehicle behaviors to facilitate precise emission calculations but necessitate extensive input data. MOVES estimates emissions from on-road and off-road mobile sources, taking parameters into account such as vehicle speed, driving cycles, fuel properties, and meteorological data. It models emission processes, including tailpipe emissions, brake wear, tire wear, and operational losses (Liu et al., 2013). IVE operates at the microsimulation level, requiring detailed vehicle motion, speed, and acceleration data (Yao et al., 2006). The CAL3QHC model focuses on pollutant dispersion, such as carbon monoxide, in proximity to roadways (Sun et al., 2020). Inputs encompass traffic data, road geometry, meteorological conditions, and emission factors. These methods provide high-resolution emission data but are more suitable for small-scale traffic studies and planning, enabling consideration of details like speed, acceleration, and route choice, among others, for emission estimation.
The simulation method often falls short in accurately replicating real-world scenarios, particularly for large-scale urban emissions estimation. Data-driven emission estimation methods offer a more precise calculation approach by utilizing actual data sources, such as GPS data, to attain high-precision emission estimates. GPS- based methods, which track vehicle movements and driving conditions, offer real-time insights into emission patterns at specific locations. Nyhan et al. (2016) employed GPS trajectory data from a vast taxi fleet using a microscopic emissions model, allowing for precise predictions of air pollution emissions in Singapore. An artificial neural network model was effectively utilized to identify taxis with elevated emissions based on remote sensing data (Zeng et al., 2007). A novel vehicle speed profile estimation model was proposed, leveraging license plate recognition (LPR) data and a car-following model to accurately estimate speed profiles, enhancing micro-emission models and emissions calculations (Mo et al., 2017). Nocera et al. (2018) introduced the TANINO model, which effectively addresses carbon emissions from road transport by optimizing traffic flow data and infrastructure considerations.
However, these data-driven traffic emission methods have not yet addressed the issue of missing data for segments within the entire road network. This limitation prevents us from obtaining comprehensive spatiotemporal coverage of traffic speed and emissions across the urban road network, thereby hindering the city’s progress towards sustainable development. With the advancement of urban big data, it has become possible to make precise inferences about emissions using data from multiple sources. Yet, due to the complexity and difficulty in obtaining heterogeneous multi-source data, existing traffic emission models lack transferability and generalization capabilities. To address this issue, this paper proposes a traffic emission inference method that utilizes readily available data from multiple sources and completes missing data, thereby achieving comprehensive spatiotemporal coverage of traffic emissions.
2.2 Traffic speed inference
Traffic speed inference falls into three primary categories: machine learning, interpolation, and statistical methods. Interpolation techniques often struggle to account for daily traffic speed variations or make full use of current-day information. Statistical methods, which assume a probability distribution of traffic speed, may not be suitable for various types of traffic data. In contrast, machine learning methods efficiently adapt to data structures, making them a favorable choice. Therefore, this paper utilizes three machine learning algorithms, namely Random Forest, GBDT, and XGBoost, to infer traffic speed within urban networks. The selection of the three algorithms is based on their ability to handle complex data, provide feature importance assessments, and operate efficiently on large datasets. These attributes make them highly suitable for traffic speed inference, thereby providing accurate inputs for traffic emission models.
When inferring missing values, it’s essential to consider the spatial and temporal dynamics of traffic speed, as traffic conditions evolve within the spatiotemporal domain. Most existing research tends to focus on capturing spatial or temporal traffic features while overlooking road attributes and the urban environment. Few studies incorporate the broader urban context of traffic. Ben Said and Erradi (2022) proposed an enhanced Candecomp Parafac (CP) completion approach that considers urban and temporal aspects using POI datasets. However, this method primarily focuses on urban factors derived from POI, and the grid-based completion approach may not fully capture road characteristics. In large urban traffic networks, inferring traffic speed is more complex compared to individual roads or highway networks due to the stochastic environment.
Hence, it’s of utmost importance to account for the urban traffic environment when inferring traffic speed within urban networks. In this context, we consider not only the spatial and temporal dynamics of traffic speed but also various factors such as road network features, POI, population density, and weather conditions during the inference process. Table 1 offers an overview of research conducted on traffic state inference. Numerous studies have explored speed estimation, with the research scope spanning from individual roads to entire road networks.
3 Methodology
3.1 Methodology framework
The study aims to derive road network speeds from taxi GPS data and employ machine learning algorithms to estimate traffic speeds for missing road segments, thus achieving comprehensive spatiotemporal coverage of traffic speeds and traffic emissions within urban traffic networks.
This study comprises four main parts, including map matching of taxi data to the road network, traffic speed inference based on multi-source data, evaluation of traffic speed inference performance in different scenarios, and traffic emission inference. Firstly, map matching is conducted based on taxi trajectory data and road network structure to obtain segment speeds. Subsequently, leveraging multiple data sources such as road network data, temporal features, spatial features, POI data, and weather data, three machine learning algorithms, namely, Random Forest, GBDT, and XGBoost, are employed to infer missing road network speeds. The performance of traffic speed inference is compared under five conditions: randomly missing rates, varying sample sizes, different time intervals, continuous missing rates, and spatially random missing. Finally, based on the completely inferred spatiotemporal traffic speeds and the provided information on traffic flow and vehicle types, the fully spatiotemporal traffic emissions are calculated. The proposed framework, illustrated in Fig. 1.
3.2 Method for obtaining road segment speed
To obtain the traffic speed of road segments, we first extract the instantaneous speed of vehicles based on Taxi GPS data, match them to the corresponding road segments and finally calculate the speed of the road segments.
Instantaneous speed, derived from GPS data, is determined by calculating the travel time between two GPS sampling points, which results in the taxi’s immediate velocity, as described by the equation:
Here, d denotes the distance traveled between two GPS sampling points, t0 represents the timestamp at the current moment, and t1 refers to the timestamp at the subsequent sampling point.
To acquire road information for the GPS point, we employ a geometry-based map-matching technique. Road segment speeds are aggregated into 10-min intervals by averaging the instantaneous velocities of all GPS data points on the road during the specified time span. The calculation is as follows:
Where, \(\overline{v }\) represents the average velocity at a given instance, \({v}_{i}\) denotes the individual velocities at a specific instance \(i\),\(n\) stands for the total number of velocities or data points.
If no taxis traverse a road segment, it is classified as having missing speeds. To determine the minimum taxi sampling number for a road segment, Walpole et al. (1998) utilized statistical techniques, as described in Eq. 3:
In this equation, Zα/2 represents the quantile of the standard normal distribution under the desired confidence level, Sn is the standard deviation of speed, and E indicates the allowable relative error. The minimum number of taxi samples is established as 30, with a speed estimation error of 3 km/h at a 95% confidence level. As a result, road segments with fewer than 30 taxi GPS sampling points during the specified period are designated as road segments with missing speed data and will not be utilized for data input.
3.3 Feature input of model
Traffic speed exhibits spatiotemporal dependencies. To ensure accurate inference, it is crucial to account for these spatiotemporal characteristics. Additionally, traffic speed is influenced by a variety of factors, including weather conditions such as weather conditions like rainfall will reduce vehicle speed, and POIs, which represent different urban functional areas, such as residential, commercial, and transportation facilities, denoting high traffic areas in urban centers leading to lower traffic speeds during peak hours. Population density, which refers to the number of individuals per unit area, also plays a significant role in traffic congestion, especially in densely populated regions. Therefore, this study incorporates a wide range of data, including road network information, POIs, weather data, and population density, to enhance the accuracy of traffic speed inference. These multidimensional features are summarized as Table 2:
To depict the spatial information of roads, the grid is used to express the study area, dividing it into a 5 × 5 grid with each grid covering an area of 2.35 square kilometers.
3.4 Machine learning algorithms for traffic speed inference
-
(1)
Random Forest: The Random Forest algorithm utilizes a bootstrap sampling method to train decision trees by randomly selecting features for each dataset. It makes decisions based on averaging or majority voting principles. Random Forest are highly versatile and robust against overfitting due to the bagging approach, where each decision tree is trained independently. It excels at handling high-dimensional data and preventing overfitting, which is crucial for traffic speed inference given the multifaceted nature of traffic data and potential noise. Additionally, Random Forest provides an assessment of feature importance, helping us understand which factors most significantly impact traffic speeds.
-
(2)
GBDT: The GBDT algorithm combines decision trees using a boosting technique. GBDT reduces residuals during the training of decision trees by fitting samples and residuals and updating leaf nodes. It is exceptional at handling various types of data, especially structured and tabular data. It sequentially corrects errors made by previous models to minimize residuals, resulting in high predictive accuracy. By focusing on the residuals of the previous model at each step, GBDT is particularly adept at capturing complex patterns in the data. This capability is essential for traffic speed inference to capture the intricate relationship between urban features and traffic speed.
-
(3)
XGBoost: XGBoost is an advanced machine learning algorithm based on GBDT, designed for parallel computing. Unlike GBDT, XGBoost includes a regularization term in the objective function to control model complexity, simplify the model, and prevent overfitting, resulting in robust models that generalize well. XGBoost’s efficient computation speed is beneficial for handling the large datasets of traffic speed data and GPS data points from urban traffic networks. The objective function is defined as follows:
Where \(J\left(\phi \right)\) expresses the loss function in the XGBoost model, \({\sum }_{i}l\left(\widehat{{y}_{i}},{y}_{i}\right)\) denotes the summation of the loss function \({\text{l}}\) applied to the predicte \(\widehat{{{y}}_{{\text{i}}}}\) and actual \({{\text{y}}}_{{\text{i}}}\) values, assessing the errors in the model’s predictions, \({\mathit\sum }_{{k}}{\mathit\Omega} \left({{f}}_{{k}}\right)\) represents the regularization term \(\Omega\) for each \({f}_{k}\), controlling the complexity of the model. \(\Omega (f)\) defines the regularization term, \({\mathrm{\gamma}} T\) refers to the product of the number of leaf nodes \({T}\) and a regularization parameter \(\gamma\), \(\frac{\mathit1}{\mathit2}\lambda |{w}{|}^{\mathit2}\) represents an \({{\text{l}}}_{2}\) norm regularization term on the leaf weights \({w}\), where \(\lambda\) is the regularization parameter, and \(|{w}{|}^{\mathit2}\) indicates the squared norm of the weights.
Three evaluation metrics were selected to assess the model’s performance, namely Mean Absolute Percentage Error (MAPE), Root Mean Squared Error (RMSE), and R- squared. MAPE measures the average of the absolute percentage differences between predicted and actual values. It indicates the accuracy of a forecasting model by quantifying the size of the error in terms of a percentage. RMSE measures the average magnitude of the error between predicted and observed values. It represents the square root of the mean of the squared differences between predicted and actual values, providing insight into the model’s prediction accuracy. R-squared, or the coefficient of determination quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables (features). It reflects the goodness of fit of the model and how well the independent variables explain the variance in the dependent variable. The formula is shown as follows.
Mean Absolute Percentage Error (MAPE):
Root Mean Squared Error (RMSE):
Here, m represents the total number of missing items, yi is the actual value of the ith missing item, and f (xi) is the imputed value of the ith missing item.
R-squared (R2):
Where, yi represents the observed values, \(\widehat{{y}_{i}}\) represents the predicted values, \(\overline{{\text{y}} }\) is the mean of the observed values, n is the number of observations.
3.5 Traffic emissions inference
Traffic emissions are estimated using the COPERT model in conjunction with traffic speed, traffic volume, and vehicle types. Traffic Speed is inferred from the aforementioned method, while traffic flow is set based on road types and taxi GPS data. Vehicle types are categorized into heavy-duty trucks, light-duty trucks, and passenger cars, and the traffic volume of different types of cars is calculated from the proportion of different types of cars according to vehicle type ownership. Subsequently, the COPERT model is utilized to calculate emissions across the road network.
The COPERT model used for emission calculations for estimating emissions \({E}_{i,j}\) (PM2.5, NOx, and FC) of trip \({\text{i}}\) by using emission factors is as follow:
Where \({E}_{i,j}\) is the emissions of type \({j}\) of trip \({i}\) (unit: g), \({F}_{i,j}\) is the hot emission factor of \({j}\) of trip \({i}\) (unit: g/km), and \({l}_{i}\) is the length of trip \({i}\) (unit: km).
The emission factor of \(({j})\) (PM2.5, NOx, and FC) is calculated by:
Where \({{{v}}}_{{{i}}}\) is the average speed of the vehicle on trip \({\text{i}}\) (unit: km/h). Emission factors are speed-dependent and expressed in g/km. Additionally, they are determined by parameters \({{\alpha }}_{{{j}}}\), \({\beta }_{j}\), \({\gamma }_{j}\), \({\varepsilon }_{j}\), \({\zeta }_{j}\), and \({\eta }_{j}\), which are experimentally obtained according to fuel, vehicle class, and engine technology of vehicles.
4 Result
4.1 Data and study area
The taxi GPS data, procured from DiDi Chuxing, spans 2 months from October 1st, 2018, to November 30th, 2018, and encompasses approximately one-quarter of Chengdu, China’s urban area, ranging from 104.03°E to 104.13°E in longitude and 30.65°N to 30.73°N in latitude. GPS data was logged at three-second intervals, with each entry containing a real-time timestamp, driver ID, longitude, and latitude, culminating in a total of approximately 2,319.8 million GPS records.
Figure 2a delineates the temporal fluctuations in the mean quantity of Taxi GPS records on weekdays and weekends, revealing over 2 million data records per hour during daytime hours and variations between weekdays and weekends. From 7:00 to 10:00, the hourly Taxi GPS records on weekdays surpass those on weekends, whereas from 11:00 to 22:00, the hourly GPS records on weekends exceed those on weekdays. Figure 2b illustrates the spatial distribution of GPS records from 08:00 to 08:10 on November 1, 2018. The density of GPS records appeared to be higher in the city center and on roads with higher grades.
Ultimately, 5,301,069 valid data points were obtained. Figure 2c illustrates the temporal missing rates of road segments. From 0:00 to 6:00, the missing rate is comparatively high. From 7:00 to 21:00, the missing rate is low. At 23:00, the missing rate will increase.
Figure 2d displays the hourly average speed of primary roads in 2 months. Comparatively, the lower speed value on weekends appears 2 h later than on weekdays, at 10:00. Figure 2e exhibits the heat map of traffic speed in 2 months with a 10-min interval. Intuitively, the traffic pattern is akin to weekends during the National holiday.
4.2 Feature importance
Simultaneously, feature importance for traffic speed is analyzed. The most critical features can be categorized into road network, temporal features, spatial, and POI- related features and weather. Road network attributes take the highest precedence, with road type being the most significant feature, as shown in Fig. 3. It is followed by Road Length, Direction (One-way or Two-way), Road Identifier, and the presence of Overhead bridges. Among temporal characteristics, the hour of the day holds the most importance, followed by the Date, Minutes, Weekday, month, and holidays. In the context of POI features, Restaurant-related points of interest are the most vital, followed by life services, transportation facilities, Educational facilities, Companies, Shopping, Residence, and Accommodation. Meteorological conditions also play a role, with gale having the most influence, followed by Breeze, Better weather, Good weather, and Rainfall.
The Random Forest algorithm’s performance improves as the number of features increases, as shown in Fig. 4. Optimal performance is achieved with a feature dimension of 52, highlighting the utility of all features in the algorithm. The algorithm’s performance is less satisfactory when only one feature, such as Road Type, is selected. However, the score experiences a substantial improvement when a combination of features, including Road Type, Taxi GPS data, Road Length, Direction(one-way or two-way), Hour, Road Identifier, presence of Overhead Bridges, and POI Restaurants, are considered. This combination, which captures both road and spatio-temporal attributes, can achieve 0.87 of R2.
4.3 Inference of traffic speed
In the context of the defined parameters, we are considering five distinct scenarios to assess the performance of three machine learning algorithms:
-
Randomly Missing Data: In this scenario, data points are missing randomly and independently of each other. These missing data points are isolated and scattered throughout the dataset.
-
Varying Sample Sizes: The datasets are divided into different sample sizes, including 1 day, 1 week, 2 weeks, 1 month, and 2 months. This allows for assessing how the algorithms perform under varying data availability.
-
Differing Time Intervals: Traffic speed intervals are aggregated into different time intervals, such as 10-min, 20-min, 30-min, and 1-h intervals. This helps us understand how the algorithms handle data with varying temporal resolutions.
-
Continuous Missing Data: In this scenario, data is continuously missing at a range of rates, from 5 to 50%, over a time span of 1 day. For instance, a 50% continuous missing rate means that within 24 h of data records, there will be 12 h of missing data points. This scenario datasets is replicated five times to reduce the impact of missing data on algorithm performance.
-
Spatially Random Missing: In this scenario, data is randomly missing across different spatial positions in the dataset. The spatially random missing rate varies from 5 to 50%, and the time frame is 1 day. For example, a 25% spatially random missing rate means that 25% of the data within the road network will be entirely missing, and the spatial distribution of the missing data is random. This experiment is conducted ten times to assess algorithm performance under varying spatial data availability.
These scenarios are designed to comprehensively evaluate how the three machine learning algorithms perform under different data-missing conditions and help deter- mine their robustness and effectiveness in handling real-world data with missing values.
In the context of traffic speed inference, Random Forest, Gradient Boosting Decision Trees (GBDT), and XGBoost have been employed as machine learning algorithms, and their performance has been evaluated across various scenarios, encompassing randomly missing data, varying sample sizes, different time intervals, continuous missing data, and spatially random missing data. These assessments underscore the effectiveness of the proposed methodology and the generalization capabilities of these machine learning algorithms, as shown in Fig. 5.
-
(1)
Random Missing Data: As the rate of randomly missing data increases, the performance of the three models experiences a modest decline, but it sustains a certain level of precision even under high missing rates. This demonstrates the resilience of these algorithms in the face of varying random missing rates.
-
(2)
Varying Sample Sizes: Within the analysis time frame spanning from 1 day to 1 month, the inference errors of the three algorithms decrease as the time range expands and the sample size increases. However, in 2-month scenarios, the performance with the most extensive sample size is found to be inferior to that of the 2-week and 1-month scenarios. This discrepancy may be attributed to subtle disparities in traffic speed between months and the inclusion of the month feature during feature selection.
-
(3)
Differing Time Intervals: Algorithm performance improves as the time interval broadens, with the model exhibiting optimal performance when the time interval is 1 h. This suggests that the model is more effective with data containing larger traffic speed time intervals. Notably, machine learning algorithms demonstrate superior performance with smaller data inputs but more extensive time intervals of traffic speed. This is possibly due to a larger number of data points used in traffic speed calculations when the interval is 1 h, resulting in reduced noise and an increased inferred dimension, ultimately leading to enhanced performance.
-
(4)
Continuous Missing and Spatial Random Missing Data: With increasing continuous missing rates and spatial random missing rates, the algorithm’s performance experiences a slight decline. Nevertheless, it maintains a certain degree of precision, indicating its effectiveness under varying continuous missing and spatial random missing rates.
In summary, these observations affirm the robustness and reliability of Random Forest, GBDT, and XGBoost as machine learning algorithms for traffic speed inference. They demonstrate their ability to provide accurate results across a variety of data- missing scenarios, including random missing data, different sample sizes, time intervals, and spatial distributions.
4.4 Best algorithm across five scenarios
The best algorithm performance across five scenarios, as evaluated based on Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE), is summarized as follows,
-
Randomly Missing Data: In the context of randomly missing data, Random Forest exhibits superior performance when the missing rate ranges from 10 to 50% under the MAPE criterion, while XGBoost excels when the missing rate is between 60 and 90%. For RMSE, Random Forest demonstrates optimal performance when the missing rate is between 10 and 40%, whereas XGBoost outperforms at missing rates spanning from 50 to 90%, as shown in Table 3.
-
Varying Sample Sizes: For varying sample sizes, under the MAPE criterion, Random Forest proves optimal for sample sizes of 1 day, 1 month, and 2 months, while XGBoost excels in the 1-week and 2-weeks sample size categories. Under the RMSE criterion, Random Forest is the superior choice when the sample size is limited to 1 day, while XGBoost dominates in the other sample size categories, as shown in Table 4.
-
Differing Traffic Speed Intervals: Regarding differing traffic speed intervals, Random Forest demonstrates the best performance under the MAPE criterion, whereas XGBoost prevails under the RMSE criterion, as shown in Table 5.
-
Continuous Missing Data: In the continuous missing data pattern, under the MAPE criterion, Random Forest excels at missing rates spanning from 5 to 35%, while XGBoost is optimal when the missing rate ranges from 40 to 50%. Under the RMSE criterion, Random Forest achieves superior results when the missing rate lies between 5 and 30%, with XGBoost outperforming when the missing rate is within the 35% to 50% range, as shown in Table 6.
-
Spatially Random Missing Data: In the spatially random missing data pattern, under the MAPE criterion, Random Forest outperforms when the missing rate ranges from 5 to 35%, while XGBoost is superior when the missing rate lies between 40 and 50%. Under the RMSE criterion, Random Forest dominates when the missing rate is 5% or 15%, with XGBoost prevailing in the remaining missing rate categories, as shown in Table 6.
In conclusion, the performance of Random Forest is superior when confronted with small missing rates, small sample sizes, and larger inference values. Conversely, XGBoost proves superior when faced with large missing rates and sizable sample sizes. These findings provide valuable guidance for selecting the most suitable algorithm based on specific data characteristics and missing data scenarios.
The results of spatio-temporal traffic speed inference using three machine learning algorithms are presented in Fig. 6. In the temporal diagram, we focus on the traffic speed of a specific road segment connecting the First Ring Road and the Second Ring Road on October 10th. In this analysis, a randomly missing rate of 30% is applied. The traffic speed inference in the temporal diagram shows consistency over time and lacks significant outliers, indicating the effectiveness of the proposed approach in capturing temporal patterns and trends. A box plot is used to represent the data. As depicted in the illustration, the box plot does not display any extreme outliers, as shown in Fig. 6.
In the spatial diagram, Fig. 7, we specifically visualize the traffic speed on October 10th at 18:00 within the road network. The traffic speed inference maintains spatial continuity, further affirming the efficacy of the suggested method in accurately capturing spatial patterns and providing reliable insights into traffic speed variations.
4.5 Inference of traffic emissions
After the complete spatiotemporal traffic speed was obtained, we applied it to estimate the traffic emissions, which is the basis for sustainable urban development and urban carbon reduction. The COPERT model calculates traffic emissions, and we obtain fine-grained traffic emissions data for urban networks using the inferred traffic speed, traffic volume, and road types we set from taxi GPS data. Figure 8 illustrates PM2.5, NOx emissions, and fuel consumption at 18:00 on October 10th. It is evident that roads with high grades exhibit higher emissions, while areas in the city center tend to experience elevated emissions due to heavy traffic.
Furthermore, an analysis of the spatial variation in emissions at different times of the day was conducted. As shown in Fig. 9, the emissions difference between the period of 18:00 and 22:00 were calculated. For most roads, emissions were higher at 18:00, especially for roads located in the city center. During the evening rush hour at 18:00, traffic congestion and increased traffic flow lead to higher emissions. Conversely, emissions on certain roads increased by 22:00, particularly for roads with higher road grades. This may be attributed to the increased nighttime heavy truck traffic volume on high-grade roads, where vehicles maintain higher speeds. Elevated speeds result in greater fuel consumption and, consequently, higher emissions.
5 Conclusions
Traffic emission estimation plays a pivotal role in intelligent transportation systems, forming the basis for traffic information dissemination, traffic management and control, and safety management. Utilizing taxi-based traffic emission inference offers several advantages in terms of coverage, cost-effectiveness, and accessibility. However, challenges arise when specific road segments lack traffic data due to infrequent taxi data.
To address this issue and attain comprehensive spatiotemporal coverage of traffic emissions in urban networks, we employ three machine learning algorithms, namely, Random Forest, GBDT, and XGBoost, to infer missing traffic speed data, which is the basis for traffic emissions estimation. By using taxi GPS information and multi- dimensional data, traffic speed inference was conducted through a comparative study of these algorithms across five types of missing speed scenarios. In contrast to conventional traffic emission estimation methods, this approach enables a more precise assessment of urban road network emissions. Unlike methods that rely on complex, multi-source data for emission estimation, our approach leverages simple and readily available data to achieve precise emission estimation on a large-scale spatiotemporal basis. This approach demonstrates strong generalizability and can be easily adapted for use in other cities.
Our proposed approach takes into account the intricate urban and transportation environment while simplifying model complexity and computational time, all while maintaining high inference accuracy. Experimental results highlight the effectiveness of considering multi-dimensional data, leading to improved accuracy compared to models with fewer traffic speed features. Among the features considered for traffic speed inference, road network features are found to be the most significant, followed by time, spatial, and POI features, with weather data playing a less crucial role. We also assess the performance of Random Forest with various feature combinations and observe that as the number of features increases, algorithm performance improves, underscoring the enhancement achieved by incorporating road and urban environment features. Furthermore, the three algorithms demonstrate high accuracy in inferring missing values across various scenarios, including random missing data, varying time intervals, and continuous missing data, underscoring the effectiveness of our proposed approach. Notably, Random Forest excels in scenarios with small missing rates or sample sizes and large inference values, highlighting its robustness when dealing with sparse traffic data. This may due to Random Forest’s approach of performing bootstrap sampling with each tree, making it less sensitive to minor data gaps. Additionally, by averaging the outcomes of multiple trees, it achieves more stable and accurate results, particularly when inferring traffic speeds with larger values. In contrast, XGBoost outperforms in cases with higher missing rates or larger sample sizes. This indicates the robustness of XGBoost when dealing with a large amount of missing traffic data. This could be attributed to XGBoost’s parallel processing capabilities and its gradient boosting method, which leads to faster convergence and superior performance when handling extensive datasets with substantial missing traffic data. Finally, utilizing inferred traffic speed and traffic volume information, emissions from large-scale urban road traffic are inferred based on the COPERT model.
The research indicates that traffic speed inference algorithms’ effectiveness changes with different data loss scenarios, necessitating tailored algorithm selection for accurate traffic and emission predictions. Random data loss might stem from sensor issues or transmission errors, while varying sample sizes and time intervals reflect monitoring duration and precision needs. Continuous and spatially random data loss could be due to sensor failures or distribution challenges. The study evaluates algorithm performance and suggests optimal approaches for precise emission inference, also analyzing emission variances between peak and off-peak times and roads. The research demonstrates that the traffic speed inference framework constructed in this paper, using readily available data, achieve high-precision traffic speed inference at the large-scale road network level. To adapt this model to other cities with different traffic characteristics, the model parameters can be calibrated based on the specific traffic patterns, vehicle composition, and emission factors of those cities, allowing the method to be transferred and applied in other urban contexts.
Policy recommendations include the improvement of the traffic carbon emission monitoring system and control policies. Establishing a comprehensive management method for road emissions, and improving carbon emission calculation method. It encourages big data for emission inference and developing a monitoring platform. The promotion of clean energy, particularly in heavy-duty trucks and high-displacement vehicles, is emphasized to reduce reliance on traditional fuels.
In future research, it is advisable to explore various data types, such as loop detectors, cameras, or other traffic mode data, like bus GPS data, to infer traffic emissions. Additionally, the investigation of alternative machine learning methods capable of efficiently handling large-scale data with high inference accuracy can lead to enhanced application outcomes.
Availability of data and materials
The datasets generated during the current study are available from the corresponding author on reasonable request.
References
Bae, B., Kim, H., Lim, H., Liu, Y., Han, L. D., & Freeze, P. B. (2018). Missing data imputation for traffic flow speed using spatio-temporal cokriging. Transportation Research. Part C, Emerging Technologies, 88, 124–139.
Ben Said, A., & Erradi, A. (2022). Spatiotemporal tensor completion for improved urban traffic imputation. IEEE Transactions on Intelligent Transportation Systems, 23(7), 6836–6849.
Brueckner, J. K. (2007). Urban growth boundaries: An effective second-best remedy for unpriced traffic congestion? Journal of Housing Economics, 16(3), 263–273.
Cai, B., Zhang, L., Xia, C., Yang, L., Liu, H., Jiang, L., Cao, L., Lei, Y., Yan, G., & Wang, J. (2021). A new model for China’s CO2 emission pathway using the top- down and bottom-up approaches. Chinese Journal of Population, Resources and Environment, 19(4), 291–294.
Castro, P. S., Zhang, D., Chen, C., Li, S., & Pan, G. (2013). From taxi gps traces to social and community dynamics: A survey. ACM Computing Surveys, 46(2), 1–34.
Chang, G., Zhang, Y., & Yao, D. (2012). Missing data imputation for traffic flow based on improved local least squares. Tsinghua Science and Technology, 17(3), 304–309.
Duan, Y., Lv, Y., Liu, Y.-L., & Wang, F.-Y. (2016). An efficient realization of deep learning for traffic data imputation. Transportation Research. Part C, Emerging Technologies, 72, 168–181.
Fukuda, A., Satiennam, T., Ito, H., Imura, D., & Kedsadayurat, S. (2013). Study on estimation of VKT and fuel consumption in Khon Kaen City, Thailand. Journal of the Eastern Asia Society for Transportation Studies, 10, 113–130.
Hao, H., Wang, H., & Ouyang, M. (2011). Fuel conservation and ghg (greenhouse gas) emissions mitigation scenarios for China’s passenger vehicle fleet. Energy (Oxford), 36(11), 6520–6528.
Jordan, A., Huitema, D., & Forster, J. (2018). Governing climate change: Polycentricity in action? Cambridge University Press.
Li, T., Wu, J., Dang, A., Liao, L., & Xu, M. (2019a). Emission pattern mining based on taxi trajectory data in beijing. Journal of Cleaner Production, 206, 688–700.
Li, L., Zhang, J., Wang, Y., & Ran, B. (2019b). Missing value imputation for traffic-related time series data based on a multi-view learning method. IEEE Transactions on Intelligent Transportation Systems, 20(8), 2933–2943.
Li, H., Li, M., Lin, X., He, F., & Wang, Y. (2020). A spatiotemporal approach for traffic data imputation with complicated missing patterns. Transportation Research. Part C, Emerging Technologies, 119, 102730.
Liu, H., Chen, X., Wang, Y., & Han, S. (2013). Vehicle emission and near-road air quality modeling for shanghai, china: Based on global positioning system data from taxis and revised moves emission inventory. Transportation Research Record, 2340(1), 38–48.
Liu, J., Han, K., Chen, X. M., & Ong, G. P. (2019). Spatial-temporal inference of urban traffic emissions based on taxi trajectories and multi-source urban data. Transportation Research Part C: Emerging Technologies, 106, 145–165.
Mo, B., Li, R., & Zhan, X. (2017). Speed profile estimation using license plate recognition data. Transportation Research Part C: Emerging Technologies, 82, 358–378.
Ni, D., & Leonard, J., II. (2005). Markov chain monte carlo multiple imputation using bayesian networks for incomplete intelligent transportation systems data. Transportation Research Record, 1935(1935), 57–67.
Nocera, S., Ruiz-Alarcón-Quintero, C., & Cavallaro, F. (2018). Assessing carbon emissions from road transport through traffic flow estimators. Transportation Research Part C: Emerging Technologies, 95, 125–148.
Nyhan, M., Sobolevsky, S., Kang, C., Robinson, P., Corti, A., Szell, M., Streets, D., Lu, Z., Britter, R., Barrett, S. R. H., & Ratti, C. (2016). Predicting vehicular emissions in high spatial resolution using pervasively measured transportation data and microscopic emissions model. Atmospheric Environment, 140, 352–363.
Palocz-Andresen, M. (2012;2013). Decreasing fuel consumption and exhaust gas emissions in transportation: Sensing, control and reduction of emissions (vol. 14, 1. aufl.;1;2013; edn). Springer.
Qu, L., Hu, J., Li, L., & Zhang, Y. (2009). Ppca-based missing data imputation for traffic flow volume: A systematical approach. IEEE Transactions on Intelligent Transportation Systems, 10(3), 512–522.
Sui, Y., Zhang, H., Shang, W., Sun, R., Wang, C., Ji, J., Song, X., & Shao, F. (2020). Mining urban sustainable performance: Spatio-temporal emission potential changes of urban transit buses in post-covid-19 future. Applied Energy, 280, 115966.
Sun, D., Yin, Z., & Cao, P. (2020). An improved cal3qhc model and the application in vehicle emission mitigation schemes for urban signalized intersections. Building and Environment, 183, 107213.
Tak, S., Woo, S., & Yeo, H. (2016). Data-driven imputation method for traffic data in sectional units of road links. IEEE Transactions on Intelligent Transportation Systems, 17(6), 1762–1771.
Walpole, R. E., Myers, R. H., & Myers, S. L. (1998). Probability and statistics for engineers and scientists (6th ed.). Prentice Hall.
Yang, H., Yang, J., Han, L. D., Liu, X., Pu, L., Chin, S.-M., Hwang, H.-L., Oak RidgeNationalLab(ORNL), T.U.S.OakRidge. (2018). A kriging based spatiotemporal approach for traffic volume data imputation. PLoS One, 13(4), 0195957.
Yang, B., Kang, Y., Yuan, Y., Li, H., & Wang, F. (2022). St-fvgan: Filling series traffic missing values with generative adversarial network. Transportation Letters, 14(4), 407–415.
Yao, Z.-L., He, K.-B., Wang, Q.-D., Huo, H., Liu, H., He, C.-Y., & James, L. (2006). Application study of ive vehicle emission model. Huanjing Kexue, 27(10), 1928–1933.
Yu, Q., Li, W., Yang, D., & Xie, Y. (2020a). Policy zoning for efficient land utilization based on spatio-temporal integration between the bicycle-sharing service and the metro transit. Sustainability, 13(1), 141.
Yu, J., Stettler, M. E. J., Angeloudis, P., Hu, S., & Chen, X. (2020b). Urban network-wide traffic speed estimation with massive ride-sourcing gps traces. Transportation Research. Part C, Emerging Technologies, 112, 136–152.
Yu, Q., Zhang, H., Li, W., Song, X., & Shibasaki, R. (2020c). Mobile phone gps data in urban customized bus: Dynamic line design and emission reduction potentials analysis. Journal of Cleaner Production, 272, 122471.
Yu, Q., Zhang, H., Li, W., Sui, Y., Song, X., Yang, D., Shibasaki, R., & Jiang, W. (2020d). Mobile phone data in urban bicycle-sharing: Market-oriented sub-area division and spatial analysis on emission reduction potentials. Journal of Cleaner Production, 254, 119974.
Yu, Q., Li, W., Zhang, H., & Chen, J. (2022). GPS data in taxi-sharing system: Analysis of potential demand and assessment of fuel consumption based on routing probability model. Applied Energy, 314, 118923.
Zeng, J., Guo, H.-F., & Hu, Y.-M. (2007). Artificial neural network model for identifying taxi gross emitter from remote sensing data of vehicle emission. Journal of Environmental Sciences, 19(4), 427–431.
Zhang, H., Li, R., Chen, B., Lin, H., Zhang, Q., Liu, M., Chen, L., & Wang, X. (2019a). Evolution of the life cycle primary pm2.5 emissions in globalized production systems. Environment International, 131, 104996.
Zhang, H., Song, X., Long, Y., Xia, T., Fang, K., Zheng, J., Huang, D., Shibasaki, R., & Liang, Y. (2019b). Mobile phone gps data in urban bicycle-sharing: Layout optimization and emissions reduction analysis. Applied Energy, 242, 138–147.
Acknowledgements
An earlier version of this paper was presented at the 25th International Conference of Hong Kong Society for Transportation Studies (http://www.hksts.org/conf20l.pdf). The authors express their gratitude to DiDi Chuxing for providing sample data.
Funding
This work is supported by Key Laboratory of Road and Traffic Engineering of the Ministry of Education, Tongji University (Grant No. K202301).
Author information
Authors and Affiliations
Contributions
Jiaxing Li: Conceptualization, Methodology, Software, Writing - Original Draft, Data Curation, Visualization. Chaozhe Jiang: Writing - Review & Editing, Conceptualization. Ke Han: Writing - Review & Editing, Conceptualization, Supervision. Qing Yu: Writing - Review & Editing, Conceptualization, Supervision, Funding acquisition. Haoran Zhang: Writing - Review & Editing, Conceptualization, Supervision, Funding acquisition.
Corresponding author
Ethics declarations
Competing interests
There are no conflicts of interest to declare.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, J., Jiang, C., Han, K. et al. High-resolution spatiotemporal inference of urban road traffic emissions using taxi GPS and multi-source urban features data: a case study in Chengdu, China. Urban Info 3, 17 (2024). https://doi.org/10.1007/s44212-024-00045-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44212-024-00045-9