URBAN TRAFFIC PREDICTION
Introduction
The prediction of urban traffic is one of the essential items in all transportation systems globally. At the same time, predicting urban traffic is necessary since it emphasizes the relationship between the safety of a specific group of people operating their businesses, along with a specified transportation system. Besides, it allows researchers to develop a statistical model that best explains the spatial-correlations exhibited by transportation systems (Qin, 2013). The implication here is that the prediction of urban traffic can explain the geographical factors that influence other economic perspectives. The recent technologies offer an excellent opportunity to collect traffic data implying that modelling, analyzing and coming up with statistical knowledge cannot be disputed. However, predicting urban traffic is tedious and complex; thus, the researcher is subject to cleaning the data before carrying out any further analysis.
Exploratory Data Analysis of the Urban Traffic
Exploratory data analysis, abbreviated as EDA, is one of the basic analytical techniques applied in research work. The major focus of EDAs is to categorize and expound on the distribution of the underlying data set to come to qualitative inferential statistics. A good example of an experimental technique is analyzing means, standard deviations, variances, kurtosis and skewness of a given data set (Qin, 2013). However, it is equally important to explore the chosen data to develop a good predictive model. For the current case study for the urban traffic prediction, the first steps involved are extracting data from google search engines and cleaning it. To come up with a good methodological approach, data was extracted from the census bureau website. This domain website was appropriate because it consists of all the information for traffic control variables specified in the United States. The following is a preview of the data for the GPS that was put in place to track the traffic in the major states in the United States.
rideable_type | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | p | start_lng | end_lat | end_lng | member_casual |
docked_bike | 5/27/2020 10:16 | Franklin St & Jackson Blvd | 36 | Wabash Ave & Grand Ave | 199 | 41.8777 | -87.6353 | 41.8915 | -87.6268 | member |
docked_bike | 5/25/2020 11:05 | Clark St & Wrightwood Ave | 340 | Clark St & Leland Ave | 326 | 41.9295 | -87.6431 | 41.9671 | -87.6674 | casual |
docked_bike | 5/2/2020 15:48 | Kedzie Ave & Milwaukee Ave | 260 | Kedzie Ave & Milwaukee Ave | 260 | 41.9296 | -87.7079 | 41.9296 | -87.7079 | casual |
docked_bike | 5/2/2020 16:39 | Clarendon Ave & Leland Ave | 251 | Lake Shore Dr & Wellington Ave | 157 | 41.968 | -87.65 | 41.9367 | -87.6368 | casual |
docked_bike | 5/29/2020 13:27 | Hermitage Ave & Polk St | 261 | Halsted St & Archer Ave | 206 | 41.8715 | -87.6699 | 41.8472 | -87.6468 | member |
docked_bike | 5/29/2020 14:14 | Halsted St & Archer Ave | 206 | May St & Taylor St | 22 | 41.8472 | -87.6468 | 41.8695 | -87.6555 | member |
docked_bike | 5/20/2020 13:46 | Hermitage Ave & Polk St | 261 | Hermitage Ave & Polk St | 261 | 41.8715 | -87.6699 | 41.8715 | -87.6699 | member |
docked_bike | 5/6/2020 19:07 | Ritchie Ct & Banks St | 180 | Ritchie Ct & Banks St | 180 | 41.9069 | -87.6262 | 41.9069 | -87.6262 | casual |
Data Cleaning
From the above data set, the data required cleaning and elimination of outliers. Thus the following steps were appropriate before coming up with a good statistical model
- The data is divided into smaller units that can be quantified easily.
- The units that share the same measurements and removing those that do not make meaning (Ho Yu, 2010)
- Identifying possible sources of variations and outliers and eliminate them.
The implication here is that the outside sources of variation and errors were eliminated.
Dimension Reduction
The dimensional scaling approach applies to short traffic urban prediction most common in busy urban roads. In analytical procedures, the dimension reduction approach follows three steps: correlation analysis, qualitative analysis and multivariate regression (Han & Huang, 2020). For the urban traffic prediction in our case study, the following steps will be applied.
- Selection of appropriate historical data without outliers based on qualitative analysis
In this step, specific road networks will be selected, and data filtered from depending on the selection criterion. Also, the traffic points where concentration is higher will be set as a target.
- Grouping the data using the multidimensional method
The selected data will be grouped into units representing different road networks and traffic variables based on the specified streets to ease analysis.
- Reducing the data completely
This is the last step under the dimension reduction method and will use the Pearson correlation coefficients to filter the data required. Since the model targeted requires a generally lower correlation, the target areas with high correlation values will be eliminated. As a result, this will improve the robustness of the model at large.
The following formula is essential in calculating the Pearson coefficient and will give the researcher an enhanced prediction model.
Feature Engineering
Feature engineering is a vital step in machine learning and involves transforming data and coding the given variables to fit the required model. The following steps will be incorporated into the methodology
- Frequency data filtering
In this step, the data will be filtered depending on the traffic concentration in the target areas. The main focus will be to eliminate variables with uninformative data.
- Encoding Categorical Variables
Under this step, bins will be counted, and the categorical variables transformed to meet the minimum required condition that the correlation must be lower.
- Stacking the model based on the data transformed
This step will focus on stacking all the variables incorporated in the model in the log-transformed form.
- Extracting the Required categorical variables
After transforming the chosen variables into the log form, the variables defining the traffic data set will be extracted to meet an unbiased model’s minimum requirements.
Choice of modelling techniques
The accurate prediction of urban traffic will depend on the model incorporated in this case study. This implies that a sufficient model free from outside sources of variation must exist. Meeting this requirement is complex but needs thoroughly cleaned data without outliers (Zhao, Ukkusuri & Lu, 2018). The backpropagation neural will focus on bivariate correlation analysis and multiple linear regression models that are the most appropriate for predicting urban traffic in the target areas. The two models are sufficient and provide a good opportunity for statistical inference. Other models, such as descriptive analysis, are not appropriate since they offer limited resources.
Multivariate Analysis
Under multivariate analysis, regression analysis will be carried out. Independent (explanatory variables) and dependent (response variables) will be specified (Grohman, 2004). Here, the significance level will be taken at an alpha level of 0.05, thus implying that coming up with appropriate inferences. The model is as follows;
Bivariate Correlation
Under this modelling type, the Pearson coefficients of correlation will be determined for the log-transformed variables. Thus the strength of association for the urban traffic prediction data set will be obtained. However, it is worth noting that the modelling assumes that the Pearson coefficient ranges from -1 to +1. After the modelling, one can easily infer numerous patterns of urban traffic across many states.
Hyperparameter Optimisation and Model Evaluation
The Bayesian optimization technique is the most appropriate for the urban traffic prediction data set. This is because the number of estimators and explanatory variables is sufficient to optimize the chosen models fully. On the other hand, the Bayesian technique will allow the researcher to draw appropriate conclusions ensuring that the multicollinearity of the variables is minimal.
For the model Evaluation, the binary logistic regression and Bayes approach will provide a good insight into the entire model (Zhao, Ukkusuri & Lu, 2018). The level of relevance denoted as alpha for both the Bayesian and regression coefficient is 0.05. Therefore, to test the entire model’s significance, the researcher only needs to compare the obtained coefficients with the value of alpha and draw appropriate inferences.
Scalability Issues
In the chosen model, the model’s scalability is the measure of how its effectiveness can be reduced or increased. Most transportation systems use technological systems in controlling traffic. As such, the Artificial Neural Network for the chosen model will ensure that it is first tested before further implementation (Zhao, Ukkusuri & Lu, 2018). This implies that the following assumptions must be met.
- The data must be normally distributed with mean zero and variance 1
- The skewness and kurtosis must fall within the required positive or negative dimension.
- There are no missing values and outliers within the data set.
However, the researcher is most likely to encounter the following challenges in modelling a predictive statistic for urban traffic data.
- Extracting data from the required domain websites is complicated and time-consuming. In this case, the researcher might end up obtaining insufficient data.
- Cleaning the data is tedious and takes more time. Also, failure to eliminate outliers the results might be redundant, reducing the entire model’s effectiveness.
- The interpretation and modelling of regression and Bayes coefficients are complex and require excellent knowledge in machine learning.
Ethical considerations
For any research to be effective and yield the required results, it must follow the researchers’ stipulated ethical guidelines. This study is subject to requests for permission to access data from the domain website of traffic control in the chosen locale. At the same time, where the public will be involved, consent forms will be sent to assure them of the confidentiality of any information obtained from them.
References
Zhao, Y., Ukkusuri, S., & Lu, J. (2018). Multidimensional Scaling-Based Data Dimension Reduction Method for Application in Short-Term Traffic Flow Prediction for Urban Road Network. Journal Of Advanced Transportation, 2018, 1-10. doi: 10.1155/2018/3876841
Qin, Z. (2013). The Urban Road Short-Term Traffic Flow Prediction Research. Applied Mechanics And Materials, 423-426, 2954-2956. doi: 10.4028/www.scientific.net/amm.423-426.2954
Han, L., & Huang, Y. (2020). Short-term traffic flow prediction of road network based on deep learning. IET Intelligent Transport Systems, 14(6), 495-503. doi: 10.1049/iet-its.2019.0133
Ho Yu, C. (2010). Exploratory data analysis in the context of data mining and resampling. International Journal Of Psychological Research, 3(1), 9-22. doi: 10.21500/20112084.819
Grohman, W. (2004). Using Convex Sets for Exploratory Data Analysis and Visualization. Data Mining And Knowledge Discovery, 9(3), 275-295. doi: 10.1023/b:dami.0000040906.82842.b5