
A robust outlier detection for large-scale traffic data by an unsupervised regression method is proposed in this paper. Traffic data is collected from loops, sensors and digital cameras all around a city every day. The data size is massive and in a big data format. Outlier is regarded as abnormal traffic situation like traffic jams, low traffic flows, or incidents as well as errors and noise in data storage and transmission. The traffic data to be tackled in this paper is represented by spatial temporal (ST) signals. A principle component analysis (PCA) is used for dimension reduction and to generate a representation of (x, y) –coordinates from the first two component's coefficients in the ST signals. The (x, y) –coordinate points of inliers are measured by Standardized Residual (SR), Hat Matrix (HM) and Cook's Distance (CD) in the regression method so that outliers are assumed to have high changes in these three metrics in the best fit regression model. Experimental result of the proposed method for the Level 1 data achieves detection success rates (DSRs) of 97.37% (SR), 91.19% (HM), 94.28% (CD) for linear regression model, respectively, and 96.80% (SR), 89.71% (HM), 93.14% (CD) for quadratic regression model, respectively. For a finer granularity of Level 2 data, the regression method with the CD metric achieves 94.44% DSR.