A Survey of Challenges Facing Streaming Data

  • Sikha Bagui University of West Florida
  • Katie Jin University of West Florida
Keywords: Streaming Data, Sliding Windows, Concept Drift, Data Preprocessing, Data Reduction, Data Streams

Abstract

This survey performs a thorough enumeration and analysis of existing methods for data stream processing. It is a survey of the challenges facing streaming data. The challenges addressed are preprocessing of streaming data, detection and dealing with concept drifts in streaming data, data reduction in the face of data streams, approximate queries and blocking operations in streaming data.

Author Biographies

Sikha Bagui, University of West Florida

Dr. Sikha Bagui is Professor and Askew Fellow in the Department of Computer Science, at The University West Florida, Pensacola, Florida. Dr. Bagui is active in publishing peer reviewed journal articles in the areas of database design, data mining, BigData and Big Data analytics, and machine learning. Dr. Bagui has worked on funded as well unfunded research projects and has over 70 peer reviewed publications. She has also co-authored several books on database and SQL. Bagui also serves as Associate Editor and is on the editorial board of several journals.

Katie Jin, University of West Florida

Katie Jin completed her MS in Computer Science at The University of West Florida. Her interests are in Big Data Analytics and Machine Learning.

References

(1) "Federal Standard 1037C data stream" April 4, 2007.

(2) García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining. New York: Springer.

(3) García, S., Luengo, J., & Herrera, F. (2016). Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowledge-Based Systems, 98, 1-29.

(4) Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., & Herrera, F. (2017). A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing, 239, 39-57.

(5) Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4), 44.

(6) Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 79-85.

(7) Kuncheva, L. I. (2008, July). Classifier ensembles for detecting concept change in streaming data: Overview and perspectives. 2nd Workshop SUEMA, 5-10.

(8) Brzezinski, D., & Stefanowski, J. (2014). Combining block-based and online methods in learning ensembles from concept drifting data streams. Information Sciences, 265, 50-67.

(9) Minku, L. L., White, A. P., & Yao, X. (2009). The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Transactions on knowledge and Data Engineering, 22(5), 730-742.

(10) Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., & Ghédira, K. (2015). Self-adaptive windowing approach for handling complex concept drift. Cognitive Computation, 7(6), 772-790.

(11) Barddal, J. P., Gomes, H. M., Enembreck, F., Pfahringer, B., & Bifet, A. (2016, September). On dynamic feature weighting for feature drifting data streams. In Joint european conference on machine learning and knowledge discovery in databases (pp. 129-144). Springer, Cham.

(12) Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004, September). Learning with drift detection. Brazilian symposium on artificial intelligence, Springer, Berlin, Heidelberg, 286-295.

(13) Bifet, A., & Gavalda, R. (2007, April). Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining, 443-448. Society for Industrial and Applied Mathematics.

(14) Hulten, G., Spencer, L., & Domingos, P. (2001, August). Mining time-changing data streams. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 97-106.

(15) Shan, J., Luo, J., Ni, G., Wu, Z., & Duan, W. (2016). CVS: fast cardinality estimation for large-scale data streams over sliding windows. Neurocomputing, 194, 107-116.

(16) Krawczyk, B., & Woźniak, M. (2015). One-class classifiers with incremental learning and forgetting for data streams with concept drift. Soft Computing, 19(12), 3387-3400.

(17) Du, L., Song, Q., & Jia, X. (2014). Detecting concept drift: an information entropy based method using an adaptive sliding window. Intelligent Data Analysis, 18(3), 337-364.

(18) Mimran, O., & Even, A. (2014). Data stream mining with multiple sliding windows for continuous prediction.

(19) Domingos, P., & Hulten, G. (2000, August). Mining high-speed data streams. Kdd (2), 4.

(20) Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., & Alsaadi, F. E. (2017). A survey of deep neural network architectures and their applications. Neurocomputing, 234, 11-26.

(21) Czarnecki, W. M., & Tabor, J. (2016). Online extreme entropy machines for streams classification and active learning. Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015 (pp. 371-381). Springer, Cham.

(22) Lakshminarayanan, B., Roy, D. M., & Teh, Y. W. (2014). Mondrian forests: Efficient online random forests. Advances in neural information processing systems, 3140-3148.

(23) Woźniak, M. (2013, September). Application of combined classifiers to data stream classification. In IFIP International Conference on Computer Information Systems and Industrial Management (pp. 13-23). Springer, Berlin, Heidelberg.

(24) Woźniak, M., Graña, M., & Corchado, E. (2014). A survey of multiple classifier systems as hybrid systems. Information Fusion, 16, 3-17.

(25) Elwell, R., & Polikar, R. (2011). Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks, 22(10), 1517-1531.

(26) Sun, Y., Tang, K., Minku, L. L., Wang, S., & Yao, X. (2016). Online ensemble learning of data streams with gradually evolved classes. IEEE Transactions on Knowledge and Data Engineering, 28(6), 1532-1545.

(27) Song, G., Ye, Y., Zhang, H., Xu, X., Lau, R. Y., & Liu, F. (2016). Dynamic clustering forest: an ensemble framework to efficiently classify textual data stream with concept drift. Information Sciences, 357, 125-143.

(28) Canzian, L., Zhang, Y., & van der Schaar, M. (2015). Ensemble of distributed learners for online classification of dynamic data streams. IEEE Transactions on Signal and Information Processing over Networks, 1(3), 180-194.

(29) Minku, L. L., & Yao, X. (2011). DDD: A new ensemble approach for dealing with concept drift. IEEE Transactions on Knowledge and Data Engineering, 24(4), 619-633.

(30) Minku, L. L., White, A. P., & Yao, X. (2009). The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Transactions on Knowledge and Data Engineering, 22(5), 730-742.

(31) Shikkenawis, G., & Mitra, S. K. (2015). 2d orthogonal locality preserving projection for image denoising. IEEE Transactions on Image Processing, 25(1), 262-273.

(32) Al-Shiha, A. A. M., Woo, W. L., & Dlay, S. S. (2014). Multi-linear neighborhood preserving projection for face recognition. Pattern Recognition, 47(2), 544-555.

(33) Zhang, H., Wu, Q. J., Chow, T. W., & Zhao, M. (2012). A two-dimensional neighborhood preserving projection for appearance-based face recognition. Pattern Recognition, 45(5), 1866-1876.

(34) Doquire, G., & Verleysen, M. (2012). Feature selection with missing data using mutual information estimators. Neurocomputing, 90, 3-11.

(35) Ferreira, A. J., & Figueiredo, M. A. (2014). Incremental filter and wrapper approaches for feature discretization. Neurocomputing, 123, 60-74.

(36) Yang, Y., & Webb, G. I. (2009). Discretization for naive-Bayes learning: managing discretization bias and variance. Machine learning, 74(1), 39-74.

(37) Hu, H. W., Chen, Y. L., & Tang, K. (2009). A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Transactions on Knowledge and Data Engineering, 21(11), 1505-1514.

(38) Cano, A., Nguyen, D. T., Ventura, S., & Cios, K. J. (2016). ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft Computing, 20(1), 173-188.

(39) Cano, A., Luna, J. M., Gibaja, E. L., & Ventura, S. (2016). LAIM discretization for multi-label data. Information Sciences, 330, 370-384.

(40) Golab, L., & Özsu, M. T. (2003). Processing sliding window multi-joins in continuous queries over data streams. Proceedings of the 29th international conference on Very large data bases-Volume 29, 500-511.

(41) Han, J., Kamber, M. & Pei, J. (2012). Data Mining: Concepts and Techniques, 3rd Edition, Waltham: Elsevier.

Published
2020-08-01
How to Cite
Bagui, S., & Jin, K. (2020). A Survey of Challenges Facing Streaming Data. Transactions on Machine Learning and Artificial Intelligence, 8(4), 63-73. https://doi.org/10.14738/tmlai.84.8579