Optimizing Hadoop for Small File Management
Keywords:Cloud Hadoop, HDFS, Small Files, SequenceFile, MapFile
HDFS is one of the most used distributed file systems, that offer a high availability and scalability on low-cost hardware. HDFS is delivered as the storage component of Hadoop framework. Coupled with map reduce, which is the processing component, HDFS and MapReduce become the de facto platform for managing big data nowadays. However, HDFS was designed to handle specifically a huge number of large files, while when it comes to a large number of small files, Hadoop deployments may be not efficient. In this paper, we proposed a new strategy to manage small files. Our approach consists of two principal phases. The first phase is about consolidating more than only one client’s small files input, and store the inputs continuously in the first allocated block, in a SequenceFile format, and so on into the next blocks. That way we avoid multiple block allocations for different streams, to reduce calls for available blocks and to reduce the metadata memory on the NameNode. This is because groups of small files packaged in a SequenceFile on the same block will require one entry instead of one for each small file. The second phase consists of analyzing attributes of stored small files to distribute them in such a way that the most called files will be referenced by an additional index as a MapFile format to reduce the read throughput during random access.
(1) Official Hadoop website, http://www.hadoop.apache.org.
(2) J. Dörre, S. Apel, and C. Lengauer, “Modeling and optimizing MapReduce programs,” Concurrency and Computation: Practice and Experience, vol. 27, no. 7, pp.1734-1766, 2015.
(3) D. T. Nukarapu, B. Tang, L. Wang, and S. Lu, “Data replication in data intensive scientific applications with performance guarantee,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 8, pp. 1299–1306, 2011.
(4) Y. Gao and S. Zheng, "A Metadata Access Strategy of Learning Resources Based on HDFS," in proceeding International Conference on Image Analysis and Signal Processing (IASP), pp. 620—622, 2011.
(5) T. White. Hadoop: The Definitive Guide, 4th Edition O’Reilly, 2015.
(6) B. White, T. Yeh, J. Lin, and L. Davis, “Web-scale computer vision using mapreduce for multimedia data mining,” in Proceedings of the Tenth International Workshop on Multimedia Data Mining. ACM, 2010, p. 9.
(7) K. Wiley, A. Connolly, J. Gardner, S. Krughoff, M. Balazinska, B. Howe, Y. Kwon, and Y. Bu, “Astronomy in the cloud: using mapreduce for image co-addition,” Astronomy, vol. 123, no. 901, pp. 366–380, 2011.
(8) W. Fang, V. Sheng, X. Wen, and W. Pan, “Meteorological data analysis using mapreduce,” The Scientific World Journal, vol. 2014, 2014.
(9) F. Wang and M. Liao, “A map-reduce based fast speaker recognition,” in Information, Communications and Signal Processing (ICICS) 2013 9th International Conference on. IEEE, 2013, pp. 1–5.
(10) K. P. Ajay, K. C. Gouda, H. R. Nagesh, “A Study for Handelling of High-Performance Climate Data using Hadoop, Proceedings of the International Conference, pp: 197-202, April 2015.
(11) D. Q. Duffy, J. L. Schnase, J. H. Thompson, S. M. Freeman, and T. L. Clune, “Preliminary evaluation of mapreduce for high-performance climate data analysis,” 2012.
(12) C. Shen, W. Lu, J. Wu, and B. Wei, “A digital library architecture supporting massive small files and efficient replica maintenance,” in Proceedings of the 10th Annual Joint Conference on Digital Libraries (JCDL '10), pp. 391–392, June 2010
(13) A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1013–1020.
(14) J. K. Bonfield and R. Staden, “ZTR: A new format for DNA sequence trace data”, Bioinformatics, vol. 18, no. 1, (2002), pp. 3–10.
(15) J. Xie, S. Yin, et al. “Improving MapReduce performance through data placement in heterogeneous Hadoop clusters”, In 2010 IEEE International Symposium on Parallel & Distributed
(16) G. Mackey; S. Sehrish; J. Wang. Improving metadata management for small files in HDFS. IEEE International Conference on Cluster Computing and Workshops (CLUSTR). 2009. pp.1-4.
(17) C. Vorapongkitipun; N. Nupairoj. Improving performance of small-file accessing in Hadoop. IEEE International Conference on Computer Science and Software Engineering (JCSSE). 2014. pp.200-205.
(18) Patel A, Mehta M A. A novel approach for efficient handling of small files in HDFS, 2015 IEEE International Advance Computing Conference (IACC), pp. 1258-1262.
(19) Y. Zhang; D. Liu. Improving the Efficiency of Storing for Small Files in HDFS. International Conference on Computer Science & Service System (CSSS). 2012. pp.2239-2242
(20) D. Dev; R. Patgiri. HAR+: Archive and metadata distribution! Why not both?. IEEE International Conference on Computer Communication and Informatics (ICCCI). 2015. pp.1-6.
(21) P. Gohil; B. Panchal; J. S. Dhobi. A novel approach to improve the performance of Hadoop in handling of small files. International Conference on Electrical, Computer and Communication Technologies (ICECCT). 2015. pp.1-5.