An Empirical Study of Clustering Algorithms to extract Knowledge from PubMed Articles
DOI:
https://doi.org/10.14738/tmlai.53.3106Keywords:
Information Retrieval, Text Mining, Fuzzy Clustering, Influenza Virus, PubMed.Abstract
Extraction of useful information from biomedical literature is one of the thrust for the world nowadays due to availability of almost articles on the web in electronic form. Information retrieval (IR) from biomedical literature is finding useful patterns from the unstructured text corpus that satisfies information. In this paper intelligent text analysis is carried out on PubMed articles related to influenza virus. In this context, various algorithms are discussed to reveal the information from PubMed articles, like year wise count of articles containing influenza virus related terms (viz. H1N1, H5N1, and H7N1 etc.), countries with their publication count, which tells about the outbreaks of the diseases in these countries. The articles may be grouped by searching the keyword “influenza virus strain” pattern with the help of regular expressions. Automatic text categorization is another challenging issue for text mining. We applied k-means, fuzzy C-means, and fuzzy C-shell algorithm for automatic categorization of text articles. The association between words based on their co-occurrence is computed which further helps to categorize the documents based on their co-occurrences. The basic k-means clustering algorithm is first applied to cluster the documents, and then to handle the fuzzy nature of words which may belong to more than one cluster, fuzzy c-means clustering is applied to form more accurate clusters. As Fuzzy c-means method clusters the documents which are in linear spaces but not in the circle, spherical, or ellipsoidal spaces. A new method is proposed here, which considers the clusters of documents in the radius of the circle.References
(1) Sophia, Ananiadou, "Automated Term Recognition", Encyclopedia of Systems Biology, Springer, 2013,
pp 57-59.
(2) Sophia, Ananiadou, "Term Normalization, Text Mining", Encyclopedia of Systems Biology, Springer,
, p 2155.
(3) Eric,Sayers, A General Introduction to the E-utilities, http://www.ncbi.nlm.nih.gov/books/NBK25501/,
NCBI, August 9, 2013.
(4) luwening,Accessing NCBI’s Entrez databases, http://nbviewer.ipython.org/github/gumption/
Using_Biopython_Entrez/blob/master/Biopython_Tutorial_and_Cookbook_Chapter_9.ipynb, BioPython
Documentation, 2011.
(5) Michal Toman, Roman Tesar, Karel Jezek,"Influence of Word Normalization on Text Classification",
http://textmining.zcu.cz/publications/inscit20060710.pdf, 2007.
(6) Sophia, Ananiadou, "Term Disambiguation, Text Mining", Encyclopedia of Systems Biology,
Springer, 2013, pp 2154-2155.
(7) Rajesh, N. Dave, and Kurra, Bhaswan. "Adaptive Fuzzy c-Shells Clustering and Detection of Ellipses",
IEEE Transactions On Neural Networks, VOL. 3, NO. 5, September 1992, 643-662.
(8) Kevin, W. Boyack, David, Newman, Russell, J. Duhon, Richard, Klavans, Michael, Patek, Joseph, R.
Biberstine, Bob, Schijvenaars, Andre, Skupin, Nianli Ma, Katy, Borner. “Clustering More than Two
Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches”,
PLOS ONE, 17 Mar 2011, doi/10.1371/journal.pone.0018029.
(9) Senay, Kafkas, Jee-Hyub Kim, Johanna R. McEntyre. “Database Citation in Full Text Biomedical Articles”,
PLOS ONE, 29 May 2013, doi/10.1371/journal.pone.0063184.
(10) Sophia, Ananiadou, Paul, Thompson, Raheel, Nawaz, John, McNaught, and Douglas, B. Kell, "Eventbased text mining for biology and functional genomics", Briefings in Functional Genomics, June 6,
, pp 1-18.
(11) YangYan, Lihui Chen, William-Chandra Tjhi, Fuzzy semi-supervised co-clustering for text documents,
Fuzzy Sets and Systems, Elsevier, 215 (2013), pp 74–89.
(12) Jensen K, Panagiotou G, Kouskoumvekaki I (2014), "Integrated Text Mining and Chemoinformatics
Analysis Associates Diet to Health Benefit at Molecular Level", PLoS Comput. Biol. 10(1): e1003432.
doi:10.1371/journal.pcbi.1003432 [13] http://www.cdc.gov/flu/avianflu/index.htm.
(13) Agnihotri, Deepak, Verma, Keshri, and Tripathi, Priyanka, "Pattern and Cluster Mining on Text Data",
Fourth International Conference on Communication Systems and Network Technologies, IEEE
Computer Society, 978-1-4799-3070-8/, DOI 10.1109/CSNT.2014.92, p 428-432.
(14) Sunghae, Jun, Sang-Sung, Park, Dong-Sik Jang, "Document clustering method using dimension
reduction and support vector clustering to overcome sparseness", Expert Systems with Applications,
Elsevier, 41(2014), pp 3204–3212.
(15) Memoranda, A revision of the system of nomenclature for influenza viruses: a WHO Memorandum,
Bulletin of the World Health Organization, 58 (4): 585-591 (1980).
(16) Edvard, G Randell, "Influenza Virus Types, Subtypes, and Strains", PA Pandemic influenza virus,
planning summit 2006.
(17) Jin-Dong Kim, Tomoko Ohta, Yuka Teteisi, Jun’ichi Tsujii, " GENIA Corpus Manual Encoding
schemes for the corpus and annotation ", http://www-tsujii.is.s.u-tokyo.ac.jp/TR/, 2006.
(18) Yang, M.S., A Survey of Fuzzy Clustering, Mathl. Comput. Modelling ,Vol. 18, No. 11, 1993, pp. 1-16
(19) Sumit Goswami, and Mayank Singh Shishodia, " A Fuzzy Based Approach To Text Mining And
Document Clustering ", Cornell University Library, CoRR arXiv:1306.4633, june 2013.
(20) T. Theodosiou, N. Darzentas, L. Angelis and C. A. Ouzounis, "PuReD-MCL: a graph-based PubMed
document clustering methodology", Bioinformatics Data and Text Mining, Oxford University Press, Vol.
no. 17 2008, pages 1935–1941,doi:10.1093/bioinformatics/btn318.
(21) Fei Zhua, Preecha Patumcharoenpol, Cheng Zhanga, Yang Yanga, Jonathan Chanc, Asawin Meechaie,
Wanwipa Vongsangnaka, Bairong Shena, "Biomedical text mining and its applications in cancer
research", Journal of Biomedical Informatics, Volume 46, Issue 2, April 2013, Pages 200–211,
doi:10.1016/j.jbi.2012.10.007.
(22) Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput. Biol. 5(12):
e1000597. doi:10.1371/journal.pcbi.1000597.
(23) Cohen KB, Hunter L (2008) Getting Started in Text Mining. PLoS Comput. Biol. 4(1): e20.
doi:10.1371/journal.pcbi.0040020.
(24) Rzhetsky A, Seringhaus M, Gerstein MB (2009) Getting Started in Text Mining: Part Two. PLoS
Comput. Biol. 5(7): e1000411. doi:10.1371/journal.pcbi.1000411.
(25) D. Agnihotri, K. Verma, and P. Tripathi, “Computing symmetrical strength of ngrams:a two pass filtering
approach in automatic classification of text documents,”SPRINGERPLUS, vol. 5, no. 942, pp. 1–29, 2016.
(26) D. Agnihotri, K. Verma, and P. Tripathi, “Computing correlative association of terms for automatic
classification of text documents,” in Proceedings of the Third International Symposium on Computer
Vision and the Internet. ACM, 2016, pp. 71–80.
(27) Deepak Agnihotri, Kesari Verma, Priyanka Tripathi, Variable Global Feature Selection Scheme for
automatic classification of text documents, Expert Systems with Applications, Volume 81, 15 September
, Pages 268-281, ISSN 0957-4174, http://doi.org/10.1016/j.eswa.2017.03.057.