An Empirical Study of Clustering Algorithms to extract Knowledge from PubMed Articles

Authors

  • Deepak Agnihotri Dept. of Computer Applications, National Institute of Technology Raipur, CG-492010, INDIA http://orcid.org/0000-0002-1536-2261
  • Kesari Verma Dept. of Computer Applications, National Institute of Technology Raipur, CG-492010, INDIA
  • Priyanka Tripathi Dept. of Computer Engineering and Applications, National Institute of Technical Teachers Training and Research Bhopal, MP, INDIA

DOI:

https://doi.org/10.14738/tmlai.53.3106

Keywords:

Information Retrieval, Text Mining, Fuzzy Clustering, Influenza Virus, PubMed.

Abstract

Extraction of useful information from biomedical literature is one of the thrust for the world nowadays due to availability of almost articles on the web in electronic form. Information retrieval (IR) from biomedical literature is finding useful patterns from the unstructured text corpus that satisfies information. In this paper intelligent text analysis is carried out on PubMed articles related to influenza virus. In this context, various algorithms are discussed to reveal the information from PubMed articles, like year wise count of articles containing influenza virus related terms (viz. H1N1, H5N1, and H7N1 etc.), countries with their publication count, which tells about the outbreaks of the diseases in these countries.  The articles may be grouped by searching the keyword “influenza virus strain” pattern with the help of regular expressions. Automatic text categorization is another challenging issue for text mining. We applied k-means, fuzzy C-means, and fuzzy C-shell algorithm for automatic categorization of text articles. The association between words based on their co-occurrence is computed which further helps to categorize the documents based on their co-occurrences. The basic k-means clustering algorithm is first applied to cluster the documents, and then to handle the fuzzy nature of words which may belong to more than one cluster, fuzzy c-means clustering is applied to form more accurate clusters. As Fuzzy c-means method clusters the documents which are in linear spaces but not in the circle, spherical, or ellipsoidal spaces. A new method is proposed here, which considers the clusters of documents in the radius of the circle.

Author Biographies

Deepak Agnihotri, Dept. of Computer Applications, National Institute of Technology Raipur, CG-492010, INDIA

Ph.D. Research Scholar,

Department of Computer Applications,

National Institute of Technology Raipur,

CG, 492010, India.

Kesari Verma, Dept. of Computer Applications, National Institute of Technology Raipur, CG-492010, INDIA

Assistant Prof.

Department of Computer Applications,

National Institute of Technology Raipur,

CG, 492010, India.

Priyanka Tripathi, Dept. of Computer Engineering and Applications, National Institute of Technical Teachers Training and Research Bhopal, MP, INDIA

Asso. Prof.

Dept. of Computer Engineering and Applications, National Institute of Technical Teachers Training and Research Bhopal, MP, INDIA

References

(1) Sophia, Ananiadou, "Automated Term Recognition", Encyclopedia of Systems Biology, Springer, 2013,

pp 57-59.

(2) Sophia, Ananiadou, "Term Normalization, Text Mining", Encyclopedia of Systems Biology, Springer,

, p 2155.

(3) Eric,Sayers, A General Introduction to the E-utilities, http://www.ncbi.nlm.nih.gov/books/NBK25501/,

NCBI, August 9, 2013.

(4) luwening,Accessing NCBI’s Entrez databases, http://nbviewer.ipython.org/github/gumption/

Using_Biopython_Entrez/blob/master/Biopython_Tutorial_and_Cookbook_Chapter_9.ipynb, BioPython

Documentation, 2011.

(5) Michal Toman, Roman Tesar, Karel Jezek,"Influence of Word Normalization on Text Classification",

http://textmining.zcu.cz/publications/inscit20060710.pdf, 2007.

(6) Sophia, Ananiadou, "Term Disambiguation, Text Mining", Encyclopedia of Systems Biology,

Springer, 2013, pp 2154-2155.

(7) Rajesh, N. Dave, and Kurra, Bhaswan. "Adaptive Fuzzy c-Shells Clustering and Detection of Ellipses",

IEEE Transactions On Neural Networks, VOL. 3, NO. 5, September 1992, 643-662.

(8) Kevin, W. Boyack, David, Newman, Russell, J. Duhon, Richard, Klavans, Michael, Patek, Joseph, R.

Biberstine, Bob, Schijvenaars, Andre, Skupin, Nianli Ma, Katy, Borner. “Clustering More than Two

Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches”,

PLOS ONE, 17 Mar 2011, doi/10.1371/journal.pone.0018029.

(9) Senay, Kafkas, Jee-Hyub Kim, Johanna R. McEntyre. “Database Citation in Full Text Biomedical Articles”,

PLOS ONE, 29 May 2013, doi/10.1371/journal.pone.0063184.

(10) Sophia, Ananiadou, Paul, Thompson, Raheel, Nawaz, John, McNaught, and Douglas, B. Kell, "Eventbased text mining for biology and functional genomics", Briefings in Functional Genomics, June 6,

, pp 1-18.

(11) YangYan, Lihui Chen, William-Chandra Tjhi, Fuzzy semi-supervised co-clustering for text documents,

Fuzzy Sets and Systems, Elsevier, 215 (2013), pp 74–89.

(12) Jensen K, Panagiotou G, Kouskoumvekaki I (2014), "Integrated Text Mining and Chemoinformatics

Analysis Associates Diet to Health Benefit at Molecular Level", PLoS Comput. Biol. 10(1): e1003432.

doi:10.1371/journal.pcbi.1003432 [13] http://www.cdc.gov/flu/avianflu/index.htm.

(13) Agnihotri, Deepak, Verma, Keshri, and Tripathi, Priyanka, "Pattern and Cluster Mining on Text Data",

Fourth International Conference on Communication Systems and Network Technologies, IEEE

Computer Society, 978-1-4799-3070-8/, DOI 10.1109/CSNT.2014.92, p 428-432.

(14) Sunghae, Jun, Sang-Sung, Park, Dong-Sik Jang, "Document clustering method using dimension

reduction and support vector clustering to overcome sparseness", Expert Systems with Applications,

Elsevier, 41(2014), pp 3204–3212.

(15) Memoranda, A revision of the system of nomenclature for influenza viruses: a WHO Memorandum,

Bulletin of the World Health Organization, 58 (4): 585-591 (1980).

(16) Edvard, G Randell, "Influenza Virus Types, Subtypes, and Strains", PA Pandemic influenza virus,

planning summit 2006.

(17) Jin-Dong Kim, Tomoko Ohta, Yuka Teteisi, Jun’ichi Tsujii, " GENIA Corpus Manual Encoding

schemes for the corpus and annotation ", http://www-tsujii.is.s.u-tokyo.ac.jp/TR/, 2006.

(18) Yang, M.S., A Survey of Fuzzy Clustering, Mathl. Comput. Modelling ,Vol. 18, No. 11, 1993, pp. 1-16

(19) Sumit Goswami, and Mayank Singh Shishodia, " A Fuzzy Based Approach To Text Mining And

Document Clustering ", Cornell University Library, CoRR arXiv:1306.4633, june 2013.

(20) T. Theodosiou, N. Darzentas, L. Angelis and C. A. Ouzounis, "PuReD-MCL: a graph-based PubMed

document clustering methodology", Bioinformatics Data and Text Mining, Oxford University Press, Vol.

no. 17 2008, pages 1935–1941,doi:10.1093/bioinformatics/btn318.

(21) Fei Zhua, Preecha Patumcharoenpol, Cheng Zhanga, Yang Yanga, Jonathan Chanc, Asawin Meechaie,

Wanwipa Vongsangnaka, Bairong Shena, "Biomedical text mining and its applications in cancer

research", Journal of Biomedical Informatics, Volume 46, Issue 2, April 2013, Pages 200–211,

doi:10.1016/j.jbi.2012.10.007.

(22) Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput. Biol. 5(12):

e1000597. doi:10.1371/journal.pcbi.1000597.

(23) Cohen KB, Hunter L (2008) Getting Started in Text Mining. PLoS Comput. Biol. 4(1): e20.

doi:10.1371/journal.pcbi.0040020.

(24) Rzhetsky A, Seringhaus M, Gerstein MB (2009) Getting Started in Text Mining: Part Two. PLoS

Comput. Biol. 5(7): e1000411. doi:10.1371/journal.pcbi.1000411.

(25) D. Agnihotri, K. Verma, and P. Tripathi, “Computing symmetrical strength of ngrams:a two pass filtering

approach in automatic classification of text documents,”SPRINGERPLUS, vol. 5, no. 942, pp. 1–29, 2016.

(26) D. Agnihotri, K. Verma, and P. Tripathi, “Computing correlative association of terms for automatic

classification of text documents,” in Proceedings of the Third International Symposium on Computer

Vision and the Internet. ACM, 2016, pp. 71–80.

(27) Deepak Agnihotri, Kesari Verma, Priyanka Tripathi, Variable Global Feature Selection Scheme for

automatic classification of text documents, Expert Systems with Applications, Volume 81, 15 September

, Pages 268-281, ISSN 0957-4174, http://doi.org/10.1016/j.eswa.2017.03.057.

Downloads

Published

2017-07-13

How to Cite

Agnihotri, D., Verma, K., & Tripathi, P. (2017). An Empirical Study of Clustering Algorithms to extract Knowledge from PubMed Articles. Transactions on Engineering and Computing Sciences, 5(3), 13. https://doi.org/10.14738/tmlai.53.3106