Page 1 of 22
Transactions on Engineering and Computing Sciences - Vol. 12, No. 2
Publication Date: April 25, 2024
DOI:10.14738/tecs.122.16892.
Abu Alrub, M., & Shaout, A. (2024). Power DataMate Tool: Leveraging Logistic Regression Classification for Interactive Data
Modeling. Transactions on Engineering and Computing Sciences, 12(2). 195-216.
Services for Science and Education – United Kingdom
Power DataMate Tool: Leveraging Logistic Regression
Classification for Interactive Data Modeling
Mahmoud Abu Alrub
ORCID: 0000-0002-8916-0259
The Electrical and Computer Engineering Department,
The University of Michigan – Dearborn, USA
Adnan Shaout
The Electrical and Computer Engineering Department,
The University of Michigan – Dearborn, USA
ABSTRACT
The demand for efficient predictive modeling techniques has become crucial due to
the growing occurrence of binary classification problems in diverse fields. This
paper presents a new and innovative software tool “Power DataMate” (PDM) which
performs logistic regression (LR) classification as a potent technique for data
modeling. PDM is a powerful tool in capturing and analyzing correlations across
varied datasets. One objective of PDM is to focus on logistic regression for inquiry
based on its capability to represent intricate interactions between predictors and
the binary response variable. Another important goal of the new tool is to forecast
the likelihood of discovering Primary Keys (PK) and Foreign Keys (FK) within
datasets. The new tool PDM also allows users not only to automatically have their
data for a project modeled, but also interactively review and confirm primary keys
and features for further data analysis and modeling. While the research entails a
comprehensive evaluation of model performance indicators, including accuracy,
precision, and recall, results show that the accuracy of PK prediction is 89% and
82% for the FK. Hence, these results are the first of their kind and could be a starting
point for further model enhancements and data analytics research, especially for
projects which include large data files where PDM end user has the choice to
interactively feed the learning algorithm for better outcomes.
Keywords: Data Mining, Data Modeling, Data Classification, Logistic Regression, Primary
Key, Foreign Key.
INTRODUCTION
In a time of unparalleled data abundance, extracting valuable knowledge from large and
complex datasets has become a critical task for researchers and data scientists in many fields.
It is more important than ever to use data modeling techniques as we find ourselves at the
nexus of information and technology. The present paper delves into the field of data modeling
as the combination of methodologies from statistics, machine learning, and database systems
which are employed to extract probabilities and insights that may be utilized for decision- making and strategic planning purposes.
Page 2 of 22
196
Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 2, April - 2024
Services for Science and Education – United Kingdom
While data classification is an essential component within the expansive domain of data mining,
classification refers to the procedure of assigning predetermined labels or categories to
instances, taking into consideration their distinctive features [1]. On the other hand, logistic
regression has also recently been a fundamental component in the domain of statistical
modeling and machine learning [2]. The applications of this method are many, encompassing
the prediction of consumer behavior and medical results, the detection of fraudulent
operations, and the assessment of risk factors. For example, [3].
Data modeling entails the methodical depiction of an entity's data resources and the
interconnections that exist among them [4]. It acts as a foundational plan for creating databases,
providing a theoretical structure that facilitates the connection between unprocessed data and
practical knowledge. In the contemporary era of digitalization, the quantity of data produced
daily has attained unparalleled magnitudes. On the other hand, data mining (DM) refers to the
systematic exploration and analysis of extensive information with the objective of identifying
patterns, trends, correlations, and essential insights. The process integrates methodologies
from diverse disciplines, including statistics, machine learning, and database management, to
convert unprocessed data into practical insights [5].
The process of data classification entails the methodical grouping of data based on
predetermined criteria, including confidentiality and integrity [5]. Classification problem in the
context of data mining is the practice of giving predetermined labels or categories to instances
by considering their distinct properties. The task includes the pursuit of creating precise,
effective, and comprehensible models that can automatically assign predetermined labels to
instances of data. This achievement would enable the automation of processes and provide
improved decision-making support in a wide range of applications [5-6]. Within the expansive
realm of statistical modeling and machine learning, logistic regression emerges as a potent and
extensively utilized methodology that specifically addresses binary or multi-class classification
challenges. Logistic regression, a fundamental technique in statistical analysis and probability
theory, is utilized to describe the association between input variables and categorical outputs
[2][7].
The paper will present a new and unique software tool called Power DataMate (PDM) that will
enable the user to create data management projects, review and analyze datasets, understand
its initial entities, features, and instances, and automatically or interactively model the data.
PDM is a web-based comprehensive data modeling application.
The paper is organized as follows: section 2 explores the related work in data modeling using
classification and logistic regression techniques, mainly, in PK and FK identification, section 3
reviews the existing academic knowledge and serves as a natural foundation for the rest of the
study, section 4 elaborates PDM functionalities in terms of predictive and automated and
interactive classification models, section 5 describes how PDM analyzes, trains, tests, and builds
a data model, section 6 draws attention to the study's limits and scope, and lastly section 7
summarizes the key findings, gives interpretations, and concludes the recommendations of
PDM data models.
Page 3 of 22
197
Abu Alrub, M., & Shaout, A. (2024). Power DataMate Tool: Leveraging Logistic Regression Classification for Interactive Data Modeling. Transactions
on Engineering and Computing Sciences, 12(2). 195-216.
URL: http://dx.doi.org/10.14738/tecs.122.16892
RELATED WORK
Several studies explained the classification problem and how logistic regression helped in
modeling data using machine learning techniques. However, many have not concluded mining
the data to understand its relations by discovering the PK and FK. Our interest relies on
automatically modeling the data from distinct types of data sources by classifying PKs and FKs
and interactively allowing one to have his input on keys and features to improve the maturity
of our prediction model. This section investigates the existing work in terms of data modeling
through data mining, data modeling using logistic regression, and lastly PK and FK discovery.
Data Modeling through Data Mining
Data mining can be classified from various viewpoints: 1. In terms of data, it can be categorized
as supervised, semi-supervised, or unsupervised. 2. In the context of data mining, it can serve
the aim of predictive analysis or descriptive analysis. 3. The used algorithm can be categorized
into classification, regression, clustering, pattern mining, etc. [8]. While we aim to use
supervised data for predictive classification, it is necessary to ensure manageability and
traceability. Thus, it is recommended to follow a guided approach when undertaking data
mining activities. To facilitate this, various data mining standard processes have been
developed [9-12].
Applying a data mining model to input data without a well-organized application, often known
as "data dredging" [13], can result in the production of meaningless or unintelligible patterns,
ultimately leading to data mining or modeling failure. The utilization of a standardized data
mining framework could serve as a quantifiable guide for individuals to adhere to when
engaging in problem definition, data preprocessing, classification model selection, and data
modeling. In this contribution, PDM has the capability for one to create a well-structured data
modeling project and apply a nine-step framework to ensure quality classification results as
will be presented in Section 5.
The rapid expansion of data and databases necessitates the creation of novel technologies and
tools to transform data efficiently and autonomously into valuable information and knowledge.
Data mining has gained significant prominence as a study field, as evidenced by its growing
importance in several studies [14-16]. Unlike existing work that either discusses the data
mining techniques and algorithms or applies them onto domain-specific areas such as the field
of education, PDM encapsulates the process of extracting valuable information from extensive
datasets in a simplified user journey so a data analyst can interpret the knowledge of a refined
data model.
Data Modeling using Classification, Logistic Regression
Supervised learning algorithms are characterized by having prior knowledge of the class
attribute values in the dataset before executing the algorithm. The data is referred to as labeled
data or training data [17]. The elements in this collection are tuples represented as (x, y), where
x is a vector and y are the class property, often a scalar value. Supervised learning constructs a
model that establishes a mapping between the input variable x and the output variable y [12].
In accordance with the above, we utilize the supervised learning algorithm through the logistic