TECS-16892 Camera Ready.pdf

Page 1 of 22

Transactions on Engineering and Computing Sciences - Vol. 12, No. 2

Publication Date: April 25, 2024

DOI:10.14738/tecs.122.16892.

Abu Alrub, M., & Shaout, A. (2024). Power DataMate Tool: Leveraging Logistic Regression Classification for Interactive Data

Modeling. Transactions on Engineering and Computing Sciences, 12(2). 195-216.

Services for Science and Education – United Kingdom

Power DataMate Tool: Leveraging Logistic Regression

Classification for Interactive Data Modeling

Mahmoud Abu Alrub

ORCID: 0000-0002-8916-0259

The Electrical and Computer Engineering Department,

The University of Michigan – Dearborn, USA

Adnan Shaout

The Electrical and Computer Engineering Department,

The University of Michigan – Dearborn, USA

ABSTRACT

The demand for efficient predictive modeling techniques has become crucial due to

the growing occurrence of binary classification problems in diverse fields. This

paper presents a new and innovative software tool “Power DataMate” (PDM) which

performs logistic regression (LR) classification as a potent technique for data

modeling. PDM is a powerful tool in capturing and analyzing correlations across

varied datasets. One objective of PDM is to focus on logistic regression for inquiry

based on its capability to represent intricate interactions between predictors and

the binary response variable. Another important goal of the new tool is to forecast

the likelihood of discovering Primary Keys (PK) and Foreign Keys (FK) within

datasets. The new tool PDM also allows users not only to automatically have their

data for a project modeled, but also interactively review and confirm primary keys

and features for further data analysis and modeling. While the research entails a

comprehensive evaluation of model performance indicators, including accuracy,

precision, and recall, results show that the accuracy of PK prediction is 89% and

82% for the FK. Hence, these results are the first of their kind and could be a starting

point for further model enhancements and data analytics research, especially for

projects which include large data files where PDM end user has the choice to

interactively feed the learning algorithm for better outcomes.

Keywords: Data Mining, Data Modeling, Data Classification, Logistic Regression, Primary

Key, Foreign Key.

INTRODUCTION

In a time of unparalleled data abundance, extracting valuable knowledge from large and

complex datasets has become a critical task for researchers and data scientists in many fields.

It is more important than ever to use data modeling techniques as we find ourselves at the

nexus of information and technology. The present paper delves into the field of data modeling

as the combination of methodologies from statistics, machine learning, and database systems

which are employed to extract probabilities and insights that may be utilized for decision- making and strategic planning purposes.

Page 2 of 22

196

Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 2, April - 2024

Services for Science and Education – United Kingdom

While data classification is an essential component within the expansive domain of data mining,

classification refers to the procedure of assigning predetermined labels or categories to

instances, taking into consideration their distinctive features [1]. On the other hand, logistic

regression has also recently been a fundamental component in the domain of statistical

modeling and machine learning [2]. The applications of this method are many, encompassing

the prediction of consumer behavior and medical results, the detection of fraudulent

operations, and the assessment of risk factors. For example, [3].

Data modeling entails the methodical depiction of an entity's data resources and the

interconnections that exist among them [4]. It acts as a foundational plan for creating databases,

providing a theoretical structure that facilitates the connection between unprocessed data and

practical knowledge. In the contemporary era of digitalization, the quantity of data produced

daily has attained unparalleled magnitudes. On the other hand, data mining (DM) refers to the

systematic exploration and analysis of extensive information with the objective of identifying

patterns, trends, correlations, and essential insights. The process integrates methodologies

from diverse disciplines, including statistics, machine learning, and database management, to

convert unprocessed data into practical insights [5].

The process of data classification entails the methodical grouping of data based on

predetermined criteria, including confidentiality and integrity [5]. Classification problem in the

context of data mining is the practice of giving predetermined labels or categories to instances

by considering their distinct properties. The task includes the pursuit of creating precise,

effective, and comprehensible models that can automatically assign predetermined labels to

instances of data. This achievement would enable the automation of processes and provide

improved decision-making support in a wide range of applications [5-6]. Within the expansive

realm of statistical modeling and machine learning, logistic regression emerges as a potent and

extensively utilized methodology that specifically addresses binary or multi-class classification

challenges. Logistic regression, a fundamental technique in statistical analysis and probability

theory, is utilized to describe the association between input variables and categorical outputs

[2][7].

The paper will present a new and unique software tool called Power DataMate (PDM) that will

enable the user to create data management projects, review and analyze datasets, understand

its initial entities, features, and instances, and automatically or interactively model the data.

PDM is a web-based comprehensive data modeling application.

The paper is organized as follows: section 2 explores the related work in data modeling using

classification and logistic regression techniques, mainly, in PK and FK identification, section 3

reviews the existing academic knowledge and serves as a natural foundation for the rest of the

study, section 4 elaborates PDM functionalities in terms of predictive and automated and

interactive classification models, section 5 describes how PDM analyzes, trains, tests, and builds

a data model, section 6 draws attention to the study's limits and scope, and lastly section 7

summarizes the key findings, gives interpretations, and concludes the recommendations of

PDM data models.

Page 3 of 22

197

Abu Alrub, M., & Shaout, A. (2024). Power DataMate Tool: Leveraging Logistic Regression Classification for Interactive Data Modeling. Transactions

on Engineering and Computing Sciences, 12(2). 195-216.

URL: http://dx.doi.org/10.14738/tecs.122.16892

RELATED WORK

Several studies explained the classification problem and how logistic regression helped in

modeling data using machine learning techniques. However, many have not concluded mining

the data to understand its relations by discovering the PK and FK. Our interest relies on

automatically modeling the data from distinct types of data sources by classifying PKs and FKs

and interactively allowing one to have his input on keys and features to improve the maturity

of our prediction model. This section investigates the existing work in terms of data modeling

through data mining, data modeling using logistic regression, and lastly PK and FK discovery.

Data Modeling through Data Mining

Data mining can be classified from various viewpoints: 1. In terms of data, it can be categorized

as supervised, semi-supervised, or unsupervised. 2. In the context of data mining, it can serve

the aim of predictive analysis or descriptive analysis. 3. The used algorithm can be categorized

into classification, regression, clustering, pattern mining, etc. [8]. While we aim to use

supervised data for predictive classification, it is necessary to ensure manageability and

traceability. Thus, it is recommended to follow a guided approach when undertaking data

mining activities. To facilitate this, various data mining standard processes have been

developed [9-12].

Applying a data mining model to input data without a well-organized application, often known

as "data dredging" [13], can result in the production of meaningless or unintelligible patterns,

ultimately leading to data mining or modeling failure. The utilization of a standardized data

mining framework could serve as a quantifiable guide for individuals to adhere to when

engaging in problem definition, data preprocessing, classification model selection, and data

modeling. In this contribution, PDM has the capability for one to create a well-structured data

modeling project and apply a nine-step framework to ensure quality classification results as

will be presented in Section 5.

The rapid expansion of data and databases necessitates the creation of novel technologies and

tools to transform data efficiently and autonomously into valuable information and knowledge.

Data mining has gained significant prominence as a study field, as evidenced by its growing

importance in several studies [14-16]. Unlike existing work that either discusses the data

mining techniques and algorithms or applies them onto domain-specific areas such as the field

of education, PDM encapsulates the process of extracting valuable information from extensive

datasets in a simplified user journey so a data analyst can interpret the knowledge of a refined

data model.

Data Modeling using Classification, Logistic Regression

Supervised learning algorithms are characterized by having prior knowledge of the class

attribute values in the dataset before executing the algorithm. The data is referred to as labeled

data or training data [17]. The elements in this collection are tuples represented as (x, y), where

x is a vector and y are the class property, often a scalar value. Supervised learning constructs a

model that establishes a mapping between the input variable x and the output variable y [12].

In accordance with the above, we utilize the supervised learning algorithm through the logistic