Value 0 = female
In the original dataset, a total of 6 samples have null values; 4 samples in the “Ca (Number of Major Vessels)” feature and 2 samples in the “Thal (Thallium Heart Rate)” feature. Since null values are very few, these samples can be removed from the dataset. The dataset used in this study contains a total of 1025 samples. A total of 499 samples belong to the disease (1), and 526 of these samples belong to the no disease (0) class. Histograms of all features in the Cleveland heart disease dataset are shown in Figure 1 .
Histograms of features in the heart disease dataset.
The performance of ML models depends on the quality of the features used as input. As the number of features in the datasets increases, the prediction performance of the model decreases, and the computational costs increase. By reducing the number of features, the model can obtain more accurate results and work faster and more efficiently. ML models are designed according to the data used in the learning process. Selecting the best features makes the features learned by the model more generalizable. Thus, it makes the model work better with new data. Some features in the datasets are not important to the result and increase the computational complexity of the model. Removing unnecessary features reduces noise and helps the model achieve better results. Also, feature selection is important for understanding the nature of the dataset. Well-chosen features help people better understand the data. In this study, the Jellyfish algorithm was used to select the best features from the dataset.
Presented in 2021, the Jellyfish optimization algorithm is a type of swarm intelligence algorithm that is inspired by the food-finding behavior of jellyfish in the ocean. It is used to solve optimization problems, particularly in the field of engineering and computer science. According to the literature, the Jellyfish algorithm outperforms many well-known meta-heuristic algorithms in most real-world applications. In the Jellyfish algorithm, a group of artificial agents or particles, called “jellyfish,” move in a three-dimensional space, searching for the optimal solution to a problem. The algorithm is based on a set of rules that simulate the behavior of real-life jellyfish. The algorithm uses a combination of random and deterministic movements to explore the search space and exploit promising solutions. Each Jellyfish has a set of properties that are updated at each iteration, based on its own and the swarm’s best-known solutions. These properties include its position, velocity, and acceleration. The Jellyfish algorithm has been successfully applied to a range of optimization problems, including clustering, feature selection, and image segmentation. It has been shown to perform well in high-dimensional search spaces and can handle multiple objectives and constraints. Overall, the Jellyfish algorithm is a promising optimization technique that takes inspiration from nature to solve complex problems in a computationally efficient way. Figure 2 shows the behavior of jellyfish in the sea and the modeling of group movements [ 25 ].
Jellyfish behaviors for modeling a jellyfish optimization algorithm [ 25 ].
The Jellyfish algorithm has the following three behaviors:
Ocean waves in the sea contain nutrients that can attract jellyfish. The direction of current in the ocean can be defined with a vector and as in Equation (1):
In this regard, e c is the absorption factor and a parameter. This equation can be extended as Equation (2):
In this equation, X * is the best jellyfish, and μ is the average population of the jellyfish. For simplicity, d f = e c μ can be assumed, and therefore this Equation can be more general and presented in Equation (3):
The random distribution of jellyfish can be considered normal, as shown in Equations (4) and (5):
In these relationships, σ is the standard deviation index of the distribution of jellyfish distribution. Figure 3 shows the normal distribution of jellyfish scattering around the mean point with the normal distribution.
Normal distribution of jellyfish in the ocean [ 25 ].
Figure 4 depicts the displacement process of each jellyfish under the influence of ocean water force and under the influence of the jellyfish group.
The movement of jellyfish in the ocean with the force of ocean movements and group movements [ 25 ].
The equations d f and e c can be rewritten as Equations (6) and (7), respectively:
Now we can rewrite Equation (3) based on Equation (6) and present it in Equation (8):
They are moved by water waves of jellyfish, the equation of which is given in Equation (9):
Equation (9) can be extended to Equation (10):
In this relation, β is a number greater than zero and is usually β = 3 . Jellyfishes also have group movements and usually have two passive and active movements. In the passive state, they search more around themselves. To model passive motion, Equation (11) is used to move them:
In this relation, γ is the coefficient of motion and is a positive number, and is usually set to 0.1. U b is the upper range of each dimension and L b is the lower range of one dimension. In the active behavior mode, a jellyfish-like X i randomly determines a jellyfish-like X j , and there are two modes. If the merit of X i is greater than X j , it uses Equation (12) to move; otherwise, Equation (13) is used:
Equation (14) is used to switch between ocean movements and group movements:
In this regard, t is the current iteration number of the algorithm, and M a x t is the maximum iteration counter. The diagram c t Figure 5 is shown for an experiment. For each update, if the random number c t is greater than 0.5, then the Jellyfish update is based on waves, and if it is less than 0.5, it is based on group movements.
Random function to determine the type of motion of the type of force of ocean motions and group motions [ 25 ].
Machine Learning refers to the use of computer algorithms that can learn to perform a particular task from sample data without explicitly programmed instructions. ML uses advanced statistical techniques to learn distinctive patterns from training data to make the most accurate predictions of new data. In applications such as disease prediction, ML models can often be developed using supervised learning methods. Supervised learning requires that training samples are correctly labeled. In its simplest form, the output is a binary variable with a value of 1 for patient subjects and 0 for healthy subjects. To obtain robust ML models, it is recommended to use balanced training samples from healthy and patient subjects. If several diseases are to be included in the ML model, the binary classification can be easily extended to the multi-class case. Therefore, supervised learning algorithms associate input variables with labeled outputs. In this study, we compare the performance of four different ML models using supervised learning, such as ANN, DT, Adaboost, and SVM.
ANN is one of the most basic and popular models of artificial neural networks. It is a network with two or more hidden layers and is often used to solve classification or regression problems. ANN consists of the input layer, one or more hidden layers, and output layers. Each layer contains one or more nodes (neurons). The input layer introduces data into the network and contains a node for each attribute. Hidden layers are layers used to process data. The output layer outputs the results and contains a node for each class in classification problems. ANN works by multiplying each node’s inputs by their weights, putting them into the activation function, and calculating the output. The activation function is the function that determines the output of each node, and non-linear functions such as sigmoid, ReLU, or tanh are often used. During the training process, the weights are randomly assigned, and then the weights are optimized using the backpropagation algorithm. The backpropagation algorithm minimizes the difference between the target outputs and the outputs of the network. ANN can be used for many different types of data and can be used in conjunction with other neural network models and extended to solve more complex problems.
The DT algorithm tries to classify data using a tree structure. The algorithm creates a set of decision rules that parse data according to a specific set of features. This set of decision rules is interconnected along the branches of the tree, forming a decision tree. Each branch corresponds to a decision rule, and each leaf node provides a class or value estimate. The algorithm helps to separate the classes by parsing the data. Each decomposition is accomplished by selecting a feature and dividing it among the values of that feature.
Adaboost (Adaptive Boosting) is an ML algorithm used to solve classification and regression problems. Adaboost algorithm works by combining weak classifiers (weak learners) into strong classifiers (strong learners). The algorithm starts by weighing each sample in the dataset. Initially, each sample has an equal weight. Then, a weak classifier is trained, and this classifier is selected considering the classification accuracy. The selected classifier reduces the weight of the samples it classifies as correct and increases the weight of the samples it classifies as incorrect. Next, a new weak classifier is trained with the weighted samples, and the process is repeated. This process continues until a predetermined number of weak classifiers are trained. Finally, a weighted vote is performed according to the classification accuracy of each weak classifier. As a result of this voting, a powerful classifier is obtained for classifying the given samples.
SVM is a preferred ML algorithm because it is resistant to outliers and gives good results when the data size grows. SVM represents data points in an n-dimensional space and tries to find the best hyperplane separating samples belonging to different classes. However, in some cases, data points cannot be separated linearly. In these cases, the SVM’s solution is found using more complex hyperplanes. The kernel trick allows the SVM to work with data that can be separated more easily in higher dimensional spaces by moving the data to higher dimensional spaces (kernel space). This allows it to perform the separation using more complex hyperplanes for the non-linearly separable dataset. The kernel trick works by using different kernel functions, especially the radial basis function (RBF) and the polynomial kernel. These kernel functions operate based on the properties of data points (distance, similarity, inner product, etc.) and allow the SVM to find an appropriate hyperplane that it can use to separate data in higher dimensional spaces.
The main aim of this study is to provide clinicians with a tool to help them diagnose heart problems early. Therefore, it will be easier to effectively treat patients early and avoid serious consequences. In this study, the performance of different ML models using the Jellyfish algorithm and feature selection for heart disease prediction was compared, and we attempted to obtain the highest performance ML model. The summary of the proposed method is shown in Figure 6 . As seen in Figure 6 , firstly, the Jellyfish algorithm that was presented in 2021 was applied to the dataset to obtain the best features. The Jellyfish algorithm tries to find optimal solutions to various optimization problems by simulating the intelligent behavior of jellyfish. The Jellyfish algorithm does not get stuck in local minimums and reaches the global minimum faster than other optimization algorithms. The algorithm has attracted great attention around the world due to its simplicity of implementation, few parameters, and flexibility. Because of these advantages, the Jellyfish algorithm was preferred in this study to select the best features from the dataset. The Jellyfish algorithm has an effective feature selection role, and a binary version of it is used in this study. This algorithm starts with a population, which is a collection of potential solutions with the best features. The best features are selected for transfer to the next step in each iteration of the algorithm, which ultimately results in the best solution for the features. After creating a new dataset with the best features, this dataset was used for training four different classifiers such as ANN, DT, Adaboost, and SVM. The ML models obtained after the training were tested, and their performances were compared using metrics such as Accuracy, Sensitivity, Specificity, and Area Under Curve, and the ML model with the best performance was selected. A 10-fold cross-validation was used in the training and test phase of ML algorithms. This selected model has high performance in separating and classifying new data samples into two classes as no disease and diseased. In this study, MATLAB (version R2022a) was used for feature selection and classification.
Flowchart of the proposed approach for heart disease prediction.
3.1. performance metrics.
A table known as the confusion matrix is used to evaluate the performance of ML models. The confusion matrix is a table showing the difference between the actual and predicted classes. Each row of the confusion matrix represents an instance in the predicted class, while each column represents an instance in the real class (and vice versa). The confusion matrix usually contains four different terms: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).
True Positive (TP) refers to situations where actual positives are correctly predicted as positives. False Positive (FP) refers to situations where actuals are incorrectly predicted as positives.
True Negative (TN) refers to situations where what is negative is correctly predicted as negative.
False Negative (FN) refers to situations in which true positives are incorrectly predicted as negatives.
Using these terms, performance metrics such as Accuracy, Sensitivity, Specificity, and Area Under Curve (AUC) are calculated. These evaluation criteria, commonly used in the context of binary classification tasks, are calculated as follows.
Accuracy: the proportion of true predictions (both true positives and true negatives) out of all predictions. It is calculated as (TP + TN)/(TP + TN + FP + FN).
Sensitivity (also called recall or true positive rate): the proportion of true positives out of all actual positive cases. It is calculated as TP/(TP + FN).
Specificity: the proportion of true negatives out of all actual negative cases. It is calculated as TN/(TN + FP).
Under the curve (AUC): It refers to the area under the ROC (Receiver Operating Characteristic) curve and takes a value between 0 and 1. If the value of AUC is 0, the classifier predicts all classes incorrectly, and if it is 1, the classifier correctly predicts all classes.
In this section, the proposed method has been implemented on the test data, and the results have been compared with other ML methods such as ANN, Decision Tree, AdaBoost, and SVM. Also, the four different types of performance metrics, such as Sensitivity, Specificity, Accuracy, and Area Under Curve, have been calculated. In total, 70% of the data were selected for training and 30% for testing. Furthermore, other numbers of the training and testing data were selected and tested, but the best performance has been obtained from the mentioned percentages. The performance evaluation results of ML models without applying feature selection with the Jellyfish algorithm are given in Table 2 .
Performance comparison of different ML models without the Jellyfish algorithm.
Model | Sensitivity (%) | Specificity (%) | Accuracy (%) | AUC (%) |
---|---|---|---|---|
ANN | 97.53 | 98.63 | 98.08 | 69.03 |
Decision Tree | 97.69 | 97.17 | 97.43 | 75.83 |
AdaBoost | 97.22 | 98.47 | 97.84 | 78.82 |
SVM | 98.21 | 97.96 | 98.09 | 90.21 |
According to the results of the studies, the classification accuracy of the ANN, DT, AdaBoost, and SVM classifier models was 98.08%, 97.43%, 97.84%, and 98.09%, respectively. The SVM classifier model was the most accurate when compared to the other ML models, and the accuracy rose to 98.09%. The results as graphical illustrations are shown in Figure 7 .
Graphical representation of performance evaluation results of ML models without feature selection.
The performance evaluation results of the ML models, when feature selection is applied with the Jellyfish optimization algorithm, are given in Table 3 .
Performance comparison of different ML models when applying feature selection with the Jellyfish algorithm.
Model with JF | Sensitivity (%) | Specificity (%) | Accuracy (%) | AUC (%) |
---|---|---|---|---|
ANN with JF | 98.22 | 98.89 | 97.99 | 79.33 |
DT with JF | 98.07 | 98.34 | 97.55 | 81.98 |
AdaBoost with JF | 98.12 | 98.07 | 98.24 | 84.92 |
According to the results of the studies, the accuracy of the ANN–JF, DT–JF, AdaBoost–JF, and SVM–JF was 97.99%, 97.55%, 98.24%, and 98.47%, respectively. The SVM-based Jellyfish approach was the most accurate when compared to the other methods, and the accuracy rose to 98.47% when feature selection was combined with the Jellyfish algorithm. The results as a graphical illustration are shown in Figure 8 .
Graphical representation of performance evaluation results of ML models with feature selection.
The method of combining feature selection based on the Jellyfish optimization algorithm and SVM has higher Area Under Curve values than the other methods. In this method, the best features can be selected by using the Jellyfish algorithm and the SVM method to classify the data more accurately than other ML methods.
Furthermore, a case comparison between the current study and references [ 26 , 27 ] has been conducted by the classification accuracy evaluation criteria, with the findings displayed in Table 4 .
Comparison of the approach proposed in this study with some studies in the literature in terms of classification accuracy.
Reference | Dataset | Accuracy (%) |
---|---|---|
[ ] | Cleveland and Statlog heart dataset | 89 |
[ ] | Cleveland heart dataset | 88.5 |
[ ] | Cleveland heart dataset | 94.6 |
[ ] | Cleveland and Statlog heart dataset | 85.29 |
[ ] | Cleveland heart dataset | 91.8 |
[ ] | Cleveland heart dataset | 90.16 |
[ ] | South African heart dataset | 78.1 |
The suggested approach in this study achieves favorable outcomes in the evaluation criteria. The classification accuracy of its prediction of heart disease is also higher than that of some studies in the literature and comparable techniques.
As seen in Table 4 , the proposed method reached 98.47% accuracy. This result shows that the optimum features can be used for heart disease diagnosis. The best features selected by Jellyfish improve the accuracy of results, because some of the features that are not selected by the Jellyfish algorithm can reduce the performance of the classification results. However, in classical methods such as Principal Component Analysis (PCA), some of the features that are not so important can be selected, which can reduce classifier model performance.
The best cost of feature selection, the Root Mean Square Error, and the accuracy of the proposed are shown in Figure 9 a–c, respectively.
( a ) Best cost of feature selection, ( b ) Root Mean Square Error, and ( c ) accuracy of the proposed method.
As seen in Figure 9 a, the best cost of feature selection is obtained in 50 iterations, and this value is 0.0004, which is close to zero. Also, Figure 9 b shows the Root Mean Square Error that reached 0.030 in the fourth iteration.
Heart Valve Disease refers to any condition that affects the heart valves. The heart has four valves, known as mitral, tricuspid, aortic, and pulmonary, which open and close to allow blood to flow in one direction through the heart. Heart Valve Disease occurs when one or more of the valves work improperly. When the valves are healthy, they keep blood flowing smoothly through the heart and body. But when the valves are diseased, they may not open and close properly, causing blood to back up or leak in the wrong direction. Procedures to repair or replace heart valves can include balloon valvuloplasty, surgical valve repair, or surgical valve replacement.
Heart Failure is a condition in which the heart is unable to pump enough blood to meet the body’s needs. The heart may be weakened, stiffened, or damaged, and is unable to efficiently circulate blood throughout the body. This can lead to fluid build-up in the lungs, legs, and other areas of the body. There are two main types of heart failure: systolic and diastolic. Systolic heart failure occurs when the heart’s ability to contract and pump blood is impaired, while diastolic heart failure occurs when the heart is stiff and unable to fill with blood properly. Heart failure can be caused by a variety of factors, including coronary artery disease, high blood pressure, heart valve disease, heart attack, and certain medications.
The findings show that, compared with previous approaches, the proposed strategy improves percent accuracy in heart disease diagnosis. The results of this study demonstrate the potential of artificial intelligence, particularly ML, to significantly influence heart disease diagnostic decisions. The steady increase in computing power and increased data availability through mobile apps and the digital transformation of the global healthcare system are driving the growth of artificial intelligence and ML further. Therefore, future research will continue to use these techniques to translate them into routine clinical practice, thus paving the way for improved diagnostic decision-making to suit the specific needs of individual patients.
Machine learning algorithms for the diagnosis of heart diseases may have significant potential in the medical diagnosis process. These algorithms can be trained on datasets to perform tasks such as diagnosing specific heart diseases, assessing risk factors, and recommending treatment options. However, the potential risks and problems of these applications should also be considered. Several aspects of this debate can be addressed:
Data quality and accuracy: The proposed algorithm requires sufficient and high-quality data to produce accurate and reliable results. Therefore, the datasets used should not contain incomplete, inaccurate, or misleading data. Especially in a field such as heart disease, misdiagnosis recommendations can be errors that can have serious consequences.
Understandability of the algorithm: It may be necessary to explain to doctors how the algorithm and its parameters work. If doctors do not understand the decision processes of the algorithm, they may find it difficult to fully trust its results.
Data privacy and security: Privacy and security concerns may arise when using patients’ medical data. It is important that the data is properly protected and protected from unauthorized access and malicious use. This should be considered during the implementation of algorithms into clinical practice.
Physician–patient relationship: Some patients may find it difficult to trust their doctors regarding a diagnosis or treatment recommendation made by the algorithm, or may be skeptical about the results of the algorithm. The proposed algorithm should only be considered as a tool to assist physicians in their decision-making process. It should not be perceived as interfering with doctors’ decision-making.
This study aimed to obtain a highly accurate and reliable intelligent medical diagnosis model based on ML with the Jellyfish optimization algorithm using the Cleveland data set for early prediction of heart disease. One of the important factors affecting the performance of an ML model is the number of features in the dataset used. Choosing the right features can help the model better understand the data and give more accurate results. Selecting the right features can improve the performance of the model, while selecting too many features can increase the complexity of the model and cause overfitting. Therefore, the number of features must be accurately determined. To avoid the overfitting problem due to the large number of features in the Cleveland dataset used in this study, the best features were selected from the dataset by using the Jellyfish algorithm. The Jellyfish algorithm is a swarm-based metaheuristic algorithm that can be used with ML methods to optimize hyperparameters. The optimum features obtained from the dataset were used in the training and testing stages of four different ML algorithms (ANN, DT, AdaBoost, and SVM). Then, the performances of the obtained models were compared. The results show that the accuracy rates of all ML models improved after the dataset was subjected to feature selection with the Jellyfish algorithm. The highest classification accuracy (98.47%) was obtained with the SVM model trained using the dataset optimized with the Jellyfish algorithm. The Sensitivity, Specificity, Accuracy, and AUC for SVM without using the Jellyfish algorithm were obtained at 98.21%, 97.96%, 98.09%, and 90.21%, respectively. However, by using the Jellyfish algorithm, these values have been obtained as 98.56%, 98.37%, 98.47%, and 94.48%, respectively.
This research received no external funding.
Conceptualization, A.A.A., H.P.; methodology, A.A.A.; software, A.A.A.; validation, A.A.A., H.P.; formal analysis, H.P.; investigation, A.A.A., Hüseyin Polat; resources, A.A.A., H.P.; data curation, A.A.A.; writing—original draft preparation, A.A.A.; writ-ing—review and editing, A.A.A., H.P.; visualization, H.P.; supervision, H.P. All authors have read and agreed to the published version of the manuscript.
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent was obtained from all subjects involved in the study.
Conflicts of interest.
The authors declare that there is no conflict of interest regarding the publication of this paper.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
In recent years, the number of cases of heart disease has been greatly increasing, and heart disease is associated with a high mortality rate. Moreover, with the development of technologies, some advanced types of equipment were invented to help patients measure health conditions at home and predict the risks of having heart disease. The research aims to find the accuracy of self-measurable physical health indicators compared to all indicators measured by healthcare providers in predicting heart disease using five machine learning models. Five models were used to predict heart disease, including Logisti cs Regression, K Nearest Neighbors, Support Vector Model, Decision tree, and Random Forest. The database used for the research contains 13 types of health test results and the risks of having heart disease for 303 patients. All matrices consisted of all 13 test results, while the home matrices included 6 results that could test at home. After constructing five models for both the home matrices and all matrices, the accuracy score and false negative rate were computed for every five models. The results showed all matrices had higher accuracy scores than home matrices in all five models. The false negative rates were lower or equal for all matrices than home matrices for five machine learning models. The conclusion was drawn from the results that home-measured physica l health indicators were less accurate than all physical indicators in predicting patients’ risk for heart disease. Therefore, without the future development of home-testable indicators, all physical health indicators are preferred in measuring the risk for heart diseases.
Machine Learning , Data Visualization , Feature Engineering , Health , Heart Disease
Share and Cite:
1. Introduction
Heart disease, caused by abnormal heart and blood vessel conditions, is widely considered a direct threat to human life and health. It is one of the significant diseases exerting irreversible effects on many middle-aged and older people, in which fatal complications are highly likely to result [1]. Makino states that the absolute risk of cardiovascular heart disease is associated with disability and death among people 65 years or older [2]. The World Health Organization (WHO) declared an estimated 17.7 million people died from cardiovascular disorders in 2015, accounting for one-third of all deaths that year [3]. According to the Australian Bureau of Statistics, heart ailment was one of Australia’s two highest causes of mortality [4]. As its extremely negative influence on human health, a great deal of effort has been devoted to the study of the onset of heart disease, trying to prevent and reduce the incidence of heart disease with a timely and efficacious method. Moreover, purpose to prevent the adverse effects of heart disease, it is recommendable to use sophisticated equipment to detect potential heart risks in advance. Currently, qualified health organizations can conduct many tests, including blood tests, echocardiography, chest X-Rays, magnetic resonance imaging (MRI), electrocardiogram, physical examination, and exercise stress test that provide medical doctors with valuable information in their diagnosis and their views on the patient’s heart failure risk level [5].
There are several risk factors for heart failure, corresponding to different test indexes. A significant amount of relevant research has been carried out to reveal the potential attributes of a heart attack. Sex, age, smoking, hypertension, and diabetes depend on heart disease [6]. Peter et al. [7] suggest that indexes including blood pressure, total cholesterol, and age are essential in predicting coronary heart disease. The effects of sex differences on traditional cardiovascular risk factors are considered to be notable [8]. Heart rate is also a powerful indicator of a patient’s potential heart attack risk [9]. The attributes of heart disease could be approximately divided into two types according to whether the indicators could be measured at home. It is considered worthwhile to compare the accuracy of indicators measured at home with those measured in hospitals, which is useful for future tests of heart disease.
Computational technology and statistical approach have been popular in discovering the relationship between heart diseases and patients’ health conditions [10] [11]. They can help predict the potential risk of heart disease based on the patient’s underlying physical condition in advance, thereby reducing the probability of dying from a heart attack. Many statistical methods based on computer calculation have been applied to predict heart attacks [12]. Due to its high accuracy, SVM has been prevalently applied as a classification method to predict heart attacks [13]. Akkaya used logistic regression and the k-NN algorithm to estimate heart failure and accomplished compromising outcomes [13]. With the adoption of Random Forest, the best accuracy of 82.18% has been achieved by modification of feature selection [14]. These algorithms have been proved to predict the risk of heart disease effectively, which helps researchers and doctors make better judgments about heart disease.
Although these machine learning technique has been acknowledged and refined continuously to increase the performance of prediction, few investigators has examined and compare the accuracy of home-tested versus in-hospital measures for predicting heart disease risk. Few investigators have examined the relative accuracy of home-tested versus in-hospital measures for predicting heart disease risk. If the indicators measured at home can well predict the patient’s risk of heart disease, then the patient can be tested by themselves or their families instead of having to go to the hospital for testing. Therefore, the innovation of this article lies in that not only did it use five machine learning algorithms to regress data on heart patients, but it also compared the contribution of these algorithms to the prediction of heart disease measured at home and measured in the hospital.
This study aims to compare the patient’s physical condition indicators measured at home and in the hospital, using 5 different prediction methods to explore their accuracy of heart disease prediction. Moreover, the research question “How is machine learning algorithms’ performance with only self-measurable physical condition indicators compared to algorithms with all physical condition indicators?” would be answered accordingly.
2. Data Description
We used the data from the Cleveland heart data set from the UCI machine learning repository. The data we selected is made up of 14 variables and 303 instances. Overall speaking, there are 13 variables and 1 categorical response variables (target). Among these variables, numerical variables are age, trtbps, chol, thalach, old peak; Categorical variables are sex, exang, cp, fbs, rest_ecg, slp, thall, target. The table below illuminates the meaning of each variable. Detailed information could be seen in Table 1 .
From Figure 1 we can see that in the data set, most patients with heart attack are aging between 50 and 60, while only few people have heart failure aged under 30 or above 70. The range of this attribute is 29 - 77, illustrating the wide span of age.
Figure 1 . Age of heart diseased patients.
The Chol means cholestoral of patients, fetched via BMI sensor. According to Figure 2 , it seems that most patients’ cholestoral is around 230 mg/dl and the whole distribution shows a slightly right skewness.
According to Figure 3 , most maximum heart rates of patients gathers between 140 to 180. Some particular patients have extremely low and high heart rate, specifically lower than 100 and surpassing 200.
When it comes to resting blood pressure ( Figure 4 ), a great number of patients have resting blood pressure around 100 to 140. Only a few have abnormal values of around 160 mm/Hg and below 100 mm/Hg.
Table 1 . Variable description.
Figure 2 . Chol of heart diseased patients.
Figure 3 . Maximum heart rate achieved of heart diseased patients.
Figure 4 . Resting blood pressure of heart diseased patients.
3. Methodology
3.1. Data Processing
For data description, the research utilized the describe function and pandas profiling in Python to summarize the dataset. The raw data contained 14 variables for 303 patients. Chi-square values, extra-tree classifiers, and correlation matrices were measured to conduct data analysis. The Chi-square values and correlation matrices showed that no variables were highly correlated, and all variables were selected for model building. Moreover, all numerical variables were scaled to normal using Standard Scaler .
The 13 independent variables were divided into home matrices and all matrices. Home matrices consisted of 6 variables—age, sex, resting blood pressure, cholesterol, fasting blood sugar, and thalassemia. All matrices included all 13 independent variables. The research created the training set and test sets with 80% training data and 20% testing data.
The helper function was used in Python to show each model’s accuracy score, false negative rate, and confusion matrix. The accuracy score was used to measure the percentage of correctly predicted patients who had or did not have a risk for heart disease. The score showed the accuracy of each model in predicting the correct heart disease risks for patients. The false negative rate measured the percentage of patients with a high risk for heart disease but was mispredicted as having a low risk for heart disease. The false negative rate was significant because misprediction may lead to late treatment for the patients. Those values were used in the final model comparison to conclude the accuracy of self-measured home matrices compared to all matrices.
3.2. Machine Learning Algorithms
The research built five models for both the home matrices and all matrices.
3.2.1. Logistics Regression
Logistics Regression is a model for predicting a binary outcome utilizing the observations of a data set. The research selected this model because the output variable is a binary outcome taking either the high risk or no risk for heart disease. The Logistic Regression from the sklearn package in Python was used to build the model. Library for large linear classification was chosen for logistics models because the dataset size was relatively small.
3.2.2. K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a classification algorithm that tests the likelihood of a data point belonging to a group according to the distance to the nearest point. The research chose 1 to 20 as the number of neighbors. The K Neighbors Classifier Scores were calculated for each number of neighbors. The line chart using the number of neighbors as x and the K Neighbors Classifier Scores as y was created. The research chose K equal 8 since it had the highest K Neighbors Classifier Score.
3.2.3. Support Vector Machine
Support Vector Machine was chosen as one of the models because it is an algorithm for classification and regression. The research used svm from sklearn.svm package in Python. The Radial basis function kernel was selected, gamma equaled 0.01, and the ragularization parameter equaled 1 for the two machine learning models.
3.2.4. Decision Tree
Decision tree was chosen because it is a nonparametric machine learning model for classification and regression. The research drew the line graph using the number of maximum depth from 1 to 30 as x and Decision Tree Classifier Score as y. Maximum depth equal to 10 was picked for the model building because it has the highest scores.
3.2.5. Random Forest
Random Forest is an algorithm consisting of decision trees. Random Forest Classifier from the sklearn. ensumble package was used to build the home and all matrices models. The number of estimators equaled 1000 in both the home and all matrices models.
Raw data, after some preprocessing, are fed into machine learning algorithms. Afterward, the accuracy score and the false negative rate are obtained.
4.1. Accuracy
According to Table 2 , the Logistic Regression and Support Vector Model have the highest accuracy score at 88.52% within the machine learning algorithms with all physical condition indicators. In comparison, the Decision Tree has the lowest accuracy score with only 85.25%. Within the machine learning algorithms with only physical condition indicators measured at home, Logistic Regression has the highest accuracy score at 73.77%, while the Support Vector Model has the lowest accuracy at only 68.85%.
After comparing the accuracy between machine learning algorithms with only physical condition indicators measured at home and algorithms with all physical condition indicators, it is concluded that algorithms with only physical condition indicators measured at home do not perform as accurately as algorithms with all physical condition indicators. The difference in accuracy ranges from 14.75% to 19.67%.
4.2. False Negative Rate
From the false negative rate perspective ( Table 3 ), it is observed that the Decision Tree has the highest false negative rate within the algorithms with all physical condition indicators. In contrast, Logistic Regression has the lowest false negative rate. Within the algorithms with only physical condition indicators measured at home, K Nearest Neighbors and Random Forest have the highest false negative rate, while Decision Tree has the lowest false negative rate.
Table 2 . The table shows the accuracy score of machine learning algorithms with all physical condition indicators and only self-measurable indicators. Orange represents the algorithm with the highest accuracy score. Green represents the algorithm with the lowest accuracy score.
Table 3 . The table shows the false negative rate of machine learning algorithms with all physical condition indicators and only self-measurable indicators. Orange represents the algorithm with the highest false negative rate. Green represents the algorithm with the lowest false negative rate.
After comparison, it is concluded that machine learning algorithms with all physical condition indicators have a much lower false negative rate than algorithms with only physical condition indicators measured at home. Note that the false negative rate for the Decision Tree is the same for both groups. This is probably due to the randomness of the data splitting process, as the test set is only 20% of the entire data set, which is about 60 data samples. The difference between the algorithms ranges from 0% to 17.65%.
5. Conclusions and Discussion
5.1. Conclusion
To answer the research question of this study, it is concluded that the machine learning algorithms with only self-measurable physical condition indicators do not predict as accurately as machine learning algorithms with all physical condition indicators. Not only do algorithms with self-measurable physical condition indicators not predict the heart disease outcome as accurately as algorithms with all physical condition indicators, but they are also more likely to falsely predict not having heart disease among patients with heart disease. Thus, machine learning algorithms with only self-measurable physical condition indicators should not be used until more indicators are measurable at home in the future.
5.2. Study Limitation
The findings of this study have to be seen in light of some limitations. It is noteworthy that the dataset used in this is a subset of the original database, which contained 76 attributes instead of 14, which is used in this study. Within the original 76 attributes, other attributes could be measured at home and thus improve the accuracy and reduce the false negative rate of the machine learning algorithms with only self-measurable physical condition indicators.
5.3. Future Work
The limitations of this study have indicated the following areas as recommendations for future work. First, include other health attributes from the original dataset to discover the machine learning algorithm with the highest accuracy and lowest false negative rate. Second, since every patient has different health conditions, it is recommended to group the patients with similar health conditions and ages to investigate each machine learning algorithm’s accuracy and false negative rate.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.
[ ] | Heron, M. (2012) Deaths: Leading Causes for 2008. National Vital Statistics Reports: From the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System, 60, 1-94. |
[ ] | Makino, K., Lee, S., Bae, S., Chiba, I., Harada, K., Katayama, O., Shinkai, Y. and Shimada, H. (2021) Absolute Cardiovascular Disease Risk Assessed in Old Age Predicts Disability and Mortality: A Retrospective Cohort Study of Community-Dwelling Older Adults. Journal of the American Heart Association, 10, e022004. https://doi.org/10.1161/JAHA.121.022004 |
[ ] | WHO (2017) Cardiovascular Diseases. http://www.who.int/mediacentre/factsheets/fs317/en/ |
[ ] | ABS (2009) Causes of Death, Australia. Australian Bureau of Statistics. http://abs.gov.au/ausstats/[email protected]/Products/696C1CF9601E4D8DCA25788400127BF0?opendocument |
[ ] | AHA (2017) American Heart Association. http://www.heart.org |
[ ] | Liu, X., Wang, X.L., Su, Q., Zhang, M., Zhu, Y.H., Wang, Q.G. and Wang, Q. (2017) A Hybrid Classification System for Heart Disease Diagnosis Based on the RFRS Method. Computational and Mathematical Methods in Medicine, 2017, 1-11. https://doi.org/10.1155/2017/8272091 |
[ ] | Wilson, P.W.F., D’Agostino, R.B., Levy, D., Belanger, A.M., Silbershatz, H. and Kannel, W.B. (1998) Prediction of Coronary Heart Disease Using Risk Factor Categories. Circulation, 97, 1837-1847. https://doi.org/10.1161/01.CIR.97.18.1837 |
[ ] | Liu, W., Tang, Q., Jin, J., et al. (2021) Sex Differences in Cardiovascular Risk Factors for Myocardial Infarction. Herz, 46, 115-122. https://doi.org/10.1007/s00059-020-04911-5 |
[ ] | Lee, H.G., Noh, K.Y. and Ryu, K.H. (2007) Mining Biosignal Data: Coronary Artery Disease Diagnosis Using Linear and Nonlinear Features of HRV. In: Emerging Technologies in Knowledge Discovery and Data Mining, PAKDD 2007, Lecture Notes in Computer Science, Vol. 4819, Springer, Berlin, Heidelberg. |
[ ] | Nahar, J., Imam, T., Tickle, K.S. and Chen, Y.-P.P. (2013) Association Rule Mining to Detect Factors Which Contribute to Heart Disease in Males and Females. Expert Systems with Applications, 40, 1086-1093. https://doi.org/10.1016/j.eswa.2012.08.028 |
[ ] | Desai, F., Chowdhury, D., Kaur, R., Peeters, M., Arya, R.C., Wander, G.S., Gill, S.S. and Buyya, R. (2022) HealthCloud: A System for Monitoring Health Status of Heart Patients Using Machine Learning and Cloud Computing. Internet of Things, 17, Article ID: 100485. https://doi.org/10.1016/j.iot.2021.100485 |
[ ] | Nahar, J., Imam, T., Tickle, K.S. and Chen, Y.P.P. (2013) Computational Intelligence for Heart Disease Diagnosis: A Medical Knowledge Driven Approach. Expert Systems with Applications, 40, 96-104. https://doi.org/10.1016/j.eswa.2012.07.032 |
[ ] | Xing, Y.W., Wang, J., Zhao, Z.H. and Gao, Y.H. (2007) Combination Data Mining Methods with New Medical Data to Predicting Outcome of Coronary Heart Disease. Convergence Information Technology, Gwangju, 21-23 November 2007, 868-872. https://doi.org/10.1109/ICCIT.2007.204 |
[ ] | Akkaya, B., Sener, E. and Gursu, C. (2022) A Comparative Study of Heart Disease Prediction Using Machine Learning Techniques. 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, 9-11 June 2022, 1-8. https://doi.org/10.1109/HORA55278.2022.9799978 |
+1 323-425-8868 | |
+86 18163351462(WhatsApp) | |
Paper Publishing WeChat |
Copyright © 2024 by authors and Scientific Research Publishing Inc.
This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License .
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Scientific Reports volume 10 , Article number: 19747 ( 2020 ) Cite this article
27k Accesses
83 Citations
Metrics details
Heart disease is a fatal human disease, rapidly increases globally in both developed and undeveloped countries and consequently, causes death. Normally, in this disease, the heart fails to supply a sufficient amount of blood to other parts of the body in order to accomplish their normal functionalities. Early and on-time diagnosing of this problem is very essential for preventing patients from more damage and saving their lives. Among the conventional invasive-based techniques, angiography is considered to be the most well-known technique for diagnosing heart problems but it has some limitations. On the other hand, the non-invasive based methods, like intelligent learning-based computational techniques are found more upright and effectual for the heart disease diagnosis. Here, an intelligent computational predictive system is introduced for the identification and diagnosis of cardiac disease. In this study, various machine learning classification algorithms are investigated. In order to remove irrelevant and noisy data from extracted feature space, four distinct feature selection algorithms are applied and the results of each feature selection algorithm along with classifiers are analyzed. Several performance metrics namely: accuracy, sensitivity, specificity, AUC, F1-score, MCC, and ROC curve are used to observe the effectiveness and strength of the developed model. The classification rates of the developed system are examined on both full and optimal feature spaces, consequently, the performance of the developed model is boosted in case of high variated optimal feature space. In addition, P-value and Chi-square are also computed for the ET classifier along with each feature selection technique. It is anticipated that the proposed system will be useful and helpful for the physician to diagnose heart disease accurately and effectively.
Introduction.
Heart disease is considered one of the most perilous and life snatching chronic diseases all over the world. In heart disease, normally the heart fails to supply sufficient blood to other parts of the body to accomplish their normal functionality 1 . Heart failure occurs due to blockage and narrowing of coronary arteries. Coronary arteries are responsible for the supply of blood to the heart itself 2 . A recent survey reveals that the United States is the most affected country by heart disease where the ratio of heart disease patients is very high 3 . The most common symptoms of heart disease include physical body weakness, shortness of breath, feet swollen, and weariness with associated signs, etc. 4 . The risk of heart disease may be increased by the lifestyle of a person like smoking, unhealthy diet, high cholesterol level, high blood pressure, deficiency of exercise and fitness, etc. 5 . Heart disease has several types in which coronary artery disease (CAD) is the common one that can lead to chest pain, stroke, and heart attack. The other types of heart disease include heart rhythm problems, congestive heart failure, congenital heart disease (birth time heart disease), and cardiovascular disease (CVD). Initially, traditional investigation techniques were used for the identification of heart disease, however, they were found complex 6 . Owing to the non-availability of medical diagnosing tools and medical experts specifically in undeveloped countries, diagnosis and cure of heart disease are very complex 7 . However, the precise and appropriate diagnosis of heart disease is very imperative to prevent the patient from more damage 8 . Heart disease is a fatal disease that rapidly increases in both economically developed and undeveloped countries. According to a report generated by the World Health Organization (WHO), an average of 17.90 million humans died from CVD in 2016. This amount represents approximately 30% of all global deaths. According to a report, 0.2 million people die from heart disease annually in Pakistan. Every year, the number of victimizing people is rapidly increasing. European Society of Cardiology (ESC) has published a report in which 26.5 million adults were identified having heart disease and 3.8 million were identified each year. About 50–55% of heart disease patients die within the initial 1–3 years, and the cost of heart disease treatment is about 4% of the overall healthcare annual budget 9 .
Conventional invasive-based methods used for the diagnosis of heart disease which were based on the medical history of a patient, physical test results, and investigation of related symptoms by the doctors 10 . Among the conventional methods, angiography is considered one of the most precise technique for the identification of heart problems. Conversely, angiography has some drawbacks like high cost, various side effects, and strong technological knowledge 11 . Conventional methods often lead to imprecise diagnosis and take more time due to human mistakes. In addition, it is a very expensive and computational intensive approach for the diagnosis of disease and takes time in assessment 12 .
To overcome the issues in conventional invasive-based methods for the identification of heart disease, researchers attempted to develop different non-invasive smart healthcare systems based on predictive machine learning techniques namely: Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Naïve Bayes (NB), and Decision Tree (DT), etc. 13 . As a result, the death ratio of heart disease patients has been decreased 14 . In literature, the Cleveland heart disease dataset is extensively utilized by the researchers 15 , 16 .
In this regard, Robert et al . 17 have used a logistic regression classification algorithm for heart disease detection and obtained an accuracy of 77.1%. Similarly, Wankhade et al . 18 have used a multi-layer perceptron (MLP) classifier for heart disease diagnosis and attained accuracy of 80%. Likewise, Allahverdi et al . 19 have developed a heart disease classification system in which they integrated neural networks with an artificial neural network and attained an accuracy of 82.4%. In a sequel, Awang et al . 20 have used NB and DT for the diagnosis and prediction of heart disease and achieved reasonable results in terms of accuracy. They achieved an accuracy of 82.7% with NB and 80.4% with DT. Oyedodum and Olaniye 21 have proposed a three-phase system for the prediction of heart disease using ANN. Das and Turkoglu 22 have proposed an ANN ensemble-based predictive model for the prediction of heart disease. Similarly, Paul and Robin 23 have used the adaptive fuzzy ensemble method for the prediction of heart disease. Likewise, Tomov et al. 24 have introduced a deep neural network for heart disease prediction and his proposed model performed well and produced good outcomes. Further, Manogaran and Varatharajan 25 have introduced the concept of a hybrid recommendation system for diagnosing heart disease and their model has given considerable results. Alizadehsani et al . 26 have developed a non-invasive based model for the prediction of coronary artery disease and showed some good results regarding the accuracy and other performance assessment metrics. Amin et al . 27 have proposed a framework of a hybrid system for the identification of cardiac disease, using machine learning, and attained an accuracy of 86.0%. Similarly, Mohan et al . 28 have proposed another intelligent system that integrates RF with a linear model for the prediction of heart disease and achieved the classification accuracy of 88.7%. Likewise, Liaqat et al . 29 have developed an expert system that uses stacked SVM for the prediction of heart disease and obtained 91.11% classification accuracy on selected features.
The contribution of the current work is to introduce an intelligent medical decision system for the diagnosis of heart disease based on contemporary machine learning algorithms. In this study, 10 different nature of machine learning classification algorithms such as Logistic Regression (LR), Decision Tree (DT), Naïve Bayes (NB), Random Forest (RF), Artificial Neural Network (ANN), etc. are implemented in order to select the best model for timely and accurate detection of heart disease at an early stage. Four feature selection algorithms, Fast Correlation-Based Filter Solution (FCBF), minimal redundancy maximal relevance (mRMR), Least Absolute Shrinkage and Selection Operator (LASSO), and Relief have been used for selecting the vital and more correlated features that have truly reflect the motif of the desired target. Our developed system has been trained and tested on the Cleveland (S 1 ) and Hungarian (S 2 ) heart disease datasets which are available online on the UCI machine learning repository. All the processing and computations were performed using Anaconda IDE. Python has been used as a tool for implementing all the classifiers. The main packages and libraries used include pandas, NumPy, matplotlib, sci-kit learn (sklearn), and seaborn. The main contribution of our proposed work is given below:
The performance of all classifiers has been tested on full feature spaces in terms of all performance evaluation matrices specifically accuracy.
The performances of the classifiers are tested on selected feature spaces, selected through various feature selection algorithms mentioned above.
The research study recommends that which feature selection algorithm is feasible with which classification algorithm for developing a high-level intelligence system for the diagnosing of heart disease patients.
The rest of the paper is organized as: “ Results and discussion ” section represents the results and discussion, “ Material and methods ” section describes the material and methods used in this paper. Finally, we conclude our proposed research work in “ Conclusion ” section.
This section of the paper discusses the experimental results of various contemporary classification algorithms. At first, the performance of all used classification models i.e. K-Nearest Neighbors (KNN), Decision Tree (DT), Extra-Tree Classifier (ETC), Random Forest (RF), Logistic Regression (LR), Naïve Bayes (NB), Artificial Neural Network (ANN), Support Vector Machine (SVM), Adaboost (AB), and Gradient Boosting (GB) along with full feature space is evaluated. After that, four feature selection algorithms (FSA): Fast Correlation-Based Filter (FCBF), Minimal Redundancy Maximal Relevance (mRMR), Least Absolute Shrinkage and Selection Operator (LASSO), and Relief are applied to select the prominent and high variant features from feature space. Furthermore, the selected feature spaces are provided to classification algorithms as input to analyze the significance of feature selection techniques. The cross-validation techniques i.e. k-fold (10-fold) are applied on both the full and selected feature spaces to analyze the generalization power of the proposed model. Various performance evaluation metrics are implemented for measuring the performances of the classification models.
The experimental outcomes of the applied classification algorithms on the full feature space of the two benchmark datasets by using 10-fold cross-validation (CV) techniques are shown in Tables 1 and 2 , respectively.
The experimental results demonstrated that the ET classifier performed quite well in terms of all performance evaluation metrics compared to the other classifiers using 10-fold CV. ET achieved 92.09% accuracy, 91.82% sensitivity, 92.38% specificity, 97.92% AUC, 92.84% Precision, 0.92 F1-Score and 0.84 MCC. The specificity indicates that the diagnosed test was negative and the individual doesn't have the disease. While the sensitivity indicates the diagnostic test was positive and the patient has heart disease. In the case of the KNN classification model, multiple experiments were accomplished by considering various values for k i.e. k = 3, 5, 7, 9, 13, and 15, respectively. Consequently, KNN has shown the best performance at value k = 7 and achieved a classification accuracy of 85.55%, 85.93% sensitivity, 85.17% specificity, 95.64% AUC, 86.09% Precision, 0.86 F1-Score, and 0.71 MCC. Similarly, DT classifier has achieved accuracy of 86.82%, 89.73% sensitivity, 83.76% specificity, 91.89% AUC, 85.40% Precision, 0.87 F1-Score, and 0.73 MCC. Likewise, GB classifier has yielded accuracy of 91.34%, 90.32% sensitivity, 91.52% specificity, 96.87% AUC, 92.14% Precision, 0.92 F1-Score, and 0.83 MCC. After empirically evaluating the success rates of all classifiers, it is observed that ET Classifier out-performed among all the used classification algorithms in terms of accuracy, sensitivity, and specificity. Whereas, NB shows the lowest performance in terms of accuracy, sensitivity, and specificity. The ROC curve of all classification algorithms on full feature space is represented in Fig. 1 .
ROC curves of all classifiers on full feature space using 10-fold cross-validation on S 1 .
In the case of dataset S 2 , composed of 1025 total instances in which 525 belong to the positive class and 500 instances of having negative class, again ET has obtained quite well results compared to other classifiers using a 10-fold cross-validation test, which are 96.74% accuracy, 96.36 sensitivity, 97.40% specificity, and 0.93 MCC as shown in Table 2 .
Fcbf feature selection technique.
FCBF feature selection technique is applied to select the best subset of feature space. In this attempt, various length of subspaces is generated and tested. Finally, the best results are achieved by classification algorithms on the subset of feature space (n = 6) using a 10-fold CV. Table 3 shows various performance measures of classifiers executed on the selected features space of FCBF.
Table 3 demonstrates that the ET classifier obtained quite good results including accuracy of 94.14%, 94.29% sensitivity, and specificity of 93.98%. In contrast, NB reported the lowest performance compared to the other classification algorithms. The performance of classification algorithms is also illustrated in Fig. 2 by using ROC curves.
ROC curve of all classifiers on selected features by FCBF feature selection algorithm.
mRMR feature selection technique is used in order to select a subset of features that enhance the performance of classifiers. The best results reported on a subset of n = 6 of feature space which is shown in Table 4 .
In the case of mRMR, still, the success rates of the ET classifier are well in terms of all performance evaluation metrics compared to the other classifiers. ET has attained 93.42% accuracy, 93.92% sensitivity, and specificity of 93.88%. In contrast, NB has achieved the lowest outcomes which are 81.84% accuracy. Figure 3 shows the ROC curve of all ten classifiers using the mRMR feature selection algorithm.
ROC curve of all classifiers on selected features using the mRMR feature selection algorithm.
In order to choose the optimal feature space which not only reduces computational cost but also progresses the performance of the classifiers, LASSO feature selection technique is applied. After performing various experiments on different subsets of feature space, the best results are still noted on the subspace of (n = 6). The predicted outcomes of the best-selected feature space are reported in Table 5 using the 10-fold CV.
Table 5 demonstrated that the predicted outcomes of the ET classifier are considerable and better compared to the other classifiers. ET has achieved 89.36% accuracy, 88.21% sensitivity, and specificity of 90.58%. Likewise, GB has yielded the second-best result which is the accuracy of 88.47%, 89.54% sensitivity, and specificity of 87.37%. Whereas, LR has performed worse results and achieved 80.77% accuracy, 83.46% sensitivity, and specificity of 77.95%. ROC curves of the classifiers are shown in Fig. 4 .
ROC curve of all classifiers on selected feature space using the LASSO feature selection algorithm.
In a sequel, another feature selection technique Relief is applied to investigate the performance of classifiers on different sub-feature spaces by using the wrapper method. After empirically analyzing the results of the classifiers on a different subset of feature spaces, it is observed that the performance of classifiers is outstanding on the sub-space of length (n = 6). The results of the optimal feature space on the 10-fold CV technique are listed in Table 6 .
Again, the ET classifier performed outstandingly in terms of all performance evaluation metrics as compared to other classifiers. ET has obtained an accuracy of 94.41%, 94.93% sensitivity, and specificity of 94.89%. In contrast, NB has shown the lowest performance and achieved 80.29% accuracy, 81.93% sensitivity, and specificity of 78.55%. The ROC curves of the classifiers are demonstrated in Fig. 5 .
ROC curve of all classifiers on selected features selected by the Relief feature selection algorithm.
After executing classification algorithms along with full and selected feature spaces in order to select the optimal algorithm for the operational engine, the empirical results have revealed that ET performed well not only on all feature space but also on optimal selected feature space among all the used classification algorithms. Furthermore, the ET classifier obtained quite promising accuracy in the case of the Relief feature selection technique which is 94.41%. Overall, the performance of ET is reported better in terms of most of the measures while other classifiers have shown good results in one measure while worse in other measures. In addition, the performance of the ET classifier is also evaluated on a 10-fold CV in combination with different sub-feature spaces of varying length starting from 1 to 12 with a step size of 1 to check the stability and discrimination power of the classifier as described in 30 . Doing so will assist the readers to have a better understanding of the impact, of the number of selected features on the performance of the classifiers. The same process is repeated for another dataset i.e. S 2 (Hungarian heart disease dataset) as well, to know the impact of selected features on the classification performance.
Tables 7 and 8 shows the performance of the ET classifier using 10-fold CV in combination with different feature sub-spaces starting from 1 to 12 with a step size of 1. The experimental results show that the performance of the ET classifier is affected significantly by using the varying length of sub-feature spaces. Finally, it is concluded that all these achievements are ascribed with the best selection of Relief feature selection technique which not only reduces the feature space but also enhances the predictive power of classifiers. In addition, the ET classifier has also played a quite promising role in these achievements because it has clearly and precisely learned the motif of the target class and reflected it truly. In addition, the performance of the ET classifier is also evaluated on 5-fold and 7-fold CV in combination with different sub-spaces of length 5 and 7 to check the stability and discrimination power of the classifier. It is also tested on another dataset S 2 (Hungarian heart disease dataset). The results are shown in supplementary materials .
In Table 9 , P-value and Chi-Square values are also computed for the ET classifier in combination with the optimal feature spaces of different feature selection techniques.
Further, a comparative study of the developed system is conducted with other states of the art machine learning approaches discussed in the literature. Table 10 represents, a brief description and classification accuracies of those approaches. The results demonstrate that our proposed model success rate is high compared to existing models in the literature.
The subsections represent the materials and the methods that are used in this paper.
The first and rudimentary step of developing an intelligent computational model is to construct or develop a problem-related dataset that truly and effectively reflects the pattern of the target class. Well organized and problem-related dataset has a high influence on the performance of the computational model. Looking at the significance of the dataset, two datasets i.e. the Cleveland heart disease dataset S 1 and Hungarian heart disease dataset (S 2 ) are used, which are available online at the University of California Irvine (UCI) machine learning repository and UCI Kaggle repository, and various researchers have used it for conducting their research studies 28 , 31 , 32 . The S1 consists of 304 instances, where each instance has distinct 13 attributes along with the target labels and are selected for training. The dataset is composed of two classes, presence or absence of heart disease. The S 2 is composed of 1025 instances in which 525 instances belong to positive class while the rest of 500 instances have negative class. The description of attributes of both the datasets is the same, and both have similar attributes. The complete description and information of the datasets with 13 attributes are given in Table 11 .
The main theme of the developed system is to identify heart problems in human beings. In this study, four distant feature selection techniques namely: FCBF, mRMR, Relief, and LASSO are applied on the provided dataset in order to remove noisy, redundant features and select variant features, consequently may cause of enhancing the performance of the proposed model. Various machine learning classification algorithms are used in this study which include, KNN, DT, ETC, RF, LR, NB, ANN, SVM, AB, and GB. Different evaluation metrics are computed to assess the performance of classification algorithms. The methodology of the proposed system is carried out in five stages which include dataset preprocessing, selection of features, cross-validation technique, classification algorithms, and performance evaluation of classifiers. The framework of the proposed system is illustrated in Fig. 6 .
An Intelligent Hybrid Framework for the prediction of heart disease.
Data preprocessing is the process of transforming raw data into meaningful patterns. It is very crucial for a good representation of data. Various preprocessing approaches such as missing values removal, standard scalar, and Min–Max scalar are used on the dataset in order to make it more effective for classification.
Feature selection technique selects the optimal features sub-space among all the features in a dataset. It is very crucial because sometimes, the classification performance degrades due to irrelevant features in the dataset. The feature selection technique improves the performance of classification algorithms and also reduces their execution time. In this research study, four feature selection techniques are used and are listed below:
Fast correlation-based filter (FCBF): FCBF feature selection algorithm follows a sequential search strategy. It first selects full features and then uses symmetric uncertainty for measuring the dependencies of the features on each other and how they affect the target output label. After this, it selects the most important features using the backward sequential search strategy. FCBF outperforms on high dimensional datasets. Table 12 shows the results of the selected features (n = 6) by using the FCBF feature selection algorithm. Each attribute is given a weight based on its importance. According to the FCBF feature selection technique, the most important features are THA and CPT as shown in Table 12 . The ranking that the FCBF gives to all the features of the dataset is shown in Fig. 7 .
Minimal redundancy maximal relevance (mRMR): mRMR uses the heuristic approach for selecting the most vital features that have minimum redundancy and maximum relevance. It selects those features which are useful and relevant to the target. As it follows a heuristic approach so, it checks one feature at a time and then computes its pairwise redundancy with the other features. The mRMR feature selection algorithm is not suitable for high domain feature problems 33 . The results of selected features by the mRMR feature selection algorithm (n = 6) are listed in Table 13 . In addition, among these attributes, PES and CPT have the highest score. Figure 7 describes the attributes ranking given by the mRMR feature selection algorithm to all attributes in the feature space.
Features ranking by four feature selection algorithms (FCBF, LASSO, mRMR, Relief).
Least absolute shrinkage and selection operator (LASSO) LASSO selects features based on updating the absolute value of the features coefficient. In updating the features coefficient values, zero becoming values are removed from the features subset. LASSO outperforms with low feature coefficient values. The features having high coefficient values will be selected in the subset of features and the rest will be eliminated. Moreover, some irrelevant features with higher coefficient values may be selected and are included in the subset of features 30 . Table 14 represents the six most profound attributes which have a great correlation with the target and their scores selected by the LASSO feature selection algorithm. Figure 7 represents the important features and their scoring values given by the LASSO feature selection algorithm.
Relief feature selection algorithm Relief utilizes the concept of instance-based learning which allocates weight to each attribute based on its significance. The weight of each attribute demonstrates its capability to differentiate among class values. Attributes are rated by weights, and those attributes whose weight is exceeding a user-specified cutoff, are chosen as the final subset 34 . The relief feature selection algorithm selects the most significant attributes which have more effect on the target 35 . The algorithm operates by selecting instances randomly from the training samples. The nearest instance of the same class (nearest hit) and opposite class (nearest miss) is identified for each sampled instance. The weight of an attribute is updated according to how well its values differentiate between the sampled instance and its nearest miss and hit. If an attribute discriminates amongst instances from different classes and has the same value for instances of the same class, it will get a high weight.
The weight updating of attributes works on a simple idea (line 6). That if instance R i and NH have dissimilar value (i.e. the diff value is large), that means the attribute splits two instances with the same class which is not worthwhile, and thus we reduce the attributes weight. On the other hand, if the instance R i and NM have a distinct value that means the attribute separates the two instances with a different class, which is desirable. The six most important features selected by the Relief algorithm are listed in descending order in Table 15 . Based on weight values the most vital features are CPT and Age. Figure 7 demonstrates the important features and their ranking given by the Relief feature selection algorithm.
Various machine learning classification algorithms are investigated for early detection of heart disease, in this study. Each classification algorithm has its significance and the importance is reported varied from application to application. In this paper, 10 distant nature of classification algorithms namely: KNN, DT, ET, GB, RF, SVM, AB, NB, LR, and ANN are applied to select the best and generalize prediction model.
Validation of the prediction model is an essential step in machine learning processes. In this paper, the K-Fold cross-validation method is applied to validating the results of the above-mentioned classification models.
In K-Fold CV, the whole dataset is split into k equal parts. The (k-1) parts are utilized for training and the rest is used for the testing at each iteration. This process continues for k-iteration. Various researchers have used different values of k for CV. Here k = 10 is used for experimental work because it produces good results. In tenfold CV, 90% of data is utilized for training the model and the remaining 10% of data is used for the testing of the model at each iteration. At last, the mean of the results of each step is taken which is the final result.
For measuring the performance of the classification algorithms used in this paper, various evaluation matrices have been implemented including accuracy, sensitivity, specificity, f1-score, recall, Mathew Correlation-coefficient (MCC), AUC-score, and ROC curve. All these measures are calculated from the confusion matrix described in Table 16 .
In confusion matrix True Negative (TN) shows that the patient has not heart disease and the model also predicts the same i.e. a healthy person is correctly classified by the model.
True Positive (TP) represents that the patient has heart disease and the model also predicts the same result i.e. a person having heart disease is correctly classified by the model.
False Positive (FP) demonstrates that the patient has not heart disease but the model predicted that the patient has i.e. a healthy person is incorrectly classified by the model. This is also called a type-1 error.
False Negative (FN) notifies that the patient has heart disease but the model predicted that the patient has not i.e. a person having heart disease is incorrectly classified by the model. This is also called a type-2 error.
Accuracy Accuracy of the classification model shows the overall performance of the model and can be calculated by the formula given below:
Specificity specificity is a ratio of the recently classified healthy people to the total number of healthy people. It means the prediction is negative and the person is healthy. The formula for calculating specificity is given as follows:
Sensitivity Sensitivity is the ratio of recently classified heart patients to the total patients having heart disease. It means the model prediction is positive and the person has heart disease. The formula for calculating sensitivity is given below:
Precision: Precision is the ratio of the actual positive score and the positive score predicted by the classification model/algorithm. Precision can be calculated by the following formula:
F1-score F1 is the weighted measure of both recall precision and sensitivity. Its value ranges between 0 and 1. If its value is one then it means the good performance of the classification algorithm and if its value is 0 then it means the bad performance of the classification algorithm.
MCC It is a correlation coefficient between the actual and predicted results. MCC gives resulting values between − 1 and + 1. Where − 1 represents the completely wrong prediction of the classifier.0 means that the classifier generates random prediction and + 1 represents the ideal prediction of the classification models. The formula for calculating MCC values is given below:
Finally, we will examine the predictability of the machine learning classification algorithms with the help of the receiver optimistic curve (ROC) which represents a graphical demonstration of the performance of ML classifiers. The area under the curve (AUC) describes the ROC of a classifier and the performance of the classification algorithms is directly linked with AUC i.e. larger the value of AUC greater will be the performance of the classification algorithm.
In this study, 10 different machine learning classification algorithms namely: LR, DT, NB, RF, ANN, KNN, GB, SVM, AB, and ET are implemented in order to select the best model for early and accurate detection of heart disease. Four feature selection algorithms such as FCBF, mRMR, LASSO, and Relief have been used to select the most vital and correlated features that truly reflect the motif of the desired target. Our developed intelligent computational model has been trained and tested on two datasets i.e. Cleveland (S1) and Hungarian (S2) heart disease datasets. Python has been used as a tool for implementation and simulating the results of all the utilized classification algorithms.
The performance of all classification models has been tested in terms of various performance metrics on full feature space as well as selected feature spaces, selected through various feature selection algorithms. This research study recommends that which feature selection algorithm is feasible with which classification model for developing a high-level intelligent system for the diagnosis of a patient having heart disease. From simulation results, it is observed that ET is the best classifier while relief is the optimal feature selection algorithm. In addition, P-value and Chi-square are also computed for the ET classifier along with each feature selection algorithm. It is anticipated that the proposed system will be useful and helpful for the doctors and other care-givers to diagnose a patient having heart disease accurately and effectively at the early stages.
Heart disease is one of the most devastating and fatal chronic diseases that rapidly increase in both economically developed and undeveloped countries and causes death. This damage can be reduced considerably if the patient is diagnosed in the early stages and proper treatment is provided to her. In this paper, we developed an intelligent predictive system based on contemporary machine learning algorithms for the prediction and diagnosis of heart disease. The developed system was checked on two datasets i.e. Cleveland (S1) and Hungarian (S2) heart disease datasets. The developed system was trained and tested on full features and optimal features as well. Ten classification algorithms including, KNN, DT, RF, NB, SVM, AB, ET, GB, LR, and ANN, and four feature selection algorithms such as FCBF, mRMR, LASSO, and Relief are used. The feature selection algorithm selects the most significant features from the feature space, which not only reduces the classification errors but also shrink the feature space. To assess the performance of classification algorithms various performance evaluation metrics were used such as accuracy, sensitivity, specificity, AUC, F1-score, MCC, and ROC curve. The classification accuracies of the top two classification algorithms i.e. ET and GB on full features were 92.09% and 91.34% respectively. After applying feature selection algorithms, the classification accuracy of ET with the relief feature selection algorithm increases from 92.09 to 94.41%. The accuracy of GB increases from 91.34 to 93.36% with the FCBF feature selection algorithm. So, the ET classifier with the relief feature selection algorithm performs excellently. P-value and Chi-square are also computed for the ET classifier with each feature selection technique. The future work of this research study is to use more optimization techniques, feature selection algorithms, and classification algorithms to improve the performance of the predictive system for the diagnosis of heart disease.
Bui, A. L., Horwich, T. B. & Fonarow, G. C. Epidemiology and risk profile of heart failure. Nat. Rev. Cardiol. 8 , 30 (2011).
Article PubMed Google Scholar
Polat, K. & Güneş, S. Artificial immune recognition system with fuzzy resource allocation mechanism classifier, principal component analysis, and FFT method based new hybrid automated identification system for classification of EEG signals. Expert Syst. Appl. 34 , 2039–2048 (2010).
Article Google Scholar
Heidenreich, P. A. et al. Forecasting the future of cardiovascular disease in the United States: A policy statement from the American Heart Association. Circulation 123 , 933–944 (2011).
Durairaj, M. & Ramasamy, N. A comparison of the perceptive approaches for preprocessing the data set for predicting fertility success rate. Int. J. Control Theory Appl. 9 , 255–260 (2016).
Google Scholar
Das, R., Turkoglu, I. & Sengur, A. Effective diagnosis of heart disease through neural networks ensembles. Expert Syst. Appl. 36 , 7675–7680 (2012).
Allen, L. A. et al. Decision making in advanced heart failure: A scientific statement from the American Heart Association. Circulation 125 , 1928–1952 (2014).
Yang, H. & Garibaldi, J. M. A hybrid model for automatic identification of risk factors for heart disease. J. Biomed. Inform. 58 , S171–S182 (2015).
Article PubMed PubMed Central Google Scholar
Alizadehsani, R., Hosseini, M. J., Sani, Z. A., Ghandeharioun, A. & Boghrati, R. In 2012 IEEE 12th International Conference on Data Mining Workshops. 9–16 (IEEE, New York).
Arabasadi, Z., Alizadehsani, R., Roshanzamir, M., Moosaei, H. & Yarifard, A. A. Computer aided decision making for heart disease detection using hybrid neural network-Genetic algorithm. Comput. Methods Programs Biomed. 141 , 19–26 (2017).
Samuel, O. W., Asogbon, G. M., Sangaiah, A. K., Fang, P. & Li, G. An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction. Expert Syst. Appl. 68 , 163–172 (2017).
Patil, S. B. & Kumaraswamy, Y. Intelligent and effective heart attack prediction system using data mining and artificial neural network. Eur. J. Sci. Res. 31 , 642–656 (2009).
Vanisree, K. & Singaraju, J. Decision support system for congenital heart disease diagnosis based on signs and symptoms using neural networks. Int. J. Comput. Appl. 19 , 6–12 (2015).
B. Edmonds. In Proceedings of AISB Symposium on Socially Inspired Computing 1–12 (Hatfield, 2005).
Methaila, A., Kansal, P., Arya, H. & Kumar, P. Early heart disease prediction using data mining techniques. Comput. Sci. Inf. Technol. J. https://doi.org/10.5121/csit.2014.4807 (2014).
Samuel, O. W., Asogbon, G. M., Sangaiah, A. K., Fang, P. & Li, G. An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction. Expert Syst. Appl. 68 , 163–172 (2018).
Nazir, S., Shahzad, S., Mahfooz, S. & Nazir, M. Fuzzy logic based decision support system for component security evaluation. Int. Arab J. Inf. Technol. 15 , 224–231 (2018).
Detrano, R. et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 64 , 304–310 (2009).
Gudadhe, M., Wankhade, K. & Dongre, S. In 2010 International Conference on Computer and Communication Technology (ICCCT) , 741–745 (IEEE, New York).
Kahramanli, H. & Allahverdi, N. Design of a hybrid system for the diabetes and heart diseases. Expert Syst. Appl. 35 , 82–89 (2013).
Palaniappan, S. & Awang, R. In 2012 IEEE/ACS International Conference on Computer Systems and Applications 108–115 (IEEE, New York).
Olaniyi, E. O., Oyedotun, O. K. & Adnan, K. Heart diseases diagnosis using neural networks arbitration. Int. J. Intel. Syst. Appl. 7 , 72 (2015).
Das, R., Turkoglu, I. & Sengur, A. Effective diagnosis of heart disease through neural networks ensembles. Expert Syst. Appl. 36 , 7675–7680 (2011).
Paul, A. K., Shill, P. C., Rabin, M. R. I. & Murase, K. Adaptive weighted fuzzy rule-based system for the risk level assessment of heart disease. Applied Intelligence 48 , 1739–1756 (2018).
Tomov, N.-S. & Tomov, S. On deep neural networks for detecting heart disease. arXiv:1808.07168 (2018).
Manogaran, G., Varatharajan, R. & Priyan, M. Hybrid recommendation system for heart disease diagnosis based on multiple kernel learning with adaptive neuro-fuzzy inference system. Multimedia Tools Appl. 77 , 4379–4399 (2018).
Alizadehsani, R. et al. Non-invasive detection of coronary artery disease in high-risk patients based on the stenosis prediction of separate coronary arteries. Comput. Methods Programs Biomed. 162 , 119–127 (2018).
Haq, A. U., Li, J. P., Memon, M. H., Nazir, S. & Sun, R. A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Inf. Syst. 2018 , 3860146. https://doi.org/10.1155/2018/3860146 (2018).
Mohan, S., Thirumalai, C. & Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7 , 81542–81554 (2019).
Ali, L. et al. An optimized stacked support vector machines based expert system for the effective prediction of heart failure. IEEE Access 7 , 54007–54014 (2019).
Peng, H., Long, F. & Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27 (8), 1226–1238 (2005).
Palaniappan, S. & Awang, R. In 2008 IEEE/ACS International Conference on Computer Systems and Applications 108–115 (IEEE, New York).
Ali, L., Niamat, A., Golilarz, N. A., Ali, A. & Xingzhong, X. An expert system based on optimized stacked support vector machines for effective diagnosis of heart disease. IEEE Access (2019).
Pérez, N. P., López, M. A. G., Silva, A. & Ramos, I. Improving the Mann-Whitney statistical test for feature selection: An approach in breast cancer diagnosis on mammography. Artif. Intell. Med. 63 , 19–31 (2015).
Tibshirani, R. Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 73 , 273–282 (2011).
Article MathSciNet Google Scholar
Peng, H., Long, F. & Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27 , 1226–1238 (2012).
de Silva, A. M. & Leong, P. H. Grammar-Based Feature Generation for Time-Series Prediction (Springer, Berlin, 2015).
Book Google Scholar
Download references
This research was supported by the Brain Research Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. NRF-2017M3C7A1044815).
Authors and affiliations.
Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
Yar Muhammad, Muhammad Tahir & Maqsood Hayat
Department of Electronic and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea
Kil To Chong
You can also search for this author in PubMed Google Scholar
All authors have equal contributions.
Correspondence to Maqsood Hayat or Kil To Chong .
Competing interests.
The authors declare no competing interests.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information., rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
Cite this article.
Muhammad, Y., Tahir, M., Hayat, M. et al. Early and accurate detection and diagnosis of heart disease using intelligent computational model. Sci Rep 10 , 19747 (2020). https://doi.org/10.1038/s41598-020-76635-9
Download citation
Received : 03 April 2020
Accepted : 28 October 2020
Published : 12 November 2020
DOI : https://doi.org/10.1038/s41598-020-76635-9
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Comprehensive evaluation and performance analysis of machine learning in heart disease prediction.
Scientific Reports (2024)
Operations Research Forum (2024)
Multimedia Tools and Applications (2024)
Health and Technology (2024)
Neural Computing and Applications (2023)
By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
Discover the world's research
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
Select Proceedings of the 3rd International Conference, Com-IT-Con 2023, Volume 1
You can also search for this editor in PubMed Google Scholar
School of science and technology, bournemouth university, poole, uk, manav rachna international institute of research and studies, faridabad, india.
Part of the book series: Lecture Notes in Electrical Engineering (LNEE, volume 1191)
Included in the following conference series:
Conference proceedings info: COMITCON 2023.
275 Accesses
This is a preview of subscription content, log in via an institution to check access.
Subscribe and save.
Tax calculation will be finalised at checkout
Licence this eBook for your library
Institutional subscriptions
This book presents select proceedings of the 3rd International Conference on “Artificial-Business Analytics, Quantum and Machine Learning: Trends, Perspectives, and Prospects” (Com-IT-Con 2023) held at the Manav Rachna University in July 2023. It covers topics such as artificial intelligence and business analytics, virtual/augmented reality, quantum information systems, cyber security, data science, and machine learning. The book is useful for researchers and professionals interested in the broad field of communication engineering.
Front matter, weed localization: comparison of different transfer learning models with u-net.
Editors and affiliations.
K. C. Santosh
Sandeep Kumar Sood
Hari Mohan Pandey
Charu Virmani
KC Santosh, a highly accomplished AI expert, is the chair of the Department of Computer Science at the University of South Dakota. He served the National Institutes of Health (NIH) as a research fellow. Before that, he worked as a postdoctoral research scientist at the LORIA research center, University de Lorraine in direct collaboration with industrial partner, ITESOFT, France. He earned his Ph.D. in Computer Science—Artificial Intelligence from the INRIA Nancy Grand Est research center (France). He has demonstrated expertise in artificial intelligence, machine learning, pattern recognition, and computer vision with various application domains such as healthcare informatics and medical imaging, document imaging, biometrics, forensics, speech/audio analysis, and the Internet of Things. He is highly motivated in academic leadership, and his contributions have established USD as a pioneer in AI programs within the state of SD.
Sandeep Kumar Sood received a Ph.D. degree in computer science and engineering from IIT Roorkee, Roorkee, India, in 2010. He is currently working as a head and an associate professor in the Department of Computer Applications, NIT Kurukshetra, Haryana, India. He has authored or co-authored more than 115 SCI/SCIE-indexed research publications. According to Google Scholar, the citation number is with an h-index equal to 34 and an i10-index equal to 96. His research interests include network and information security, fog computing, cloud computing, IoT, and big data analytics.
Hari is in data science and artificial intelligence at the School of Technology Bournemouth University, UK. He specializes in Computer Science and Engineering. His research areas include artificial intelligence, soft computing techniques, natural language processing, language acquisition, machine learning, deep learning, and computer vision. He is the author of various books in computer science engineering (algorithms, programming, and evolutionary algorithms). He has published over 150 scientific papers in reputed journals and conferences. He is serving on the editorial board of reputed journals as action editor, associate editor, and guest editor. He is a reviewer of top international conferences such as GECCO, CEC, IJCNN, BMVC, and AAAI. He has delivered expert talks as a keynote and invited speaker. He is a fellow of the HEA of the UK Professional Standards Framework (UKPSF)
Charu Pujara has worked as Head of the Department of Computer Science and Engineering, School of Engineering at Manav Rachna International Institute of Research and Studies, Faridabad, Haryana. She has a B.E. in Information Technology with an M.Tech. in Computer Science and Engineering. She holds a Doctoral degree in Computer Engineering having expertise in the domain of cyber security and artificial intelligence. Her visual acuity has played an important role in promoting technology, automation, and skill management.
Book Title : Advances in Artificial-Business Analytics and Quantum Machine Learning
Book Subtitle : Select Proceedings of the 3rd International Conference, Com-IT-Con 2023, Volume 1
Editors : K. C. Santosh, Sandeep Kumar Sood, Hari Mohan Pandey, Charu Virmani
Series Title : Lecture Notes in Electrical Engineering
DOI : https://doi.org/10.1007/978-981-97-2508-3
Publisher : Springer Singapore
eBook Packages : Computer Science , Computer Science (R0)
Copyright Information : The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
Softcover ISBN : 978-981-97-2507-6 Published: 19 September 2024
eBook ISBN : 978-981-97-2508-3 Published: 18 September 2024
Series ISSN : 1876-1100
Series E-ISSN : 1876-1119
Edition Number : 1
Number of Pages : XII, 826
Number of Illustrations : 69 b/w illustrations, 312 illustrations in colour
Topics : Systems and Data Security , Artificial Intelligence , Computer Applications
Policies and ethics
IMAGES
VIDEO
COMMENTS
Cardiovascular disease refers to any critical condition that impacts the heart. Because heart diseases can be life-threatening, researchers are focusing on designing smart systems to accurately diagnose them based on electronic health data, with the aid of machine learning algorithms. This work presents several machine learning approaches for predicting heart diseases, using data of major ...
According to the Global Burden of Disease Report 2019 2, the number of CVD prevalence is steadily increasing, reaching 523 million in 2019, with up to 18.6 million deaths, accounting for one-third ...
An IoT-ML method was investigated by Akash et al. (33) with the goal of predicting the condition of the cardiovascular system in the human body. The algorithm model uses machine learning (ML) techniques to compute and predict the patient cardiovascular health after it has obtained essential data from the human body.
Heart diseases are consistently ranked among the top causes of mortality on a global scale. Early detection and accurate heart disease prediction can help effectively manage and prevent the disease. However, the traditional methods have failed to improve heart disease classification performance. So, this article proposes a machine learning approach for heart disease prediction (HDP) using a ...
A Method for improving prediction of human heart disease using machine learning algorithms. Mobile Inf. Syst. 2022 , 1-11 (2022). Article Google Scholar
Heart Diseases Prediction using Machine Learning ... To achieve this, we conduct research using the heart disease dataset provided by the UCI Machine Repository. Additionally, our system provides a user-friendly interface to facilitate ease of use. ... 23 November 2023 ISBN Information: Electronic ISBN: 979-8-3503-3509-5 Print on ...
One of the main reasons for death worldwide is heart disease, and early detection of the condition can help lower the risk of having a cardiac arrest. This research paper aims to suggest a machine learning-based method for estimating the risk of developing cardiac disease. First recent advancements in the field have been reviewed and then an ML model has been implemented to work on the ...
The heart disease prediction model is trained by using the training data, and performance of the model is evaluated by using the test data. The patient details in the dataset are given as input to the model and calculated the accuracy for each ML algorithm. The algorithms used in this project are as follows: i.
A wide variety of issues can solve related to heart diseases using a machine learning approach. Marimuthu et al. conducted a review for the prediction of heart disease using a data analytical technique. For predicting cardiac disease, machine learning techniques (MLT) has included DT, NB, KNN, and SVM . A comprehensive review of heart disease ...
Most importantly, pooled analyses indicate that, in general, ML algorithms are accurate (AUC 0.8-0.9 s) in overall cardiovascular disease prediction. In subgroup analyses of each ML algorithms ...
Globally, cardiovascular disease (CVDs) is the primary cause of morbidity and mortality, accounting for more than 70% of all fatalities. According to the 2017 Global Burden of Disease research, cardiovascular disease is responsible for about 43% of all fatalities [1,2].Common risk factors for heart disease in high-income nations include lousy diet, cigarette use, excessive sugar consumption ...
In this article Machine Learning (ML) algorithm known as a sine-cosine weighted k-nearest neighbor (SCA-WKNN) is used for predicting the Hearth disease with the maximum accuracy among the existing approaches. Blockchain technology has been used in the research to secure the data throughout the session and can give more accurate results using ...
The paper [18] experimented on the datasets, heart dataset and CHD dataset, compared the accuracy by using the SVM and LR (82% Accuracy-best one) and Identified the heart disease status of ...
Heart failure has been subject to significant research as a ... which can be considered a significant advance and a great help in determining the risk of heart diseases. The rest of this paper is ... Diwakar M, Tripathi A, Joshi K, Memoria M, Singh P, Kumar N (2021) Latest trends on heart disease prediction using machine learning and image ...
The Cleveland heart disease dataset is commonly used for heart disease prediction with supervised Machine Learning. The Cleveland dataset is obtained from the Kaggle Machine Learning repository. The Cleveland dataset was collected for use in a study in the field of health research by the Cleveland Clinic Foundation in 1988.
Our paper is part of the research on the detection and prediction of heart disease. It is based on the application of Machine Learning algorithms, of which w e have. chosen the 3 most used ...
Cardiovascular illness, which includes several heart ailments, is the leading cause of mortality globally. The need for early detection of this condition has become more critical as the number of young people dying from it has increased. This study employs machine learning techniques such as a K-nearest neighbor, Gaussian naive Bayesian, random forest, support vector machine, decision tree ...
The cardiovascular system plays a vital role in all living organisms, responsible for circulating blood throughout the body, delivering essential oxygen and nutrients to cells, and eliminating waste products. Predicting cardiovascular diseases holds significant importance in clinical data analysis. Early detection measures have proven valuable in making critical decisions for high-risk ...
Sun, H. and Pan, J. (2023) Heart Disease Prediction Using Machine Learning Algorithms with Self-Measurable Physical Condition Indicators. Journal of Data Analysis and Information Processing , 11 , 1-10. doi: 10.4236/jdaip.2023.111001 .
Prediction Using Effective Machine Learning Techniques Avinash Golande, Pavan Kumar T ,k 2019 Decision Tree, KNN-mean, addboost 2 Prediction of Heart Disease Using Machine Learning Algorithms Mr.Santhana Krishnan.J Dr.Geetha S 2018 Decision tree, naive bayes 3 A Hybrid Intelligent System Framework for the Prediction of Heart Disease Using ...
Validation of the prediction model is an essential step in machine learning processes. In this paper, the K-Fold cross-validation method is applied to validating the results of the above-mentioned ...
The Research Process section describes the steps of this study in detail. The Result section describes the experimental results of using machine learning to predict heart disease and the results of the paired sample T test. The results and discussion section summarise and analyse the experimental results. The conclusion part summarises the ...
Tailoring Risk Prediction ... 2023 — The new American Heart Association PREVENTTM risk calculator estimates the 10- and 30-year risk of total cardiovascular disease for people aged 30 years and ...
This research aims to foresee the odds of having heart disease as probable cause of computerized prediction of heart disease that is helpful in the medical field for clinicians and patients [].To accomplish the aim, we have discussed the use of various machine learning algorithms on the data set and dataset analysis is mentioned in this research paper.
This research paper presents reasons for heart disease and a model based on Machine learning algorithms for prediction. Published in: 2022 International Conference on Cyber Resilience (ICCR) Date of Conference: 06-07 October 2022. Date Added to IEEE Xplore: 03 January 2023. ISBN Information:
The heart disease cases are rising day by day and it is very Important to predict such diseases before it causes more harm to human lives. The diagnosis of heart disease is such a complex task i.e., it should be performed very carefully. The work done in this research paper mainly focuses on which patients has more chance to suffer from this based on their various medical feature such as chest ...
This paper presents a comprehensive review of Artificial Intelligence (AI) and Machine Learning (ML), exploring foundational concepts, emerging trends, and diverse applications.
Joshi R, Alehegn M (2017) Analysis and prediction of diabetes diseases using machine learning algorithm: ensemble approach. Int Res J Eng Technol (IRJET) 04(10). e-ISSN: 2395-0056, p-ISSN: 2395-0072. www.irjet.net. Veena Vijayan V, Anjali C (2015) Decision support systems for predicting diabetes mellitus—a review.
Cardiovascular disorders, including various heart diseases, have emerged as leading causes of mortality worldwide, necessitating accurate and effective diagnostic techniques. Leveraging data science approaches, particularly machine learning methods, offers a solution for analyzing large-scale medical datasets to enable early identification and proper care of these conditions. This study aims ...
Proceedings of COMITCON 2023 cover research in artificial-business analytics, virtual/augmented reality, ... A Reactive Approach for High-Accuracy and Data-Driven Customer Behaviour Analysis and Prediction. Priyank Sirohi, Niraj Singhal, Syed Vilayat Ali Rizvi, Pradeep Kumar ... Detection of Heart Disease Using Machine Learning. Supriya Raheja ...