ORIGINAL RESEARCH article

Cardiovascular diseases prediction by machine learning incorporation with deep learning.

Sivakannan Subramani

  • 1 Department of Advanced Computing, St. Joseph's University, Bengaluru, Karnataka, India
  • 2 Department of Computer Engineering and Applications, GLA University, Mathura, Uttar Pradesh, India
  • 3 Department of Mechanical Engineering, Kongu Engineering College, Perundurai, Erode, Tamil Nadu, India
  • 4 Department of VLSI Microelectronics, Saveetha School of Engineering, SIMATS, Chennai, Tamil Nadu, India
  • 5 Faculty of Science, Princess Norah Bint Abdulrahman University, Riyadh, Saudi Arabia
  • 6 Department of Biotechnology, Parul Institute of Applied Sciences and Centre of Research for Development, Parul University, Vadodara, India
  • 7 Department of Biology, College of Science, University of Hail, Hail, Saudi Arabia
  • 8 Centre for Drug Discovery and Development, Sathyabama Institute of Science and Technology, Chennai, Tamil Nadu, India
  • 9 Department of Bioinformatics, Saveetha School of Engineering, SIMATS, Chennai, Tamil Nadu, India
  • 10 Unit of Biochemistry, Centre of Excellence for Biomaterials Engeneering, Faculty of Medicine, AIMST University, Semeleing, Bedong, Malaysia
  • 11 Centre for Excellence for Biomaterials Science AIMST University, Semeling, Bedong, Malaysia
  • 12 Department of Computational Biology, Saveetha School of Engineering, SIMATS, Chennai, Tamil Nadu, India

It is yet unknown what causes cardiovascular disease (CVD), but we do know that it is associated with a high risk of death, as well as severe morbidity and disability. There is an urgent need for AI-based technologies that are able to promptly and reliably predict the future outcomes of individuals who have cardiovascular disease. The Internet of Things (IoT) is serving as a driving force behind the development of CVD prediction. In order to analyse and make predictions based on the data that IoT devices receive, machine learning (ML) is used. Traditional machine learning algorithms are unable to take differences in the data into account and have a low level of accuracy in their model predictions. This research presents a collection of machine learning models that can be used to address this problem. These models take into account the data observation mechanisms and training procedures of a number of different algorithms. In order to verify the efficacy of our strategy, we combined the Heart Dataset with other classification models. The proposed method provides nearly 96 percent of accuracy result than other existing methods and the complete analysis over several metrics has been analysed and provided. Research in the field of deep learning will benefit from additional data from a large number of medical institutions, which may be used for the development of artificial neural network structures.

1. Introduction

Cardiovascular disease (CVD), which is the leading cause of death globally, has become a significant problem in public health all over the world. As a result, patients, their families, and the governments of these countries have incurred substantial socioeconomic expenses. Patients at high risk for CVD can be identified by prediction models that use risk stratification. After that, measures that are tailored to this group, such as dietary changes and the use of statins, can help reduce that risk and contribute to the primary prevention of CVD ( 1 ).

Several guidelines for the evaluation and management of CVD have suggested using predictive models as a means of identifying patients at high risk and assisting with clinical decision-making. The Pooled Cohort Equations and the Framingham CV risk equation6, for example, have both been subjected to independent evaluations in a variety of populations; however, the findings indicated that both of these equations were only weakly discriminating and had a poor level of calibration ( 2 ).

As a direct consequence of this, the predictive power of the majority of the models that are now in use is restricted, and there is room for advancement. For instance, the assumption of linearity is necessary for logistic regression, while the assumption of predictor independence is necessary for the Cox proportional hazard model ( 3 ).

In the area of study pertaining to the cardiovascular system, machine learning (ML) algorithms have been demonstrated to be extremely helpful predictors. They are more adept than standard statistical models at capturing the complex interactions and nonlinear linkages that exist between the variables and the results ( 4 ). Several different investigations ( 5–15 ) came to the conclusion that random forests (RF) and support vector machines (SVM) performed better than traditional models.

Cardiovascular diseases such as coronary artery disease (CAD), atrial fibrillation (AF), and other cardiac or vascular ailments continue to be the leading cause of death in the world ( 10 ). As people living standards improve and their stress levels continue to rise, the number of people who suffer from CVD is growing at an alarming rate.

According to the most recent estimations ( 16 , 17 ), CVD will be responsible for the deaths of about 23 million people by the year 2030. Infarction of the myocardium, atrial fibrillation, and heart failure are all instances of different types of CVD ( 18 , 19 ). The incidence of cardiovascular disease can be influenced by a number of factors, including racial or ethnic background, age, gender, body mass index (BMI), height, and length of torso, as well as the outcomes of blood tests that evaluate factors such as renal function, liver function, and cholesterol levels ( 20 , 21 ) which is shown in Figure 1 .

www.frontiersin.org

Figure 1 . Several factor influencing incidence in cardiovascular disease.

The development of a wide variety of health problems can be influenced by the complex interactions that take place between these factors. Standard statistical approaches are incapable of accounting for all of the intricate causal links that exist between risk factors because there are so many of them ( 22 , 23 ). In this day and age of big data, the Internet of Things (IoT) has been shown to be of critical importance. It has made it possible for patients to use smart drugs and smart bracelets to monitor and collect accurate data during a pandemic ( 24 ).

Researchers are employing artificial intelligence (AI) in an effort to mine new medical information that can be used by clinicians to better understand the symptoms of various diseases and, as a result, make more informed decisions for patients ( 25 ). This comes as the prevalence of data from the internet of things (IoT) grows within healthcare systems. In order to investigate previously unknown risk factors, current efforts to standardise medical data, and efforts to organise national health screening data ( 26–28 ), we will first standardise medical data. These risk variables may have a correlation with the occurrence of the disease, which means that they could offer insights into the mechanisms underlying the disease. Furthermore, accurate disease incidence prediction models necessitate the analysis of large amounts of data ( 29 , 30 ). The use of artificial intelligence (AI) and massive amounts of data in the prediction of CVD models is becoming increasingly common.

The main contribution and novelty of this research is mentioned below:

• To extract a total of 11 distinct characteristics from the dataset.

• After that, we started by normalising the data and then proceeded to divide the Heart dataset into training and testing sets using an 8:2 split.

• Afterwards, the incorporated GBDT is utilised in the SHAP method for the purpose of selecting features.

• It helps to construct a stacking model consisting of a base learner layer in addition to a meta learner layer.

• Finally, we will achieve the results over several performance metrics analyses and method in the stacking model.

2. Background

Weng et al. ( 31 ) tested four different models using clinical data from over 300,000 homes in the United Kingdom. According to the findings, NN was the method that produced the most accurate CVD prediction results for the larger amount of data that were analysed.

The three traditional machine learning models that were tested and evaluated by Dimopoulos et al. ( 32 ) based on ATTICA data with 2020 samples for the little CVD dataset were K-Nearest Neighbour (KNN), Random Forest (RF), and Decision Tree. When compared, RF was shown to have produced the best results by using the HellenicSCORE tool, which is a calibration of the ESC Score.

In view of the growing popularity of machine learning techniques in IoT applications, Mohan et al. ( 15 ) have proposed a hybrid HRFLM strategy as a means of further improving the accuracy of the model predictions in light of the aforementioned popularity of machine learning methods.

An IoT-ML method was investigated by Akash et al. ( 33 ) with the goal of predicting the condition of the cardiovascular system in the human body. The algorithm model uses machine learning (ML) techniques to compute and predict the patient cardiovascular health after it has obtained essential data from the human body. This data include the patient heart rate, ECG signal, and cholesterol.

Within the framework of Yang et al. ( 34 ) examination of local locations with separate prediction models, LR was utilised to evaluate 30 cardiovascular disease-related characteristics utilising more than 200,000 high-risk participants in eastern China. The results of the experiments led to the development of an RF model that is more suited to eastern China.

For the first time in the study of CVDs, the idea of a stacking model was presented for the very first time by Yang et al. ( 35 ). The data on air pollution and weather were considered in order to have a better understanding of how the stacking model influences the daily hospitalisation rate for CVDs. In order to assist in the construction of the stacking model, a grassroots level of five basic learners was first constructed.

During this period, digital, fully automated ecosystems as well as cyber-physical systems are fast growing and finding applications all over the world. The creation of smart healthcare, which offers tools and processes for early diagnosis of life-threatening disorders, is one example of the innovative concepts and technical compositions that are being implemented in nearly every business. As the fourth industrial revolution moves towards a society that is more technologically advanced, there is an urgent requirement for additional research into CVD Zheng et al. ( 36 ).

3. Proposed method

The first thing that needs to be done is to combine the data from the Heart Dataset, which already contains information from Cleveland, Hungarian, and Swizerlang, as well as data from Long Beach VA and Stalog (Heart). From the five sources, we extracted a total of 11 distinct characteristics. After that, we started by normalising the data and then proceeded to divide the Heart dataset into training and testing sets using an 8:2 split. Afterwards, the incorporated GBDT is utilised in the SHAP method for the purpose of selecting features.

In the following stage, we will construct a stacking model consisting of a base learner layer in addition to a meta learner layer. The study uses RF, LR, MLP, ET, and CatBoost classifiers to serve as our base learners. LR is utilised in the role of the meta learner. Finally, the suggested stacking model is assessed with regard to its accuracy, precision, recall, F1 score, and area under the curve (AUC). In order to evaluate the model adaptability to new contexts, we made use of a publicly available dataset known as the Heart Attack Dataset.

The Cleveland, Hungarian, Swizerlang, Long Beach VA, and Stalog (Heart) datasets, together with others from the machine learning repository at the University of California, Irvine (UCI), were combined to form the Heart Dataset. We began with a total of 1,190 samples, and after deleting 272 duplicates, we were left with 918 unique sample datasets. We started with 1,190 samples. The whole Heart dataset is displayed in Table 1 , and it consists of 11 features that were taken from five different datasets that contained significant relevant features.

www.frontiersin.org

Table 1 . Heart dataset features.

3.1. Feature select and analysis

It is feasible to increase model performance and save a considerable amount of runtime by selecting the ideal subset of features that have a significant impact on the prediction outcomes. This process is referred to as feature selection, and it is possible to accomplish both of these goals.

The three most common methods for picking characteristics are called filters, wrappers, and embedding. The research we conducted utilised the embedded approach known as GBDT as a means of selecting feature variables. This was due to the fact that embedded techniques offer superior prediction performance compared to filter methods and are noticeably quicker than wrapper methods.

GBDT makes use of an additive model and a forward stepwise algorithm in order to achieve learning. These two components work together to accomplish this. For non-leaf nodes, the significance of the features increases proportionately with the magnitude of the reduction in weighted impurity that occurs during splitting.

Because of this, it is not possible to provide a detailed explanation of the role that each attribute plays in determining the overall accuracy of the predictions made by the integrated GBDT. In order to find a solution to this issue, we make use of a technique known as feature imputation, in which the explanatory model is a linear function of the values produced by feature imputation.

where N—features; ∅ i —feature attribute value, and Z′i—feature is valid or not.

The Φ i value of Equation (1) can be determined by employing a tree-valued estimate methodology (also known as the SHAP method), which is founded on the concepts of game theory and used as the feature attribute values. Below is a formulation for a model f and a set S of non-zero Z ′ indices, as well as the conventional spherically valued attribute ∅ i for each feature.

where M —input feature set.

It is essential to keep in mind that the SHAP strategy is adapted to the specific context and tailored to individual needs. In contrast to the tree model gain, this method can produce consistent results for global feature attributes. This is an advantage over the tree model gain. In the course of our study, we make use of the SHAP methodology in order to isolate and assess several individual characteristics.

In addition to this, we investigate the ways in which various characteristics interact with one another in order to improve our ability to predict outcomes. We differentiate between the contributions of individual features and the contributions of feature interactions by referring to the former as individual feature contribution and the latter as joint feature contribution Φ i,j . In the same way as the value, the Shapeley interaction index is calculated using a formula, and the joint feature contribution i and j can be found by doing the calculation as follows.

When i  ≠  j :

where Z represents the indices. i , j represent the feature contributions. S represents the Shapeley interaction Index.

Equations (3) and (4) in the GBDT model quantify the twinning relationships between joint features. So, when judging the model, you can get a good idea of how the different factors that interact with each other contribute together.

3.2. Model building

To the extent that the model predictions are accurate, each individual in the base population has a stronger capacity for learning, and the degree of correlation between them decreases. When the individual learners are already more accurate, a fusion of models will be more successful if the individual learners come from a diverse range of backgrounds. This is the foundation upon which the concept of error-ambiguity decomposition is built.

This suggests that when picking the foundation learners for our organisation, we should take into account the performance of individual learners while also taking into account the distinctiveness of each individual learner. Theoretically, it is conceivable to build layers of the stacking model indefinitely as long as their fundamental classifier is operational. This, of course, results in an increase in the level of complexity represented by the model.

To ensure accuracy while reducing the level of complexity exhibited by the model, we solely employ the stacking model, which is comprised of a two-tiered structure consisting of base learners and meta-learners. As a direct consequence of this, SVM, KNN, LR, and ET were decided upon as the possible models for base learners to utilise in the prediction of CVDs. XGBoost, LightGBM, CatBoost, and MLP were some of the other options that were thought about. Following the selection of the most reliable models as the foundation for our education, we restricted the pool of potential candidates to five people who exemplified a comprehensive representation of the population as a whole. The optuna framework was used in order to get the optimal values for the model parameters.

After running a 5-fold CV, this model may generate a large number of features. 5-fold CV is a technique that is frequently utilised in the first layer of a stacking framework to collect input features for the second layer. This paper employs linear regression (LR) as the classifier for the fusion model predictions since generalised linear regression, also known as GLR, has historically been employed in the second layer of the stacking architecture. Because adjusting the complexity of the output layer of a neural network does not require the employment of more complex functions, this example makes use of functions that are simpler in nature.

The primary learners are the LR, RF, DT, MLP, and CatBoost protocols. At the beginning, we give the training sets eight times as many points as the testing sets. Within the training package that we provide for each of the five foundational learners, we utilise a 5-fold CV. A single base learner is capable of producing five predictions, and each of these five predictions is arranged in a vertical column within a one-dimensional matrix. It possible that the second stage of training will be based on a five-dimensional matrix that been developed using the data of five different learners as its foundation.

When applied to the testing set, the 5-fold CV model is utilised once more to make predictions about our initial testing set, which results in the production of five predictions once more. The base learners can be concatenated into a matrix for the stage second iteration. We use LR on the meta-learner so that it does not become too good at its job. In the second step of the process, we use these predictions to build the final results.

4. Results and discussion

The outcomes of the experiments will be discussed here in order to illustrate the benefits of the stacking paradigm that was recommended by us. Python version 3.9.7 was used throughout each and every test. In this investigation, the sklearn 1.0.2 toolbox is used for model prediction. The SHAP 40.0 toolbox is used for feature selection, and the Optuna 2.10.0 framework is used to determine the optimum values for the model parameters which is shown in Table 2 . We executed 10 splits of the data set using various random seeds in order to account for the small sample size of this study and the aforementioned randomisation. After doing so, we averaged the results of all 10 experiments.

www.frontiersin.org

Table 2 . Software specifications.

Before we started the feature selection process, our dataset contained a total of 11 features. Using the Tree SHAP approach, you are able to determine the contribution value that corresponds to each feature that is contained inside the sample dataset. The ranking of the feature contributions is determined by using the average SHAP value for all of the samples. In accordance with the GBDT model, the contributions of the global features are depicted. The ST Slope and Chest Pain Type have a significant influence on the patient condition (CVD) in patients with cardiovascular disease. In order to cut the model operating time even more, some features that aren’t necessary will have to be eliminated. We chose to adopt a cutoff of 0.02, which led to the elimination of the Resting ECG characteristic while permitting the retention of the other 10 features. We used the AUC to evaluate the performance both before and after the feature selection process. Even though the AUC values of GBDT went down, the drop wasnot substantial at all, and there was not any difference that could be considered statistically significant by performing metrics such as AUC, Threshold, Sensitivity, Specificity which is shown in Figures 2 – 5 .

www.frontiersin.org

Figure 2 . Area under the curve (AUC).

www.frontiersin.org

Figure 3 . Threshold probability.

www.frontiersin.org

Figure 4 . Sensitivity (%).

www.frontiersin.org

Figure 5 . Specificity (%).

The incidence of CVD was quite low in this experiment, resulting in poor PPV and NPV values for each of the seven different ML models. Because of this, their therapeutic value may suffer as a result of an increase in the number of false-positive results. The probabilities that were predicted by each machine learning model were unique, and the risk distribution for LR was comparable to that of SVM. Patients who had a CVD episode had estimated risks that were greater, across all ML models, than those patients who had not had a CVD episode. The plots also demonstrated that all ML models overestimated the risks of those individuals who had not experienced any CVD events. This finding suggests that this variable may also affect how well a model predicts what will happen.

It is necessary to have a risk model in order to determine whether individuals have a high probability of developing CVD. We intended to test seven machine learning (ML)-based models to evaluate how correctly they could predict the risk of CVD. The findings demonstrated that each one of them had good to excellent discrimination and that they were all accurately calibrated. When it came to forecasting the risk of CVD, penalised LR performed better than other machine learning models, just like SVM did. The specificity of the SVM was higher than that of the LR, while the LR had a lower level of sensitivity. It is possible that a higher level of specialisation was favoured in this Kazakh Chinese group because the majority of the population was nomadic and there were few medical services available. In addition to this, when taking calibration and DCA into consideration, SVM fared marginally better than LR. Because of this, SVM and LR can be used to find people in this group who are at a higher risk of CVD and to find out if putting risk-mitigation interventions in place for these people will improve their CVD outcomes during the clinical decision-making process.

Linear regression has been widely used in the clinic to construct predictive models due to the ease with which it may be interpreted and its general straightforwardness. In a study aimed at predicting myocardial ischemia, both LR and SVM were shown to have the same level of predictive ability, which was consistent with our findings. A recent and exhaustive study concluded that there is no performance benefit to be gained from using ML in clinical prediction models over using LR. It was determined that when machine learning algorithms were applied to small datasets with a limited number of predictors, LR models might perform better than ML models in terms of overall performance. It is possible that the small sample size and the L1 penalised technique used in this work are to blame for the superior performance of LR in comparison to other machine learning models, with the exception of SVM.

The Support Vector Machine (SVM) is a well-known supervised machine learning approach that has found applications in a wide variety of business sectors. The fundamental idea behind support vector machines (SVM) is to locate the hyperplane that has the capacity to effectively classify the data while also providing the biggest geometric margin. In addition to this, it possesses significant kernel capabilities, which simplify the process of dealing with nonlinear classification issues. The outstanding performance of SVM demonstrates that it is a great tool for tackling classification challenges on small, non-linear, and high-dimensional datasets. This demonstrates that SVM is an excellent tool. In our experiment, we observed that the SVM performed significantly better than other machine learning models.

When it comes to classification, RF is among the most successful ensemble learning strategies that may be used. Its predictions were not nearly as accurate as those generated by the LR and SVM algorithms, which were the other two options. It is likely that the limited number of participants in this study will prevent RF from achieving its full potential as a prediction tool. The concept of variable importance was utilised in order to locate potential indicators of CVD. Some studies suggest that RF may be capable of revealing crucial but undisclosed predictions.

According to the results of feature selection that was based on RF, the age of the patient was the most significant predictor in the classification of CVD. In this study, it was discovered that certain risk factors, such as smoking and alcohol intake, were not as predictive as previously believed. Previous studies have shown that the synthetic indices BAI and LHR are both very good indicators of cardiovascular disease. Inflammation plays a significant part in the formation of atherosclerotic plaques as well as1 the progression of cardiovascular disease is shown in Figures 6 – 11 .

www.frontiersin.org

Figure 6 . PPV (%).

www.frontiersin.org

Figure 7 . NPV (%).

www.frontiersin.org

Figure 8 . Youden index.

www.frontiersin.org

Figure 9 . High-risk patients (%).

www.frontiersin.org

Figure 10 . Brier score.

www.frontiersin.org

Figure 11 . Hosmer-Lemeshow-2.

There is evidence that inflammatory cytokines, such as high-sensitivity CRP and interleukin-6, are associated with an elevated risk of cardiovascular disease. The Hs-CRP inflammatory marker was included in the Reynolds Risk Score in order to account for its role as a potential contributor to cardiovascular disease. hs-CRP has been shown in a number of other epidemiological studies to be an important predictor of CVD. These studies have also shown that hs-CRP acts as a mediator in the pathogenesis of vascular disease and is a marker of endothelial dysfunction. These findings are consistent with the findings of the aforementioned studies. It was discovered that Hs-CRP increases the risk of atherosclerotic plaque rupture in addition to destabilising atherosclerotic plaques via NO, IL-6, and prostacyclin.

In addition, hs-CRP has been demonstrated to enhance thrombosis and cardiomyocyte apoptosis in response to hypoxia, which provides more support for its position as a risk factor for cardiovascular disease. It has been demonstrated that IL-6 is a factor in the course of atherosclerosis and that it promotes the creation of atherosclerotic plaques. This factor may have contributed to the increase in the number of cases of CVD. Taking statins, which can reduce a person chance of acquiring CVD, is beneficial for a great number of people and can help them avoid developing the condition. In clinical practise, Hs-CRP and IL-6 can be used as biomarkers for the early diagnosis of patients who have an increased likelihood of developing cardiovascular disease.

According to the findings of our study, a decrease in ADP was associated with an increased risk of developing cardiovascular disease. The adipose hormone ADP, which is secreted by adipocytes, has anti-inflammatory effects. These effects manifest themselves as a reduction in the levels of CRP and lymphocyte recruitment in atherosclerotic lesions, a reduction in the expression of TNF-, and an increase in the production of cytokines that are protective against inflammation. On the other hand, there is evidence from a small number of studies that suggests an increase in ADP may assist in avoiding an ischemic stroke. Increased NEFA concentrations have been associated with an increased risk of cardiovascular disease in previous research, and our study came to the same conclusion. The possible effects of NEFA on cardiovascular disease include the potential to exacerbate or worsen a number of illnesses, including type 2 diabetes, hypertension, the metabolic syndrome, and endothelial deterioration, to name a few. Patients can have a lower chance of developing cardiovascular disease if they are treated to have a lower ADP (CVD).

The risk prediction models that are currently being used in CVD domains were built using traditional statistical methodologies, as many studies have revealed. Nevertheless, these models have been proven to be erroneous in external populations. In the field of cardiology, machine learning algorithms have proven to be superior methods for deriving predictions from big datasets that are notoriously difficult to understand. No prior assumptions are made by machine learning algorithms, which means that any data can be used to develop accurate and resilient models. Because of this, ML is able to model more complex relationships between outcomes and predictors, which are typically more challenging to express using standard statistical methods. Discovering interactions between marginal predictors can help improve risk-management strategies when using ML.

Machine learning has the potential to identify new genotypes and phenotypes for a variety of CVDs, as well as novel risk factors for CVDs. All aspects of medical picture recognition, diagnosis, outcome prediction, and prognosis evaluation can be improved with the application of more sophisticated machine learning techniques such as deep learning and artificial neural networks (ANN). It possible that in the future, cardiologists will make better clinical decisions if they use machine learning models rather than the CVD risk stratifications that are currently used. On the other hand, most ML models may be hard for medical professionals to understand and use, which may limit how often they can be used in clinical settings.

5. Conclusion

According to the findings of this research, a stacking fusion model-based classifier performs better than individual models on all assessment criteria. This finding suggests that stacking models can combine the benefits of a variety of model types to achieve superior prediction performance. The recommended stacking approach offers improved prediction performance, increased resilience, and increased utility for individuals who are at high risk of developing cardiovascular disease. Hospitals can utilise this information to identify patients who are at a high risk of developing cardiovascular disease and provide them with early clinical intervention in order to reduce that risk. Research in the field of deep learning will benefit from additional data from a large number of medical institutions, which may be used for the development of artificial neural network structures or for the usage of deep learning frameworks in the future. In future work, the other deep learning techniques algorithms can be incorporated into Internet of Things (IoT) environments which helps to achieve more accuracy in terms of result and it can be useful to the hospitals and saving several human life.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

MA and KS: conceptualization. SS, NV, and MA: methodology and investigation. TU and LA-k: software. MSo, TU, and NA: validation. KA and RK: formal analysis. KA and MA: data curation. SS and NV: writing—original draft preparation. MA, MSo, LA-k, and RK: writing—review and editing. LA-k, NA, and RK: supervision. All authors contributed to the article and approved the submitted version.

Acknowledgments

The authors are thankful to the princess Nourah Bint Abdulrahman University researcher, supporting program number (PNURSP2023R82) Princess Nourah bint Abdulrahman University, Riyadh Saudia arabia.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. Al Aref, SJ, Anchouche, K, Singh, G, Slomka, PJ, Kolli, KK, Kumar, A, et al. Clinical applications of machine learning in cardiovascular disease and its relevance to cardiac imaging. Eur Heart J . (2019) 40:1975–86. doi: 10.1093/eurheartj/ehy404

CrossRef Full Text | Google Scholar

2. Ghosh, P, Azam, S, Jonkman, M, Karim, A, Shamrat, FJM, Ignatious, E, et al. Efficient prediction of cardiovascular disease using machine learning algorithms with relief and LASSO feature selection techniques. IEEE Access . (2021) 9:19304–26. doi: 10.1109/ACCESS.2021.3053759

3. Krittanawong, C, Virk, HUH, Bangalore, S, Wang, Z, Johnson, KW, Pinotti, R, et al. Machine learning prediction in cardiovascular diseases: a meta-analysis. Sci Rep . (2020) 10:1–11. doi: 10.1038/s41598-020-72685-1

4. Alaa, AM, Bolton, T, Di Angelantonio, E, Rudd, JH, and Van der Schaar, M. Cardiovascular disease risk prediction using automated machine learning: a prospective study of 423,604 UK biobank participants. PLoS One . (2019) 14:e0213653. doi: 10.1371/journal.pone.0213653

PubMed Abstract | CrossRef Full Text | Google Scholar

5. Mezzatesta, S, Torino, C, De Meo, P, Fiumara, G, and Vilasi, A. A machine learning-based approach for predicting the outbreak of cardiovascular diseases in patients on dialysis. Comput Methods Prog Biomed . (2019) 177:9–15. doi: 10.1016/j.cmpb.2019.05.005

6. Brites, I. S., Silva, L. M., Barbosa, J. L., Rigo, S. J., Correia, S. D., and Leithardt, V. R. (2022) “Machine learning and iot applied to cardiovascular diseases identification through heart sounds: a literature review.” in International Conference on Information Technology & Systems . Springer, Cham, 356–388.

Google Scholar

7. Dinesh, K. G., Arumugaraj, K., Santhosh, K. D., and Mareeswari, V. (2018) “Prediction of cardiovascular disease using machine learning algorithms.” in 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT) . IEEE, 1–7.

8. Li, Y, Sperrin, M, Ashcroft, DM, and Van Staa, TP. Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar. BMJ . (2020) 371:1–8. doi: 10.1136/bmj.m3919

9. Padmanabhan, M, Yuan, P, Chada, G, and Nguyen, HV. Physician-friendly machine learning: a case study with cardiovascular disease risk prediction. J Clin Med . (2019) 8:1050. doi: 10.3390/jcm8071050

10. Aryal, S, Alimadadi, A, Manandhar, I, Joe, B, and Cheng, X. Machine learning strategy for gut microbiome-based diagnostic screening of cardiovascular disease. Hypertension . (2020) 76:1555–62. doi: 10.1161/HYPERTENSIONAHA.120.15885

11. Dinh, A, Miertschin, S, Young, A, and Mohanty, SD. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med Inform Decis Mak . (2019) 19:1–15. doi: 10.1186/s12911-019-0918-5

12. Ponnusamy, M, Bedi, P, Suresh, T, Alagarsamy, A, Manikandan, R, and Yuvaraj, N. Design and analysis of text document clustering using salp swarm algorithm. J Supercomput . (2022) 78:16197–213. doi: 10.1007/s11227-022-04525-0

13. Allan, S, Olaiya, R, and Burhan, R. Reviewing the use and quality of machine learning in developing clinical prediction models for cardiovascular disease. Postgrad Med J . (2022) 98:551–8. doi: 10.1136/postgradmedj-2020-139352

14. Yadav, A, Singh, A, Dutta, MK, and Travieso, CM. Machine learning-based classification of cardiac diseases from PCG recorded heart sounds. Neural Comput Applic . (2020) 32:17843–56. doi: 10.1007/s00521-019-04547-5

15. Mohan, S, Thirumalai, C, and Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access . (2019) 7:81542–54. doi: 10.1109/ACCESS.2019.2923707

16. Juhola, M, Joutsijoki, H, Penttinen, K, and Aalto-Setälä, K. Detection of genetic cardiac diseases by Ca2+ transient profiles using machine learning methods. Sci Rep . (2018) 8:1–10. doi: 10.1038/s41598-018-27695-5

17. Maheshwari, V, Mahmood, MR, Sravanthi, S, Arivazhagan, N, ParimalaGandhi, A, Srihari, K, et al. Nanotechnology-based sensitive biosensors for COVID-19 prediction using fuzzy logic control. J Nanomater . (2021) 2021:1–8. doi: 10.1155/2021/3383146

18. Maini, E., Venkateswarlu, B., and Gupta, A. (2018). “Applying machine learning algorithms to develop a universal cardiovascular disease prediction system” in International Conference on Intelligent Data Communication Technologies and Internet of Things . Springer, Cham. 627–632.

19. Li, Q, Campan, A, Ren, A, and Eid, WE. Automating and improving cardiovascular disease prediction using machine learning and EMR data features from a regional healthcare system. Int J Med Inform . (2022) 163:104786. doi: 10.1016/j.ijmedinf.2022.104786

20. Maiga, J., and Hungilo, G. G. (2019). “Comparison of machine learning models in prediction of cardiovascular disease using health record data.” in 2019 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS) IEEE, 45–48.

21. Sivasankari, S. S., Surendiran, J., Yuvaraj, N., Ramkumar, M., Ravi, C. N., and Vidhya, R. G. (2022). “Classification of diabetes using multilayer perceptron.” in 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE) . IEEE, 1–5.

22. Rahim, A, Rasheed, Y, Azam, F, Anwar, MW, Rahim, MA, and Muzaffar, AW. An integrated machine learning framework for effective prediction of cardiovascular diseases. IEEE Access . (2021) 9:106575–88. doi: 10.1109/ACCESS.2021.3098688

23. Maurovich-Horvat, P . Current trends in the use of machine learning for diagnostics and/or risk stratification in cardiovascular disease. Cardiovasc Res . (2021) 117:e67–9. doi: 10.1093/cvr/cvab059

24. Ahn, I, Gwon, H, Kang, H, Kim, Y, Seo, H, Choi, H, et al. Machine learning–based hospital discharge prediction for patients with cardiovascular diseases: development and usability study. JMIR Med Inform . (2021) 9:e32662. doi: 10.2196/32662

25. Arunachalam, SK, and Rekha, R. A novel approach for cardiovascular disease prediction using machine learning algorithms. Concurr Comput Pract Exp . (2022) 34:e7027. doi: 10.1002/cpe.7027

26. Kannan, S, Yuvaraj, N, Idrees, BA, Arulprakash, P, Ranganathan, V, Udayakumar, E, et al. Analysis of convolutional recurrent neural network classifier for COVID-19 symptoms over computerised tomography images. Int J Comput Appl Technol . (2021) 66:427–32. doi: 10.1504/IJCAT.2021.120453

27. Smita,, and Kumar, E. Probabilistic decision support system using machine learning techniques: a case study of cardiovascular diseases. J Discret Math Sci Cryptogr . (2021) 24:1487–96. doi: 10.1080/09720529.2021.1947452

28. Di Castelnuovo, A, Bonaccio, M, Costanzo, S, Gialluisi, A, Antinori, A, Berselli, N, et al. Common cardiovascular risk factors and in-hospital mortality in 3,894 patients with COVID-19: survival analysis and machine learning-based findings from the multicentre Italian CORIST study. Nutr Metab Cardiovasc Dis . (2020) 30:1899–913. doi: 10.1016/j.numecd.2020.07.031

29. Shu, S, Ren, J, and Song, J. Clinical application of machine learning-based artificial intelligence in the diagnosis, prediction, and classification of cardiovascular diseases. Circ J . (2021) 85:1416–25. doi: 10.1253/circj.CJ-20-1121

30. Nashif, S, Raihan, MR, Islam, MR, and Imam, MH. Heart disease detection by using machine learning algorithms and a real-time cardiovascular health monitoring system. World J Eng Technol . (2018) 6:854–73. doi: 10.4236/wjet.2018.64057

31. Weng, SF, Reps, J, Kai, J, Garibaldi, JM, and Qureshi, N. Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS One . (2017) 12:e0174944. doi: 10.1371/journal.pone.0174944

32. Dimopoulos, AC, Nikolaidou, M, Caballero, FF, Engchuan, W, Sanchez-Niubo, A, Arndt, H, et al. Machine learning methodologies versus cardiovascular risk scores, in predicting disease risk. BMC Med Res Methodol . (2018) 18:1–11. doi: 10.1186/s12874-018-0644-1

33. Zaman, M. I. U., Tabassum, S., Ullah, M. S., Rahaman, A., Nahar, S., and Islam, A. M. (2019). “Towards IoT and ML driven cardiac status prediction system.” in 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT) , 1–6. IEEE

34. Yang, L, Wu, H, Jin, X, Zheng, P, Hu, S, Xu, X, et al. Study of cardiovascular disease prediction model based on random forest in eastern China. Sci Rep . (2020) 10:1–8. doi: 10.1038/s41598-020-62133-5

35. Hu, Z, Qiu, H, Su, Z, Shen, M, and Chen, Z. A stacking ensemble model to predict daily number of hospital admissions for cardiovascular diseases. IEEE Access . (2020) 8:138719–29. doi: 10.1109/ACCESS.2020.3012143

36. Zheng, H, Sherazi, SWA, and Lee, JY. A stacking ensemble prediction model for the occurrences of major adverse cardiovascular events in patients with acute coronary syndrome on imbalanced data. IEEE Access . (2021) 9:113692–704. doi: 10.1109/ACCESS.2021.3099795

Keywords: cardiovascular disease, AI-based technologies, internet of things, machine learning, computational method

Citation: Subramani S, Varshney N, Anand MV, Soudagar MEM, Al-keridis LA, Upadhyay TK, Alshammari N, Saeed M, Subramanian K, Anbarasu K and Rohini K (2023) Cardiovascular diseases prediction by machine learning incorporation with deep learning. Front. Med . 10:1150933. doi: 10.3389/fmed.2023.1150933

Received: 25 January 2023; Accepted: 09 March 2023; Published: 17 April 2023.

Reviewed by:

Copyright © 2023 Subramani, Varshney, Anand, Soudagar, Al-keridis, Upadhyay, Alshammari, Saeed, Subramanian, Anbarasu and Rohini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Rohini Karunakaran, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

  • Open access
  • Published: 10 October 2023

Early prediction of heart disease with data analysis using supervised learning with stochastic gradient boosting

  • Anil Pandurang Jawalkar 1 ,
  • Pandla Swetcha 1 ,
  • Nuka Manasvi 1 ,
  • Pakki Sreekala 1 ,
  • Samudrala Aishwarya 1 ,
  • Potru Kanaka Durga Bhavani 1 &
  • Pendem Anjani 1  

Journal of Engineering and Applied Science volume  70 , Article number:  122 ( 2023 ) Cite this article

3671 Accesses

5 Citations

Metrics details

Heart diseases are consistently ranked among the top causes of mortality on a global scale. Early detection and accurate heart disease prediction can help effectively manage and prevent the disease. However, the traditional methods have failed to improve heart disease classification performance. So, this article proposes a machine learning approach for heart disease prediction (HDP) using a decision tree-based random forest (DTRF) classifier with loss optimization. Initially, preprocessing of the dataset with patient records with known labels is performed for the presence or absence of heart disease records. Then, train a DTRF classifier on the dataset using stochastic gradient boosting (SGB) loss optimization technique and evaluate the classifier’s performance using a separate test dataset. The results demonstrate that the proposed HDP-DTRF approach resulted in 86% of precision, 86% of recall, 85% of F1-score, and 96% of accuracy on publicly available real-world datasets, which are higher than traditional methods.

Introduction

One person dies due to cardiovascular disease every 36 s in every country. Coronary heart disease is the leading cause of mortality in the USA, accounting for one out of every four fatalities that occur each year. This disease claims the lives of about 0.66 million people annually [ 1 ]. The expenditures associated with cardiovascular disease are significant for the healthcare system in the USA. In the years 2021 and 2022, it resulted in annual costs of around $219 billion owing to the increased demand for medical treatment and medication and the loss of productivity caused by deaths. Table 1 provides the statistics of the heart disease dataset with total heart disease cases, deaths, case fatality rate, and total vaccinations. A prompt diagnosis also aids in preventing heart failure, which is another potential cause of mortality in certain cases [ 2 ]. Since many traits put a person at risk for acquiring the ailment, it is difficult to diagnose heart disease in its earlier stages while it is still in its infancy. Diabetes, hypertension, elevated cholesterol levels, an irregular pulse rhythm, and a wide variety of other diseases are some risk factors that might contribute to this [ 3 ]. These ailments are grouped and discussed under “heart disease,” an umbrella word. The symptoms of cardiac disease can differ considerably from one individual to the next and from one condition to another within the same patient [ 4 ]. The process of identifying and classifying cardiovascular diseases is a continuous one that has a chance of being fruitful when carried out by a qualified professional with appropriate knowledge and skill in the relevant sector. There are a lot of different aspects, such as age, diabetes, smoking, being overweight, and eating a diet high in junk food. There have been several variables and criteria discovered that have been shown to either cause heart disease or raise the risk of developing heart disease [ 5 ].

Most hospitals use management software to monitor the clinical and patient data they collect. It is well-known these days, and these kinds of devices generate a vast quantity of information on patients. These data are used for decision-making help in clinical settings rather seldom. These data are precious, yet a significant portion of their knowledge is left unused [ 6 ]. Because of the sheer volume of data involved in the process, the translation of clinical data that has been acquired into information that intelligent systems can use to assist healthcare practitioners in making decisions is a process fraught with difficulties [ 7 ]. Intelligent systems put this knowledge to use to enhance the quality of treatment provided to patients. As a result of this issue, research on the processing of medical photographs was carried out. Because there were not enough specialists and too many instances were misdiagnosed, an automated detection method that was both quick and effective was necessary [ 8 ].

The primary objective of the research is centered around the effective utilization of a classifier model, which aims to categorize and identify vital components within complex medical data. This categorization process is a critical step towards enabling early diagnosis of cardiovascular diseases, potentially contributing to improved patient outcomes and healthcare management [ 9 ]. However, the pursuit of disease prediction at an early stage is not without its challenges. One significant factor pertains to the inherent complexity of the predictive methods employed in the classification process [ 10 ]. The intricate nature of these methods can lead to difficulties in interpreting the underlying decision-making processes, which might impede the integration of these models into clinical practice. Furthermore, the efficiency of disease prediction models is impacted by the time they take to execute. Swift diagnosis and intervention are crucial in medical conditions, and time-intensive models might not align with the urgency required for timely medical decisions. Researchers [ 11 ] have investigated various alternative strategies to forecast cardiovascular diseases. Perfect treatment and diagnosis have the potential to save the lives of an infinite number of individuals. The novel contribution of this work is as follows:

Preprocessing of HDP dataset with normalization, exploratory data analysis (EDA), data visualization, and extraction of top correlated features.

Implementation of DTRF classifier for training preprocessed dataset, which can accurately predict the presence or absence of heart disease.

The SGB loss optimization is used to reduce the losses generated during the training process, which tunes the hyperparameters of DTRF.

The rest of the article is organized as follows: Sect. 2 gives a detailed literature survey analysis. Section 3 gives a detailed analysis of the proposed HDP-DTRF with multiple modules. Section 4 gives a detailed simulation analysis of the proposed HDP-DTRF. Section 5 concludes the article.

Literature survey

Rani et al. [ 12 ] designed a novel hybrid decision support system to diagnose cardiac ailments early. They effectively addressed the missing data challenge by employing multivariate imputations through chained equations. Additionally, their unique approach to feature selection involved a fusion of genetic algorithms (GA) and recursive feature reduction. Notably, the integration of random forest classifiers played a pivotal role in significantly enhancing the accuracy of their system. However, despite these advancements, their hybrid approach’s complexity might have posed challenges in terms of interpretability and practical implementation. Kavitha et al. [ 13 ] embraced machine learning techniques to forecast cardiac diseases. They introduced a hybrid model by incorporating random forest as the base classifier. This hybridization aimed to enhance prediction accuracy; however, their decision to capture and store user input parameters for future use was intriguing but yielded suboptimal classification performance. This unique approach could be viewed as an innovative attempt to integrate patient-specific information, yet the exact impact on overall performance warrants further investigation.

Mohan et al. [ 14 ] further advanced the field by employing a hybrid model that combined random forest with a linear model to predict cardiovascular diseases. Through this amalgamation of different classification approaches and feature combinations, they achieved commendable performance with an accuracy of 88.7%. However, it is worth noting that while hybrid models show promise, the trade-offs between complexity and interpretability could influence their practical utility in real-world clinical settings. To predict heart diseases, Shah et al. [ 15 ] adopted supervised learning techniques, including Naive Bayes, decision trees, K-nearest neighbor (KNN), and random forest algorithms. Their choice of utilizing the Cleveland database from the UCI repository as their data source added a sense of universality to their findings. However, the lack of customization in data sources might limit the applicability of their model to diverse patient populations with varying characteristics. Guoet et al. [ 16 ] contributed to the field by harnessing an improved learning machine (ILM) model in conjunction with machine learning techniques. Integrating novel feature combinations and categorization methods showcased their dedication to enhancing performance and accuracy. Nonetheless, while their approach exhibits promising results, the precise impact of specific feature combinations on prediction accuracy could have been further explored. Hager Ahmed et al. [ 17 ] presented an innovative real-time prediction system for cardiac diseases using Apache Spark and Apache Kafka. This system, characterized by its three-tier architecture—offline model building, online prediction, and stream processing pipeline—highlighted its commitment to harnessing cutting-edge technologies for practical medical applications. However, the scalability and resource requirements of such real-time systems, especially in healthcare settings with limited computational resources, could be an area of concern.

Kataria et al. [ 18 ] comprehensively analyzed and compared various machine learning algorithms for predicting heart disease. Their focus on analyzing the algorithms’ ability to predict heart disease effectively sheds light on their dedication to identifying the most suitable model. However, their study’s outcome might have been further enriched by addressing the unique challenges posed by individual attributes, such as high blood pressure and diabetes, in a more customized manner. Kannan et al. [ 19 ] meticulously evaluated machine learning algorithms to predict and diagnose cardiac sickness. By selecting 14 criteria from the UCI Cardiac Datasets, they showcased their dedication to designing a comprehensive study. Nevertheless, a deeper analysis of how these algorithms perform with specific criteria and their contributions to accurate predictions could provide more actionable insights.

Ali et al. [ 20 ] conducted a detailed analysis of supervised machine-learning algorithms for predicting cardiac disease. Their thorough evaluation of decision trees, k-nearest neighbors, and logistic regression classifiers (LRC) provided a well-rounded perspective on the strengths and limitations of each method. However, a more fine-grained analysis of how these algorithms perform under various parameter configurations and feature combinations might offer additional insights into their potential use cases. Mienye et al. [ 21 ] introduced an enhanced technique for ensemble learning, utilizing decision trees, random forests, and support vector machine classifiers. The voting system they employed to aggregate results showcased their innovative approach to combining various methods. However, the potential trade-offs between ensemble complexity and the robustness of predictions could be considered for future refinement. Dutta et al. [ 22 ] revolutionized the field by introducing convolutional neural networks (CNNs) for predicting coronary heart disease. Their approach, leveraging the power of CNNs on a large dataset of ECG signals, showcased the potential for deep learning techniques in healthcare. However, the requirement for extensive computational resources and potential challenges in model interpretability could be areas warranting further attention. Latha et al. [ 23 ] demonstrated ensemble classification approaches. Combined with a bagging technique, their utilization of decision trees, naive Bayes, and random forest exemplified their determination to achieve robust results. Nevertheless, the potential interplay between different ensemble techniques and their effectiveness under various scenarios could be explored further.

Ishaq et al. [ 24 ] introduced the concept of using the synthetic minority oversampling technique (SMOTE) in conjunction with efficient data mining methods to improve survival prediction for heart failure patients. Their emphasis on addressing class imbalance through SMOTE showcased their awareness of real-world challenges in healthcare datasets. However, the potential impact of the SMOTE method on individual patient subgroups and its implications for model fairness could be areas of future exploration. Asadi et al. [ 25 ] proposed a unique cardiac disease detection technique based on random forest swarm optimization. Their use of a large dataset for evaluation underscored their dedication to robust testing. However, the potential influence of dataset characteristics and the algorithm’s sensitivity to various parameters on prediction performance could be investigated further.

Proposed methodology

Heart disease is a significant health problem worldwide and is responsible for many deaths every year. Traditional methods for diagnosing heart disease are often time-consuming, expensive, and inaccurate. Therefore, there is a need for more accurate and efficient methods for predicting and diagnosing heart disease. The article aims to provide a detailed analysis of the proposed HDP-DTRF approach and its performance in accurately predicting the presence or absence of heart disease. The results demonstrate the effectiveness of the proposed approach, which can lead to improved diagnosis and treatment of heart disease, ultimately leading to better health outcomes for patients.

Figure  1 shows the proposed HDP-DTRF block diagram. The initial step in the proposed approach is the preprocessing of a dataset consisting of patient records with known labels indicating the presence or absence of heart disease. The dataset is then used to train a DTRF classifier with the SGB loss optimization technique. The performance of the trained classifier is evaluated using a separate publicly available real-world test dataset, and the results show that the proposed HDP-DTRF approach can accurately predict the presence or absence of heart disease. Using decision trees in the random forest classifier enables the algorithm to handle nonlinear data and make accurate predictions even with missing or noisy data. Applying the SGB loss optimization technique further enhances the algorithm’s performance by improving the convergence rate and avoiding overfitting. The proposed approach can be useful in clinical decision-making processes, enabling medical professionals to predict the likelihood of heart disease in patients and take appropriate preventive measures.

figure 1

Block diagram for the proposed HDP-DTRF system

The detailed operation of the proposed HDP-DTRF system is illustrated as follows:

Step 1: Data preprocessing: Gather a dataset containing patient records, where each record includes features such as age, blood pressure, and cholesterol levels, along with labels indicating whether the patient has heart disease. Remove duplicate records, handle missing values (e.g., imputing missing data or removing instances with missing values), and eliminate irrelevant or redundant features. Encode categorical variables (like gender) into numerical values using techniques like one-hot encoding. Scale numerical features to bring them to a common scale, which can prevent features with larger ranges from dominating the model.

Step 2: Training the DTRF classifier: Initialize an empty random forest ensemble. For each tree in the ensemble, randomly sample the training data with replacement. It creates a bootstrapped dataset for training each tree, ensuring diversity in the data subsets. Construct a decision tree using the bootstrapped dataset. At each node of the tree, split the data based on the feature that provides the best separation, determined using metrics like Gini impurity or information gain. Add the constructed decision tree to the random forest ensemble. Repeat the process to create the ensemble’s desired number of decision trees.

Step 3: SGB optimization: Initialize the model by setting the initial prediction to the mean of the target labels. Calculate the negative gradient of the loss function (such as mean squared error or log loss) concerning the current model’s predictions. This gradient represents the direction in which the model’s predictions need to be adjusted to minimize the loss. Train a new decision tree using the negative gradient as the target. This new tree will help correct the errors made by the previous model iterations. Update the model’s predictions by adding the predictions of the new tree, scaled by a learning rate. This step moves the model closer to the correct predictions. Repeat the process for a predefined number of iterations. Each iteration focuses on improving the model’s predictions based on the errors made in the previous iterations.

Step 4: Performance evaluation: Use a separate real-world test dataset that was not used during training to evaluate the performance of the trained HDP-DTRF classifier.

DTRF classifier

The DTRF classifier, an ensemble learning model, centers around the decision tree as its core component. As illustrated in Fig.  2 , the DTRF block diagram depicts a framework comprising multiple trained decision trees employing the bagging technique. During the classification process, when a sample requiring classification is input, the ultimate classification outcome is determined through a majority vote from the output of an individual decision tree [ 26 ]. In classifying high-dimensional data, the DTRF model outperforms standalone decision trees by effectively addressing overfitting, displaying robust resistance to noise and outliers, and demonstrating exceptional scalability and parallel processing capabilities. Notably, the strength of DTRF stems from its inherent parameter-free nature, embodying a data-driven approach. The model requires no prior knowledge of classification from the user and is adept at training classification rules based on observed instances. This data-centric attribute enhances the model’s adaptability to various data scenarios. The DTRF model’s essence lies in utilizing K decision trees. Each of these trees contributes a single “vote” towards the category it deems most fitting, thereby participating in determining the class to which the independent variable X, under consideration, should be allocated. This approach effectively harnesses the collective wisdom of multiple trees, facilitating accurate and robust classification outcomes that capitalize on the diverse insights provided by each decision tree. The mathematical analysis of DTRF is as follows:

figure 2

Block diagram of DTRF

Here, \(K\) represents the number of decision trees present in the DTRF. In this context, \({\theta }_{k}\) is a collection of independent random vectors uniformly distributed amongst themselves. Here, \(K\) individual decision trees are generated. Each tree provides its prediction for the category that best fits the independent variable \(X\) . The predictions made by the \(K\) decision trees are combined through a voting mechanism to determine the final category assignment for the independent variable \(X\) . It is important to note that the given Eq. ( 1 ) indicates the ensemble nature of the DTRF model, where multiple decision trees work collectively to enhance predictive accuracy and robustness. The collection of \({\theta }_{k}\) represents the varied parameter sets for each decision tree within the ensemble.

The following procedures must be followed to produce a DTRF:

Step 1: The \(K\) classification regression trees are generated by randomly selecting \(K\) samples from the original training set as a self-service sample set, using the random repeated sampling method. Extracting all \(K\) samples requires repeating this procedure.

Step 2: Each node in the trees will include m randomly selected characteristics from the first training set (m n). Only one of the m traits is employed in the node splitting procedure, and it is the one with the greatest classification potential. DTRF calculates how much data is included in each feature to do this.

Step 3: A tree never has to be trimmed since it grows perfectly without help.

Step 4: The generated trees are built using DTRFs, and the freshly received data is categorized using DTRFs. The number of votes from the tree classifiers determines the classification outcomes.

There are a lot of important markers of generalization performance that are inherent to DTRFs. Similarity and correlation between different decision trees, mistakes in generalization, and the system’s ability to generalize are all features t . A system’s decision-making efficacy is determined by how well it can generalize its results to fresh information that follows the same distribution as the training set [ 27 ]. The system’s performance and generalizability benefit from reducing the severity of generalization mistakes. Here is a case of the overgeneralization fallacy in action:

Here, \(P{E}^{*}\) denotes the generalization error, the subscripts \(X\) and \(Y\) point to the space where the probability is defined, and \(Mr (X, Y)\) is the margin function. The following is a definition of the margin function:

If it stands for the input sample, \(Y\) indicates the correct classification, and \(J\) indicates the incorrect one. Specifically, \(h(g)\) is a representation of a sequence model for classification, \(I(g)\) indicates an indicator function, and \({avg}_{k}(g)\) means averaging. The margin function determines how many more votes the correct classification for sample X receives than all possible incorrect classifications. As the value of the margin function grows, so does the classifier’s confidence in its accuracy. The term “convergence formulation of generalization error” as follows [ 28 ]:

As the number of decision trees grows, the generalization error will tend toward a maximum, as predicted by the preceding calculation, and the model will not over-fit. The classification power of each tree and the correlation between trees is used to estimate the maximum allowed generalization error. The DTRF model aims to produce a DTRF with a small correlation coefficient and strong classification power. Classification intensity ( \(S\) ) is the sample-space-wide mathematical expectation of the variable \(mr(X, Y)\) .

Here, \(\theta\) and \(\theta {\prime}\) are independent and identically distributed vectors of estimated data \({E}_{X, Y}\) , correlation coefficients of \(mr(\theta , X, Y)\) and \(mr(\theta ,{\prime} X, Y)\) :

Among them, \(sd(\theta )\) can be expressed as follows:

Equation ( 7 ) is a metric that is used to quantify the degree to which the trees \(h(X,\theta )\) and \(h(X,\theta {\prime})\) on the dataset consisting of X , Y are correlated with one another. The correlation coefficient increases in magnitude in direct proportion \(\overline{\rho }\) to the size of the chi-square. The upper limit of generalization error is obtained using the following formula, which is based on the Chebyshev inequality:

The generalization error limit of a DTRF is inversely proportional to the strength of the correlation P between individual decision trees and positively correlated with the classification intensity S of a single tree. That is to say, the stricter the category \(S\) , the lower the degree of linkage \(P\) . If the DTRF is to improve its classification accuracy, the threshold for generalization error must be lowered.

SGB loss optimization

The SGB optimization approach has recently received increased use in various deep-learning applications. These applications call for a higher degree of expertise in learning than what can be provided by more conventional means. During the whole training process, the learning rate that SGB uses does not, at any time, experience any fluctuations. The SGB uses one learning rate, which is alpha. The SGB algorithm maintains a per-parameter learning rate to increase performance in scenarios with sparse gradients (for example, computer vision challenges). It maintains per-parameter learning rates that are updated based on the average of recent magnitudes of the gradients for the weight, and it does so based on averaging recent gradient magnitudes (for example, how rapidly it is changing). In addition, it does this based on averaging recent gradient magnitudes for the weight. It illustrates that the strategy is effective for online and non-stationary applications (for example, noisy). The chain rule applied calculus to compute the partial derivatives. To calculate the loss gradient about the weights and biases, it will allow us to determine how the loss varies as a function of the weights and biases. Let us assume that we have a training dataset with N samples, denoted as { \({x}_{i}, {y}_{i}\) } for i = 1, 2, …, N , where \({x}_{i}\) is the input, and \({y}_{i}\) is the true label or target value. It uses a decision tree with parameters θ to predict the output \({\widehat{\mathrm{y}}}_{i}\) for input \({x}_{i}\) . The output can be any function of the parameters and the input, represented as \({\widehat{\mathrm{y}}}_{i} = f({x}_{i}, \theta ).\) The goal is to minimize the difference between the predicted output \({\widehat{\mathrm{y}}}_{i}\) and the true label \({y}_{i}\) . It is typically done by defining a loss function \(L({\widehat{\mathrm{y}}}_{i}, {y}_{i})\) that quantifies the difference between the predicted and true values. The total loss over the entire dataset is then defined as the sum of the individual losses over all samples:

The optimization algorithm focused on estimating the values of the parameters \(\theta\) that minimize this total loss. It is typically done using gradient descent, which updates the parameters \(\theta\) in the opposite direction of the gradient of the total loss concerning the parameters:

Here, \(\alpha\) is the learning rate, which controls the size of the parameter update, and \({\nabla }_{\theta }{L}_{total}\) is the gradient of the total loss concerning the parameters θ. The SGB can sometimes oscillate and take a long time to converge due to the noisy gradients. Momentum is a technique that helps SGB converge faster by adding a fraction of the previous update to the current update:

Here, \({v}_{t}\) is the momentum term at iteration \(t, \beta\) is the momentum coefficient, typically set to 0.9 or 0.99, and the other terms are as previously defined.

Results and discussion

This section gives a detailed performance analysis of the proposed HDP-DTRF. The performance of the proposed method is measured using multiple performance metrics. All these metrics are measured for proposed methods as well as existing methods. Then, all the methods use the same publicly available real-world dataset for performance estimations.

The Cleveland Heart Disease dataset contains data on 303 patients who were evaluated for heart disease. The dataset is downloaded from open-access websites like the UCI-ML repository. Each patient is represented by 14 attributes, which include demographic and clinical information such as age, sex, chest pain type, resting blood pressure, serum cholesterol level, and exercise test results. The dataset has 303 records, each corresponding to a unique patient. The data in each record includes values for all 14 attributes, and the diagnosis of heart disease (present or absent) is also included in the dataset. Table 2 provides a detailed description of the dataset. Researchers and data scientists can use this dataset to develop predictive models for heart disease diagnosis or explore relationships between the different variables in the dataset. With 303 records, this dataset is relatively small compared to other medical datasets. However, it is still widely used in heart disease research due to its rich attributes and long history of use in research studies.

EDA is essential in understanding and analyzing any dataset, including the Cleveland Heart Disease dataset. EDA involves examining the dataset’s basic properties, identifying missing values, checking data distributions, and exploring relationships between variables. Figure  3 shows the EDA of the dataset. Figure  3 (a) shows the count for each target class. Here, the no heart disease class contains 138 records, and the heart disease presented class contains 165 records. Figure  3 (b) shows the male and female-based record percentages in the dataset. Here, the dataset contains 68.32% male and 31.68% female records. Figure  3 (c) shows the percentage of records for chest pain experienced by the patient in the dataset. Here, the dataset contains 47.19% of records in typical angina, 16.50% in atypical angina, 28.71% in non-anginal pain, and 7.59% in the asymptomatic class. Figure  3 (d) shows the percentage of records for fasting blood sugar in the dataset. Here, the dataset contains 85.15% of records in the fasting blood sugar (> 120 mg/dl) class and 14.85% of records in the fasting blood sugar (< 120 mg/dl) class. Figure  4 shows the heart disease frequency by age for both no disease and disease classes. The output contains histogram levels that show the frequency of heart disease by age. Here, the counts of patients with and without heart disease are shown in red and green colors. The overlap between the bars shows how the frequency of heart disease varies with age, with a peak in the frequency of heart disease occurring around the age of 29–77 years.

figure 3

EDA of the dataset. a Count for each target class. b Male–female distribution. c Chest pain experienced by patient distribution. d Fasting blood sugar distribution

figure 4

Heart disease frequency by age

Figure  5 shows the frequencies for different columns of the dataset, which contains the frequencies of chest pain type, fasting blood sugar, rest ECG, exercise-induced angina, st_slope, and number of major vessel columns. Exploring the frequencies of different variables in a dataset is crucial in understanding the data and gaining insights about the underlying phenomena. By analyzing the frequency of values in each variable, we can better understand the data distribution and identify potential patterns, relationships, and outliers that are important for further analysis. For example, understanding the frequency of different chest pain types in a heart disease dataset reveals whether certain types of chest pain are more strongly associated with the disease than others. Similarly, analyzing the frequency of different fasting blood sugar levels helps to identify potential risk factors for heart disease. Overall, exploring the frequencies of variables is an important step in the EDA process, as it provides a starting point for identifying potential relationships and patterns in the data.

figure 5

Frequencies for different columns of the dataset

Performance evaluation

Table 3 shows the class-specific performance evaluation of HDP-DTRF. Here, the performance was measured for class-0 (no heart disease) and class-1 (heart disease presented) classes. Further, macro average and weighted average performances were also measured. Macro average treats all classes equally, regardless of their size. It calculates the average performance metrics across all classes, giving each class an equal weight. It means that the performance of smaller classes will have the same impact on the metric as larger classes.Then, the weighted average considers the size of each class. It calculates the average performance metric across all classes but gives each class a weight proportional to its size. It means that the performance of larger classes will have a greater impact on the metric than smaller classes.

Table 4 shows the class-0 performance comparison of various methods. Here, the proposed HDP-DTRF improved precision by 5.75%, recall by 1.37%, F1-score by 6%, and accuracy by 2.45% compared to KNN [ 15 ]. Then, the proposed HDP-DTRF improved precision by 3.45%, recall by 0.63%, F1-score by 3.61%, and accuracy by 1.45% compared to ILM [ 16 ]. Then, the proposed HDP-DTRF improved precision by 2.30%, recall by 1.27%, F1-score by 3.61%, and accuracy by 1.03% compared to LRC [ 20 ]. Table 5 shows the class-1 performance comparison of various methods. Here, KNN [ 15 ] shows a 2.35% lower precision, a 4.40% lower recall, a 3.53% lower F1-score, and a 1.03% lower accuracy than the proposed HDP-DTRF method. Then, ILM shows a 2.35% lower precision, a 5.49% lower recall, a 1.14% lower F1-score, and a 1.03% lower accuracy than the proposed HDP-DTRF method. Then, LRC [ 20 ] shows a 4.71% lower precision, an 11.11% lower recall, a 2.27% lower F1-score, and a 1.03% lower accuracy than the proposed HDP-DTRF method.

Table 6 shows the macro average performance comparison of various methods. For KNN [ 15 ], the percentage improvements are 7.5% for precision, 13.3% for recall, 10.4% for F1-score, and 6.7% for accuracy. For ILM [ 16 ], the percentage improvements are achieved as 2.4% for precision, 6.1% for recall, 6.0% for F1-score, and 3.2% for accuracy. For LRC [ 20 ], the percentage improvements are achieved as 3.4% for precision, 10.0% for recall, 6.0% for F1-score, and 4.3% for accuracy archived by the proposed method. Table 7 shows the weighted average performance comparison of various methods. For KNN [ 15 ], the percentage improvements are 6.5% for precision, 3.3% for recall, 1.4% for F1-score, and 6.7% for accuracy. For ILM [ 16 ], the percentage improvements are achieved as 2.4% for precision, 5.1% for recall, 6.0% for F1-score, and 3.2% for accuracy. For LRC [ 20 ], the percentage improvements are achieved as 1.4% for precision, 1.0% for recall, 6.0% for F1-score, and 4.3% for accuracy archived by the proposed method.

The ROC curve of the proposed HDP-DTRF is seen in Fig.  6 . The true positive rate (TPR) is shown against the false-positive rate (FPR) on the ROC curve, which considers various threshold values. In the context of the HDP-DTRF technique, the ROC curve illustrates the degree to which the model can differentiate between positive and negative heart disease instances. The model’s performance is greater when it has a higher TPR and a lower FPR. The ROC curve that represents the HDP-DTRF approach that has been suggested is used to find the best classification threshold, which strikes a balance between sensitivity and specificity in the diagnostic process. If there is a point on the ROC curve that is closer to the top left corner, this implies that the model is doing better.

figure 6

ROC curve of proposed HDP-DTRF

Conclusions

This article proposes a machine-learning approach for heart disease prediction. The approach uses a DTRF classifier with loss optimization and involves preprocessing a dataset of patient records to determine the presence or absence of heart disease. The DTRF classifier is then trained on the SGB loss optimization dataset and evaluated using a separate test dataset. The proposed HDP-DTRF improved class-specific performances and a macro with weighted average performance measures. Overall, the proposed HDP-DTRF improved precision by 2.30%, recall by 1.27%, F1-score by 3.61%, and accuracy by 1.03% compared to traditional methodologies. Further, this work can be extended with deep learning-based classification with machine learning feature analysis .

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Abbreviations

Heart disease prediction

Decision tree-based random forest

  • Stochastic gradient boosting

False positive

False negative

True negative

True positive

Bhatt CM et al (2023) Effective heart disease prediction using machine learning techniques. Algorithms 16(2):88

Article   Google Scholar  

Dileep P et al (2023) An automatic heart disease prediction using cluster-based bi-directional LSTM (C-BiLSTM) algorithm. Neural Comput Appl 35(10):7253–7266

Jain A et al (2023) Optimized levy flight model for heart disease prediction using CNN framework in big data application. Exp Syst Appl 223:119859

Nandy S et al (2023) An intelligent heart disease prediction system based on swarm-artificial neural network. Neural Comput Appl 35(20):14723–14737

Hassan D et al (2023) Heart disease prediction based on pre-trained deep neural networks combined with principal component analysis. Biomed Signal Proc Contr 79:104019

Ozcan M et al (2023) A classification and regression tree algorithm for heart disease modeling and prediction. Healthc Anal 3:100130

Saranya G et al (2023) A novel feature selection approach with integrated feature sensitivity and feature correlation for improved heart disease prediction. J Ambient Intell Humaniz Comput 14(9):12005–12019

Sudha VK et al (2023) Hybrid CNN and LSTM network for heart disease prediction. SN Comp Sc 4(2):172

Chaurasia V, et al (2023) Novel method of characterization of heart disease prediction using sequential feature selection-based ensemble technique. Biomed Mat Dev 2023;1–10. https://doi.org/10.1007/s44174-022-00060-x

Ogundepo EA et al (2023) Performance analysis of supervised classification models on heart disease prediction. Innov Syst Software Eng 19(1):129–144

de Vries S et al (2023) Development and validation of risk prediction models for coronary heart disease and heart failure after treatment for Hodgkin lymphoma. J Clin Oncol 41(1):86–95

Vijaya Kishore V, Kalpana V (2020) Effect of Noise on Segmentation Evaluation Parameters. In: Pant, M., Kumar Sharma, T., Arya, R., Sahana, B., Zolfagharinia, H. (eds) Soft Computing: Theories and Applications. Advances in Intelligent Systems and Computing, vol 1154. Springer, Singapore. https://doi.org/10.1007/978-981-15-4032-5_41 .

Kalpana V, Vijaya Kishore V, Praveena K (2020) A Common Framework for the Extraction of ILD Patterns from CT Image. In: Hitendra Sarma, T., Sankar, V., Shaik, R. (eds) Emerging Trends in Electrical, Communications, and Information Technologies. Lecture Notes in Electrical Engineering, vol 569. Springer, Singapore. https://doi.org/10.1007/978-981-13-8942-9_42

Annamalai M, Muthiah P (2022) An Early Prediction of Tumor in Heart by Cardiac Masses Classification in Echocardiogram Images Using Robust Back Propagation Neural Network Classifier. Brazilian Archives of Biology and Technology. 65. https://doi.org/10.1590/1678-4324-2022210316

Shah D et al (2020) Heart disease prediction using machine learning techniques. SN Comput Sci 1:345

Guo C et al (2020) Recursion enhanced random forest with an improved linear model (RERF-ILM) for heart disease detection on the internet of medical things platform. IEEE Access 8:59247–59256

Ahmed H et al (2020) Heart disease identification from patients’ social posts, machine learning solution on Spark. Future Gen Comp Syst 111:714–722

Katarya R et al (2021) Machine learning techniques for heart disease prediction: a comparative study and analysis. Health Technol 11:87–97

Kannan R et al (2019) Machine learning algorithms with ROC curve for predicting and diagnosing the heart disease. Springer, Soft Computing and Medical Bioinformatics

Book   Google Scholar  

Ali MM et al (2021) Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison. Comput Biol Med 136:104672

Mienye ID et al (2020) An improved ensemble learning approach for the prediction of heart disease risk. Inform Med Unlocked 20:100402

Dutta A et al (2020) An efficient convolutional neural network for coronary heart disease prediction. Expert Syst Appl 159:113408

Latha CBC et al (2019) Improving the accuracy of heart disease risk prediction based on ensemble classification techniques. Inform Med Unlocked 16:100203

Ishaq A et al (2021) Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. IEEE Access 9:39707–39716

Asadi S et al (2021) Random forest swarm optimization-based for heart diseases diagnosis. J Biomed Inform 115:103690

Asif D et al (2023) Enhancing heart disease prediction through ensemble learning techniques with hyperparameter optimization. Algorithms 16(6):308

David VAR S, Govinda E, Ganapriya K, Dhanapal R, Manikandan A (2023) "An Automatic Brain Tumors Detection and Classification Using Deep Convolutional Neural Network with VGG-19," 2023 2nd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 2023, pp. 1-5. https://doi.org/10.1109/ICAECA56562.2023.10200949

Radwan M et al (2023) MLHeartDisPrediction: heart disease prediction using machine learning. J Comp Commun 2(1):50-65

Download references

Acknowledgements

Not applicable.

No funding was received by any government or private concern.

Author information

Authors and affiliations.

Department of Information Technology, Malla Reddy Engineering College for Women (UGC-Autonomous), Maisammaguda, Hyderabad, India

Anil Pandurang Jawalkar, Pandla Swetcha, Nuka Manasvi, Pakki Sreekala, Samudrala Aishwarya, Potru Kanaka Durga Bhavani & Pendem Anjani

You can also search for this author in PubMed   Google Scholar

Contributions

A.P.J, P.S., and N.M. contributed to the technical content of the paper, and P.S. and S.A. contributed to the conceptual content and architectural design. P.K., D.B., and P.A. contributed to the guidance and counseling on the writing of the paper.

Corresponding author

Correspondence to Anil Pandurang Jawalkar .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jawalkar, A.P., Swetcha, P., Manasvi, N. et al. Early prediction of heart disease with data analysis using supervised learning with stochastic gradient boosting. J. Eng. Appl. Sci. 70 , 122 (2023). https://doi.org/10.1186/s44147-023-00280-y

Download citation

Received : 31 May 2023

Accepted : 05 September 2023

Published : 10 October 2023

DOI : https://doi.org/10.1186/s44147-023-00280-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Heart disease
  • Machine learning
  • Decision tree
  • Random forest
  • Loss optimization

heart disease prediction using machine learning research paper 2023

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Journal Proposal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

algorithms-logo

Article Menu

heart disease prediction using machine learning research paper 2023

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Effective heart disease prediction using machine learning techniques.

heart disease prediction using machine learning research paper 2023

1. Introduction

2. literature survey, 3. methodology, 3.1. data source, 3.2. removing outliers, 3.3. feature selection and reduction, 3.4. clustering, 3.5. correlation table, 3.6. modeling, 3.6.1. decision tree classifier, 3.6.2. random forest, 3.6.3. multilayer perceptron, 3.6.4. xgboost, 5. conclusions, author contributions, conflicts of interest.

  • Estes, C.; Anstee, Q.M.; Arias-Loste, M.T.; Bantel, H.; Bellentani, S.; Caballeria, J.; Colombo, M.; Craxi, A.; Crespo, J.; Day, C.P.; et al. Modeling NAFLD disease burden in China, France, Germany, Italy, Japan, Spain, United Kingdom, and United States for the period 2016–2030. J. Hepatol. 2018 , 69 , 896–904. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Drożdż, K.; Nabrdalik, K.; Kwiendacz, H.; Hendel, M.; Olejarz, A.; Tomasik, A.; Bartman, W.; Nalepa, J.; Gumprecht, J.; Lip, G.Y.H. Risk factors for cardiovascular disease in patients with metabolic-associated fatty liver disease: A machine learning approach. Cardiovasc. Diabetol. 2022 , 21 , 240. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Murthy, H.S.N.; Meenakshi, M. Dimensionality reduction using neuro-genetic approach for early prediction of coronary heart disease. In Proceedings of the International Conference on Circuits, Communication, Control and Computing, Bangalore, India, 21–22 November 2014; pp. 329–332. [ Google Scholar ] [ CrossRef ]
  • Benjamin, E.J.; Muntner, P.; Alonso, A.; Bittencourt, M.S.; Callaway, C.W.; Carson, A.P.; Chamberlain, A.M.; Chang, A.R.; Cheng, S.; Das, S.R.; et al. Heart disease and stroke statistics—2019 update: A report from the American heart association. Circulation 2019 , 139 , e56–e528. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Shorewala, V. Early detection of coronary heart disease using ensemble techniques. Inform. Med. Unlocked 2021 , 26 , 100655. [ Google Scholar ] [ CrossRef ]
  • Mozaffarian, D.; Benjamin, E.J.; Go, A.S.; Arnett, D.K.; Blaha, M.J.; Cushman, M.; de Ferranti, S.; Després, J.-P.; Fullerton, H.J.; Howard, V.J.; et al. Heart disease and stroke statistics—2015 update: A report from the American Heart Association. Circulation 2015 , 131 , e29–e322. [ Google Scholar ] [ CrossRef ]
  • Maiga, J.; Hungilo, G.G.; Pranowo. Comparison of Machine Learning Models in Prediction of Cardiovascular Disease Using Health Record Data. In Proceedings of the 2019 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia, 24–25 October 2019; pp. 45–48. [ Google Scholar ] [ CrossRef ]
  • Li, J.; Loerbroks, A.; Bosma, H.; Angerer, P. Work stress and cardiovascular disease: A life course perspective. J. Occup. Health 2016 , 58 , 216–219. [ Google Scholar ] [ CrossRef ]
  • Purushottam; Saxena, K.; Sharma, R. Efficient Heart Disease Prediction System. Procedia Comput. Sci. 2016 , 85 , 962–969. [ Google Scholar ] [ CrossRef ]
  • Soni, J.; Ansari, U.; Sharma, D.; Soni, S. Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction. Int. J. Comput. Appl. 2011 , 17 , 43–48. [ Google Scholar ] [ CrossRef ]
  • Mohan, S.; Thirumalai, C.; Srivastava, G. Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques. IEEE Access 2019 , 7 , 81542–81554. [ Google Scholar ] [ CrossRef ]
  • Waigi, R.; Choudhary, S.; Fulzele, P.; Mishra, G. Predicting the risk of heart disease using advanced machine learning approach. Eur. J. Mol. Clin. Med. 2020 , 7 , 1638–1645. [ Google Scholar ]
  • Breiman, L. Random forests. Mach. Learn. 2001 , 45 , 5–32. [ Google Scholar ] [ CrossRef ]
  • Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ’16: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [ Google Scholar ] [ CrossRef ]
  • Gietzelt, M.; Wolf, K.-H.; Marschollek, M.; Haux, R. Performance comparison of accelerometer calibration algorithms based on 3D-ellipsoid fitting methods. Comput. Methods Programs Biomed. 2013 , 111 , 62–71. [ Google Scholar ] [ CrossRef ]
  • K, V.; Singaraju, J. Decision Support System for Congenital Heart Disease Diagnosis based on Signs and Symptoms using Neural Networks. Int. J. Comput. Appl. 2011 , 19 , 6–12. [ Google Scholar ] [ CrossRef ]
  • Narin, A.; Isler, Y.; Ozer, M. Early prediction of Paroxysmal Atrial Fibrillation using frequency domain measures of heart rate variability. In Proceedings of the 2016 Medical Technologies National Congress (TIPTEKNO), Antalya, Turkey, 27–29 October 2016. [ Google Scholar ] [ CrossRef ]
  • Shah, D.; Patel, S.; Bharti, S.K. Heart Disease Prediction using Machine Learning Techniques. SN Comput. Sci. 2020 , 1 , 345. [ Google Scholar ] [ CrossRef ]
  • Alotaibi, F.S. Implementation of Machine Learning Model to Predict Heart Failure Disease. Int. J. Adv. Comput. Sci. Appl. 2019 , 10 , 261–268. [ Google Scholar ] [ CrossRef ]
  • Hasan, N.; Bao, Y. Comparing different feature selection algorithms for cardiovascular disease prediction. Health Technol. 2020 , 11 , 49–62. [ Google Scholar ] [ CrossRef ]
  • Ouf, S.; ElSeddawy, A.I.B. A proposed paradigm for intelligent heart disease prediction system using data mining techniques. J. Southwest Jiaotong Univ. 2021 , 56 , 220–240. [ Google Scholar ] [ CrossRef ]
  • Khan, I.H.; Mondal, M.R.H. Data-Driven Diagnosis of Heart Disease. Int. J. Comput. Appl. 2020 , 176 , 46–54. [ Google Scholar ] [ CrossRef ]
  • Kaggle Cardiovascular Disease Dataset. Available online: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset (accessed on 1 November 2022).
  • Han, J.A.; Kamber, M. Data Mining: Concepts and Techniques , 3rd ed.; Morgan Kaufmann Publishers: San Francisco, CA, USA, 2011. [ Google Scholar ]
  • Rivero, R.; Garcia, P. A Comparative Study of Discretization Techniques for Naive Bayes Classifiers. IEEE Trans. Knowl. Data Eng. 2009 , 21 , 674–688. [ Google Scholar ]
  • Khan, S.S.; Ning, H.; Wilkins, J.T.; Allen, N.; Carnethon, M.; Berry, J.D.; Sweis, R.N.; Lloyd-Jones, D.M. Association of body mass index with lifetime risk of cardiovascular disease and compression of morbidity. JAMA Cardiol. 2018 , 3 , 280–287. [ Google Scholar ] [ CrossRef ]
  • Kengne, A.-P.; Czernichow, S.; Huxley, R.; Grobbee, D.; Woodward, M.; Neal, B.; Zoungas, S.; Cooper, M.; Glasziou, P.; Hamet, P.; et al. Blood Pressure Variables and Cardiovascular Risk. Hypertension 2009 , 54 , 399–404. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Yu, D.; Zhao, Z.; Simmons, D. Interaction between Mean Arterial Pressure and HbA1c in Prediction of Cardiovascular Disease Hospitalisation: A Population-Based Case-Control Study. J. Diabetes Res. 2016 , 2016 , 8714745. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Huang, Z. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. DMKD 1997 , 3 , 34–39. [ Google Scholar ]
  • Maas, A.H.; Appelman, Y.E. Gender differences in coronary heart disease. Neth. Heart J. 2010 , 18 , 598–602. [ Google Scholar ] [ CrossRef ]
  • Bhunia, P.K.; Debnath, A.; Mondal, P.; D E, M.; Ganguly, K.; Rakshit, P. Heart Disease Prediction using Machine Learning. Int. J. Eng. Res. Technol. 2021 , 9 . [ Google Scholar ]
  • Mohanty, M.D.; Mohanty, M.N. Verbal sentiment analysis and detection using recurrent neural network. In Advanced Data Mining Tools and Methods for Social Computing ; Academic Press: Cambridge, MA, USA, 2022; pp. 85–106. [ Google Scholar ] [ CrossRef ]
  • Menzies, T.; Kocagüneli, E.; Minku, L.; Peters, F.; Turhan, B. Using Goals in Model-Based Reasoning. In Sharing Data and Models in Software Engineering ; Morgan Kaufmann: San Francisco, CA, USA, 2015; pp. 321–353. [ Google Scholar ] [ CrossRef ]
  • Fayez, M.; Kurnaz, S. Novel method for diagnosis diseases using advanced high-performance machine learning system. Appl. Nanosci. 2021 . [ Google Scholar ] [ CrossRef ]
  • Hassan, C.A.U.; Iqbal, J.; Irfan, R.; Hussain, S.; Algarni, A.D.; Bukhari, S.S.H.; Alturki, N.; Ullah, S.S. Effectively Predicting the Presence of Coronary Heart Disease Using Machine Learning Classifiers. Sensors 2022 , 22 , 7227. [ Google Scholar ] [ CrossRef ]
  • Subahi, A.F.; Khalaf, O.I.; Alotaibi, Y.; Natarajan, R.; Mahadev, N.; Ramesh, T. Modified Self-Adaptive Bayesian Algorithm for Smart Heart Disease Prediction in IoT System. Sustainability 2022 , 14 , 14208. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

AuthorsNovel ApproachBest AccuracyDataset
Shorewall, 2021 [ ]Stacking of KNN, random forest, and SVM outputs with logistic regression as the metaclassifier75.1% (stacked model)Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
Maiga et al., 2019 [ ]-Random forest
-Naive Bayes
-Logistic regression
-KNN
70%Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
Waigi at el., 2020 [ ]Decision tree72.77% (decision tree)Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
Our and ElSeddawy, 2021 [ ]Repeated random with random forest89.01%(random forest classifier)UCI cardiovascular dataset (303 patients, 14 attributes)
Khan and Mondal, 2020 [ ]Holdout cross-validation with the neural network for Kaggle dataset71.82% (neural networks)Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
Cross-validation method with logistic regression (solver: lbfgs) where k = 3072.72%Kaggle cardiovascular disease dataset 1 (462 patients, 12 attributes)
Cross-validation method with linear SVM where k = 1072.22%Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
FeatureVariableMin and Max Values
AgeAgeMin: 10,798 and max: 23,713
HeightHeightMin: 55 and max: 250
WeightWeightMin: 10 and max: 200
GenderGender1: female, 2: male
Systolic blood pressureap_hiMin: −150 and max: 16,020
Diastolic blood pressureap_loMin: −70 and max: 11,000
CholesterolCholCategorical value = 1(min) to 3(max)
GlucoseGlucCategorical value = 1(min) to 3(max)
SmokingSmoke1: yes, 0: no
Alcohol intakeAlco1: yes, 0: no
Physical activityActive1: yes, 0: no
Presence or absence of cardiovascular diseaseCardio1: yes, 0: no
MAP ValuesCategory
≥70 and <801
≥80 and <902
≥100 and <1103
≥100 and <1104
≥110 and <1205
FeatureVariableMin and Max Values
Gendergender1: male, 2: female
AgeAgeCategorical values = 0(min) to 6(max)
BMIBMI_ClassCategorical values = 0(min) to 5(max)
Mean arterial pressureMAP_ClassCategorical values = 0(min) to 5(max)
CholesterolCholesterolCategorical values = 1(min) to 3(max)
GlucoseGlucCategorical values = 1(min) to 3(max)
SmokingSmoke1: yes, 0: no
Alcohol intakeAlco1: yes, 0: no
Physical activityActive1: yes, 0: no
Presence or absence of cardiovascular diseaseCardio1: yes, 0: no
ModelAccuracyPrecisionRecallF1-ScoreAUC
Without CVCVWithout CVCVWithout CVCVWithout CVCV
MLP86.9487.2889.0388.7082.9584.8585.8886.710.95
RF86.9287.0588.5289.4283.4683.4385.9186.320.95
DT86.5386.3790.1089.5881.1781.6185.4085.420.94
XGB87.0286.8789.6288.9382.1183.5786.3086.160.95
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Bhatt, C.M.; Patel, P.; Ghetia, T.; Mazzeo, P.L. Effective Heart Disease Prediction Using Machine Learning Techniques. Algorithms 2023 , 16 , 88. https://doi.org/10.3390/a16020088

Bhatt CM, Patel P, Ghetia T, Mazzeo PL. Effective Heart Disease Prediction Using Machine Learning Techniques. Algorithms . 2023; 16(2):88. https://doi.org/10.3390/a16020088

Bhatt, Chintan M., Parth Patel, Tarang Ghetia, and Pier Luigi Mazzeo. 2023. "Effective Heart Disease Prediction Using Machine Learning Techniques" Algorithms 16, no. 2: 88. https://doi.org/10.3390/a16020088

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Help | Advanced Search

Computer Science > Machine Learning

Title: heart diseases prediction using block-chain and machine learning.

Abstract: Most people around the globe are dying due to heart disease. The main reason behind the rapid increase in the death rate due to heart disease is that there is no infrastructure developed for the healthcare department that can provide a secure way of data storage and transmission. Due to redundancy in the patient data, it is difficult for cardiac Professionals to predict the disease early on. This rapid increase in the death rate due to heart disease can be controlled by monitoring and eliminating some of the key attributes in the early stages such as blood pressure, cholesterol level, body weight, and addiction to smoking. Patient data can be monitored by cardiac Professionals (Cp) by using the advanced framework in the healthcare departments. Blockchain is the world's most reliable provider. The use of advanced systems in the healthcare departments providing new ways of dealing with diseases has been developed as well. In this article Machine Learning (ML) algorithm known as a sine-cosine weighted k-nearest neighbor (SCA-WKNN) is used for predicting the Hearth disease with the maximum accuracy among the existing approaches. Blockchain technology has been used in the research to secure the data throughout the session and can give more accurate results using this technology. The performance of the system can be improved by using this algorithm and the dataset proposed has been improved by using different resources as well.
Comments: page 23, figurse 19
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Diagnostics (Basel)
  • PMC10378171

Logo of diagno

Prediction of Heart Disease Based on Machine Learning Using Jellyfish Optimization Algorithm

Ahmad ayid ahmad.

1 Computer Engineering Department, Gazi University, Ankara 06560, Turkey; rt.ude.izag@htalop

2 Information Technology Department, Kirkuk University, Kirkuk 36013, Iraq

Huseyin Polat

Associated data.

No data were used to support this study.

Heart disease is one of the most known and deadly diseases in the world, and many people lose their lives from this disease every year. Early detection of this disease is vital to save people’s lives. Machine Learning (ML), an artificial intelligence technology, is one of the most convenient, fastest, and low-cost ways to detect disease. In this study, we aim to obtain an ML model that can predict heart disease with the highest possible performance using the Cleveland heart disease dataset. The features in the dataset used to train the model and the selection of the ML algorithm have a significant impact on the performance of the model. To avoid overfitting (due to the curse of dimensionality) due to the large number of features in the Cleveland dataset, the dataset was reduced to a lower dimensional subspace using the Jellyfish optimization algorithm. The Jellyfish algorithm has a high convergence speed and is flexible to find the best features. The models obtained by training the feature-selected dataset with different ML algorithms were tested, and their performances were compared. The highest performance was obtained for the SVM classifier model trained on the dataset with the Jellyfish algorithm, with Sensitivity, Specificity, Accuracy, and Area Under Curve of 98.56%, 98.37%, 98.47%, and 94.48%, respectively. The results show that the combination of the Jellyfish optimization algorithm and SVM classifier has the highest performance for use in heart disease prediction.

1. Introduction

According to the World Health Organization, despite significant advances in diagnosis and treatment, mortality from heart disease remains the leading cause of death worldwide, accounting for about one-third of annual deaths [ 1 ]. “Heart disease” is a general term used to describe a group of heart conditions and diseases, including Coronary Artery Disease, Arrhythmia, Heart Valve Disease, and Heart Failure, which cause the heart not to pump blood healthily.

The most common type of heart disease is Coronary Artery Disease. The disease is a medical condition in which the coronary arteries that supply blood to the heart muscle become narrowed or blocked due to plaque build-up on their inner walls. This can lead to serious complications such as a heart attack, heart failure, and arrhythmias, as it reduces blood flow to the heart muscle. In some cases, procedures such as angioplasty or bypass surgery may be necessary to improve blood flow to the heart.

The second common heart disease is Arrhythmia. Arrhythmia is caused by disturbances in the normal electrical activity of the heart. The normal beating rhythm of the heart is disrupted because the electrical impulses in the heart responsible for synchronizing the heartbeat are not working properly. As a result, the heartbeat may be faster, slower, or more irregular than normal [ 2 , 3 ]. Millions of people worldwide are affected by Arrhythmia. Symptoms can include a fast or irregular heartbeat, shortness of breath, dizziness or fainting, chest pain or discomfort, fatigue, and weakness. There are many different types of arrhythmias, and some types of arrhythmias are harmless, while others can be life-threatening. While many people may experience occasional episodes of mild arrhythmia in their lives, some people may struggle with more serious types of arrhythmias. For example, a type of Arrhythmia known as Atrial Fibrillation can occur in about 10% of adults over the age of 60 and can increase the risk of stroke. On the other hand, a serious type of Arrhythmia known as Ventricular Fibrillation is considered a cause of heart attacks and can be fatal. Some types of arrhythmias can be inherited, while others can be caused by lifestyle factors or other heart diseases. In most early-diagnosed cases, arrhythmias can be treated. Patients with these disorders are much less likely to die suddenly if they receive prompt, thorough diagnosis and medical care [ 4 , 5 ].

The main reasons for the significant increase in heart disease in recent years are people’s lifestyle, lack of exercise, and consumption of various processed foods. Heart disease in its advanced stages can cause heart attacks and endanger the lives of patients, so it is necessary to detect the disease quickly and in its early stages with intelligent and therapeutic methods. One of the major challenges in the diagnosis of heart disease is the reluctance of patients to participate in clinical trials. On the other hand, the cost of these trials is high, and they take a lot of time, which is why they receive little attention. In contrast to clinical methods for diagnosing heart disease, some methods can be used to analyze the pattern of the disease by analyzing information from patients and healthy people [ 6 ].

In recent years, applications of artificial intelligence technology, especially Machine Learning (ML), in the field of auxiliary diagnosis have developed rapidly, and efficient progress has been made in automatic detection applications [ 7 , 8 , 9 , 10 ]. The advantage of ML methods is that they can diagnose diseases, such as heart disease, with low-cost and reasonable accuracy [ 11 ]. ML techniques for diagnosing heart disease do not require multiple clinical trials, most of which are invasive, and a set of information and features can help to diagnose the disease with high accuracy. It should be noted that although ML technology has made advances in the automatic diagnosis of heart disease, the approval of doctors is still a necessary link in diagnosis and treatment. It is also clear that ML-based disease diagnosis offers an opportunity to increase doctors’ work efficiency and generate economic benefits. In the age of big data, with ever-expanding datasets and the development of new ML algorithms, it is expected that ML applications will undoubtedly have a major impact on automated heart disease prediction [ 12 , 13 , 14 , 15 , 16 ]. In the literature, there are research papers that try to predict heart disease with different datasets and different types of ML algorithms.

Dubey A. K. et al. examined the performance of ML models such as Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), SVM with grid search (SVMG), K-Nearest Neighbor (KNN) and Naïve Bayes (NB) for heart disease classification. Cleveland and Statlog datasets from the UCI Machine Learning repository were used for training and testing. The experimental results show that LR and SVM classifier models perform better on the Cleveland dataset with 89% accuracy, while LR performs better on the Statlog dataset with 93% accuracy [ 17 ].

Karthick K. et al. used SVM, Gaussian Naive Bayes (GNB), LR, LightGBM, XGBoost, and RF algorithms to build an ML model for heart disease risk prediction. In this study, the authors applied the Chi-square statistical test to select the best features from the Cleveland heart disease dataset. After feature selection, the RF classifier model obtained the highest classification accuracy rate of 88.5% [ 18 ].

Veisi H. et al. developed various ML models such as DT, RF, SVM, XGBoost, and Multilayer Perceptron (MLP) using the Cleveland heart disease dataset to predict heart disease. Various preprocessing (outlier detection, normalization, etc.) and feature selection processes were applied to the dataset. Among the ML models evaluated, the highest accuracy of 94.6% was achieved using the MLP [ 19 ].

Sarra R. R. et al. proposed a new classification model based on SVM for better prediction of heart disease using the Cleveland and Statlog datasets from the UCI Machine Learning repository. The χ 2 statistical optimal feature selection method was used to improve the prediction accuracy of the model. The performance of the proposed model is evaluated against traditional classifier models using various performance metrics, and the results showed that the accuracy improved from 85.29% to 89.7% by applying the proposed model [ 20 ].

Malavika G. et al. investigated the use of ML algorithms to predict heart disease. The heart disease dataset from the UCI repository was used for this study. They used various ML algorithms, including LR, KNN, SVM, NB, DT, and RF, to predict heart disease, and their performances were compared. The results showed that RF (91.80%) had the highest accuracy in predicting heart disease, followed by NB (88.52%) and SVM (88.52%). The authors concluded that ML algorithms could be a useful tool in predicting heart disease and could potentially help doctors diagnose and treat patients more accurately [ 21 ].

Sahoo G. K. et al. compared the performance of LR, KNN, SVM, NB, DT, RF, and XG Boost Machine Learning models for predicting heart disease. The Cleveland heart disease dataset from the UCI ML repository was used to train the models. Comparing the results of the tested ML algorithms, the RF algorithm performed the best, with a classification accuracy of 90.16% [ 22 ].

The exploration of various ML techniques for predicting coronary artery disease is addressed in [ 23 ]. The study used a dataset of 462 medical instances, and nine features from the South African heart disease dataset. It consists of 302 healthy records and 160 records with coronary heart disease. In this study, the k-means algorithm, along with the synthetic minority oversampling technique, were used to solve the problem of imbalanced data. A comparative analysis of four different ML techniques, such as LR, SVM, KNN, and artificial neural network (ANN), can accurately predict coronary artery disease events from clinical data. The results showed that SVM had the highest accuracy performance (78.1%) [ 23 ].

In Ahmad G. N. et al.’s study, Cleveland, Hungarian, Switzerland, Statlog, and Long Beach VA datasets were combined to obtain a larger dataset compared to existing heart disease datasets. They compared the performances of LR, KNN, SVM, Nu-Support Vector Classifier (Nu-SVC), DT, RF, NB, ANN, AdaBoost, Gradient Boosting (GB), Linear Discriminants Analysis (LDA) and Quadratic Discriminant Analysis (QDA), algorithms for heart disease classification. In this study, the authors claimed that the best classification accuracy of 100% was achieved with the RF algorithm [ 24 ].

The main objective of this study is to use the metaheuristic method, such as the Jellyfish algorithm, to select the optimum features from the heart disease dataset and use it in the Machine Learning method to classify the healthy and non-healthy heart disease data. Some of the features do not have more efficiency in the classification of heart disease. The Jellyfish has some advantages, such as the high speed of convergency, and high accuracy to find the features. For this reason, this algorithm has been selected.

2. Material and Method

This paper presents a performance analysis of different ML techniques based on selecting the meaningful features of the dataset in the hope of improving heart disease prediction accuracy. In this study, the performance of different ML models such as ANN, DT, Adaboost, and SVM using the Jellyfish algorithm and feature selection for the prediction of heart disease was compared, aiming at obtaining the highest performance model. The Cleveland dataset used in this study was obtained from the Kaggle Machine Learning repository.

2.1. Dataset

The Cleveland heart disease dataset is commonly used for heart disease prediction with supervised Machine Learning. The Cleveland dataset is obtained from the Kaggle Machine Learning repository. The Cleveland dataset was collected for use in a study in the field of health research by the Cleveland Clinic Foundation in 1988. In the original of this dataset, 76 different features of 303 subjects were recorded. However, it is known that most researchers use only 14 of these features, including the target class feature. These features include age, gender, blood pressure, cholesterol, blood sugar, and many more health metrics. The original Cleveland dataset has five class labels. It has integer values ranging from zero (no presence) to four. The Cleveland dataset experiments have focused on just trying to discriminate between presence (Values 1, 2, 3, 4) and absence (Value 0). However, the number of samples for each class is not homogeneous (Values 0, 1, 2, 3, 4—samples 164, 55, 36, 35, 13). Researchers suggest that the five class features of this data set be reduced to two classes; 0 = no disease and 1 = disease. The target feature refers to the presence of heart disease in the subject. Table 1 shows the features included in the Cleveland heart disease dataset.

List of features in the Cleveland heart disease dataset.

OrderFeatureDescriptionFeature Value Range
1Age Age in years29 to 77
2Sex GenderValue 1 = male
Value 0 = female
3Cp Chest pain typeValue 0: typical angina
Value 1: atypical angina
Value 2: non-anginal pain
Value 3: asymptomatic
4Trestbps Resting blood pressure (in mm Hg on admission to the hospital)94 to 200
5Chol Serum cholesterol in mg/dL126 to 564
6Fbs Fasting blood sugar > 120 mg/dLValue 1 = true
Value 0 = false
7Restecg Resting electrocardiographic resultsValue 0: Normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of >0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
8Thalach Maximum heart rate achieved71 to 202
9Exang Exercise-induced anginaValue 1 = yes
Value 0 = no
10Oldpeak Stress test depression induced by exercise relative to rest0 to 6.2
11Slope The slope of the peak exercise ST segmentValue 0: upsloping
Value 1: flat
Value 2: downsloping
12Ca Number of major vessels Number of major vessels (0–3) colored by fluoroscopy
13Thal Thallium heart rateValue 0 = normal;
Value 1 = fixed defect;
Value 2 = reversible defect
14TargetDiagnosis of heart disease Value 0 = no disease
Value 1 = disease

In the original dataset, a total of 6 samples have null values; 4 samples in the “Ca (Number of Major Vessels)” feature and 2 samples in the “Thal (Thallium Heart Rate)” feature. Since null values are very few, these samples can be removed from the dataset. The dataset used in this study contains a total of 1025 samples. A total of 499 samples belong to the disease (1), and 526 of these samples belong to the no disease (0) class. Histograms of all features in the Cleveland heart disease dataset are shown in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g001.jpg

Histograms of features in the heart disease dataset.

2.2. Feature Selection and Dimension Reduce

The performance of ML models depends on the quality of the features used as input. As the number of features in the datasets increases, the prediction performance of the model decreases, and the computational costs increase. By reducing the number of features, the model can obtain more accurate results and work faster and more efficiently. ML models are designed according to the data used in the learning process. Selecting the best features makes the features learned by the model more generalizable. Thus, it makes the model work better with new data. Some features in the datasets are not important to the result and increase the computational complexity of the model. Removing unnecessary features reduces noise and helps the model achieve better results. Also, feature selection is important for understanding the nature of the dataset. Well-chosen features help people better understand the data. In this study, the Jellyfish algorithm was used to select the best features from the dataset.

Presented in 2021, the Jellyfish optimization algorithm is a type of swarm intelligence algorithm that is inspired by the food-finding behavior of jellyfish in the ocean. It is used to solve optimization problems, particularly in the field of engineering and computer science. According to the literature, the Jellyfish algorithm outperforms many well-known meta-heuristic algorithms in most real-world applications. In the Jellyfish algorithm, a group of artificial agents or particles, called “jellyfish,” move in a three-dimensional space, searching for the optimal solution to a problem. The algorithm is based on a set of rules that simulate the behavior of real-life jellyfish. The algorithm uses a combination of random and deterministic movements to explore the search space and exploit promising solutions. Each Jellyfish has a set of properties that are updated at each iteration, based on its own and the swarm’s best-known solutions. These properties include its position, velocity, and acceleration. The Jellyfish algorithm has been successfully applied to a range of optimization problems, including clustering, feature selection, and image segmentation. It has been shown to perform well in high-dimensional search spaces and can handle multiple objectives and constraints. Overall, the Jellyfish algorithm is a promising optimization technique that takes inspiration from nature to solve complex problems in a computationally efficient way. Figure 2 shows the behavior of jellyfish in the sea and the modeling of group movements [ 25 ].

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g002.jpg

Jellyfish behaviors for modeling a jellyfish optimization algorithm [ 25 ].

The Jellyfish algorithm has the following three behaviors:

  • A walker or jellyfish either follows the ocean current or moves within the group and can switch between the two modes intermittently;
  • The jellyfish move in the ocean in search of food. They are more attracted to places where there is a lot of food;
  • The amount of food found is determined by the location and function of the target.

Ocean waves in the sea contain nutrients that can attract jellyfish. The direction of current in the ocean can be defined with a vector and as in Equation (1):

In this regard, e c is the absorption factor and a parameter. This equation can be extended as Equation (2):

In this equation, X * is the best jellyfish, and μ is the average population of the jellyfish. For simplicity, d f = e c μ can be assumed, and therefore this Equation can be more general and presented in Equation (3):

The random distribution of jellyfish can be considered normal, as shown in Equations (4) and (5):

In these relationships, σ is the standard deviation index of the distribution of jellyfish distribution. Figure 3 shows the normal distribution of jellyfish scattering around the mean point with the normal distribution.

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g003.jpg

Normal distribution of jellyfish in the ocean [ 25 ].

Figure 4 depicts the displacement process of each jellyfish under the influence of ocean water force and under the influence of the jellyfish group.

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g004.jpg

The movement of jellyfish in the ocean with the force of ocean movements and group movements [ 25 ].

The equations d f and e c can be rewritten as Equations (6) and (7), respectively:

Now we can rewrite Equation (3) based on Equation (6) and present it in Equation (8):

They are moved by water waves of jellyfish, the equation of which is given in Equation (9):

Equation (9) can be extended to Equation (10):

In this relation, β is a number greater than zero and is usually β = 3 . Jellyfishes also have group movements and usually have two passive and active movements. In the passive state, they search more around themselves. To model passive motion, Equation (11) is used to move them:

In this relation, γ is the coefficient of motion and is a positive number, and is usually set to 0.1. U b is the upper range of each dimension and L b is the lower range of one dimension. In the active behavior mode, a jellyfish-like X i randomly determines a jellyfish-like X j , and there are two modes. If the merit of X i is greater than X j , it uses Equation (12) to move; otherwise, Equation (13) is used:

Equation (14) is used to switch between ocean movements and group movements:

In this regard, t is the current iteration number of the algorithm, and M a x t is the maximum iteration counter. The diagram c t   Figure 5 is shown for an experiment. For each update, if the random number c t is greater than 0.5, then the Jellyfish update is based on waves, and if it is less than 0.5, it is based on group movements.

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g005.jpg

Random function to determine the type of motion of the type of force of ocean motions and group motions [ 25 ].

2.3. Machine Learning Algorithms

Machine Learning refers to the use of computer algorithms that can learn to perform a particular task from sample data without explicitly programmed instructions. ML uses advanced statistical techniques to learn distinctive patterns from training data to make the most accurate predictions of new data. In applications such as disease prediction, ML models can often be developed using supervised learning methods. Supervised learning requires that training samples are correctly labeled. In its simplest form, the output is a binary variable with a value of 1 for patient subjects and 0 for healthy subjects. To obtain robust ML models, it is recommended to use balanced training samples from healthy and patient subjects. If several diseases are to be included in the ML model, the binary classification can be easily extended to the multi-class case. Therefore, supervised learning algorithms associate input variables with labeled outputs. In this study, we compare the performance of four different ML models using supervised learning, such as ANN, DT, Adaboost, and SVM.

ANN is one of the most basic and popular models of artificial neural networks. It is a network with two or more hidden layers and is often used to solve classification or regression problems. ANN consists of the input layer, one or more hidden layers, and output layers. Each layer contains one or more nodes (neurons). The input layer introduces data into the network and contains a node for each attribute. Hidden layers are layers used to process data. The output layer outputs the results and contains a node for each class in classification problems. ANN works by multiplying each node’s inputs by their weights, putting them into the activation function, and calculating the output. The activation function is the function that determines the output of each node, and non-linear functions such as sigmoid, ReLU, or tanh are often used. During the training process, the weights are randomly assigned, and then the weights are optimized using the backpropagation algorithm. The backpropagation algorithm minimizes the difference between the target outputs and the outputs of the network. ANN can be used for many different types of data and can be used in conjunction with other neural network models and extended to solve more complex problems.

The DT algorithm tries to classify data using a tree structure. The algorithm creates a set of decision rules that parse data according to a specific set of features. This set of decision rules is interconnected along the branches of the tree, forming a decision tree. Each branch corresponds to a decision rule, and each leaf node provides a class or value estimate. The algorithm helps to separate the classes by parsing the data. Each decomposition is accomplished by selecting a feature and dividing it among the values of that feature.

Adaboost (Adaptive Boosting) is an ML algorithm used to solve classification and regression problems. Adaboost algorithm works by combining weak classifiers (weak learners) into strong classifiers (strong learners). The algorithm starts by weighing each sample in the dataset. Initially, each sample has an equal weight. Then, a weak classifier is trained, and this classifier is selected considering the classification accuracy. The selected classifier reduces the weight of the samples it classifies as correct and increases the weight of the samples it classifies as incorrect. Next, a new weak classifier is trained with the weighted samples, and the process is repeated. This process continues until a predetermined number of weak classifiers are trained. Finally, a weighted vote is performed according to the classification accuracy of each weak classifier. As a result of this voting, a powerful classifier is obtained for classifying the given samples.

SVM is a preferred ML algorithm because it is resistant to outliers and gives good results when the data size grows. SVM represents data points in an n-dimensional space and tries to find the best hyperplane separating samples belonging to different classes. However, in some cases, data points cannot be separated linearly. In these cases, the SVM’s solution is found using more complex hyperplanes. The kernel trick allows the SVM to work with data that can be separated more easily in higher dimensional spaces by moving the data to higher dimensional spaces (kernel space). This allows it to perform the separation using more complex hyperplanes for the non-linearly separable dataset. The kernel trick works by using different kernel functions, especially the radial basis function (RBF) and the polynomial kernel. These kernel functions operate based on the properties of data points (distance, similarity, inner product, etc.) and allow the SVM to find an appropriate hyperplane that it can use to separate data in higher dimensional spaces.

2.4. Methodology

The main aim of this study is to provide clinicians with a tool to help them diagnose heart problems early. Therefore, it will be easier to effectively treat patients early and avoid serious consequences. In this study, the performance of different ML models using the Jellyfish algorithm and feature selection for heart disease prediction was compared, and we attempted to obtain the highest performance ML model. The summary of the proposed method is shown in Figure 6 . As seen in Figure 6 , firstly, the Jellyfish algorithm that was presented in 2021 was applied to the dataset to obtain the best features. The Jellyfish algorithm tries to find optimal solutions to various optimization problems by simulating the intelligent behavior of jellyfish. The Jellyfish algorithm does not get stuck in local minimums and reaches the global minimum faster than other optimization algorithms. The algorithm has attracted great attention around the world due to its simplicity of implementation, few parameters, and flexibility. Because of these advantages, the Jellyfish algorithm was preferred in this study to select the best features from the dataset. The Jellyfish algorithm has an effective feature selection role, and a binary version of it is used in this study. This algorithm starts with a population, which is a collection of potential solutions with the best features. The best features are selected for transfer to the next step in each iteration of the algorithm, which ultimately results in the best solution for the features. After creating a new dataset with the best features, this dataset was used for training four different classifiers such as ANN, DT, Adaboost, and SVM. The ML models obtained after the training were tested, and their performances were compared using metrics such as Accuracy, Sensitivity, Specificity, and Area Under Curve, and the ML model with the best performance was selected. A 10-fold cross-validation was used in the training and test phase of ML algorithms. This selected model has high performance in separating and classifying new data samples into two classes as no disease and diseased. In this study, MATLAB (version R2022a) was used for feature selection and classification.

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g006.jpg

Flowchart of the proposed approach for heart disease prediction.

3. Experimental Test Results

3.1. performance metrics.

A table known as the confusion matrix is used to evaluate the performance of ML models. The confusion matrix is a table showing the difference between the actual and predicted classes. Each row of the confusion matrix represents an instance in the predicted class, while each column represents an instance in the real class (and vice versa). The confusion matrix usually contains four different terms: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).

True Positive (TP) refers to situations where actual positives are correctly predicted as positives. False Positive (FP) refers to situations where actuals are incorrectly predicted as positives.

True Negative (TN) refers to situations where what is negative is correctly predicted as negative.

False Negative (FN) refers to situations in which true positives are incorrectly predicted as negatives.

Using these terms, performance metrics such as Accuracy, Sensitivity, Specificity, and Area Under Curve (AUC) are calculated. These evaluation criteria, commonly used in the context of binary classification tasks, are calculated as follows.

Accuracy: the proportion of true predictions (both true positives and true negatives) out of all predictions. It is calculated as (TP + TN)/(TP + TN + FP + FN).

Sensitivity (also called recall or true positive rate): the proportion of true positives out of all actual positive cases. It is calculated as TP/(TP + FN).

Specificity: the proportion of true negatives out of all actual negative cases. It is calculated as TN/(TN + FP).

Under the curve (AUC): It refers to the area under the ROC (Receiver Operating Characteristic) curve and takes a value between 0 and 1. If the value of AUC is 0, the classifier predicts all classes incorrectly, and if it is 1, the classifier correctly predicts all classes.

3.2. Test Results

In this section, the proposed method has been implemented on the test data, and the results have been compared with other ML methods such as ANN, Decision Tree, AdaBoost, and SVM. Also, the four different types of performance metrics, such as Sensitivity, Specificity, Accuracy, and Area Under Curve, have been calculated. In total, 70% of the data were selected for training and 30% for testing. Furthermore, other numbers of the training and testing data were selected and tested, but the best performance has been obtained from the mentioned percentages. The performance evaluation results of ML models without applying feature selection with the Jellyfish algorithm are given in Table 2 .

Performance comparison of different ML models without the Jellyfish algorithm.

ModelSensitivity (%)Specificity (%)Accuracy (%)AUC (%)
ANN97.5398.6398.0869.03
Decision Tree97.6997.1797.4375.83
AdaBoost97.2298.4797.8478.82
SVM98.2197.9698.0990.21

According to the results of the studies, the classification accuracy of the ANN, DT, AdaBoost, and SVM classifier models was 98.08%, 97.43%, 97.84%, and 98.09%, respectively. The SVM classifier model was the most accurate when compared to the other ML models, and the accuracy rose to 98.09%. The results as graphical illustrations are shown in Figure 7 .

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g007.jpg

Graphical representation of performance evaluation results of ML models without feature selection.

The performance evaluation results of the ML models, when feature selection is applied with the Jellyfish optimization algorithm, are given in Table 3 .

Performance comparison of different ML models when applying feature selection with the Jellyfish algorithm.

Model with JFSensitivity (%)Specificity (%)Accuracy (%)AUC (%)
ANN with JF98.2298.8997.9979.33
DT with JF98.0798.3497.5581.98
AdaBoost with JF98.1298.0798.2484.92

According to the results of the studies, the accuracy of the ANN–JF, DT–JF, AdaBoost–JF, and SVM–JF was 97.99%, 97.55%, 98.24%, and 98.47%, respectively. The SVM-based Jellyfish approach was the most accurate when compared to the other methods, and the accuracy rose to 98.47% when feature selection was combined with the Jellyfish algorithm. The results as a graphical illustration are shown in Figure 8 .

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g008.jpg

Graphical representation of performance evaluation results of ML models with feature selection.

The method of combining feature selection based on the Jellyfish optimization algorithm and SVM has higher Area Under Curve values than the other methods. In this method, the best features can be selected by using the Jellyfish algorithm and the SVM method to classify the data more accurately than other ML methods.

Furthermore, a case comparison between the current study and references [ 26 , 27 ] has been conducted by the classification accuracy evaluation criteria, with the findings displayed in Table 4 .

Comparison of the approach proposed in this study with some studies in the literature in terms of classification accuracy.

ReferenceDatasetAccuracy (%)
[ ]Cleveland and Statlog heart dataset89
[ ]Cleveland heart dataset88.5
[ ]Cleveland heart dataset94.6
[ ]Cleveland and Statlog heart dataset85.29
[ ]Cleveland heart dataset91.8
[ ]Cleveland heart dataset90.16
[ ]South African heart dataset78.1

The suggested approach in this study achieves favorable outcomes in the evaluation criteria. The classification accuracy of its prediction of heart disease is also higher than that of some studies in the literature and comparable techniques.

As seen in Table 4 , the proposed method reached 98.47% accuracy. This result shows that the optimum features can be used for heart disease diagnosis. The best features selected by Jellyfish improve the accuracy of results, because some of the features that are not selected by the Jellyfish algorithm can reduce the performance of the classification results. However, in classical methods such as Principal Component Analysis (PCA), some of the features that are not so important can be selected, which can reduce classifier model performance.

The best cost of feature selection, the Root Mean Square Error, and the accuracy of the proposed are shown in Figure 9 a–c, respectively.

An external file that holds a picture, illustration, etc.
Object name is diagnostics-13-02392-g009a.jpg

( a ) Best cost of feature selection, ( b ) Root Mean Square Error, and ( c ) accuracy of the proposed method.

As seen in Figure 9 a, the best cost of feature selection is obtained in 50 iterations, and this value is 0.0004, which is close to zero. Also, Figure 9 b shows the Root Mean Square Error that reached 0.030 in the fourth iteration.

Heart Valve Disease refers to any condition that affects the heart valves. The heart has four valves, known as mitral, tricuspid, aortic, and pulmonary, which open and close to allow blood to flow in one direction through the heart. Heart Valve Disease occurs when one or more of the valves work improperly. When the valves are healthy, they keep blood flowing smoothly through the heart and body. But when the valves are diseased, they may not open and close properly, causing blood to back up or leak in the wrong direction. Procedures to repair or replace heart valves can include balloon valvuloplasty, surgical valve repair, or surgical valve replacement.

Heart Failure is a condition in which the heart is unable to pump enough blood to meet the body’s needs. The heart may be weakened, stiffened, or damaged, and is unable to efficiently circulate blood throughout the body. This can lead to fluid build-up in the lungs, legs, and other areas of the body. There are two main types of heart failure: systolic and diastolic. Systolic heart failure occurs when the heart’s ability to contract and pump blood is impaired, while diastolic heart failure occurs when the heart is stiff and unable to fill with blood properly. Heart failure can be caused by a variety of factors, including coronary artery disease, high blood pressure, heart valve disease, heart attack, and certain medications.

The findings show that, compared with previous approaches, the proposed strategy improves percent accuracy in heart disease diagnosis. The results of this study demonstrate the potential of artificial intelligence, particularly ML, to significantly influence heart disease diagnostic decisions. The steady increase in computing power and increased data availability through mobile apps and the digital transformation of the global healthcare system are driving the growth of artificial intelligence and ML further. Therefore, future research will continue to use these techniques to translate them into routine clinical practice, thus paving the way for improved diagnostic decision-making to suit the specific needs of individual patients.

Machine learning algorithms for the diagnosis of heart diseases may have significant potential in the medical diagnosis process. These algorithms can be trained on datasets to perform tasks such as diagnosing specific heart diseases, assessing risk factors, and recommending treatment options. However, the potential risks and problems of these applications should also be considered. Several aspects of this debate can be addressed:

Data quality and accuracy: The proposed algorithm requires sufficient and high-quality data to produce accurate and reliable results. Therefore, the datasets used should not contain incomplete, inaccurate, or misleading data. Especially in a field such as heart disease, misdiagnosis recommendations can be errors that can have serious consequences.

Understandability of the algorithm: It may be necessary to explain to doctors how the algorithm and its parameters work. If doctors do not understand the decision processes of the algorithm, they may find it difficult to fully trust its results.

Data privacy and security: Privacy and security concerns may arise when using patients’ medical data. It is important that the data is properly protected and protected from unauthorized access and malicious use. This should be considered during the implementation of algorithms into clinical practice.

Physician–patient relationship: Some patients may find it difficult to trust their doctors regarding a diagnosis or treatment recommendation made by the algorithm, or may be skeptical about the results of the algorithm. The proposed algorithm should only be considered as a tool to assist physicians in their decision-making process. It should not be perceived as interfering with doctors’ decision-making.

4. Conclusions

This study aimed to obtain a highly accurate and reliable intelligent medical diagnosis model based on ML with the Jellyfish optimization algorithm using the Cleveland data set for early prediction of heart disease. One of the important factors affecting the performance of an ML model is the number of features in the dataset used. Choosing the right features can help the model better understand the data and give more accurate results. Selecting the right features can improve the performance of the model, while selecting too many features can increase the complexity of the model and cause overfitting. Therefore, the number of features must be accurately determined. To avoid the overfitting problem due to the large number of features in the Cleveland dataset used in this study, the best features were selected from the dataset by using the Jellyfish algorithm. The Jellyfish algorithm is a swarm-based metaheuristic algorithm that can be used with ML methods to optimize hyperparameters. The optimum features obtained from the dataset were used in the training and testing stages of four different ML algorithms (ANN, DT, AdaBoost, and SVM). Then, the performances of the obtained models were compared. The results show that the accuracy rates of all ML models improved after the dataset was subjected to feature selection with the Jellyfish algorithm. The highest classification accuracy (98.47%) was obtained with the SVM model trained using the dataset optimized with the Jellyfish algorithm. The Sensitivity, Specificity, Accuracy, and AUC for SVM without using the Jellyfish algorithm were obtained at 98.21%, 97.96%, 98.09%, and 90.21%, respectively. However, by using the Jellyfish algorithm, these values have been obtained as 98.56%, 98.37%, 98.47%, and 94.48%, respectively.

Funding Statement

This research received no external funding.

Author Contributions

Conceptualization, A.A.A., H.P.; methodology, A.A.A.; software, A.A.A.; validation, A.A.A., H.P.; formal analysis, H.P.; investigation, A.A.A., Hüseyin Polat; resources, A.A.A., H.P.; data curation, A.A.A.; writing—original draft preparation, A.A.A.; writ-ing—review and editing, A.A.A., H.P.; visualization, H.P.; supervision, H.P. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Conflicts of interest.

The authors declare that there is no conflict of interest regarding the publication of this paper.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

  •    Home

In recent years, the number of cases of heart disease has been greatly increasing, and heart disease is associated with a high mortality rate. Moreover, with the development of technologies, some advanced types of equipment were invented to help patients measure health conditions at home and predict the risks of having heart disease. The research aims to find the accuracy of self-measurable physical health indicators compared to all indicators measured by healthcare providers in predicting heart disease using five machine learning models. Five models were used to predict heart disease, including Logisti cs Regression, K Nearest Neighbors, Support Vector Model, Decision tree, and Random Forest. The database used for the research contains 13 types of health test results and the risks of having heart disease for 303 patients. All matrices consisted of all 13 test results, while the home matrices included 6 results that could test at home. After constructing five models for both the home matrices and all matrices, the accuracy score and false negative rate were computed for every five models. The results showed all matrices had higher accuracy scores than home matrices in all five models. The false negative rates were lower or equal for all matrices than home matrices for five machine learning models. The conclusion was drawn from the results that home-measured physica l health indicators were less accurate than all physical indicators in predicting patients’ risk for heart disease. Therefore, without the future development of home-testable indicators, all physical health indicators are preferred in measuring the risk for heart diseases.

Machine Learning , Data Visualization , Feature Engineering , Health , Heart Disease

Share and Cite:

1. Introduction

Heart disease, caused by abnormal heart and blood vessel conditions, is widely considered a direct threat to human life and health. It is one of the significant diseases exerting irreversible effects on many middle-aged and older people, in which fatal complications are highly likely to result [1]. Makino states that the absolute risk of cardiovascular heart disease is associated with disability and death among people 65 years or older [2]. The World Health Organization (WHO) declared an estimated 17.7 million people died from cardiovascular disorders in 2015, accounting for one-third of all deaths that year [3]. According to the Australian Bureau of Statistics, heart ailment was one of Australia’s two highest causes of mortality [4]. As its extremely negative influence on human health, a great deal of effort has been devoted to the study of the onset of heart disease, trying to prevent and reduce the incidence of heart disease with a timely and efficacious method. Moreover, purpose to prevent the adverse effects of heart disease, it is recommendable to use sophisticated equipment to detect potential heart risks in advance. Currently, qualified health organizations can conduct many tests, including blood tests, echocardiography, chest X-Rays, magnetic resonance imaging (MRI), electrocardiogram, physical examination, and exercise stress test that provide medical doctors with valuable information in their diagnosis and their views on the patient’s heart failure risk level [5].

There are several risk factors for heart failure, corresponding to different test indexes. A significant amount of relevant research has been carried out to reveal the potential attributes of a heart attack. Sex, age, smoking, hypertension, and diabetes depend on heart disease [6]. Peter et al. [7] suggest that indexes including blood pressure, total cholesterol, and age are essential in predicting coronary heart disease. The effects of sex differences on traditional cardiovascular risk factors are considered to be notable [8]. Heart rate is also a powerful indicator of a patient’s potential heart attack risk [9]. The attributes of heart disease could be approximately divided into two types according to whether the indicators could be measured at home. It is considered worthwhile to compare the accuracy of indicators measured at home with those measured in hospitals, which is useful for future tests of heart disease.

Computational technology and statistical approach have been popular in discovering the relationship between heart diseases and patients’ health conditions [10] [11]. They can help predict the potential risk of heart disease based on the patient’s underlying physical condition in advance, thereby reducing the probability of dying from a heart attack. Many statistical methods based on computer calculation have been applied to predict heart attacks [12]. Due to its high accuracy, SVM has been prevalently applied as a classification method to predict heart attacks [13]. Akkaya used logistic regression and the k-NN algorithm to estimate heart failure and accomplished compromising outcomes [13]. With the adoption of Random Forest, the best accuracy of 82.18% has been achieved by modification of feature selection [14]. These algorithms have been proved to predict the risk of heart disease effectively, which helps researchers and doctors make better judgments about heart disease.

Although these machine learning technique has been acknowledged and refined continuously to increase the performance of prediction, few investigators has examined and compare the accuracy of home-tested versus in-hospital measures for predicting heart disease risk. Few investigators have examined the relative accuracy of home-tested versus in-hospital measures for predicting heart disease risk. If the indicators measured at home can well predict the patient’s risk of heart disease, then the patient can be tested by themselves or their families instead of having to go to the hospital for testing. Therefore, the innovation of this article lies in that not only did it use five machine learning algorithms to regress data on heart patients, but it also compared the contribution of these algorithms to the prediction of heart disease measured at home and measured in the hospital.

This study aims to compare the patient’s physical condition indicators measured at home and in the hospital, using 5 different prediction methods to explore their accuracy of heart disease prediction. Moreover, the research question “How is machine learning algorithms’ performance with only self-measurable physical condition indicators compared to algorithms with all physical condition indicators?” would be answered accordingly.

2. Data Description

We used the data from the Cleveland heart data set from the UCI machine learning repository. The data we selected is made up of 14 variables and 303 instances. Overall speaking, there are 13 variables and 1 categorical response variables (target). Among these variables, numerical variables are age, trtbps, chol, thalach, old peak; Categorical variables are sex, exang, cp, fbs, rest_ecg, slp, thall, target. The table below illuminates the meaning of each variable. Detailed information could be seen in Table 1 .

From Figure 1 we can see that in the data set, most patients with heart attack are aging between 50 and 60, while only few people have heart failure aged under 30 or above 70. The range of this attribute is 29 - 77, illustrating the wide span of age.

Figure 1 . Age of heart diseased patients.

The Chol means cholestoral of patients, fetched via BMI sensor. According to Figure 2 , it seems that most patients’ cholestoral is around 230 mg/dl and the whole distribution shows a slightly right skewness.

According to Figure 3 , most maximum heart rates of patients gathers between 140 to 180. Some particular patients have extremely low and high heart rate, specifically lower than 100 and surpassing 200.

When it comes to resting blood pressure ( Figure 4 ), a great number of patients have resting blood pressure around 100 to 140. Only a few have abnormal values of around 160 mm/Hg and below 100 mm/Hg.

Table 1 . Variable description.

Figure 2 . Chol of heart diseased patients.

Figure 3 . Maximum heart rate achieved of heart diseased patients.

Figure 4 . Resting blood pressure of heart diseased patients.

3. Methodology

3.1. Data Processing

For data description, the research utilized the describe function and pandas profiling in Python to summarize the dataset. The raw data contained 14 variables for 303 patients. Chi-square values, extra-tree classifiers, and correlation matrices were measured to conduct data analysis. The Chi-square values and correlation matrices showed that no variables were highly correlated, and all variables were selected for model building. Moreover, all numerical variables were scaled to normal using Standard Scaler .

The 13 independent variables were divided into home matrices and all matrices. Home matrices consisted of 6 variables—age, sex, resting blood pressure, cholesterol, fasting blood sugar, and thalassemia. All matrices included all 13 independent variables. The research created the training set and test sets with 80% training data and 20% testing data.

The helper function was used in Python to show each model’s accuracy score, false negative rate, and confusion matrix. The accuracy score was used to measure the percentage of correctly predicted patients who had or did not have a risk for heart disease. The score showed the accuracy of each model in predicting the correct heart disease risks for patients. The false negative rate measured the percentage of patients with a high risk for heart disease but was mispredicted as having a low risk for heart disease. The false negative rate was significant because misprediction may lead to late treatment for the patients. Those values were used in the final model comparison to conclude the accuracy of self-measured home matrices compared to all matrices.

3.2. Machine Learning Algorithms

The research built five models for both the home matrices and all matrices.

3.2.1. Logistics Regression

Logistics Regression is a model for predicting a binary outcome utilizing the observations of a data set. The research selected this model because the output variable is a binary outcome taking either the high risk or no risk for heart disease. The Logistic Regression from the sklearn package in Python was used to build the model. Library for large linear classification was chosen for logistics models because the dataset size was relatively small.

3.2.2. K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a classification algorithm that tests the likelihood of a data point belonging to a group according to the distance to the nearest point. The research chose 1 to 20 as the number of neighbors. The K Neighbors Classifier Scores were calculated for each number of neighbors. The line chart using the number of neighbors as x and the K Neighbors Classifier Scores as y was created. The research chose K equal 8 since it had the highest K Neighbors Classifier Score.

3.2.3. Support Vector Machine

Support Vector Machine was chosen as one of the models because it is an algorithm for classification and regression. The research used svm from sklearn.svm package in Python. The Radial basis function kernel was selected, gamma equaled 0.01, and the ragularization parameter equaled 1 for the two machine learning models.

3.2.4. Decision Tree

Decision tree was chosen because it is a nonparametric machine learning model for classification and regression. The research drew the line graph using the number of maximum depth from 1 to 30 as x and Decision Tree Classifier Score as y. Maximum depth equal to 10 was picked for the model building because it has the highest scores.

3.2.5. Random Forest

Random Forest is an algorithm consisting of decision trees. Random Forest Classifier from the sklearn. ensumble package was used to build the home and all matrices models. The number of estimators equaled 1000 in both the home and all matrices models.

Raw data, after some preprocessing, are fed into machine learning algorithms. Afterward, the accuracy score and the false negative rate are obtained.

4.1. Accuracy

According to Table 2 , the Logistic Regression and Support Vector Model have the highest accuracy score at 88.52% within the machine learning algorithms with all physical condition indicators. In comparison, the Decision Tree has the lowest accuracy score with only 85.25%. Within the machine learning algorithms with only physical condition indicators measured at home, Logistic Regression has the highest accuracy score at 73.77%, while the Support Vector Model has the lowest accuracy at only 68.85%.

After comparing the accuracy between machine learning algorithms with only physical condition indicators measured at home and algorithms with all physical condition indicators, it is concluded that algorithms with only physical condition indicators measured at home do not perform as accurately as algorithms with all physical condition indicators. The difference in accuracy ranges from 14.75% to 19.67%.

4.2. False Negative Rate

From the false negative rate perspective ( Table 3 ), it is observed that the Decision Tree has the highest false negative rate within the algorithms with all physical condition indicators. In contrast, Logistic Regression has the lowest false negative rate. Within the algorithms with only physical condition indicators measured at home, K Nearest Neighbors and Random Forest have the highest false negative rate, while Decision Tree has the lowest false negative rate.

Table 2 . The table shows the accuracy score of machine learning algorithms with all physical condition indicators and only self-measurable indicators. Orange represents the algorithm with the highest accuracy score. Green represents the algorithm with the lowest accuracy score.

Table 3 . The table shows the false negative rate of machine learning algorithms with all physical condition indicators and only self-measurable indicators. Orange represents the algorithm with the highest false negative rate. Green represents the algorithm with the lowest false negative rate.

After comparison, it is concluded that machine learning algorithms with all physical condition indicators have a much lower false negative rate than algorithms with only physical condition indicators measured at home. Note that the false negative rate for the Decision Tree is the same for both groups. This is probably due to the randomness of the data splitting process, as the test set is only 20% of the entire data set, which is about 60 data samples. The difference between the algorithms ranges from 0% to 17.65%.

5. Conclusions and Discussion

5.1. Conclusion

To answer the research question of this study, it is concluded that the machine learning algorithms with only self-measurable physical condition indicators do not predict as accurately as machine learning algorithms with all physical condition indicators. Not only do algorithms with self-measurable physical condition indicators not predict the heart disease outcome as accurately as algorithms with all physical condition indicators, but they are also more likely to falsely predict not having heart disease among patients with heart disease. Thus, machine learning algorithms with only self-measurable physical condition indicators should not be used until more indicators are measurable at home in the future.

5.2. Study Limitation

The findings of this study have to be seen in light of some limitations. It is noteworthy that the dataset used in this is a subset of the original database, which contained 76 attributes instead of 14, which is used in this study. Within the original 76 attributes, other attributes could be measured at home and thus improve the accuracy and reduce the false negative rate of the machine learning algorithms with only self-measurable physical condition indicators.

5.3. Future Work

The limitations of this study have indicated the following areas as recommendations for future work. First, include other health attributes from the original dataset to discover the machine learning algorithm with the highest accuracy and lowest false negative rate. Second, since every patient has different health conditions, it is recommended to group the patients with similar health conditions and ages to investigate each machine learning algorithm’s accuracy and false negative rate.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

[ ] Heron, M. (2012) Deaths: Leading Causes for 2008. National Vital Statistics Reports: From the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System, 60, 1-94.
[ ] Makino, K., Lee, S., Bae, S., Chiba, I., Harada, K., Katayama, O., Shinkai, Y. and Shimada, H. (2021) Absolute Cardiovascular Disease Risk Assessed in Old Age Predicts Disability and Mortality: A Retrospective Cohort Study of Community-Dwelling Older Adults. Journal of the American Heart Association, 10, e022004.
https://doi.org/10.1161/JAHA.121.022004
[ ] WHO (2017) Cardiovascular Diseases.
http://www.who.int/mediacentre/factsheets/fs317/en/
[ ] ABS (2009) Causes of Death, Australia. Australian Bureau of Statistics.
http://abs.gov.au/ausstats/[email protected]/Products/696C1CF9601E4D8DCA25788400127BF0?opendocument
[ ] AHA (2017) American Heart Association.
http://www.heart.org
[ ] Liu, X., Wang, X.L., Su, Q., Zhang, M., Zhu, Y.H., Wang, Q.G. and Wang, Q. (2017) A Hybrid Classification System for Heart Disease Diagnosis Based on the RFRS Method. Computational and Mathematical Methods in Medicine, 2017, 1-11.
https://doi.org/10.1155/2017/8272091
[ ] Wilson, P.W.F., D’Agostino, R.B., Levy, D., Belanger, A.M., Silbershatz, H. and Kannel, W.B. (1998) Prediction of Coronary Heart Disease Using Risk Factor Categories. Circulation, 97, 1837-1847.
https://doi.org/10.1161/01.CIR.97.18.1837
[ ] Liu, W., Tang, Q., Jin, J., et al. (2021) Sex Differences in Cardiovascular Risk Factors for Myocardial Infarction. Herz, 46, 115-122.
https://doi.org/10.1007/s00059-020-04911-5
[ ] Lee, H.G., Noh, K.Y. and Ryu, K.H. (2007) Mining Biosignal Data: Coronary Artery Disease Diagnosis Using Linear and Nonlinear Features of HRV. In: Emerging Technologies in Knowledge Discovery and Data Mining, PAKDD 2007, Lecture Notes in Computer Science, Vol. 4819, Springer, Berlin, Heidelberg.
[ ] Nahar, J., Imam, T., Tickle, K.S. and Chen, Y.-P.P. (2013) Association Rule Mining to Detect Factors Which Contribute to Heart Disease in Males and Females. Expert Systems with Applications, 40, 1086-1093.
https://doi.org/10.1016/j.eswa.2012.08.028
[ ] Desai, F., Chowdhury, D., Kaur, R., Peeters, M., Arya, R.C., Wander, G.S., Gill, S.S. and Buyya, R. (2022) HealthCloud: A System for Monitoring Health Status of Heart Patients Using Machine Learning and Cloud Computing. Internet of Things, 17, Article ID: 100485.
https://doi.org/10.1016/j.iot.2021.100485
[ ] Nahar, J., Imam, T., Tickle, K.S. and Chen, Y.P.P. (2013) Computational Intelligence for Heart Disease Diagnosis: A Medical Knowledge Driven Approach. Expert Systems with Applications, 40, 96-104.
https://doi.org/10.1016/j.eswa.2012.07.032
[ ] Xing, Y.W., Wang, J., Zhao, Z.H. and Gao, Y.H. (2007) Combination Data Mining Methods with New Medical Data to Predicting Outcome of Coronary Heart Disease. Convergence Information Technology, Gwangju, 21-23 November 2007, 868-872.
https://doi.org/10.1109/ICCIT.2007.204
[ ] Akkaya, B., Sener, E. and Gursu, C. (2022) A Comparative Study of Heart Disease Prediction Using Machine Learning Techniques. 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, 9-11 June 2022, 1-8.
https://doi.org/10.1109/HORA55278.2022.9799978
  •   Articles
  •   Archive
  •   Indexing
  •   Aims & Scope
  •   Editorial Board
  •   For Authors
  •   Publication Fees

Journals Menu  

  • Open Special Issues
  • Published Special Issues
  • Special Issues Guideline
  • E-Mail Alert
  • JDAIP Subscription
  • Publication Ethics & OA Statement
  • Frequently Asked Questions
  • Recommend to Peers
  • Recommend to Library
  • History Issue
+1 323-425-8868
+86 18163351462(WhatsApp)
Paper Publishing WeChat

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License .

  • Journals A-Z

About SCIRP

  • Publication Fees
  • For Authors
  • Peer-Review Issues
  • Special Issues
  • Manuscript Tracking System
  • Subscription
  • Translation & Proofreading
  • Volume & Issue
  • Open Access
  • Publication Ethics
  • Preservation
  • Privacy Policy

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 12 November 2020

Early and accurate detection and diagnosis of heart disease using intelligent computational model

  • Yar Muhammad 1 ,
  • Muhammad Tahir 1 ,
  • Maqsood Hayat 1 &
  • Kil To Chong 2  

Scientific Reports volume  10 , Article number:  19747 ( 2020 ) Cite this article

27k Accesses

83 Citations

Metrics details

  • Cardiovascular diseases
  • Computational biology and bioinformatics
  • Health care
  • Heart failure

Heart disease is a fatal human disease, rapidly increases globally in both developed and undeveloped countries and consequently, causes death. Normally, in this disease, the heart fails to supply a sufficient amount of blood to other parts of the body in order to accomplish their normal functionalities. Early and on-time diagnosing of this problem is very essential for preventing patients from more damage and saving their lives. Among the conventional invasive-based techniques, angiography is considered to be the most well-known technique for diagnosing heart problems but it has some limitations. On the other hand, the non-invasive based methods, like intelligent learning-based computational techniques are found more upright and effectual for the heart disease diagnosis. Here, an intelligent computational predictive system is introduced for the identification and diagnosis of cardiac disease. In this study, various machine learning classification algorithms are investigated. In order to remove irrelevant and noisy data from extracted feature space, four distinct feature selection algorithms are applied and the results of each feature selection algorithm along with classifiers are analyzed. Several performance metrics namely: accuracy, sensitivity, specificity, AUC, F1-score, MCC, and ROC curve are used to observe the effectiveness and strength of the developed model. The classification rates of the developed system are examined on both full and optimal feature spaces, consequently, the performance of the developed model is boosted in case of high variated optimal feature space. In addition, P-value and Chi-square are also computed for the ET classifier along with each feature selection technique. It is anticipated that the proposed system will be useful and helpful for the physician to diagnose heart disease accurately and effectively.

Similar content being viewed by others

heart disease prediction using machine learning research paper 2023

Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction

heart disease prediction using machine learning research paper 2023

An active learning machine technique based prediction of cardiovascular heart disease from UCI-repository database

heart disease prediction using machine learning research paper 2023

Finding the influential clinical traits that impact on the diagnosis of heart disease using statistical and machine-learning techniques

Introduction.

Heart disease is considered one of the most perilous and life snatching chronic diseases all over the world. In heart disease, normally the heart fails to supply sufficient blood to other parts of the body to accomplish their normal functionality 1 . Heart failure occurs due to blockage and narrowing of coronary arteries. Coronary arteries are responsible for the supply of blood to the heart itself 2 . A recent survey reveals that the United States is the most affected country by heart disease where the ratio of heart disease patients is very high 3 . The most common symptoms of heart disease include physical body weakness, shortness of breath, feet swollen, and weariness with associated signs, etc. 4 . The risk of heart disease may be increased by the lifestyle of a person like smoking, unhealthy diet, high cholesterol level, high blood pressure, deficiency of exercise and fitness, etc. 5 . Heart disease has several types in which coronary artery disease (CAD) is the common one that can lead to chest pain, stroke, and heart attack. The other types of heart disease include heart rhythm problems, congestive heart failure, congenital heart disease (birth time heart disease), and cardiovascular disease (CVD). Initially, traditional investigation techniques were used for the identification of heart disease, however, they were found complex 6 . Owing to the non-availability of medical diagnosing tools and medical experts specifically in undeveloped countries, diagnosis and cure of heart disease are very complex 7 . However, the precise and appropriate diagnosis of heart disease is very imperative to prevent the patient from more damage 8 . Heart disease is a fatal disease that rapidly increases in both economically developed and undeveloped countries. According to a report generated by the World Health Organization (WHO), an average of 17.90 million humans died from CVD in 2016. This amount represents approximately 30% of all global deaths. According to a report, 0.2 million people die from heart disease annually in Pakistan. Every year, the number of victimizing people is rapidly increasing. European Society of Cardiology (ESC) has published a report in which 26.5 million adults were identified having heart disease and 3.8 million were identified each year. About 50–55% of heart disease patients die within the initial 1–3 years, and the cost of heart disease treatment is about 4% of the overall healthcare annual budget 9 .

Conventional invasive-based methods used for the diagnosis of heart disease which were based on the medical history of a patient, physical test results, and investigation of related symptoms by the doctors 10 . Among the conventional methods, angiography is considered one of the most precise technique for the identification of heart problems. Conversely, angiography has some drawbacks like high cost, various side effects, and strong technological knowledge 11 . Conventional methods often lead to imprecise diagnosis and take more time due to human mistakes. In addition, it is a very expensive and computational intensive approach for the diagnosis of disease and takes time in assessment 12 .

To overcome the issues in conventional invasive-based methods for the identification of heart disease, researchers attempted to develop different non-invasive smart healthcare systems based on predictive machine learning techniques namely: Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Naïve Bayes (NB), and Decision Tree (DT), etc. 13 . As a result, the death ratio of heart disease patients has been decreased 14 . In literature, the Cleveland heart disease dataset is extensively utilized by the researchers 15 , 16 .

In this regard, Robert et al . 17 have used a logistic regression classification algorithm for heart disease detection and obtained an accuracy of 77.1%. Similarly, Wankhade et al . 18 have used a multi-layer perceptron (MLP) classifier for heart disease diagnosis and attained accuracy of 80%. Likewise, Allahverdi et al . 19 have developed a heart disease classification system in which they integrated neural networks with an artificial neural network and attained an accuracy of 82.4%. In a sequel, Awang et al . 20 have used NB and DT for the diagnosis and prediction of heart disease and achieved reasonable results in terms of accuracy. They achieved an accuracy of 82.7% with NB and 80.4% with DT. Oyedodum and Olaniye 21 have proposed a three-phase system for the prediction of heart disease using ANN. Das and Turkoglu 22 have proposed an ANN ensemble-based predictive model for the prediction of heart disease. Similarly, Paul and Robin 23 have used the adaptive fuzzy ensemble method for the prediction of heart disease. Likewise, Tomov et al. 24 have introduced a deep neural network for heart disease prediction and his proposed model performed well and produced good outcomes. Further, Manogaran and Varatharajan 25 have introduced the concept of a hybrid recommendation system for diagnosing heart disease and their model has given considerable results. Alizadehsani et al . 26 have developed a non-invasive based model for the prediction of coronary artery disease and showed some good results regarding the accuracy and other performance assessment metrics. Amin et al . 27 have proposed a framework of a hybrid system for the identification of cardiac disease, using machine learning, and attained an accuracy of 86.0%. Similarly, Mohan et al . 28 have proposed another intelligent system that integrates RF with a linear model for the prediction of heart disease and achieved the classification accuracy of 88.7%. Likewise, Liaqat et al . 29 have developed an expert system that uses stacked SVM for the prediction of heart disease and obtained 91.11% classification accuracy on selected features.

The contribution of the current work is to introduce an intelligent medical decision system for the diagnosis of heart disease based on contemporary machine learning algorithms. In this study, 10 different nature of machine learning classification algorithms such as Logistic Regression (LR), Decision Tree (DT), Naïve Bayes (NB), Random Forest (RF), Artificial Neural Network (ANN), etc. are implemented in order to select the best model for timely and accurate detection of heart disease at an early stage. Four feature selection algorithms, Fast Correlation-Based Filter Solution (FCBF), minimal redundancy maximal relevance (mRMR), Least Absolute Shrinkage and Selection Operator (LASSO), and Relief have been used for selecting the vital and more correlated features that have truly reflect the motif of the desired target. Our developed system has been trained and tested on the Cleveland (S 1 ) and Hungarian (S 2 ) heart disease datasets which are available online on the UCI machine learning repository. All the processing and computations were performed using Anaconda IDE. Python has been used as a tool for implementing all the classifiers. The main packages and libraries used include pandas, NumPy, matplotlib, sci-kit learn (sklearn), and seaborn. The main contribution of our proposed work is given below:

The performance of all classifiers has been tested on full feature spaces in terms of all performance evaluation matrices specifically accuracy.

The performances of the classifiers are tested on selected feature spaces, selected through various feature selection algorithms mentioned above.

The research study recommends that which feature selection algorithm is feasible with which classification algorithm for developing a high-level intelligence system for the diagnosing of heart disease patients.

The rest of the paper is organized as: “ Results and discussion ” section represents the results and discussion, “ Material and methods ” section describes the material and methods used in this paper. Finally, we conclude our proposed research work in “ Conclusion ” section.

Results and discussion

This section of the paper discusses the experimental results of various contemporary classification algorithms. At first, the performance of all used classification models i.e. K-Nearest Neighbors (KNN), Decision Tree (DT), Extra-Tree Classifier (ETC), Random Forest (RF), Logistic Regression (LR), Naïve Bayes (NB), Artificial Neural Network (ANN), Support Vector Machine (SVM), Adaboost (AB), and Gradient Boosting (GB) along with full feature space is evaluated. After that, four feature selection algorithms (FSA): Fast Correlation-Based Filter (FCBF), Minimal Redundancy Maximal Relevance (mRMR), Least Absolute Shrinkage and Selection Operator (LASSO), and Relief are applied to select the prominent and high variant features from feature space. Furthermore, the selected feature spaces are provided to classification algorithms as input to analyze the significance of feature selection techniques. The cross-validation techniques i.e. k-fold (10-fold) are applied on both the full and selected feature spaces to analyze the generalization power of the proposed model. Various performance evaluation metrics are implemented for measuring the performances of the classification models.

Classifiers’ predictive outcomes on full feature space

The experimental outcomes of the applied classification algorithms on the full feature space of the two benchmark datasets by using 10-fold cross-validation (CV) techniques are shown in Tables 1 and 2 , respectively.

The experimental results demonstrated that the ET classifier performed quite well in terms of all performance evaluation metrics compared to the other classifiers using 10-fold CV. ET achieved 92.09% accuracy, 91.82% sensitivity, 92.38% specificity, 97.92% AUC, 92.84% Precision, 0.92 F1-Score and 0.84 MCC. The specificity indicates that the diagnosed test was negative and the individual doesn't have the disease. While the sensitivity indicates the diagnostic test was positive and the patient has heart disease. In the case of the KNN classification model, multiple experiments were accomplished by considering various values for k i.e. k = 3, 5, 7, 9, 13, and 15, respectively. Consequently, KNN has shown the best performance at value k = 7 and achieved a classification accuracy of 85.55%, 85.93% sensitivity, 85.17% specificity, 95.64% AUC, 86.09% Precision, 0.86 F1-Score, and 0.71 MCC. Similarly, DT classifier has achieved accuracy of 86.82%, 89.73% sensitivity, 83.76% specificity, 91.89% AUC, 85.40% Precision, 0.87 F1-Score, and 0.73 MCC. Likewise, GB classifier has yielded accuracy of 91.34%, 90.32% sensitivity, 91.52% specificity, 96.87% AUC, 92.14% Precision, 0.92 F1-Score, and 0.83 MCC. After empirically evaluating the success rates of all classifiers, it is observed that ET Classifier out-performed among all the used classification algorithms in terms of accuracy, sensitivity, and specificity. Whereas, NB shows the lowest performance in terms of accuracy, sensitivity, and specificity. The ROC curve of all classification algorithms on full feature space is represented in Fig.  1 .

figure 1

ROC curves of all classifiers on full feature space using 10-fold cross-validation on S 1 .

In the case of dataset S 2 , composed of 1025 total instances in which 525 belong to the positive class and 500 instances of having negative class, again ET has obtained quite well results compared to other classifiers using a 10-fold cross-validation test, which are 96.74% accuracy, 96.36 sensitivity, 97.40% specificity, and 0.93 MCC as shown in Table 2 .

Classifiers’ predictive outcomes on selected feature space

Fcbf feature selection technique.

FCBF feature selection technique is applied to select the best subset of feature space. In this attempt, various length of subspaces is generated and tested. Finally, the best results are achieved by classification algorithms on the subset of feature space (n = 6) using a 10-fold CV. Table 3 shows various performance measures of classifiers executed on the selected features space of FCBF.

Table 3 demonstrates that the ET classifier obtained quite good results including accuracy of 94.14%, 94.29% sensitivity, and specificity of 93.98%. In contrast, NB reported the lowest performance compared to the other classification algorithms. The performance of classification algorithms is also illustrated in Fig.  2 by using ROC curves.

figure 2

ROC curve of all classifiers on selected features by FCBF feature selection algorithm.

mRMR feature selection technique

mRMR feature selection technique is used in order to select a subset of features that enhance the performance of classifiers. The best results reported on a subset of n = 6 of feature space which is shown in Table 4 .

In the case of mRMR, still, the success rates of the ET classifier are well in terms of all performance evaluation metrics compared to the other classifiers. ET has attained 93.42% accuracy, 93.92% sensitivity, and specificity of 93.88%. In contrast, NB has achieved the lowest outcomes which are 81.84% accuracy. Figure  3 shows the ROC curve of all ten classifiers using the mRMR feature selection algorithm.

figure 3

ROC curve of all classifiers on selected features using the mRMR feature selection algorithm.

LASSO feature selection technique

In order to choose the optimal feature space which not only reduces computational cost but also progresses the performance of the classifiers, LASSO feature selection technique is applied. After performing various experiments on different subsets of feature space, the best results are still noted on the subspace of (n = 6). The predicted outcomes of the best-selected feature space are reported in Table 5 using the 10-fold CV.

Table 5 demonstrated that the predicted outcomes of the ET classifier are considerable and better compared to the other classifiers. ET has achieved 89.36% accuracy, 88.21% sensitivity, and specificity of 90.58%. Likewise, GB has yielded the second-best result which is the accuracy of 88.47%, 89.54% sensitivity, and specificity of 87.37%. Whereas, LR has performed worse results and achieved 80.77% accuracy, 83.46% sensitivity, and specificity of 77.95%. ROC curves of the classifiers are shown in Fig.  4 .

figure 4

ROC curve of all classifiers on selected feature space using the LASSO feature selection algorithm.

Relief feature selection technique

In a sequel, another feature selection technique Relief is applied to investigate the performance of classifiers on different sub-feature spaces by using the wrapper method. After empirically analyzing the results of the classifiers on a different subset of feature spaces, it is observed that the performance of classifiers is outstanding on the sub-space of length (n = 6). The results of the optimal feature space on the 10-fold CV technique are listed in Table 6 .

Again, the ET classifier performed outstandingly in terms of all performance evaluation metrics as compared to other classifiers. ET has obtained an accuracy of 94.41%, 94.93% sensitivity, and specificity of 94.89%. In contrast, NB has shown the lowest performance and achieved 80.29% accuracy, 81.93% sensitivity, and specificity of 78.55%. The ROC curves of the classifiers are demonstrated in Fig.  5 .

figure 5

ROC curve of all classifiers on selected features selected by the Relief feature selection algorithm.

After executing classification algorithms along with full and selected feature spaces in order to select the optimal algorithm for the operational engine, the empirical results have revealed that ET performed well not only on all feature space but also on optimal selected feature space among all the used classification algorithms. Furthermore, the ET classifier obtained quite promising accuracy in the case of the Relief feature selection technique which is 94.41%. Overall, the performance of ET is reported better in terms of most of the measures while other classifiers have shown good results in one measure while worse in other measures. In addition, the performance of the ET classifier is also evaluated on a 10-fold CV in combination with different sub-feature spaces of varying length starting from 1 to 12 with a step size of 1 to check the stability and discrimination power of the classifier as described in 30 . Doing so will assist the readers to have a better understanding of the impact, of the number of selected features on the performance of the classifiers. The same process is repeated for another dataset i.e. S 2 (Hungarian heart disease dataset) as well, to know the impact of selected features on the classification performance.

Tables 7 and 8 shows the performance of the ET classifier using 10-fold CV in combination with different feature sub-spaces starting from 1 to 12 with a step size of 1. The experimental results show that the performance of the ET classifier is affected significantly by using the varying length of sub-feature spaces. Finally, it is concluded that all these achievements are ascribed with the best selection of Relief feature selection technique which not only reduces the feature space but also enhances the predictive power of classifiers. In addition, the ET classifier has also played a quite promising role in these achievements because it has clearly and precisely learned the motif of the target class and reflected it truly. In addition, the performance of the ET classifier is also evaluated on 5-fold and 7-fold CV in combination with different sub-spaces of length 5 and 7 to check the stability and discrimination power of the classifier. It is also tested on another dataset S 2 (Hungarian heart disease dataset). The results are shown in supplementary materials .

In Table 9 , P-value and Chi-Square values are also computed for the ET classifier in combination with the optimal feature spaces of different feature selection techniques.

Performance comparison with existing models

Further, a comparative study of the developed system is conducted with other states of the art machine learning approaches discussed in the literature. Table 10 represents, a brief description and classification accuracies of those approaches. The results demonstrate that our proposed model success rate is high compared to existing models in the literature.

Material and methods

The subsections represent the materials and the methods that are used in this paper.

The first and rudimentary step of developing an intelligent computational model is to construct or develop a problem-related dataset that truly and effectively reflects the pattern of the target class. Well organized and problem-related dataset has a high influence on the performance of the computational model. Looking at the significance of the dataset, two datasets i.e. the Cleveland heart disease dataset S 1 and Hungarian heart disease dataset (S 2 ) are used, which are available online at the University of California Irvine (UCI) machine learning repository and UCI Kaggle repository, and various researchers have used it for conducting their research studies 28 , 31 , 32 . The S1 consists of 304 instances, where each instance has distinct 13 attributes along with the target labels and are selected for training. The dataset is composed of two classes, presence or absence of heart disease. The S 2 is composed of 1025 instances in which 525 instances belong to positive class while the rest of 500 instances have negative class. The description of attributes of both the datasets is the same, and both have similar attributes. The complete description and information of the datasets with 13 attributes are given in Table 11 .

Proposed system methodology

The main theme of the developed system is to identify heart problems in human beings. In this study, four distant feature selection techniques namely: FCBF, mRMR, Relief, and LASSO are applied on the provided dataset in order to remove noisy, redundant features and select variant features, consequently may cause of enhancing the performance of the proposed model. Various machine learning classification algorithms are used in this study which include, KNN, DT, ETC, RF, LR, NB, ANN, SVM, AB, and GB. Different evaluation metrics are computed to assess the performance of classification algorithms. The methodology of the proposed system is carried out in five stages which include dataset preprocessing, selection of features, cross-validation technique, classification algorithms, and performance evaluation of classifiers. The framework of the proposed system is illustrated in Fig.  6 .

figure 6

An Intelligent Hybrid Framework for the prediction of heart disease.

Preprocessing of data

Data preprocessing is the process of transforming raw data into meaningful patterns. It is very crucial for a good representation of data. Various preprocessing approaches such as missing values removal, standard scalar, and Min–Max scalar are used on the dataset in order to make it more effective for classification.

Feature selection algorithms

Feature selection technique selects the optimal features sub-space among all the features in a dataset. It is very crucial because sometimes, the classification performance degrades due to irrelevant features in the dataset. The feature selection technique improves the performance of classification algorithms and also reduces their execution time. In this research study, four feature selection techniques are used and are listed below:

Fast correlation-based filter (FCBF): FCBF feature selection algorithm follows a sequential search strategy. It first selects full features and then uses symmetric uncertainty for measuring the dependencies of the features on each other and how they affect the target output label. After this, it selects the most important features using the backward sequential search strategy. FCBF outperforms on high dimensional datasets. Table 12 shows the results of the selected features (n = 6) by using the FCBF feature selection algorithm. Each attribute is given a weight based on its importance. According to the FCBF feature selection technique, the most important features are THA and CPT as shown in Table 12 . The ranking that the FCBF gives to all the features of the dataset is shown in Fig.  7 .

Minimal redundancy maximal relevance (mRMR): mRMR uses the heuristic approach for selecting the most vital features that have minimum redundancy and maximum relevance. It selects those features which are useful and relevant to the target. As it follows a heuristic approach so, it checks one feature at a time and then computes its pairwise redundancy with the other features. The mRMR feature selection algorithm is not suitable for high domain feature problems 33 . The results of selected features by the mRMR feature selection algorithm (n = 6) are listed in Table 13 . In addition, among these attributes, PES and CPT have the highest score. Figure  7 describes the attributes ranking given by the mRMR feature selection algorithm to all attributes in the feature space.

figure 7

Features ranking by four feature selection algorithms (FCBF, LASSO, mRMR, Relief).

Least absolute shrinkage and selection operator (LASSO) LASSO selects features based on updating the absolute value of the features coefficient. In updating the features coefficient values, zero becoming values are removed from the features subset. LASSO outperforms with low feature coefficient values. The features having high coefficient values will be selected in the subset of features and the rest will be eliminated. Moreover, some irrelevant features with higher coefficient values may be selected and are included in the subset of features 30 . Table 14 represents the six most profound attributes which have a great correlation with the target and their scores selected by the LASSO feature selection algorithm. Figure 7 represents the important features and their scoring values given by the LASSO feature selection algorithm.

Relief feature selection algorithm Relief utilizes the concept of instance-based learning which allocates weight to each attribute based on its significance. The weight of each attribute demonstrates its capability to differentiate among class values. Attributes are rated by weights, and those attributes whose weight is exceeding a user-specified cutoff, are chosen as the final subset 34 . The relief feature selection algorithm selects the most significant attributes which have more effect on the target 35 . The algorithm operates by selecting instances randomly from the training samples. The nearest instance of the same class (nearest hit) and opposite class (nearest miss) is identified for each sampled instance. The weight of an attribute is updated according to how well its values differentiate between the sampled instance and its nearest miss and hit. If an attribute discriminates amongst instances from different classes and has the same value for instances of the same class, it will get a high weight.

figure a

The weight updating of attributes works on a simple idea (line 6). That if instance R i and NH have dissimilar value (i.e. the diff value is large), that means the attribute splits two instances with the same class which is not worthwhile, and thus we reduce the attributes weight. On the other hand, if the instance R i and NM have a distinct value that means the attribute separates the two instances with a different class, which is desirable. The six most important features selected by the Relief algorithm are listed in descending order in Table 15 . Based on weight values the most vital features are CPT and Age. Figure  7 demonstrates the important features and their ranking given by the Relief feature selection algorithm.

Machine learning classification algorithms

Various machine learning classification algorithms are investigated for early detection of heart disease, in this study. Each classification algorithm has its significance and the importance is reported varied from application to application. In this paper, 10 distant nature of classification algorithms namely: KNN, DT, ET, GB, RF, SVM, AB, NB, LR, and ANN are applied to select the best and generalize prediction model.

Classifier validation method

Validation of the prediction model is an essential step in machine learning processes. In this paper, the K-Fold cross-validation method is applied to validating the results of the above-mentioned classification models.

K-fold cross validation (CV)

In K-Fold CV, the whole dataset is split into k equal parts. The (k-1) parts are utilized for training and the rest is used for the testing at each iteration. This process continues for k-iteration. Various researchers have used different values of k for CV. Here k = 10 is used for experimental work because it produces good results. In tenfold CV, 90% of data is utilized for training the model and the remaining 10% of data is used for the testing of the model at each iteration. At last, the mean of the results of each step is taken which is the final result.

Performance evaluation metrics

For measuring the performance of the classification algorithms used in this paper, various evaluation matrices have been implemented including accuracy, sensitivity, specificity, f1-score, recall, Mathew Correlation-coefficient (MCC), AUC-score, and ROC curve. All these measures are calculated from the confusion matrix described in Table 16 .

In confusion matrix True Negative (TN) shows that the patient has not heart disease and the model also predicts the same i.e. a healthy person is correctly classified by the model.

True Positive (TP) represents that the patient has heart disease and the model also predicts the same result i.e. a person having heart disease is correctly classified by the model.

False Positive (FP) demonstrates that the patient has not heart disease but the model predicted that the patient has i.e. a healthy person is incorrectly classified by the model. This is also called a type-1 error.

False Negative (FN) notifies that the patient has heart disease but the model predicted that the patient has not i.e. a person having heart disease is incorrectly classified by the model. This is also called a type-2 error.

Accuracy Accuracy of the classification model shows the overall performance of the model and can be calculated by the formula given below:

Specificity specificity is a ratio of the recently classified healthy people to the total number of healthy people. It means the prediction is negative and the person is healthy. The formula for calculating specificity is given as follows:

Sensitivity Sensitivity is the ratio of recently classified heart patients to the total patients having heart disease. It means the model prediction is positive and the person has heart disease. The formula for calculating sensitivity is given below:

Precision: Precision is the ratio of the actual positive score and the positive score predicted by the classification model/algorithm. Precision can be calculated by the following formula:

F1-score F1 is the weighted measure of both recall precision and sensitivity. Its value ranges between 0 and 1. If its value is one then it means the good performance of the classification algorithm and if its value is 0 then it means the bad performance of the classification algorithm.

MCC It is a correlation coefficient between the actual and predicted results. MCC gives resulting values between − 1 and + 1. Where − 1 represents the completely wrong prediction of the classifier.0 means that the classifier generates random prediction and + 1 represents the ideal prediction of the classification models. The formula for calculating MCC values is given below:

Finally, we will examine the predictability of the machine learning classification algorithms with the help of the receiver optimistic curve (ROC) which represents a graphical demonstration of the performance of ML classifiers. The area under the curve (AUC) describes the ROC of a classifier and the performance of the classification algorithms is directly linked with AUC i.e. larger the value of AUC greater will be the performance of the classification algorithm.

In this study, 10 different machine learning classification algorithms namely: LR, DT, NB, RF, ANN, KNN, GB, SVM, AB, and ET are implemented in order to select the best model for early and accurate detection of heart disease. Four feature selection algorithms such as FCBF, mRMR, LASSO, and Relief have been used to select the most vital and correlated features that truly reflect the motif of the desired target. Our developed intelligent computational model has been trained and tested on two datasets i.e. Cleveland (S1) and Hungarian (S2) heart disease datasets. Python has been used as a tool for implementation and simulating the results of all the utilized classification algorithms.

The performance of all classification models has been tested in terms of various performance metrics on full feature space as well as selected feature spaces, selected through various feature selection algorithms. This research study recommends that which feature selection algorithm is feasible with which classification model for developing a high-level intelligent system for the diagnosis of a patient having heart disease. From simulation results, it is observed that ET is the best classifier while relief is the optimal feature selection algorithm. In addition, P-value and Chi-square are also computed for the ET classifier along with each feature selection algorithm. It is anticipated that the proposed system will be useful and helpful for the doctors and other care-givers to diagnose a patient having heart disease accurately and effectively at the early stages.

Heart disease is one of the most devastating and fatal chronic diseases that rapidly increase in both economically developed and undeveloped countries and causes death. This damage can be reduced considerably if the patient is diagnosed in the early stages and proper treatment is provided to her. In this paper, we developed an intelligent predictive system based on contemporary machine learning algorithms for the prediction and diagnosis of heart disease. The developed system was checked on two datasets i.e. Cleveland (S1) and Hungarian (S2) heart disease datasets. The developed system was trained and tested on full features and optimal features as well. Ten classification algorithms including, KNN, DT, RF, NB, SVM, AB, ET, GB, LR, and ANN, and four feature selection algorithms such as FCBF, mRMR, LASSO, and Relief are used. The feature selection algorithm selects the most significant features from the feature space, which not only reduces the classification errors but also shrink the feature space. To assess the performance of classification algorithms various performance evaluation metrics were used such as accuracy, sensitivity, specificity, AUC, F1-score, MCC, and ROC curve. The classification accuracies of the top two classification algorithms i.e. ET and GB on full features were 92.09% and 91.34% respectively. After applying feature selection algorithms, the classification accuracy of ET with the relief feature selection algorithm increases from 92.09 to 94.41%. The accuracy of GB increases from 91.34 to 93.36% with the FCBF feature selection algorithm. So, the ET classifier with the relief feature selection algorithm performs excellently. P-value and Chi-square are also computed for the ET classifier with each feature selection technique. The future work of this research study is to use more optimization techniques, feature selection algorithms, and classification algorithms to improve the performance of the predictive system for the diagnosis of heart disease.

Bui, A. L., Horwich, T. B. & Fonarow, G. C. Epidemiology and risk profile of heart failure. Nat. Rev. Cardiol. 8 , 30 (2011).

Article   PubMed   Google Scholar  

Polat, K. & Güneş, S. Artificial immune recognition system with fuzzy resource allocation mechanism classifier, principal component analysis, and FFT method based new hybrid automated identification system for classification of EEG signals. Expert Syst. Appl. 34 , 2039–2048 (2010).

Article   Google Scholar  

Heidenreich, P. A. et al. Forecasting the future of cardiovascular disease in the United States: A policy statement from the American Heart Association. Circulation 123 , 933–944 (2011).

Durairaj, M. & Ramasamy, N. A comparison of the perceptive approaches for preprocessing the data set for predicting fertility success rate. Int. J. Control Theory Appl. 9 , 255–260 (2016).

Google Scholar  

Das, R., Turkoglu, I. & Sengur, A. Effective diagnosis of heart disease through neural networks ensembles. Expert Syst. Appl. 36 , 7675–7680 (2012).

Allen, L. A. et al. Decision making in advanced heart failure: A scientific statement from the American Heart Association. Circulation 125 , 1928–1952 (2014).

Yang, H. & Garibaldi, J. M. A hybrid model for automatic identification of risk factors for heart disease. J. Biomed. Inform. 58 , S171–S182 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Alizadehsani, R., Hosseini, M. J., Sani, Z. A., Ghandeharioun, A. & Boghrati, R. In 2012 IEEE 12th International Conference on Data Mining Workshops. 9–16 (IEEE, New York).

Arabasadi, Z., Alizadehsani, R., Roshanzamir, M., Moosaei, H. & Yarifard, A. A. Computer aided decision making for heart disease detection using hybrid neural network-Genetic algorithm. Comput. Methods Programs Biomed. 141 , 19–26 (2017).

Samuel, O. W., Asogbon, G. M., Sangaiah, A. K., Fang, P. & Li, G. An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction. Expert Syst. Appl. 68 , 163–172 (2017).

Patil, S. B. & Kumaraswamy, Y. Intelligent and effective heart attack prediction system using data mining and artificial neural network. Eur. J. Sci. Res. 31 , 642–656 (2009).

Vanisree, K. & Singaraju, J. Decision support system for congenital heart disease diagnosis based on signs and symptoms using neural networks. Int. J. Comput. Appl. 19 , 6–12 (2015).

B. Edmonds. In Proceedings of AISB Symposium on Socially Inspired Computing 1–12 (Hatfield, 2005).

Methaila, A., Kansal, P., Arya, H. & Kumar, P. Early heart disease prediction using data mining techniques. Comput. Sci. Inf. Technol. J. https://doi.org/10.5121/csit.2014.4807 (2014).

Samuel, O. W., Asogbon, G. M., Sangaiah, A. K., Fang, P. & Li, G. An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction. Expert Syst. Appl. 68 , 163–172 (2018).

Nazir, S., Shahzad, S., Mahfooz, S. & Nazir, M. Fuzzy logic based decision support system for component security evaluation. Int. Arab J. Inf. Technol. 15 , 224–231 (2018).

Detrano, R. et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 64 , 304–310 (2009).

Gudadhe, M., Wankhade, K. & Dongre, S. In 2010 International Conference on Computer and Communication Technology (ICCCT) , 741–745 (IEEE, New York).

Kahramanli, H. & Allahverdi, N. Design of a hybrid system for the diabetes and heart diseases. Expert Syst. Appl. 35 , 82–89 (2013).

Palaniappan, S. & Awang, R. In 2012 IEEE/ACS International Conference on Computer Systems and Applications 108–115 (IEEE, New York).

Olaniyi, E. O., Oyedotun, O. K. & Adnan, K. Heart diseases diagnosis using neural networks arbitration. Int. J. Intel. Syst. Appl. 7 , 72 (2015).

Das, R., Turkoglu, I. & Sengur, A. Effective diagnosis of heart disease through neural networks ensembles. Expert Syst. Appl. 36 , 7675–7680 (2011).

Paul, A. K., Shill, P. C., Rabin, M. R. I. & Murase, K. Adaptive weighted fuzzy rule-based system for the risk level assessment of heart disease. Applied Intelligence 48 , 1739–1756 (2018).

Tomov, N.-S. & Tomov, S. On deep neural networks for detecting heart disease. arXiv:1808.07168 (2018).

Manogaran, G., Varatharajan, R. & Priyan, M. Hybrid recommendation system for heart disease diagnosis based on multiple kernel learning with adaptive neuro-fuzzy inference system. Multimedia Tools Appl. 77 , 4379–4399 (2018).

Alizadehsani, R. et al. Non-invasive detection of coronary artery disease in high-risk patients based on the stenosis prediction of separate coronary arteries. Comput. Methods Programs Biomed. 162 , 119–127 (2018).

Haq, A. U., Li, J. P., Memon, M. H., Nazir, S. & Sun, R. A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Inf. Syst. 2018 , 3860146. https://doi.org/10.1155/2018/3860146 (2018).

Mohan, S., Thirumalai, C. & Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7 , 81542–81554 (2019).

Ali, L. et al. An optimized stacked support vector machines based expert system for the effective prediction of heart failure. IEEE Access 7 , 54007–54014 (2019).

Peng, H., Long, F. & Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27 (8), 1226–1238 (2005).

Palaniappan, S. & Awang, R. In 2008 IEEE/ACS International Conference on Computer Systems and Applications 108–115 (IEEE, New York).

Ali, L., Niamat, A., Golilarz, N. A., Ali, A. & Xingzhong, X. An expert system based on optimized stacked support vector machines for effective diagnosis of heart disease. IEEE Access (2019).

Pérez, N. P., López, M. A. G., Silva, A. & Ramos, I. Improving the Mann-Whitney statistical test for feature selection: An approach in breast cancer diagnosis on mammography. Artif. Intell. Med. 63 , 19–31 (2015).

Tibshirani, R. Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 73 , 273–282 (2011).

Article   MathSciNet   Google Scholar  

Peng, H., Long, F. & Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27 , 1226–1238 (2012).

de Silva, A. M. & Leong, P. H. Grammar-Based Feature Generation for Time-Series Prediction (Springer, Berlin, 2015).

Book   Google Scholar  

Download references

Acknowledgements

This research was supported by the Brain Research Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. NRF-2017M3C7A1044815).

Author information

Authors and affiliations.

Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan

Yar Muhammad, Muhammad Tahir & Maqsood Hayat

Department of Electronic and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea

Kil To Chong

You can also search for this author in PubMed   Google Scholar

Contributions

All authors have equal contributions.

Corresponding authors

Correspondence to Maqsood Hayat or Kil To Chong .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Muhammad, Y., Tahir, M., Hayat, M. et al. Early and accurate detection and diagnosis of heart disease using intelligent computational model. Sci Rep 10 , 19747 (2020). https://doi.org/10.1038/s41598-020-76635-9

Download citation

Received : 03 April 2020

Accepted : 28 October 2020

Published : 12 November 2020

DOI : https://doi.org/10.1038/s41598-020-76635-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Comprehensive evaluation and performance analysis of machine learning in heart disease prediction.

  • Halah A. Al-Alshaikh
  • Abeer A. AlSanad

Scientific Reports (2024)

Heart Disease Prediction Using Weighted K-Nearest Neighbor Algorithm

  • Khalidou Abdoulaye Barry
  • Youness Manzali
  • Mohamed Elfar

Operations Research Forum (2024)

Future prediction for precautionary measures associated with heart-related issues based on IoT prototype

  • Ganesh Keshaorao Yenurkar
  • Aniket Pathade

Multimedia Tools and Applications (2024)

An improved machine learning-based prediction framework for early detection of events in heart failure patients using mHealth

  • Deepak Kumar
  • Keerthiveena Balraj
  • Anurag S. Rathore

Health and Technology (2024)

Identification and classification of pneumonia disease using a deep learning-based intelligent computational framework

  • Lanying Tang

Neural Computing and Applications (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

heart disease prediction using machine learning research paper 2023

A Comprehensive Review of Artificial Intelligence and Machine Learning : Concepts, Trends, and Applications

  • September 2024
  • International Journal of Scientific Research in Science and Technology 11(5):126-142
  • 11(5):126-142

Akanksha Mishra

  • This person is not on ResearchGate, or hasn't claimed this research yet.

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • David Silver
  • Thomas Hubert
  • Julian Schrittwieser
  • Demis Hassabis

Y. Bengio

  • Geoffrey E. Hinton
  • Diederik P. Kingma

Max Welling

  • I Sutskever
  • Inc Gartner
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

heart disease prediction using machine learning research paper 2023

Advances in Artificial-Business Analytics and Quantum Machine Learning

Select Proceedings of the 3rd International Conference, Com-IT-Con 2023, Volume 1

  • Conference proceedings
  • © 2024
  • K. C. Santosh 0 ,
  • Sandeep Kumar Sood 1 ,
  • Hari Mohan Pandey 2 ,
  • Charu Virmani 3

Department of Computer Science, University of South Dakota, Vermillion, USA

You can also search for this editor in PubMed   Google Scholar

Department of Computer Applications, National Institute of Technology Kuruks, Kurukshetra, India

School of science and technology, bournemouth university, poole, uk, manav rachna international institute of research and studies, faridabad, india.

  • Discusses applications of new, emerging techniques for practitioners and researchers
  • Highlights trends, perspectives, and prospects in cloud and parallel computing
  • Includes selected papers from the 3rd International Conference on COM-IT-CON 2023

Part of the book series: Lecture Notes in Electrical Engineering (LNEE, volume 1191)

Included in the following conference series:

  • COMITCON: International Conference on Artificial-Business Analytics, Quantum and Machine Learning

Conference proceedings info: COMITCON 2023.

275 Accesses

This is a preview of subscription content, log in via an institution to check access.

Access this book

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

About this book

This book presents select proceedings of the 3rd International Conference on “Artificial-Business Analytics, Quantum and Machine Learning: Trends, Perspectives, and Prospects” (Com-IT-Con 2023) held at the Manav Rachna University in July 2023. It covers topics such as artificial intelligence and business analytics, virtual/augmented reality, quantum information systems, cyber security, data science, and machine learning. The book is useful for researchers and professionals interested in the broad field of communication engineering.

  • Conference Proceedings
  • COMITCON Proceedings
  • Cloud Management and Operations
  • COMITCON 2023
  • Computational Biology & Bioinformatics
  • Data Science & Machine Learning
  • Scalability and Reliability
  • AI & Business Analytics
  • Virtual Reality & Augmented Reality
  • Classification and Clustering
  • Artificial Intelligence

Table of contents (60 papers)

Front matter, weed localization: comparison of different transfer learning models with u-net.

  • Neha Shekhawat, Seema Verma, F. H. Juwono, Wong Kitt Wei, Catur Apriono, I. Gde Dharma Nugraha

Blended Canopy with k-Means Clustering of States Based on Crime Cases Against Children

  • Suresh Babu Changalasetty, Lalitha Saroja Thota, Sreelasya Changalasetty, Yerraginnela Shravani, Ahmed Said Badawy, Wade Ghribi

A Hybrid Approach for Scalable Load Balancing Using Virtual Machine Migration and Dynamic Resource Allocation

  • Monika Yadav, Atul Mishra

Particle Swarm Optimization for Efficient Data Dissemination in VANETs

  • Arvind Kumar, Prashant Dixit, S. S. Tyagi

A Reactive Approach for High-Accuracy and Data-Driven Customer Behaviour Analysis and Prediction

  • Priyank Sirohi, Niraj Singhal, Syed Vilayat Ali Rizvi, Pradeep Kumar

A Brief Survey on Fabric Defect Detection

  • Rashi Singh, Vibha Pratap

CNN Based Real Time Detection of Words from Lip Movements and Automated into Text

  • Avipriya Bardhan, Ankit Singh, Shree Harsh Attri

Application of Artificial Intelligence for the Diagnosis of Dementia (Alzheimer): A Systematic Evaluation

  • Purushottam Kumar Pandey, Jyoti Pruthi, Surbhi Bhatia

Shortest Job First with Gateway-Based Resource Management Strategy for Fog Enabled Cloud Computing

  • Sunakshi Mehta, Supriya Raheja, Manoj Kumar

Software Vulnerability Analysis Based on Statistical Characteristics

  • Birendra Kumar Verma, Ajay Kumar Yadav

Hate Speech Detection on Twitter: A Comparative Evaluation of Different Machine Learning Techniques

  • Aryan Rastogi, Arjit Kumar, Daarshik Dwivedi, Abhishek Pratap Singh, Suruchi Saberwal, Mehboob Alam

Detection of Heart Disease Using Machine Learning

  • Supriya Raheja, Navya Ray

SVM-Based Framework for Breast Cancer Detection

  • Manik Jain, Sumit Das, Vidushi Gandhi, Monika Goyal, Stuti Saxena

Hand Gesture Recognition System Using Machine Learning

  • Milind Udbhav, Robin Kumar Attri, Prateek Garg, Meenu Vijarania, Swati Gupta, Akshat Aggarwal

Deep Learning Method for Plant Disease Recognition and Prediction

  • Khushi Singh, Aditya Prakhar, Deependra Rastogi

Lung Cancer Detection by Using CNN Architecture Models

  • Dattatray G. Takale, Parishit N. Mahalle, Sachin R. Sakhare, Piyush P. Gawali, Gopal Deshmukh, Vajid Khan et al.

Deep Fake Analyser: A Review Based on Detecting the Deepfakes

  • Santosh Kumar Srivastava, Durgesh Srivastava, Praveen Kantha, Manoj Kumar Mahto, Sheo Kumar, Hare Ram Singh

Predicting the Outcomes of La Liga Matches

  • Vineet Sharma, Jasleen Kaur, Gurjapna Kaur, Sumit Kumar, Vidhi Khanduja

Artifact Detection and Removal in EEG: A Review of Methods and Contemporary Usage

  • Vinod Prakash, Dharmender Kumar

Other volumes

Editors and affiliations.

K. C. Santosh

Sandeep Kumar Sood

Hari Mohan Pandey

Charu Virmani

About the editors

KC Santosh, a highly accomplished AI expert, is the chair of the Department of Computer Science at the University of South Dakota. He served the National Institutes of Health (NIH) as a research fellow. Before that, he worked as a postdoctoral research scientist at the LORIA research center, University de Lorraine in direct collaboration with industrial partner, ITESOFT, France. He earned his Ph.D. in Computer Science—Artificial Intelligence from the INRIA Nancy Grand Est research center (France). He has demonstrated expertise in artificial intelligence, machine learning, pattern recognition, and computer vision with various application domains such as healthcare informatics and medical imaging, document imaging, biometrics, forensics, speech/audio analysis, and the Internet of Things. He is highly motivated in academic leadership, and his contributions have established USD as a pioneer in AI programs within the state of SD.

Sandeep Kumar Sood received a Ph.D. degree in computer science and engineering from IIT Roorkee, Roorkee, India, in 2010. He is currently working as a head and an associate professor in the Department of Computer Applications, NIT Kurukshetra, Haryana, India. He has authored or co-authored more than 115 SCI/SCIE-indexed research publications. According to Google Scholar, the citation number is with an h-index equal to 34 and an i10-index equal to 96. His research interests include network and information security, fog computing, cloud computing, IoT, and big data analytics.

Hari is in data science and artificial intelligence at the School of Technology Bournemouth University, UK. He specializes in Computer Science and Engineering. His research areas include artificial intelligence, soft computing techniques, natural language processing, language acquisition, machine learning, deep learning, and computer vision. He is the author of various books in computer science engineering (algorithms, programming, and evolutionary algorithms). He has published over 150 scientific papers in reputed journals and conferences. He is serving on the editorial board of reputed journals as action editor, associate editor, and guest editor. He is a reviewer of top international conferences such as GECCO, CEC, IJCNN, BMVC, and AAAI. He has delivered expert talks as a keynote and invited speaker. He is a fellow of the HEA of the UK Professional Standards Framework (UKPSF)

Charu Pujara has worked as Head of the Department of Computer Science and Engineering, School of Engineering at Manav Rachna International Institute of Research and Studies, Faridabad, Haryana. She has a B.E. in Information Technology with an M.Tech. in Computer Science and Engineering. She holds a Doctoral degree in Computer Engineering having expertise in the domain of cyber security and artificial intelligence. Her visual acuity has played an important role in promoting technology, automation, and skill management.

Bibliographic Information

Book Title : Advances in Artificial-Business Analytics and Quantum Machine Learning

Book Subtitle : Select Proceedings of the 3rd International Conference, Com-IT-Con 2023, Volume 1

Editors : K. C. Santosh, Sandeep Kumar Sood, Hari Mohan Pandey, Charu Virmani

Series Title : Lecture Notes in Electrical Engineering

DOI : https://doi.org/10.1007/978-981-97-2508-3

Publisher : Springer Singapore

eBook Packages : Computer Science , Computer Science (R0)

Copyright Information : The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024

Softcover ISBN : 978-981-97-2507-6 Published: 19 September 2024

eBook ISBN : 978-981-97-2508-3 Published: 18 September 2024

Series ISSN : 1876-1100

Series E-ISSN : 1876-1119

Edition Number : 1

Number of Pages : XII, 826

Number of Illustrations : 69 b/w illustrations, 312 illustrations in colour

Topics : Systems and Data Security , Artificial Intelligence , Computer Applications

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. (PDF) Prediction of Heart Disease Using Machine Learning Algorithms

    heart disease prediction using machine learning research paper 2023

  2. (PDF) Heart Disease Prediction using Machine Learning

    heart disease prediction using machine learning research paper 2023

  3. (PDF) USE OF MACHINE LEARNING IN HEART DISEASE PREDICTION: A SURVEY

    heart disease prediction using machine learning research paper 2023

  4. Algorithms

    heart disease prediction using machine learning research paper 2023

  5. (PDF) PREDICTION OF HEART DISEASE BY USING MACHINE LEARNING

    heart disease prediction using machine learning research paper 2023

  6. Heart Disease Prediction using Machine Learning

    heart disease prediction using machine learning research paper 2023

VIDEO

  1. Multiple Disease Prediction Machine Learning Model

  2. Heart Disease Prediction Using Machine Learning Algorithms

  3. Heart Disease Prediction Using Novel Quine McCluskey Binary Classifier QMBC

  4. Heart Disease Prediction using Machine Learning Techniques best machine learning projects

  5. Heart Disease Prediction Using Machine Learning

  6. Heart Disease Prediction using ML algorithms frontend

COMMENTS

  1. Heart Disease Prediction Using Machine Learning

    Cardiovascular disease refers to any critical condition that impacts the heart. Because heart diseases can be life-threatening, researchers are focusing on designing smart systems to accurately diagnose them based on electronic health data, with the aid of machine learning algorithms. This work presents several machine learning approaches for predicting heart diseases, using data of major ...

  2. Prediction of cardiovascular disease risk based on major contributing

    According to the Global Burden of Disease Report 2019 2, the number of CVD prevalence is steadily increasing, reaching 523 million in 2019, with up to 18.6 million deaths, accounting for one-third ...

  3. Cardiovascular diseases prediction by machine learning incorporation

    An IoT-ML method was investigated by Akash et al. (33) with the goal of predicting the condition of the cardiovascular system in the human body. The algorithm model uses machine learning (ML) techniques to compute and predict the patient cardiovascular health after it has obtained essential data from the human body.

  4. Early prediction of heart disease with data analysis using supervised

    Heart diseases are consistently ranked among the top causes of mortality on a global scale. Early detection and accurate heart disease prediction can help effectively manage and prevent the disease. However, the traditional methods have failed to improve heart disease classification performance. So, this article proposes a machine learning approach for heart disease prediction (HDP) using a ...

  5. An active learning machine technique based prediction of cardiovascular

    A Method for improving prediction of human heart disease using machine learning algorithms. Mobile Inf. Syst. 2022 , 1-11 (2022). Article Google Scholar

  6. Heart Diseases Prediction using Machine Learning

    Heart Diseases Prediction using Machine Learning ... To achieve this, we conduct research using the heart disease dataset provided by the UCI Machine Repository. Additionally, our system provides a user-friendly interface to facilitate ease of use. ... 23 November 2023 ISBN Information: Electronic ISBN: 979-8-3503-3509-5 Print on ...

  7. Heart Disease Prediction Using Machine Learning

    One of the main reasons for death worldwide is heart disease, and early detection of the condition can help lower the risk of having a cardiac arrest. This research paper aims to suggest a machine learning-based method for estimating the risk of developing cardiac disease. First recent advancements in the field have been reviewed and then an ML model has been implemented to work on the ...

  8. Heart Disease Prediction Using Machine Learning

    The heart disease prediction model is trained by using the training data, and performance of the model is evaluated by using the test data. The patient details in the dataset are given as input to the model and calculated the accuracy for each ML algorithm. The algorithms used in this project are as follows: i.

  9. Performance Evaluation of Machine Learning Techniques (MLT) for Heart

    A wide variety of issues can solve related to heart diseases using a machine learning approach. Marimuthu et al. conducted a review for the prediction of heart disease using a data analytical technique. For predicting cardiac disease, machine learning techniques (MLT) has included DT, NB, KNN, and SVM . A comprehensive review of heart disease ...

  10. Machine learning prediction in cardiovascular diseases: a meta-analysis

    Most importantly, pooled analyses indicate that, in general, ML algorithms are accurate (AUC 0.8-0.9 s) in overall cardiovascular disease prediction. In subgroup analyses of each ML algorithms ...

  11. Effective Heart Disease Prediction Using Machine Learning Techniques

    Globally, cardiovascular disease (CVDs) is the primary cause of morbidity and mortality, accounting for more than 70% of all fatalities. According to the 2017 Global Burden of Disease research, cardiovascular disease is responsible for about 43% of all fatalities [1,2].Common risk factors for heart disease in high-income nations include lousy diet, cigarette use, excessive sugar consumption ...

  12. Heart Diseases Prediction Using Block-chain and Machine Learning

    In this article Machine Learning (ML) algorithm known as a sine-cosine weighted k-nearest neighbor (SCA-WKNN) is used for predicting the Hearth disease with the maximum accuracy among the existing approaches. Blockchain technology has been used in the research to secure the data throughout the session and can give more accurate results using ...

  13. Effective Heart Disease Prediction Using Machine Learning Techniques

    The paper [18] experimented on the datasets, heart dataset and CHD dataset, compared the accuracy by using the SVM and LR (82% Accuracy-best one) and Identified the heart disease status of ...

  14. Heart disease risk prediction using deep learning techniques with

    Heart failure has been subject to significant research as a ... which can be considered a significant advance and a great help in determining the risk of heart diseases. The rest of this paper is ... Diwakar M, Tripathi A, Joshi K, Memoria M, Singh P, Kumar N (2021) Latest trends on heart disease prediction using machine learning and image ...

  15. Prediction of Heart Disease Based on Machine Learning Using Jellyfish

    The Cleveland heart disease dataset is commonly used for heart disease prediction with supervised Machine Learning. The Cleveland dataset is obtained from the Kaggle Machine Learning repository. The Cleveland dataset was collected for use in a study in the field of health research by the Cleveland Clinic Foundation in 1988.

  16. Using Machine Learning for Heart Disease Prediction

    Our paper is part of the research on the detection and prediction of heart disease. It is based on the application of Machine Learning algorithms, of which w e have. chosen the 3 most used ...

  17. A Novel Heart Disease Disorder Prediction Using Faster ...

    Cardiovascular illness, which includes several heart ailments, is the leading cause of mortality globally. The need for early detection of this condition has become more critical as the number of young people dying from it has increased. This study employs machine learning techniques such as a K-nearest neighbor, Gaussian naive Bayesian, random forest, support vector machine, decision tree ...

  18. Heart Disease Prediction Using Machine Learning Algorithms

    The cardiovascular system plays a vital role in all living organisms, responsible for circulating blood throughout the body, delivering essential oxygen and nutrients to cells, and eliminating waste products. Predicting cardiovascular diseases holds significant importance in clinical data analysis. Early detection measures have proven valuable in making critical decisions for high-risk ...

  19. Heart Disease Prediction Using Machine Learning Algorithms with Self

    Sun, H. and Pan, J. (2023) Heart Disease Prediction Using Machine Learning Algorithms with Self-Measurable Physical Condition Indicators. Journal of Data Analysis and Information Processing , 11 , 1-10. doi: 10.4236/jdaip.2023.111001 .

  20. PDF A Survey on Heart Disease Prediction Using Machine Learning ...

    Prediction Using Effective Machine Learning Techniques Avinash Golande, Pavan Kumar T ,k 2019 Decision Tree, KNN-mean, addboost 2 Prediction of Heart Disease Using Machine Learning Algorithms Mr.Santhana Krishnan.J Dr.Geetha S 2018 Decision tree, naive bayes 3 A Hybrid Intelligent System Framework for the Prediction of Heart Disease Using ...

  21. Early and accurate detection and diagnosis of heart disease using

    Validation of the prediction model is an essential step in machine learning processes. In this paper, the K-Fold cross-validation method is applied to validating the results of the above-mentioned ...

  22. Optimization of multidimensional feature engineering and data

    The Research Process section describes the steps of this study in detail. The Result section describes the experimental results of using machine learning to predict heart disease and the results of the paired sample T test. The results and discussion section summarise and analyse the experimental results. The conclusion part summarises the ...

  23. Researchers use machine learning to improve cardiovascular risk

    Tailoring Risk Prediction ... 2023 — The new American Heart Association PREVENTTM risk calculator estimates the 10- and 30-year risk of total cardiovascular disease for people aged 30 years and ...

  24. Heart Disease Prediction using Machine Learning Techniques

    This research aims to foresee the odds of having heart disease as probable cause of computerized prediction of heart disease that is helpful in the medical field for clinicians and patients [].To accomplish the aim, we have discussed the use of various machine learning algorithms on the data set and dataset analysis is mentioned in this research paper.

  25. Heart Disease Prediction Using Machine Learning Method

    This research paper presents reasons for heart disease and a model based on Machine learning algorithms for prediction. Published in: 2022 International Conference on Cyber Resilience (ICCR) Date of Conference: 06-07 October 2022. Date Added to IEEE Xplore: 03 January 2023. ISBN Information:

  26. Heart Disease Prediction Using Machine Learning

    The heart disease cases are rising day by day and it is very Important to predict such diseases before it causes more harm to human lives. The diagnosis of heart disease is such a complex task i.e., it should be performed very carefully. The work done in this research paper mainly focuses on which patients has more chance to suffer from this based on their various medical feature such as chest ...

  27. A Comprehensive Review of Artificial Intelligence and Machine Learning

    This paper presents a comprehensive review of Artificial Intelligence (AI) and Machine Learning (ML), exploring foundational concepts, emerging trends, and diverse applications.

  28. An Analysis of Machine Learning Approaches for Diabetic Prediction

    Joshi R, Alehegn M (2017) Analysis and prediction of diabetes diseases using machine learning algorithm: ensemble approach. Int Res J Eng Technol (IRJET) 04(10). e-ISSN: 2395-0056, p-ISSN: 2395-0072. www.irjet.net. Veena Vijayan V, Anjali C (2015) Decision support systems for predicting diabetes mellitus—a review.

  29. Forecasting Cardio Vascular Diseases Using Kernel Machine

    Cardiovascular disorders, including various heart diseases, have emerged as leading causes of mortality worldwide, necessitating accurate and effective diagnostic techniques. Leveraging data science approaches, particularly machine learning methods, offers a solution for analyzing large-scale medical datasets to enable early identification and proper care of these conditions. This study aims ...

  30. Advances in Artificial-Business Analytics and Quantum Machine Learning

    Proceedings of COMITCON 2023 cover research in artificial-business analytics, virtual/augmented reality, ... A Reactive Approach for High-Accuracy and Data-Driven Customer Behaviour Analysis and Prediction. Priyank Sirohi, Niraj Singhal, Syed Vilayat Ali Rizvi, Pradeep Kumar ... Detection of Heart Disease Using Machine Learning. Supriya Raheja ...