Document Type

Article

Publication Date

10-25-2021

Publication Title

Computer Methods and Programs in Biomedicine Update

Volume

1

First page number:

1

Last page number:

8

Abstract

Background Logistic regression is a classification model in machine learning, extensively used in clinical analysis. It uses probabilistic estimations which helps in understanding the relationship between the dependent variable and one or more independent variables. Diabetes, being one of the most common diseases around the world, when detected early, may prevent the progression of the disease and avoid other complications. In this work, we design a prediction model, that predicts whether a patient has diabetes, based on certain diagnostic measurements included in the dataset, and explore various techniques to boost the performance and accuracy. Methods Logistic Regression is the main algorithm used in this paper and the analysis is carried out using Python IDE. The experiment mainly uses two datasets – one is the PIMA Indians Diabetes dataset, which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases, and the other dataset is from Vanderbilt, which is based on a study of rural African Americans in Virginia. Feature selection is carried out using two different methods. Ensemble methods are further used, that improve performance by producing better predictions compared to a single model. Results The accuracy and runtimes are captured for the original datasets and also for the ones obtained after using feature selection and ensemble techniques. A comparison is also shown in each case. The highest accuracy obtained was around 78% for Dataset 1, after employing the ensemble technique- Max Voting; and it was around 93% for Dataset 2, after using the ensemble techniques- Max Voting, and Stacking. Conclusion Logistic Regression has shown to be one of the efficient algorithms in building prediction models. This study also shows that apart from the choice of algorithms, there are other factors that could improve the accuracy and runtimes of the model, such as: data-preprocessing, removal of redundant and null values, normalization, cross-validation, feature selection, and usage of ensemble techniques.

Keywords

Diabetes; ensemble methods; feature selection; logistic regression; prediction model

Disciplines

Computational Biology | Diseases | Nutritional and Metabolic Diseases

File Format

PDF

File Size

3200 KB

Language

English

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/

Creative Commons License

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

UNLV article access

Search your library

Share

COinS