Breast Cancer Disease Prediction Using Ensemble Techniques

T. Chalapathi Rao *   Kshiramani Naik **
*-** Department of Information Technology, Veer Surendra Sai University of Technology, Burla, Odisha, India.

Abstract

Breast Cancer is a highly lethal reproductive cancer that disproportionately affects women and is a leading cause of death worldwide. Cancer is characterized by the uncontrolled division and invasion of abnormal cells into the surrounding tissues. Early detection is crucial in the diagnosis of Breast Cancer, as it accounts for a significant percentage of cancer diagnoses and deaths among women. To prevent unnecessary tests, accurate classification of malignant and benign tumors is necessary. Researchers have developed numerous automated classification methods for Breast Cancer, with soft computing techniques being widely used due to their high performance in classification. Machine learning algorithms, known for their ability to identify critical features from medical datasets, are also extensively utilized in Breast Cancer prediction. Therefore, this study seeks to employ Boosting algorithms in machine learning to predict Breast Cancer accurately. Over the years, the mortality rate in Breast Cancer diagnosis has decreased due to research efforts.

Keywords :

Introduction

Breast cancer is a type of cancer that originates from the breast tissue. Its symptoms may manifest as a lump in the breast, alteration in breast shape, skin dimpling, discharge from the nipple, nipple inversion, or a red or scaly patch of skin. Advanced stages of the disease may cause bone pain, swollen lymph nodes, breathing difficulty or jaundice (yellowing of the skin).

The risk factors associated with the development of breast cancer include obesity, physical inactivity, alcohol consumption, hormone replacement therapy during menopause, exposure to ionizing radiation, early onset of menstruation, having children at an older age or not at all, advanced age, previous history of breast cancer, and a family history of the disease. Inheriting a genetic predisposition from one's parent accounts for approximately 5-10% of all cases.

With the rapid growth of digital technologies, healthcare centers are storing an enormous amount of complex data in their databases, which presents a challenge when it comes to analysis. In this regard, data mining techniques and machine learning algorithms have important roles in the analysis of medical data. These techniques and algorithms can be applied directly to datasets, enabling the creation of models and the derivation of valuable insights and inferences, including the detection of diseases such as breast cancer.

Disease Prediction using Ensemble Techniques

Ensemble techniques can be used for disease prediction by combining the predictions of multiple models to improve the overall accuracy of the prediction. Ensemble techniques work by training multiple models on the same dataset and combining their predictions using various methods, such as averaging or voting. One popular ensemble technique is random forest, which is a collection of decision trees. Each decision tree is trained on a random subset of the training data, and the final prediction is made by aggregating the predictions of all the trees in the forest. Random forest has been successfully used for disease prediction in various fields, such as cancer diagnosis and heart disease prediction. Another ensemble technique is Gradient boosting, which builds an ensemble of weak models such as decision trees, and iteratively improve its performance by adding new models that focuses on correcting the errors of the previous models. Gradient boosting has been used for disease prediction in areas such as diabetes, heart disease, and breast cancer.

Ensemble techniques can also be used for feature selection, where a subset of the most relevant features is selected to improve the accuracy of the prediction. For example, in a study on Alzheimer's disease prediction, an ensemble of decision trees was used to identify the most important features for predicting Alzheimer's disease, such as age, gender, and education level. Overall, these techniques can be a powerful tool for disease prediction, providing higher accuracy and robustness compared to single models. However, careful selection of the models and parameters, as well as appropriate validation and testing, are critical for successful disease prediction using ensemble techniques.

1. Literature Review

Before we go deep into the topic of Breast cancer we need to build the baseline and basic idea about the problem and the terms associated with it by providing relevant explanations.

FDA has approved Sacituzumab Govitecan for the treatment of triple-negative breast cancer that has spread to other parts of the body. Immunotherapy with Sacituzumab Govitecan may induce changes in the body's immune system and may interfere with the ability of tumor cells to grow and spread. It is also observed that treatments using Tucatinib improved survival of women in the HER2CLIMB trial, including some whose cancer had spread to the brain. Trastuzumab Deruxtecan improved survival rate of cancer patients and reduced many tumours in the Destiny Laboratories. Women with earlystage breast cancer and high recurrence scores on the Oncotype DX received chemotherapy with hormone therapy and had better long-term outcomes. Typical clinical applications of image classification tasks include skin disease identification in dermatology eye disease recognition in ophthalmology such as diabetic retinopathy (Iwama et al., 2022).

The paper discusses wearable tech's potential in COVID- 19 by monitoring, predicting symptoms, and tracing contacts, with a comparative analysis of devices and future trends (Islam et al., 2020a). This paper reviews the algorithms and techniques used for the detection and classification of breast tumors, focusing on machine learning and imaging modalities. It discusses current trends and challenges for future research in breast cancer diagnosis (SR & Rajaguru, 2021). It also compares the classification performance of two machine learning techniques, Support Vector Machine and Wisconsin Breast Cancer using accuracy, precision, recall, and ROC Area metrics. The study finds that Support Vector Machine has the highest accuracy (Bayrak et al., 2019). This study compares five classification models, including Decision Tree, Random Forest, Support Vector Machine, Neural Network, and Logistics Regression, for predicting breast cancer outcomes using two datasets. It was found that Random Forest outperformed the other models in accuracy, F-measure metric, and AUC values. The model is considered useful for practical applications (Li & Chen, 2018);Islam et al. (2020b) that compares five supervised machine learning techniques for the early detection of breast cancer using the Wisconsin Breast Cancer dataset. Artificial neural networks achieved the highest accuracy, precision, and F1 score, while the Support Vector Machine had lower scores.

Müller and Kramer (2021) used the deep learning and classical features, including GIST and bag-of-words, for detecting pathology in chest X-rays. Results show AUCs of 0.78-0.95 for various pathologies, highlighting the strength and robustness of the CNN features. The study suggests that deep learning with non-medical image databases can be a good substitute for domain-specific representations in medical image recognition tasks.

Unless the Intersection over Union (IoU) threshold is set too high, anchor-based methods typically offer greater accuracy compared to region-based methods. This implies that anchor-based methods are generally more reliable in terms of precisely predicting object boundaries and locations. However, if the IoU threshold is set excessively high, region-based methods can outperform anchor-based methods in terms of accuracy. It is important to select the appropriate method based on the specific needs of the application, and the desired tradeoff between accuracy and computational efficiency (Voulodimos et al., 2018).

The use of the open-source Python framework MIScnn for image segmentation is being discussed. This framework allows for the training, prediction, and evaluation of deep learning models, and has been applied to Kidney Tumor segmentation. The library is freely available at a Git repository (Müller & Kramer, 2021). Pymia is a Pythonbased open-source package designed for data handling and evaluation in medical image processing through the use of deep learning. The package provides flexible data handling capability, allowing 2D, 3D, full, or patch-wise data processing, which is independent of the deep learning framework being used. Advancements in deep learning have resulted in the development of neural network algorithms that surpass human performance in image classification and segmentation. As a result, computer scientists and neuro-oncologists are working closely together to gain a better understanding of these advanced deep learning techniques. The building blocks of artificial neural networks such as convolutional neural networks, are being studied along with imaging features such as Genotype. In order to aid clinical diagnosis and computer-assisted surgeries, medical image segmentation tools have been developed. In comparison to classical segmentation methods, the accuracy of these tools is greatly improved by the use of deep learning algorithms. While these methods are complex and highly variable, efforts are being made to ensure reproducibility of results.

This literature review focuses on the variability and reproducibility of results obtained using deep learning techniques for medical images. Specifically, the study investigates the framework of deep learning, sources of variability, and methods of evaluating segmentation results.

2. Proposed System

AdaBoost, Gradient Boost, and XGBoost are powerful techniques for building ensemble models in machine learning.

Overall, these boosting techniques have proven to be very effective in improving the accuracy and robustness of machine learning models, and they are widely used in many real-world applications. The proposed system is that we have used some boosting algorithms and all other various tools to build system which detects the disease of the patient. By comparing the dataset collected by us and the patient's data we were able to predict the accurate percentage of Breast Cancer of the patient. The dataset is given to the classification model of the system where the data is pre-processed for the future references and then the feature selection is done.

Then the classification of those data is done with the help of various algorithms and techniques such as AdaBoost, Gradient Boost and XGBoost.

3. Results and Discussions

A confusion matrix is a performance evaluation metric that shows the number of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) for a binar y classification problem. It visualizes and summarizes the performance of a classification algorithm. The XGBoost algorithm can be evaluated using a confusion matrix, which is typically presented in Table 1.

Table 1. Performance Evaluation of Confusion Matrix for the XGBoost Algorithm

Here is an example of a confusion matrix for the XGBoost algorithm:

The accuracy of the XGBoost algorithm can be computed using the values in the confusion matrix as follows,

Accuracy = (TP + TN) / (TP + TN + FP + FN)
where,
TP - (True Positive) indicates the number of correctly predicted positive samples.
TN - (True Negative) indicates the number of correctly predicted negative samples.
FP - (False Positive) indicates the number of incorrectly predicted positive samples.
FN - (False Negative) indicates the number of incorrectly predicted negative samples.

The recommendation model receives the dataset and performs risk analysis, providing probability estimates based on various classification scenarios. It also determines which dataset to use and which to avoid. The model evaluates the results to provide an overall structured form of data for Breast Cancer detection. The proposed approach accurately identifies Breast Cancer at early stages, providing decision support for doctors when making clinical diagnosis. This model can potentially become the best tool for detecting Breast Cancer and assist doctors in providing confirmed diagnosis. Figures 1 and 2 show the user details page and user dataset.

Figure 1. User Details Page

Figure 2. User Dataset

The accuracy of the XGBoost algorithm is based on the number of correct predictions divided by the total number of predictions. It is calculated as the ratio of the number of correctly predicted samples to the total number of samples in the dataset. Figures 3 and 4 show the confusion matrix for XG Boosting algorithm and the comparative analysis of algorithms based on accuracy is shown in Figure 5.

Figure 3. Confusion Matrix of XG Boost Algorithm

Figure 4. Classification Result on Basis of Accuracy

Figure 5. Comparative Analysis of Algorithms Based on Accuracy

Thus, the accuracies for the algorithms adapted are as follows in Table 2.

Table 2. Accuracy of Various Algorithm

Conclusion

Breast Cancer is a serious health concern that affects many women worldwide. Early diagnosis is crucial in improving patient outcomes, and automated classification methods using machine learning have shown promise in predicting the presence of Breast Cancer. Soft computing techniques have been widely used for cancer diagnosis, and Boosting algorithms like eXtremeGradientBoost (XG) has demonstrated high accuracy in predicting Breast Cancer. Overall, the results of this work suggest that machine learning algorithms can be useful tools in the early detection and prediction of Breast Cancer, and could potentially aid healthcare professionals in making more informed decisions about patient care.

Future work in this area could involve exploring the use of Deep Learning network models, such as Feed Forward Neural Network, Recurrent Neural Networks, and Convolutional Neural Networks, to potentially improve the accuracy of Breast Cancer prediction using the same dataset. Additionally, further research could be conducted to evaluate the performance of different machine learning algorithms for Breast Cancer prediction and compare their results to those obtained using the eXtremeGradientBoost (XG) algorithm.

References

[1]. Bayrak, E. A., Kırcı, P., & Ensari, T. (2019, April). Comparison of machine learning methods for breast cancer diagnosis. In 2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT) (pp. 1-3). IEEE. https://doi.org/10.1109/EBBT.2019.8741990
[2]. Islam, M. M., Haque, M. R., Iqbal, H., Hasan, M. M., Hasan, M., & Kabir, M. N. (2020b). Breast cancer prediction: a comparative study using machine learning techniques. SN Computer Science, 1, 1-14. https://doi.org/10.1007/s42979-020-00305-w
[3]. Islam, M. M., Mahmud, S., Muhammad, L. J., Islam, M. R., Nooruddin, S., & Ayon, S. I. (2020a). Wearable technology to assist the patients infected with novel coronavirus (COVID-19). SN Computer Science, 1, 1-9. https://doi.org/10.1007/s42979-020-00335-4
[4]. Iwama, E., Zenke, Y., Sugawara, S., Daga, H., Morise, M., Yanagitani, N., ... & Okamoto, I. (2022). Trastuzumab emtansine for patients with non–Small cell lung cancer positive for human epidermal growth factor receptor 2 exon-20 insertion mutations. European Journal of Cancer, 162, 99-106. https://doi.org/10.1016/j.ejca. 2021.11.021
[5]. Li, Y., & Chen, Z. (2018). Performance evaluation of machine learning methods for breast cancer prediction. Applied and Computational Mathematics, 7(4), 212-216. https://doi.org/10.11648/j.acm.20180704.15
[6]. Müller, D., & Kramer, F. (2021). MIScnn: a framework for medical image segmentation with convolutional neural networks and deep learning. BMC Medical Imaging, 21(1), 1-11. https://doi.org/10.1186/s12880-020-00543-7
[7]. SR, S. C., & Rajaguru, H. (2021). A systematic review on screening, examining and classification of breast cancer. 2021 Smart Technologies, Communication and Robotics (STCR), 1-4. https://doi.org/10.1109/STCR51658.2021.9588828
[8]. Voulodimos, A., Doulamis, N., Doulamis, A., & Protopapadakis, E. (2018). Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, 2018. https://doi.org/10.1155/2018/7068349