Lecture
Metabolomics Supported with Machine Learning and Computational Advances for Early Glioma Diagnostics
- at -
- ICM Saal 2
- Type: Lecture
Lecture description
A. Godlewski, Bialystok/PL; K. Solowiej, Bialystok/PL; J. Chilimoniuk, Bialystok/PL; P. Mojsak, Bialystok/PL; J. Godzien, Bialystok/PL; M. Burdukiewicz, Bialystok/PL; T. Lyson, Bialystok/PL; M. Ciborowski, Bialystok/PL
Gliomas remain one of the most challenging brain tumors for early diagnosis due to their molecular and clinical heterogeneity. Especially slow-growing low-grade tumors often remain asymptomatic or cause only nonspecific symptoms and can be missed for long periods. On the contrary, high-grade gliomas typically present more acute, obvious neurological deficits that prompt rapid imaging and diagnosis. As a result, in the clinical cohorts, the number of patients with high-grade glioma exceeds that of low-grade. In this study, we present an integrated computational–metabolomic workflow combining targeted and untargeted plasma metabolomics with advanced data processing and machine learning methods to support early glioma detection. In the first part of the study, plasma targeted metabolomic profiles obtained for samples collected from glioma, meningioma and control patients were used to construct diagnostic panels using
supervised machine learning models (e.g., RF, SVM or EvoHDTree) that were trained to derive plasma metabolite panels discriminating glioma (with emphasis on low-grade disease) from meningioma and healthy controls, with performance evaluated using cross-validation, F1-score, and AUC. Obtained results demonstrated that EvoHDTree can yield compact, interpretable diagnostic panels with F1-scores up to approximately 0.95 and AUC up to about 0.87 across multiple glioma and meningioma comparisons, while highlighting key amino acid and lipid pathways involved in tumor biology. In the next step, we addressed the problem of the unbalanced groups that bias statistical tests and machine‑learning models toward reduced power to detect true differences and can distort normalization and size effects. Untargeted GC-MS and LC-MS-based metabolomics data were obtained for the same plasma samples and included for analysis together with the above-mentioned targeted data. Oversampling and undersampling algorithms, such as SMOTE and RUS were applied to these datasets, and their impact on univariate statistics, distributional properties, inter-platform correlations, and machine learning performance was assessed. The results showed that oversampling can recover most of the metabolites that are significant in full datasets, but at the cost of increased false positives and distribution distortion, whereas RUS preserves distributions and correlations better but may reduce power and accuracy, depending on sample size and data set. However, in certain cases, using SMOTE may be beneficial, as we demonstrated that even a small number of samples can yield the same statistically significant metabolites as the full dataset. While algorithms that address class imbalance can be applied in metabolomic studies, careful consideration of their usage and a critical evaluation of the results obtained is essential.
Acknowledgement: A.G. acknowledges funding from the National Science Centre, Poland (grant number: 2025/57/N/NZ5/03554).
Gliomas remain one of the most challenging brain tumors for early diagnosis due to their molecular and clinical heterogeneity. Especially slow-growing low-grade tumors often remain asymptomatic or cause only nonspecific symptoms and can be missed for long periods. On the contrary, high-grade gliomas typically present more acute, obvious neurological deficits that prompt rapid imaging and diagnosis. As a result, in the clinical cohorts, the number of patients with high-grade glioma exceeds that of low-grade. In this study, we present an integrated computational–metabolomic workflow combining targeted and untargeted plasma metabolomics with advanced data processing and machine learning methods to support early glioma detection. In the first part of the study, plasma targeted metabolomic profiles obtained for samples collected from glioma, meningioma and control patients were used to construct diagnostic panels using
supervised machine learning models (e.g., RF, SVM or EvoHDTree) that were trained to derive plasma metabolite panels discriminating glioma (with emphasis on low-grade disease) from meningioma and healthy controls, with performance evaluated using cross-validation, F1-score, and AUC. Obtained results demonstrated that EvoHDTree can yield compact, interpretable diagnostic panels with F1-scores up to approximately 0.95 and AUC up to about 0.87 across multiple glioma and meningioma comparisons, while highlighting key amino acid and lipid pathways involved in tumor biology. In the next step, we addressed the problem of the unbalanced groups that bias statistical tests and machine‑learning models toward reduced power to detect true differences and can distort normalization and size effects. Untargeted GC-MS and LC-MS-based metabolomics data were obtained for the same plasma samples and included for analysis together with the above-mentioned targeted data. Oversampling and undersampling algorithms, such as SMOTE and RUS were applied to these datasets, and their impact on univariate statistics, distributional properties, inter-platform correlations, and machine learning performance was assessed. The results showed that oversampling can recover most of the metabolites that are significant in full datasets, but at the cost of increased false positives and distribution distortion, whereas RUS preserves distributions and correlations better but may reduce power and accuracy, depending on sample size and data set. However, in certain cases, using SMOTE may be beneficial, as we demonstrated that even a small number of samples can yield the same statistically significant metabolites as the full dataset. While algorithms that address class imbalance can be applied in metabolomic studies, careful consideration of their usage and a critical evaluation of the results obtained is essential.
Acknowledgement: A.G. acknowledges funding from the National Science Centre, Poland (grant number: 2025/57/N/NZ5/03554).