Lecture
Molecular Encoding as a Tool to Enable Data Science-Assisted Analytical Measurements
- at -
- ICM Saal 5
- Type: Lecture
Lecture description
K.A. Schug; V.C.P. Chen; J. Rosenberger; C. Kan; Y. Yang; M. Ghasemloo; K.R. Saraswat; L. Ho Manh; J. Sood; N. Bhakta
Encoding molecular descriptors for molecules enables a quantitative description of a molecule in the form of a vector that allows for their incorporation in prediction and optimization algorithms, among other data treatments. Molecules can be encoded with a variety of structural, topological, and/or physicochemical descriptors. The choice of molecular encoding should capture attributes of the molecule that are important to the process under investigation. For example, we have encoded molecules with bond counts, atom counts, custom descriptors, and engineered descriptors to predict gas phase vacuum ultraviolet/ultraviolet (VUV/UV) absorption properties using machine learning (ML) [1,2]. Specific descriptors were engineered to capture properties, such as aromaticity and double bond conjugation, which are known to influence molecular absorption. ML-based spectral prediction was shown to
outperform quantum chemical calculation predictions based on time-dependent density functional theory [1], especially when interwavelength correlation was accounted using principal components to substantially improve model training time [2]. In another example, molecular encoding was used for evaluating molecular diversity and molecular similarity in the context of analytical extractions and separations [3,4]. Molecular encoding enabled the selection of a set of diverse pharmaceutical compounds for subsequent surrogate optimization of an on-line supercritical fluid extraction – supercritical fluid separation (SFE-SFC) system. This approach allowed evaluation of optimal settings across similar and diverse molecule types, as determined by the quantitative encoding relationships [3]. As part of this work, a new means for molecular encoding was devised, called cumulative binarization, which provides greater granularity in distinguishing molecules encoded using a variety of molecular descriptors. Further, molecular descriptors devised from structural aspects were shown in some cases to well correlate with physicochemical descriptors, such as chromatographic retention factors and Abraham solvation parameters [4]. This work was supported by the National Science Foundation [5].
1. Ho Manh, L.; Chen, V.; Rosenberger, J.; Wang, S.; Yang, Y.; Schug, K.A. J. Chem. Inf. Model. 2024, 64, 5547-5556.
2. Ghasemloo, M.; Chen, V.; Rosenberger, J.; Schug, K.A. J. Chem. Inf. Model. 2026, 66, 299-309.
3. Bhakta, N.; Sood, J.; Yang, Y.; Black, D.; Rosenberger, J.; Chen, V.; Schug, K.A. J. Chromatogr. A 2026, 1767, 466624.
4. Yang, Y.; Chen, V.C.P.; Kan, C.; Rosenberger, J.; Saraswat, K.R.; Ghasemloo, M.; Schug, K.A. J. Chem. Inf. Model. 2026, 66, 371-386.
5. National Science Foundation grant CHE-2108767.
Encoding molecular descriptors for molecules enables a quantitative description of a molecule in the form of a vector that allows for their incorporation in prediction and optimization algorithms, among other data treatments. Molecules can be encoded with a variety of structural, topological, and/or physicochemical descriptors. The choice of molecular encoding should capture attributes of the molecule that are important to the process under investigation. For example, we have encoded molecules with bond counts, atom counts, custom descriptors, and engineered descriptors to predict gas phase vacuum ultraviolet/ultraviolet (VUV/UV) absorption properties using machine learning (ML) [1,2]. Specific descriptors were engineered to capture properties, such as aromaticity and double bond conjugation, which are known to influence molecular absorption. ML-based spectral prediction was shown to
outperform quantum chemical calculation predictions based on time-dependent density functional theory [1], especially when interwavelength correlation was accounted using principal components to substantially improve model training time [2]. In another example, molecular encoding was used for evaluating molecular diversity and molecular similarity in the context of analytical extractions and separations [3,4]. Molecular encoding enabled the selection of a set of diverse pharmaceutical compounds for subsequent surrogate optimization of an on-line supercritical fluid extraction – supercritical fluid separation (SFE-SFC) system. This approach allowed evaluation of optimal settings across similar and diverse molecule types, as determined by the quantitative encoding relationships [3]. As part of this work, a new means for molecular encoding was devised, called cumulative binarization, which provides greater granularity in distinguishing molecules encoded using a variety of molecular descriptors. Further, molecular descriptors devised from structural aspects were shown in some cases to well correlate with physicochemical descriptors, such as chromatographic retention factors and Abraham solvation parameters [4]. This work was supported by the National Science Foundation [5].
1. Ho Manh, L.; Chen, V.; Rosenberger, J.; Wang, S.; Yang, Y.; Schug, K.A. J. Chem. Inf. Model. 2024, 64, 5547-5556.
2. Ghasemloo, M.; Chen, V.; Rosenberger, J.; Schug, K.A. J. Chem. Inf. Model. 2026, 66, 299-309.
3. Bhakta, N.; Sood, J.; Yang, Y.; Black, D.; Rosenberger, J.; Chen, V.; Schug, K.A. J. Chromatogr. A 2026, 1767, 466624.
4. Yang, Y.; Chen, V.C.P.; Kan, C.; Rosenberger, J.; Saraswat, K.R.; Ghasemloo, M.; Schug, K.A. J. Chem. Inf. Model. 2026, 66, 371-386.
5. National Science Foundation grant CHE-2108767.