Deep learning strategy for limited and imbalanced data
Xuemei Pu*, Yuanyuan Jiang, Jiali Guo, Songran Yang
College of Chemistry, Sichuan University, Chengdu 610064
EXTENDED ABSTRACT: Artificial intelligence, especially deep learning algorithms, has shown its promising merits and application potential in various fields, due to its strong learning capacity to complex data and avoiding feature project. However, the advantage is dependent on big data. Unfortunately, in real words, we often face limited and imbalanced data, which has been considered to a difficult task for deep learning. Cocrystal plays an important role in various fields. However, how to choose coformer remains a challenge on experiments. Furthermore, the data available is limited in the cocrystal field, in particular for negative samples. Hence it remains to be a challenge to develop a highly efficient and universal computation-strategy for the cocrystal screening. Motivated by the challenge, we develop a novel graph neural network (GNN) based learning framework to rapidly predict formation of the cocrystal. A large and reliable data set is first constructed, which contains 7871 samples. A complementary feature representation is proposed by combining molecular graph and molecular descriptors from priori knowledge. A new GNN learning architecture is then explored to effectively embed the priori knowledge into the “end-to-end” learning on the molecular graph, in which multi-head attention mechanism is introduced to further optimize the feature space. Consequently, the performance of our model achieves 98.86% accuracy, greatly surpassing some traditional machine learning models and classic GNN models. Furthermore, the out-of-distribution prediction on energetic cocrystals is also high up to 97.11% accuracy, showing strong generalization. The strategy proposed by the work can provide useful guidelines for the application of deep learning. All the data and source codes are available at https://github.com/Saoge123/ccgnet.
Dr. Xuemei Pu is a professor of College of Chemistry, Sichuan University and a member of the Computational Chemistry Professional Committee of the Chinese Chemical Society. In recent years, she has carried out a series of research works in the field of functional materials and biomedical fields, supported by multiple National Natural Science Foundation of China. She has co-published more than 100 SCI papers and applied for 10 patents (4 authorized) and four computer software copyrights.