The 3rd Forum of Materials Genome Engineering

4-16. Procedure for building a standard specific dataset based on materials genome database

Hong Wang¹, Haiqing Yin² and Lanting Zhang¹

1. Materials Genome Initiative Center and the School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240

2. Materials Design Group, Collaborative Innovation Center of Steel Technology,

University of Science and Technology Beijing, Beijing, 100083

Abstract: The recently released CSTM standard “General Rules for Materials Genome Engineering Data” stipulates the information that must be collected and the content that must be included in the data production process, to meet the requirements of data-driven Materials Science research. According to the “General Rules”, the materials genome engineering (MGE) data are divided into three class: sample information, source data (unprocessed data) and processed data (data obtained by analysis and processing of existing data). Each action (sample preparation/characterization/ calculation/data processing) is defined as a stand-alone entry unit, and assigned an independent resource identification (per DOI or Chinese national standard GB/T 32843-2016). Each data entry should include metadata related to the process of action as complete as possible. With the intention of breaking down any direct link between material and its parameters, the "general rules" is designed to provide the flexibility of data utilization and combination to the maximum extent, so as to ensure the usage and collection of data conforming to the FAIR principles (Findable, accessible, interoperable, and reusable) and to promote the sharing of data. To list sample information as a class of data is a unique choice. Its greatest advantage is to make the samples themselves a part of the social resources conforming to the FAIR principles, so that the samples can be found, shared and reused, similar to data.

Keywords: Intelligent design of materials; Software; Database; Industrial product design

由材料基因工程数据构建专用数据集的标准化方法

汪洪1，尹海清2，张澜庭1

1. 上海交通大学材料基因组联合研究中心，材料科学与工程学院，上海，200240

2. 北京科技大学材料设计团队，钢铁共性技术协同创新中心，北京，100083

摘要：最近发布的中国材料与试验团体标准委员会（CSTM）材料基因工程数据通则（T/CSTM 00120-2019），根据材料科学在数据驱动模式下对数据的需求，规定了数据生产过程中必须收集的信息和必须包含的内容。它将数据分为样品信息、源数据（未经处理的数据）与衍生数据（经分析处理得到的数据）三类，每条数据将被赋予唯一和永久的科学资源标识（DOI或根据国标GB/T 32843-2016），并包括记录充分的有关相应样品制备/表征/数据处理事件的信息作为元数据。“通则”意在打碎现今数据中内禀的材料-参数间关联，最大限度地提供了数据使用、组合的灵活性，从而确保数据的利用和收录符合FAIR（Findable, Accessible, Interoperable, Reusable，可发现、可获取、可交互、可再利用）原则，促进数据的社会化共享。其中将样品单独列为一类数据是之前任何其它数据标准中都没有的做法。这样做的最大优点是不仅使数据满足FAIR原则，样品本身也成为符合FAIR原则的社会资源，便于样品可以共享、多用和重复使用。

既有的材料数据呈现的是材料-参数一对一关系，数据库以特定材料种类或材料应用方向为主题，将与此相关的材料参数聚集在一起。如此形成的数据集，直接或部分建立了材料成分-结构-工艺-性能间的关联关系。这类数据库的形式，符合目前的应用习惯，回应了在应用过程中关注的重点，为广大用户所喜闻乐见。

随着时间的推移，越来越多符合材料基因工程通则标准的数据被产生出来、收集起来，成为社会材料数据资源的重要组成部分。由于符合材料基因工程通则标准的数据在直观上并不直接提供用户已经习惯的数据表现形式，需要根据不同需求通过对相关关键词采用特定的筛选逻辑进行检索，才能构成与既有材料数据库相当的数据集。因此，有必要建立规范化的方法/流程，及相关的关键词组，使得这些数据中蕴含的重要信息能够被以标准化的形式，高效、自动地组成符合需要的数据集。这种标准化的方法可能以应用程序（APPS）的方式固化下来，成为材料基因工程数据与既有材料数据库的桥梁。

关键词：材料基因工程；标准；数据库；关键词组；专用数据集

Brief Introduction of Speaker

汪洪

上海交通大学材料基因组联合研究中心主任，“致远”讲席教授，中国材料试验标准委员会（CSTM）材料基因工程领域委员会主任委员。曾担任中国工程院、中国科学院材料基因组重点咨询项目专家。当前研究集中在材料基因工程理论，数据标准，高通量材料制备与表征技术及机器学习在材料中的应用。

Email: hongwang2@sjtu.edu.cn