Yukari Katsura1,2), M. Kumagai3,4), M. Kaneshige5), Y. Ando2), S. Gunji1,2), Y. Imai2,3),
R. Sato1,2), T. Kodani1,2), R. Ni1), K. Kimura1, K. Tsuda1,2,3
1. Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan; 2. Research and Services Division of Materials Data and Integrated System (MaDIS), National Institute for Materials Science (NIMS), Ibaraki, Japan;
3. Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan; 4. SAKURA Internet Inc., Osaka, Japan;
5. X-Ability Co., Ltd., Tokyo, Japan
Abstract: The development of materials informatics has been supported by the large-scale (104-106) DFT calculation datasets. However, for prediction of non-calculable properties such as thermoelectric properties of real samples, experimental datasets ares preferred. The problem with experimental data is that most of published experimental data are not in digital forms, and the available datasets are small (101-102). While materials science tells us that properties of a compound strongly depend on compositions, fabrication methods and microstructures, materials informatics have used only one to few samples to represent the whole experimental properties of the compound.
In the papers of material sciences, most of the experimental data are provided as plot images. A typical plot image contains both the good samples and the other samples for comparison. The original experimental data can be extracted by tracing the plots. Such scientific data can be shared freely, without violating the copyrights of the publishers.
Recently, we have developed an open web system named Starrydata (https://www.starrydata2.org) to encourage researchers extract and share data from published plot images. This system was designed not to share any copyrighted materials such as full-text PDF files and figure images. We only share the links to the original papers, numerical data and their description, and the bibliographic information (title, authors, journal names etc.).
Starrydata allows worldwide users to browse, add, edit and download the data, free of charge. Starrydata is designed to host a number of databases on different fields of materials sciences. 'ThermoelectricMaterials' database is our first database in Starrydata, for experimental thermoelectric properties. Within each database, the user can create and browse their own lists of favorite papers, like an online reference manager. The user can add papers to the paper list, either by searching within the default list named 'All papers', or by providing a plain-text list of DOI (Digital Object Identifier). The latter method is especially useful when the user wants to introduce new papers to Starrydata, and when sharing a paper list with somebody else. Once the paper is recognized, the bibliographic information of the paper is automatically displayed. Each paper is assigned a 'Starrydata ID' (SID/ paper ID), and a data collection subpage for the paper. If the user has an access to the original paper, he/she can open the full-text and take a screenshot of a target figure, press to paste the figure image in WebPlotDigitizer) embedded in Starrydata. Then the original data can be extracted by semi-automatic color detection or manual clicking. Finally, the user copy the extracted data, and paste on Starrydata with a figure name, a sample name, a sample composition, physical properties, and the units for the x and y axes. When the Save button is pressed, Starrydata automatically converts the units and record in the database.
By using Starrydata, we collected temperature (T) dependences of Seebeck coefficient S, electrical conductivity σ, electrical resistivity ρ, total thermal conductivity κ, power factor P=S2σ and dimensionless figure of merit ZT=PT/κ=S2σT/κ, from 3,900+ papers. The collected data includes 70,000+ curves from 20,000+ physical samples of thermoelectric materials, presented in 18,000+ figure images. During the presentation, we demonstrate the examples analyses of this dataset, including Jonker plots, direct prediction of thermoelectric properties by machine learning, and the combination with first-principles calculation to estimate the electron relaxation time of each sample.
The data on Starrydata can be collectively downloaded as data files by pressing the 'Get raw data' or 'Get interpolated data' button of a paper list. We provide two types of datafiles: a single two-dimensional table, and a set of linked tables like a relational database. Application Programmable Interface (API) is also provided to search for samples from the constituent elements, and to retrieve data automatically.
As an application of this dataset, we demonstrated direct prediction of thermoelectric properties from compositions, by machine learning. About 7,000 compositions of semiconductors on the Materials Project4 were sorted according to their predicted values of thermoelectric figure of merit ZT. We also demonstrated optimization of complex sample compositions by sampling about 10,000 arbitrary compositions, to improve the predicted values of ZT.
PhD, is an Assistant Professor in the Department of Advanced Materials Science, the Graduate School of Frontier Sciences, the University of Tokyo. She is also a Special Researcher in the Center for Materials research by Information Integration (CMI2), Research and Services Division of Materials Data and Integrated System (MaDIS), National Institute for Materials Science (NIMS), Japan, and a Guest Researcher in the RIKEN Center for Advanced Intelligence Project (AIP), Japan.
She is a material scientist who is strongly inspired by the systematic searches for new functional materials. She started her research carrier in the Department of Applied Chemisty in 2009, School of Engineering, the University of Tokyo, as an experimental material scientist to improve critical current properties in MgB2 superconductors. During her Ph.D. course, she searched for new superconductors by combining experiments and first-principles calculations. After gaining PhD, she contributed to establish theories to design new thermoelectric materials, by using first-principles calculations. By collaboration with Masaya Kumagai, she started her project named Starrydata in 2015, to collect experimental data from plot images in published papers. She established 'Thermoelectric Database Working Group" in the Thermoelectrics Society of Japan, to start open collaborations across laboratories in different institutions to collect and shar large-scale experimental data for materials informatics of experimental data. Recently she became the project leader of a 5.5-year project (JST-CREST) entitled 'Development of innovative functional materials based on large-scale search for new crystals', to search for new crystals by using materials informatics, in collaboration with experimental materials scientists.
Email:katsura@phys.mm.t.u-tokyo.ac.jp