Overview of Tools and Resources#
Before we dive into data mining and analysis, it is important to pause and understand the tools that will guide us along the way. In modern materials science, the challenge is not just generating data, it is navigating, organizing, and interpreting the massive volumes of information that already exist. To do this effectively, researchers rely on a combination of computational libraries, databases, and programming environments that make the exploration of materials both rigorous and approachable.
In this section, we highlight the key resources that will power our journey. These tools were chosen not only because they are widely used in cutting-edge research, but also because they strike a balance between accessibility and depth. By combining them, we will be able to move seamlessly from raw data to meaningful insights, all while building reproducible workflows that mirror real-world research practices.
Note
Computational tools and software libraries evolve continuously, and changes in package versions or execution environments (e.g., Google Colab) may occasionally lead to unexpected errors. Rather than viewing this as a limitation, we emphasize this as an authentic aspect of computational scientific practice. Students are encouraged to critically interpret error messages, consult documentation, and use AI-assisted tools responsibly to help diagnose and resolve issues. Developing these skills is an important component of computational literacy.
Getting Started: Cloning and Setting up the Environment#
All of the materials for this tutorial are hosted in a public GitHub repository, an online platform used for sharing and collaborating on code. GitHub allows users to access all files, track changes, and contribute to open-source projects. Users can clone this repository, which means that they can copy it to their local machines and install all required packages using a single command. This step ensures compatibility by installing the exact versions used in our tested environment. This greatly reduces the likelihood of dependency conflicts that might arise from version mismatches. The option to run the entire tutorial using Google Colab is provided. It is a browser-based platform that requires no local installation. This is specifically helpful for students or instructors working on shared or restricted machines; moreover, for remote collaborative tasks.
Jupyter Notebooks: Our Computational Laboratory#
The tutorial is developed through Jupyter Notebooks, a platform that combines executable code, text, equations, and visualizations in a single, interactive document. Students can read explanations and run code cells, all within the same interface. This format supports exploratory learning and encourages active engagement with the material. Whether running locally or on Google Colab, Jupyter notebooks serve as an accessible entry point for hands-on scientific computing.
Python: The Language That Glues Everything Together#
Python is the programming language used throughout the tutorial. It is widely used in science and technology applications, regarded for its simplicity, readability, high efficiency, and large ecosystem of scientific libraries, with a community of thousands of users worldwide. Python connects all the other tools in this workflow. Students and/or instructors do not need prior programming experience to follow this tutorial. All code is explained step-by-step and annotated to encourage understanding. By the end of this activity, students will gain familiarity with Python syntax and basic data manipulation techniques, which are valuable skills in scientific research, academy, and industry.
Materials Project: a Database of Materials#
The Materials Project is an open-access online database. It is an international initiative to get free access for data analysis providing relevant information from thousands of materials, viz. electronic structure, geometry, thermodynamics, and spectral information, together with compositional analysis and the feasibility to calculate the properties of all inorganic materials.Students can use this resource to explore real materials data and retrieve relevant information for their own analyses. In the tutorial, the Materials Project is accessed both through its website and using an Application Programming Interface (API), which is a tool that allows programs to communicate with external services. In this context, the API allows users to retrieve materials data directly from the Materials Project using Python commands, without having to download files manually from the website. This tool allows students to automate remotely data retrieval and perform custom searches.
Pymatgen: A Python Interface for Materials Data#
Pymatgen (Python Materials Genomics) is a Python library that acts as a bridge between the Materials Project and your Jupyter Notebook and/or Google Colab. With Pymatgen, students can query the Materials Project using an API key, which is a personal access code provided for free by the respective website. This key identifies the user and allows them to make a certain number of automated data requests per day, ensuring responsible use of the database. Then students retrieve detailed structural information and perform basic analyses. Pymatgen provides simple functions for working with structures, converting file formats, and visualizing data, making it an essential component in this tutorial.
Matminer: Data Mining for Materials Science#
Matminer is a Python Library developed to support data mining and machine learning in materials science. It provides access to curated datasets like the widely known Materials Project, in addition to Citrine and MDF. This suite of tools is used for converting material structures into numerical descriptors that can be used in statistical analysis or modeling. In this tutorial, the students use Matminer to explore material properties and identify trends across different compounds. This exposes them to data-driven inquiry methods and the basic concepts of materials informatics.
ASE (Atomic Simulation Environment): Structure Manipulation and Visualization#
ASE is a Python library designed for creating, modifying, and visualizing atomic structures, together with implementing minimization algorithms, among many other practical applications. In this tutorial, students/instructors use ASE to manipulate structures retrieved from the Materials Project, such as by introducing point defects. ASE also includes built-in visualization capabilities that allow students to see the atomic arrangements they are working with. These visual insights are helpful for reinforcing the connection between data and physical structure and making abstract concepts more tangible.
Background and Recommended Resources#
This Jupyter Book is designed to be accessible to students and instructors with little or no prior experience in programming. All activities are presented in a guided, step-by-step format, and the code cells can be run as written. However, users who wish to extend the examples, modify the workflows, or develop their own notebooks may benefit from additional familiarity with Jupyter notebooks and basic Python concepts.
For readers new to these tools, the following freely available resources provide excellent introductions and opportunities for self-directed learning:
Jupyter Documentation An official introduction to the Jupyter Notebook environment, including how cells work, how to combine code and Markdown, and best practices for interactive computing. https://docs.jupyter.org
Software Carpentry: Programming with Python & Jupyter Notebooks Community-developed lessons designed specifically for scientists with little or no prior coding experience. These materials emphasize practical workflows and reproducible research. https://software-carpentry.org/lessons/
Python Data Science Handbook (Jake VanderPlas) An open-access, notebook-based resource introducing Python tools commonly used in scientific data analysis, including NumPy and pandas. https://jakevdp.github.io/PythonDataScienceHandbook/
Readers are encouraged to consult these resources as needed. Encountering errors, exploring documentation, and learning how to troubleshoot computational workflows are valuable components of developing computational literacy and reflect authentic scientific practice.
Next Steps#
Together, these tools will form the foundation of everything we do in this project. Now that we have a sense of the digital “toolbox” available to us, it is time to start using it. We will begin by exploring one of the most important resources in computational materials science: the Materials Project, a vast database of computed properties for thousands of compounds.