GitHub - MLAI-Yonsei/GenMol_MatterGen

This repository adds a pipeline for conditioned fine-tuning of MatterGen, generating samples with the fine-tuned model, and computing their properties using the GPAW package.

Installation

The easiest way to install prerequisites is via uv, a fast Python package and project manager.

The MatterGen environment can be installed via the following command (assumes you are running Linux and have a CUDA GPU):

pip install uv
uv venv .venv --python 3.10 
source .venv/bin/activate
cd /your/path/mattergen
uv pip install -e .

Note that our datasets and model checkpoints are provided inside this repo via Git Large File Storage (LFS). To find out whether LFS is installed on your machine, run

git lfs --version

If this prints some version like git-lfs/3.0.2 (GitHub; linux amd64; go 1.18.1), you can skip the following step.

Install Git LFS

If Git LFS was not installed before you cloned this repo, you can install it via:

sudo apt install git-lfs
git lfs install

Get started with a trained model

microsoft provides checkpoints of an unconditional base version of MatterGen as well as fine-tuned models for these properties:

mattergen_base: unconditional base model trained on Alex-MP-20 (PT)
mp_20_base: unconditional base model trained on MP-20 (PT)
chemical_system: fine-tuned model conditioned on chemical system (FT)
space_group: fine-tuned model conditioned on space group (FT)
dft_mag_density: fine-tuned model conditioned on magnetic density from DFT (FT)
dft_band_gap: fine-tuned model conditioned on band gap from DFT (FT)
ml_bulk_modulus: fine-tuned model conditioned on bulk modulus from ML predictor (FT)
dft_mag_density_hhi_score: fine-tuned model jointly conditioned on magnetic density from DFT and HHI score (FT)
chemical_system_energy_above_hull: fine-tuned model jointly conditioned on chemical system and energy above hull from DFT (FT)

The Microsoft-provided models are located at checkpoints/<model_name> and are also available on Hugging Face. By default, they are downloaded from Huggingface when requested. You can also manually download them from Git LFS via

git lfs pull -I checkpoints/<model_name> --exclude=""

For reproducibility, the fine-tuned model files(trained by MLAI) are stored in the saves folder.

Generating materials

Unconditional generation (using PT model from microsoft)

To sample from the pre-trained base model, run the following command.

export MODEL_NAME=mattergen_base
export RESULTS_PATH=results/  # Samples will be written to this directory

# generate batch_size * num_batches samples
mattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --num_batches 1

This script will write the following files into $RESULTS_PATH:

generated_crystals_cif.zip: a ZIP file containing a single .cif file per generated structure.
generated_crystals.extxyz, a single file containing the individual generated structures as frames.
If --record-trajectories == True (default): generated_trajectories.zip: a ZIP file containing a .extxyz file per generated structure, which contains the full denoising trajectory for each individual structure.

Property-conditioned generation (using FT model from microsoft)

With a fine-tuned model, you can generate materials conditioned on a target property. For example, to sample from the model trained on magnetic density, you can run the following command.

export MODEL_NAME=dft_mag_density
export RESULTS_PATH="results/$MODEL_NAME/"  # Samples will be written to this directory, e.g., `results/dft_mag_density`

# Generate conditional samples with a target magnetic density of 0.15
mattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --properties_to_condition_on="{'dft_mag_density': 0.15}" --diffusion_guidance_factor=2.0

Property-conditioned generation (manually fine-tuned model)

export PROPERTY_NAME=dft_band_gap
export PROPERTY_CONDITION=2 
export MODEL_PATH= your/path/model
# MODEL_PATH = the directory right above the /lightning_logs folder
export RESULTS_PATH="results/$PROPERTY_NAME/$PROPERTY_CONDITION" 

mattergen-generate $RESULTS_PATH --model_path=$MODEL_PATH --batch_size=16 --properties_to_condition_on="{'$PROPERTY_NAME': $PROPERTY_CONDITION}" --diffusion_guidance_factor=2.0

The fine-tuned model files from MLAI are stored in the saves folder. For multiple property-conditioned generation, simply provide multiple conditions in the --properties_to_condition_on argument, like --properties_to_condition_on="{'energy_above_hull': 0.05, 'dft_band_gap': 2}"

Evaluation

Once you have generated a list of structures contained in $RESULTS_PATH (either using MatterGen or another method), you can relax the structures using the default MatterSim machine learning force field (see repository) and compute novelty, uniqueness, stability (using energy estimated by MatterSim), and other metrics via the following command:

git lfs pull -I data-release/alex-mp/reference_MP2020correction.gz --exclude=""  # first download the reference dataset from Git LFS
mattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as="$RESULTS_PATH/metrics.json"

This script will write metrics.json containing the metric results to $RESULTS_PATH and will print it to your console.

Note

if you are in trouble error like TypeError: Optimizer.converged() missing 1 required positional argument: 'gradient' change .venv/lib/python3.10/site-packages/mattersim/applications/batch_relax.py L121-122

opt.step()
if opt.converged():

->

opt.step()
gradient = opt.optimizable.get_gradient()
if opt.converged(gradient):

Train MatterGen yourself

Before we can train MatterGen from scratch, we have to unpack and preprocess the dataset files.

Pre-process a dataset for training

To preprocess our larger alex_mp_20 dataset, run:

# Download file from LFS
git remote add upstream https://github.com/microsoft/mattergen.git 
git remote -v

git lfs install
git lfs fetch upstream --include="data-release/alex-mp/alex_mp_20.zip" --exclude=""

git lfs checkout data-release/alex-mp/alex_mp_20.zip

git lfs pull -I data-release/alex-mp/alex_mp_20.zip --exclude=""
unzip data-release/alex-mp/alex_mp_20.zip -d datasets
csv-to-dataset --csv-folder datasets/alex_mp_20/ --dataset-name alex_mp_20 --cache-folder datasets/cache

The above code includes additional modifications to resolve LFS server issues that occur when working with the forked repository instead of the original GitHub repo. This will take some time (~1h). You will get preprocessed data files in datasets/cache/alex_mp_20.

Training (Pre-training)

To train the MatterGen base model on alex_mp_20, use the following command:

mattergen-train data_module=alex_mp_20 ~trainer.logger trainer.accumulate_grad_batches=4

Note

For Apple Silicon training, add ~trainer.strategy trainer.accelerator=mps to the above command.

Tip

Note that a single GPU's memory usually is not enough for the batch size of 512, hence we accumulate gradients over 4 batches. If you still run out of memory, increase this further.

Crystal structure prediction

Even though not a focus of our paper, you can also train MatterGen in crystal structure prediction (CSP) mode, where it does not denoise the atom types during generation. This gives you the ability to condition on a specific chemical formula for generation. You can train MatterGen in this mode by passing --config-name=csp to run.py.

To sample from this model, pass --target_compositions=['{"<element1>": <number_of_element1_atoms>, "<element2>": <number_of_element2_atoms>, ..., "<elementN>": <number_of_elementN_atoms>}'] --sampling-config-name=csp to generate.py. An example composition could be --target_compositions=['{"Na": 1, "Cl": 1}'].

Fine-tuning on property data

You can change the location of the fine-tuned model files by editing the conf/finetune.yaml file. By default, fine-tuned model are going to save in the saves folder.

list of property which can be trained on

space_group
energy_above_hull
dft_band_gap
dft_bulk_modulus
dft_mag_density
hhi_score
ml_bulk_modulus

You can fine-tune the MatterGen base model using the following command.

export PROPERTY=dft_mag_density 
mattergen-finetune adapter.pretrained_name=mattergen_base data_module=alex_mp_20 +lightning_module/diffusion_module/model/property_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY=$PROPERTY ~trainer.logger data_module.properties=["$PROPERTY"] trainer.accumulate_grad_batches=4

Note

in single A6000 GPU, setting trainer.accumulate_grad_batches=4 is mandatory to avoid OOM error. if OOM error still occurs raise trainer.accumulate_grad_batches

dft_mag_density denotes the target property for fine-tuning.

Multi-property fine-tuning

You can also fine-tune MatterGen on multiple properties. For instance, to fine-tune it on dft_mag_density and dft_band_gap, you can use the following command.

export PROPERTY1=dft_mag_density
export PROPERTY2=dft_band_gap 
export MODEL_NAME=mattergen_base
mattergen-finetune adapter.pretrained_name=$MODEL_NAME data_module=mp_20 +lightning_module/diffusion_module/model/property_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY1=$PROPERTY1 +lightning_module/diffusion_module/model/property_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY2=$PROPERTY2 ~trainer.logger data_module.properties=["$PROPERTY1","$PROPERTY2"]

Tip

Add more properties analogously by adding these overrides:

+lightning_module/diffusion_module/model/property_embeddings@adapter.adapter.property_embeddings_adapt.<my_property>=<my_property>
Add <my_property> to the data_module.properties=["$PROPERTY1","$PROPERTY2",...,<my_property>] override.

Note

For Apple Silicon training, add ~trainer.strategy trainer.accelerator=mps to the above command.

GPAW Calculation

In the gpaw folder, there is code that checks whether the property values (the conditioning targets) of the conditioned-generation results were correctly produced. After downloading the required packages using

conda env create -f ./gpaw/environment.yml

recommend create another environment from ".venv", because of package confliction

download the files necessary for GPAW computations using the commands below.

# download files
gpaw install-data --gpaw --version=24.11.0 /your/path/gpaw/setups
# downloaded file location
export GPAW_SETUP_PATH=/your/path/gpaw/setups
echo 'export GPAW_SETUP_PATH=/your/path/gpaw/setups' >> ~/.bashrc
source ~/.bashrc
# check
echo $GPAW_SETUP_PATH

Running the executable inside /gpaw/calculate_bash will execute the GPAW computation. An example is shown below:

python /your/path/gpaw/properties_mattergen.py   --cif_dir /your/path/cif_file   --pattern "*.cif"   --prop dft_bulk_modulus   --out /your/path/gpaw/out_gen_dft_bulk_modulus/300   --k 3 3 3   --smear 0.05   --span 0.05

When you run the code, a CSV file and a JSON file are created in the directory specified by --out.

You can perform other tasks simply by changing the --prop.

Also, GPAW is not as fast as the VASP package, and for efficiency the k value is set to 3 3 3, which leads to slightly lower accuracy in the computed results.

TroubleShooting History

There are no issues during the installation or generation steps. The problem occurs during the evaluation step, and a detailed explanation is added in that section. (Resolved by modifying the code)

During the pre-process a dataset for training stage, working with the forked repository caused an issue where the LFS server could not be accessed. This was resolved by adding a step to connect to the original repository’s LFS server.

Fine-tuning itself works without problems, but during the Generate stage, note that name of the arguments differ between generating with the pre-trained model and generating with the fine-tuned model.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.azure		.azure
.github		.github
assets		assets
benchmark		benchmark
checkpoints		checkpoints
data-release		data-release
datasets		datasets
example		example
gpaw		gpaw
mattergen		mattergen
sampling_conf		sampling_conf
saves		saves
.gitattributes		.gitattributes
.gitignore		.gitignore
.lfsconfig		.lfsconfig
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Installation

Install Git LFS

Get started with a trained model

Generating materials

Unconditional generation (using PT model from microsoft)

Property-conditioned generation (using FT model from microsoft)

Property-conditioned generation (manually fine-tuned model)

Evaluation

Train MatterGen yourself

Pre-process a dataset for training

Training (Pre-training)

Crystal structure prediction

Fine-tuning on property data

Multi-property fine-tuning

GPAW Calculation

TroubleShooting History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Installation

Install Git LFS

Get started with a trained model

Generating materials

Unconditional generation (using PT model from microsoft)

Property-conditioned generation (using FT model from microsoft)

Property-conditioned generation (manually fine-tuned model)

Evaluation

Train MatterGen yourself

Pre-process a dataset for training

Training (Pre-training)

Crystal structure prediction

Fine-tuning on property data

Multi-property fine-tuning

GPAW Calculation

TroubleShooting History

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages