A Complete Workflow for a Single Element

This notebook runs the full ASSYST pipeline end-to-end on a unary system (copper). The four canonical steps are:

  1. Sampling — generate symmetric random structures for every space group.

  2. Relaxing — minimise volume first, then full cell + positions, to find the relevant pockets of the potential energy surface.

  3. Perturbing — apply random rattles and stretches around those minima to populate the surrounding PES.

  4. Combining — collect everything, filter unphysical configurations, evaluate the reference and store the resulting training data.

In a real workflow the relaxations and final evaluations are carried out with DFT (or any other reference); here we use a Morse potential so the notebook runs quickly.

Imports

import pickle
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

from assyst.crystals import Formulas, sample
from assyst.filters import DistanceFilter, AspectFilter, VolumeFilter
from assyst.relaxations import VolumeRelax, FullRelax, relax
from assyst.perturbations import RandomChoice, Rattle, Stretch, perturb

Configuration

max_num is the largest number of atoms per generated structure. 2 keeps the example fast; 10 is the typical production value and yields ~10k training structures.

max_num = 2
out_dir = Path(f"Unary/{max_num}")
out_dir.mkdir(parents=True, exist_ok=True)

Reference potential

Any ASE calculator works. We pick a Morse potential whose parameters were chosen so that its minimum sits near the experimental Cu-Cu bond length.

For more interesting reference data the universal Grace graph-ACE models in assyst.calculators are a good drop-in replacement (they require the optional tensorpotential dependency).

from ase.calculators.morse import MorsePotential

reference = MorsePotential(epsilon=0.3, r0=2.55265548 * 1.10619396, rho0=4)

1. Sampling random structures

Build the formulas to sample (Cu, Cu_2, …, up to max_num) and let sample generate one structure for every compatible space group.

AspectFilter(6) filters out very elongated unit cells that pyxtal occasionally produces.

formulas = Formulas.range('Cu', max_num + 1)
formulas
Formulas(atoms=({'Cu': 0}, {'Cu': 1}, {'Cu': 2}))
spg = list(filter(
    AspectFilter(6),
    sample(
        formulas,
        spacegroups=range(1, 230 + 1),
        max_atoms=max_num,
    )
))
print(f"Generated {len(spg)} symmetric seeds")
Generated 168 symmetric seeds

Just to create a small check point, dump generated structures into a pickle file. This is not required for assyst.

with open(out_dir / 'spg.pckl', 'wb') as f:
    pickle.dump(spg, f)

2. Relaxation

ASSYST applies a two-step relaxation:

  1. VolumeRelax — relax the volume only, keeping shape and internal positions frozen. Quickly puts each seed near a sensible energy scale.

  2. FullRelax — relax cell shape, volume and atomic positions together, landing in a (local) minimum of the PES.

In production both would be carried out with DFT; here the Morse potential makes the cost negligible. For experimentation, assyst.relaxations also provides CellRelax and SymmetryRelax.

volset = VolumeRelax(max_steps=100, force_tolerance=1e-3)
allset = FullRelax(max_steps=100, force_tolerance=1e-3)
%%time
volmin = list(relax(tqdm(spg, desc='volume'), volset, reference))
CPU times: user 2min 24s, sys: 139 ms, total: 2min 25s
Wall time: 48.7 s
%%time
allmin = list(relax(tqdm(volmin, desc='full'), allset, reference))
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.432556521112339e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.432556521112339e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 3.7157276183604643e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 3.7157276183604643e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 5.516864777792211e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 5.516864777792211e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 8.15181096918948e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 8.15181096918948e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.3749153830356407e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.3749153830356407e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.4866777525490544e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.4866777525490544e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.421109299871798e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.421109299871798e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.452048037342556e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.452048037342556e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.9627799244203634e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.9627799244203634e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.2935924938929445e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.2935924938929445e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.4872886565241954e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.4872886565241954e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 3.021149907848636e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 3.021149907848636e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.37529756320406e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.37529756320406e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.9385885786932446e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.9385885786932446e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 3.6379884134723173e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 3.6379884134723173e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.989922080203508e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.989922080203508e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.3682633429615987e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.3682633429615987e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.89860856904846e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.89860856904846e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 3.8198492634984746e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 2.535116799634508e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 2.535116799634508e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 3.7910927379480105e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 3.7910927379480105e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:624: RuntimeWarning: logm result may be inaccurate, approximate err = 5.716936716033083e-13
  cur_deform_grad_log = self.logm(cur_deform_grad)
/home/ponder/micromamba/envs/assyst/lib/python3.12/site-packages/ase/filters.py:606: RuntimeWarning: logm result may be inaccurate, approximate err = 5.716936716033083e-13
  pos[natoms:] = self.logm(pos[natoms:]) * self.exp_cell_factor
CPU times: user 9min 35s, sys: 234 ms, total: 9min 35s
Wall time: 3min 13s
with open(out_dir / 'volmin.pckl', 'wb') as f:
    pickle.dump(volmin, f)
with open(out_dir / 'allmin.pckl', 'wb') as f:
    pickle.dump(allmin, f)

3. Random perturbations

Around each minimum we sample the surrounding PES with three families of random perturbations described in the ASSYST paper:

  1. mostly positional noise with a small cell change (rattle),

  2. mostly hydrostatic cell change,

  3. mostly shear cell change.

The hydrostatic and shear strains are combined with RandomChoice so that hydrostatic changes are drawn slightly more often. The final mods sequence is applied to every fully-minimised structure. DistanceFilter discards perturbations that placed atoms unphysically close.

rattle = Rattle(0.25) + Stretch(hydro=0.05, shear=0.005)
hydro = Stretch(hydro=0.80, shear=0.05)
shear = Stretch(hydro=0.05, shear=0.20)
stretch = RandomChoice(hydro, shear, 0.7)

# four rattles + four stretches per minimum reproduces the published settings
mods = 4 * [rattle] + 4 * [stretch]
%%time
random = list(perturb(allmin, mods, filters=[DistanceFilter({'Cu': 1})]))
print(f"Produced {len(random)} perturbed structures")
Produced 1196 perturbed structures
CPU times: user 1.13 s, sys: 4.92 ms, total: 1.13 s
Wall time: 1.08 s
with open(out_dir / 'random.pckl', 'wb') as f:
    pickle.dump(random, f)

4. Combine and evaluate

Concatenate the sampled, relaxed and perturbed structures, run the distance and volume filters once more to reject any leftover pathological configurations, and finally evaluate the reference to obtain a training dataset of energies and forces.

everything = list(filter(
    VolumeFilter(300),
    filter(DistanceFilter({'Cu': 1}), spg + volmin + allmin + random),
))
print(f"Final training set: {len(everything)} structures")
Final training set: 1573 structures
rows = []
for s in tqdm(everything, desc='evaluating'):
    s.calc = reference
    rows.append({
        'ase_atoms': s,
        'energy': s.get_potential_energy(),
        'forces': s.get_forces(),
        'number_of_atoms': len(s),
    })
df = pd.DataFrame(rows)
df.to_pickle(out_dir / 'everything_training_data.pckl.gz')
df.head()
ase_atoms energy forces number_of_atoms
0 (Atom('Cu', [-1.3265240321355765, -7.353189190... -1.802887 [[6.071532165918825e-18, -7.979727989493313e-1... 1
1 (Atom('Cu', [0.0, 0.0, 0.24375080364530244], i... -2.192044 [[7.556889142223966e-17, 2.886146183156413e-16... 1
2 (Atom('Cu', [0.0, 0.0, 0.0], index=0)) -1.702372 [[4.5102810375396984e-17, 1.708702623837155e-1... 1
3 (Atom('Cu', [-1.1459510315187336, -1.145951031... -1.921211 [[6.895525817007808e-17, -3.673276960380889e-1... 1
4 (Atom('Cu', [-0.7029469904758565, 1.2175399025... -2.453490 [[5.204170427930421e-18, 7.28583859910259e-17,... 1

Inspect the training data

Energy per atom, energy vs volume and force component distributions are a quick sanity check that the dataset spans the expected ranges.

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

(df['energy'] / df['number_of_atoms']).plot.hist(bins=100, log=True, ax=axes[0])
axes[0].set_xlabel('Energy per atom [eV/atom]')
axes[0].set_ylabel('# Structures')

axes[1].scatter(
    df['ase_atoms'].map(lambda s: s.cell.volume / len(s)),
    df['energy'] / df['number_of_atoms'],
    marker='.', alpha=0.5,
)
axes[1].set_xlim(0, 300)
axes[1].set_xlabel(r'Volume [$\mathrm{\AA}^3$/atom]')
axes[1].set_ylabel('Energy per atom [eV/atom]')

df['forces'].explode().explode().infer_objects().plot.hist(bins=100, log=True, ax=axes[2])
axes[2].set_xlabel(r'Force component [eV/$\mathrm{\AA}$]')
axes[2].set_ylabel('# Components')

plt.tight_layout()
../_images/b3c8a16937e4e1476f68ba605b2577446341a870b268464fee67c1da3a37ac2c.png

Optional: post-hoc filtering

If the resulting energies and forces still cover ranges that are too extreme, EnergyFilter and ForceFilter (or pandas query) can prune them. These act on ase.Atoms objects that carry a SinglePointCalculator, which is what relax and the loop above attach.

df.query('-3 <= energy / number_of_atoms < 10')
ase_atoms energy forces number_of_atoms
0 (Atom('Cu', [-1.3265240321355765, -7.353189190... -1.802887 [[6.071532165918825e-18, -7.979727989493313e-1... 1
1 (Atom('Cu', [0.0, 0.0, 0.24375080364530244], i... -2.192044 [[7.556889142223966e-17, 2.886146183156413e-16... 1
2 (Atom('Cu', [0.0, 0.0, 0.0], index=0)) -1.702372 [[4.5102810375396984e-17, 1.708702623837155e-1... 1
3 (Atom('Cu', [-1.1459510315187336, -1.145951031... -1.921211 [[6.895525817007808e-17, -3.673276960380889e-1... 1
4 (Atom('Cu', [-0.7029469904758565, 1.2175399025... -2.453490 [[5.204170427930421e-18, 7.28583859910259e-17,... 1
... ... ... ... ...
1568 (Atom('Cu', [0.07877639345419643, 0.3276080267... -5.287408 [[-2.5511153646601943, 0.8436956669030645, -1.... 2
1569 (Atom('Cu', [0.0, 0.0, 0.0], index=0), Atom('C... -4.389131 [[-5.963111948670274e-17, -3.4876615484513707e... 2
1570 (Atom('Cu', [0.0, 0.0, 0.0], index=0), Atom('C... -4.417568 [[-2.8027710360922775e-15, -1.7008963681952594... 2
1571 (Atom('Cu', [0.0, 0.0, 0.0], index=0), Atom('C... -5.346236 [[-2.4824977143400595e-15, -1.9329156331071573... 2
1572 (Atom('Cu', [0.0, 0.0, 0.0], index=0), Atom('C... -4.695105 [[-2.8142381681322062e-15, 3.803390538441569e-... 2

1573 rows × 4 columns

Next steps

This notebook stops at the training set. Fitting a potential is then the job of a downstream MLIP package — e.g. ACE via python-ace, MTP via mlip-3, or any other framework that reads a list of structures with energies and forces. The dataframe saved in Unary/<max_num>/everything_training_data.pckl.gz is in the format consumed by pyace.linearacefit.LinearACEDataset.

The companion notebook ACEFitting.ipynb walks through training a small linear ACE potential on the dataset produced here.