Design

ASSYST is designed around two principles:

  1. small, orthogonal classes, and

  2. clean, abstract interfaces between them.

This has advantages on heterogeneous HPC: the individual stages of an MLIP-generation workflow require very different resources (serial preparation, multi-node DFT, GPU-accelerated fitting), and a loosely coupled package lets each stage be dispatched independently — e.g. via executorlib. The same loose coupling is what keeps ASSYST future-proof against new MLIP architectures or new reference-data sources.

ASSYST is organised around four small, independent submodules — one per stage of the training-set construction pipeline. Each submodule exposes a top-level function that consumes an iterable of ase.Atoms and produces another iterable of ase.Atoms, plus a family of small, frozen dataclasses.dataclass() configuration objects that describe how the stage should behave.

ASSYST module overview

Outline of data and control flow in the package. Green boxes are the cleanly separated workflow steps; arrows between them show the flow of generated structures as lists of ase.Atoms. Ochre boxes give example code for the small modular classes that configure each step. Violet boxes mark the points where ASSYST loosely couples to external codes — either via the ASE calculator interface or by overloading a minimal set of documented methods.

The pipeline

Reading the figure top-to-bottom, the canonical ASSYST workflow is:

  1. assyst.crystalssample() enumerates random symmetric crystals for the requested formulas and space groups.

  2. assyst.relaxationsrelax() minimises each candidate using an ASE calculator, driven by a Relax subclass (VolumeRelax, SymmetryRelax, FullRelax, …).

  3. assyst.perturbationsperturb() expands each minimum into a cloud of nearby configurations via Rattle, Stretch, and their additive/random combinations.

  4. assyst.filtersfilter() (the standard built-in) is composed with FilterBase subclasses such as DistanceFilter, VolumeFilter, EnergyFilter, and ForceFilter to discard pathological structures.

The stages are deliberately decoupled: each takes plain ASE atoms in and emits plain ASE atoms out, so any step can be skipped, swapped, or re-ordered without touching the others.

Why configuration objects?

Every stage separates what to do (the top-level function) from how to do it (the dataclass argument). This keeps three properties stable across the whole API:

  • Composability. Filters and perturbations support algebraic operators (&, |, +, RandomChoice) so complex behaviour is built from small pieces rather than added as new keyword arguments.

  • Reproducibility. Configuration objects are frozen dataclasses and therefore hashable and picklable — settings can be stored alongside the generated structures and reloaded verbatim.

  • Extensibility. Each configuration class is a small subclass with a single hook (e.g. apply_filter_and_constraints() for relaxations, __call__() for filters). See Custom Relaxers for a worked example of plugging in a non-ASE engine.

Role of the ASE calculator

ASSYST never owns a reference method. An ase.calculators.calculator.Calculator is passed into relax() and is the only place energies and forces enter the pipeline. All other stages operate on pure geometry, which is why a single workflow scales unchanged from a cheap Morse potential during prototyping to production DFT runs.

Provenance and reproducibility

Each generated structure carries a UUID in Atoms.info that is updated by assyst.utils.update_uuid() whenever a stage produces a new structure from an existing one. Parent/child relationships between structures are tracked alongside, so the full history of any data point in the final training set is recoverable; this property is exercised in the test suite. See Lineage Tracking and Metadata for the downstream tooling that builds on it.

Where ASSYST relies on random initialisation or perturbation, the state of the underlying random number generators is exposed so that runs can be reproduced bit-for-bit.