Input file#

The required settings are provided by input YAML file. This file consists of several sections devoted to setting up particular settings of pacemaker. The sections are listed below.

Cutoff and (optional) metadata#

Global cutoff for the neighborlist constructor is setup as:

cutoff: 10.0

Metadata (optional)

This is arbitrary key (string)-value (string) pairs that would be added to the potential YAML file:

metadata:
  info: some info
  comment: some comment
  purpose: some purpose

Moreover, starttime and user fields would be added automatically

Dataset specification section#

This section is denoted by the key

data:
  ...

Fitting dataset could be queried automatically from structdb (if corresponding structdborm package is installed and connection to database is configured, see structdb.ini file in home folder). Alternatively, dataset could be saved into file as a pickled pandas dataframe with special names for columns.

Example:

data: # dataset specification section
  # data configuration section
  config:
    element: Al                    # element name
    calculator: FHI-aims/PBE/tight # calculator type from `structdb` 
    # ref_energy: -1.234           # single atom reference energy
                                   # if not specified, then it will be queried from database

  # seed: 42                       # random seed for shuffling the data  
  # query_limit: 1000              # limiting number of entries to query from `structdb`
                                   # ignored if reading from cache

  # cache_ref_df: True             # whether to store the queried or modified dataset into file, default - True
  # filename: some.pckl.gzip       # force to read reference pickled dataframe from given file
  # ignore_weights: False          # whether to ignore energy and force weighting columns in dataframe
  # datapath: ../data              # path to folder with cache files with pickled dataframes

Alternatively, instead of data::config section, one can specify just the cache file with pickled dataframe as data::filename:

data: 
  filename: small_df_tf_atoms.pckl
  datapath: ../tests/

data:datapath option, if not provided, could be replaced with environment variable PACEMAKERDATAPATH

Example of creating the subselection of fitting dataframe and saving it is given in notebooks/data_preprocess.ipynb

Example of generating custom energy/forces weights is given in notebooks/data_custom_weights.ipynb

Querying data#

You can just query and preprocess data, without running potential fitting. Here is the minimalistic input YAML:

# input.yaml file

cutoff: 10.0  # use larger cutoff to have excess neighbour list
data: # dataset specification section
  config:
    element: Al                    # element name
    calculator: FHI-aims/PBE/tight # calculator type from `structdb`
  seed: 42
  datapath: ../data                # path to the directory with cache files
  # query_limit: 100               # number of entries to query

Preparing the data / constructing neighbor list#

You can use existing .pckl.gzip dataset and generate all necessary columns for that, including neighbourlists Here is the minimalistic input YAML:

# input.yaml file

cutoff: 10.

data:
  filename: my_dataset.pckl.gzip

backend:
  evaluator: tensorpot  # pyace, tensorpot

Then execute pacemaker --prepare-data input.yaml Generation of the my_dataset.pckl.gzip from, for example, pyiron is shown in notebooks/convert-pyiron-to-pacemaker.ipynb

Test set#

You could provide test set either as a fraction or certain number of samples from the train set (option test_size) or as a separate pckl.gzip file (option test_filename)

data:
  test_filename: my_test_dataset.pckl.gzip

or

data:
  test_size: 100 # would take 100 samples randomly from train/fit set
  # test_size: 0.1 #  if <1 - would take given fraction of samples randomly from train/fit set

Interatomic potential (or B-basis) configuration#

Basis configuration#

In order to specify the B-basis potential, you have to provide four main components (aka basis shape): elements, embeddings for each element, bonds for each possible pairs of elements and functions for each possible combination of elements (unary, binary, ternary, etc.) as follows:

potential:
  deltaSplineBins: 0.001
  elements: [Al, Ni]  # list of all element

  # Embeddings are specified for each individual elements,
  # all parameters could be distinct for different species
  embeddings: ## possible keywords: ALL, UNARY, elements: Al, Ni
    Al: {
      npot: 'FinnisSinclairShiftedScaled',
      fs_parameters: [1, 1, 1, 0.5], ## non-linear embedding function: 1*rho_1^1 + 1*rho_2^0.5
      ndensity: 2,

      # core repulsion parameters
      rho_core_cut: 200000,
      drho_core_cut: 250
    }

    Ni: {
      npot: 'FinnisSinclairShiftedScaled', ## linear embedding function: 1*rho_1^1
      fs_parameters: [1, 1],
      ndensity: 1,

      # core repulsion parameters
      rho_core_cut: 3000,
      drho_core_cut: 150
    }

  ## Bonds are specified for each possible pairs of elements
  ## One could use keywords: ALL (Al,Ni, AlNi, NiAl)
  bonds: ## possible keywords: ALL, UNARY, BINARY, elements pairs as AlAl, AlNi, NiAl, etc...  
    ALL: {
        radbase: ChebExpCos,
        radparameters: [5.25],

        ## outer cutoff, applied in a range [rcut - dcut, rcut]
        rcut: 5,
        dcut: 0.01,

        ## inner cutoff, applied in a range [r_in, r_in + delta_in]
        r_in: 1.0,
        delta_in: 0.5,

        ## core-repulsion parameters `prefactor` and `lambda` in
        ## prefactor*exp(-lambda*r^2)/r, >0 only r<r_in+delta_in
        core-repulsion: [0.0, 5.0],
    }

    ## BINARY overwrites ALL settings when they are repeated
    BINARY: {
        radbase: ChebPow,
        radparameters: [6.25],

        ## cutoff may vary for different bonds
        rcut: 5.5,
        dcut: 0.01,

        ## inner cutoff, applied in a range [r_in, r_in + delta_in]
        r_in: 1.0,
        delta_in: 0.5,

        ## core-repulsion parameters `prefactor` and `lambda` in
        ## prefactor*exp(-lambda*r^2)/r, >0 only r<r_in+delta_in
        core-repulsion: [0.0, 5.0],
    }

  ## possible keywords: ALL, UNARY, BINARY, TERNARY, QUATERNARY, QUINARY,
  ##  element combinations as (Al,Al), (Al, Ni), (Al, Ni, Zn), etc...
  functions: 
    UNARY: {
      nradmax_by_orders: [15, 3, 2, 2, 1],
      lmax_by_orders: [ 0, 2, 2, 1, 1],
      # coefs_init: zero # initialization of functions coefficients: zero (default) or random
    }

    BINARY: {
      nradmax_by_orders: [15, 2, 2, 2],
      lmax_by_orders: [ 0, 2, 2, 1],
      # coefs_init: zero # initialization of functions coefficients: zero (default) or random
    }

In sections embeddings, bonds and functions one could use keywords (ALL, UNARY, BINARY, TERNARY, QUATERNARY, QUINARY). The settings provided by more specific keyword will override those from less specific keyword, i.e. ALL < UNARY < BINARY < ('Al','Ni')

Upfitting#

If you want to continue the fitting of the existing potential from potential.yaml file, then specify

potential: potential yaml

alternatively, one could use pacemaker ... -p potential.yaml option.

For specifying both initial and target potential from the file one could provide:

potential: 
  filename: potential.yaml

  ## in "ladder" fitting scheme, potential from with to start fit
  # initial_potential: initial_potential.yaml

  ## reset potential from potential.yaml, i.e. set radial coefficients to delta_nk and func coeffs=[0..]
  # reset: true

or alternatively, one could use pacemaker ... -p potential.yaml -ip initial_potential.yaml options.

Fitting settings#

Example of fit section is:

fit:
    ## LOSS FUNCTION OPTIONS ##
    loss: {
      ## [0..1] or auto, relative force weight,
      ## kappa = 0 - energies-only fit,
      ## kappa = 1 - forces-only fit
      ## auto - determined from dataset based on variance of energies and forces
      kappa: 0,
      ## L1-regularization coefficient
      L1_coeffs: 0,
      ## L2-regularization coefficient
      L2_coeffs: 0,
      ## w0 radial smoothness regularization coefficient
      w0_rad: 0,
      ## w1 radial smoothness regularization coefficient
      w1_rad: 0,
      ## w2 radial smoothness regularization coefficient
      w2_rad: 0  
    }

    ## DATA WEIGHTING OPTIONS ##
    weighting: {
        ## weights for the structures energies/forces are associated according to the distance to E_min:
        ## convex hull ( energy: convex_hull) or minimal energy per atom (energy: cohesive)
        type: EnergyBasedWeightingPolicy,
        ## number of structures to randomly select from the initial dataset
        nfit: 10000,         
        ## only the structures with energy up to E_min + DEup will be selected
        DEup: 10.0,  ## eV, upper energy range (E_min + DElow, E_min + DEup)        
        ## only the structures with maximal force on atom  up to DFup will be selected
        DFup: 50.0, ## eV/A
        ## lower energy range (E_min, E_min + DElow)
        DElow: 1.0,  ## eV
        ## delta_E  shift for weights, see paper
        DE: 1.0,
        ## delta_F  shift for weights, see paper
        DF: 1.0,
        ## 0<wlow<1 or None: if provided, the renormalization weights of the structures on lower energy range (see DElow)
        wlow: 0.75,        
        ##  "convex_hull" or "cohesive" : method to compute the E_min
        energy: convex_hull,        
        ## structures types: all (default), bulk or cluster
        reftype: all,        
        ## random number seed
        seed: 42 
    }

    ## Custom weights:  corresponding to main dataset index and `w_energy` and `w_forces` columns should
    ## be provided in pckl.gzip file
    #weighting: {type: ExternalWeightingPolicy, filename: custom_weights_only.pckl.gzip}

    ## OPTIMIZATION OPTIONS ##
    optimizer: BFGS # BFGS, L-BFGS-B, Nelder-Mead, etc. : scipy minimization algorithm
    ## additional options for scipy.minimize(..., options={...}, ...)
    #options: {maxcor: 100}    
    maxiter: 1000 # maximum number of iteration for EACH scipy minimization round

    ## EXTRA OPTIONS ##
    repulsion: auto            # set inner cutoff based on the minimal distance in the dataset

    #trainable_parameters: ALL  # ALL, UNARY, BINARY, ..., radial, func, {"AlNi": "func"}, {"AlNi": {"func","radial"}}, ...

    ##(optional) number of consequentive runs of fitting algorithm (for each ladder step), that helps convergence
    #fit_cycles: 1   

    ## starting from second fit_cycle:

    ## applies Gaussian noise with specified relative sigma/mean ratio to all potential trainable coefficients
    #noise_relative_sigma: 1e-3

    ## applies Gaussian noise with specified absolute sigma to all potential trainable coefficients
    #noise_absolute_sigma: 1e-3  

    # reset the function coefficients according to Gaussian distribution with given sigma; enable ensemble fitting mode
    #randomize_func_coeffs: 1e-3  

    ## LADDER SCHEME (i.e. hierarchical fitting) ##      
    ## enables hierarchical fitting (LADDER SCHEME), that sequentially add specified number of B-functions (LADDER STEP)    
    #ladder_step: [10, 0.02]  
    ##      - integer >= 1 - number of basis functions to add in ladder scheme,
    ##      - float between 0 and 1 - relative ladder step size wrt. current basis step
    ##      - list of both above values - select maximum between two possibilities on each iteration 
    ##     see. Ladder scheme fitting for more info


    ## Possible values:
    ## body_order  -  new basis functions are added according to the body-order, i.e., a function with higher body-order
    ##                will not be added until the list of functions of the previous body-order is exhausted
    ## power_order -  the order of adding new basis functions is defined by the "power rank" p of a function.
    ##                p = len(ns) + sum(ns) + sum(ls). Functions with the smallest p are added first      
    #ladder_type: body_order    


    ## callbacks during the fitting. Module quick_validation.py should be available for import
    ## see example/pacemaker_with_callback for more details and examples
    #callbacks:
    #  - quick_validation.test_fcc_potential_callback

If not specified, then uniform weight and energy-only fit (kappa=0), fit_cycles=1, noise_relative_sigma = 0 settings will be used.

If ladder fitting scheme is used, then intermediate version of the potential after each ladder step will be saved into interim_potential_ladder_step_{LADDER_STEP}.yaml.

Backend specification#

backend:
  evaluator: tensorpot  # pyace, tensorpot

  ## for `tensorpot` evaluator, following options are available:
  # batch_size: 10            # batch size for loss function evaluation, default is 10
  # batch_size_reduction: True # automatic batch_size reduction if not enough memory (default - True) 
  # batch_size_reduction_factor: 1.618  # batch size reduction factor
  # display_step: 20          # frequency of detailed metric calculation and printing

  ## for `pyace` evaluator, following options are available:
  # parallel_mode: process    # process, serial  - parallelization mode for `pyace` evaluator
  # n_workers: 4              # number of parallel workers for `process` parallelization mode

Alternatively, backend could be selected as pacemaker ... -b tensorpot

Ladder (hiererchical) basis extension#

In a ladder scheme potential extension happens by adding new portion of basis functions step-by-step, to form a "ladder" from initial potential to final potential. Following settings should be added to the input YAML file:

Specify final potential shape by providing potential section:

potential:
  deltaSplineBins: 0.001
  element: Al
  ...

Specify initial potential by providing initial_potential option in potential section:

potential:
    ...
    initial_potential: some_start_or_interim_potential.yaml    # potential to start fit from

If initial potential is not specified, then the fit will start from empty potential. Alternatively, you can specify initial potential by command-line option

pacemaker ... -ip some_start_or_interim_potential.yaml

Specify ladder_step in fit section:

fit:
    ...

    ladder_step: [10, 0.02]       
## Possible values:
##  - integer >= 1 - number of basis functions to add in ladder scheme,
##  - float between 0 and 1 - relative ladder step size wrt. current basis step
##  - list of both above values - select maximum between two possibilities on each iteration

See example/ladder_fit_pyace.yaml and example/ladder_fit_tensorpot.yaml example input files