[TCOD] TCOD dictionaries and data presentation -- 3 levels of description

Tue Jul 29 13:31:26 UTC 2014

Hello everybody,

You did a great job, Saulius, digesting this diverse bunch of opinions
into a sound compromise ! Some comments (I collect several of the
previous mails hereafter):

> I am toying with the idea that TCOD should have 3 levels of structure
> description:

Yes, that makes sense. Your rationale for the three levels is really
good (one comment about level 1, though - see hereafter).

>
> -- level 0: - cell constants; - atomic coordinates; - literature
> reference.
>
> Standard CIF dictionaries are enough for such description.

What I feel is missing in this minimalistic set, is the XC-functional.
Without that information, the relevance of predicted unit cell
information is limited. (It can be retrieved from the literature
reference, sure. And if people complete level 1, then the info is
available anyway. But as this is really necessary information, it would
be good to force users to include it by asking it at the mandatory level 0.)

Immediately then the question arises how to identify unambiguously the
XC-functional (a question that would arise equally well if this would be
kept in level 1). One possibility would be to use the identifier used in
LIBXC (http://www.tddft.org/programs/octopus/wiki/index.php/Libxc). This
is widely accepted, code-independent and open source.

Specifying the XC-functional limits us to DFT only. Perhaps the keyword
should rather be 'level-of-theory'. If a LIBXC-identifier is given, then
this implies it is DFT with the quoted functional. If the value is not a
LIBXC-identifier, it refers to a non-DFT method (Hartree-Fock, GW,
QMC,...) It might be tricky to create a relevant list of non-DFT
methods, and right now it is perhaps of limited use, but the number of
predictions by these methods is likely going to surge in coming years.

[added later: OK, the _tcod_model variable more or less does this, as I
see now. Hence, my comment boils down to replacing the value 'DFT' by
the LIBXC identifier for the relevant functional]
>
> -- level 1: - same as in level 0, plus any parameters that permit
> qualified person to judge if the structure has converged and how
> good it is. The parameters should include residual force on atoms
> (2014-02-10 16:16, Björkman Torbjörn), energy change(s) in the last
> cycle(s), and references to basis set, pseudo-potentials, XC
> functionals, etc. (as described in our dictionaries and in
> http://www.xml-cml.org/dictionary/compchem/). Basis sets can be
> referenced as in https://bse.pnl.gov/bse/portal;

I might be ruining the beautiful simplicity of the scheme now, but it
seems to me that the present level 1 covers two different things. What
about splitting this into two levels:

* One that lists the main technical settings (basis set, pseudo's,
k-mesh, XC [although the latter might at level 0]), such that a
qualified person can assess the level of credibility of the calculation

* One that lists information to assess the level of convergence.

The former is information at the input stage, the latter is at the
output stage. Two different aspects. The advantage of splitting the
level is that probably less people would take the effort to collect the
convergence info, and might then discard that level entirely.

> One more related question:
>
> If I understand correctly, from what I recall from the QM lectures,
> we can have in principle two kinds of boundary conditions for
> localised particle wave functions:
>
> a) vanishing at the infinity (modelling single molecule in vacuum),
> and b) periodic (modelling an ideal crystal).
>
> From what I have read in manuals of the QM codes, most implement b),
> and a) is approximated by putting a molecule in a large enough "unit
> cell" so that interactions between molecule images are negligible.
>
> Is this a correct view? Does any code implement (a) as a separate
> mode of computation?
>
> In either case, we should probably have a special tag that
> distinguishes "true" crystal structures from the "convenience" unit
> cells that are non-physical but are set up solely to solve a
> molecule structure problem with the same code that also deals with
> crystals. Any ideas how to tell from the computations which mode was
> used?

Yes, this looks to be correct. There some codes that have a 3D
(crystals), 2D (surfaces) and 1D (molecules) implementation, without
mimicking the missing dimension by finite vacuum (FLEUR does this, for
instance).

As, however, the goal of TCOD would be to document truly infinite
crystals, there will be no calculations for molecules in a big almost
empty unit cell in the data base anyway. What could occur are
non-periodic cluster models to mimic infinite crystals -- rather the
opposite situation. Having a keyword that differentiates between
'periodic' and 'non-periodic' calculations would be sufficient to filter
these.

> I still hold to the statement that it's is almost impossible to
> reproduce the exact results using two different codes even though the
> settings (xc functional, basis, pseudopotential, k-point sampling
> etc.) are the same. The numerical implementations could be
> significantly different and there are always a bunch of hard-coded
> parameters which are different for different implementations.

Interesting statement, as we are currently working on a quantitative
proof of what you say (Torbjörn is involved in this as well). You can
watch a 15-minute talk on this topic at
https://www.youtube.com/pasSSaMMnnE , and inspect a snapshot of current
results at https://molmod.ugent.be/deltacodesdft ). [so far the
advertisement ;-) ]

> I don't think it is so important to have the energy convergence
> reported. It's a crystal structural database, so the it is probably
> enough to report 8 things from the last step:
>
> Max Force on Atoms RMS Force on Atoms Max Displacement on Atoms RMS
> Displacement on Atoms
>
> If the cell optimization is performed:
>
> Max Force on Cell RMS Force on Cell Max Displacement on Cell RMS
> Displacement on Cell

This illustrates why I hesitate to keep this information within level 1:
not all codes can provide this. Forces are probably available
everywhere, but cell optimizations by 'forces' require the stress tensor
formalism. That's peanuts for plane wave codes, but has not been
developed for all more involved basis sets. There is not a single LAPW
code, for instance, that can optimize a unit cell in that way (they have
to resort to energy minimization, which is fair as long as the symmetry
of the cell is sufficiently high).

> As I said, this could be extremely tricky. Don't think it makes much
>  sense to store the output logs, since they can get huge. What we
> could do, is to define a couple (3-5) of "Reference codes" which
> could be supported for checking/benchmarking purposes. VASP,
> QuantumEspresso, GPAW? Which inputs we could store and structures
> submitted from other codes should stick to those benchmarks. As far
> as I know some of the databases use in their philosophy (e.g. they
> would only accept VASP results calculated using some minimum
> criteria). In this way one can be consistent.

I agree that level 2 is the tricky one. One has to be careful here to
create a pragmatic solution without investing too much time. What I see
as a fair target is that those (few?) people who are willing to provide
information for a full recreation of their results, should be able to do
that. Full stop. The amount of work it takes from their side is not our
worry (for some codes this will be far more easy to do than for others).
Perhaps the most pragmatic thing to do is to provide the possibility to
insert a verbatim section with required input files, code version and
run commands in order to reproduce the results, without any checking
whether or not this information is actually complete.

Working with reference codes, hmm,... Tricky as well. This implicitly
gives the message that tcod favours/promotes some codes more than
others. Moreover, minimal requirements may evolve over time. What about
this alternative: provide a separate algorithm that can, for a given
code, assess whether or not a given entry is sufficiently reliable or
not. If there is crucial information missing to assess this, then the
algoritm answers 'unable to decide'. The advantage is that such
algorithms are not baked into the tcod entries, but are stand-alone.
They can evolve over time (tresholds can be increased, but also more
refined criteria to assess quality can be included), different people
can each contribute to a different algorith (an expert about one code
writes the algorithm for that specific code), it has no quality
suggestion (if someone is unhappy that the algorithm for his/her pet
code is missing they can provide one themselves), etc.

>> In either case, we should probably have a special tag that
>> distinguishes "true" crystal structures from the "convenience" unit
>> cells that are non-physical but are set up solely to solve a
>> molecule structure problem with the same code that also deals with
>> crystals. Any ideas how to tell from the computations which mode
>> was used?
>
> We should be only interested in periodic structures. Or I would be
> even more strict - 3D periodic (crystals). We should not worry about
>  the molecules or clusters... or leave it to others :-) We just
> should not have any limitation on the supercell size of the
> structure, but in general it should be strictly 3D periodic for this
> database. That would make life much easier and give the database
> some shape on what we are trying to systematize.

I agree with Linas. The 'C' in TCOD excludes the need to deal with
molecules in vacuo. That's an entire different world (with even much
more degrees of freedom than in our 3D crystal world), and there are
other databases that deal with these objects.

> As a person affiliated with molecular biology and drug design, I
> would be of course also interested in comparing crystal structures
> with gas-phase (or I would rather say in vacuo) QM optimised
> structures; but these, as Peter has told me, are much more diverse
> and thus more difficult to manage in a coherent database. So we can
> postpone their addition; in any case, we'll have to flag them
> carefully and probably keep separately (different ID range or
> namespace?).

Saulius, you seem to be interested in formation energies of molecular
crystals? (energy difference between the actual crystal and the
individual molecules) Although from an experimental point of view this
information is considered to be a property of a crystal (and therefore
could have its place in a crystal database), it actually is a difference
between a solid state and a molecular property. As most computational
methods are tailored towards either molecules or solids (and have severe
limitations on the other class), such formation energies are bound to
have problems. I agree this would be useful information, but it is hard
to produce really accurate numbers and I doubt whether many people will
have data that are worth adding to a database.

Best,
Stefaan