Hello everybody,
You did a great job, Saulius, digesting this diverse bunch of opinions into a sound compromise ! Some comments (I collect several of the previous mails hereafter):
I am toying with the idea that TCOD should have 3 levels of structure description:
Yes, that makes sense. Your rationale for the three levels is really good (one comment about level 1, though - see hereafter).
-- level 0: - cell constants; - atomic coordinates; - literature reference.
Standard CIF dictionaries are enough for such description.
What I feel is missing in this minimalistic set, is the XC-functional. Without that information, the relevance of predicted unit cell information is limited. (It can be retrieved from the literature reference, sure. And if people complete level 1, then the info is available anyway. But as this is really necessary information, it would be good to force users to include it by asking it at the mandatory level 0.)
Immediately then the question arises how to identify unambiguously the XC-functional (a question that would arise equally well if this would be kept in level 1). One possibility would be to use the identifier used in LIBXC (http://www.tddft.org/programs/octopus/wiki/index.php/Libxc). This is widely accepted, code-independent and open source.
Specifying the XC-functional limits us to DFT only. Perhaps the keyword should rather be 'level-of-theory'. If a LIBXC-identifier is given, then this implies it is DFT with the quoted functional. If the value is not a LIBXC-identifier, it refers to a non-DFT method (Hartree-Fock, GW, QMC,...) It might be tricky to create a relevant list of non-DFT methods, and right now it is perhaps of limited use, but the number of predictions by these methods is likely going to surge in coming years.
[added later: OK, the _tcod_model variable more or less does this, as I see now. Hence, my comment boils down to replacing the value 'DFT' by the LIBXC identifier for the relevant functional]
-- level 1: - same as in level 0, plus any parameters that permit qualified person to judge if the structure has converged and how good it is. The parameters should include residual force on atoms (2014-02-10 16:16, Björkman Torbjörn), energy change(s) in the last cycle(s), and references to basis set, pseudo-potentials, XC functionals, etc. (as described in our dictionaries and in http://www.xml-cml.org/dictionary/compchem/). Basis sets can be referenced as in https://bse.pnl.gov/bse/portal;
I might be ruining the beautiful simplicity of the scheme now, but it seems to me that the present level 1 covers two different things. What about splitting this into two levels:
* One that lists the main technical settings (basis set, pseudo's, k-mesh, XC [although the latter might at level 0]), such that a qualified person can assess the level of credibility of the calculation
* One that lists information to assess the level of convergence.
The former is information at the input stage, the latter is at the output stage. Two different aspects. The advantage of splitting the level is that probably less people would take the effort to collect the convergence info, and might then discard that level entirely.
One more related question:
If I understand correctly, from what I recall from the QM lectures, we can have in principle two kinds of boundary conditions for localised particle wave functions:
a) vanishing at the infinity (modelling single molecule in vacuum), and b) periodic (modelling an ideal crystal).
From what I have read in manuals of the QM codes, most implement b), and a) is approximated by putting a molecule in a large enough "unit cell" so that interactions between molecule images are negligible.
Is this a correct view? Does any code implement (a) as a separate mode of computation?
In either case, we should probably have a special tag that distinguishes "true" crystal structures from the "convenience" unit cells that are non-physical but are set up solely to solve a molecule structure problem with the same code that also deals with crystals. Any ideas how to tell from the computations which mode was used?
Yes, this looks to be correct. There some codes that have a 3D (crystals), 2D (surfaces) and 1D (molecules) implementation, without mimicking the missing dimension by finite vacuum (FLEUR does this, for instance).
As, however, the goal of TCOD would be to document truly infinite crystals, there will be no calculations for molecules in a big almost empty unit cell in the data base anyway. What could occur are non-periodic cluster models to mimic infinite crystals -- rather the opposite situation. Having a keyword that differentiates between 'periodic' and 'non-periodic' calculations would be sufficient to filter these.
I still hold to the statement that it's is almost impossible to reproduce the exact results using two different codes even though the settings (xc functional, basis, pseudopotential, k-point sampling etc.) are the same. The numerical implementations could be significantly different and there are always a bunch of hard-coded parameters which are different for different implementations.
Interesting statement, as we are currently working on a quantitative proof of what you say (Torbjörn is involved in this as well). You can watch a 15-minute talk on this topic at https://www.youtube.com/pasSSaMMnnE , and inspect a snapshot of current results at https://molmod.ugent.be/deltacodesdft ). [so far the advertisement ;-) ]
I don't think it is so important to have the energy convergence reported. It's a crystal structural database, so the it is probably enough to report 8 things from the last step:
Max Force on Atoms RMS Force on Atoms Max Displacement on Atoms RMS Displacement on Atoms
If the cell optimization is performed:
Max Force on Cell RMS Force on Cell Max Displacement on Cell RMS Displacement on Cell
This illustrates why I hesitate to keep this information within level 1: not all codes can provide this. Forces are probably available everywhere, but cell optimizations by 'forces' require the stress tensor formalism. That's peanuts for plane wave codes, but has not been developed for all more involved basis sets. There is not a single LAPW code, for instance, that can optimize a unit cell in that way (they have to resort to energy minimization, which is fair as long as the symmetry of the cell is sufficiently high).
As I said, this could be extremely tricky. Don't think it makes much sense to store the output logs, since they can get huge. What we could do, is to define a couple (3-5) of "Reference codes" which could be supported for checking/benchmarking purposes. VASP, QuantumEspresso, GPAW? Which inputs we could store and structures submitted from other codes should stick to those benchmarks. As far as I know some of the databases use in their philosophy (e.g. they would only accept VASP results calculated using some minimum criteria). In this way one can be consistent.
I agree that level 2 is the tricky one. One has to be careful here to create a pragmatic solution without investing too much time. What I see as a fair target is that those (few?) people who are willing to provide information for a full recreation of their results, should be able to do that. Full stop. The amount of work it takes from their side is not our worry (for some codes this will be far more easy to do than for others). Perhaps the most pragmatic thing to do is to provide the possibility to insert a verbatim section with required input files, code version and run commands in order to reproduce the results, without any checking whether or not this information is actually complete.
Working with reference codes, hmm,... Tricky as well. This implicitly gives the message that tcod favours/promotes some codes more than others. Moreover, minimal requirements may evolve over time. What about this alternative: provide a separate algorithm that can, for a given code, assess whether or not a given entry is sufficiently reliable or not. If there is crucial information missing to assess this, then the algoritm answers 'unable to decide'. The advantage is that such algorithms are not baked into the tcod entries, but are stand-alone. They can evolve over time (tresholds can be increased, but also more refined criteria to assess quality can be included), different people can each contribute to a different algorith (an expert about one code writes the algorithm for that specific code), it has no quality suggestion (if someone is unhappy that the algorithm for his/her pet code is missing they can provide one themselves), etc.
In either case, we should probably have a special tag that distinguishes "true" crystal structures from the "convenience" unit cells that are non-physical but are set up solely to solve a molecule structure problem with the same code that also deals with crystals. Any ideas how to tell from the computations which mode was used?
We should be only interested in periodic structures. Or I would be even more strict - 3D periodic (crystals). We should not worry about the molecules or clusters... or leave it to others :-) We just should not have any limitation on the supercell size of the structure, but in general it should be strictly 3D periodic for this database. That would make life much easier and give the database some shape on what we are trying to systematize.
I agree with Linas. The 'C' in TCOD excludes the need to deal with molecules in vacuo. That's an entire different world (with even much more degrees of freedom than in our 3D crystal world), and there are other databases that deal with these objects.
As a person affiliated with molecular biology and drug design, I would be of course also interested in comparing crystal structures with gas-phase (or I would rather say in vacuo) QM optimised structures; but these, as Peter has told me, are much more diverse and thus more difficult to manage in a coherent database. So we can postpone their addition; in any case, we'll have to flag them carefully and probably keep separately (different ID range or namespace?).
Saulius, you seem to be interested in formation energies of molecular crystals? (energy difference between the actual crystal and the individual molecules) Although from an experimental point of view this information is considered to be a property of a crystal (and therefore could have its place in a crystal database), it actually is a difference between a solid state and a molecular property. As most computational methods are tailored towards either molecules or solids (and have severe limitations on the other class), such formation energies are bound to have problems. I agree this would be useful information, but it is hard to produce really accurate numbers and I doubt whether many people will have data that are worth adding to a database.
Best, Stefaan