[TCOD] Questions still left regarding TCOD dictionaries

Thu Nov 6 07:00:32 UTC 2014

The more you store, the less you get ... :-) 
lubo 

----- Original Message -----

From: "Linas Vilciauskas" <linas.vilciauskas at nyu.edu> 
To: tcod at lists.crystallography.net 
Sent: Wednesday, November 5, 2014 6:53:26 PM 
Subject: Re: [TCOD] Questions still left regarding TCOD dictionaries 

Dear Saulius and TCOD'ers 

One of the first questions for me when looking at the new dictionary was the _tcode_structure_type entry. I think we are mixing up several terminologies from the experimental and computational worlds which simply do not make sense. I understand the " crystal-ground-state" and " crystal-excited-state". But honestly have no idea what does the " crystal-metastable" mean, especially when talking about a life-time? Kinetics is virtually inaccessible in the kind of simulations we are talking about and should be left out from the discussion. The theorists by metastable structure usually mean that it is a stable structure (local minimum) on the potential (free) energy surface but it is not a global minimum on that surface. The two might have a high barrier separating (+ there might be defects etc.) them, which for experimentalist, makes it to show up as a kinetically stabilized (metastable) structure in experiments with finite (seconds to thousands of years) life-time. So what experimentalists mean as a metastable structure could be a nice local minimum (same as " crystal-ground-state") in DFT which depending on the method might even show up as a global minimum depending on the theory level used or e.g. inclusion of the entropic effects. 

My underline is that in DFT there is nothing like life-time... 

I'm also not entirely sure what is the difference between the " crystal-soft-phonon" and " crystal-transition-state". Is there are going to be a cut-off frequency for phonons to call them "soft"? Or imaginary frequency phonons? Which would imply that they have a negative Hessian eigenvalue? Would it make it a transition state then? 

If one includes the transition states, I think one has to add a massive dictionary for describing the plethora of methods to get that transition state. I think at some point it will become intractable... My suggestion would be to get rid of all these keywords: crystal-metastable, crystal-soft-phonon, crystal-transition-state, vacuum-ground-state, vacuum-excited-state, vacuum-metastable, vacuum-transition-state completely. 

At the end of a day it is a Crystallographic database and it should have a clearly defined face and purpose meaning that we should not care(accept) about molecules (vacuum structures). 

On Wed, Nov 5, 2014 at 11:34 AM, Björkman Torbjörn < torbjorn.bjorkman at aalto.fi > wrote: 

Dear all, 

Stefaan supplies good information for everything as usual, so just a couple of remarks here. 

>>> f) If e) is "yes", then we can talk about "microcycles" (electron w/f 
>>> refinement) and "macrocycles" (nuclei/cell shifted in each 
>>> "macrocycle"). We could also document total energy changes in all these 
>>> cycles, to monitor the convergence, as I suggest in 
>>> _tcod_computation_cycle_... data items. Is such view acceptable? What is 
>>> the terminology used in different codes? Will such table be useful? Will 
>>> it be easy to obtain from most codes? 
>> 
>> I advice to stay away from that. Your view is correct, but this 
>> information is often not mentioned even in papers. Documenting such 
>> changes would be arbitrary, to some extent. The final relaxed geometry 
>> is well-defined, but what do you take as the unrelaxed starting point...? 
> 
> The idea is that the unrelaxed structure starts somewhere at high 
> energies, and then it converges to low energy that no longer changes 
> significantly with refinement cycles. For me, this would be evidence 
> that the process has converged. I guess when one does a calculation, one 
> looks into such energy behavior traces, doesn't one? 
> 
> If so, then it makes sense to have tools to record the traces in CIFs, 
> as an evidence of convergence and convergence checks. 
> 
> I also want to point out that the presence of the data items (for energy 
> tables) in the dictionary does not imply any obligation to use them. Any 
> CIF will be correct and valid without them. Its just if we decide to 
> include them, there will be a publicly announced way to do this. 

Here I would say that what one normally pays attention to is just the residual forces/energy changes at convergence, documenting the starting point is not something that is normally done. We should also beware a little here since the starting point is very often data straight out of proprietory databases... it might well lead to a lot of trouble. 

I agree with Stefaan and Torbjörn. The only thing which might be useful is the final structure and some information on residual(maximum) forces and maybe energy changes... 

<blockquote>

>>> Is it enough for DFT to specify 
>>> just an energy cut-off, assuming plane wave bases (for a given 
>>> pseudopotential), or are there different possible bases also among plane 
>>> waves (I guess there should not be, but maybe I'm missing something...)? 
>> 
>> For plane waves the energy cut-off is the only quantity. But there are 
>> many other types of basis sets that are no plane waves. For these, more 
>> specification might be needed (although often they are contained in the 
>> published definition of the basis set). 
> 
> I see. OK, I so understand that DFT can work both with PW and localized 
> bases, and the exact basis should be specified in the input file (and 
> might be documented in the CIF). 

There are even "hybrid" schemes, like anything based on the muffin-tin geometry (LAPW, LMTO, ...) which come with a local and PW-like part and where the enumeration of basis functions may be by the PW's (like LAPW) or by the local part (like LMTO). There are wavelet-based codes which I have no idea how they converge. And so on. We have unfortunately a rather messier situation than the quantum chemists. 

</blockquote>

I think it is inevitable to use strings for the basis set description... e.g. CP2K uses Plane waves and Gaussians at the same time, so one needs both the cut-off and the gaussian basis set name. Wavelets are even more messy since it is a kind of multi-resolution method with several parameters. 

<blockquote>

>>> n) Am I correct to assume that the "total energy" reported by codes will 
>>> always be the sum of separate energy terms (Coulomb, exchange, 
>>> 1-electron, 2-electron, etc.)? Is there an interest to have them 
>>> recorded in the result data files (CIFs) separately? If yes, what is the 
>>> "Hartree energy" (is it a sum of all single electron energies in the SCF 
>>> for each of them?), "Ewald energy" (is it the electrostatic lattice 
>>> energy, obtained by Ewald summation?) and the rest in the values from 
>>> the AbInit output file? Are these terms consistent across QM codes? 
>> 
>> Also here I think this is asking for way too much detail. Most codes can 
>> indeed split up the total energy in many contributions, but papers 
>> usually do not report that (only in the special cases when there is 
>> useful information in the splitting). If papers don't do it, databases 
>> shouldn't either -- that feels as a sound criterium. 
> 
> Interesting idea. Well, that makes our life easier. 

It is in fact worse than that, because the precise nature of this split depends on technical details such as how you solve the Poisson equation and how you treat core states. This varies between the codes, so there is not in general any possibility to make a sound comparison between different methods for these partial energies. So I don't think anything much is gained by documenting them. 

</blockquote>

Completely agree. Different terms can be evaluated using different methods so the comparison many cases will not make sense. 

<blockquote>

... 
> May I disagree to some extent. If we can do better checks than journals 
> do, why shouldn't we? That would be a useful tool, a help for journal 
> reviewers, and one possible way to improved the situation. 
... 
> In short, I think that it would be quite useful to have uniform *actual* 
> convergence criteria in the CIF output, and check them before inserting 
> computations into TCOD, like we check crystal structure symmetry, bond 
> distances, parameter shifts or R-factors. 

Here I tend to agree, but at the same time I would not want us to impose too draconic criteria for acceptance, as long as the obtained result is documented. Perhaps some two-level system with "hard" acceptance criteria and a "warning" level would be good? 

</blockquote>

That's true. There might be some problem with converging the geometry for example (high residual forces on atoms(cell)) which might be checked against some value. But on the other hand if the R- factor is high (wrt to the experiment) what does it mean: maybe the DFT model is not correct, or maybe it is not correct for this type of system? Shall we accept an entry on naphthalene, anthrecene or pentacene calculated using plain PBE? Is there an ultimate method which would be able to calculate everything correctly? The answer being - No... 

<blockquote>

Hi, Stefaan, 

many thanks for your quick and very informative answer. I'll adjust the 
dictionaries according to it. Below are some of my comments. 

On 2014-11-05 13:33, Stefaan Cottenier wrote: 
> Let me comment on those questions that I can do on the spot, without 
> looking into the dictionary yet : 
> 
>> a) the units of energy that we started to use in the dictionaries are eV 
>> (electron-Volts). For distances, we should probably use Angstroems, 
>> since then we can easier compare computation results with 
>> crystallographic experimental data. This naturally suggests units for 
>> forces as eV/A. Is that OK, or should we better use SI units of similar 
>> scale (say aJ -- atto-Joules, on the similar scale)? The only problem I 
>> see with eV is that it is measured, not defined from the basic SI units 
>> ( http://en.wikipedia.org/wiki/Electronvolt , 
>> http://physics.nist.gov/cuu/Units/outside.html ). 

> Although SI units would be in principle the good choice, nobody uses 
> them in this context. At least eV and Angstrom have a special status 
> ('tolerated units' within SI), hence allowing eV, Angstrom and therefore 
> eV/A for forces is a fair compromise. 

Good. So we stay with eV and A, with eV/A for forces, as already 
documented in our dictionaries. 

>> b) Is nuclear electric dipole moment used/necessary for DFT computations 
>> (_dft_atom_type_nuclear_dipole)? 
>> 
>> c) If b) is "yes", what units we should use for electric dipole (and 
>> higher moments) -- Debayes, e*A (amount of unit charges times 
>> Angstroems), or something else? 
> 
> There must be some kind of confusion here -- my old nuclear physics 
> courses always emphasized that nuclei do not have an electric dipole 
> moment. Either you mean magnetic dipole moment or nuclear quadrupole 
> moment? Anyway, nuclear properties are never required for DFT 
> calculations as such, but they can be used to convert DFT-predictions 
> into quantities that are experimentally accessible. I don't see the need 
> to keep track of this, however, in a computational database. 

OK, this is a gap in my education -- I must have overlooked the zero 
nuclear electric dipole during my university years... 

Setting aside the theoretical question whether the quarks can move in 
such a way as to give non-zero dipole, I remove the 
_dft_atom_type_nuclear_dipole as lacking theoretical justification and 
empirical evidence. For magnetic dipole, a 
_dft_atom_type_magn_nuclear_moment could be introduced if needed (for 
orbital and spin magnetic moments the data names are already there). 

>> e) If I understand correctly, DFT and most other QM methods operate 
>> under Born-Oppenheimer approximation; under this approximation, electron 
>> densities (electron wave-functions) are optimised to minimal energy at 
>> fixed nuclei and unit cell parameters, and when this converges, the 
>> nuclei parameters and/or cell constants are changed slightly (e.g. along 
>> gradients), and the electron energy is minimised again. Is this a 
>> correct view? Is it a universal situation across QM codes? 
> 
> Yes, correct. It is pretty universal. There some special-purpose 
> applications that do not make the born-oppenheimer approximation, but 
> that's really a minority. 

OK. Thanks for confirmation. 

>> f) If e) is "yes", then we can talk about "microcycles" (electron w/f 
>> refinement) and "macrocycles" (nuclei/cell shifted in each 
>> "macrocycle"). We could also document total energy changes in all these 
>> cycles, to monitor the convergence, as I suggest in 
>> _tcod_computation_cycle_... data items. Is such view acceptable? What is 
>> the terminology used in different codes? Will such table be useful? Will 
>> it be easy to obtain from most codes? 
> 
> I advice to stay away from that. Your view is correct, but this 
> information is often not mentioned even in papers. Documenting such 
> changes would be arbitrary, to some extent. The final relaxed geometry 
> is well-defined, but what do you take as the unrelaxed starting point...? 

The idea is that the unrelaxed structure starts somewhere at high 
energies, and then it converges to low energy that no longer changes 
significantly with refinement cycles. For me, this would be evidence 
that the process has converged. I guess when one does a calculation, one 
looks into such energy behavior traces, doesn't one? 

If so, then it makes sense to have tools to record the traces in CIFs, 
as an evidence of convergence and convergence checks. 

I also want to point out that the presence of the data items (for energy 
tables) in the dictionary does not imply any obligation to use them. Any 
CIF will be correct and valid without them. Its just if we decide to 
include them, there will be a publicly announced way to do this. 

>> h) The CML CompChem dictionary mentions SCF as a method. I know HF and 
>> its modifications are SCF; is DFT technically also SCF? Are there more 
>> SCF methods that are not HF? Should we include "SCF" into the 
>> enumeration values of the _tcod_model as a separate model? 
> 
> 'SCF' refers only to the fact that a particular iterative solving scheme 
> is used. As such, I would consider that term as being less informative 
> than HF or DFT (one could even imagine to do DFT without SCF, although 
> in practice this very rarely is done). 

OK; so we skip SCF as a separate "model". 

>> j) I have taken the _dft_basisset_type list from 
>> http://www.xml-cml.org/dictionary/compchem/#basisSet , to strive at least 
>> potentially for a possibility of the CIF->CML->CIF roundtrip. The two 
>> big classes of basis functions, as I have learned, are localised 
>> (Slater, Gaussian) and plane wave. Should we introduce such 
>> classification on top of the _dft_basisset_type enumerator? 
> 
> I don't think so. It will be implicit in the name people use for the 
> basis set. 

OK. The type of the basis set should be implied from the basis set date 
(name, files, reference). 

>> Are 
>> localised bases relevant for DFT at all? 
> 
> Yes, sure (SIESTA, for instance). 

Good to know. Thanks for the info! 

>> Is it enough for DFT to specify 
>> just an energy cut-off, assuming plane wave bases (for a given 
>> pseudopotential), or are there different possible bases also among plane 
>> waves (I guess there should not be, but maybe I'm missing something...)? 
> 
> For plane waves the energy cut-off is the only quantity. But there are 
> many other types of basis sets that are no plane waves. For these, more 
> specification might be needed (although often they are contained in the 
> published definition of the basis set). 

I see. OK, I so understand that DFT can work both with PW and localized 
bases, and the exact basis should be specified in the input file (and 
might be documented in the CIF). 

>> l) There are a lot of *_conv data items declared (e.g. 
>> _dft_basisset_energy_conv). Are they for convergence tests? Or for 
>> convolutions? What is their proposed definition? 
> 
> These are the convergence criteria, for instance: stop the iterative 
> (SCF) cycle once the total energy does change by less than 
> _dft_basisset_energy_conv during the last few iterations. 

Perfect. Now, are these the *desired* criteria, or the *obtained* values 
(i.e. actual values of the computation)? Although probably we can assume 
that in any case the energy change at the end of the computation was 
less than the specified _dft_basisset_energy_conv value, and the same 
for other *_conv values, right? 

I'll add units and explanations to the dictionary. 

>> m) Is _dft_cell_energy the same as the "total energy" reported by some 
>> codes? Can we rename it to _dft_total_energy? 
> 
> Probably yes. 

Hmmm... We need some explanation in the dictionary how these values are 
supposed to be used. 

>> n) Am I correct to assume that the "total energy" reported by codes will 
>> always be the sum of separate energy terms (Coulomb, exchange, 
>> 1-electron, 2-electron, etc.)? Is there an interest to have them 
>> recorded in the result data files (CIFs) separately? If yes, what is the 
>> "Hartree energy" (is it a sum of all single electron energies in the SCF 
>> for each of them?), "Ewald energy" (is it the electrostatic lattice 
>> energy, obtained by Ewald summation?) and the rest in the values from 
>> the AbInit output file? Are these terms consistent across QM codes? 
> 
> Also here I think this is asking for way too much detail. Most codes can 
> indeed split up the total energy in many contributions, but papers 
> usually do not report that (only in the special cases when there is 
> useful information in the splitting). If papers don't do it, databases 
> shouldn't either -- that feels as a sound criterium. 

Interesting idea. Well, that makes our life easier. 

On the other hand, electronic media, like database, can record and make 
usable more information than a traditional paper or PDF publication. We 
should not overlook such possibilities and use them when needed. 

For example, in protein crystallography, structure factors were not 
reported in publications at the very beginning, due to a sheer volume of 
data (a protein crystal can give you a million of unique reflections); 
but today it is a must and a self-evident thing to deposit such data 
electronically into PDB. 

>> o) How does one check that computation has converged on k-points, 
>> E-cuttof, smear and other parameters, and that pseudopotential is 
>> selected right? From the Abinit tutorial 
>> ( http://flex.phys.tohoku.ac.jp/texi/abinit/Tutorial/lesson_4.html ) I got 
>> impression that one needs to run computation with different values of 
>> these parameters, and see that the total energy, or other gauge values, 
>> no longer change significantly when these parameters are increased. Is 
>> that right? If yes, are there codes that do this automatically? Should 
>> we require Etotal (or coordinates') dependence on k-grid, E-cuttof, 
>> smear, to check convergence when depositing to TCOD? Or should TCOD side 
>> check this automatically when appropriate (say for F/LOSS codes)? 
> 
> "k-points, E-cuttof, smear and other parameters" are indeed tested as 
> you describe. 

OK. Thanks for clarification. 

I think it would be beneficial to have such checks included into teh 
results file... 

> ... The pseudpotential can't be tested that way, what people 
> usually do is to verify whether the numerically converged results when 
> using a particular pseudo do agree with experiment. 

I see. OK, we'll have to trust PP was selected properly. 

Actually, TCOD + COD can check the computations against empirical 
(crystallographic) data -- say interatomic distances, bonds, angles, 
coordination sphere geometry, etc. The results might be interesting -- 
significant discrepancies will either predict unseen new phenomena or 
point out at problems that can be fixed readily. 

</blockquote>

I commented on that before. Theory might never agree with experiment. If they agree - is it for a good reason? If they don't is the reason known? 

<blockquote>

> Doing such tests is the responsibility of each user. In principle, 
> journals should not publish ab initio results if such tests are missing. 
> Journals are not that strict, unfortunately. And some researchers are 
> not very careful in that respect. 
> 
> It's a longstanding problem, that is gradually being solved because 
> computers are so fast now that the default settings of most codes are 
> sufficiently accurate for many cases, even if a researcher does not 
> explicitly tests it. 
> 
> Also here, TCOD shouldn't try to do better than the journals do. 

May I disagree to some extent. If we can do better checks than journals 
do, why shouldn't we? That would be a useful tool, a help for journal 
reviewers, and one possible way to improved the situation. 

In crystallography, the situation is sometimes similar. IUCr journals 
are very good at checking structures, but some chemical journals, even 
the "high profile" ones, give you CIFs that may even have syntax errors 
in them! For me, this hints that nobody bothered to look at the data 
before publication. But how can the claim then that he paper was "peer 
reviewed"? The text apparently was, but the data probably not. This is 
not a good way in today's data-driven sciences. 

COD does checks, and we plan more -- and it helps e.g. when personal 
communications or prepublication structures are deposited. My personal 
experience was very positive on this -- the first personal communication 
I tried to send to COD was marked as not converged properly; indeed this 
was an oversight, and a couple more refinement cycles fixed the problem. 

I fund such checks to be a very useful tool, so why not having something 
similar for TCOD? Especially, when we expect a large number of 
structures from wide-scale computational projects, where not every 
computation is checked manually. 

>> p) what are other obvious things that one could make wrong in QM/DFT 
>> computations, that could be checked formally? 
> 
> That's an interesting one... With no answer from my side. If there is 
> anything that can go obviously wrong, the codes will have an internal 
> test for it already. 

Well, an obviously wrong thing would be insufficient checks for 
convergence (to coarse k-grid, to little steps of minimization, etc.). 

Usually, experienced computational chemists will check for these, but 
ever so often a MSc student who is just starting to learn things will 
compute a structure, and the experienced boss will happen to be in the 
conference... The structure may look reasonable, especially for an 
unexperienced eye, but in fact it may be inaccurate... You know what I 
mean :) 

In short, I think that it would be quite useful to have uniform *actual* 
convergence criteria in the CIF output, and check them before inserting 
computations into TCOD, like we check crystal structure symmetry, bond 
distances, parameter shifts or R-factors. 

The question is, what should be the universal criteria for QChem? 

Regards, 
Saulius 

-- 
Saulius Gražulis 

Biotechnologijos institutas 
Graičiūno 8 
02241 Vilnius-28 
Lietuva 

Tel.: vidaus BTI: 226 
Vilniaus BTI: (8-5)-260-25-56 
mobilus TELE2: (8-684)-49-802 
mobilus OMNIT: (8-614)-36-366 

_______________________________________________ 
Tcod mailing list 
Tcod at lists.crystallography.net 
http://lists.crystallography.net/cgi-bin/mailman/listinfo/tcod 

_______________________________________________ 
Tcod mailing list 
Tcod at lists.crystallography.net 
http://lists.crystallography.net/cgi-bin/mailman/listinfo/tcod 

</blockquote>

Best, 

Linas 
-- 
Linas Vilciauskas 
Postdoctoral Research Fellow 
Dept. of Chemistry 
New York University 
100 Washington Square East / 830BB 
New York, NY 10003 
linas.vilciauskas at nyu.edu 
Phone: +1-347-614-6831 

_______________________________________________ 
Tcod mailing list 
Tcod at lists.crystallography.net 
http://lists.crystallography.net/cgi-bin/mailman/listinfo/tcod 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/tcod/attachments/20141106/9dd11236/attachment-0001.html>