[TCOD] Fwd: Re: Fwd: some discussion topics related to TCOD

Saulius Gražulis grazulis at ibt.lt
Sun Sep 8 11:36:11 UTC 2013


Dear Stefaan,
dear colleagues,

On 2013-09-06 19:02, Stefaan Cottenier wrote:
> (as it seems there have been technical problems about TCOD, I'm not 
> sure whether this 3-months-ago mail actually reached all or you or 
> not. I just resend it.)

Many thanks for reposting your e-mail. The TCOD mailing list, indeed,
had a configuration error, but now I hope I have fixed it and our
discussions can continue unhindered.

> -------- Original Message -------- Subject: some discussion topics 
> related to TCOD Date: Fri, 24 May 2013 14:34:08 +0200 From: Stefaan 
> Cottenier <Stefaan.Cottenier at UGent.be> To: 
> tcod at lists.crystallography.net
> 
> 
> Dear colleagues,
> 
> Today, we held a group discussion at CMM (http://molmod.ugent.be) 
> about TCOD from the perspective of users and/or structure donators. 
> That did not lead to clear conclusions, but rather to a series of 
> thoughts that can be a starting point for further discussions or 
> actions. I'll list those here (I guess that is what this mailing list
> is meant for):

Thanks a lot, your input is very important, and we definitely need to
take into account everyone's needs and opinions to make TCOD useful.

> 1) You are probably aware of other database initiatives for computed 
> crystal structures. Is there a vision on whether TCOD wants to 
> 'compete' with those, or whether TCOD tries to fill a niche that is 
> not served by other databases?
> 
> For instance, consider https://www.materialsproject.org/. This is a 
> database that aims doing a VASP geometry optimization for every 
> structure in ICSD, starting from the ICSD cif (quite ironically, you
>  can read their starting geometry (=ICSD info) for free, without 
> having an ICSD yourself...). They use their own dedicated 
> supercomputer infrastructure to run all this, and upload only results
> achieved by themselves. No input by others, this in order to keep
> control over quality and consistency. It has over 30.000 crystals by
> now, and apart from the structure info also computed properties are
> being added.

Sure enough, there are several approaches besides TCOD already that
implement collections in theoretical structures in one way or another. I
have read a bit about the https://www.materialsproject.org/ and I am
still familiarising myself with others you have mentioned in your
2013-08-20 01:44 e-mail (I would like to encourage you to repost it to
the TCOD list if you do not object making it open).

The main difference of TCOD is that, unlike the Materialsproject group
who as you say "pload only results achieved by themselves", TCOD is
(*already* :) open to wide range of contributors, including the
Materialsproject if they wish or agree to share their data on the Open
Data basis. This may be both TCOD's weakness and strength. As you say,
"No input by others, this in order to keep control over quality and
consistency" -- true. But on the other hand, the scientific community is
broader and is getting broader, there are more programs than VASP (even
if VASP would be considered the best), and it would be interesting to
compare various calculations, and calculations vs. experimental results
-- a COD/TCOD/PCOD bundle would make this easy to accomplish if done
properly.

> I quote here a paragraph from a project proposal which we submitted 
> earlier this year, and which contains info/references about other 
> databases:
> 
> "As is often the case for revolutions, this idea has been realized 
> simultaneously and more or less independently at several places, 
> emphasizing different aspects. The Materials Project [Jai11,MP] is
> an initiative at MIT, where basic properties of all crystalline
> solids documented in the (experimental) Inorganic Crystal Structure
> Database (ICSD) [ICSD] have been computed by DFT. The computed
> formation energies are used to construct secondary data bases of
> binary and ternary phase diagrams [Ong08, Jai11b]. Similar
> initiatives, each with their own focus and in different stages of
> growth are AFLOWLIB [Set10,Cur12] (Duke University), OQMD [Wol12]
> (Northwestern University) and CompES [CES] (NIMS, Japan). Other
> database initiatives that emphasize collaborative data sharing (see
> also Sec. 2.3, WP3) are ESTEST [Yua10,Yua12] at UC Davis (US), the 
> Computational Materials Repository (CMR) [Lan12] at DTU (Denmark) and
> the AIDA environment [Koz13]."
> 
> [Jai11] A. Jain et al. Computational Materials Science 50 (2011)
> 2295 [Jai11b] A. Jain et al., Physical Review B 84 (2011) 045115.
> [Ong08] S. P. Ong, L. Wang, B. Kang, G. Ceder, Chemistry of Materials
> 20 (2008) 1798 [Set10] W. Setyawan, S. Curtarolo, Computational 
> Materials Science 49 (2010) 299 [Cur12] S. Curtarolo et al., 
> Computational Materials Science 58 (2012) 227 [Wol12] Open Quantum 
> Mechanical Database, http://wolverton.northwestern.edu/oqmd (no 
> public access yet) [CES] http://caldb.nims.go.jp/index_en.html (Y. 
> Chen, A. Nogami, H. Ohtani, N. Tatara) [Yua10] G. Yuan, F. Gygi, 
> Computational Science & Discovery 3 (2010) 015004 [Yua12] G. Yuan, F.
> Gygi, Computer Physics Communications 183 (2012) 1744 [Lan12] D. 
> Landis et al., Computing in Science and Engineering, nov/dec 2012, p.
> 51 (http://dx.doi.org/10.1109/MCSE.2012.16) [Koz13] B. Kozinsky, N.
> Marzari, N. Bonini, J. Garg, G. Pizzi, A. Cepellotti, M. Fornari, 
> contributions at the MRS Fall meeting (Boston, Nov. 2012) and the DPG
> Spring meeting (Regensburg, March 2013).

You are right, the idea of having Open theoretical structure database is
floating in the air. But is there a database that is:

a) open (as in Open Data), permitting unrestricted sharing, reuse and
republication of data if cited properly;
b) accepting contributions from all relevant sources;
c) up and running right now?

If yes, then TCOD is probably not needed. Indeed, if Computational
Materials Repository accepts the computed depositions, then its probably
the one that does the job. If, however, it does not -- we give a try,
why not? The need is there, and the TCOD was basically a reaction to the
request from community to deposit computed crystal structures which COD
currently either rejects or checks poorly.

> (2) From our perspective, we don't understand the need to separate 
> experimental and computational databases. As long as every structure
>  is properly tagged as 'experimental' or 'computed', we see no reason
>  to separate them. It will only lead to the burden of having to run 
> queries twice. In any case, a unified search web page that searches 
> in both data bases simultaneously seems useful to us (that is, de 
> facto, treating (T)COD as one database).

These concerns will be addressed. The division COD and TCOD is pure
technical; searches will be possible in the uniform unions of COD ad
TCOD and PCOD.

The reason for separation are the different data requirements for
experimental and computed structures -- COD needs criteria that show how
good the model matches the observed data (R-factors, CC's, finally, the
Fobs data itself); TCOD needs computational parameters that indicate
convergence of the process, etc.

> (3) Quality control. If a deposited computed structure has been 
> published, the reference to the publication serves as quality 
> control.

Absolutely. The COD/TCOD framework does store full provenance
information in the records, and we treat published structures as
peer-reviewed.

Still, "peer-reviewed" dos not always mean "correct" or "accurate";
especially when it comes to data...

> That's probably similar for deposited experimental structures, isn't
>  it?

Yes it is.

> If a computed structure has not (yet) been published, how to assess 
> its quality? Well, let us turn the question around: if a not yet 
> published experimental structure is deposited, how do you judge its 
> quality...?

For experimental structure, what counts is its correspondence to the
observed data (Iobs/Fobs) and to our background knowledge (reasonably
chemistry). The correspondence to data can be expressed in numeric terms
in many cases -- Rcryst, CC, GoF are parameters that are routinely cited
in experimental CIFs to show the size of discrepancies between the model
and the experiment. No single parameter is perfect, for sure, and most
have been criticized, but they do give a rough (and, most importantly,
machine-processable) picture of the model-observation fit.

As for "reasonable chemistry", parameters are less formalised, but an
experienced chemist can detect a lot of suspicious cases by looking at
the structure, and we have now compiled comprehensive statistics on COD
(the master thesis work of Andrius Merkys) so that we can now detect
"low probability" structures automatically.

> It makes sense to do it in the same way: as long as basic information
> is provided on how the calculation has been done, later users should
> make their own judgement.

Absolutely. But the numeric quality criteria (QC) are needed so that we
can monitor the incoming structures automatically -- no-one will be
willing or able to sift manually through the million+ structures, right?

> (4) A quality control measure that could make sense for both 
> experimental and computed structures, is to allow people to add 
> remarks that appear on the web page of that structure ('considering 
> the small basis set that has been used, I do not trust this result',
>  or 'this result has been contradicted by <ref>').

You hit the nerve! That's what we are doing right now for COD, and TCOD
will immediately benefit from the system as well.

I would like to point out however that community consensus does not
necessary mean a correct judgement, just a widely accepted one. We have
all knew for a couple of hundred of years that crystals do not have
fivefold symmetry axes -- and yet Shechtman discovered quasicrystals,
now well confirmed (right?).

Another problem with community review is that we simply do not have time
to look through each and every file carefully enough. So computers
should help us; my strategy would be to flag structures automatically as
being "usual" or "unusual", and then ask people whether they find the
provided evidence "convincing" or "unconvicing", ideally in the way you
have proposed, 'this result has been contradicted by <ref>', or
otherwise with a sound argumentation.

The "unusual" and "unconvincing" structures will probably need to be
reinvestigated; the "unusual" but "convincing" ones might be a real gems!

I hope this strategy would be applicable both for COD and TCOD.

> Such information by the community can help people who are not
> experts in either computing or measuring crystal structures to decide
> to which extent they can trust a particular result.

Sure. Lets present it on the *COD search results? Group structures by
"reliability index"?

> (Another variant is a feature on the web page of each structure to 
> flag it if you have doubts, such that experts can look at it).

Will be done automatically in a near future.

> (5) COD has cifs as entries only. If you want to go as far as 
> including input info to reproduce each calculation, it might be hard
>  to stick cifs-only for TCOD. It will then be necessary to link 
> additional files to the cif of each entry.

Adding additional files is possible; the upload site needs to be fixed
to accept ans store arbitrary archives, but this is easy to do.

For TCOD, I envisage possibility to store files in arbitrary formats. O
good strategy, IMHO, is outlined Tim Berners Lee’s Open data definition
(I learnd about it from the Peter Murray-Rust's talk,
http://www.iucr.org/__data/assets/pdf_file/0018/80280/murrayrust3.pdf).
We can start accepting additional files that outline computation
process, and later formalise data presentations and formats.

Of course, the better defined the formats are the more useful these
auxiliary files will be. It would be nice if the community agreed on
comprehensive and open computational data formats, and deposit also
descriptions of their computational work flow, not just the final
results. If this goal is attainable, remains to be seen...

> (6) We are split over whether it makes sense to require that each 
> deposited calculation is exactly reproducible.

So what if we say that at the moment this is "nice to have", but not a
"must have" option?

> Some basic information should be given, of course.

Agreed. And it is vital that the community agrees what information is
necessary, and we formalise it in some ontology. CIF dictionaries which
you have started are a good possibility for this; other ways are also
possible.

> This will be mainly information about the method that does not depend
> on the specific implementation (code). Furthermore, the main 
> technical input parameters that are specific for the code used, 
> should be given as well (if you want to do that in a structured way,
>  you will immediately hit the problem that the required set of
> numbers will be a different kind of set for every other code...).

IUCr does similar thing for the crystal structure refinement data: they
just include the input script of the refinement program verbatim, in the
data item "_iucr_refine_instructions_details"; vis:

http://www.crystallography.net/cod/2234930.cif

Maybe such approach would be viable for DFT computations? Say in
_dft_computations_details data item?

> Third in line should then be a free field for 'any other special 
> settings' that could apply. With this kind of info, people can 
> reproduce to a large extent (95%) the same calculation. If you want 
> to have it reproduced exactly, then much more input (files) will be 
> required. Which might not be worth the effort.

Could be. Actually, storing the just the information you have outlined,
and not the complete input files, would make our database maintainer's
life easier as well.

Would a program, something like Torbjörn's 'cif2cell', be able to
reconstruct at least approximate input scripts to repeat the
computation? If yes, this would be probably what we really need.

> (7) One of our enthusiastic PhD students with many years of 
> experience in web design and databases immediately had a mental map 
> about how to implement an automated version of (6) -- if the 
> TCOD-team is willing to take him aboard to code that, he'll probably
>  agree ;-).

Sure! Please invite him/her! Cordially welcome!

> (8) In any case, it should be clear for the one who provides the 
> structure, what exactly is requested.

A part of this would be CIF dictionaries and other ontologies,
supplemented by deposition check software, supplemented by *good*
documentation (hopefully ;).

> A web form where the information under (6) can be filled out (or the
>  automated tool of (7) that extracts all this info from the output of
>  each mainstream code) will be required in order to have consistent 
> data.

Good idea!

> (9) Which kind of computed structures will TCOD accept? Ground state 
> structures, obviously?

Sure!

> Metastable structures probably as well?

Why not, if marked or otherwise detectable as such.

> Also if they are not dynamically stable (soft phonon mode)? (that is
>  not routinely examined)

Sure, very interesting to see such beasts. Again, marked properly.

> And what about transition state structures? (never observable in 
> experiment, yet very useful to know -- and hard to find -- in order 
> to understand reactions)

Absolutely, this is what many chemists and biologists (including
structural biologists) are interested in. I guess most structural
enzymologists would love to see what a transition state of the
substrate-to-product reaction of their beloved enzyme is, wouldn't they?

> If the latter are allowed, then they should be tagged as such.

For sure. Actually, your list is a good addition for the dictionary --
each of your suggestions might become an enumeration value in a data
item describing the struct

Thanks for your input!

Will put together your and others' ideas as suggestions for action soon
on this mailing list.

Regards,
Saulius

-- 
Dr. Saulius Gražulis
Vilnius University Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366





More information about the Tcod mailing list