[Cod-users] Specifying values 'less than something' in CIFs?

Saulius Grazulis grazulis at ibt.lt
Sun Apr 29 12:02:54 UTC 2012


Hi, Peter,

more detailed data curation discussion below:

On 04/29/2012 02:11 PM, Peter Murray-Rust wrote:

>     Clearly, the value has incorrect ESU syntax, but it has slipped through
>     all syntax and semantics checks until now, and would probably go
>     unnoticed by many programs that only take numeric value but do not use
>     ESU. And those that interpret ESU would either report an error or, even
>     worse, yield unpredictable and possibly incorrect results. And we do not
>     know this until we run that specific program on the specific CIF.
> 
> Yes - my JUMBOConverter system will do something similar.

May I have a look at it and/or use it?

>     b) make software that detects and, when possible, corrects the most
>     common 'mistakes'; e.g. it is probably safe to change '100C' to '373.15'
>     (Kelvins), with a benign warning (COD deposition tools do this on
>     the fly).
> 
> This is really difficult. I don't like changing things and certainly not
> without metadata. Maybe an additional local COD_ field that gives the
> heuristic value. That way we don't corrupt the past but also allow
> people to search and compute on "better" information

I have given some thought to this, and I hope our treatment of data is
fair and careful enough (but comments are welcome):

what we do is:

a) when changing a data value, in a way which we assume is unambiguous
and straightforward, we leave _cod_depositor_comments describing the
change; this is done automatically for automatic changes and we also
have the same policy for manual changes.

b) all our data are versioned in a Subversion repository; thus, all
changes can be tracked back, reviewed and reverted if necessary.

c) I keep track of original data file names and blocks. While in theory
ambiguous, in practice in nearly always permits to track down the
original file that was used for COD deposition. We also keep original
submitted files in a separate file tree/repository. Thus, we can always
track back the provenance of the data.

Given this, I think it is more important to have reasonably correct data
ready for automatic processing, rather than to record every aberration
possible.

> What about adding the field:
> 
> COD_diffrn_standards_decay_% '<1'
> 
> and removing the old old. Removal of an invalid field is better than
> throwing away the whole CIF (which is the only logical approach to error). 

>     For anything that is understood by us humans as "room temperature" (RT,
>     "room temp.", "ambient temp.", etc.), we assume that the average is
>     meant to be 22 deg. C (comfort level in a lab), and the uncertainty to
>     be +/- 2 degrees (assuming it is unlikely that human crystallographers
>     would measure above 28 degrees C or below 16 degrees, yielding a 60%
>     (1*sigma) confidence interval of 2 degrees, on a broader side), ending
>     up with a "justified wild guess" of 295(2) (Kelvins).
> 
> Again - convert to
> COD_temperature...

I have considered this approach, but I find it awkward. The reason is
that any processor that will try to extract, say, temperature from a CIF
will naturally process the _diffrn_ambient_temperature data item. If
this is unreadable, it will (should?) report an error.

If we now move these values to _cod_diffrn_ambient_temperature, we will
have a large number of tags that would contain essentially the same
information -- AMCSD would need to introduce
_amcsd_diffrn_ambient_temperature if using the same approach, CrystalEye
(when providing for me files autoconverted from ChemML :) would use
something like _crystaleye_diffrn_ambient_temperature ... The list of
data names is open, so no software will catch up. In addition, the
values are by definition *not* conforming to the value syntax/semantics,
so what do you do with them? I guess such values are useless in general.
So we end up essentially deleting data, which is however correct from a
human point of view (and authors might understandably be upset with
this, arguing "well, don't you guys understand what '100 degrees C'
means?!").

So my approach is the following. When I can reasonably *derive* a
correct value, I put this derived value into the cif_core data item;
e.g. we restore _diffrn_ambient_temperature value as we think the author
has intended to publish it, and we use the IUCr CIF requirements for its
syntax.

We then invent, if necessary, a _cod_original_diffrn_ambient_temperature
data item, described in cif_cod.dic
(http://www.crystallography.net/cif/dictionaries/cif_cod.dic), that
contains the *original* value verbatim; this value is present only when
the original value was different from what we have have now in the COD CIF.

This policy is implemented in full for the _cell_volume data item. The
_cell_volume contains volumes computed from the cell constants in COD
CIFs. If the original value was different, it is left copied to the
_cod_original_cell_volume. The same is for _symmetry_space_group_name_*
tags.

In this way, you can process COD CIFs assuming that all values are
obtained in a uniform way, and satisfy invariants that COD ensures. If
you have doubt whether the values are OK, or if you want to find out
whether and how they were changed, you can consult the corresponding
_cod_original_... value -- and this can be done automatically.

For the temperature values, we have rather a human readable
_cod_depositor_comment message describing the change. But now that I am
writing you this text, I realise that we should have _cod_original_...
data items for temperature values as well, in a machine-readable form.

Since all our data changes are versioned and all original files are
tracked, I hope there will be no serious problems to assure reliable
data and to establish data provenance.

Regards,
Saulius

-- 
Dr. Saulius Gražulis
Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366



More information about the Cod-users mailing list