[CIG-MC] AGU's new data policy

Louise Kellogg lhkellogg at ucdavis.edu
Tue Mar 6 14:34:06 PST 2018


Great minds think alike Katie! We were just discussing this idea at the CIG
staff meeting this morning.

As it happens, I have a meeting with AGU's Brooks Hanson on a related topic
(publication of software), two weeks from now, and I'd be happy to bring
this whole topic to the table. As CIG, the organization, we represent a
community and I hope we can help influence this direction. Someone else,
either here or on the AGU communities discussion of this issue, also raised
the issue Matt hints at - barriers to access to publishing in AGU journals
for scientists with fewer resources.

It's important to ask what the goal of archiving data is, in order to
decide what data to archive. Is it reproducibility? or replicability? They
are not the same thing, as I was reminded this morning, and the distinction
is important.  Here is a helpful discussion of the differences, published
in a neuroscience journal, but drawing on the work of geophysicist Jon
Claerbout: (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5778115/). The
entire article is interesting, but I'll only quote the following:

>
> Claerbout defined “reproducing” to mean “running the same software on the
> same input data and obtaining the same results” (Rougier et al., 2017),
> going so far as to state that “[j]udgement of the reproducibility of
> computationally oriented research no longer requires an expert—a clerk can
> do it” (Claerbout and Karrenbach, 1992). As a complement, replicating a
> published result is then defined to mean “writing and then running new
> software based on the description of a computational model or method
> provided in the original publication, and obtaining results that are
> similar enough …” (Rougier et al., 2017). I will refer to these definitions
> of “reproducibility” and “replicability” as Claerbout terminology; they
> have also been recommended in social, behavioral and economic sciences
> (Bollen et al., 2015).


Replicability is the higher standard, and more useful. But it does not
necessarily require bit-for-bit conservation of the output of a model.
Instead, it may focus on the complete specification of the model and the
methods used. That would require input files and perhaps other input
information with sufficient metadata to explain what everything is, and it
would require archiving of the software in an archive like CIG's, and
citation including the specific version number to ensure that the methods
are specified completely. Here the work that CIG has been doing on software
attribution is relevant.  See for example the citation tool:
https://geodynamics.org/cig/abc which spells out in detail how to cite
software, by version, both in a manuscript and in the references and
acknowledgements sections of a paper.  These practices are in line with
those recommended by library, information, and data scientists.

There are instances where a partial or complete preservation of the model
may be essential. One example could be be benchmark models.  Another
example is provided by our geodynamo group, who is using leadership class
computers to produce a very limited set of model runs that they intend to
mine for knowledge going forward. That model output needs to be preserved
as an observational dataset would. Another example: it may be a good idea
to preserve additional information on model design, such as the FEM mesh
when it is particularly complex.

Thanks to Lorraine Hwang for reminding me of the distinction between
reproducibility and replicability, for coming up with the FEM mesh
preservation example, and for leading the charge on software citation.

Best,

Louise



On Tue, Mar 6, 2018 at 9:28 AM, Cooper, Catherine M <cmcooper at wsu.edu>
wrote:

>
> I wonder if it wouldn’t be helpful to have a community statement as to
> what we consider “data” and what we agree needs to be shared for
> reproducibility (which we all agree is important)?  But it seems like we
> might need to do some outreach on this if there is some misunderstanding
> about model output as data amongst AGU and NASA (this has come up in
> proposal reviews).
>
>
> On Mar 6, 2018, at 9:01 AM, Juliane Dannberg <judannberg at gmail.com> wrote:
>
> My experience with this is similar to what Thorsten describes. I also
> regularly have TB-sized model output, and usually include the doi of the
> version of the code I used in the paper, upload all input files/scripts
> etc. I used as supplementary material, and include a sentence that "all
> input files necessary to reproduce the model results are included in the
> supplementary material". So far, that seemed to be an acceptable solution,
> also for AGU journals.
>
> But I agree that there doesn't seem to be a good way to archive TB-sized
> model output over long periods of time...
>
> Best,
> Juliane
>
> ----------------------------------------------------------------------
> Juliane Dannberg
> Postdoctoral Fellow, Colorado State University
> http://www.math.colostate.edu/~dannberg/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.math.colostate.edu_-257Edannberg_&d=DwMDaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=3iAcC0lqlf3gOx4_NidEiA&m=HfT2q7BUNv7lwQ4rNBn6WCxad64-R40vEvd3Ehweq84&s=iVhuH7wc5RA1dEULO5hsENSQBBcmRpe1dBNQMzkZxOU&e=>
>
>
>
> Am 3/6/2018 um 9:39 AM schrieb Thorsten Becker:
>
> The way I have interpreted AGU's guidelines for geodynamic studies as AGU
> editor is to not ask for archiving of model output, but to ask for general
> access to all material that would be needed to recreate that output, or
> some simpler version of it that is proof of concept. I.e. input data, input
> files, and a DOI to version of code, for example, if a community code is
> used.
>
> The general idea is, of course, to make things reproducible, and AGU and
> Wiley are among those who realize that this can cause problems, and are
> working on solutions with the community.
>
> One particular issue is that I have not asked for verification that
> results are actually reproducible, and taken authors assurances that codes
> will be shared at face value (besides when the publications were of
> technical nature, and we ask reviewers to actually try to download and run
> the software, for example (which usually never works)). I think that part
> might change, in that publishers may ask for a code access link and somehow
> archive this.
>
> I can also see some solutions akin to asking for a Docker set up, archived
> somewhere, that will allow anyone to rerun the models. There are
> interesting challenges involved, but in the end, I think moving to more
> openness and reproducibility is a good thing, and the success of CIG shows
> how some issues that were raised before we moved into this model resolved
> themselves. Things are perfect, but we're making progress.
>
> My personal experience with publishing numerical stuff in highly visible
> journals is that, within a week, there are people actually asking to get
> all the code and all the input files to rerun our models, and we've always
> shared all of our stuff, of course. I realize that this is a
> significant workload (particularly for my grad students who actually put
> this stuff together...) and somehow AGU and publishers need to do more to
> support people with large data volumes, seismological inversions being
> another example.
>
>
> Thorsten Becker - UTIG & DGS, JSG, UT Austin
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www-2Dudc.ig.utexas.edu_external_becker_&d=DwMDaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=3iAcC0lqlf3gOx4_NidEiA&m=HfT2q7BUNv7lwQ4rNBn6WCxad64-R40vEvd3Ehweq84&s=ht52HpJGdxPfwGDTFHbOvc6DI21TD42eHQ4S-Bm8Iyo&e=>
>
> On Tue, Mar 6, 2018 at 7:17 AM, Scott King <sdk at vt.edu> wrote:
>
>>
>> AGU journals have a new data policy requiring that all the data from the
>> work must be in a publicly accessible repository.  In general I think this
>> is a good thing.   They provide several possible solutions.   From the
>> editor letter…
>>
>> "*AGU requires that data needed to understand and build upon the
>> published research be available in public repositories following **best
>> practices
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__publications.agu.org_author-2Dresource-2Dcenter_publication-2Dpolicies_data-2Dpolicy_data-2Dpolicy-2Dfaq_&d=DwMDaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=3iAcC0lqlf3gOx4_NidEiA&m=HfT2q7BUNv7lwQ4rNBn6WCxad64-R40vEvd3Ehweq84&s=ySxEBeS3cBYIOD8hSSkU7WymnOp8M-wucwrFXLBFHss&e=>.
>> This includes an explicit statement in the Acknowledgments section on where
>> users can access or find the data for this paper. Citations to archived
>> data should be included in your reference list and all references,
>> including those cited in the supplement, should be included in the main
>> reference list. All listed references must be available to the general
>> reader by the time of acceptance.*”
>>
>> They list several possible repositories, none of which seem appropriate
>> for 2.9 TB of CicomS results. Set aside the philosophical issue that model
>> results are not “data” (they don’t accept that).   I have the output used
>> in the published figures down to a reasonable size but. I’m curious what
>> others are doing.  Has anyone else run into this yet?  (If not you will.)
>>  I’m curious if there is a community consensus regarding a repository where
>> all geodynamics results would/could end up, as opposed to ending up with
>> them scattered across 3-4 (or more) potential repositories.  Maybe that’s
>> not something to worry about, but since this is new and to me at least I’ve
>> had no time to think it through, I’m curious what others are doing.
>>
>> Thoughts?
>>
>> Cheers,
>>
>> Scott
>>
>>
>>
>>
>> _______________________________________________
>> CIG-MC mailing list
>> CIG-MC at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_cig-2Dmc&d=DwMDaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=3iAcC0lqlf3gOx4_NidEiA&m=HfT2q7BUNv7lwQ4rNBn6WCxad64-R40vEvd3Ehweq84&s=XOhKoMDTham1Kxbm10gSj_HK0WwQs7oPVG5RjUctuS0&e=>
>>
>
>
>
> _______________________________________________
> CIG-MC mailing listCIG-MC at geodynamics.orghttp://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_cig-2Dmc&d=DwMDaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=3iAcC0lqlf3gOx4_NidEiA&m=HfT2q7BUNv7lwQ4rNBn6WCxad64-R40vEvd3Ehweq84&s=XOhKoMDTham1Kxbm10gSj_HK0WwQs7oPVG5RjUctuS0&e=>
>
>
> _______________________________________________
> CIG-MC mailing list
> CIG-MC at geodynamics.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> geodynamics.org_cgi-2Dbin_mailman_listinfo_cig-2Dmc&d=
> DwIGaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=3iAcC0lqlf3gOx4_
> NidEiA&m=HfT2q7BUNv7lwQ4rNBn6WCxad64-R40vEvd3Ehweq84&s=
> XOhKoMDTham1Kxbm10gSj_HK0WwQs7oPVG5RjUctuS0&e=
>
>
>
> _______________________________________________
> CIG-MC mailing list
> CIG-MC at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>



-- 
**********************************
Louise Kellogg
Professor, Department of Earth and Planetary Sciences
University of California, Davis
***********************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/cig-mc/attachments/20180306/996a726d/attachment-0001.html>


More information about the CIG-MC mailing list