Mutability of Content in DataONE

Overview

All content synchronized by DataONE is immutable, and so resolution of a persistent identifier (PID) will always result in a pointer (URI) to a set of bytes that are in all respects identical to the original. Version 2.0 of the DataONE APIs introduced the ability to associate an optional series identifier (SID) with an object. Unlike a PID, resolution of a SID will always result in a pointer (URI) to a set of bytes that represent the latest revision of an object.

A revision or obsolescence chain is constructed by setting the obsoletes and obsoletedBy properties of the new and old objects respectively. For example, here PID_B represents the latest revision of object as it obsoletes PID_A (object PID_A has a value of “PID_B” in its system metadata obsoletedBy property, and object PID_B has a value of “PID_A” in its system metadata obsoletes property):

+------------+                      +------------+
|            | ----- obsoletes ---> |            |
|   PID_B    |                      |   PID_A    |
|            | <--- obsoletedBy --- |            |
+------------+                      +------------+

             resolve(PID_A) => PID_A
             resolve(PID_B) => PID_B

In version 1.x of DataONE, it was necessary to manually follow the obsolescence chain in order to find the latest version of an object. This process is simplified in version 2.x and later through the use of series identifiers. The previous example can be augmented with series identifiers:

+------------+                      +------------+
|            | ----- obsoletes ---> |            |
|   PID_B    |                      |   PID_A    |
|   SID_1    |                      |   SID_1    |
|            | <--- obsoletedBy --- |            |
+------------+                      +------------+

             resolve(PID_A) => PID_A
             resolve(PID_B) => PID_B
             resolve(SID_1) => PID_B

Each object in the obsolescence chain has the same value for the series identifier (“SID_1”), and calling resolve() with the value “SID_1” will result in the URIs from which the object “PID_B” may be retrieved, since that object is the latest revision in the obsolescence chain.

The availability of PIDs and SIDs means users may now refer to objects using either a PID when it is necessary or appropriate to refer to an exact set of bytes that represent an object or through a SID when referring to the latest version of an object. The former is important for repeatable analyses, since the same content may be reliably referenced and retrieved. The latter is important for referencing the most up to date revision of some object, and so may be useful for example to perform anaysis with the latest information available.

Unless indicated otherwise, the DataONE version 2.x and later APIs will accept either a PID or a SID when an identifier is specified as a request parameter.

Resolving Series Identifiers

In a perfect world, all obsolescence chains will have be complete, bi-directional links, and so determining the latest version of an object is determined simply by examining the set of all objects with the same SID, and selecting the object that is not obsoletedBy anything else. Obsolescence chains may be incomplete for various reasons and in such situations, resolution of series identifiers should still operate consistently.

The following series of scenarios demonstrate the behavior of the DataONE system when resolving a seriesId to a specific object. The behavior of resolution is to rely primarily on the obsoletes and obsoletedBy entities, falling back to the date when an object is added to a Member Node (dateUploaded) to determine the newer version.

The following notation is used herein:

\(P_i\)

Refers to a Persistent Identifier (PID)

\(S_i\)

Refers to a Series Identifier (SID)

\(t_i\)

The value of dateUploaded for an object

\(t_1\) < \(t_2\)

\(t_1\) is older than \(t_2\)

\(P_i \binom{S_j}{t_k}\)

An object with identifier (PID) \(P_i\), a seriesId (SID) of \(S_j\), and a dateUploaded of \(t_k\).

\(P_i\) \(\rightarrow\) \(P_j\)

\(P_i\) has an obsoletedBy entry that contains the value \(P_j\)

\(P_i\) \(\leftarrow\) \(P_j\)

\(P_j\) has an obsoletes entry that contains the value \(P_i\)

\(P_i\) \(\leftrightarrows\) \(P_j\)

\(P_i\) has an obsoletedBy entry that contains the value \(P_j\) and \(P_j\) has an obsoletes entry that contains the value \(P_i\).

\(P_i\) \(\square\) \(P_j\)

Neither obsoletedBy nor obsoletes is set by \(P_i\) or \(P_j\).

??

Object was not synchronized, and so unknown to DataONE

\(resolve(S_i) \Rrightarrow P_j\)

Resolving SID \(S_i\) results in PID \(P_j\)

Case 1

(1)\[\begin{split}P_1\binom{S_1}{t_1} & \leftrightarrows P_2\binom{S_1}{t_2} \\ t_1 & < t_2 \\ resolve(S_1) & \Rrightarrow P_2\end{split}\]

A set of objects \(O = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

All objects in \(O\) are participants in an obsolescence chain since \(P_2\) obsoletes \(P_1\) and \(P_1\) is obsoletedBy \(P_2\).

All elements of the obsolescence chain \(P_1 \leftrightarrows P_2\) have the same series identifier, \(S_1\).

The dateUploaded of \(P_1\) is older than that of \(P_2\).

This is a perfect obsolescence chain and resolving \(S_1\) will result in the object identified by \(P_2\).

Case 2

(2)\[\begin{split}P_1\binom{S_1}{t_1}\; & \square \; P_2\binom{S_1}{t_2} \\ t_1 & < t_2 \\ resolve(S_1) & \Rrightarrow P_2\end{split}\]

A set of objects \(O = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

No obsolescence information associates objects in \(O\).

The dateUploaded of \(P_1\) is older than that of \(P_2\).

No obsolescence assertions are made, so resolution is inferred by the most recent value of dateUploaded.

Case 3

(3)\[\begin{split}P_1\binom{S_1}{t_1}\; & \leftarrow \; P_2\binom{S_1}{t_2} \\ t_1 & < t_2 \\ resolve(S_1) & \Rrightarrow P_2\end{split}\]

A set of objects \(O = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

All objects in \(O\) are participants in an obsolescence chain since \(P_2\) obsoletes \(P_1\) even though \(P_1\) does not assert it is obsoletedBy \(P_2\).

All elements of the obsolescence chain \(P_1 \leftarrow P_2\) have the same series identifier, \(S_1\).

The dateUploaded of \(P_1\) is older than that of \(P_2\).

This is a damaged, but consistent obsolescence chain and resolving \(S_1\) will result in the object identified by \(P_2\).

Case 4

(4)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2}& \leftrightarrows P_3\binom{S_2}{t_3}\\ t_1 < t_2 & < t_3 \\ resolve(S_1) &\Rrightarrow P_2 \\ resolve(S_2) &\Rrightarrow P_3 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

A set of objects \(O_{S_2} = \lbrace P_3 \rbrace\) has the series identifier, \(S_2\).

Objects \(O = O_{S_1} \cup O_{S_2}\) all participate in a full, bi-directional obsolescence chain.

In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).

Resolving \(S_2\) will result in \(P_3\).

Case 5

(5)\[\begin{split}P_1\binom{S_1}{t_1} \leftarrow P_2\binom{S_1}{t_2}& \leftarrow P_3\binom{S_2}{t_3}\\ t_1 < t_2 & < t_3 \\ resolve(S_1) &\Rrightarrow P_2 \\ resolve(S_2) &\Rrightarrow P_3 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

A set of objects \(O_{S_2} = \lbrace P_3 \rbrace\) has the series identifier, \(S_2\).

Objects \(O = O_{S_1} \cup O_{S_2}\) all participate in a damaged, though consistent obsolescence chain.

In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).

Resolving \(S_2\) will result in \(P_3\).

Case 6

(6)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2}& \leftrightarrows P_3\binom{}{t_3}\\ t_1 < t_2 & < t_3 \\ resolve(S_1) &\Rrightarrow P_2 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

Objects \(O = O_{S_1} \cup P_3\) all participate in an obsolescence chain.

In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).

Case 7

(7)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2}& \leftrightarrows P_3\binom{}{t_3} \leftrightarrows P_4\binom{S_2}{t_4} \\ t_1 < t_2 & < t_3 < t_4\\ resolve(S_1) &\Rrightarrow P_2 \\ resolve(S_2) &\Rrightarrow P_4\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

A set of objects \(O_{S_2} = \lbrace P_4 \rbrace\) has the series identifier, \(S_2\).

Objects \(O = O_{S_1} \cup P_3 \cup O_{S_2}\) all participate in an obsolescence chain.

In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).

Resolving \(S_2\) will result in \(P_4\)

Case 8

(8)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2}& \rightarrow ?? \leftarrow P_4\binom{S_1}{t_4} \\ t_1 < t_2 & < t_4\\ resolve(S_1) &\Rrightarrow P_4 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).

Objects \(O_{S_1}\) all participate in an obsolescence chain, however the chain is broken with no way to traverse between \(P_2\) and \(P_4\) because the object that \(P_2\) indicates it is obsoletedBy, and the object that \(P_4\) indicates it obsoletes is not recorded by the DataONE Coordinating Nodes (does not resolve).

In this case resolving \(S_1\) will result in \(P_4\) since that is the most recent object in the set of objects \(O_{S_1}\).

Case 9

(9)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2}& \;\square\; ?? \leftarrow P_4\binom{S_1}{t_4} \\ t_1 < t_2 & < t_4\\ resolve(S_1) &\Rrightarrow P_4 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).

Objects \(O_{S_1}\) all participate in an obsolescence chain, however the chain is broken with no way to traverse between \(P_2\) and \(P_4\) because the the object that \(P_4\) indicates it obsoletes is not recorded by the DataONE Coordinating Nodes (does not resolve).

In this case resolving \(S_1\) will result in \(P_4\) since that is the most recent object in the set of objects \(O_{S_1}\).

Case 10

(10)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2}& \rightarrow P_{del}\binom{}{} \leftarrow P_4\binom{S_1}{t_4} \\ t_1 < t_2 & < t_4\\ resolve(S_1) &\Rrightarrow P_4 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).

The object \(P_{del}\) was deleted from the system, so the identifier is known, but the object and associated system metadata are no longer available.

Objects \(O_{S_1}\) all participate in an obsolescence chain, however the chain is broken with no way to traverse between \(P_2\) and \(P_4\) because the object that \(P_2\) indicates it is obsoletedBy, and the object that \(P_4\) indicates it obsoletes is not recorded by the DataONE Coordinating Nodes (does not resolve).

In this case resolving \(S_1\) will result in \(P_4\) since that is the most recent object in the set of objects \(O_{S_1}\).

Case 11

(11)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2}& \leftrightarrows archived\biggl[P_3\binom{S_1}{t_3}\biggr] \\ t_1 < t_2 & < t_3 \\ resolve(S_1) &\Rrightarrow P_3 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_3 \rbrace\) have the same series identifier, \(S_1\).

Objects \(O_{S_1}\) all participate in an obsolescence chain.

Object \(P_3\) has been archived, and so is not discoverable.

In this case resolving \(S_1\) will result in \(P_3\) which is the most recent object in the obsolescence chain even though it is archived.

Case 12

(12)\[\begin{split}P_1\binom{S_1}{t_1} & \leftrightarrows P_2\binom{S_1}{t_2} \rightarrow ?? \\ t_1 & < t_2 \\ resolve(S_1) &\Rrightarrow P_2 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

Objects \(O_{S_1}\) participate in an obsolescence chain which is damaged by \(P_2\) indicating it is obsoletedBy some object that is not resolvable.

In this case resolving \(S_1\) will result in \(P_2\) which is the most recent resolvable object in the obsolescence chain.

Case 13

(13)\[\begin{split}P_1\binom{S_1}{t_1} & \leftarrow P_2\binom{S_1}{t_2} \rightarrow ?? \\ t_1 & < t_2 \\ resolve(S_1) &\Rrightarrow P_2 \\\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

Objects \(O_{S_1}\) participate in a damaged obsolescence chain since \(P_2\) indicates it is obsoletedBy some object that is not resolvable, and \(P_1\) does not assert it is obsoletedBy \(P_2\).

In this case resolving \(S_1\) will result in \(P_2\) which is the most recent resolvable object in the obsolescence chain.

Case 14

(14)\[\begin{split}P_1\binom{S_1}{t_1} \leftarrow P_2\binom{S_1}{t_2}& \rightarrow P_3\binom{S_2}{t_3}\\ t_1 < t_2 & < t_3 \\ resolve(S_1) &\Rrightarrow P_2 \\ resolve(S_2) &\Rrightarrow P_3\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

A set of objects \(O_{S_2} = \lbrace P_3 \rbrace\) has the series identifier, \(S_2\).

Objects \(O = O_{S_1} \cup O_{S_2}\) all participate in a damaged obsolescence chain, with \(P_1\) not indicating it is obsoleted by \(P_2\), and \(P_3\) not indicating that it obsoletes \(P_2\).

In this case resolving \(S_1\) will result in \(P_2\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).

\(S_2\) will resolve to \(P_3\).

Case 15

(15)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2} \; & \square \; ?? \leftarrow P_4\binom{S_1}{t_4} \leftrightarrows P_5\binom{S_2}{t_5}\\ t_1 < t_2 & < t_4 < t_5\\ resolve(S_1) &\Rrightarrow P_4 \\ resolve(S_2) &\Rrightarrow P_5\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).

A set of objects \(O_{S_2} = \lbrace P_5 \rbrace\) has the series identifier, \(S_2\).

Objects \(O = O_{S_1} \cup P_3 \cup O_{S_2}\) all participate in a damaged obsolescence chain with no assertion of the relationship between \(P_2\) and \(P_4\).

In this case resolving \(S_1\) will result in \(P_4\) which is not the most recent object in the obsolescence chain, however it is the newest version in the obsolescence chain identified by \(S_1\).

Resolving \(S_2\) will result in \(P_5\).

Case 16

(16)\[\begin{split}P_1\binom{S_1}{t_1} \leftarrow P_2\binom{S_1}{t_2} & \rightarrow ?? \leftarrow P_4\binom{S_2}{t_4} \\ t_1 < t_2 & < t_4\\ resolve(S_1) &\Rrightarrow P_2 \\ resolve(S_2) &\Rrightarrow P_4\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2 \rbrace\) have the same series identifier, \(S_1\).

A set of objects \(O_{S_2} = \lbrace P_4 \rbrace\) has the series identifier, \(S_2\).

Objects \(O_{S_1}\) and \(O_{S_2}\) are both damaged obsolescence chains though the Coordinating Nodes may infer association between \(O_{S_1}\) and \(O_{S_2}\) since even though the object that \(P_2\) is obsoletedBy and the object that \(P_4\) obsoletes can not be resolved, \(P_2.obsoletedBy\) and \(P_4.obsoletes\) are be the same value.

In this case resolving \(S_1\) will result in \(P_2\) which is the most recent resolvable object in the obsolescence chain.

Resolving \(S_2\) will result in \(P_4\).

Case 17

(17)\[\begin{split}P_1\binom{S_1}{t_1} \leftarrow P_2\binom{S_1}{t_2} & \rightarrow ?? \leftarrow P_4\binom{S_1}{t_4} \\ t_1 < t_2 & < t_4\\ resolve(S_1) &\Rrightarrow P_4\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace\) have the same series identifier, \(S_1\).

Objects \(O_{S_1}\) form a damaged obsolescence chain though it can be inferred that \(P_2\) is obsoletedBy and \(P_4\) obsoletes the same object even though it can not be resolved, \(P_2.obsoletedBy\) and \(P_4.obsoletes\) are be the same value.

In this case resolving \(S_1\) will result in \(P_4\).

Case 18

(18)\[\begin{split}P_1\binom{S_1}{t_1} \leftrightarrows P_2\binom{S_1}{t_2} & \rightarrow ?? \; \square \; ?? \leftarrow P_5\binom{S_1}{t_5} \\ t_1 < t_2 & < t_5\\ resolve(S_1) &\Rrightarrow P_5\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_5 \rbrace\) have the same series identifier, \(S_1\).

The obsolescence chain \(O_{S_1}\) is broken, with no way to traverse from \(P_2\) to \(P_5\).

The dateUploaded places \(P_5\) as the newest object with the series Id of \(S_1\).

Resolving \(S_1\) results in \(P_5\).

Case 19

(19)\[\begin{split}P_1\binom{S_1}{t_1} \leftarrow P_2\binom{S_1}{t_2} & \leftarrow P_3\binom{S_1}{t_3} \\ t_1 > t_2 & > t_3\\ resolve(S_1) &\Rrightarrow P_3\end{split}\]

A set of objects \(O_{S_1} = \lbrace P_1, P_2, P_3 \rbrace\) have the same series identifier, \(S_1\).

Objects \(O_{S_1}\) form a damaged obsolescence chain since only obsoletes values are specified.

The dateUploaded of \(P_1\) is newer than \(P_2\), which in turn is newer than \(P_3\).

In this case resolving \(S_1\) will result in \(P_3\) even though \(P_1\) is the most recent object since the obsolescence chain overrides the times.

Referencing Content by Identifier

The use of the PID or SID for either citation or analysis workflows is up to the user and is context dependent. In general, DataONE anticipates DATA and RESOURCE_MAP objects will be referenced by PID, to ensure reproducibility; and in general, METADATA documents will be referenced by SID, to take advantage of any data curation / correction efforts that would not otherwise affect scientific reproducibility. Additionally, clues for the content submitter’s preference can be found in the format of the identifiers themselves. For example, DOIs and EZIDs take a recognizable format, and are often encouraged in scientific communities for citations, so an end-user might take that into consideration when deciding which identifier to choose.

Todo

guidance on RESOURCE_MAPS - initial thoughts: depends on references to DATA objects, whether they be SIDs or PIDs

Assigning Identifiers

Depending on the Member Node used as the primary repository, content originators may have some choice in assigning identifiers. For those that do, it is advised that they assign PIDs and SIDs according to the typical usage pattern described above.

Some Member Nodes may not preserve past versions of content, in which case the PID is likely to be automatically generated, and the submitter only has to determine the SID, and may not need to know the difference between the SID and PID. Other Member Nodes may still be at v1 of the DataONE APIs and only allow assignment of the PID.

Limits on the Series

The SID is used to conceptually represent an object that may vary modestly over time, but remains conceptually the same. Content contributers should be careful to apply reasonable limits on the scope of documents such that an entity does not deviate too much from the original item. In such cases, a new / different series should be initiated.

Requirements on Member Node Implementations

For Member Nodes that employ a mutable content storage model, the only additional DataONE requirement is that the Member Node generate a SystemMetadata document for the updated content, containing:

  1. unique PID in systemMetadata.identifier field

  2. new checksum

  3. the previous PID in the systemMetadata.obsoletes field

Ideally, the SystemMetadata of now unavailable versions will be maintained, and the obsoletedBy field is populated with the PID of the version that replaced it.

Some Member Nodes may opt to preserve recent back-versions to aid the complete capture of versions by the DataONE network via synchronization.

Replication of unhosted back-versions

DataONE will attempt to synchronize all versions it’s made aware of through the synchronization process, but may miss short-lived versions that are in existence only between the Member Node’s synchronization interval. Please note, also, that the synchronization schedule is not guaranteed. Periods of DataONE maintenance may suspend synchronization, or high CN load could prolong the synchronization interval.

Member Nodes keen to make sure versions have the highest chance of synchronization can choose to issue a CNCore.synchronize() command that will put the item on the synchronization queue instead of waiting for the harvest interval.

Conversely, if the Member Node expressly doesn’t want DataONE to preserve back-versions, they can set systemMetadata.replicationPolicy.numberReplicas field to 0.

Synchronizing Content from Mutable Member Nodes

At its core, DataONE is in the business of preserving definite versions of content through centrally coordinated per-to-peer replication. That is, DataONE Coordinating Nodes direct certain Member Nodes to replicate newly synchronized objects from the originating Member Node to better preserve it. New versions of objects appear as first class immutable objects with unique PIDs, even if originating from mutable Member Nodes.

From the DataONE perspective the only difference between objects from mutable Member Nodes and immutable Member Nodes is the completeness of the series of versions it is able to synchronize and replicate.

The Problem

Current DataONE replication processes and fixity checks depend on content identified by a PID that does not change. If this were not enforced, mutable content from a member node would not be differentiated from corrupt copies of the object and our replication and recovery features would attempt to correct the byte inconsistency. The immutability requirement helps to ensure reproducible results of any use of an object. Any analysis on a data set repeated sometime in the future should yield identical results (within the limits of precision of the analytical tools) and this is one of the major guiding principles in creating DataONE as a long term data repository federation. By simply overwriting existing content using the same identifier, nodes cannot be relied upon for repeatable retrieval of content.

Proposal

The proposal for supporting “mutable” content is to allow a series identifier (SID) to facilitate the semantics of citing an object at the conceptual level, instead of the version level. As content changes over time, new identifiers (PIDs) will still be used to mark each change, but the conceptual object can continue to be referred to with an unchanging identifier (SID). The member node will be responsible for creating each version and assigning a unique PID to it and these objects will be synchronized and replicated to other DataONE member nodes as they are today. So instead of allowing content to be directly modified, we are allowing strongly-versioned chains to be referenced by an identifier; and relaxing the requirement that all revisions be resolvable forever.

The Series Identifier

The proposed solution is to model and implement a “series identifier” (SID) along with modified services that would work with both SIDs and PIDs. From a DataONE perspective, the series identifiers would be assigned to all versions of an object, be unique in DataONE (assigned to only one version chain), and would be reserved just as PIDs - from the same namespace. The series identifier, once assigned to the version chain, would similarly be immutable, and could apply to all new versions of the item. It is also assumed that in order to coordinate users to use one identifier for citations, that the cardinality for the citation identifier would be 0..1. The semantics for making API calls with a SID would, in general, be to return responses as if the call were made with the most current PID.

Member Nodes that only maintain the latest version of an item would be required to use a new PID for any updated content, and modify the System Metadata appropriately so that the new version can be synchronized with the network. The same SID would typically be used for the updated object, although we would allow the revision chain to shift to a new SID as desired by the client and/or member node.

It cannot be assumed that a user with an identifier in hand knows whether it is a SID or a PID, so DataONE expects the user to refer to the System Metadata once it has the item to determine if the identifier used in the call matches the PID or the SID. Similarly, they could interrogate search results for the same information. For high-level interfaces, like D1Client.getD1Object(id), the PID of the object returned may or may not match the passed in ‘id’. So, high-level functions or applications that use resolve will have to make sure they handle the new resolving semantics.

It is recommended that search indexes include a search field for the series identifier that can also be returned in the results.

Semantics of “Current”

A SID chain closes with two types of ends:

Type 1: An object on the SID chain doesn’t have the “obsoletedBy” field.

Example:

P1(S1) ⟺ P2(S1)

P2 is a type 1 end.

Type 2: An object on the SID chain does have the “obsoletedBy” field, but the PID in the “obsoletedBy” field has a different SID (including no SID value).

Examples:

P1(S1) ⟺ P2(S2)

P1(S1) ⟺ P2()

P1 is a type 2 end on both chains.

It is tricky to determine a type 2 end if the object in the “obsoletedBy” field is missing. For example, P1(S1) P2(S1) ??. We don’t have the knowledge of the series id of the object “??”. So we generally consider it a type 2 end except we are sure it is not an end - there is another object in the chain (has the same series id) that obsoletes the missing object.

In previous example [P1(S1) ⟺ P2(S1) ⟹ ??], P2 is a type 2 end (case 12).

However, P1(S1) ⟺ P2(S1) ⟹ ?? ⟸ P4(S1), P2 is not an end (case 8) since “??” is in the obsoletes field of P4 that has the same series id - S1 (We are sure that the “??” has the series id S1 as well, so P2 is not an end).

In P1(S1) ⟺ P2(S1) ⟹ ?? ⟸ P4(S2), P2 is a type 2 end even though “??” is in the obsoletes field of P4. But P4 has a different series id - S2 (so we are not sure “??” has the S1 or S2).

Ideally, if there is one and only one end on a SID chain, this end will be the HEAD (current) version. This kind of chains are called ideal chains.

If there are more than one end on a SID chain because of the incompleteness of the system metadata, It is hard to determine which one is the real end. This kind of chain is not a ideal chain and we have to use this mechanism to determine the HEAD version:
  1. Choose the end with latest dateUploaded in the chain as the temporary HEAD version. This rule works if the uploaded time stamps of objects aren’t messed up.

  2. If the time stamps are messed up, we need to test if any object obsoletes the temporary HEAD on the obsolete chain with the SAME SID. If nothing obsoletes the temporary HEAD, the temporary HEAD is the final HEAD; otherwise, the end of obsolete chain is the final HEAD.

Take this example P1[S1, t1] ⟸P2[S1, t2] ⟸P3(S1, t3) (case 19) (The t1, t2 and t3 are time stamps and t1 > t2 > t3. This means the time stamps are messing up - the newest version P3 was uploaded the earliest while the oldest version p1 was uploaded the latest)
  1. This chain has three type 1 ends - P1, P2 and P3. It is not an ideal chain.

  2. Choose P1 which has the latest date of uploaded as the temporary HEAD.

  3. P2 obsoletes P1 and also P3 obsoletes P2 on the obsolete chain P1 ⟸ P2 ⟸ P3. So we choose the end of the whole chain - P3 as the final HEAD.

Version Storage

Mutable content implies that back-versions of content may not be readily available on the nodes that originally produce the content. For metadata and resource maps, the coordinating nodes will store previous versions of objects during the synchronization process, but any data updates will result in only the latest version being available at the originating node. If the data objects were replicated (as is the hope), it is likely that previous versions of the data can still be resolved from replica target nodes, though this is dependent on replication policies, synchronization schedules and the availability of replica storage across the federation.

The current DataONE storage model, through the MN_Storage.update method, places responsibility for storing versions squarely on the submitter. Each update to the object requires a new unique identifier (PID) and must state which PID the new version is obsoleting. We will continue to require that unique PIDs are provided for each and every version of an object, but the member node will not be required to maintain a copy of previous revisions if it chooses not to. An optional series identifier (SID) can be provided with object SystemMetadata to group revisions together and to provide a convenient way to refer to the latest version of the object.

Version preservation

As is currently the case, the member node should maintain all versions of content using unique identifiers (PID) and synchronization will harvest each new revision to the network. While there will be no requirement that the Member node continue to make available the object identified by the obsoleted PID, the hope is that they will persist the data history as best they can. If the objects in the revision chain have a SID assigned, the new PID will be considered the latest version of this series.

The member node can allow access to the current version of the object using MN_Read.get(sid) as a convenience and any reference to the SID would resolve to the latest version of the object with a potentially different checksum and PID from what was originally present when the citation was distributed.

The member node must [minimally] maintain system metadata for the current revision of the object. Any updated object is still required to be identified by a new unique PID, but would include the same SID used in the previous version. The obsoletes field should indicate that the new PID replaces the previous PID. The coordinating node learns about the updated content during synchronization because there is:

  • a new PID

  • an updated dateSystemMetadataUpdated timestamp

  • an updated checksum (other fields may also be updated).

N.B. Multiple revisions between synchronization periods would not result in multiple versions recorded in the federation - just the revision[s] that happened to be synchronized would be persisted in DataONE. This leaves open the possibility of an end user retrieving a version from the MN that will ultimately not be persisted in perpetuity.

Working drafts vs. Repository publishing

DataONE essentially considers member nodes as the originators of selected versions of content. That is, not every intermediate revision on the way to a final product should neccessarily be saved for future reference. Organizations following the mutable content model for storage may wish to limit the objects returned by listObjects() to those that are considered in their publishable form. Certainly theses objects can later be updated as needed, but minimizing draft-status objects will reduce the amount of [possibly irretrievable] draft content floating around the federated network.

Types of Mutable Objects

As illustrated in the optional use cases, the rate and regularity of change of objects can be widely variable. The more frequent the change, the less likely that all versions would need to be reproduced, and the utility of complete version history diminishes. One can imagine a member node serving up an unrecorded data stream, such as a web-cam, delaying creating a version until a user calls MN.get() on the item, by tee’ing the output stream to file while returning the object.

Additionally the need to keep past versions may be less important for metadata objects (correcting typos that do not change the meaning or interpretation of the data) than data objects or resource maps.

Accumulating datasets

The use case of mutable data objects that grow with new records appended to the end of a table, for example, was given as a common practice for some groups, and one that would produce progressively redundant information with each persisted version. The motivation for rolling up records accumulated over time instead of new data files for each is the ease of use for end users. Using a SID to access the data object will always give the latest snapshot of the data records where old revisions may or may not also be accessible.

Mixed metadata-data objects

Objects like NetCDF files that include both metadata and data in the same object will be managed with the same PID and SID considerations. If only the metadata portion of the file is modified, the SID may remain the same, but a new PID and checksum must be created and made available for synchronization. The old revision may immediately become inaccessible using the PID and that is allowable under the proposal.

Retrieval / Citation Support

Implicit in the support for versioned content is support for retrieval of, or possibly just resolution to, the current object bytes by the identifier assigned in the originating system. At a minimum CNs will be required to support calculating which is the current version of series of versions and returning it or its identifier. This will be accomplished using the series identifier (SID) associated with object[s] in a revision chain. The “current” version of an object is defined as the non-obsoleted object with a SID that matches the requested identifier. Objects that are marked as “archived” may be returned as the most current version, but they should not be seen in default search interfaces. Since DataONE identifiers have no special formating semantics, those following a citation will not know by looking at the identifier whether it is referring to a specific version (PID) or the latest version of the item (SID), so services may be provided to easily investigate an entire version series. Existing services allow clients to deduce this information by inspecting the system metadata for the identifier and following any obsolescence properties as needed.

Retrieval vs. Resolution

Because the content of an object is retrieved in a separate call from its system metadata, use of the SID for MN Read API calls is troublesome because the content may be updated between the two calls. It would be impossible to tell if the bytes retrieved were incorrect (bit rot) or correct (newer version) when comparing checksums in this case. If data consistency is important to the caller, the PID should be used to guarantee that only the expected bytes (or a NotFound exception) are returned by any MN.get calls.

Those making a citation may wish to cite a specific version, or the latest current version. Followers of citations may wish to, if given an identifier representing a specific version (PID), find out what is the latest version (another, newer PID, or the SID). Conversely, if given a series identifier that navigates to the latest version, they may wish to find out what the content was at some previous point in time (e.g., the time of the citation) by following the obsolescence chain backward.

Service development plans

DataONE will be providing CN services for navigating to the latest version of an object, since the only way to do it currently is for the clients to serially retrieve the system metadata for versions in the chain until they reach the head version, which is can be inefficient. A new method to retrieve the entire version history is also under consideration.

MN API method changes

MN.get(Identifier id):

Identifier can be either a PID or SID, and if a SID, return the bytes of the HEAD PID in the series.

MN.getSystemMetadata(Identifier id):

If PID, return SystemMetadata of PID. If SID, return HEAD PID SystemMetadata.

MN.describe(Identifier id):

If PID, return header for PID. If SID, return header for HEAD PID.

MN.getChecksum(Identifier pid):

Requires PID to effectively verify data integrity.

MN.create(Identifier pid, object, SystemMetadata):

Identifier must be PID and included in accompanying systemMetadata. SID may be included in accompanying systemMetadata if known at time of creation. The SID mustn’t exist in the system.

MN.update(Identifier id, Identifier newPid, SystemMetadata):

Identifier id may be a PID or SID – in the case of a SID, the method works with the HEAD PID of the chain. The new Identifier must be a PID and must match the accompanying SystemMetadata. The new SID can match the old SID in previous SystemMetadata (objects are in the same series), or it can be any unique SID that does not already exist in the system (newly assigning a SID or shifting the SID because of a “scientifically meaningful change”). Moreover, the new system metadata may not have a SID no matter the previous version has a SID or not.

MN.getLogRecords(?idFilter):

Filter can be PID or SID. The MN should resolve the SID to the HEAD PID, and return the log records for that PID. If a client wishes to retrieve log records for the entire family of objects referenced by a SID, then the client should retrieve a list of PIDs for the SID, the call getLogRecords for each PID to retrieve the entire set of log records. The Log.identifier field will only contain PID values, no SIDs.

MN.delete(Identifier id):

Identifier can be PID or SID. If PID, delete that specific version; If a SID, delete the HEAD PID version.

MN.archive(Identifier id):

Identifier can be PID or SID. If PID, archive that specific version. If a SID, archive the HEAD PID.

MN.isAuthorized(Identifier id):

Can accept either PID or SID, but in the case of a SID parameter only reports on the the accessPolicy for HEAD PID.

MN.synchronizationFailed(Identifier pid):

Inter-node communication should only use PIDs for identifying objects.

MN.replicate(Identifier id):

No changes in behavior. SystemMetadata object has changed structure so there is a change in signature. Replication is based on the PID so that we can ensure content has not been corrupted.

MN.getReplica(Identifier id):

Can only make requests for PIDs so that checksum integrity can be verified.

MN.systemMetadataChanged(Identifier, serialVersion, dateSysMetaModified):

May be called on the MN if the CN infers an obsoletes relationship for a new PID based on a shared SID. Identifier can be either a PID or a SID. If a SID, the MN will fetch SystemMetadata from the CN using SID (which will return the HEAD PID SystemMetadata).

MN.listObjects(?identifier=XXX):

Returns an ObjectList like normal, but can be filtered by identifer (SID or PID). If the Identifier is a PID, it returns just the single entry for that PID. If the Identifier is a SID, it returns the objects (PIDs) of all objects that have that SID.

MN.view(Identifier id):

Can accept either PID or SID. If a PID, get the formated view for the specified version. If a SID, get the view for the HEAD PID.

MN.getPackage(Identifier id):

Can accept either PID or SID. If a PID, get the package of the specified version. If a SID, get the package of the HEAD PID.

MN.updateSystemMetadata(Identifier id, SystemMetadata newSysmeta):
Requires a PID. The SID can exist in the newSysmeta object. Since SID is immutable, the SID in the newSysmeta should match the current SID if it exists. If current system metadata doesn’t have a SID, the new SID can be one of the following cases:
  1. The new SID is null (without a SID).

  2. The new SID is a unique identifier which doesn’t exist in the system.

  3. The new SID matches the SID in the system metadata of the object in the “obsoletes” value.

  4. The new SID matches the SID in the system metadata of the object in the “obsoletedBy” value.

CN API method changes

CN.get(Identifier id):

Behaves the same as MN

CN.describe(Identifier id):

Behaves the same as MN

CN.getSystemMetadata(Identifier id):

Behaves the same as MN. N.B. This method can be used with a SID to locate the PID of the latest version which may be sufficient without implementing a getHead() method.

CN.getChecksum(Identifier id):

Behaves the same as MN

CN.getLogRecords(?idFilter):

Behaves the same as MN

CN.create(Identifier pid, object, SystemMetadata):
Identifier must be PID and included in accompanying systemMetadata. SID may be included in accompanying systemMetadata if known at time of creation. The SID can be one of the following cases:
  1. The SID is a unique identifier which doesn’t exist in the system.

  2. The SID matches the SID in the system metadata of the object in the “obsoletes” value.

  3. The SID matches the SID in the system metadata of the object in the “obsoletedBy” value.

CN.registerSystemMetadata(Identifier pid, SystemMetadata sysmeta):
Requires a PID, allows a SID in the SystemMetadata. The SID can be one of the following cases:
  1. The SID is a unique identifier which doesn’t exist in the system.

  2. The SID matches the SID in the system metadata of the object in the “obsoletes” value.

  3. The SID matches the SID in the system metadata of the object in the “obsoletedBy” value.

CN.updateSystemMetadata(Identifier id, SystemMetadata newSysmeta):

Behaves the same as MN

CN.delete(Identifier id):

Behaves the same as MN

CN.archive(Identifier id):

Behaves the same as MN

CN.reserveIdentifier(Identifier id):

Accepts PID or SID values and treats them exactly the same.

CN.hasReservation(Identifier id):

Accepts PID or SID values and treats them exactly the same.

CN.resolve(Identifier):

If PID, resolve it. If a SID, then resolve the HEAD PID.

CN.isAuthorized(Identifier id):

Behaves the same as MN

CN.isNodeAuthorized(Identifier id):

Only accept PID since it is a replication related method. No behavior change.

CN.updateReplicationMetadata(Identifier id):

Only accept PID since it is a replication related method. No behavior change.

CN.deleteReplicationMetadata(Identifier id):

Only accept PID since it is a replication related method. No behavior change.

CN.setReplicationStatus(Identifier id):

Only accept PID since it is a replication related method. No behavior change.

CN.setReplicationPolicy():

Only accept PID since it is a replication related method. No behavior change.

CN.setRightsHolder():

Ownerships apply to particular revisions, not the entire chain. If a SID is passed in to a method that affects one of these policies, the change is applied to the HEAD PID for that series.

CN.setAccessPolicy():

Policies apply to particular revisions, not the entire chain. If a SID is passed in to a method that affects one of these policies, the change is applied to the HEAD PID for that series.

CN.setObsoletedBy(Identifier id, Identifier obsoletedByPid):

Only PIDs can be used when expressing obsolescence chain.

CN.view(Identifier id)

Behaves the same as MN

CN.listObjects(?identifier=XXX):

Behaves the same as MN

Use Cases

The use cases below organize the identified requirements related to mutable content, with the most relevant use cases listed first.

Prioritized goals

1. Data preservation

Defined as activities that help ensure continued discoverability and usefulness and usually in reference to metadata, not data.

  • metadata adaptation / improvement

  • metadata correction

  • absent a “push” notification, users should be able to easily determine if they have the most current version of something, and easily and quickly get it.

2. Mutable Content Member Node support

For institutions following a mutable content model:

  • Provide a path forward for integrating into DataONE network.

  • Minimize the burden of adaptation to working with versioned content.

  • Allow use of their identifiers in DataONE in the context they are familiar with (if their identifier always points to the latest, in DataONE it should too)

  • Options for maintaining past versions

  • Differentiating between incremental internal saves, vs. new revision.

3. Citation support

  • avoid unnecessary costs associated with obtaining resolvable (e.g., DOIs) for each version

  • coordinating citation by a common identifier for citation tracking

  • ensuring that the cited object is the same when accessed as when it was originally used

  • ability to cite a version as well as the conceptual object

Optional

4. Support for frequently changing / overwritten data

What is the best way to version mutable data that frequently changes but may or not be used. For example a “current time” object, replaced every minute, or “current weather radar” that’s replaced every 3 hours.

  • preserving every version could be very expensive for very little value

  • what mechanisms could be employed to minimize the overhead?

The underlying dynamic here is the the rate of mutation vs. the rate of synchronization

5. Support for accumulating datasets

This means supporting data objects that add records over time, either:

  • within pre-defined bounds e.g. “2013 year-to-date” (the metadata could stay the same, while data changes)

  • without pre-defined bounds e.g. “JGoodall primate observation log”?

6. Support for mixed metadata/data objects

Some formats combine data with metadata, for example netCDF, so allowing the metadata to change without impacting the consistency assessment of the data itself.

  • changes in the file are treated like any other change; they will be versioned,

    but may be referenced using a seriesId

7. Supporting ‘unrecorded’ data streams

Mutable content can theoretically include things that are live feeds from sensors, but are otherwise not captured.

This proposal does not accommodate streams unless they have discrete snapshots that can be referenced as part of a seriesId.

  • Should we allow identifiers to resolve to a URL that returns an input stream?

  • Can we prevent it?

  • Can we mark it as the user’s responsibility to do the mn.create?