Immutability of Content in DataONE

Overview

To support the goals of preservation and scientific reproducibility, all registered objects in DataONE are considered immutable, with each object representing a published snapshot of data or metadata associated with a specific time. DataONE manages the registration, indexing, and replication of these snapshots throughout the DataONE network of Member Nodes. Upon this foundation, DataONE can guarantee that the exact byte array returned through the DataONE Read APIs (MNRead and CNRead) is the one submitted and registered.

Any repository that provides unique identifiers to snapshots (or revisions) can participate as DataONE Member Nodes, irrespective of whether or not they retain past snapshots. This is accomplished by the use of two identifiers, one representing the revision, and the other representing the changing (or mutable) entity. For those Member Nodes only managing the mutable entity, as long as a unique revision-level identifier is generated upon each update to the entity, DataONE will not reject the update. In situations where the rate of change is faster than DataONE’s Member Node synchronization, it is possible that some snapshots will fail to be registered. However, since that revision’s unique identifier is never indexed or otherwise made available, the chance for needing to retrieve that snapshot (and not finding it) in the future is very small.

The two identifiers are known as:

Persistent Identifier (PID)

declared in the systemMetadata.identifier field. This identifier represents the snapshot or revision DataONE replicates among the DataONE federation.

Series Identifier (SID)

declared in the systemMetadata.seriesId field. This identifier represents the mutable content, and resolves to the latest revision among all registered revisions when used in the DataONE Read APIs.

DataONE relies on content originators to generate the identifiers they use for each snapshot (with the series identifier being optional) being registered, and determining which field will hold the “citable” identifier.

Changes constituting a new snapshot

DataONE considers any change that results in a different byte array of content to be a new snapshot, and thus a new object to be registered. Subtle changes, such as whitespace differences, although potentially meaningless, do therefore constitute a new object. If not properly identified with a new PID, the content held on that Member Node is invalid. Member Nodes that periodically regenerate their stored content or manipulate it upon retrieval will need to take extra care to validate checksums after regeneration or manipulation and resolve any discrepancies in content they may encounter.

Changes constituting a new series

The SID is provided expressly to group the snapshots of a single entity that is stable over time. It was not intended to represent highly volatile entities, or those that significantly “drift” over time. Member Nodes are encouraged to instead use registered services for volatile content, and create new entities when significant change in the scope of an entity occurs.

It is not clear that individual contributers will always have the means to register a service, and so may have entities organized for input, such as a file that accumulates observations. If items such as these are registered as an object, the rightsHolders should be mindful to apply reasonable temporal bounds.

When the scope of an item has changed significantly, it is permissible to supply a new seriesId to the next snapshot while still relating the two items by obsoletes and obsoletedBy. However, it is not necessary, and may be more straightforward to simply save the new entity without relating the snapshots through system metadata, but through provenance mechanisms instead.

Usage Conventions

DataONE anticipates data consumers using other contributors’ data will prefer to cite using the PID, for the certainty that provides, and will prefer using the SID when citing indirectly (when citing metadata). Similarly, we anticipate content originators will wish to promote one identifier for citation. For those content providers using both identifiers, it is recommended to assign the preferred identifier according to anticipated data consumer preference.

Aggregating Download Statistics

To aid in aggregating download statistics for data providers, DataONE provides the cn/v2/query/logsolr query endpoint. Both PID and SID field are included in the index records to allow straightforward retrieval of download statistics by either field.

Identifier resolution in DataONE APIs

All DataONE APIs accepting an Identifier must treat PIDs as requests for the exact snapshot, and SIDs as a request for the latest snapshot that DataONE Node has knowledge of. Due to the distributed nature of snapshot replication, it is possible that a replica Member Node not know about the latest snapshot, in which case, a request by SID to that node should give a previous snapshot. For all nodes, even the authoritative Member Node, a request by PID for a snapshot that it doesn’t host must return a NotFound exception.

End users should therefore rely on v2.cn.resolve for object retrieval. If the Identifier used for the resolve is a SID, this method will return an ObjectLocationList for the latest known snapshot.

Series Identifier resolution to the head revision

The primary way for determining the head of a series is via the obsoletedBy field in the system metadata that links to the next snapshot in the chain. With all of the snapshots synchronized, there should only be one of the series that is not obsolete, and that is the head. With incomplete synchronization, there will be possibly more than one snapshot that is not obsoleted, and in these cases, the one with the latest dateUploaded value will be chosen as the head.

Use of the obsoleted fields as the primary indicator for the head of the series is preferred because it is a direct reflection of the rightsHolder’s intentions, whereas the dateUploaded value is only a reflection of the order in which the Member Node processes uploaded content.

Importance of the obsolete fields

Member Nodes that manage by mutable entity (don’t preserve prior snapshots) should populate the obsoletes and obsoletedBy fields, even if they do not plan to preserve older snapshots. Replica nodes and the DataONE Coordinating Nodes can use these fields to optimize queries for finding the head of the series.

Question: should mutable Member Nodes keep systemMetadata documents for snapshots they no longer have? (it would allow the obsoletedBy fields to be synchronized, but would it be in conflict with the behavior of deleted items (and would it matter?)

Mutable Member Node example

To illustrate by way of example, author A uploads an item to Member Node M, with an identifier S not using the DataONE API, but with M’s primary API. M builds a systemMetadata document for S, generating a PID, P1, to uniquely identify the initial snapshot, and assigns an upload date of D1, and uses the identifier S for the seriesId. DataONE synchronizes the object, and replicates the snapshot P1 to one other Member Node R1. The author, A, then saves changes to the item, whereupon M generates another PID, P2, to uniquely identifier this newer snapshot, uses S in the seriesId field, puts P1 in the obsoletes field, and D2 in the dateUploaded field. (size and checksum are also calculated for the new snapshot.) This is synchronized and replicated to a different Member Node, R2.

When v2.cn.resolve(S`) is called, an ObjectLocationList for P2 is returned, listing Member Nodes M and R2 as locations for retrieval. A call to M.get(P2) or R2.get(P2) will return the latest snapshot, as will the same call using S as the identifier instead. However, a call to R1.get(P2) will return NotFound, because it was not a replication target for that snapshot, and a call to R1.get(S) will return the initial snapshot, because it has snapshot P1 with the associated seriesId S.

Notice, too, that v2.cn.resolve(P1) will return an ObjectLocationList containing both M and R1, although retrieval from M is no longer possible, since M doesn’t preserve past snapshots. M.get(P1) should return a NotFound, and the client will move on to R1, and be able to retrieve P1 with R1.get(P1).

The CN, when v2.cn.resolve(S) was called, determined the head of the series by first finding all of the snapshots where the seriesId is S, and obsoletedBy is null or the obsoletedBy object has a different seriesId. In this case, since the P1 systemMetadata is never updated to fill in the obsoletedBy field, the algorithm will get both P1 and P2. It will then notice that P2 has the later date of D2, so will choose P2 as the head of the series.

Suppose now that A spawns two more snapshots in quick succession, and DataONE synchronizes afterwards. It missed P3(S) but picks up P4(S). Cn.resolve(S) will return an ObjectLocationList for the P4 snapshot, since it is the latest of all non-obsoleted snapshots.

Later, A makes some changes and realizes that the content is significantly different from previous revisions, so renames it S2. The system treats it as related, so links to the P4 snapshot with the obsoletes field. P5(S2) is now hosted on M, but P4(S) is gone. cn.resolve(S) will return an ObjectLocationList for P4, but M.get(P4) will return NotFound, and the client will have to retrieve from a replica Member Node, if possible. Cn.resolve(S2) will return an OLL for P5. Note also that M.get(S) will not be able to resolve the SID to any PID, since it doesn’t host any of the snapshots of S.

Summary

  • the PID represents the snapshot, and snapshots are immutable

  • the SID represents the entity, and can be applied to several connected snapshots.

  • A particular SID cannot be used if it is either reserved by or in use by someone else.

    • Specifically, the CN checks that the submitter has CHANGE_PERMISSION on the current head of the series

    • if not in use, checks submitter against cn.hasReservation(SID)

  • cannot put SID in obsoletes and obsoletedBy fields

  • SID resolution is “latest upload” among the set of snapshots not obsoleted by another member of the series.

  • not all revisions are guaranteed to be synchronized or replicated

    • is dependent on synchronization frequency, CN availability for sync

    • missing revisions, if synced but not replicated will appear as NotFound exceptions on forwarded resolve requests

  • unsynchronized revisions might appear in obsoletes/dBy fields of existing revisions

  • cn.listObjects(idFilter=sid)?? retrieves all synchronized revisions

  • once SeriesID is set, it cannot be changed, because it breaks the trust that the identifier always gets the user to a conceptually equivalent object.