It does not make sense to have a string without knowing what encoding it uses [Spolsky2003].
In DataONE content may be transferred multiple times between multiple locations, and each transfer must result in an accurate representation of the original content. DataONE achieves this by transferring byte copies of content between clients and servers using the HTTP protocol, and verifying that the checksum computed by the origin matches that retrieved. Hence the bytes are accurately transferred and can be reliably transferred again by the consumer.
In order to properly interpret how to use the object, the consumer must know the media type of the object. The media type (formerly the MIME or Multipurpose Internet Mail Extensions Type) is metadata about an object that can be used by the consumer to determine what the object is. The IANA (Internet Assigned Numbers Authority [IANA]) provides a controlled list of media types [IANA_MEDIA] (henceforth “IANA Media Types”) that are used during internet transfer of objects to inform the receiver of the type of object being transferred.
The media type can be determined several ways:
examine the bytes of the object
infer from the file name of the object
additional metadata provided by the object producer
The most reliable general solution is for the media type metadata to be provided by the object producer. This is especially important for ambiguous object types such as text documents since the character encoding can in many cases only be reliably determined by the application that created the document.
In some cases, the IANA Media Type by itself does not provide sufficient
information for a consumer to reliably process an object. For example a text
document with IANA Media Type of
text/plain may have been created using any
of hundreds of character sets [IANA_CHARS]. In these cases, an additional
charset parameter is specified, and this information along with the IANA
Media Type is required to properly interpret a text file.
DataONE expands on the metadata describing an object by recording additional
Types.SystemMetadata that accompanies every object.
Amongst this additional metadata is a
formatId that, like the IANA Media
Type, provides a pointer to additional information (a
Types.ObjectFormat) about the object for the benefit of downstream
ObjectFormat structure is a controlled list of object
classifications that augments the IANA Media Type to support use by analytical
tools employed by researchers and other.
In this manner the combination of an object and it’s System Metadata provides the information necessary for a consumer to discern what the object is and so what applications might be used to ingest the object.
Once available, the media type metadata should be preserved with the object to ensure that downstream consumers can utilize the content in the same way without resorting to inference mechanisms with potentially different results. Hence it is essential that media type information is considered an integral part of the action of transferring an object between systems.
When a server sends an object to a user agent (e.g. a CN acting as a client
retrieving a Science Metadata document from a MN, a script accessing content, or
a browser viewing something from a CN), the server should specify the media type
Content-Type field of the accompanying HTTP headers [RFC2616 Section
Content-Type entity-header field indicates the media type
[IANA_MEDIA] (formerly known as “MIME Type” or “Multipurpose Internet Mail
Extensions Type”) of the entity-body sent to the recipient [RFC2616]. The media
type entry of the Content-Type header is used to to inform the consumer of what
the bytes in the payload represent.
The server may also include a suggested filename in the Content-Disposition HTTP header [RFC6266]. This can be useful for consumers as it specifies a filename that may be used by default for the content, and also provides a hint as to the type of content being provided (i.e. through the file name extension).
All content in DataONE is accompanied by System Metadata which is used to
provide persistent information about the associated object that is useful for
maintaining the object state and for consumers. Content type in DataONE is
indicated in System Metadata by freference to an
Types.Object Format, a
complex structure that contains a
formatName and a
formatType. In version 2.0 APIs,
extended to include mimeType and extension.
The use of a controlled list of object formats may be problematic however, when considering that a particular type of object may have multiple media types (e.g. an Excel spreadsheet) or may require more detail such as character encoding information (e.g. a CSV or XML document) that may not be reliably inferred from the object bytes.
Hence, the system metadata for an object should also include optional properties for the media type specific to the object, the character encoding, and the filename. This information may be provided with the object System Metadata or in the Content-Type and Content-Disposition headers. Where the information in the headers conflicts with that in the System Metadata, the System Metadata should prevail (since presumably the system metadata was set correctly by the origin, whereas a misconfigured server may be setting an incorrect value).
(no change) The objectFormat is used to indicate to a consumer application more detailed information than is available through the media type.
mimeType element of the Draft v2.0 API should be renamed “mediaType”
and used to specify the default media type for an object should that
information not be explicitly provided through the Content-Type header
provided by the producer (Issue #)
The media type as provided by the producer of the object should be specified
and should be preserved as part of the system metadata so that the media type
may be reliably presented to downstream consumers. When specified in the
Content-Type header, the media type overrides the default value present
in the associated objectFormat. When present in System Metadata, that value
overrides a value presented in the ``Content-Type``header. In practice,
System Metadata is retrieved separately from the object, and so such an
override will optional for consumers.
For text media sub-types, or content that is textual (e.g. media type =
should be provided in the
Content-Type header. When provided, this value
must be persisted in the system metadata associated with an object. When
charset is specified in the System Metadata, it overrides a value that
may be present in the Content-Type header. In practice, System Metadata is
retrieved separately from the object, and so such an override will optional
A filename should be provided in a
Content-Disposition header by a
producer and should be preserved in the system metadata associated with the
object. When present in the System Metadata, that value overrides a value in
Content-Disposition header. In practice, System Metadata is retrieved
separately from the object, and so such an override will optional for
The purpose of the HTTP
Content-Type header is to inform the receiver of a
byte stream what the payload actually is. Parameters may be included with the
Content-Type to provide additional information for the consumer (e.g. the
charset parameter for text sub-types).
Media type tracking in Version 1.x is largely delegated to the ObjectFormat referenced in the SystemMetadata associated with an object. A content producer may provide a Content-Type header, but this information is not preserved as part of the DataONE infrastructure. Hence, consumers that intend to re-expose the object should endeavor to record the provided Content-Type and provide tha header when re-transmitting the object. Such an action is however, undefined within the Version 1.x DataONE service interfaces.
Lacking an explicltly set Content-Type, a Node may infer the Content-Type from the ObjectFormat
mediaType value is specified in SystemMetadata
The SystemMetdata.mediaType value is used to set the Content-Type header value. The SystemMetadata.mediaType overrides a value that may be set in the referenced ObjectFormat.
mediaType value not specified in SystemMetadata, available in ObjectFormat
mediaType value not specified in SystemMetadata or ObjectFormat
text/xml are equivalent [RFC7303 Section
The use of UTF-8, without a BOM, is RECOMMENDED for all XML MIME entities [RFC7303].
The document character set for XML is Unicode (ISO 10646), which means that XML processors should behave as if they used Unicode internally. However, that does not mean an XML document must be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode.
A challenge with XML documents is that there are three locations where character encoding information may be provided:
A Byte Order Marker (BOM) at the begining of the entity body
An XML encoding property present at the start of the document
A charset property present in the
Content-Type HTTP header
Each of these are optional, and when present may provide conflicting information. Section 3.2 of RFC7303 provides guidelines for how to infer the character encoding of a document. In order of priority:
A BOM (Section 3.3) is authoritative if it is present in an XML MIME entity;
In the absence of a BOM (Section 3.3), the charset parameter is authoritative if it is present.
If an XML MIME entity is received where the charset parameter is omitted, no information is being provided about the character encoding by the MIME Content-Type header. XML-aware consumers MUST follow the requirements in section 4.3.3 of [XML] that directly address this case. XML-unaware MIME consumers SHOULD NOT assume a default encoding in this case.
Section 8 of RFC7303 provides several examples of consistent and inconsistent XML encoding.
An important consequence of the document character set is that values of numeric character references (such as ǵ and ǵ for LATIN SMALL LETTER G WITH ACUTE) are interpreted as Unicode characters - no matter what encoding you use for your document. This is a common source of error among those who are not clear about the distinction.
Note that not all Unicode characters can be used anywhere in XML. Certain characters are excluded from use in tag names (elements and attributes), and XML 1.1 expands significantly on the range of characters that may be used compared with XML 1.0.
MIME media type name: text
MIME subtype name: csv
Required parameters: none
Optional parameters: charset, header
Common usage of CSV is US-ASCII, but other character sets defined by IANA for the “text” tree may be used in conjunction with the “charset” parameter.
The “header” parameter indicates the presence or absence of the header line.Valid values are “present” or “absent”. Implementors choosing not to use this parameter must make their own decisions as to whether the header line is present or absent.
As per section 4.1.1. of RFC 2046 , this media type uses CRLF to denote line breaks.However, implementors should be aware that some implementations may use other values.
Obsoleted in favor of
Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets:
00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8