DataONE Usage Statistics

Overview

DataONE Member Nodes and Coordinating Nodes record access events that result from DataONE API calls. A list of access events and the API calls that logged these events is shown in Table 1.

Table 1 Access Events

Access event

DataONE MN API call

Metacat API call

create

MNStorage.create()

action=insert

delete

MNStorage.delete()

action=delete

read

MNRead.get()

action=read

replicate

MNReplication.replicate()

update

MNStoreage.update()

action=update

The content of the access event log records are described here: LoggingSchema.html.

The access event log records are harvested from each MN in the network and aggregated into a common search index by the Log Aggregation Facility which is described here: LogAggregator.html. The Event Log Index is implemented as an Apache Solr instance and can be queried using standard Solr queries using the DataONE service endpoint https://cn.dataone.org/cn/v1/query/logsolr.

The Solr search platform provides query capabilities such as field faceting, range filtering, numeric field statistics and more that provide usage information based on the access events, harvest from the MN, thereby providing network wide statistics from one search index.

The section Example Queries gives several examples of usage information that can be obtained from the Event Log Index.

Event Log Index

Table 2. Solr index schema

Event Log Index Schema

Solr index schema

Name

Solr Type

Comment

id

string

added after harvest

dateAggregated

date

added after harvest

isPublic

boolean

added after harvest, obtained from systemmetadata

readPermission

string

added after harvest, obtained from systemmetadata, filtered during query

entryId

string

obtained from MN event log

pid

string

added after harvest, obtained from systemmetadata

ipAddress

string

obtained from MN event log, filtered during query

userAgent

string

obtained from MN event log

subject

string

obtained from MN event log, filtered during query

event

string

obtained from MN event log

dateLogged

date

obtained from MN event log

nodeId

string

obtained from MN event log

rightsHolder

string

added after harvest, obtained from systemmetadata, filtered during query

formatId

string

added after harvest, obtained from systemmetadata

formatType

string

added after harvest, obtained from systemmetadata

size

slong

added after harvest, obtained from systemmetadata

country

string

added after harvest, determined from ipAddress

region

string

added after harvest, determined from ipAddress

city

string

added after harvest, determined from ipAddress

geohash_1

string

added after harvest, determined from ipAddress

geohash_2

string

added after harvest, determined from ipAddress

geohash_3

string

added after harvest, determined from ipAddress

geohash_4

string

added after harvest, determined from ipAddress

geohash_5

string

added after harvest, determined from ipAddress

geohash_6

string

added after harvest, determined from ipAddress

geohash_7

string

added after harvest, determined from ipAddress

geohash_8

string

added after harvest, determined from ipAddress

geohash_9

string

added after harvest, determined from ipAddress

location

location

added after harvest, determined from ipAddress

inFullRobotList

boolean

added after harvest, determined based on log processing for COUNTER compliance

inPartialRobotList

boolean

added after harvest, determined based on log processing for COUNTER compliance

isRepeatVisit

boolean

added after harvest, determined based on log processing for COUNTER compliance

Access to Event Log Index

Access to the Event Log Index adheres to the DataONE identity and authentication protocols described here: Authentication.html. The level of access allowed when querying the index is determined by your DataONE Authentication Session Identity

CN Administrators

CN Administrators have full access to the index and can therefor select index entries based on any field and can view the entire contents of the index entries.

Authenticated session access

Clients (i.e. web browsers) that have established an authenticated session using a DataONE identity have access to information for any pids for which they are the rightsholder, or pids for which they have an access policy granting write access. For example, if the authenticated subject is 'uid=smith,o=NCEAS,dc=ecoinformatics,dc=org' then the client can query index entries for pids that have access policies allowing write access to the authenticated subject. This level of access allows summary information to be viewed, so the full content of index entries cannot be viewed.

Public Access

All other access is considered non-privileged public access in which case only index entries associated with pids that have an access policy granting public read can be queried. This level of access only allows summary information to be viewed, so the full content of index entries cannot be viewed.

In addition to these access rules, certain fields are considered sensitive such that they cannot be included in Solr field queries (i.e. &fq=<field name>) or included in Solr facet queries (i.e. &facet.field=<field name>). The fields from the Event Log Index that are considered sensitive are rightsHolder, ipAddress, subject and readPermission.

COUNTER Compliance

While unfiltered log records are useful for some system monitoring and related activities, scientifically-meaningful analysis of log records requires that we correct log records for common events that would otherwise artificially inflate the statistics, such as access by web-indexing robots and multiple accesses from the same individual. Within the publishing community, the COUNTER standard has been used to provide a consistent set of guidelines as to how resource access statistics should be reported. To be COUNTER-compliant, DataONE provides three filters on log files:

  1. Only allow status 200 and 304 on READ requests

    This ensures that redirect requests (302) are only counted once, and that unsuccessful requests are ignored.

  2. Exclude robots

    This ensures that the myriad web-robots that constantly index web-accessible content do not artificially inflate results.

  3. Exclude repeat visits within certain time window

    This ensures that accidental double-clicks on a link or repeated requests from a client tool in a short time period are only counted once.

Compliance with these three COUNTER requirements is implemented as two boolean index field (isRepeatVisit and inFullRobotList) which, for each record, determines if a given record adheres to the COUNTER standards outlined above. Client queries which wish to only report COUNTER-compliant results just add a filter expression to their query (isRepeatVisit=false, inFullRobotList=false), and all non-compliant records will be removed from the usage statistics reports.

The field inFullRobotList indicates whether or not the logged event originated from a request issued by a user agent found in the full list of web robots, with the value true indicating that the user agent is a web robot, and thus the event record is not COUNTER compliant.

DataONE will maintain a list on known Internet robots to be used for filtering addresses, and this list will be updated periodically as new robots become known, at least annually.

The field isRepeatVisit indicates whether or not a duplicate request has occurred for the same IP address and pid within a certain time window (currently 30 seconds), with a value of true indicating that an entry is a repeat request.

The following query will return the count of all read events that have passed the COUNTER compliance tests:

https://cn.dataone.org/cn/v2/query/logsolr/?q=event:read&fq=inFullRobotList:false&fq=isRepeatVisit:false

The event index is updated once a day with event entries from all active member nodes, with the most current information being from the previous day.

In addition to the ‘COUNTER’ related fields, the field inPartialRobotList indicates whether or not the user agent was found in a list that contains a subset of the full robots list, and represents a less strict interpretation of which user agents are considered web robots, and does not include user agents such as ‘java’, ‘libwww’, ‘Wget’. A value of true indicates that a match was found in the less strict web robots list. This field is not used in COUNTER compliance filtering.

Statistics Service Usage

The following sections shows example queries that can be sent to the Event Log Solr index. Note: in order to make the examples easier to read, the output of some of the examples queries has been editied, with removed lines replaced with ellipses, i.e. ‘…’.

Retrieve pids for a specified subject

The following example shows a query for download volume for pids created by subjects matching uid*smith* with download size statistics aggregated by pid:

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=subject:uid*smith*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=pid

The following result is returned:

<?xml version="1.0"?>
<response>
  ...
  <result name="response" numFound="96" start="0"/>
  <lst name="stats">
    <lst name="stats_fields">
      <lst name="size">
        <double name="min">135.0</double>
        <double name="max">1.5209072E8</double>
        <double name="sum">1.082767665E9</double>
        <long name="count">96</long>
        <long name="missing">0</long>
        <double name="sumOfSquares">1.13751276670495792E17</double>
        <double name="mean">1.127882984375E7</double>
        <double name="stddev">3.2692977584385287E7</double>
        <lst name="facets">
          <lst name="pid">
            <lst name="doi:10.6085/AA/pisco_intertidal.45.1">
              <double name="min">2.8738045E7</double>
              <double name="max">2.8738045E7</double>
              <double name="sum">2.8738045E7</double>
              <long name="count">1</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">8.25875230422025E14</double>
              <double name="mean">2.8738045E7</double>
              <double name="stddev">0.0</double>
            </lst>
            <lst name="doi:10.6085/AA/MLPA_intertidal.30.10">
              <double name="min">2984.0</double>
              <double name="max">2984.0</double>
              <double name="sum">11936.0</double>
              <long name="count">4</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">3.5617024E7</double>
              <double name="mean">2984.0</double>
              <double name="stddev">0.0</double>
            </lst>
            <lst name="doi:10.6085/AA/pisco_snbs.19.1">
              <double name="min">52335.0</double>
              <double name="max">52335.0</double>
              <double name="sum">104670.0</double>
              <long name="count">2</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">5.47790445E9</double>
              <double name="mean">52335.0</double>
              <double name="stddev">0.0</double>
            </lst>
            ...
            </lst>
          </lst>
        </lst>
      </lst>
    </lst>
  </lst>
</response>

The previous query can be constrained to a specific time by adding a time range, i.e.:

&fq=dateLogged:[2013-01-01T23:59:59Z TO 2013-12-31T23:59:59Z]

or using Solr date range key words:

&fq=dateLogged:[NOW-1MONTH TO NOW]

Data upload counts

The following query shows counts of data uploads by format type by a specified rightsHolder (PISCO):

https://cn.dataone.org/cn/v2/query/logsolr/?&q=*:*&facet=true&fq=rightsHolder:uid*PISCO*&fq=event:create&facet.field=formatId&facet.mincount=1
<?xml version="1.0"?>
<response>
  ...
  <result name="response" numFound="40928" start="0"/>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="formatId">
        <int name="eml://ecoinformatics.org/eml-2.0.1">32932</int>
        <int name="text/csv">5236</int>
        <int name="application/octet-stream">2570</int>
        <int name="eml://ecoinformatics.org/eml-2.0.0">100</int>
        <int name="eml://ecoinformatics.org/eml-2.1.0">28</int>
        <int name="-//ecoinformatics.org//eml-dataset-2.0.0beta6//EN">19</int>
        <int name="-//ecoinformatics.org//eml-entity-2.0.0beta6//EN">12</int>
        <int name="-//ecoinformatics.org//eml-attribute-2.0.0beta6//EN">11</int>
        <int name="-//ecoinformatics.org//eml-access-2.0.0beta6//EN">7</int>
        <int name="-//ecoinformatics.org//eml-physical-2.0.0beta6//EN">6</int>
        <int name="image/jpeg">3</int>
        <int name="text/plain">3</int>
        <int name="-//ecoinformatics.org//eml-project-2.0.0beta6//EN">1</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
</response>

Data download counts by month

The following query shows data download counts by a specific user for each month in 2013:

https://cn.dataone.org/cn/v1/query/logsolr/?q=*:*&fq=rightsHolder:uid*PISCO*&fq=event:read&facet=true&facet.range=dateLogged&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
<?xml version="1.0"?>
<response>
   ...
  <result name="response" numFound="3623404" start="0"/>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields"/>
    <lst name="facet_dates"/>
    <lst name="facet_ranges">
      <lst name="dateLogged">
        <lst name="counts">
          <int name="2013-01-01T01:01:01Z">56962</int>
          <int name="2013-02-01T01:01:01Z">23656</int>
          <int name="2013-03-01T01:01:01Z">46167</int>
          <int name="2013-04-01T01:01:01Z">58562</int>
          <int name="2013-05-01T01:01:01Z">65192</int>
          <int name="2013-06-01T01:01:01Z">203082</int>
          <int name="2013-07-01T01:01:01Z">66013</int>
          <int name="2013-08-01T01:01:01Z">92320</int>
          <int name="2013-09-01T01:01:01Z">23059</int>
          <int name="2013-10-01T01:01:01Z">16135</int>
          <int name="2013-11-01T01:01:01Z">73831</int>
          <int name="2013-12-01T01:01:01Z">44968</int>
        </lst>
        <str name="gap">+1MONTH</str>
        <date name="start">2013-01-01T01:01:01Z</date>
        <date name="end">2014-01-01T01:01:01Z</date>
      </lst>
    </lst>
  </lst>
</respones>

Read counts for format type EML

The following query shows all EML metadata activity by a specific user for each month in 2013:

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=rightsHolder:uid*PISCO*&fq=formatId:eml*&facet=true&facet.field=event&facet.range=dateLogged&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
<?xml version="1.0"?>
<response>
  ...
  <result name="response" numFound="3504705" start="0"/>
    <lst name="facet_counts">
      <lst name="facet_queries"/>
      <lst name="facet_fields">
      <lst name="event">
        <int name="read">3327009</int>
        <int name="delete">51249</int>
        <int name="update">47593</int>
        <int name="synchronization_failed">45752</int>
        <int name="create">33060</int>
        <int name="replicate">42</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges">
      <lst name="dateLogged">
        <lst name="counts">
          <int name="2013-01-01T01:01:01Z">54815</int>
          <int name="2013-02-01T01:01:01Z">18652</int>
          <int name="2013-03-01T01:01:01Z">45043</int>
          <int name="2013-04-01T01:01:01Z">58420</int>
          <int name="2013-05-01T01:01:01Z">64208</int>
          <int name="2013-06-01T01:01:01Z">136014</int>
          <int name="2013-07-01T01:01:01Z">65417</int>
          <int name="2013-08-01T01:01:01Z">92103</int>
          <int name="2013-09-01T01:01:01Z">22899</int>
          <int name="2013-10-01T01:01:01Z">15522</int>
          <int name="2013-11-01T01:01:01Z">73340</int>
          <int name="2013-12-01T01:01:01Z">44745</int>
        </lst>
        <str name="gap">+1MONTH</str>
        <date name="start">2013-01-01T01:01:01Z</date>
        <date name="end">2014-01-01T01:01:01Z</date>
      </lst>
    </lst>
  </lst>
</response>

Download volume for pids

The following query shows all pids created by rightsHolder PISCO with upload size statistics aggregated by formatId:

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=rightsHolder:uid=*PISCO*&fq=event:create&stats=true&stats.field=size&rows=0&stats.facet=formatId
<result name="response" numFound="14721" start="0"/>
  ...
        <lst name="facets">
          <lst name="formatId">
            <lst name="eml://ecoinformatics.org/eml-2.0.0">
              <double name="min">3582.0</double>
              <double name="max">29176.0</double>
              <double name="sum">604461.0</double>
              <long name="count">43</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">1.1348783711E10</double>
              <double name="mean">14057.232558139534</double>
              <double name="stddev">8240.051522137841</double>
            </lst>
            <lst name="eml://ecoinformatics.org/eml-2.0.1">
              <double name="min">938.0</double>
              <double name="max">646484.0</double>
              <double name="sum">2.37265549E8</double>
              <long name="count">14668</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">7.985322030167E12</double>
              <double name="mean">16175.72600218162</double>
              <double name="stddev">16815.75005078953</double>
            </lst>
            ...
          </lst>
        </lst>
      </lst>
    </lst>
  </lst>
</response>

Note

The examples that follow do not include the result output to improve legibility. The reader is encouraged to cut/paste the sample queries into a web browser to view the resulting output.

Select events using time range based on date of access event

https://cn.dataone.org/cn/v2/query/logsolr/?q=dateLogged:[2014-03-01T00:00:01Z TO 2014-03-31T00:00:01Z]

Counts of event types

https://cn.dataone.org/cn/v2/query/logsolr/?q=dateLogged:[* TO NOW]&facet=true&facet.field=event

Wildcard search for pids

https://cn.dataone.org/cn/v2/query/logsolr/?q=pid:doi*&facet=true&facet.field=pid&facet.mincount=1

Spatial search for events within 10km of the latitude, longitude of Santa Barbara, CA

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq={!geofilt sfield=location pt=34.4329,-119.837 d=10}

Search by city name for events occuring in Albuquerque

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=city:Albuquerque

Events aggregated by location name

https://cn.dataone.org/cn/v2/query/logsolr/?q=event:create&facet=true&facet.field=city

Download (read) counts by month for all data format types

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=event:read&formatType=DATA&facet=true&facet.range=dateLogged&facet.range.start=2014-01-01T00:00:00.000Z&facet.range.end=2015-01-01T00:00:00.000Z&facet.range.gap=%2B1MONTH

Download (read) counts by month for all format types, counter-compliant

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=event:read&counterCompliant=true&facet=true&facet.range=dateLogged&facet.range.start=2014-01-01T00:00:00.000Z&facet.range.end=2015-01-01T00:00:00.000Z&facet.range.gap=%2B1MONTH

Metadata read counts by month for all metadata format types

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=event:read&formatType=METADATA&facet=true&facet.range=dateLogged&facet.range.start=2014-01-01T00:00:00.000Z&facet.range.end=2015-01-01T00:00:00.000Z&facet.range.gap=%2B1MONTH

Byte count for read events for May 2013

https://cn.dataone.org/cn/v2/query/logsolr/?q=event:read&fq=dateLogged:[2013-05-01T00:00:00.000Z TO 2013-05-31T23:59:59.999Z]&stats=true&stats.field=size&sort=size%20desc&rows=0

Bytes downloaded for subject=cjones aggregated by formatId

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=subject:uid=*cjones*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=formatId

Download (read) counts for node KNB, excluding web crawler accesses and duplicate (repeat) visits (with a short time interval, i.e. 30 seconds)

https://cn.dataone.org/cn/v2/query/logsolr/?q=event:read&fq=inFullRobotList:false&fq=isRepeatVisit:false&fq=nodeId:urn\:node\:KNB