Abstract

The Allotrope Data Format (ADF) [ADF] consists of several APIs as well as ontologies. It defines an interface and file format for storing scientific observations from analytical chemistry. This document constitutes the specification the Allotrope Data Format Data Cube (ADF-DC) API for storing and reading analytical data. It defines how to store one- or multi-dimensional data.

Disclaimer

THESE MATERIALS ARE PROVIDED "AS IS" AND ALLOTROPE EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF NON-INFRINGEMENT, TITLE, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current Allotrope publications and the latest revision of this technical report can be found in the Allotrope technical reports index at http://purl.allotrope.org/TR/.

This document is part of a set of specifications on the Allotrope Data Format [ADF]

This document was published by the Allotrope Foundation as a First Public Working Draft. This document is intended to become an Allotrope Recommendation. If you wish to make comments regarding this document, please send them to more.info@allotrope.org. All comments are welcome.

Publication as a First Public Working Draft does not imply endorsement by the Allotrope Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Table of Contents


1. Introduction

The Allotrope Data Format (ADF) defines an interface and file format for storing scientific observations from analytical chemistry. It is intended for long-term stability of archived analytical data and fast real-time access to it. The ADF Data Cube API (ADF-DC) defines an interface for storing n-dimensional analytical result data in form of data cubes. ADF-DC uses the vocabulary of the W3C Data Cube Ontology [QB] to describe the basic structure and meta data of data cubes and observations. The ADF Data Cube Ontology [ADF-DCO] extends the [QB] by advanced concepts, such as specific selections of subsets on data cubes, scale types, complex data types and order functions.

ADF is based on the Hierarchical Data Format [HDF5] which is specifically designed to store large amounts of numerical data. The ADF Data Cube to HDF5 Mapping Ontology [ADF-DCO-HDF] provides classes and properties to define the mapping between the abstract data cubes defined in terms of the data cube ontology to their concrete HDF5 representations in the ADF file. That is, HDF-DCO-HDF defines the mapping between functional and physical representations. The physical representation in HDF5 is described by a HDF5 ontology, which is based on the the official HDF5 specifications [HDF5].

This document is structured as follows: First, the role of the ADF Data Cube API within the ADF high-level structure is shown. Second, the general requirement for an ADF Data Cube API are listed. Third, a use case data cube on mass spectroscopy is described which will be referred to in later examples to illustrate specified methods. Finally, the different API methods for creation, writing and reading are specified in detail with corresponding parameters. For each of the specified methods example RDF representations of corresponding meta data are provided.

1.1 Document Conventions

1.1.1 Naming Conventions

The IRI of an entity has two parts: the namespace and the local identifier. Within one RDF document the namespace might be associated by a shorter prefix. For instance the namespace IRI http://www.w3.org/2002/07/owl# is commonly associated with the prefix owl: and one can write owl:Class instead of the full IRI http://www.w3.org/2002/07/owl#Class.

Namespaces

Within this specification, the following namespace prefix bindings are used:

Prefix Namespace
owl:http://www.w3.org/2002/07/owl#
rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:http://www.w3.org/2000/01/rdf-schema#
xsd:http://www.w3.org/2001/XMLSchema#
dct:http://purl.org/dc/terms/
skos:http://www.w3.org/2004/02/skos/core#
qb:http://purl.org/linked-data/cube#
af-x:http://purl.allotrope.org/ontologies/property#
adf-dp:http://purl.allotrope.org/ontologies/datapackage#
adf-dc:http://purl.allotrope.org/ontologies/datacube#
adf-dc-hdf:http://purl.allotrope.org/ontologies/datacube-to-hdf5-map#
ex:http://example.com/ns#

1.1.2 Indication of Requirement Levels

Within this document the definitions of MUST, SHOULD and MAY are used as defined in [rfc2119].

1.1.3 Diagram Notation

This document MAY use the Unified Modeling Language [UML] to illustrate some concepts and visualize RDF graphs. These diagrams are non-normative and SHOULD not be interpreted in the strict interpretation specified by the UML specification.

1.1.4 Number Formatting

Within this document, decimal numbers will use a dot "." as the decimal mark.

2. ADF High-Level Structure

The next figure illustrates the ADF Data Cube API within the high-level structure of the Allotrope Data Format (ADF) [ADF] API stack:

Fig. 1 The high-level structure of the Allotrope Data Format (ADF) API stack.

This document specifies the methods which MUST be provided by the ADF Data Cube API.

3. General Requirements

The following key requirements MUST be addressed by the ADF Data Cube API:

The following requirements to ADF-DC SHOULD be addressed:

4. Use Case

The following figure describes a typical Liquid Chromatography Mass Spectroscopy (LC/MS) for a single sample:

Fig. 2 Liquid Chromatography Mass Spectroscopy (LC/MS) for a single sample.
This use case can be represented by a data cube with the following structure. Two dimension components: The sample index is an optional dimension component that is used for illustrative examples below. One measure component:

5. ADF Data Cube Operations

This section describes the core operations that MUST be provided by the ADF-DC API: creating a data cube as well as writing data into and reading (subsets of) data from it. These methods are described in the following subsections.

5.1 Creating Data Cubes

According to the RDF Data Cube Vocabulary [QB], a data cube qb:DataSet MUST specify a structure which is expressed through a data structure definition qb:DataStructureDefinition (DSD). The DSD defines the components of a data cube through component specification qb:ComponentSpecifications which describe either dimension components adf-dc:Dimension or measure components adf-dc:Measure. Details on classes and properties of the ADF Data Cube Ontology are specified in [ADF-DCO].

The ADF-DC API MUST provide a method to create a data cube by specifying all its components, i.e. by explicitly specifying the data structure though dimensions and measures. The API MUST provide a method to create a data cube by reusing an existing data structure definition (DSD) which is represented in form of RDF triples. The detailed specifications of these methods are given in the following subsections.

In general, the API methods for creation of a data cube MUST provide a parameter to specify a IRI for the data cube. Specification of a label or title for a data cube SHOULD be possible as well.

5.1.1 Explicit Definition of Dimensions and Measures

The API MUST provide a method to describe the structure of the data cube by explicitly defining one or more measures and, optionally, one or more dimensions.

5.1.1.1 Definition of Measures

A measure adf-dc:Measure is a component specification for dependent data values, which represent the measured values. The required parameters and required meta data descriptions that MUST be persisted in ADF-TS are described in the following subsections.

5.1.1.1.1 Parameters

The API method for definition of a measure of a data cube MUST provide the following parameters:

  • Exactly one measure property specified by a IRI which MUST represent an owl:ObjectProperty or an owl:DatatypeProperty
  • Exactly one component data type specified through a IRI.
    If the measure property is a owl:DatatypeProperty, the IRI of the component data type MUST be one of the standard XSD data types (xsd:integer, xsd:decimal, xsd:string etc.).
    If the measure property is a owl:ObjectProperty, the IRI of the component data type MUST be either rdfs:Resource or a complex data type represented by a IRI that represents a data shape [SHACL-ED]. If a data shape is specified, its structure SHOULD be accessible to the API, e.g., by explicit representation within ADF-TS.
The API method SHOULD provide the following parameters:
  • Exactly one order function specified through a IRI which is an instance of one of the subclasses of adf-dc:OrderFunction. Any order function MUST be defined according to the component data type. For instance, a native order adf-dc:nativeOrder is only applicable to components with a primitive component data type that represents real numbers.
  • Exactly one scale type specified through a IRI which MUST be one of the following: adf-dc:NominalScale, adf-dc:OrdinalScale, adf-dc:CardinalScale, adf-dc:IntervalScale or adf-dc:RatioScale.
  • Exactly one fill value according to the data type specified for the component data type parameter. That is, the fill value is either a primitive data type or an instance of a complex data type that qualifies the specified data shape.
The API method MAY provide the following parameters:
  • Exactly one fill value according to the data type specified for the component data type parameter. A fill value, is a default value that is used for observations with missing measurement entries. That is, the fill value is either a primitive data type or an instance of a complex data type that qualifies the specified data shape.

5.1.1.1.2 Persistence in ADF Data Description

The API method for definition of a measure of a data cube MUST persist the description of the measure in ADF-TS according to the following structure:

Example 1
	ex:intensityMeasure 
		a                           adf-dc:RatioScale ,       # MAY
									adf-dc:Measure ,
									qb:ComponentSpecification ;
		adf-dc:componentDataType    xsd:int ;
		qb:measure                  «af-x:intensity» .        # the measure property
							

5.1.1.2 Definition of Dimensions

A dimension adf-dc:Dimension is a component specification for independent data values. The required parameters and required meta data descriptions that MUST be persisted in ADF-TS are described in the following subsections.

5.1.1.2.1 Parameters

The API method MUST provide the following parameters:

  • Exactly one dimension property specified through a IRI which MUST represent a owl:ObjectProperty or owl:DatatypeProperty.
  • The specified dimension property MUST be unique for one data structure definition.
  • Exactly one component data type specified through a IRI which MUST be either one of the standard XSD data types xsd:integer, xsd:decimal, xsd:string etc. or a complex data type represented by a data shape. The specified data shape MUST be accessible to the API e.g. by explicit representation within the ADF-TS.
The API method SHOULD provide the following parameters:
  • Exactly one order function specified through a IRI which is an instance of one of the subclasses of adf-dc:OrderFunction. Any order function MUST be defined according to the component data type.
  • Exactly one scale type specified through a IRI which MUST be one of the following: adf-dc:NominalScale, adf-dc:OrdinalScale, adf-dc:CardinalScale, adf-dc:IntervalScale or adf-dc:RatioScale.
  • Exactly one order value qb:order specified through an integer which defines the order of the dimension components. The order value must be distinct for different dimension components of one data structure definition.
The API method MAY provide the following parameters:
  • Exactly one size specified through an integer that defines the maximal number of entries of the dimension.

5.1.1.2.2 Persistence in ADF Data Description

The API method MUST persist the description of the dimension in ADF-TS as follows:

Example 2
	ex:sampleIndexDimension
			a                         adf-dc:RatioScale ,           # MAY
									  adf-dc:Dimension ,
									  qb:ComponentSpecification ;
			adf-dc:componentDataType  afs-qudt:ArbitraryUnitValue ;
			adf-dc:orderedBy          adf-dc:nativeOrder ;          # SHOULD
			qb:dimension              «af-x:index» ;
			qb:order                  "1"^^xsd:long .               # SHOULD
							

Note
When using HDF5 as persistence layer for data cubes, the maximum number of dimensions supported by HDF5 is 32.

After creation of a data cube with measures and dimensions the meta descriptions MUST be written to ADF-TS according to the data model specified in [ADF-DCO] and [ADF-DCO-HDF]. Additionally to the RDF descriptions for sample index dimension and intensity measure, the creation of a data cube MUST result in the following triples describing the data structure:
Example 3
ex:UseCaseDataSet
        a             qb:DataSet ;
        rdfs:label    "Data Cube for LC/MS use case" ;          # SHOULD
        dct:title     "Data Cube for LC/MS use case" ;          # SHOULD
        qb:structure  ex:UseCaseDSD .
ex:UseCaseDSD
        a               qb:DataStructureDefinition ;
        qb:component    ex:sampleIndexDimension ,
                        ex:massPerChargeDimension ,
                        ex:retensionTimeDimension ,
                        ex:intensityMeasure  .

5.1.2 Reference to a Data Structure Definition

The structure of a data cube is expressed by a data structure definition (DSD) qb:DataStructureDefinition which defines the dimensions and measures of the data cube. There MUST be a way to reuse an explicitly created DSD, since many data cubes share a common structure. There MAY be a way to reuse an implicitly created DSD. A DSD is created implicitly for every created DataCube that does not reuse an existing DSD. Thus, the API MUST provide a method to specify the structure of a data cube at creation time by reference to a IRI of a predefined DSD. The detailed requirements of the method are described next. The method for creation of a reusable DSD is described afterwards.

5.1.2.1 Parameters

The API method for definition of a data cube by reference to a DSD MUST provide the following parameters:

  • Exactly one IRI MUST be specified for the identification of the DSD. The specified DSD with corresponding dimension and measure component specifications MUST be accessible to the API, e.g., by explicit representation within the ADF-TS.
The API method MAY provide the following parameters:
  • For each dimension component of the DSD exactly one size specified through an integer that defines the maximal number of entries for the dimension component.

5.1.2.2 Persistence in ADF Data Description

The API method MUST persist the description of the data cube as described above in ADF-TS, however in this case the description of the DSD is already available and does not have to be defined again.

Example 4
ex:UseCaseDataSet
a             qb:DataSet ;
rdfs:label    "Data Cube for LC/MS use case" ; # MAY
dct:title     "Data Cube for LC/MS use case" ; # MAY
qb:structure  ex:UseCaseDSD .                  # MUST reference to the reused DSD

5.2 Creation of a Data Structure Definition

The API MUST provide a method to create a reusable data structure definition qb:DataStructureDefinition that can be referenced at creation of a data cube. In general, the parameters for creation of measures and dimensions listed above for explicit creation of a data cube MUST be provided also for creation of a DSD. Further requirements for the creation of a DSD are described next.

5.2.1 Parameters

The API method for definition of a data structure definition (DSD) MUST provide parameters for creation of measures and dimensions as listed above. Furthermore, the following parameters MUST be provided:

  • The API MUST provide a parameter for specification of the IRI of the DSD.
The API method SHOULD provide the following parameters:
  • It SHOULD be possible to define at least one label for the DSD.
The API method MAY provide the following parameters:
  • Exactly one size for each dimension component that is specified through an integer and defines the maximal number of entries in the corresponding dimension.

5.2.2 Persistence in ADF Data Description

The API method for definition of a DSD MUST persist the description of the DSD in ADF-TS as follows:

Example 5

ex:UseCaseDSD                                       # MUST: IRI of the DSD
    a               qb:DataStructureDefinition ;    # MUST
    rdfs:label      "LC/MS use case DSD" ;          # SHOULD: a label for the DSD
    qb:component    ex:sampleIndexDimension ,
                    ex:massPerChargeDimension ,
                    ex:retensionTimeDimension ,
                    ex:intensityMeasure  .
                

5.3 Creating Data Selections

According to the ADF-DCO, a data selection is an n-dimensional subset of data of a data cube. A data selection is specified in form of a selection structure definition adf-dc:SelectionStructureDefinition which is defined as "a set of component selections on the components of a data structure definition". The selection is based on dimension adf-dc:Dimension and measure adf-dc:Measure components. For each dimension component exactly one dimension selection MUST be defined. For measure components, at least one measure selection MUST be defined.

The ADF-DC API MUST provide a method to create data selections. The API MUST provide methods to create data selections based on business values (functional selection). The API MAY provide methods to create data selections based on index values (physical selection). In general, the scale type of a component determines which type of selections are possible. For example, on nominal scales, only point selections are possible. Furthermore range selections MUST be only allowed, when components have an associated order function.

5.3.1 Dimension Selections

The API MUST provide methods for specifying a dimension selection adf-dc:DimensionSelection, i.e. a selection on a dimension component.

5.3.1.1 Parameters

The API method for selections on dimension scales MUST provide the following parameters:

  • Independent of the scale type, the API MUST provide a parameter for specifying the selection of values from a scale dimension by one specific dimension value.
  • For dimensions with any scale type, except adf-dc:NominalScale, the API MUST provide a parameter for specifying the selection of values from a scale dimension by a value range with minimum and maximum value.
  • In case of a selection by value range on a scale with a complex data type the API MUST base the selection on the order function associated with the scale. In particular for scales with data type quantity value qudt:QuantityValue the API MUST support a range selection for values with different units but the same quantity kind as defined in [QUDT].
  • The method MUST allow to define selections for each dimension component of the data cube.
Note
Note that a range selection on quantity values MAY involve a conversion of numeric values when different units are used. Corresponding factors are provided by the Quantities, Units, Dimensions and Data Types Ontology [QUDT].
The API method MAY provide the following parameters:
  • Independent of the scale type the API MAY provide a parameter for specifying the selection of values from a scale dimension by a set of specific dimension values.

5.3.2 Measure Selections

The API MUST provide methods for specifying a measure selection adf-dc:MeasureSelection, i.e. a selection on a measure component. The scale type of the measure component determines which selections are possible.

5.3.2.1 Parameters

The API method for selections on measures MUST provide the same parameters as specified for a dimension selection. Additionally, it MUST provide the following parameters:

  • The API method MUST provide a parameter to select specific measure components for which range or point selections MAY be defined.
  • In case of a measure with a complex data type, the API MUST provide an optional parameter that allows to define a property path to specify the element of the measure.

5.3.3 Convenience Methods

The API SHOULD provide a convenience method to select the complete content of the data cube.

5.4 Writing Data into a Data Cube

The ADF-DC API MUST provide a method to write data into a data cube. There MUST be methods for writing data into a data cube by simple n-dimensional array structures. Writing SHOULD be done via data selections.

For all methods realizing the writing data into a data cube the API MUST provide exactly one parameter for the IRI of the corresponding target data cube. Other specific parameters are described below.

5.4.1 Principle

Writing into a data cube SHOULD be based on the following principle: A source data selection is written to a target data selection. The values of the source data selection are read in the order of the dimensions and written in the same way into the target data selection. The following figure illustrates this principle: On a 3x5 data cube, a 3x2 data selection (marked in green) is created. This source data selection is written to a 2x3 data selection (marked in read) on the target 3x3 data cube.

Fig. 3 Principle of writing data into a Data Cube

5.4.2 Parameters

The API method for writing data into an data cube MUST provide the following parameters:

  • A parameter for the specifying the data selection of the data cube the data is written to.
  • A parameter for the specifying the data selection that contains the data that should be written to the data cube.
  • If the target and source data cubes have different measures, the API MUST provide a parameter for specifying the injective mapping from source to target measures.
  • The API method MUST provide a parameter to specify data that should be written to the data cube MAY be in form of an n-dimensional array.
The API method for writing data into a data cube MAY provide the following parameters:
  • The API MAY provide a parameter to specify how to handle cases where the target data selection already contains values (e.g., overwriting or keeping existing values).

5.5 Reading Data from a Data Cube

The ADF-DC API MUST provide a method to read the data from of a data cube. In particular, the API MUST provide a method to read data in form of n-dimensional arrays. In general the API MAY realize the reading of a data cube by creation of a (copy of) the data cube that is to be read. Thus, the principle for reading is the same as for writing, only vice versa: a data selection on a source data cube is written into a selection of a target data cube. Because of this, the API method for writing MAY be reused and the API MAY do without a separate method for reading.

5.5.1 Reading the Complete Data from a Data Cube

The ADF-DC API MUST provide a method to read the complete data from of a data cube.

5.5.1.1 Parameters

The API method for reading from a data cube MUST provide the following parameters:

  • The API MUST provide a parameter for specification of the IRI of the data cube that is to be read.

5.5.2 Reading Data from a Data Cube by a Data Selection

The API MUST provide a method for reading data from a data cube by specifying a data selection.

5.5.2.1 Parameters

The API method for reading from a data cube MUST provide the following parameters:

  • The API MUST provide a parameter for specification of the IRI of the data cube that is to be read.
  • The API MUST provide a parameter for specifying a data selection (as described in Dimension Selection) that defines the values to be read from the data cube.
  • The method MUST allow defining selections for each scale dimension and measure.

5.6 Complex Data Types

Complex data types are represented in RDF as data shapes using Shape Constraint Language [SHACL-ED]. They are referenced, e.g., by measure and dimension component specifications. As described above, the API MUST provide a method parameter to specify complex data types for dimension and measure components. The API MAY also provide a method to create new complex data types that can be referenced.

5.7 HDF5 Specifics

Persisting data on HDF5 files poses some additional requirements.

5.7.1 Describing Dimensions

Regarding dimension components there are some HDF5 specifics that MAY be provided by the API. These are listed in the following subsections.

5.7.1.1 Chunking

Chunking enables to manage storage space far more efficiently.

A chunk size of 1000 means that dimension values are stored in chunks of 1000 * (size of the data type used), e.g. 1000 * 8 byte per double value = 8 KB chunks. Only chunks that actually hold data require storage space. Without chunking, the storage space requested for the dimension would have been completely allocated immediately upon creation.

5.7.1.2 Describing Scales of Dimensions

The API MAY provide a parameter of the dimension creation method to specify the scale of a dimension by scale mappings in order to allow more efficient storage of dimension values. If the API provides a corresponding parameter, it SHOULD be possible to specify the following scale mappings:

  • An identity scale mapping adf-dco-hdf:IdentityScaleMapping that specified that the dimension index is equal to the dimension values.
  • An explicit scale mapping adf-dco-hdf:ExplicitScaleMapping that specifies dimension mapping which defines the explicit mapping of dimension index to dimension values.
  • An function scale mapping adf-dco-hdf:FunctionScaleMapping that specifies an index function, that defines the mappig of a dimension index to dimension values by some mathematical function. For example, linear function or different logarithms are provided by [ADF-DCO-HDF].

5.8 Change History

Version Release Date Remarks
0.3.0 2015-04-30 Initial working draft version
0.4.0 2015-06-18
  • included disclaimer information
  • harmonized layout across ADF specifications
  • harmonized structure across ADF specifications
1.0.0 2015-09-29
  • general overhaul of the specification: removed code examples and strictly focused on definitions
  • added section for indication of requirement levels
  • updated versions, dates and document status
1.1.0 RC 2016-03-11
  • updated versions, dates and document status
  • added section on number formatting to document conventions
  • updated Fig. 1
1.1.0 RF 2016-03-31
  • updated versions, dates and document status
1.1.5 2016-05-13
  • updated versions and dates
1.2.0 Preview 2016-09-23
  • updated versions and dates
1.2.0 RC 2016-12-07
  • updated versions and dates
1.3.0 Preview 2017-03-31
  • updated versions and dates
1.3.0 RF 2017-06-30
  • updated versions and dates

A. References

A.1 Normative references

[ADF]
Allotrope. Allotrope Data Format Overview. URL: http://purl.allotrope.org/TR/adf/
[ADF-DCO]
Allotrope. ADF Data Cube Ontology. URL: http://purl.allotrope.org/TR/adf-dco/
[ADF-DCO-HDF]
Allotrope. ADF Data Cube to HDF5 Mapping Ontology. URL: http://purl.allotrope.org/TR/adf-dco-hdf/
[HDF5]
The HDF Group. HDF5 File Format Specification 2.0. URL: http://www.hdfgroup.org/HDF5/doc/H5.format.html
[QB]
W3C. The RDF Data Cube Vocabulary. URL: http://www.w3.org/TR/vocab-data-cube/
[QUDT]
QUDT organization. QUDT Documentation. URL: http://qudt.org/
[SHACL-ED]
RDF Data Shapes Working Group. Shapes Constraint Language, Editor's Draft. URL: http://w3c.github.io/data-shapes/shacl/
[rfc2119]
S. Bradner. IETF. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119

A.2 Informative references

[UML]
Object Management Group. The Unified Modeling Language (UML). URL: http://www.omg.org/spec/UML/2.5/PDF