The Allotrope Data Format (ADF) [[!ADF]] consists of several APIs, taxonomies and ontologies. This document describes the Allotrope Data Format Data Cube Ontology (ADF-DCO) which allows to describe the structure and content of n-dimensional data. ADF-DCO is based on the RDF Data Cube Vocabulary (QB)[[!QB]]. ADF-DCO extends QB by concepts for complex data types, data selections, scales, order functions, indexes and HDF mappings.
THESE MATERIALS ARE PROVIDED "AS IS" AND ALLOTROPE EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF NON-INFRINGEMENT, TITLE, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
This document is part of a set of specifications on the Allotrope Data Format (ADF)[[!ADF]]
The Allotrope Data Format (ADF) defines an interface for storing scientific observations from analytical chemistry. It is intended for long-term stability of archived analytical data and fast real-time access to it. The ADF Data Cube API (ADF-DC) defines an interface for storing raw analytical data. ADF-DCO uses the RDF Data Cube Vocabulary [[!QB]] and maps the abstract data cubes defined by the terms of the data cube ontology to their concrete HDF5 representations in the ADF file. The structure and metadata of HDF5 objects is described by an HDF5 ontology, which is based on the HDF5 specifications.
This document is structured as follows: First, the role of the ADF Data Cube API within the high-level structure of ADF [[!ADF]] API stack is presented. Then, the requirements for the ADF Data Cube Ontology are described, and an overview of the structure of ADF-DCO and the relations to QB are summarized. Then, the concept of primitive and complex data types is explained and illustrated along example representations before the main section presents the ADF-DCO concept details starting with the extension to QB component specifications and the introduction of scales and order functions. Finally, subsetting of data cubes and HDF5 mappings are described.
ADF-DCO will be published under http://purl.allotrope.org/ontologies/datacube
The IRI of an entity has two parts: the namespace and the local identifier.
Within one RDF document the namespace might be associated by a shorter prefix.
For instance the namespace IRI
http://www.w3.org/2002/07/owl# is commonly associated with the prefix
and one can write
owl:Class instead of the full IRI
Within the biomedical domain the local identifier is often an alphanumeric ID which is not human readable.
The Allotrope Foundation Taxonomies follow this approach, e.g. a process is represented as
To enhance readability within this document, the preferred label from the ontology or taxonomy is used for the corresponding entity.
I.e., instead of
af-p:AFP_0001617 the corresponding entity is named as
If the namespace is clear by the context the prefix MAY be omitted and the entity is named simply
If the label contains spaces, the entity MAY be surrounded by Guillemets to avoid ambiguities, e.g.
Within this specification, the following namespace prefix bindings are used:
Within this document the definitions of MUST, SHOULD and MAY are used as defined in [[!rfc2119]].
Within this document, decimal numbers will use a dot "." as the decimal mark.
The following figure illustrates the high-level structure of the Allotrope Data Format (ADF) API stack:
This document focuses on the ADF Data Cube Ontology, which is used by the ADF Data Cube API [[ADF-DC]] highlighted in the figure above.
The ADF Data Cube Ontology (ADF-DCO) provides a data model for the structure of n-dimensional data and subsets thereof. Specifications of the meta-data of any n-dimensional data structures covered by Allotrope Foundation Ontologies (AFO) MUST be possible. Thus, the key requirements regarding of the ADF Data Cube Ontology are the following: ADF-DCO MUST provide the following means:
The following figure illustrates the high-level structure of the ADF Data Cube (ADF-DC) API with the ADF Data Cube Ontology and its components.
ADF-DCO imports and thus extends the RDF Data Cube Vocabulary (QB). The ADF-DC API is based on the vocabulary and data structures of ADF-DCO.
The following figure illustrates the high-level structure of the RDF Data Cube Vocabulary (QB) [[!QB]]:
In QB the central classes are
qb:DataStructureDefinition defines the different components of a data set and thus provides the structure for the observations contained in a
qb:DataStructureDefinition is reusable by many
The ADF Data Cube Ontology (ADF-DCO) extends the RDF Data Cube Vocabulary by classes and properties for representation of complex data types, scales,
order functions and data selections.
The following figure illustrates the relation between high-level classes of the ADF Data Cube Ontology and their relations to classes from QB.
ADF-DCO extends QB ontology by several concepts.
The QB classes
qb:ComponentSpecification remain central
in ADF-DCO - however extension for selections are defined in a parallel structure.
E.g., ADF-DCO defines
DataSelection, a corresponding
SelectionStructureDefinition and a
The details of the extensions are described in the next section.
The general schema of a
qb:DataStructureDefinition is illustrated in the [[QB]] schema above.
ADF-DCO makes several extensions to the basic schema defined by [[QB]] in order to allow efficient storage of observation data in [[HDF5]].
These extensions are described in the following subsections.
The data types associated with the component of a
qb:DataSet can be either primitive or complex.
Primitive data types encompass all valid primitive RDF types (for details see [[rdf11-concepts]] section on data types)
xsd:String etc. as well as
Complex data types MAY be used when a single primitive data type is not sufficient to represent the values of one component. For instance, a measurement value with a unit or an error MAY be expressed by a complex data type.
In ADF-DCO, a complex data type MUST be represented by a shape according to the Shape Constraint Language (SHACL) [[SHACL]]. While SHACL allows to specify very complex graph patterns, ADF-DCO defines the following restrictions on the usage of the SHACL vocabulary for the specification of complex data types:
sh:minCount = sh:maxCount = 1.
The Allotrope Foundation Ontologies (AFO) [[AFO]] reuse [[!QUDT]] for representation of quantity values and units. A quantity value, defined according to the [[!QUDT]], is the most prominent example of a complex data type. The following shape describes the structure of a quantity value: Additionally to the numeric value, a unit MUST be specified. This is necessary if measurement values of one data set use different units.
ex:QuantityValueType a sh:Shape ; sh:property [ sh:predicate qudt:numericValue; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:Literal; sh:datatype xsd:double ]; sh:property [ sh:predicate qudt:unit; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:IRI; # the specified unit has to be an IRI sh:class qudt:Unit; # the unit must be an instance of qudt:Unit ]; .
The following representation would qualify the
ex:MassValue a qudt:QuantityValue ; qudt:numericValue "15"^^xsd:double ; qudt:unit qudt-unit:Gram .
Complex data types can be also nested. For instance, a data type of a weighing result can be specified with tare and net weight - both represented by a complex data type with numeric value, error and a predefined unit:
ex:WeighingResultType a sh:Shape ; sh:property [ sh:predicate «af-x:tare weight»; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:Blank; # the object of 'tare weight' MAY be a blank node sh:valueShape ex:MassValueType; # use nested mass value shape ] ; sh:property [ sh:predicate «af-x:net weight»; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:Blank; sh:valueShape ex:MassValueType; # use nested mass value shape ] ; . ex:MassValueType a sh:Shape ; sh:property [ sh:predicate qudt:numericValue; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:Literal; sh:datatype xsd:double ]; sh:property [ sh:predicate qudt:standardUncertainty; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:Literal; sh:datatype xsd:double ]; sh:property [ sh:predicate qudt:unit; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:IRI; sh:hasValue qudt-unit:Gram; # the unit MUST be qudt-unit:Gram ] .
The following result representation would qualify the shape:
ex:WeighingResult a af-r:Result ; «af-x:tare weight» [ qudt:numericValue "25.3332"^^xsd:double ; qudt:standardUncertainty "0.2"^^xsd:double ; qudt-unit:Gram ] ; «af-x:net weight» [ qudt:numericValue "20.219"^^xsd:double ; qudt:standardUncertainty "0.2"^^xsd:double ; qudt-unit:Gram ] ;
The following [[SHACL]] property constraints MAY be used to describe complex data types in ADF-DCO:
In [[QB]] each data set (or cube) has an associated data structure definition which defines the components of the cube. These components are either dimensions or measures. Dimension components represent independent variables and identify the observations. Measure components represent dependent variables and store the observation values.
Additionally to the [[!QB]] concepts, ADF-DCO defines
adf-dc:Measure as subclasses of
In [[!QB]] the distinction of measures and dimensions is implicitly defined by the
Further, ADF-DCO defines the annotation property
adf-dc:componentDataType which relates a
qb:ComponentSpecification with the type of values comprised by the component.
Examples are XSD data types such as
xsd:double, and instances of
ex:SampleDimension a adf-dc:Dimension, qb:ComponentSpecification ; qb:dimension «af-x:measured sample» ; qb:order "1" ; adf-dc:componentDataType rdfs:Resource . ex:MeasureCount a adf-dc:Measure, qb:ComponentSpecification ; qb:measure «af-x:total cell count» ; adf-dc:componentDataType xsd:integer . ex:MeasureWeight a adf-dc:Measure, qb:ComponentSpecification ; qb:measure «af-x:net weight» ; adf-dc:componentDataType ex:QuantityValueType .
A scale is a categorization of types of variables. Scales are essential for a definition of subsetting (definition of ranges), however [[QB]] currently does not support scales.
In ADF-DCO, scales define an additional type for
adf-dc:Dimensions and characterize the type of data values for the component.
Defining scale types for dimensions is important since this specifies also which operations and selections are possible on the data values of the component.
The following figure illustrates the types of scales defined in the ADF Data Cube Ontology:
The nominal scale differentiates between items or subjects based only on their names or (meta-)categories and other qualitative classifications they belong to; thus dichotomous data involves the construction of classifications as well as the classification of items. In general, nominal scales are used always, when the values of a component can be tested only for equality (==, !=) but no 'natural' order can be defined. This is the case for example for measured colors ("red", "blue"...) or for the IRIs of measured samples (continued example):
ex:SampleDimension a qb:ComponentSpecification , adf-dc:Dimension, adf-dc:NominalScale ; qb:dimension «af-r:measured sample» .
An ordinal scale is a scale which allows for rank order (1st, 2nd, 3rd, etc. or very good, good, average, bad) by which data can be sorted, but still does not allow for relative degree of difference between them. Ordinal scales are used for example for an index dimension that represents subsequent measurements or a peak list:
ex:IndexDimension a qb:ComponentSpecification , adf-dc:Dimension , adf-dc:OrdinalScale ; qb:dimension «af-x:index» .
A cardinal scale is a scale where the difference between two values can be measured and its meaning is independent of the absolute values. There are two types of cardinal scales, namely interval scale and ratio scale which are described next.
An interval scale is a cardinal scale where the ratio between values is not comparable. E.g. temperature with the Celsius scale has an arbitrarily-defined zero point (the freezing point of a particular substance under particular conditions). Ratios are not allowed since 20 °C cannot be said to be 'twice as hot' as 10 °C. Another example would be date, when measured from an arbitrary epoch (such as AD). As for temperatures a multiplication/division cannot be carried out between any two dates directly.
ex:TemperatureDimension a qb:ComponentSpecification , adf-dc:Dimension , adf-dc:IntervalScale ; qb:dimension «af-x:temperature» .
A ratio scale is a scale which possesses a meaningful (unique and non-arbitrary) zero value. Most measurements in the physical sciences and engineering are done on ratio scales. Examples include mass, length, duration, plane angle, energy and electric charge. Ratios are allowed because having a non-arbitrary zero point makes it meaningful to say, for example, that one object has "twice the length" of another (= is "twice as long").
ex:MassDimension a qb:ComponentSpecification , adf-dc:Dimension , adf-dc:RatioScale ; qb:dimension «af-x:net weight» .
An order function specifies the comparison of values for a component specification. Depending on the data type associated with a component specification and the scale type different types of order functions can be specified.
The following figure illustrates the different order functions defined by ADF-DCO:
For native, lexicographical and quantity value order functions, standard instances are defined.
That is, additionally to the classes
adf-dc:quantityValueOrder are defined in ADF-DCO.
Thus, only in the complex case order functions have to be specified in detail.
In the following, the different order functions are described in detail and illustrated along examples. The basis for this is a mass measurement result which has three components: index, time and mass measure:
ex:MassMeasurementResult a qb:DataSet; qb:structure ex:MassMeasurementResultStructure . ex:MassMeasurementResultStructure a qb:DataStructureDefinition; qb:component ex:IndexDimension; qb:component ex:TimeDimension; qb:component ex:MassMeasure .
An observation of this data set MAY look like this:
ex:obs123 a qb:Observation; qb:dataSet ex:MassMeasurementResult; «af-x:index» 4; «af-x:event duration» [ qudt:numericValue '24.02'^^xsd:double ; qudt:unit qudt-unit:SecondTime ; ]; ex:complexMassMeasure [ «af-x:net weight» [ qudt:numericValue '14.0'^^xsd:double ; qudt:standardUncertainty '0.2'^^xsd:double ; qudt:unit qudt-unit:Gram ]; «af-x:tare weight» [ qudt:numericValue '15.0'^^xsd:double ; qudt:standardUncertainty '0.8'^^xsd:double ; qudt:unit qudt-unit:Gram ] ; ] .
A native order is defined as the total order of real numbers or subsets thereof if a component specfication refers to
The native order for timestamps and durations (
xsd:duration) is in increasing time and the native order of booleans (
false is less than
Regarding the example above, the
ex:IndexDimension can be associated with
a native order as follows:
ex:IndexDimension a adf-dc:Dimension, adf-dc:OrdinalScale; adf-dc:orderedBy ex:nativeOrder .
The framework MUST support implementation of native orders for all primitive XSD data types.
The lexicographical order is an order function which defines an order of elements by characters.
It is defined for example for a component specification with component data type
# index dimension adaptation to the example above, if the integer property «af-x:index» would be replaced by a string identifier property ex:obs123 a qb:Observation ; qb:dataSet ex:MassMeasurementResult ; dct:identifier "sample 4". ex:IndexDimension a adf-dc:Dimension, adf-dc:OrdinalScale ; qb:dimension dct:identifier; adf-dc:orderedBy adf-dc:lexicographicalOrder .
The quantity value order is an order function which is defined for instances of
which describe values with the same underlying quantity kind. This ordering orders the quantitiy values by its numeric value
normalized to the SI standard unit, e.g. all length quantity values are first converted internally into meter and then
ordered by the normalized numeric value.
Regarding the example above, the
ex:TimeDimension can be associated with a quantity value order as follows:
ex:TimeDimension adf-dc:orderedBy adf-dc:quantityValueOrder .
The properties defined in the property path MUST be specified by full URIs.
The framework MUST implement the actual comparison functions which tolerate quantity values with different units of the same quantity kind. E.g. a duration specified in minutes is comparable to a duration specified in seconds. The different factors are provided by [[!QUDT]].
A complex value order is an order function which is defined for component specifications with complex data types which are represented by shapes. It consists of a set of items each defining a property path of the shape, an order number and an (sub) order function for the respective values. Thus, a complex value order can be considered as a sorted list of orders specified for property path of a shape. A complex value order is a generalization of the quantity value order.
ex:MassMeasure adf-dc:orderedBy ex:complexResultOrder . ex:complexResultOrder a adf-dc:ComplexValueOrder; adf-dc:hasItem [ adf-dc:propertyPath "«af-x:net weight»/qudt:numericValue"; # points to a primitive value adf-dc:order 1; # the net weight numeric value is considered first adf-dc:orderedBy adf-dc:nativeOrder ; # standard reference to native order ]; adf-dc:hasItem [ adf-dc:propertyPath "«af-x:tare weight»"; # points to a complex value adf-dc:order 2; # the tare weight is considered second within the complex result order adf-dc:orderedBy adf-dc:quantityValueOrder # reference to an order function, which handles the complex data type. ].
The second item of
ex:complexResultOrder shows how nesting works in order functions.
ADF-DCO provides comprehensive means to select subsets of data contained in a data cube (
The RDF Data Cube Vocabulary [[!QB]] provides the concept of a slice.
A slice denotes a subset of a data set, defined by fixing a subset of the dimensional values.
That is a
qb:Slice allows to restrict selected dimensions to some set of single values.
ADF-DCO introduces the concept of data selections. Data selections can be based on dimensions or measures.
If a selection is (solely) based on the dimensions of the cube, such a selection is called a data slab. The slices defined in QB are a special kind of slabs.
If a selection is based on the measures, it is a projection or a filter. If a subset of multiple measures is selected or parts of the datatypes of the measures are selected via property paths, then the selection is a data projection.
If the selection is based on the observed values, then it is a data filter. While filters and slabs look similar, there is an important difference between them: In the case of dimensions, it is required that each value of the dimension is distinct from another within the same dimension. No duplicate values are possible, otherwise the dimension could never identify the point in the cube. For measures this limitation does not exist.
A data selection is an n-dimensional subset of a data cube where the set of observations (i.e. the measures) are selected based
on component selections on dimensions or measures.
The following figure illustrates the
adf-dc:DataSelection with related classes in QB:
For each dimension component of the data structure definition, the selection structure definition MUST specify a dimension selection.
For each measure componenent of the data structure definition, that is part of the selection the selection structure definition MUST
specify a measure selection. There MUST be at least one measure selection defined.
A data selection allows defining different types of component selections on components such as point or range selections. If no point
or range selection is specified, the component selection is assumed to be an unbounded selection, however the type of selection SHOULD be
The different types of selections are described below.
adf-dc:DataSelectionprovides more possibilities to define selections on dimensions, it extends the concept of
qb:Slice. In particular, each slice can be represented as a data selection on dimensions by using point selections.
The running example for the following subsections is a cell counter measurement result, expressed by a data cube with two dimension components (index and sample) and one measurement component (total cell count).
# An observation of the data cube ex:obs13 a qb:Observation; qb:dataSet ex:CellCounterResultSet; «af-x:index» 3; «af-x:sample» ex:sample1; «af-x:total cell count» 98 . # The data cube has a structure ex:CellCounterResultSet a qb:DataSet, «af-x:cell counter measurement result»; qb:structure ex:CellCountStructure . # The data structure definition specifies three components ex:CellCountStructure a qb:DataStructureDefinition; qb:component ex:IndexDimension; qb:component ex:SampleDimension; qb:component ex:CellCountMeasure . # dimension: index ex:IndexDimension a qb:ComponentSpecification, adf-dc:Dimension, adf-dc:OrdinalScale ; qb:dimension «af-x:index»; qb:order 1; adf-dc:componentDataType xsd:integer . # dimension: sample ex:SampleDimension a qb:ComponentSpecification, adf-dc:Dimension, adf-dc:NominalScale ; qb:dimension «af-x:sample»; qb:order 2; adf-dc:componentDataType rdfs:Resource . # measure: cell count ex:CellCountMeasure a qb:ComponentSpecification, adf-dc:Measure, adf-dc:OrdinalScale ; qb:measure «af-x:total cell count»; adf-dc:componentDataType xsd:integer .
A component selection is a specification of a set of values of a single component specification of a data cube.
The following figure illustrates the different subclasses of
adf-dc:ComponentSelection that further define
the kind of selection used:
Selections are either defined by list of items (point selection) or through range selections depending on the type of component specification.
The other criterion is whether the selection is done on a dimension or on a measure. If the component selection is a selection on a measure it is
a measure selection, if it is a selection on a dimension then it is a dimension selection. If a measure selection is not the unbounded
selection, then the data selection acts as a filter on the data cube. If not all measures of the data structure definition are selected or
the measure selection has defined property paths on the underlying datatype, the data selection acts as projection on the data cube.
A data selection needs to specify for each dimension component exactly one component selection - so the data slab part MUST be always fully defined.
Based on the data structure definition a
adf-dc:DataSelection with a corresponding
is specified as follows:
# The data selection is a slab ex:CellCounterSlab a adf-dc:DataSelection; adf-dc:dataSelectionOf ex:CellCounterResultSet ; qb:structure ex:CellCountSlabStructure . # The selection structure definition specifies two component selections for the dimension components # defining the slab and includes the measure component ex:CellCountSlabStructure a adf-dc:SelectionStructureDefinition; adf-dc:selects ex:IndexDimensionSelection; adf-dc:selects ex:SampleDimensionSelection; adf-dc:selects ex:CellCountMeasureSelection;
The kind of scale of the dimension component determines which selections are allowed.
A point selection is a component selection on a dimension or measure that selects a set of distinct and named values.
# The sample dimension selection is a point selection ex:SampleDimensionSelection a adf-dc:PointSelection, adf-dc:DimensionSelection ; adf-dc:selectionOn ex:SampleDimension ; adf-dc:hasItem ex:sample1, ex:sample2, ex:sample10 .
A unbounded selection is a component selection on a dimension or measure which selects all values.
# The cell count measure selection is an unbounded selection ex:CellCountMeasureSelection a adf-dc:UnboundedSelection, adf-dc:MeasureSelection ; adf-dc:selectionOn ex:CellCountMeasure ;
A range selection is a component selection on a component specification which represents an ordinal scale. A range selection MUST define a minimum value and a maximum value when the component data type is primitive. If the component has a complex data type, then maximum and minimum need to be specified by reference to an example complex type according to the shape defined by the component specification.
# The index dimension selection defines a range ex:IndexDimensionSelection a adf-dc:RangeSelection, adf-dc:DimensionSelection ; adf-dc:selectionOn ex:IndexDimension ; adf-dc:minimumValue 2; adf-dc:maximumValue 10;
It is also possible to define unbounded and right or left bounded range selections.
If a dimension component is associated with a complex data type, the minimum and maximum MUST reference complex data values accordingly. For instance a cube may specify a time dimension component with a complex data type as follows:
ex:TimeDimension a qb:ComponentSpecification, adf-dc:Dimension, adf-dc:IntervalScale ; qb:dimension «af-x:total retention time»; qb:order 1; adf-dc:componentDataType ex:DurationType .
The duration type has a numeric value and a unit which is specified through a shape:
ex:DurationType a sh:Shape ; sh:property [ sh:predicate qudt:numericValue; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:Literal ; sh:datatype xsd:double ; ]; sh:property [ sh:predicate qudt:unit; sh:minCount 1; sh:maxCount 1; sh:nodeKind sh:IRI; sh:class qudt:TimeUnit # i.e. allowed values are e.g. qudt-unit:Second etc. ]; .
Accordingly, the range selection on a component with complex data type defines two complex values for minimum and maximum:
ex:TimeDimensionSelection a adf-dc:RangeSelection, adf-dc:DimensionSelection ; adf-dc:selectionOn ex:TimeDimension ; # from 0.5 seconds adf-dc:minimum [ a qudt:QuantityValue ; qudt:numericValue 0.5; qudt:unit qudt-unit:Second ]; # up to half an hour adf-dc:maximum [ a qudt:QuantityValue ; qudt:numericValue 30; qudt:unit qudt-unit:MinuteTime ];
For quantity values of the same quantity kind (as defined by [[QUDT]]) value comparisons will be directly implemented by the ADF Data Cube API. For complex data types, an order function MUST be specified, which is then referred to during comparison.
While a selection structure definition MUST provide component selections for all dimension components of the related data structure definition,
component selections on measure components MUST be defined on at least one measure component. All measure components not selected are
not part of the resulting data selection. The measure selection MUST be also one of point, range, or unbounded selection. The datacube shape
library [[AFS-DC]] defines
adf-dc:UnboundedSelection as the default, but the RDF data description SHOULD state this
Measure selections allow also to define property path, if the data type is complex and defined by a shape. This is another way of building
a projection on the data. The next example shows how a projection on the complex datatype
ex:DurationType on a time
measure can be defined.
ex:TimeMeasure a qb:ComponentSpecification, adf-dc:Measure, adf-dc:IntervalScale ; qb:dimension «af-x:total retention time»; qb:order 1; adf-dc:componentDataType ex:DurationType . ex:TimeMeasureSelection a adf-dc:UnboundedSelection, adf-dc:MeasureSelection; # selection on measure, no filtering on data adf-dc:selectionOn ex:TimeMeasure; adf-dc:propertyPath "qudt:numericValue"; # projection on the numeric value part of the complex data type DurationType