Copyright © 2014-2017 Allotrope Foundation, All Rights Reserved. Confidential Draft.
The Allotrope Data Format (ADF) [ADF] consists of several APIs and taxonomies. This document consitutes the specification of the ADF Check Sum Computation API for creating hash codes on an ADF file, or parts thereof.
THESE MATERIALS ARE PROVIDED "AS IS" AND ALLOTROPE EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF NON-INFRINGEMENT, TITLE, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current Allotrope publications and the latest revision of this technical report can be found in the Allotrope technical reports index at http://purl.allotrope.org/TR/.
This document is part of a set of specifications on the Allotrope Data Format [ADF]
This document was published by the Allotrope Foundation as a First Public Working Draft. This document is intended to become an Allotrope Recommendation. If you wish to make comments regarding this document, please send them to email@example.com. All comments are welcome.
Publication as a First Public Working Draft does not imply endorsement by the Allotrope Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
An underlying design principle of the ADF check sum is that a local change (e.g., in a part of a data cube) must not require re-reading the entire file to compute the check sum.In principle, ADF is designed to support the choice between multiple algorithms to compute the check sum. Currently, there is only one implementation, which depends on the internal representation storage as an HDF5 file. This document describes this algorithm and its application to various component parts of the ADF file.
The document is structured as follows: First, the prerequisites are defined, followed by a detailed description of the hash computation on different parts of an ADF file.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [RFC2119].
Within this specification, the following namespace prefix bindings are used:
Within this document, decimal numbers will use a dot "." as the decimal mark.
In the following, we often need to represent lists of bytes. For each byte, we use two hexadecimal digits; for several bytes, we simply append those digits.
We use the terms "check sum" and "hash" synonymously.
When converting data to byte strings, different data types must be handled separately. In the rest of the document, we always specify which data type is used when an object is converted to bytes.
The integer data types are written in a two’s complement with "big endian" byte order.
The table below shows an example for each type:
|Java data type||C# data type||Example value||Bytes|
The floating point data types (float and double) are written according to IEEE 754 with “big endian” byte order.
|Java data type||C# data type||Example value||Bytes|
We encode strings by first writing the number of characters of that string as int (see above), followed by the string itself encoded as UTF-8.
Example: "Hällo World!" (umlaut ä instead of e is intended for the example) is encoded to the bytes 0000000c48c3a46c6c6f20576f726c6421. The first four bytes, 0000000c, encode the number of characters (12), and the next 13 bytes are the UTF-8 encoding of the string. Note that the umlaut ä is encoded to 2 bytes (c3a4) after the H (48), which leads to one byte more than the length of the string.
ADF uses existing message digest algorithms to convert a stream of bytes to a check sum.
Currently supported are MD2, MD5, SHA-1, SHA-256, SHA-384, and SHA-512. The checksums in ADF (in general and for the DataPackage) are by default intended to ensure the integrity of the ADF file against accidental storage or transfer errors. To do this while providing the best performance, the default algorithm is set to MD5. This algorithm is no longer considered to be cryptographically secure, but for the intended purpose one of the best candidates. However, users can choose a configuration with one of the other algorithms. The checksum service can be configured per ADF file when activating hashing for the file. For DataPackage the algorithm can be set for each DataPackage file created individually.
When we state that data is added to a message digest, we mean that the data is converted to bytes (by the rules above), and then the resulting bytes are sequentially added to the digest algorithm. This must yield the same result as constructing an array of bytes that is filled by the outcomes of the conversions and then running the message digest algorithm on it.
We use a hierarchical approach to combine the hash values of several parts of the ADF file.
In ADF, we use this mechanism at several levels to minimize computation costs.
For the sake of brevity, we use a fictitious digest algorithm, SHA-32, for examples, where we just take the first 4 bytes of the SHA-256 algorithm’s result.
In an ADF file's meta model, the file itself is represented by a resource
?F, which is of type hdf:File.
If hashing is enabled, the meta-data model will contain the statement
?F adf-audit:hasDigestMethod [ a adf-audit:DigestMethod; adf-audit:hasCanonicalizationAlgorithm adf-audit:c14n-adf-hdf-1.0 ; adf-audit:hasDigestAlgorithm lc-hash:sha256 ]where
adf-audit:c14n-adf-hdf-1.0refers to the chosen algorithm and
lc-hash:sha256to the message digest algorithm used to calculate a digest from a byte stream.
Currently, there are two implementations of an ADF hash algorithm, of which one is only available for read-only access to older files to support backward compatibility. The only currently available choice when initializing check sums for a file is presented below. The configuration will allow to use other implementations in the future.
Valid options for the message digest algorithm are:
The file's hash value is stored in the HDF root group's attribute
ADF_CHECKSUM, encoded as a hexadecimal string.
adf-audit:c14n-adf-hdf-2.0. This is the current default implementation for the check sums.
The algorithm is specific to the usage of HDF as the underlying file format and internal representation of the data structure. In particular, the serialization of the data description is not deterministic in the sense that the same content of the data description might result in different binary representations in the file, depending on the order of additions or removals.
The computation of the hash value is based on the structure of the ADF file's HDF representation. We define the computation of hashes for HDF datasets and HDF groups. The overall hash value of the file is defined as the HDF root group's hash value.
For each HDF group or dataset, the computed check-sum is stored in
ADF_CHECKSUM. This allows to re-compute
the complete check sum when only a part of the file has been changed.
During verification, it allows to detect in which part of
file data has been corrupted.
The hash value of an HDF group is based on its attributes and its child elements (other groups and datasets). The following values are added to the message digest algorithm:
H5T.INTEGERwhose size is smaller 4 or whose size is 4 and whose sign is not 0) are encoded as integers.
H5T.INTEGERwhich do not fulfill the conditions of the previous item and whose size is smaller 8 or whose size is 8 and whose sign is not 0) are encoded as longs.
H5T.STRING) are encoded as UTF8 strings.
/check-sumsand any sub-group of it will not be included in the computation of the hash value.
For every data set, a check sum data set with the same number of dimensions is generated.
It has the same path and name as the original dataset, but is located under
the HDF group
For every dimension i, a number hashblocki is chosen that describes the number of elements that are grouped together in that direction to compute a single hash-value.
We denote the size of the data set in dimension i with sizei.
In the hash data set, we store a check sum for each group. Each dimension of the hash data set is of size hashsizei = sizei / hashblocki, where we round up (hashsizei = ceil(sizei / hashblocki)).
How hashblocki is chosen is up to the implementation.
The chosen values are stored in the HDF attribute
as a string which contains the comma-separated block sizes.
Each hash value of a block is the result of computing the message digest of the following values:
The data set used to store the check sums is of data type byte; the last dimension is multiplied by the length of the resulting digest (e.g. 32 for SHA-256) to store all bytes of the hash in the data set.
The overall check sum for the data set is computed by adding the following data to the message digest algorithm:
Please note: This algorithm is deprecated and only documented here for reference to support older ADF files. Current implementations do not support writing files with it and future versions might drop support completely.
The algorithm is specific to the usage of HDF as the underlying file format and internal representation of the data structure. Especially, the serialization of the data description is not deterministic in the sense that the same content of the data description might result in different binary representations in the file, depending on the order of additions or removals.
To compute the hash code, a hash code for separate parts of the ADF file is computed.
The overall hash for the data cubes is computed by adding
Information on the hash is stored in the meta data for the Cube ?Cube with
?Cube adf-audit:hasDigest [ adf-audit:hasDigestMethod lc-hash:sha256; adf-audit:digestValue "..."^xsd:base64Binary ]where
"..."contains the base-64-encoded hash value of the data cube.
To compute the hash of a data cube, we first determine the measure datasets and scale datasets that are used to store the content of the data cube from the meta-data. The RDF resources representing the data sets are sorted alphabetically by their URI. The hash value of the data set is added to the message digest. How the hash value of a data set is computed is described below.
If the data cube contains strings or IRIs in its measurements or scales, the cube contains a dictionary that contains a representation of the strings. If a dictionary is present for the data cube, all the data sets of the dictionary are sorted alphabetically by the URI of the RDF resources that represent them. For these datasets, additional attributes of the hash set are included in the hash value, see below.
For computing the hash value of the data package, we only consider the data sets of the files, not the directory layout and other meta-data like time of last change. The directory layout and meta-data are stored in the technical model of the ADF file, whose hash computation is described below.
Each file in the data description is backed by an HDF data set.
First, we take the RDF representation of all files (resources whose
adf-dp:File) and sort them by their URI.
For every file, we add
Several HDF data sets are used to store the information. Their hash values are added to the message digest for the computation of the overall data description hash. The following list contains the data sets whose hash values are added, with attribute names that must be included in the hash (see below).
|Path||Attributes||Include size attribute of group|
For each data set in the list, the following data is added to the message digest:
The hash values of the data sets are not stored in the data
description itself. They are stored as a hexadecimal string
in the HDF attribute "
Like the data description, the audit store contains RDF graphs.
To compute the audit's check sum, we first add the dictionary's
dataset's check sums analogously to the data description,
only the path is relative to
Then, all data sets that are included in any audit trail are sorted alphabetically by their URI and for each data set
In the current version of ADF, the file format HDF5 is used as an underlying storage solution. Most of the stored information in an ADF file are stored in HDF5 data sets. This section describes how the check sum of a data set is computed.
Because data sets can contain large amounts of data, we apply again the approach of combining multiple hashes to one overall hash.
A data set has n dimensions and a data type. Each entry of the data set belongs to the data type. In ADF, the data types byte, short, int, long, float and double can be used.
When we use coordinates of elements of a data set in the following, we start with 0 (not 1).
adf://some_dataset. A resource representing the hash data set looks like the following:
<adf://some_arbitrary_uri> a adf-hash:HashDataSet; adf-audit:hasDigest [ a adf-audit:Digest adf-hash:checksumDatasetOf <adf://some_dataset> audit:hasDigestMethod [ a audit:DigestMethod; adf-hash:hashDimension [ a adf-hash:HashDimensionDescription; hdf:order 1 ; adf-hash:hashBlockSize 1000; ]; adf-hash:hashDimension [ a adf-hash:HashDimensionDescription; hdf:order 2; adf-hash:hashBlockSize 500; ]; adf-audit:hasDigestAlgorithm lc-hash:sha-256; ] ]