NAV
cURL JavaScript

Time Series Query JSON Representation

The representation of a Time Series Query in json, as part of the Time Series Query API.

Introduction

When representing a Time Series Query in this API (as request or response object in an HTTP call) there are two levels of JSON objects that one has to work with:

Most of the API endpoints work with the entity specification. Only the POST /data/query REST endpoint, that executes a query that is not persisted, expects a query specification in its request body.

curl https://ts-analytics.waylay.io/schema/query.json | jq '.definitions.query'

The Time Series Query json representation is also documented in the Open API documentation as the request body of POST /data/query.

It’s formal JSON Schema can be resolved at https://ts-analytics.waylay.io/schema/query.json#/definitions/query

Overview

{
  "resource": "151DF",
  "window" : "PT48H",
  "freq": "PT5M",
  "data": [
    {
      "metric": "temperature",
      "aggregation": "mean"
    },
    {
      "resource": "151CF",
      "metric": "temperature",
      "aggregation": "max",
      "interpolation" : {
          "method" : "pad"
      }
    }
  ]
}

A Query JSON object has the following top-level properties, each optional:

When one of "resource", "metric", "aggregation", "interpolation" is not present at series level, the corresponding top-level definitions are used as defaults. Note that these top-level defaults can also be overridden during execution of the query.

If the previous example would be stored as demo_query, the following would override the top level resource and window parameter (but not the "resource": "151CF" specification in the second series):

curl -u apiKey:apiSecret https://ts-analytics.waylay.io/data/query/demo_query?resource=151D8&window=P3D

Source properties

The "resource" and "metric" properties correspond with the resource id and metric name with which a time series was ingested into the Waylay platform. These properties can occur as top-level default, or as specification for each individual series in the "data" list.

attribute type/format example description
resource string "M001" The id of the Waylay resource under which the time series was ingested.
metric string "speed" The metric name, corresponding to keys in the ingestion messages.

Time line properties

These properties establish the time index for which data is retrieved. Both it size and sampling frequency (for aggregation) can be specified. These properties can only be specified at the top level.

Note Strictly speaking, "freq" is a misnomer for what is specified to be a sampling interval (with dimension time), and not a frequency (with dimension one over time). This was inherited from the pandas library, that is used internally and in the Python SDK to represent time series data sets.

attribute type/format example description
from string ISO8601 Timestamp "2018-01-23T13:06:11+00" Start of range (inclusive), as absolute timestamp
unix epoch milliseconds 1526814524739
string ISO8601 Duration "-P3W" Start of range (inclusive) relative to the current time.
until string ISO8601 Timestamp "2018-01-23T13:06:11+00" End of range (exclusive), as absolute timestamp
unix epoch milliseconds 1526814524739
string ISO8601 Duration "-P1D" End of range (exclusive) relative to the current time.
freq string ISO8601 Duration PT1H Sampling interval used to regularize/group the data.
periods integer 24 size of the data range in freq units
window string ISO8601 Duration PT1H absolute period size of the data range

Examples

{ "from" : "2020-05-05T12:00:00Z", "until" : "2020-05-06T12:00:00Z"}

{ "window": "P2D"}

{ "from" : "2020-05-05T12:00:00Z", "periods": 48, "freq": "PT1H"}

{ "window" : "P14D", "until" : "2020-05-05T12:00:00Z", "periods": 200}

{ "periods" : 800, "until" : "-PT6H", "freq": "PT5M" }

Allowed time line specifications

The five time line keywords specify a sample interval that essentially has two degrees of freedom, with the sampling rate as a third degree of freedom. This flexibility comes with some (common sense) rules to avoid overspecification:

Not specifyfing a time window hence defaults to an interval of one week, ending now, without sample frequency specified.

{ 
  "until" : "now",
  "window" : "P1W"
}

If a range is still underspecified, the following defaults are used:

Effective time line

These are examples from a data set with a 15 minute interval.
First we show an unaggregated query

POST /data/query&render.iso_timestamp=true
{ 
  "window": "PT1H", "until": "2020-09-08T12:10:00",
  "data": [ { "resource": "151CF", "metric": "temperature"} ]
}
{ 
  ...
  "columns": ["timestamp", "timestamp_iso", {"resource": "151CF", "metric": "temperature"}],
  "data": [
    [1599564229000, "2020-09-08T11:23:49+00:00", 21],
    [1599565129000, "2020-09-08T11:38:49+00:00", 22],
    [1599566029000, "2020-09-08T11:53:49+00:00", 21],
    [1599566930000, "2020-09-08T12:08:50+00:00", 22]
  ],
  "window_spec": {
    "from": 1599564229000,
    "until": 1599567830000,
    "window": "PT1H",
    "freq": "PT15M"
  }
  ... 
}

When a sampling interval is specified (directly as freq or indirectly as periods in a fixed size window), the input specifications of a time line might differ from the effective timeline used to execute the query.

To guarantee consistent aggregation results for normal usage:

count aggregation at 1 hour sampling interval for a 3 hour window
This reports 2020-09-08T11:00:00 as last timestamp, counting the datapoints in the
[2020-09-08T11:00:00, 2020-09-08T12:00:00[
interval, the last ‘full’ interval before the specified
2020-09-08T12:10:00
until date

POST /data/query&render.iso_timestamp=true
{ 
  "freq": "PT1H", "aggregation": "count",
  "window": "PT3H", "until": "2020-09-08T12:10:00",
  "data": [ { "resource": "151CF", "metric": "temperature"} ]
}
{
  ...
  "columns": ["timestamp", "timestamp_iso", {"resource": "151CF", "metric": "temperature"}],
  "data": [
    [1599555600000, "2020-09-08T09:00:00+00:00", 4],
    [1599559200000, "2020-09-08T10:00:00+00:00", 4],
    [1599562800000, "2020-09-08T11:00:00+00:00", 4]
  ],
  "window_spec": {
    "from": 1599555600000,
    "until": 1599566400000,
    "window": "PT3H",
    "freq": "PT1H"
  }
  ...
}

These rules:

When executing a Time Series Query, the effective window is returned in the "window_spec" property of the data set. See the Query Execution API for details.

Aggregation

If the time line specification implies a sampling interval (e.g. by specifying "freq"), "aggregation" defines how data is summarised at this frequency. This can be given at series level, and as a top-level property (as default).

example aggregation values:

    "mean" "median" "min" "max" "last" "sum" "count" "std"

The "aggregation" parameter specifies how the input data in one sample interval (from t up to, but excluding t + freq) is converted to a single value at timestamp t.

Consult the aggregation value references or the built-in documentation of the Query Designer for available values and feature details.

attribute type/format example description
aggregation string (see api spec for enumeration) max, percentile(0.20), ... Method to aggregate values that fall into buckets of size freq

Interpolation

The "interpolation" property defines whether, and how to treat missing values. This can occur in three circumstances:

When specified, the value of "interpolation" is an object with a "method" property, and, depending on the method, additional "value", "order", "limit" properties.

Most common interpolation method are:

Consult the interpolation references or the built-in documentation of Query Designer for available values and feature details.

attribute type/format example description
method string (see api spec for enumeration) linear, last, ... Method to interpolate missing data after resampling up to frequency.
value number 999.03 Fixed value used for interpolation (when interpolation_method = fixed)
order number 3 order of interpolation for the polynomial and spline interpolation methods
limit number 5 maximal number of consecutive values that can be interpolated

Further Discussion

Known limitations

The following are known limitations of the v0.19.0 Time Series Query feature with respect to query specification:

duplicate input specifications

When multiple series have the same resource, metric and aggregation, duplicate series specifications are dropped from the query result.

multiple interpolations for the same input specs

As of version v0.19.0 , multiple series specification cannot only differ in their interpolation. A duplicate specification with the same resource, metric and aggregation but different interpolation will be refused when executed.

Timestamping data and half-open intervals

Half-open intervals labeled with the start timestamp are intuitive for date-related ranges. For example

{ "from": "2018-01-02", "until": "2018-01-05", "freq": "PT1H" }

will result in a range of three days, and includes 3 * 24 data points with timestamps:

[ 
    "2018-01-02T00:00+00",
    "2018-01-02T01:00+00",
    ... 
    "2018-01-04T22:00+00",
    "2018-01-04T23:00+00"
] 

All time intervals in this API are specified as half-open intervals, including the from and excluding the until boundary.

This corresponds to a convention to label subsequent intervals with the start timestamp of the interval, which is natural given our time conventions for years, days, hours …

Depending on the notation conventions near you, this is written mathematically as
[t, t+freq)
or
[t, t + freq[

This also holds for sampling/aggregation grids: the input range for each aggregated data point is a half-open interval: a data point labeled with timestamp t will be the aggregation of all data in the range between t up to and NOT including t + freq. Hence an aggregated measurement with timestamp t = "2018-01-04T23:00+00" will only take into account values with timestamps up till "2018-01-04T23:59:59.999999+00".

Imagine an IoT event stream that each ten minutes (at exactly rounded timestamps 00:00, 00:10) reports an incremental counter of events that happened up till then.
The query window

{ 
  "from": "2018-01-02", "until": "2018-01-05", 
  "freq": "PT1H" , "aggregation": "max" 
}

will result in the same hourly labels, but effectively reports about the data that was captured between ten minutes before the hours:

{
    "2018-01-02T00:00+00" : "max count for data captured after 2018-01-01T23:50 up till 2018-01-02T00:50 ",
    "2018-01-02T01:00+00" : "max count for data captured after 2018-01-02T00:50 up till 2018-01-01T01:50 ",
    ... 
    "2018-01-04T22:00+00" : "max count for data captured after 2018-01-04T21:50 up till 2018-01-04T22:50 ",
    "2018-01-04T23:00+00" : "max count for data captured after 2018-01-04T22:50 up till 2018-01-04T23:50 "
}

This worst case only happens when IoT data is ingested with timestamps at exactly the sampling boundaries, labeling the preceding period.

In the context of IoT measurement data (rather than of date-aggregated data), this might lead to unexpected results.

The Times Series Query convention to always label aggregated buckets with the first timestamp might conflict with a data processing pipeline that labels aggregated data with the emit timestamp, which is normally after the end of the captured time interval.

In applications where this is counter-intuitive, the application could show t + freq, or the complete interval, as timestamp labels to their users, rather than t. But in its interaction with this REST api, it should use the first-timestamp convention.

When using timeseries that were already aggregated before being ingested into waylay, one should be aware what the timestamps mean: do they label the start, end or even middle of an aggregation window? Watch out for data that originally did not carry event timestamps, and was timestamped by the waylay broker with ingestion timestamps.