ETL

The Waylay ETL-service

  1. Data-export
  2. Data-pipeline

1. ETL-export service

Waylay supports the export of time series data and metadata to S3 or the Google object storage owned by you by means of an ETL-export service.

Two types of data are exported:

  • TS: time series data
  • MD: metadata

For the time series data, data can be exported as:

  • CSV file which has the data in the format “resource”,”metric”,”timestamp”,”value”
  • AVRO file which has the data per “resource” and “metric”

The meta-data is exported in newline-delimited json files. The resourcetypes, resources and the change-log for resources are exported.

The export can be done on a daily, weekly or monthly basis.

2. ETL-pipeline service

In addition to the ETL-export service, Waylay also offers optional professional services to take the next steps, beyond the data export. One such additional service is getting the data into a professional data pipeline. Below we explain how we do this at Waylay, by taking the example of open data that we use for our customer The City of Ghent.

The Waylay Apache Beam ETL-BI pipeline

Let’s start with the export files for time-series data and metadata of the city-of-ghent tenant, that are delivered daily via the the Waylay ETL-export service.

ETL Ghent cloud storage Fig. 1

With an initial run of our Apache Beam ETL-BI pipeline, we discovered the json metadata that is available for the resources of this tenant.

ETL Ghent statistics Fig. 3

Based on this input, we configured the schema configuration file of the ETL pipeline (fig.4):

  • we disabled the hashing of resource ids (as this is open data)
  • we documented metadata elements ( such as geolocation longitude and latitude elements) as explicitly mapped data elements. Without this action, the data would still be imported, but as generic key-value properties.

ETL Ghent configuration Fig. 4

Apache Beam lets you create data processing pipelines using a java or python API.

ETL Ghent dataflow Fig. 5

These pipelines can run on multiple platforms: you can test locally, but it is intended to run on cloud platforms such as Google Dataflow or Apache Spark. Dataflow can massively parallelize your batch job, automatically scaling to hundreds of workers to speed up your big data pipelines.

Google Dataflow provides you an overview these jobs, and a graphical representation of each of the running process steps, including their progress on input and output data collections. In our dataset, only an initial (multi-day) data import was large enough to trigger scaling to multiple workers, which we limited to a maximum of 10 machines.

The dataflow engine took care of allocating these workers from your pool of computing resources.

ETL Ghent dataflow Fig. 6

ETL Ghent dataflow Fig. 7

ETL Ghent dataflow Fig. 8

As Apache Beam provides a unified model for stream and batch processing, most of our setup will be unchanged if we would choose to switch from batch files to near-real time streaming integration with the waylay platform.

Our ETL-BI pipeline is configured to import the data into a Google BigQuery, a serverless, low cost data warehouse.

ETL Ghent bigquery Fig. 9

ETL Ghent bigquery Fig. 10

Data exploration and visualization

Some initial data exploration can be done in the web-based SQL interface of Bigquery, but for the real data work, we switched to pandas, a Python-based data analysis and visualization toolkit.

The JupyterLab interactive notebook format allows us to tell an interactive data story to our peers, showing how we:

  1. import a (subset) of the data from Bigquery into my my pandas data structure

  2. analyse the different available resource types and metric classes

    Ghent resource type overview Fig. 11

  3. zoom in on datasets such as the traffic measurements on the R40 inner city ring of Ghent.

ETL Ghent jupyterlab query Fig. 12

We illustrated the data set with:

  • A plot of the average traffic speed and density at different locations in and around Ghent:

    R40 speed density map Fig. 13

  • The same plot, but showing the hourly evolution averaged over a period:

    R40 speed density map hourly evolution Fig. 14

  • the correlation between lane occupancy (OCC) on a fixed road segment, average speed, and car throughput on selected measurement points:

    R40 scatter matrix Fig. 15

    R40 scatter matrix Fig. 16

    R40 scatter matrix Fig. 17

Of course, there are many other options for Business Intelligence reporting and visualization. A more polished tool, used by many data analysts in the enterprise world is Tableau. Here is an example of visualisation in Tableau:

ETL Ghent tableau Fig. 18

Advanced Data Analytics

Once your waylay data is in a Big Data infrastructure, data exploration and reporting as explained above are only the first steps to extract value from your data assets. As a data scientist you use this data to train and test models that detect conditions or make predictions. Once validated, these models could deliver their value in the waylay platform, e.g. as an anomaly detection service, or a forecast service.

Conclusion

Whether you want to integrate waylay data into a BI reporting environment, or make it available to your data scientists, we can assist you in building an ETL pipeline to transform and load time series and metadata from your Waylay platform. This will let you to build data products that can deliver their value either through integration into the Waylay platform or through other IT systems of your organization.