The Waylay ETL-service
1. ETL-export service
Waylay supports the export of time series data and metadata to S3 or the Google object storage owned by you by means of an ETL-export service.
Two types of data are exported:
- TS: time series data
- MD: metadata
For the time series data, data can be exported as:
- CSV file which has the data in the format “resource”,”metric”,”timestamp”,”value”
- AVRO file which has the data per “resource” and “metric”
The meta-data is exported in newline-delimited json files. The resourcetypes, resources and the change-log for resources are exported.
The export can be done on a daily, weekly or monthly basis.
2. ETL-pipeline service
In addition to the ETL-export service, Waylay also offers optional professional services to take the next steps, beyond the data export. One such additional service is getting the data into a professional data pipeline. Below we explain how we do this at Waylay, by taking the example of open data that we use for our customer The City of Ghent.
The Waylay Apache Beam ETL-BI pipeline
Let’s start with the export files for time-series data and metadata of the city-of-ghent tenant, that are delivered daily via the the Waylay ETL-export service.
With an initial run of our Apache Beam ETL-BI pipeline, we discovered the json metadata that is available for the resources of this tenant.
Based on this input, we configured the schema configuration file of the ETL pipeline (fig.4):
- we disabled the hashing of resource ids (as this is open data)
- we documented metadata elements ( such as geolocation longitude and latitude elements) as explicitly mapped data elements. Without this action, the data would still be imported, but as generic key-value properties.
Apache Beam lets you create data processing pipelines using a java or python API.
These pipelines can run on multiple platforms: you can test locally, but it is intended to run on cloud platforms such as Google Dataflow or Apache Spark. Dataflow can massively parallelize your batch job, automatically scaling to hundreds of workers to speed up your big data pipelines.
Google Dataflow provides you an overview these jobs, and a graphical representation of each of the running process steps, including their progress on input and output data collections. In our dataset, only an initial (multi-day) data import was large enough to trigger scaling to multiple workers, which we limited to a maximum of 10 machines.
The dataflow engine took care of allocating these workers from your pool of computing resources.
As Apache Beam provides a unified model for stream and batch processing, most of our setup will be unchanged if we would choose to switch from batch files to near-real time streaming integration with the waylay platform.
Our ETL-BI pipeline is configured to import the data into a Google BigQuery, a serverless, low cost data warehouse.
Data exploration and visualization
Some initial data exploration can be done in the web-based SQL interface of Bigquery, but for the real data work, we switched to pandas, a Python-based data analysis and visualization toolkit.
The JupyterLab interactive notebook format allows us to tell an interactive data story to our peers, showing how we:
import a (subset) of the data from Bigquery into my my pandas data structure
analyse the different available resource types and metric classes
zoom in on datasets such as the traffic measurements on the R40 inner city ring of Ghent.
We illustrated the data set with:
A plot of the average traffic speed and density at different locations in and around Ghent:
The same plot, but showing the hourly evolution averaged over a period:
the correlation between lane occupancy (OCC) on a fixed road segment, average speed, and car throughput on selected measurement points:
Of course, there are many other options for Business Intelligence reporting and visualization. A more polished tool, used by many data analysts in the enterprise world is Tableau. Here is an example of visualisation in Tableau:
Advanced Data Analytics
Once your waylay data is in a Big Data infrastructure, data exploration and reporting as explained above are only the first steps to extract value from your data assets. As a data scientist you use this data to train and test models that detect conditions or make predictions. Once validated, these models could deliver their value in the waylay platform, e.g. as an anomaly detection service, or a forecast service.
Whether you want to integrate waylay data into a BI reporting environment, or make it available to your data scientists, we can assist you in building an ETL pipeline to transform and load time series and metadata from your Waylay platform. This will let you to build data products that can deliver their value either through integration into the Waylay platform or through other IT systems of your organization.