Cloud-based Data Refinery via the IBM DataWorks Service in Bluemix

Here is another quick introduction to one of the cool services in IBM Bluemix – the DataWorks service that IBM announced at the end of last year.

In order to leverage and analyze data from various sources and big amounts of data, data typically has to be cleansed and prepared first. This includes activities like joining data from multiple sources, filtering out unnecessary parts, sorting and classifying data and so forth. Then the actual processing and analysis of this data is simpler when using tools like dashDB, Hadoop or Watson Analytics. The DataWorks service in Bluemix provides various functionality to do this.

The service comes with a graphical tool Forge (beta) to load data from various sources. These sources can be sources that are available in the cloud like Salesforce or Amazon Redshift or databases that are run on-premises like DB2 and Oracle. To access the on-premises sources in the cloud the DataWorks service comes with a secure gateway that you can also use separately on Bluemix. The loaded data can then be shaped easily with Forge, e.g. sorted or filtered.

After this step activities are created that can be triggered on a scheduled basis to move the shaped data into data sources like Cloudant, dashDB, Watson Analytics or SQL databases which either provide built-in analytics functionality and/or can be used by application developers to query the data they are interested in.

The same functionality that is available in Forge (and more) can also be invoked by application developers via APIs. There is a Data Load REST API to load data (sample) and a Data Profiling REST API to analyze your cloud-based data source to understand the structure and content of the data (sample). Additionally there is an Address Cleansing REST API to verify US addresses (sample).

Check out this video to see some of this in action.