Volume 28, Issue 4 pp. 1016-1040
Special Issue Paper

A semantic-aware data generator for ETL workflows

Naiqiao Du

Corresponding Author

Naiqiao Du

Department of Computer Science and Technology, Tsinghua University, Beijing, China

Correspondence to: Naiqiao Du, Department of Computer Science and Technology, Tsinghua University, Beijing, China.

E-mail: [email protected]

Search for more papers by this author
Xiaojun Ye

Xiaojun Ye

School of Software, Tsinghua University, Beijing, China

Search for more papers by this author
Jianmin Wang

Jianmin Wang

School of Software, Tsinghua University, Beijing, China

Search for more papers by this author
First published: 22 April 2013

Summary

Extract, transform, and load (ETL) processes organized as workflows play an important role in the future data integration for cloud services. ETL designers/administrators need testing data set that is aware of semantics of ETL workflow workloads to evaluate their developed ETL systems. Populating testing ETL systems with meaningful workload data is a difficult task. In this paper, we propose a semantic-aware data generator for ETL workflows. With given ETL workflow models and workload characterizations, the generator is able to generate synthetic data that capture the semantics of ETL activities. This is carried out by a three-staged approach. First, we derive expected cardinalities of all the source, intermediate, and target data sets involved in the ETL workflow model with some user-specified cardinality requirements. Then, with the concept of symbolic test, symbolic data instead of concrete data involved in ETL activities are generated, and semantics of the ETL workflow models are transformed to various constraints over these symbols. At last, concrete data are derived on the basis of resolving constraints. Our generator may facilitate ETL workload test case generation for ETL toolkit performance and function evaluations as well as ETL workflow solution benchmarking. Copyright © 2013 John Wiley & Sons, Ltd.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.