The Matilda (Mining Artificially Generated Data) project originated from the German research grant "Strategies for Robust Design of Structures" funded by the State of Thuringia. Overall goal of this research line is to apply data mining techniques to engineering problems, specifically to civil engineering tasks dealing with design, but also to model formation and analysis tasks as they occur as part of diagnosis problems.


Introduction.  Data for structures such as bridges in civil engineering is stored in the IFC-Bridge format, which is an international standard used in building information modeling, and is part of the IFC (Industry Foundation Classes) data model. A novel extension has been recently developed based on the NURBS (Non-Uniform Rational Basis Spline) paradigm, which is useful for modeling curved surfaces where regular geometry and lines do not suffice. Bridge simulations can be performed using these models with the Finite Element Method for numerical analysis using varying geometry and material properties to identify optimal designs. The Hasetal bridge in Suhl, southwest Thuringia, Germany, pictured below with simulation results, is one example that has been driving the initial research. Here, the warm regions of the graphic could indicate displacement, strain, or stress simulation results of concern.

Hasetal Bridge

Simulation data mining can be applied in this setting to generate knowledge and decision rules from large numbers of simulations. In particular, simulation data mining will be of great value if bridge design is impeded due to the time required to generate simulation results in an interactive setting for large models. Instead, simulation results could be batch-processed in the background whilst the designer goes on with other work, and later be returned on demand. Then if the requested model isn't available, the nearest neighbor could be returned in the interim as an approximated preview, allowing the designer to continue without interruption. In addition, it is also interesting to predict the expected behavior of models based on training data made up of existing simulation results, when the designer is challenged with a design space of enormous size.

Similarity Measurement.  The development of a similarity measure can support simulation data mining based on the idea that models of similar design will have similar simulation results. From this assumption, similar models can be looked up based on any given model, or models of similar design can be identified based on simulation behavior. The development of this idea is based on six steps pictured below.

First, a subset of the design space such as geometry and material parameters is considered for exploration (step 1). Next, the simulation results are produced using the Finite Element Method (step 2). Following that, thousands of simulation measurements are aggregated into a more manageable subset (step 3). Then clustering technology is applied to generate knowledge about nearest neighbors (step 4), and an appropriate set of this knowledge is sampled (step 5). Finally, class probability estimates from machine learning classification technology are exploited for producing similarity scores. With the exception of the simulations, all steps form interesting data mining questions, from which many competing alternatives have to be considered and evaluated.

Mining and Storing BIG Data.  A student project in Summer 2012 called "Mining and Storing BIG Data" studied the relationship between simulation results and machine learning results. As a sub-theme, there is interest in applying so-called "big data" technology to address bottlenecks. The Hadoop and Mahout frameworks have been the technologies applied to address this "big data" theme. Hadoop is used to allow concurrent processing of numerical simulations, which is an inherently parallelizable task. Mahout provides a library of supervised and unsupervised learning methods that are not otherwise parallelizable, to enable concurrent processing of those parts.

As another sub-theme, there is interest in implementing methods for making the work easier to reproduce and disseminate. An online version of the six-step "simulation pipeline" implemented as a TIRA experiment addresses reproducibility. The TIRA web service provides an online framework to allow researchers to share experiments on the web for others to reproduce, and provides other features such as the ability to explore experiment parameters, monitor experiment progress, and reuse cached results. The poster and talk slides produced during the student group project are available.

Domain Decomposition.  There is also interest in speeding up numerical analysis in general, so that the findings can be applied efficiently elsewhere. A method called domain decomposition can be applied to parallelize the processing. To do this the domain on which the numerical analysis is solved is broken into several sub-domains for concurrent processing on modern computing architecture. Then the full solution is reformulated using the overlapping parts of each sub-domain.

The additive Schwarz method can be applied as one implementation for domain decomposition. In this problem setting, there is a complex trade-off that must be managed between the number of iterations required to execute the additive Schwarz method, and the total size of the problem including the redundancy of the overlaps. Here, a human could organize the sub-domains into a checkerboard pattern and apply a uniform overlap to solve the problem as a simple solution. However, our current work shows that more customized and efficient solutions can be developed with regression analysis in machine learning. Our code for supporting this work is available for download: (1.4 MB).


Students: David Wiesner, Katja Müller, Peter Hirsch, Jens Opolka, Tom Paschke, and Michael Völske.


Oliver Niggemann, Stefan Windmann, Sören Volgmann, Andreas Bunte, and Benno Stein. Using Learned Models for the Root Cause Analysis of Cyber-Physical Production Systems. In Proceedings of the 25th International Workshop on Principles of Diagnosis (DX 2014), September 2014. [publisher] [paper] [bib]
Steven Burrows, Jörg Frochte, Michael Völske, Ana Belén Martinez Torres, and Benno Stein. Learning Overlap Optimization for Domain Decomposition Methods. In Jian Pei et al, editors, 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 13), pages 438-449, Berlin Heidelberg New York, April 2013. Springer. [doi] [paper] [bib] [slides]
Tim Gollub, Benno Stein, Steven Burrows, and Dennis Hoppe. TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments. In A Min Tjoa, Stephen Liddle, Klaus-Dieter Schewe, and Xiaofang Zhou, editors, 9th International Workshop on Text-based Information Retrieval (TIR 12) at DEXA, pages 151-155, Los Alamitos, California, September 2012. IEEE. ISBN 978-1-4673-2621-6. ISSN 1529-4188. [doi] [paper] [bib] [slides]
Oliver Niggemann, Benno Stein, Asmir Vodencarevic, Alexander Maier, and Hans Kleine Büning. Learning Behavior Models for Hybrid Timed Systems. In Jörg Hoffmann and Bart Selman, editors, 26th International Conference on Artificial Intelligence (AAAI 12), pages 1083-1090, Palo Alto, California, July 2012. AAAI. ISBN 978-1-57735-568-7. [publisher] [paper] [bib]
Oliver Niggemann, Benno Stein, and Alexander Maier. Modeling Problems with Machine Learning—A Classification Scheme of Model Learning Approaches for Technical Systems. In Holger Giese, Michaela Huhn, Jan Philipps, and Bernhard Schätz, editors, 8th Dagstuhl Workshop Model-Based Development of Embedded Systems (MBEES 12), pages 21-29, Munich, Germany, February 2012. fortiss GmbH. [paper] [bib]
Steven Burrows, Benno Stein, Jörg Frochte, David Wiesner, and Katja Müller. Simulation Data Mining for Supporting Bridge Design. In Peter Christen et al, editors, 9th Australasian Data Mining Conference (AusDM 2011) volume 121 of CRPIT, pages 163-170, New York, December 2011. ACM. ISBN 978-1-921770-02-9. [paper] [bib] [slides]
Hamish Cunningham, Norbert Fuhr, and Benno Stein. Challenges in Document Mining (Dagstuhl Seminar 11171). Dagstuhl Reports, 1 (4) : 65-99, August 2011. [doi] [article] [bib]