Big Data Analytics by Online Ensemble Learning (BOEL)

Start date


End date


Developing tools for generating knowledge from big data sets is vital for meeting the digital explosion challenge. The suggested research project will tackle fundamental problems at the core of this task, i.e., big data analytics. Specifically, algorithms and methods for online analysis of very large and rapidly growing data sets will be developed.

The overall purpose is to apply algorithms and methods from the machine learning and distributed computing fields to produce system solutions and software tools for industry-relevant data mining problems, as identified by the two industrial partners AstraZeneca and Scania.

For AstraZeneca, one key goal of this project is to find improved techniques for generating in silico models from experimental data, thus reducing the costs for laboratory (in vitro) and clinical (in vivo) experiments in drug discovery projects. For Scania, the overall goal is to develop methods for structuring, analysing and mining their large warehouse of maintenance data.

This project will investigate distributed architectures for data warehousing, with the aim to alleviate the storage vs. performance trade-off, to meet the demands from big data. One specific predictive task is how usage patterns and vehicle configurations affect fuel consumption. Insights into this will be valuable not only to Scania and their customers, but also to the environment. The basis of this project is that the key to obtaining accurate and robust predictive models is to employ ensemble techniques. Thus, improving or devising new accurate ensemble techniques, suitable for big data analytics, is the principal goal for this project.

One important issue for predictive modeling is the, often dual, purpose of prediction and exploration, a fact that regularly necessitates interpretable models. Since ensemble models are inherently opaque, reducing this accuracy vs. comprehensibility trade-off is another key task for this project.

Finally, since online analysis and predictive modeling of very large data sets require computations in reasonable time, integration with high performance computing research is absolutely crucial. The research undertaken in the project is cutting edge, both scientifically and technologically, and is integrated with core R&D processes at the industrial partners. Hence, academic researchers will work with real-world data on industry relevant problems. Proposed solutions will be employed in ongoing R&D projects at the industrial partners and compared against existing “best practices”. Both companies will thus utilize front-line research to improve on key R&D-processes in their respective field. Crucially for the success of the project, both Scania and AstraZeneca have committed substantial amounts of time by senior and highly qualified employees. In addition, both industrial partners are involved in most work packages, creating an arena for exchange of problems, ideas and solutions, thus enabling true coproduction of advanced research.

Collaboration Partners