AI Takes Over Data Preparation

28/11/2025

AI Takes Over Data Preparation: New Study Shows It Can Predict Most BI Prep Steps

Data preparation has long been regarded as one of the most time-consuming and labor-intensive stages of analytics. Experts note that in many business intelligence (BI) projects, the majority of effort is spent not on building dashboards, but on cleaning, transforming, and combining data.
A new study published in the VLDB Endowment journal for 2025 suggests that this may soon change.

Researchers have introduced Auto-Prep: Holistic Prediction of Data Preparation Steps, a model capable of predicting more than 70 percent of the transformation and join operations commonly performed during data preparation. The findings are based on an analysis of more than 2,000 real-world BI projects.

A Unified Model for Transformations and Joins

Unlike conventional tools—designed to handle either transformations or joins—Auto-Prep combines both within a single framework. The authors use a graph-based representation, treating tables as nodes and potential preparation steps as edges.

This holistic approach acknowledges that transformation and join operations often influence one another. By evaluating them together, Auto-Prep is able to predict not only the correct operations but also the correct order in which they should be applied.

Notably, the model outperformed large language models, including GPT-4, especially in determining the optimal sequence of data preparation steps.

Key Findings and Practical Implications

When tested on a large sample of Power BI projects, Auto-Prep demonstrated:

  • more than 70% accuracy in predicting data preparation steps,

  • strong performance even in projects with many tables,

  • high F1 scores for transformations, reaching approximately 0.76.

Researchers point out that data preparation remains the primary bottleneck in analytics workflows. Automating this phase could dramatically reduce the time required to deploy BI solutions and free analysts from repetitive, manual tasks.

Future Applications

Although Auto-Prep is designed for self-service BI environments, experts say the underlying concept has potential far beyond that. Possible applications include:

  • ETL and data engineering pipelines,

  • automated preparation for machine learning tasks,

  • no-code/low-code data platforms,

  • enterprise-scale data transformation systems.

The study marks a significant step toward more comprehensive automation in analytics. If tools like Auto-Prep gain traction, the role of data professionals may shift from manual data manipulation toward higher-level analytical and strategic work.