Skip to content

How to evaluate a data cleaning tool

Donatello Santoro edited this page Nov 22, 2016 · 4 revisions

BART can be used to compare and evaluate data cleaning tools, using the following steps.

1. Generate a dirty db

The first step is to start from a clean db and asking BART to generate errors w.r.t. a set of quality constraints. In particular, you can choose how to introduce errors by choosing, for each vio-gen query, a percentage of errors to introduce. BART can apply the errors directly on the DBMS (property applyCellChanges) or it can export a dirty version of the database on CSV files (property exportDirtyDB)

2. Export expected changes

In order to evaluate a tool, you need to compare expected changes w.r.t. generated changes. You can export expected changes using the property exportCellChanges or using the GUI
BART Export Changes

The CSV format for expected changes requires 3 column (separated by coma).

  1. cell ref (OID.attributeName)
  2. clean value
  3. dirty value

Example
2.name, John, JXXn
3.ssn, 123, 555

3. Repair the database and export generate changes

Now you can use a data-cleaning tool to clean the dirty db. At the end of the process, you need to export the generated changes. The CSV format for expected changes requires 2 column (separated by coma).

  1. cell ref (OID.attributeName)
  2. repaired value

Example
2.name, John
3.ssn, _L1

Note that some data-cleaning tools may introduce variables instead of constant values (_L1 in the example).

4. Compare changes

Now that you have expected and generated changes, you can use BART to compare them. Compare button

The inputs are the CSV paths, and the score to use when the generated value is a variable.
Dialog