Skip to content

Clean data filter

David Megginson edited this page Jul 26, 2018 · 14 revisions
Clean Data filter form.

The Clean data filter on the Recipe page automatically fixes small formatting inconsistencies with dates, numbers, character case, and whitespace.

Options

Each of the options takes a list of Tag patterns, separated by commas, like "sector-code,subsector". The hash sign "#" is optional.

Normalise whitespace: clean the whitespace in the matching columns. Removes any leading and trailing whitespace, and converts all internal whitespace (spaces, tabs, newlines, etc.) to a single space, so that (for example) " Food Security " becomes "Food Security".

Standardise lat/lon: attempt to parse any latitude and longitude values in columns identified by the hashtag provided (#geo is a good choice) and standardise them into a decimal format. Will operate only if the attribute +lat, +lon, or +coord is present in the actual column (doesn't need to be in the pattern).

Convert to uppercase: convert the matching column(s) to uppercase, so that (for example) "WaSH" becomes "WASH".

Convert to lowercase: convert the matching column(s) to lowercase, so that (for example) "MarketPlace" becomes "marketplace".

Standardise dates: attempt to convert any dates found in the matching column(s) to standard ISO 8601 date format YYYY-MM-DD, so that (for example) "5/13/16" becomes "2016-05-13" (see Date format below if you want to change the default). If the HXL Proxy cannot convert the date, it leaves the original contents unchanged. Note: some dates, such as "5/6/2013", are ambiguous ("2013-05-06" or "2013-06-05"?). The HXL Proxy will scan to see whether unambiguous instances of DDMM or MMDD are more common, and will use that as default (preferring international DDMM in the case of a tie).

Date format: (applies only if tags supplied for Standardise dates.). A strftime format string for dates, such as %Y/%m to generate dates like "2016/03". If you leave this blank, you will get dates in YYYY-MM-DD format ("%Y-%m-%d"), which is what you should normally use.

Standardise numbers: attempt to convert any numbers found in the matching column(s) to machine-readable format, so that (for example) "2, 000" becomes "2000". If the HXL Proxy cannot convert the number, it leaves the original contents unchanged.

Number format: (applies only if tags supplied for Standardise numbers.) A printf format string for numbers, excluding the leading "%", such as 0.2f for two decimal places. Use for rounding numbers, adding leading zeros, etc. The exact string supported are documented in the Format Specification Mini-Language Note that using "d" to round an floating-point will fail, and result in no cleaning; the correct format string to round to zero (or more) decimal places is .0f

Purge bad dates, numbers, and lat/lon during cleaning: If checked, whenever the filter can't clean a number, date, or lat/lon, it will remove the value from the dataset (by default, the value is left untouched). This option will ignore values in columns that aren't being cleaned.

Clean data only in rows matching this query: skip cleaning in any rows that don't match this row query. For example, the row query "org=unicef" would perform the cleaning only where the #org column had the value "unicef" (case-insensitive).

Typical use cases

A dataset is inconsistent in the way it represents organisation names, sometimes adding whitespace and sometimes using mixed case, so that (for example) "World Vision" can appear as "␣␣world␣vision", "World␣␣Vision", "WORLD␣VISION␣␣", etc. Using the Convert to uppercase and Normalise whitespace filters, you can convert all #org columns to have consistent values like "WORLD VISION".

A dataset contains dates in several different formats, copy-pasted from different sources, so that May 31, 2015 can appear as "31/5/2015", "May 31, 2015", "5/31/15", etc. Use the Normalise dates filter to change all values in #date columns to ISO 8601 format (e.g. "2015-05-31") for easier sorting, analysis, and visualisation.

You want to roll dates up to the nearest week, month, or year, before using the Count rows filter to generate a report. Enter #date (or something more specific) in the "Standardise dates" field, and a strftime format string in the "Date format" field. Sample format strings:

  • "%Y" — round the date up to the year, e.g. "2018"
  • "%Y-m" — round the date up to the month, e.g. "2018-01"
  • "%Y-W%V" — round the date up to the ISO week, e.g. "2018-W03".

Example

This example shows the Normalise dates option.

Input

#org #output+activities #date+reported
Red Cross 15 7/13/2016
UNICEF 23 July 1, 2016
MSF 12 21-07-16

Output

#org #output+activities #date+reported
Red Cross 15 2016-07-13
UNICEF 23 2016-07-01
MSF 12 2016-07-21
Clone this wiki locally