-
Notifications
You must be signed in to change notification settings - Fork 56
Cheatsheet
Nitin Motgi edited this page Jul 30, 2017
·
1 revision
Name | Usage | Description |
---|---|---|
SWAP | swap <column1> <column2> | Swaps the column names of two columns. |
ENCODE | encode <base32 | base64 |
XPATH | xpath <column> <destination> <xpath> | Extract a single XML element or attribute using XPath. |
GENERATE-UUID | generate-uuid <column> | Populates a column with a universally unique identifier (UUID) of the record. |
LOWERCASE | lowercase <column> | Changes the column values to lowercase. |
WRITE-AS-CSV | write-as-csv <column> | Writes the records files as well-formatted CSV |
PARSE-AS-PROTOBUF | parse-as-protobuf <column> <schema-id> <record-name> [version] | Parses column as protobuf encoded memory representations. |
HASH | hash <column> <algorithm> [<encode=true | false>] |
JSON-PATH | json-path <source> <destination> <json-path-expression> | Parses JSON elements using a DSL (a JSON path expression). |
MASK-NUMBER | mask-number <column> <pattern> | Masks a column value using the specified masking pattern. |
TEXT-DISTANCE | text-distance <method> <column1> <column2> <destination> | Calculates a text distance measure between two columns containing string. |
PARSE-XML-TO-JSON | parse-xml-to-json <column> [<depth>] | Parses a XML document to JSON representation. |
PARSE-AS-HL7 | parse-as-hl7 <column> [<depth>] | Parses <column> for Health Level 7 Version 2 (HL7 V2) messages; <depth> indicates at which point JSON object enumeration terminates. |
FIND-AND-REPLACE | find-and-replace <column> <sed-expression> | Finds and replaces text in column values using a sed-format expression. |
RENAME | rename <old> <new> | Renames an existing column. |
PARSE-AS-AVRO | parse-as-avro <column> <schema-id> <json | binary> [version] |
FILL-NULL-OR-EMPTY | fill-null-or-empty <column> <fixed-value> | Fills a value of a column with a fixed value if it is either null or empty. |
SET-TYPE | set-type <column> <type> | Converting data type of a column. |
RTRIM | rtrim <column> | Trimming whitespace from right side of a string. |
INVOKE-HTTP | invoke-http <url> <column>[,<column>] <header>[,<header>] | [EXPERIMENTAL] Invokes an HTTP endpoint, passing columns as a JSON map (potentially slow). |
COLUMNS-REPLACE | columns-replace <sed-expression> | Modifies column names in bulk using a sed-format expression. |
SEND-TO-ERROR | send-to-error <condition> | Send records that match condition to the error collector. |
SET-RECORD-DELIM | set-record-delim <column> <delimiter> [<limit>] | Sets the record delimiter. |
SET-VARIABLE | set-variable <variable> <expression> | Sets the value for a transient variable for the record being processed. |
SET-CHARSET | set-charset <column> <charset> | Sets the character set decoding to UTF-8. |
WRITE-AS-JSON-OBJECT | write-as-json-object <dest-column> [<src-column>[,<src-column>] | Creates a JSON object based on source columns specified. JSON object is written into dest-column. |
KEEP | keep <column>[,<column>*] | Keeps the specified columns and drops all others. |
CUT-CHARACTER | cut-character <source> <destination> <type> <range | indexes> |
SPLIT-TO-ROWS | split-to-rows <column> <separator> | Splits a column into multiple rows, copies the rest of the columns. |
XPATH-ARRAY | xpath-array <column> <destination> <xpath> | Extract XML element or attributes as JSON array using XPath. |
FAIL | fail <condition> | Fails when the condition is evaluated to true. |
INCREMENT-VARIABLE | increment-variable <variable> <value> <expression> | Wrangler - A interactive tool for data cleansing and transformation. |
PARSE-AS-XML | parse-as-xml <column> | Parses a column as XML. |
PARSE-AS-FIXED-LENGTH | parse-as-fixed-length <column> <width>[,<width>*] [<padding-character>] | Parses fixed-length records using the specified widths and padding-character. |
CHANGE-COLUMN-CASE | change-column-case lower | upper |
SPLIT-EMAIL | split-email <column> | Split a email into account and domain. |
URL-ENCODE | url-encode <column> | URL encode a column value. |
WRITE-AS-JSON-MAP | write-as-json-map <column> | Writes all record columns as JSON map. |
MASK-SHUFFLE | mask-shuffle <column> | Masks a column value by shuffling characters while maintaining the same length. |
DROP | drop <column>[,<column>*] | Drop one or more columns. |
DECODE | decode <base32 | base64 |
SPLIT | split <source> <delimiter> <new-column-1> <new-column-2> | [DEPRECATED] Use 'split-to-columns' or 'split-to-rows'. |
PARSE-AS-SIMPLE-DATE | parse-as-simple-date <column> <format> | Parses a column as date using format. |
DIFF-DATE | diff-date <column1> <column2> <destination> | Calculates the difference in milliseconds between two Date objects.Positive if <column2> earlier. Must use 'parse-as-date' or 'parse-as-simple-date' first. |
INDEXSPLIT | indexsplit <source> <start> <end> <destination> | [DEPRECATED] Use the 'split-to-columns' or 'parse-as-fixed-length' directives instead. |
PARSE-AS-AVRO-FILE | parse-as-avro-file <column> | parse-as-avro-file <column>. |
FILTER-ROW-IF-TRUE | filter-row-if-true <condition> | [DEPRECATED] Filters rows if condition is evaluated to true. Use 'filter-rows-on' instead. |
SPLIT-URL | split-url <column> | Split a url into it's components host,protocol,port,etc. |
FORMAT-DATE | format-date <column> <format> | Formats a column using a date-time format. Use 'parse-as-date` beforehand. |
QUANTIZE | quantize <source> <destination> <[range1:range2)=value>,[<range1:range2=value>]* | Quanitize the range of numbers into label values. |
PARSE-AS-EXCEL | parse-as-excel <column> [<sheet number | sheet name>] |
PARSE-AS-DATE | parse-as-date <column> [<timezone>] | Parses column values as dates using natural language processing and automatically identifying the format (expensive in terms of time consumed). |
TABLE-LOOKUP | table-lookup <column> <table> | Uses the given column as a key to perform a lookup into the specified table. |
FILTER-ROWS-ON | filter-rows-on empty-or-null-columns <column>[,<column>*] | Filters row that have empty or null columns. |
TRIM | trim <column> | Trimming whitespace from both sides of a string. |
URL-DECODE | url-decode <column> | URL decode a column value. |
FLATTEN | flatten <column>[,<column>*] | Separates array elements of one or more columns into indvidual records, copying the other columns. |
UPPERCASE | uppercase <column> | Changes the column values to uppercase. |
CATALOG-LOOKUP | catalog-lookup <catalog> <column> | Looks-up values from pre-loaded (static) catalogs. |
PARSE-AS-LOG | parse-as-log <column> <format> | Parses Apache HTTPD and NGINX logs. |
LTRIM | ltrim <column> | Trimming whitespace from left side of a string. |
EXTRACT-REGEX-GROUPS | extract-regex-groups <column> <regex-with-groups> | Extracts data from a regex group into its own column. |
PARSE-AS-CSV | parse-as-csv <column> <delimiter> [<header=true | false>] |
FILTER-ROW-IF-MATCHED | filter-row-if-matched <column> <regex> | [DEPRECATED] Filters rows if the regex is matched. Use 'filter-rows-on' instead. |
PARSE-AS-JSON | parse-as-json <column> [<depth>] | Parses a column as JSON. |
SET COLUMN | set column <column> <jexl-expression> | Sets a column by evaluating a JEXL expression. |
STEMMING | stemming <column> | Apply Porter Stemming on the column value. |
COPY | copy <source> <destination> [<force=true | false>] |
SET-COLUMN | set-column <column> <expression> | Sets a column the result of expression execution. |
SPLIT-TO-COLUMNS | split-to-columns <column> <regex> | Splits a column into one or more columns around matches of the specified regular expression. |
CLEANSE-COLUMN-NAME | cleanse-column-names | Sanatizes column names: trims, lowercases, and replaces all but [A-Z][a-z][0-9].with an underscore ''. |
SET COLUMNS | set columns <columm>[,<column>*] | Sets the name of columns, in the order they are specified. |
TITLECASE | titlecase <column> | Changes the column values to title case. |
MERGE | merge <column1> <column2> <new-column> <separator> | Merges values from two columns using a separator into a new column. |
TEXT-METRIC | text-metric <method> <column1> <column2> <destination> | Calculates the metric for comparing two string values. |
SET FORMAT | set format csv <delimiter> <skip empty lines> | [DEPRECATED] Parses the predefined column as CSV. Use 'parse-as-csv' instead. |
FORMAT-UNIX-TIMESTAMP | format-unix-timestamp <column> <format> | Formats a UNIX timestamp using the specified format |
FILTER-ROW-IF-NOT-MATCHED | filter-row-if-not-matched <column> <regex> | Filters rows if the regex does not match |
FILTER-ROW-IF-FALSE | filter-row-if-false <condition> | Filters rows if the condition evaluates to false |
- Introduction
- Get Started
- Concepts
- System Directives
- Parsers
- Output Formatters
- Transformations
- Encoders and Decoders
- Unique ID Generation
- Date Transformations
- Lookups
- Hashing and Masking
- Row Operations
- Column Operations
- NLP
- Transient In-Memory Counters
- Transformation Functions
- Custom Directives (UDD)
- What is UDD ?
- Building UDD
- UDD Lifecycle
- Field Level Lineage
- Aliasing and Restricting
- Embedding Directives
- Schema Registry
- Pipeline Transform
- Cheatsheet
- Roadmap
- Technical Documents
- Custom Directive Internals