-
Notifications
You must be signed in to change notification settings - Fork 340
Sample Programs of Performing Distinct on a File
【Abstract】
This article explains methods of performing distinct operation on small files and big files, and offers sample program in esProc SPL. Those methods include removing or retaining duplicates by rows or by key filed.Sample Programs of Performing Distinct on a File
Sometimes during data processing we need to remove duplicate data from a file or retain only duplicates in a file. Now let’s look at how we can get these done over small files or big files through sample programs in esProc SPL. The methods include row-wise deletion or deletion by key field. esProc is the professional data computing engine. The SPL language it uses boasts a set of all-round function libraries for handling set-based operations. That makes it easy and simple to perform distinct over files.
In a text file, each row is a string. We want to retain only one of each set of duplicate rows. To do this, we read in each row in the file as string to form a set of strings and perform distinct over the set.
Example: paint.txt records the IDs and names of students who register in the painting class. Some students have registered more than once, so we need to delete the extra duplicate records and write data to paint1.txt. Below is part of paint.txt:
20121102-Joan
20121107-Jack
20121113-Mike
20121107-Jack
esProc SPL script:
A | Comment | |
---|---|---|
1 | =file("e:/txt/paint.txt").read@n() | Read in each row of paint.txt to form a set |
2 | =A1.id() | Delete duplicates from A1’s set |
3 | =file("e:/txt/paint1.txt").write(A2) | Write A2, the deduplicated set, into paint1.txt |
We have a file that has columns of data. The first row contains column headers and detailed data starts from the second row. We want to compare values in the key field, and delete rows containing duplicate key values or retain them only.
Below is a part of the orders table of 2018 (order_2018.xlsx):
Example 1: Get the unique IDs of all customers that place an order in 2018 and write them into 2018c.xlsx.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =file("e:/txt/order_2018.xlsx").xlsimport@t() | Import orders data of 2018 |
2 | =A1.id(CustomerId) | Get IDs of all different customers |
3 | =file("e:/txt/2018c.xlsx").xlsexport(A2) | Export unique IDs to 2018c.xlsx |
Example 2: Find the different products each customer bought in 2018, and write key fields CustomerId and ProductId to 2018c_p.xlsx.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =file("e:/txt/order_2018.xlsx").xlsimport@t(CustomerId,ProductId) | Import key fields of orders table of 2018 |
2 | =A1.group@1(CustomerId,ProductId) | Group by the key fields; @1 option enables getting the first record of each group only |
3 | =file("e:/txt/2018c_p.xlsx").xlsexport@t(A2) | Export A2’s result to 2018c_p.xlsx |
Example: Get records where the customer bought same product many times and export them to 2018c_rebuy.xlsx.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =file("e:/txt/order_2018.xlsx").xlsimport@t() | Import orders data of 2018 |
2 | =A1.group(CustomerId,ProductId) | Put records where same customer buys same product into one group |
3 | =A2.select(~.count()>1).conj() | Get groups where the number of records is greater than 1 and concatenate their records together |
4 | =file("e:/txt/2018c_rebuy.xlsx").xlsexport@t(A3) | Export A3’s result to 2018c_rebuy.xlsx |
A big file cannot be wholly loaded into the memory. So we can’t import all records to make comparison to find the duplicates. Instead we need to do the retrieval and comparison batch by batch. esProc SPL has the cursor mechanism that makes it convenient to handle big file processing, including the distinct operation.
In a big text file, each row is a string. We want to retain only one of each set of duplicate rows. To do this, we read in each row in the file as string to form a set of strings with the cursor and perform distinct over the cursor.
Example: all.txt records information of real property owners nationwide, including their IDs and names. Below is part of the file:
510121198802213364-Joan
110113199203259852-Jack
201264197206271113-Mike
Some owners have more than one house in different states and so they appear multiple times in the file. Now we want to remove the duplicates and retain only one registration record and export the result to all2.txt.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =file("e:/txt/all.txt").cursor@s() | Create the cursor; @s option enables generating a one-field table sequence with values being rows |
2 | =A1.groupx(_1) | Group the one field in the cursor to remove the duplicate rows |
3 | =file("e:/txt/all2.txt").export(A2) | Export the result to all2.txt. |
We still take the orders table as an example. It is orders.xlsx, a big file that contains orders of all years.
Example 1: Get the unique IDs of all customers that place an order and write them into customers.xlsx.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =file("e:/txt/orders.xlsx").xlsimport@tc() | Create cursor of the orders table |
2 | =A1.groupx(CustomerId) | Group by CustomerId to get different customers |
3 | =file("e:/txt/customers.xlsx").xlsexport@t(A2) | Write customer IDs to customers.xlsx |
Find the different products each customer bought, and write key fields CustomerId and ProductId to c_p.xlsx.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =file("e:/txt/orders.xlsx").xlsimport@tc() | Create cursor of the orders table |
2 | =A1.groupx(CustomerId,ProductId) | Group by key field to get the IDs and products of different customers |
3 | =file("e:/txt/c_p.xlsx").xlsexport@t(A2) | Export result to c_p.xlsx |
Example: Get records where the customer bought same product many times and export them to c_rebuy.xlsx.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =file("e:/txt/orders.xlsx").xlsimport@tc().sortx(CustomerId,ProductId) | Create cursor of the orders table and sort it by key field |
2 | =A1.group(CustomerId,ProductId) | Put records where same customer buys same product into one group |
3 | =A2.select(~.count()>1).conj() | Get groups where the number of records is greater than 1 and concatenate their records together |
4 | =file("e:/txt/c_rebuy.xlsx").xlsexport@t(A3) | Export A3’s result to c_rebuy.xlsx |
Find more examples in SPL CookBook
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code