-
Notifications
You must be signed in to change notification settings - Fork 339
SPL Practice:Data Comparison between Different Types of Databases
Data comparison between databases of different types means comparing data between two tables having same logical structures in different types of databases to find the differences.
The difficulty of data comparison needs to overcome is the differences of data types and the data processing approaches between different database products. The problem is that different results will be obtained after the same data is imported to and exported from different databases.
Common data types mainly include numeric, string and date. Now we use Oracle and MySQL to respectively create data tables that both contain all the three commonly seen data types and have same simple logical structure, compare differences of the two databases, and explain how to compare data between different types of databases using SPL.
Create a data table in Oracle:
CREATE TABLE "TEST001" (
"F1" NUMBER(2, 0),
"F2" VARCHAR2(100),
"F3" DATE,
"F4" BINARY_DOUBLE
)
Create a data table in MySQL:
CREATE TABLE `test001` (
`f1` int,
`f2` varchar(100),
`f3` datetime,
`f4` decimal(10,2)
)
SPL code of generating the test data and importing it to the database according to the above structures:
A | |
---|---|
1 | =rand@s(1) |
2 | =10.new(~:f1,rands("qwertyuiopasdfghjklzxcvbnm",5):f2,datetime("2011-11-11 11:11:11"):f3,rand(99999)/100:f4) |
3 | >A2(3).run(#2=null),A2(5).run(#2="") |
4 | =connect@l("mysql8") |
5 | =A4.update(A2,test001,f1,f2,f3,f4;f1) |
6 | >A4.close() |
7 | =connect@l("oracle12c") |
8 | =A7.update(A2,test001,f1,f2,f3,f4;f1) |
9 | =A7.close() |
Below is data imported to the database:
Oracle:
f1 | f2 | f3 | f4 |
---|---|---|---|
1 | kqipd | 2011-11-11 11:11:11 | 156.82 |
2 | xvvak | 2011-11-11 11:11:11 | 17.53 |
3 | null | 2011-11-11 11:11:11 | 708.24 |
4 | twtgq | 2011-11-11 11:11:11 | 825.2 |
5 | null | 2011-11-11 11:11:11 | 628.82 |
6 | hdjeb | 2011-11-11 11:11:11 | 3.12 |
7 | zurnb | 2011-11-11 11:11:11 | 463.28 |
8 | khcdv | 2011-11-11 11:11:11 | 903.9 |
9 | dhdhn | 2011-11-11 11:11:11 | 956.73 |
10 | vkacd | 2011-11-11 11:11:11 | 647.69 |
Myqsl:
f1 | f2 | f3 | f4 |
---|---|---|---|
1 | kqipd | 2011-11-11 11:11:11 | 156.82 |
2 | xvvak | 2011-11-11 11:11:11 | 17.53 |
3 | null | 2011-11-11 11:11:11 | 708.24 |
4 | twtgq | 2011-11-11 11:11:11 | 825.2 |
5 | 2011-11-11 11:11:11 | 628.82 | |
6 | hdjeb | 2011-11-11 11:11:11 | 3.12 |
7 | zurnb | 2011-11-11 11:11:11 | 463.28 |
8 | khcdv | 2011-11-11 11:11:11 | 903.9 |
9 | dhdhn | 2011-11-11 11:11:11 | 956.73 |
10 | vkacd | 2011-11-11 11:11:11 | 647.69 |
In Oracle, the record where f1 field is 5 has null value instead of an empty string under f2. This is because Oracle does not distinguish the empty string (‘’) from null.
First convert the empty string in MySQL to null and then make the comparison; otherwise, data in both sides will never be equal. If the field name is already known, we can perform the conversion directly in SQL. For example, select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1.
When only the table name is known but data structure is unknown, we can first write SQL to return the table’s structure (column names and data types) and then compose the SQL of converting the empty string to null in SPL. Below is the code:
A | |
---|---|
1 | =connect@l("mysql8") |
2 | =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'") |
3 | =A2.(if(#2=="varchar","if("/#1/"='',null,"/#1/") as"/#1,#1)).concat@c() |
4 | =A1.query@x("select"/A3/"from test001") |
This refers to the comparison scenarios where the size of data to be compared is relatively small and can fit into the memory.
Still use the above data structure and suppose f1 is the primary key, the data comparison code is as follows:
A | |
---|---|
1 | =connect@l("mysql8") |
2 | =A1.query@x("select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1") |
3 | =connect@l("oracle12c") |
4 | =A3.query@x("select * from test001 order by 1") |
5 | =join@f(A2:t1,f1;A4:t2,f1) |
A5 performs full join on A2’s table sequence and A4’s table sequence according to f1 (the primary key).
The result shows that no records are intersected. Yet as f1 values in both sides are from 1 to 10, the two tables should be equal. What is the reason behind this?
Because, by default, SPL’s join() uses hash values to make the comparison. Oracle only has the number type. Though we can specify the range of numbers through the parameter defined using DDL number (for determining the corresponding data type in Java), Oracle JDBC will not handle the data type but always return the decimal type. As a result, the f1 values on both sides are equal but they have different data types as well as different hash values.
A | |
---|---|
1 | =connect@l("mysql8") |
2 | =A1.query@x("select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1") |
3 | =connect@l("oracle12c") |
4 | =A3.query@x("select * from test001 order by 1").run(#1=int(#1)) |
5 | =join@f(A2:t1,f1;A4:t2,f1) |
A4 converts f1 values of decimal type to the int type.
A | |
---|---|
1 | =connect@l("mysql8") |
2 | =A1.query@x("select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1") |
3 | =connect@l("oracle12c") |
4 | =A3.query@x("select * from test001 order by 1") |
5 | =join@fm(A2:t1,f1;A4:t2,f1) |
As both tables are ordered by f1, A5 uses the merge join to compare numeric values instead of hash values.
Then we full-join them to view the newly-added records, the deleted ones and the modified ones. Below is the code:
A | |
---|---|
6 | =A5.select(!t1).(t2) |
7 | =A5.select(!t2).(t1) |
8 | =A5.select(t1 && t2 && (${A2.fname().("t1."/~/"!=t2."/~).concat("||")})) |
A6 finds records that exist in t1 but that do not exist in t2.
A7 finds records that do not exist in t2 but that exist in t1.
A8 finds records of t1 and t2 that have same primary key values but that have different values in other fields.
When no primary keys are specified for data tables to be compared, we need to compare the whole columns of data:
A | |
---|---|
1 | =connect@l("mysql8") |
2 | =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'") |
3 | =A2.(if(#2=="varchar","if("/#1/"='',null,"/#1/") as"/#1,#1)).concat@c() |
4 | =A1.query@x("select"/A3/"from test001 order by"/A2.(#).concat@c()) |
5 | =connect@l("oracle12c") |
6 | =A5.query("SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS WHERE TABLE_NAME ='TEST001'") |
7 | =A5.query@x("select"/A6.(#1).concat@c()/"from test001 order by"/A6.(#).concat@c()) |
8 | =join@mf(A4:t1,${A4.fname().concat@c()};A7:t2,${A7.fname().concat@c()}) |
9 | =A8.select(!t1).(t2) |
10 | =A8.select(!t2).(t1) |
As no primary key is specified, we just find out the records on both sides where there are any fields whose values are not equal.
A6 finds records that do not exist in t1 but that exist in t2.
A7 finds records that do not exist in t2 but that exist in t1.
This refers to the comparison scenarios where the size of data to be compared is too large to fit into the memory.
As the cursor is traversed once, we need to use SPL’s channel functionality in order to find the added, deleted and modified records at a time. The code is as follows:
A | |
---|---|
1 | =connect@l("mysql8") |
2 | =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'") |
3 | =A1.cursor@x("select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1") |
4 | =connect@l("oracle12c") |
5 | =A4.cursor@x("select * from test001 order by 1") |
6 | =joinx@f(A3:t1,f1;A5:t2,f1) |
7 | =channel(A6) |
8 | =channel(A6) |
9 | =channel(A6) |
10 | >A7.select(!t1).(t2) |
11 | >A7.fetch() |
12 | >A8.select(!t2).(t1) |
13 | >A8.fetch() |
14 | >A9.select(t1 && t2 && (${A2.(#1).("t1."/~/"!=t2."/~).concat("||")})) |
15 | >A9.fetch() |
16 | =A6.skip() |
17 | =A7.result() |
18 | =A8.result() |
19 | =A9.result() |
Unlike data tables (table sequences) having small volume of data, joinx(), be default, requires that the cursors be ordered by the joining fields and will not use hash value comparison method. This means result error due to different data types can be avoided. The cs.group() function is similar. It also requires that the cursors be ordered by the grouping fields by default and will not perform hash value comparisons.
A17 finds records that do not exist in t1 but that exist in t2.
A18 finds records that do not exist in t2 but that exist in t1.
A19 finds records of t1 and t2 that have same primary key values but that have different values in other fields.
We still use channel to make the comparison. The code is as follows:
A | |
---|---|
1 | =connect@l("mysql8") |
2 | =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'") |
3 | =A2.(if(#2=="varchar","if("/#1/"='',null,"/#1/") as"/#1,#1)).concat@c() |
4 | =A1.cursor@x("select"/A3/"from test001 order by"/A2.(#).concat@c()) |
5 | =connect@l("oracle12c") |
6 | =A5.query("SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS WHERE TABLE_NAME ='TEST001'") |
7 | =A5.cursor@x("select"/A6.(#1).concat@c()/"from test001 order by"/A6.(#).concat@c()) |
8 | =joinx@f(A4:t1,${A2.(#1).concat@c()};A7:t2,${A2.(#1).concat@c()}) |
9 | =channel(A8) |
10 | =channel(A8) |
11 | >A9.select(!t1).(t2) |
12 | >A9.fetch() |
13 | >A10.select(!t2).(t1) |
14 | >A10.fetch() |
15 | =A8.skip() |
16 | =A9.result() |
17 | =A10.result() |
A16 finds records that do not exist in t1 but that exist in t2.
A17 finds records that do not exist in t2 but that exist in t1.
ch.result returns a table sequence. When the comparison result is also large, we can export the result to a file to avoid that there is not enough memory, as shown below:
A | |
---|---|
1 | =connect@l("mysql8") |
2 | =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'") |
3 | =A2.(if(#2=="varchar","if("/#1/"='',null,"/#1/") as"/#1,#1)).concat@c() |
4 | =A1.cursor@x("select"/A3/"from test001 order by"/A2.(#).concat@c()) |
5 | =connect@l("oracle12c") |
6 | =A5.query("SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS WHERE TABLE_NAME ='TEST001'") |
7 | =A5.cursor@x("select"/A6.(#1).concat@c()/"from test001 order by"/A6.(#).concat@c()) |
8 | =joinx@f(A4:t1,${A2.(#1).concat@c()};A7:t2,${A2.(#1).concat@c()}) |
9 | =channel(A8) |
10 | =channel(A8) |
11 | >A9.select(!t1).(t2) |
12 | =A9.fetch(file("t2.b")) |
13 | >A10.select(!t2).(t1) |
14 | >A10.fetch(file("t1.b")) |
15 | =A8.skip() |
In ch.fetch(f) in both A12 and A14, f is a bin file.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code