zingg 0.4.0 in docker consume lots disk space #893

iqoOopi · 2024-09-15T04:10:18Z

I'm running a docker zingg with 2.7M records with 10 columns match. On my 6 core, 32gb RAM, it took 14 hours now and still running. It already has 230GB disk space.

For the past few hours, looks like it keep writing to rdd_71_1 nonstop.

Is this normal?

sonalgoyal · 2024-09-15T06:57:36Z

That should not happen. What is the labelDataSampleSize and number of matches you have labeled?

iqoOopi · 2024-09-15T11:03:47Z

so I trained the model with 60K records first, and runner the match with 60K. Everything works and the results is accurate. labelDataSampleSize for 60K records is 0.1. Number of match is roughly 60 records.

Then I want to try on 2.7M records (same table schema but just more records). So I changed the labelDataSampleSize to 0.001.
"FindAndLabel" command works fine. But the "match" never finished and took all my disk space (more than 300GB) and never finish

sonalgoyal · 2024-09-15T11:25:05Z

It is better to run train on a data size close to the one you want to run match on. Can you please run train with the bigger 2.7m dataset and try match post that?

iqoOopi · 2024-09-15T11:49:18Z

Oh, I forgot to mention. After I switched to 2.7m, I reruned few round of "findandLabel" and retrain the model as well. Then I started the match

sonalgoyal · 2024-09-15T12:12:50Z

Ok. Seems it’s not trained the blocking model. Zingg jobs are more compute intensive than memory intensive, and the training samples help to learn the blocking model which helps to parallelise the job across cores. Do you have some major null values in your dataset which may be getting clubbed together?

iqoOopi · 2024-09-15T12:17:49Z

Yes, each record has 15 fields. all these fields are nullable. I do have quite many records that only have value for 2 fields (firstName, phone) then rest 13 fields are null

sonalgoyal · 2024-09-15T12:19:37Z

Ah. That’s a tricky one. Are the null fields important to the match?

iqoOopi · 2024-09-15T12:20:21Z

yes, they are important if they have a value. Like (SIN number, DOB, email address etc)

sonalgoyal · 2024-09-15T12:33:42Z

I suspect you will need a lot more training in terms of matching data for combinations of these null values. Nulls are tough to block as they have zero signal. Maybe add through trainingSamples and see if that changes things?

iqoOopi · 2024-09-15T12:38:45Z

Seems it’s not trained the blocking model.

will try train more. Btw, how do you figure out it is not using the block model?

sonalgoyal · 2024-09-15T13:34:56Z

I have seen it in the past on one dataset which did not have a lot of values populated. If there is no signal, it is hard to learn. I think we should build some way to let users know this while running Zingg.

iqoOopi · 2024-09-16T04:26:30Z

Also when running Zingg, how can I tell it is analyzing something? nothing from log. Hard to tell whether it is working normal or need abort the current task.

sonalgoyal · 2024-09-16T04:29:40Z

These warnings are ok. If the logs are not moving at all, that may be a sign.

sonalgoyal · 2024-09-16T04:30:07Z

If you are familiar with spark, you could look at the spark gui

iqoOopi · 2024-09-16T04:37:59Z

btw, in the config.json file I have "trainingSamples" and "data" section, both are pointing to SQL server table. wondering is the schema order matters? like in training samples, I have "schema": A string,B string,C string, but in the data section I have column "schema" as A string,C string,B string. Since they are SQL table, so I think the sequence does not matter, just want to confirm

iqoOopi · 2024-09-16T16:37:10Z

Just restart the match after re-trained for around 100 matches.
I saw from spark GUI, it only have 1 active job with 3 tasks. Is this normal?

sonalgoyal · 2024-09-16T16:40:12Z

what is the numPartitions setting? For better parallelisation, you want it to be 4-5 times your number of cores

iqoOopi · 2024-09-16T16:41:59Z

AH, thanks @sonalgoyal for the quick reply. In the config.json file, it is only 4. My CPU has 4 core, 8 thread. (docker interface shows 8 cpus), so I should give it 4 X 5 or 8 X 5? Also in the zingg.conf file, I saw there is a "spark.default.parallelism" setting being commented out, what should be that value?

iqoOopi · 2024-09-16T16:42:15Z

btw, in the config.json file I have "trainingSamples" and "data" section, both are pointing to SQL server table. wondering is the schema order matters? like in training samples, I have "schema": A string,B string,C string, but in the data section I have column "schema" as A string,C string,B string. Since they are SQL table, so I think the sequence does not matter, just want to confirm

@sonalgoyal also how about this Q?

iqoOopi · 2024-09-16T17:08:20Z

I make both numberOfPartitions and spark.default.parallelsim settings to 20. Is this looks normal?

sonalgoyal · 2024-09-16T17:15:59Z

btw, in the config.json file I have "trainingSamples" and "data" section, both are pointing to SQL server table. wondering is the schema order matters? like in training samples, I have "schema": A string,B string,C string, but in the data section I have column "schema" as A string,C string,B string. Since they are SQL table, so I think the sequence does not matter, just want to confirm

@sonalgoyal also how about this Q?

I think the SQL server dataframe should be read correctly, but I am not 100% sure as thats a case we have not tested. Are you seeing an issue?

iqoOopi · 2024-09-16T17:18:16Z

btw, in the config.json file I have "trainingSamples" and "data" section, both are pointing to SQL server table. wondering is the schema order matters? like in training samples, I have "schema": A string,B string,C string, but in the data section I have column "schema" as A string,C string,B string. Since they are SQL table, so I think the sequence does not matter, just want to confirm

@sonalgoyal also how about this Q?

I think the SQL server dataframe should be read correctly, but I am not 100% sure as thats a case we have not tested. Are you seeing an issue?

NO, I have not seen any issues. From the findandlabel command result, I can see the model are doing its job.

sonalgoyal · 2024-10-01T13:13:24Z

how did it go @iqoOopi ?

iqoOopi · 2024-10-02T19:12:54Z

No success yet, it never finished the scan on our 2.7M records.

sonalgoyal · 2024-10-02T19:41:47Z

can you share the complete logs?

sonalgoyal · 2024-10-16T15:05:21Z

this is likely the case of a poorly formed blocking model. this can happen due to less traiing data, but the user has mentioned that they have labelled sufficiently. hard to say further without logs or sample data. @iqoOopi will you be open to a debug session on this?

iqoOopi added the question Further information is requested label Sep 15, 2024

sonalgoyal self-assigned this Sep 15, 2024

sonalgoyal mentioned this issue Oct 3, 2024

Make Zingg More Usable - Part 1. Blocking #902

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zingg 0.4.0 in docker consume lots disk space #893

zingg 0.4.0 in docker consume lots disk space #893

iqoOopi commented Sep 15, 2024 •

edited

Loading

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024 •

edited

Loading

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024 •

edited

Loading

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 16, 2024

sonalgoyal commented Sep 16, 2024

sonalgoyal commented Sep 16, 2024

iqoOopi commented Sep 16, 2024 •

edited

Loading

iqoOopi commented Sep 16, 2024

sonalgoyal commented Sep 16, 2024

iqoOopi commented Sep 16, 2024 •

edited

Loading

iqoOopi commented Sep 16, 2024

iqoOopi commented Sep 16, 2024

sonalgoyal commented Sep 16, 2024

iqoOopi commented Sep 16, 2024

sonalgoyal commented Oct 1, 2024

iqoOopi commented Oct 2, 2024

sonalgoyal commented Oct 2, 2024

sonalgoyal commented Oct 16, 2024

zingg 0.4.0 in docker consume lots disk space #893

zingg 0.4.0 in docker consume lots disk space #893

Comments

iqoOopi commented Sep 15, 2024 • edited Loading

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024 • edited Loading

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024 • edited Loading

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 15, 2024

sonalgoyal commented Sep 15, 2024

iqoOopi commented Sep 16, 2024

sonalgoyal commented Sep 16, 2024

sonalgoyal commented Sep 16, 2024

iqoOopi commented Sep 16, 2024 • edited Loading

iqoOopi commented Sep 16, 2024

sonalgoyal commented Sep 16, 2024

iqoOopi commented Sep 16, 2024 • edited Loading

iqoOopi commented Sep 16, 2024

iqoOopi commented Sep 16, 2024

sonalgoyal commented Sep 16, 2024

iqoOopi commented Sep 16, 2024

sonalgoyal commented Oct 1, 2024

iqoOopi commented Oct 2, 2024

sonalgoyal commented Oct 2, 2024

sonalgoyal commented Oct 16, 2024

iqoOopi commented Sep 15, 2024 •

edited

Loading

iqoOopi commented Sep 15, 2024 •

edited

Loading

iqoOopi commented Sep 15, 2024 •

edited

Loading

iqoOopi commented Sep 16, 2024 •

edited

Loading

iqoOopi commented Sep 16, 2024 •

edited

Loading