Debugging old batch processing tasks #84

thiagoyeds · 2020-02-12T13:11:35Z

Currently there is a new deployment raised in the LSD Openstack cloud service that processed all the old tasks already successfully processed by old SAPS deploys. There are two main reasons for reprocessing them, the first is that there are N Swift containers storing the results of these tasks with different patterns of organization of the directory tree that has been changing from time to time and the second reason is related to the current algorithms used in the inputdownloading, preprocessing and processing phases are more optimized. With that, for two weeks (more or less), the SAPS of the LSD deploy generated the results of the tasks, however, there were some of them that failed causing it to start debugging work on its causes and corrections in the algorithms (if it exists) to reach the state of success.

Link to the debugging worksheet for the old batch of 888 tasks: https://docs.google.com/spreadsheets/d/146qlrD564PtgeiHAL_sFOgmzE6-5JaT0M0hlgZDIDCo/edit?usp=sharing

Tasks:

Analyze results of submission 1
Describe the source (for example, nooa, ufcg, google) of each entry download task with an error - by @fubica
Resubmit tasks ("download error", "index.csv.gz", "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl") - by @thiagomanel
Analyze results of submission 2
Resubmit tasks that still remained in error for submission 2 - by @thiagoyeds and @thiagomanel
Analyze results of submission 3
Resubmit all tasks still in submission 3 failure - by @thiagomanel
Analyze results of submission 4
Describe for each type of error which procedure should be adopted to resolve it (if known) or apply something to advance the investigation of the cause

thiagoyeds · 2020-02-24T16:24:47Z

Submission 1

@wesleymonte and I have finished analyzing the batch, and follow the information about failed tasks:

step	description	count
inputdownloading	download error	9
inputdownloading	index.csv.gz	19
inputdownloading	no download options	34
preprocessing	cannot open station file	8
preprocessing	cannot open MTL file	2
preprocessing	cloud coverage	1
preprocessing	unreadable mtl	2
processing	rah cycle (timeout)	2
processing	v[] length zero	22
processing	points matrix	10
processing	cold pixel candidates	1

Some of these errors are easily resolved through resubmission, such as:

download error (9)
index.csv.gz (19)
no download options (34)

There are some that may have been flaws in the mount of the NFS Client or some other momentary factor in the Openstack service and that perhaps resubmission solves the problem, they are:

cannot open station file (8)
cannot open MTL file (2)
unreadable mtl (2)

The problem of cloud coverage is odd, it should not occur in any case, I suspect that the new mask added ( & fmask != 20480]) to the code may have increased the percentage of clouds and caused this problem.

And we have the most complicated problems that need further study to find their solution, which are:

v[] length zero (22)
points matrix (10)
cold pixel candidates (1)

Note: The rah cycle (timeout) problem is a common in this processing approach, it may be solved with an increase in the execution break trigger time (which is currently 2 hours).

How are we going to proceed @thiagomanel ?

thiagomanel · 2020-02-24T20:27:53Z

good job, @ThiagoWhispher and @wesleymonte.

I'd go for resubmitting the task you guys think we be solved by resubmission: download error, index.csv.gz, "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl".

next week we can discuss the other, complex, ones.

fubica · 2020-02-24T20:30:42Z

It would also be nice to think if some of these errors could have been circumvented if we had some sort of retry mechanism in our code.

…

On 24 Feb 2020, at 17:27 , Thiago Emmanuel Pereira ***@***.***> wrote: good job, @ThiagoWhispher <https://github.com/ThiagoWhispher> and @wesleymonte <https://github.com/wesleymonte>. I'd go for resubmitting the task you guys think we be solved by resubmission: download error, index.csv.gz, "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl". next week we can discuss the other, complex, ones. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=ABVYUTYS5ZUQQ2MEXJALHNTREQUUXA5CNFSM4KTZQGCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMZNDEY#issuecomment-590533011>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVYUT4VSLSAGIDEZJDODIDREQUUXANCNFSM4KTZQGCA>.

____________________________________________________________ Francisco Vilar Brasileiro VoIP: +55 83 1075 2038 Departamento de Sistemas e Computação Fone: +55 83 2101 1644 Universidade Federal de Campina Grande Fax: +55 83 2101 1498 58.429-900, Campina Grande, PB, Brasil E-mail: [email protected] <mailto:[email protected]>

thiagomanel · 2020-02-24T20:41:34Z

It would also be nice to think if some of these errors could have been circumvented if we had some sort of retry mechanism in our code.
…
On 24 Feb 2020, at 17:27 , Thiago Emmanuel Pereira @.***> wrote: good job, @ThiagoWhispher https://github.com/ThiagoWhispher and @wesleymonte https://github.com/wesleymonte. I'd go for resubmitting the task you guys think we be solved by resubmission: download error, index.csv.gz, "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl". next week we can discuss the other, complex, ones. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=ABVYUTYS5ZUQQ2MEXJALHNTREQUUXA5CNFSM4KTZQGCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMZNDEY#issuecomment-590533011>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVYUT4VSLSAGIDEZJDODIDREQUUXANCNFSM4KTZQGCA.
____________________________________________________________ Francisco Vilar Brasileiro VoIP: +55 83 1075 2038 Departamento de Sistemas e Computação Fone: +55 83 2101 1644 Universidade Federal de Campina Grande Fax: +55 83 2101 1498 58.429-900, Campina Grande, PB, Brasil E-mail: [email protected] mailto:[email protected]

indeed, @fubica.

@ThiagoWhispher, annotate in our spreadsheet the source (e.g. nooa, ufcg, google) from each inputdownload task in error.

thiagomanel · 2020-02-24T20:43:39Z

... these annotations should help us discussing Fubica's suggestion (part of it)

thiagoyeds · 2020-03-02T15:23:42Z

Hi @thiagomanel @fubica
In the task "describe the source (for example, noaa, ufcg, google) of each entry download task with an error" requested by @fubica we have the following:

inputdownloading error	description	possible causes
download error	this error is strange, because in the tasks with this problem there is no data recorded in the debug folder in the permanent storage	initiation/execution of the worker by Arrebol to execute the download step communication with the Google dataset
index.csv.gz	this is caused when trying to unzip the index.csv.gz file resulting in the error message `gzip: index.csv.gz: not in gzip format`	index.csv.gz corrupted when trying to download communication problem with Google dataset internet connection problem
no download options	this is caused when trying to download the satellite image files available in the Google dataset through any option offered (the number of options varies and is extracted from the index.csv spreadsheet)	- false negative issues in the Google dataset response with the version of googleapis (inputdownloading step script used to process the batch of tasks)

Note: index.csv.gz is a spreadsheet with various information about satellite images available through the Google dataset (download link: https://storage.googleapis.com/gcp-public-data-landsat/index.csv.gz )

thiagoyeds · 2020-03-02T16:28:05Z

Hi @thiagomanel
In the task "Resubmit tasks ("download error", "index.csv.gz", "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl")" requested by you was initiated, currently the tasks are being reprocessed, after all have finished I will perform a new analysis and bring new information. (I named this resubmission as "submission 2")

thiagoyeds · 2020-03-03T16:09:54Z

Hi again @thiagomanel
I completed the task "Analyze results of submission 2" and follow the information about it:

Submission 2

step	description	count
inputdownloading	download error	9 -> 2
inputdownloading	index.csv.gz	19 -> 1
inputdownloading	no download options	34 -> 5
preprocessing	cannot open station file	8 -> 6
preprocessing	cannot open MTL file	2 -> 0
preprocessing	cloud coverage	1
preprocessing	unreadable mtl	2->0
preprocessing	unknown	0 -> 1
processing	rah cycle (timeout)	2
processing	v[] length zero	22
processing	points matrix	10
processing	cold pixel candidates	1

First of all, remember that submission 2 is actually a resubmission of failed tasks using the same versions of the submission 1 steps in errors like:

download error
index.csv.gz
no download options
cannot open station file
cannot open MTL file
unreadable mtl

As you can see in the table (terms in bold), all the errors previously mentioned have reduced their frequencies, I still believe that can solve all cases only with the solution of resubmitting tasks, except for the error "cannot open station file" which can be something more serious.

In this submission was directed to 74 tasks in the aforementioned failures that as a general result we had an increase in successful tasks from 778 to 836, that is, 58 of the 74 were successful.

Note that a new error called "unknown" appeared, because the task's inputdownloading step was successfully completed, but there was a problem in the preprocessing step that did not generate any registration data in the task's debugging directory, perhaps it will be solved with resubmission, but it is something complicated to affirm.

thiagoyeds · 2020-03-03T17:09:25Z

@thiagomanel agreed to resubmit the tasks that still remained in error of submission 2, I will start the reprocessing and after finishing I will do the analysis and bring a new comment on this.

thiagoyeds · 2020-03-04T19:54:00Z

Submission 3

step	description	count
inputdownloading	download error	2 -> 2
inputdownloading	index.csv.gz	1 -> 0
inputdownloading	no download options	5 -> 1
preprocessing	cannot open station file	6 -> 4
preprocessing	cannot open MTL file	0
preprocessing	cloud coverage	1
preprocessing	unreadable mtl	0
preprocessing	unknown	1 -> 0
processing	rah cycle (timeout)	2
processing	v[] length zero	22
processing	points matrix	11
processing	cold pixel candidates	1

As you can see in the table (terms in bold), the frequency of some errors has reduced. More details about each error:

The "unknown" problem was resolved and the task was successful.
The only "no download options" problem that remains is strange, not even on the EarthExplorer website I am finding it, but the version of googleapis (inputdownloading script) found some way to download, however fails to look for the QA band of the satellite image for the task.
The problems of "cannot open station file" have decreased, but I don't know if it is still possible to reduce them, this is something to be checked.
There are two very strange "download error" cases, I still have no idea what it might be, it should be studied more calmly.

In this submission was directed to 15 tasks in the aforementioned failures that as a general result we had an increase in successful tasks from 836 to 844, that is, 8 of the 15 were successful.

thiagoyeds · 2020-03-09T15:17:00Z

In discussion with @thiagomanel we decided to resubmit all tasks still in submission 3 failure, whether the problem is any of the errors mentioned and worked on in submission 2 and 3 or the others that are more problematic, this is to check if there is any progress processing and evaluate possible points that only resolved resubmission.

thiagoyeds · 2020-04-14T16:15:45Z

Submission 4

step	description	count
inputdownloading	download error	2
inputdownloading	index.csv.gz	0
inputdownloading	no download options	1
preprocessing	cannot open station file	4 -> 3
preprocessing	cannot open MTL file	0
preprocessing	cloud coverage	1
preprocessing	unreadable mtl	0
preprocessing	unknown	0
processing	rah cycle (timeout)	2
processing	v[] length zero	22 -> 2
processing	points matrix	11 -> 10
processing	cold pixel candidates	1

As you can see in the table (terms in bold), the frequency of some errors has reduced. More details about each error:

The only "no download options" problem that remains is strange, not even on the EarthExplorer website I am finding it, but the version of googleapis (inputdownloading script) found some way to download, however fails to look for the QA band of the satellite image for the task.
The problems of "cannot open station file" have decreased, but I don't know if it is still possible to reduce them, this is something to be checked.
There are two very strange "download error" cases, I still have no idea what it might be, it should be studied more calmly.
Regarding the problem of "cloud coverage", I don't know what happened, but this error should not occur in any case of this batch, however, it may have happened that this image was processed with some version that did not check the percentage of cloud making it proceed in the processing normally.
"v[] length zero" errors are drastically reduced only with resubmission, but the real cause is not yet known
There was 1 case of the "points matrix" error that was solved, I don't know why it happened

In this submission was directed to 44 tasks in the failure state that as a general result we had an increase in successful tasks from 844 to 866, that is, 22 of the 44 were successful.

thiagoyeds · 2020-04-15T16:22:40Z

Error handling solutions or procedures

description	solution or procedure
download error	analyze a metadata table used by the googleapis version and verify the data integrity of each of the failed tasks
no download options	use usgsapis version instead of googleapis, because the Google dataset is missing the QA band for the task that failed
cannot open station file	maybe resubmission, but the ideal is to run each step at a time on the localpipeline and check the final files to find the point of failure
cloud coverage	there is nothing to do here, I believe that the only task that fell in this case was processed with an algorithm that had no cloud percentage validation (should check first if the image really has a high cloud of cloud coverage)
rah cycle (timeout)	This is a bit of a boring case, maybe trying to run only one task at a time (we have 3 workers, which would be 1 processing the task and the other 2 idle) to prevent cloud resources from being shared and generate some kind of bottleneck in the processing
v[] length zero	With the various resubmissions done, the number of these cases dropped from 22 to 2, I believe that resubmission can solve the remaining cases
points matrix / cold pixel candidates	This will be the most problematic case, first consult Italo, then isolated resubmission (1 worker working and 2 idle) and finally, run on the localpipeline step by step and try to investigate the reason/cause more calmly.

thiagoyeds added the discussion label Feb 12, 2020

thiagoyeds assigned thiagoyeds and wesleymonte Feb 24, 2020

thiagoyeds added this to the mid-april-2020 milestone Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging old batch processing tasks #84

Debugging old batch processing tasks #84

thiagoyeds commented Feb 12, 2020 •

edited

Loading

thiagoyeds commented Feb 24, 2020 •

edited

Loading

thiagomanel commented Feb 24, 2020

fubica commented Feb 24, 2020 via email

thiagomanel commented Feb 24, 2020

thiagomanel commented Feb 24, 2020

thiagoyeds commented Mar 2, 2020 •

edited

Loading

thiagoyeds commented Mar 2, 2020

thiagoyeds commented Mar 3, 2020 •

edited

Loading

thiagoyeds commented Mar 3, 2020

thiagoyeds commented Mar 4, 2020 •

edited

Loading

thiagoyeds commented Mar 9, 2020

thiagoyeds commented Apr 14, 2020

thiagoyeds commented Apr 15, 2020

Debugging old batch processing tasks #84

Debugging old batch processing tasks #84

Comments

thiagoyeds commented Feb 12, 2020 • edited Loading

thiagoyeds commented Feb 24, 2020 • edited Loading

Submission 1

thiagomanel commented Feb 24, 2020

fubica commented Feb 24, 2020 via email

thiagomanel commented Feb 24, 2020

thiagomanel commented Feb 24, 2020

thiagoyeds commented Mar 2, 2020 • edited Loading

thiagoyeds commented Mar 2, 2020

thiagoyeds commented Mar 3, 2020 • edited Loading

Submission 2

thiagoyeds commented Mar 3, 2020

thiagoyeds commented Mar 4, 2020 • edited Loading

Submission 3

thiagoyeds commented Mar 9, 2020

thiagoyeds commented Apr 14, 2020

Submission 4

thiagoyeds commented Apr 15, 2020

Error handling solutions or procedures

thiagoyeds commented Feb 12, 2020 •

edited

Loading

thiagoyeds commented Feb 24, 2020 •

edited

Loading

thiagoyeds commented Mar 2, 2020 •

edited

Loading

thiagoyeds commented Mar 3, 2020 •

edited

Loading

thiagoyeds commented Mar 4, 2020 •

edited

Loading