Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging old batch processing tasks #84

Open
8 of 9 tasks
thiagoyeds opened this issue Feb 12, 2020 · 13 comments
Open
8 of 9 tasks

Debugging old batch processing tasks #84

thiagoyeds opened this issue Feb 12, 2020 · 13 comments
Assignees

Comments

@thiagoyeds
Copy link
Contributor

thiagoyeds commented Feb 12, 2020

Currently there is a new deployment raised in the LSD Openstack cloud service that processed all the old tasks already successfully processed by old SAPS deploys. There are two main reasons for reprocessing them, the first is that there are N Swift containers storing the results of these tasks with different patterns of organization of the directory tree that has been changing from time to time and the second reason is related to the current algorithms used in the inputdownloading, preprocessing and processing phases are more optimized. With that, for two weeks (more or less), the SAPS of the LSD deploy generated the results of the tasks, however, there were some of them that failed causing it to start debugging work on its causes and corrections in the algorithms (if it exists) to reach the state of success.

Link to the debugging worksheet for the old batch of 888 tasks: https://docs.google.com/spreadsheets/d/146qlrD564PtgeiHAL_sFOgmzE6-5JaT0M0hlgZDIDCo/edit?usp=sharing

Tasks:

  • Analyze results of submission 1
  • Describe the source (for example, nooa, ufcg, google) of each entry download task with an error - by @fubica
  • Resubmit tasks ("download error", "index.csv.gz", "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl") - by @thiagomanel
  • Analyze results of submission 2
  • Resubmit tasks that still remained in error for submission 2 - by @thiagoyeds and @thiagomanel
  • Analyze results of submission 3
  • Resubmit all tasks still in submission 3 failure - by @thiagomanel
  • Analyze results of submission 4
  • Describe for each type of error which procedure should be adopted to resolve it (if known) or apply something to advance the investigation of the cause
@thiagoyeds
Copy link
Contributor Author

thiagoyeds commented Feb 24, 2020

Submission 1

@wesleymonte and I have finished analyzing the batch, and follow the information about failed tasks:

step description count
inputdownloading download error 9
inputdownloading index.csv.gz 19
inputdownloading no download options 34
preprocessing cannot open station file 8
preprocessing cannot open MTL file 2
preprocessing cloud coverage 1
preprocessing unreadable mtl 2
processing rah cycle (timeout) 2
processing v[] length zero 22
processing points matrix 10
processing cold pixel candidates 1

Some of these errors are easily resolved through resubmission, such as:

  • download error (9)
  • index.csv.gz (19)
  • no download options (34)

There are some that may have been flaws in the mount of the NFS Client or some other momentary factor in the Openstack service and that perhaps resubmission solves the problem, they are:

  • cannot open station file (8)
  • cannot open MTL file (2)
  • unreadable mtl (2)

The problem of cloud coverage is odd, it should not occur in any case, I suspect that the new mask added ( & fmask != 20480]) to the code may have increased the percentage of clouds and caused this problem.

And we have the most complicated problems that need further study to find their solution, which are:

  • v[] length zero (22)
  • points matrix (10)
  • cold pixel candidates (1)

Note: The rah cycle (timeout) problem is a common in this processing approach, it may be solved with an increase in the execution break trigger time (which is currently 2 hours).

How are we going to proceed @thiagomanel ?

@thiagomanel
Copy link
Member

good job, @ThiagoWhispher and @wesleymonte.

I'd go for resubmitting the task you guys think we be solved by resubmission: download error, index.csv.gz, "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl".

next week we can discuss the other, complex, ones.

@fubica
Copy link
Member

fubica commented Feb 24, 2020 via email

@thiagomanel
Copy link
Member

It would also be nice to think if some of these errors could have been circumvented if we had some sort of retry mechanism in our code.

On 24 Feb 2020, at 17:27 , Thiago Emmanuel Pereira @.***> wrote: good job, @ThiagoWhispher https://github.com/ThiagoWhispher and @wesleymonte https://github.com/wesleymonte. I'd go for resubmitting the task you guys think we be solved by resubmission: download error, index.csv.gz, "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl". next week we can discuss the other, complex, ones. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=ABVYUTYS5ZUQQ2MEXJALHNTREQUUXA5CNFSM4KTZQGCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMZNDEY#issuecomment-590533011>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVYUT4VSLSAGIDEZJDODIDREQUUXANCNFSM4KTZQGCA.
____________________________________________________________ Francisco Vilar Brasileiro VoIP: +55 83 1075 2038 Departamento de Sistemas e Computação Fone: +55 83 2101 1644 Universidade Federal de Campina Grande Fax: +55 83 2101 1498 58.429-900, Campina Grande, PB, Brasil E-mail: [email protected] mailto:[email protected]

indeed, @fubica.

@ThiagoWhispher, annotate in our spreadsheet the source (e.g. nooa, ufcg, google) from each inputdownload task in error.

@thiagomanel
Copy link
Member

... these annotations should help us discussing Fubica's suggestion (part of it)

@thiagoyeds
Copy link
Contributor Author

thiagoyeds commented Mar 2, 2020

Hi @thiagomanel @fubica
In the task "describe the source (for example, noaa, ufcg, google) of each entry download task with an error" requested by @fubica we have the following:

inputdownloading error description possible causes
download error this error is strange, because in the tasks with this problem there is no data recorded in the debug folder in the permanent storage
  • initiation/execution of the worker by Arrebol to execute the download step
  • communication with the Google dataset
index.csv.gz this is caused when trying to unzip the index.csv.gz file resulting in the error message gzip: index.csv.gz: not in gzip format
  • index.csv.gz corrupted when trying to download
  • communication problem with Google dataset
  • internet connection problem
no download options this is caused when trying to download the satellite image files available in the Google dataset through any option offered (the number of options varies and is extracted from the index.csv spreadsheet) - false negative issues in the Google dataset response with the version of googleapis (inputdownloading step script used to process the batch of tasks)

Note: index.csv.gz is a spreadsheet with various information about satellite images available through the Google dataset (download link: https://storage.googleapis.com/gcp-public-data-landsat/index.csv.gz )

@thiagoyeds
Copy link
Contributor Author

Hi @thiagomanel
In the task "Resubmit tasks ("download error", "index.csv.gz", "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl")" requested by you was initiated, currently the tasks are being reprocessed, after all have finished I will perform a new analysis and bring new information. (I named this resubmission as "submission 2")

@thiagoyeds
Copy link
Contributor Author

thiagoyeds commented Mar 3, 2020

Hi again @thiagomanel
I completed the task "Analyze results of submission 2" and follow the information about it:

Submission 2

step description count
inputdownloading download error 9 -> 2
inputdownloading index.csv.gz 19 -> 1
inputdownloading no download options 34 -> 5
preprocessing cannot open station file 8 -> 6
preprocessing cannot open MTL file 2 -> 0
preprocessing cloud coverage 1
preprocessing unreadable mtl 2->0
preprocessing unknown 0 -> 1
processing rah cycle (timeout) 2
processing v[] length zero 22
processing points matrix 10
processing cold pixel candidates 1

First of all, remember that submission 2 is actually a resubmission of failed tasks using the same versions of the submission 1 steps in errors like:

  • download error
  • index.csv.gz
  • no download options
  • cannot open station file
  • cannot open MTL file
  • unreadable mtl

As you can see in the table (terms in bold), all the errors previously mentioned have reduced their frequencies, I still believe that can solve all cases only with the solution of resubmitting tasks, except for the error "cannot open station file" which can be something more serious.

In this submission was directed to 74 tasks in the aforementioned failures that as a general result we had an increase in successful tasks from 778 to 836, that is, 58 of the 74 were successful.

Note that a new error called "unknown" appeared, because the task's inputdownloading step was successfully completed, but there was a problem in the preprocessing step that did not generate any registration data in the task's debugging directory, perhaps it will be solved with resubmission, but it is something complicated to affirm.

@thiagoyeds
Copy link
Contributor Author

@thiagomanel agreed to resubmit the tasks that still remained in error of submission 2, I will start the reprocessing and after finishing I will do the analysis and bring a new comment on this.

@thiagoyeds
Copy link
Contributor Author

thiagoyeds commented Mar 4, 2020

Submission 3

step description count
inputdownloading download error 2 -> 2
inputdownloading index.csv.gz 1 -> 0
inputdownloading no download options 5 -> 1
preprocessing cannot open station file 6 -> 4
preprocessing cannot open MTL file 0
preprocessing cloud coverage 1
preprocessing unreadable mtl 0
preprocessing unknown 1 -> 0
processing rah cycle (timeout) 2
processing v[] length zero 22
processing points matrix 11
processing cold pixel candidates 1

As you can see in the table (terms in bold), the frequency of some errors has reduced. More details about each error:

  • The "unknown" problem was resolved and the task was successful.
  • The only "no download options" problem that remains is strange, not even on the EarthExplorer website I am finding it, but the version of googleapis (inputdownloading script) found some way to download, however fails to look for the QA band of the satellite image for the task.
  • The problems of "cannot open station file" have decreased, but I don't know if it is still possible to reduce them, this is something to be checked.
  • There are two very strange "download error" cases, I still have no idea what it might be, it should be studied more calmly.

In this submission was directed to 15 tasks in the aforementioned failures that as a general result we had an increase in successful tasks from 836 to 844, that is, 8 of the 15 were successful.

@thiagoyeds
Copy link
Contributor Author

In discussion with @thiagomanel we decided to resubmit all tasks still in submission 3 failure, whether the problem is any of the errors mentioned and worked on in submission 2 and 3 or the others that are more problematic, this is to check if there is any progress processing and evaluate possible points that only resolved resubmission.

@thiagoyeds thiagoyeds added this to the mid-april-2020 milestone Mar 31, 2020
@thiagoyeds
Copy link
Contributor Author

Submission 4

step description count
inputdownloading download error 2
inputdownloading index.csv.gz 0
inputdownloading no download options 1
preprocessing cannot open station file 4 -> 3
preprocessing cannot open MTL file 0
preprocessing cloud coverage 1
preprocessing unreadable mtl 0
preprocessing unknown 0
processing rah cycle (timeout) 2
processing v[] length zero 22 -> 2
processing points matrix 11 -> 10
processing cold pixel candidates 1

As you can see in the table (terms in bold), the frequency of some errors has reduced. More details about each error:

  • The only "no download options" problem that remains is strange, not even on the EarthExplorer website I am finding it, but the version of googleapis (inputdownloading script) found some way to download, however fails to look for the QA band of the satellite image for the task.
  • The problems of "cannot open station file" have decreased, but I don't know if it is still possible to reduce them, this is something to be checked.
  • There are two very strange "download error" cases, I still have no idea what it might be, it should be studied more calmly.
  • Regarding the problem of "cloud coverage", I don't know what happened, but this error should not occur in any case of this batch, however, it may have happened that this image was processed with some version that did not check the percentage of cloud making it proceed in the processing normally.
  • "v[] length zero" errors are drastically reduced only with resubmission, but the real cause is not yet known
  • There was 1 case of the "points matrix" error that was solved, I don't know why it happened

In this submission was directed to 44 tasks in the failure state that as a general result we had an increase in successful tasks from 844 to 866, that is, 22 of the 44 were successful.

@thiagoyeds
Copy link
Contributor Author

Error handling solutions or procedures

description solution or procedure
download error analyze a metadata table used by the googleapis version and verify the data integrity of each of the failed tasks
no download options use usgsapis version instead of googleapis, because the Google dataset is missing the QA band for the task that failed
cannot open station file maybe resubmission, but the ideal is to run each step at a time on the localpipeline and check the final files to find the point of failure
cloud coverage there is nothing to do here, I believe that the only task that fell in this case was processed with an algorithm that had no cloud percentage validation (should check first if the image really has a high cloud of cloud coverage)
rah cycle (timeout) This is a bit of a boring case, maybe trying to run only one task at a time (we have 3 workers, which would be 1 processing the task and the other 2 idle) to prevent cloud resources from being shared and generate some kind of bottleneck in the processing
v[] length zero With the various resubmissions done, the number of these cases dropped from 22 to 2, I believe that resubmission can solve the remaining cases
points matrix / cold pixel candidates This will be the most problematic case, first consult Italo, then isolated resubmission (1 worker working and 2 idle) and finally, run on the localpipeline step by step and try to investigate the reason/cause more calmly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants