-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debugging old batch processing tasks #84
Comments
Submission 1@wesleymonte and I have finished analyzing the batch, and follow the information about failed tasks:
Some of these errors are easily resolved through resubmission, such as:
There are some that may have been flaws in the mount of the NFS Client or some other momentary factor in the Openstack service and that perhaps resubmission solves the problem, they are:
The problem of cloud coverage is odd, it should not occur in any case, I suspect that the new mask added ( And we have the most complicated problems that need further study to find their solution, which are:
Note: The rah cycle (timeout) problem is a common in this processing approach, it may be solved with an increase in the execution break trigger time (which is currently 2 hours). How are we going to proceed @thiagomanel ? |
good job, @ThiagoWhispher and @wesleymonte. I'd go for resubmitting the task you guys think we be solved by resubmission: download error, index.csv.gz, "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl". next week we can discuss the other, complex, ones. |
It would also be nice to think if some of these errors could have been circumvented if we had some sort of retry mechanism in our code.
… On 24 Feb 2020, at 17:27 , Thiago Emmanuel Pereira ***@***.***> wrote:
good job, @ThiagoWhispher <https://github.com/ThiagoWhispher> and @wesleymonte <https://github.com/wesleymonte>.
I'd go for resubmitting the task you guys think we be solved by resubmission: download error, index.csv.gz, "no download options", "cannot open station file", "cannot open MTL file", and "unreadable mtl".
next week we can discuss the other, complex, ones.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=ABVYUTYS5ZUQQ2MEXJALHNTREQUUXA5CNFSM4KTZQGCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMZNDEY#issuecomment-590533011>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVYUT4VSLSAGIDEZJDODIDREQUUXANCNFSM4KTZQGCA>.
____________________________________________________________
Francisco Vilar Brasileiro VoIP: +55 83 1075 2038
Departamento de Sistemas e Computação Fone: +55 83 2101 1644
Universidade Federal de Campina Grande Fax: +55 83 2101 1498
58.429-900, Campina Grande, PB, Brasil E-mail: [email protected] <mailto:[email protected]>
|
indeed, @fubica. @ThiagoWhispher, annotate in our spreadsheet the source (e.g. nooa, ufcg, google) from each inputdownload task in error. |
... these annotations should help us discussing Fubica's suggestion (part of it) |
Hi @thiagomanel @fubica
Note: index.csv.gz is a spreadsheet with various information about satellite images available through the Google dataset (download link: https://storage.googleapis.com/gcp-public-data-landsat/index.csv.gz ) |
Hi @thiagomanel |
Hi again @thiagomanel Submission 2
First of all, remember that submission 2 is actually a resubmission of failed tasks using the same versions of the submission 1 steps in errors like:
As you can see in the table (terms in bold), all the errors previously mentioned have reduced their frequencies, I still believe that can solve all cases only with the solution of resubmitting tasks, except for the error "cannot open station file" which can be something more serious. In this submission was directed to 74 tasks in the aforementioned failures that as a general result we had an increase in successful tasks from 778 to 836, that is, 58 of the 74 were successful. Note that a new error called "unknown" appeared, because the task's inputdownloading step was successfully completed, but there was a problem in the preprocessing step that did not generate any registration data in the task's debugging directory, perhaps it will be solved with resubmission, but it is something complicated to affirm. |
@thiagomanel agreed to resubmit the tasks that still remained in error of submission 2, I will start the reprocessing and after finishing I will do the analysis and bring a new comment on this. |
Submission 3
As you can see in the table (terms in bold), the frequency of some errors has reduced. More details about each error:
In this submission was directed to 15 tasks in the aforementioned failures that as a general result we had an increase in successful tasks from 836 to 844, that is, 8 of the 15 were successful. |
In discussion with @thiagomanel we decided to resubmit all tasks still in submission 3 failure, whether the problem is any of the errors mentioned and worked on in submission 2 and 3 or the others that are more problematic, this is to check if there is any progress processing and evaluate possible points that only resolved resubmission. |
Submission 4
As you can see in the table (terms in bold), the frequency of some errors has reduced. More details about each error:
In this submission was directed to 44 tasks in the failure state that as a general result we had an increase in successful tasks from 844 to 866, that is, 22 of the 44 were successful. |
Error handling solutions or procedures
|
Currently there is a new deployment raised in the LSD Openstack cloud service that processed all the old tasks already successfully processed by old SAPS deploys. There are two main reasons for reprocessing them, the first is that there are N Swift containers storing the results of these tasks with different patterns of organization of the directory tree that has been changing from time to time and the second reason is related to the current algorithms used in the inputdownloading, preprocessing and processing phases are more optimized. With that, for two weeks (more or less), the SAPS of the LSD deploy generated the results of the tasks, however, there were some of them that failed causing it to start debugging work on its causes and corrections in the algorithms (if it exists) to reach the state of success.
Link to the debugging worksheet for the old batch of 888 tasks: https://docs.google.com/spreadsheets/d/146qlrD564PtgeiHAL_sFOgmzE6-5JaT0M0hlgZDIDCo/edit?usp=sharing
Tasks:
The text was updated successfully, but these errors were encountered: