Skip to content
This repository has been archived by the owner on Feb 24, 2022. It is now read-only.

zOld Evolving ASO action items

dciangot edited this page May 18, 2017 · 1 revision

this page combines the other 3 written as summary of discussions at CERN:

Differently from those, this page is expected to reflect a list of Action Items which evolves including their status. Since most of those action items can't be captured as GIT issues.

History

2016, April 12, created

2016, April 13, updated with output of Apr 12 CrabDev meeting and reordered by status

2016, April 19, updated by Stefano in light of comments added by Marco

2016, Aprile 19, further updated by Stefano after CrabDev meeting

LAST UPDATED ON APRIL 13, 2016


Legend: D: Done - PP: PostPoned - IP: InProgress - TD: ToDO - U: Urgent

ACTION ITEMS

Urgent

InProgress

  1. IP move to Erlang for everything Diego

    Erlang in test for ftc_cp view, could deploy in May. Increasing the size of the db the erlang view is gradually becoming better than java. Testing on the asynctransfer1 clone from production on Diego C. private VM, the time required for a rebuild is 12 min (erlang) vs 20min (java)

  2. IP session reuse Diego

    There is a patch ready. Will go in production in May deployment. Will also require a change ftcp_all view.

  3. IP have a review and look into possible major changes all

    We made a plan to replace Couch with Oracle, which was accepted. Also Brian proposed a distributed ASO alternative which we are reviewing. Need to converge on final plan.

    A plan update emerged from the Apr 12 meeting: More discussion is likely needed, but current understanding is:

    • Justas has almost completed the work to fill Oracle from CRAB, so he will finish
    • Switching from central ASO to distributed ASO is felt as a bigger change and possibly more work than the current plan, in particular adding dependency on FTS in all schedd's
    • We have agreed anyhow on the plan to add a task-monitor thread to the DAGMAN as per Brian's suggestion, which will take care of task status update and publication.
    • So ideally in ~June we have some experience with the additional thread and can start playing with an Oracle based ASO. At least some component could be ported to Oracle then, e.g. Diego could be able to do the File Transfer. At that point we (or along this process) we will regroup and refocus and if we decide to throw away the Oracle port, so be it.

Again.. that's Stefano's attempt at capturing people's opinion, not a consensus on a decision yet, since to throw away work is never nice. But we may not be able to pick wisely w/o some exploratory work.

  1. IP Provisioning of new machines (DiegoC. + StefanoB. will follow-up with the HTTP group)

    In progress. David Lange is helping. Ian Bird involved. But an uphill struggle. Status of negotiation with CERN IT tracked in https://cern.service-now.com/service-portal/view-request.do?n=RQF0557816

  • IP Clean-up old views DiegoC.

    Ongoing, for May deploy. Need to cross check what needed/unused by crab server side

  1. IP move publication to schedd Marco

    Now tracked as: https://github.com/dmwm/CRABServer/issues/5197

  2. IP Speed up transition from ASO to direct and back in times of troubles

    Now tracked as: https://github.com/dmwm/CRABServer/issues/5198

  3. IP CRAB to check stageout OK even before resubmit (now only before submit) Justas

    Now tracked in https://github.com/dmwm/CRABServer/issues/5132

ToDo

  1. TD review current PhEDEx based ASO monitor Stefano and Diego

    understand its flow, FTS interaction, proxy usage etc, to be done Apr 20-22 when Stefano at CERN

PostPoned

  1. PP enable users to kill xfers via Crab (i.e. crab kill also kills xfers) Diego

    Need a brand new ASO component. Postpone until it is clear that we go for Oracle.

  2. PP multiple FTS servers: server should be a property of the transfer, not of the ASO instance Justas

    Postponed until we are well along on the migration to Oracle.

Done

  1. D Handle more then 100 users Hassen?

    Fixed and tested here: https://github.com/dmwm/AsyncStageout/issues/4426

  2. D finalize detailed plan for ASO rotation Diego

    Done. Rotate every month at the time cmsweb is updated.

  3. D optimize grouping of views Diego

    Done. ftc_cp view split in production on Apr 12. No reason yet to do more.

  4. D CRAB should be able to throttle load on slow channels by killing/holding/slowing DAGs Marco

    No way we can do this this year.

  5. D more/better monitoring ?

    Done. we are happy with new Kibana page. But we should add back Justas scripts which give detail of transfers to xfer time summarised per day.

  6. D automatic/easytodo/forced merge jobs ? Avoid having to deal with zilions of small files :-(

    Not in 2016.

  7. D Define operational cost of running 4 instances of ASO (1 is marginal, Stefano says he'd feel better with 3 active and one draining at any time). So we start with 2 next week and see

    Done. We converged on 3 instanced with http group.

  8. D tell management of our troubles and ask for help (done by stefano after the meeting)

    Done. Let's cancel, ok ?

  9. D That is a hand-made BigCouch, what's happened/happening to BigCouch in production ?

    Big Couch is not option for http group. We will rather go to Oracle.

  10. D Fix the update API of WMCore to delete + put instead of post HassenR.

    *On Aprile 19 CRAB meeting agreed to keep as it is since it is CounchDB related and will go away with move to Oracle

  11. D black list of banned stageout location from SSB, not configuration ? Is it needed once we have the two above ?

    • On Aprile 19 CRAB meeting agreed to keep the situation we have and try not to use it