Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap 0.22.0 #145

Open
kreczko opened this issue Mar 28, 2022 · 2 comments
Open

Roadmap 0.22.0 #145

kreczko opened this issue Mar 28, 2022 · 2 comments

Comments

@kreczko
Copy link
Contributor

kreczko commented Mar 28, 2022

With version 0.20.0 almost ready, it is time to plan the next release.

For 0.21.0, the biggest feature will be Multi-tree support (e.g. for CMS L1T and LuxZeplin).
A preview is available in #123.

The only changes from the user perspective are the dataset and processing configs:

# data.yml

datasets:
  - eventtype: null
    files:
      # - http://fast-hep-data.web.cern.ch/fast-hep-data/cms/L1T/CMS_L1T_study.root
      - data/CMS_L1T_study.root
    name: test
    nevents:
      "l1CaloTowerEmuTree/L1CaloTowerTree": 1853
      "l1CaloTowerTree/L1CaloTowerTree": 1853
    nfiles: 1
    tree:
      - "l1CaloTowerEmuTree/L1CaloTowerTree"
      - "l1CaloTowerTree/L1CaloTowerTree"

nevents has entries per tree and tree allows for lists.

Variables can then be addressed by <tree name>.<directory>.<tree>.<branch>:

# processing.yml
stages:
    # - diff: "fast_carpenter.Define"
    - var_def: "fast_carpenter.Define"
    - event_selection: 'fast_carpenter.CutFlow'
    - histogram: "fast_carpenter.BinnedDataframe"

var_def:
    variables:
        - nCaloTower: 'l1CaloTowerTree.L1CaloTowerTree.L1CaloTower.nTower'
        - caloEt: l1CaloTowerTree.L1CaloTowerTree.L1CaloTower.iet

event_selection:
    selection:
        All:
            - nCaloTower > 0

diff:
    variables:
        - diff_emu: "l1CaloTowerTree.L1CaloTowerTree.L1CaloTower.iet - l1CaloTowerTree.L1CaloTowerTree.L1CaloTower.et"

histogram:
    binning:
        # - {in: diff_emu, bins: {edges: [-100, -50, -20, 0, 20, 50, 100]}}
        - {in: "nCaloTower", out: nCaloTower, bins: {edges: [0, 1, 5, 10, 20]}}
        - {in: "caloEt", out: caloEt, bins: {edges: [0, 10, 20, 50, 80, 120, 200, 400]}}
@YeungOnion
Copy link

Hi, I'm interested in contributing to tooling for making it easier to focus on what a researcher chooses to do when performing/designating an analysis and of what I've seen, I like FAST's approach. It looks like this team has tried to learn lessons from existing SwE problems.

Since this roadmap post is the newest and a little out of date, is the slowdown from competing priorities or is it a technical matter?

I've been recently getting into this and I'm wondering if there's an area I could sink some time into to better understand the complexity of these problems. Is this the right place to contribute, or is there something else you guys are working on now?


P.S. I'm naively wanting to make a new implementation based on the Datalog or edn spec, but I think the YADL approach will provide enough flexibility, with tooling support while still constraining users to largely rely on the logical implementation of the analysis and like usual, another "better" spec never makes things more standardized, just more opinionated.

@kreczko
Copy link
Contributor Author

kreczko commented Jan 10, 2024

Hi @YeungOnion,

Thank you for your interest, and apologies for the delay in answering.

Regarding the roadmap:
We've tried a few things out, rewritten the core and other things, but could not get things to our satisfaction.
As such, we are attempting a rewrite of FAST-HEP which is described in https://fast-hep.github.io/developers-corner/, but to summarize:

  1. Focus on YAML -> workflow that can be exported to Dask (via airflow, prefect or other mechanisms)
  2. Rewrite other FAST-HEP modules with step 1 and Jupyterhub use in mind
  3. Only write code for "added value" functionality that is not provided by the big HEP contributors (IRIS-HEP, scikit-hep)

The longer version is that we can only fully leverage the Dask-added functionality provided by awkaward, hist, coffea if our core supports Dask as well - which would be a very hard bit of work for the existing code base.

I've circulated the fictional docs to the biggest users of FAST-HEP tools, they are now available online:
https://fasthep-flow.readthedocs.io/en/latest/index.html

The goal here is to gather the use-cases and make sure we provide the minimally needed functionality within FAST-HEP, while making it also easier to include your own or 3rd party code. The implementation is currently quite bare and while the above docs talk about Apache Airflow, we are currently testing prefect, which seems a bit easier to integrate.

If you are interested in contributing (feedback, docs, code), please head to https://github.com/FAST-HEP/fasthep-flow.
I will be creating issues and a minimally functioning prototype in the coming weeks.
I've also just enabled Discussions for that repo and I will be pointing people towards it: https://github.com/FAST-HEP/fasthep-flow/discussions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants