Build DAG from functions directly #1268

zilto · 2025-01-06T01:40:52Z

Summary

Add first-class way to build a DAG from function objects (in contrast to DIY/ hacky). The user API could be either:

Builder().with_functions()
or an intermediary my_module = module_from_functions(fns*) then passed to .with_modules()

Goals:

simplify codebase
foundation for better hierarchical / nested graph structures (i.e., subdags)

Current

At the core of Hamilton, users:

write functions in a Python module ("dataflow code")
load that module in the "driver code" to build a DAG
execute the DAG via the Driver

Problem

It is currently possible to build Hamilton DAGs from functions, but we have no official "here's how you do it" that we guarantee we'll support.
- ad_hoc_utils means exactly the opposite
In a notebook, there's no good reason to create a module from a notebook function before passing to Hamilton (except our how constraints)
Python module machinery is complex and adds indirection to the codebase (tests, notebook extension, LSP)

Benefits

greatly simply many unit tests
facilitate marimo integration

Hamilton 2.0 / Broader perspective

There's no well-defined structure or purpose to Hamilton top-level modules (e.g., nodes, graph_types, graph_utils, graph, ad_hoc_utils, base, hamilton.common, models). I propose a structure that matches the Hamilton lifecycle:

hamilton.parser: everything that deals with source code: how functions are written, if type annotations are present (not type matching), collecting functions from modules, converting a notebook cell string to a module, remove comments and docstring before hashing source code
hamilton.compiler: converting code to DAG: structuring the DAG from functions, applying function modifiers, validating types, etc.
remove ad_hoc_utils

The text was updated successfully, but these errors were encountered:

skrawcz · 2025-01-06T05:04:29Z

Yes there could be a big refactor for Hamilton 2.0.

Otherwise a small reason trying to make things look like a module was done, was so that we wouldn't have to rewire the internals of Hamilton which assumed a module would be passed in. The larger reason why it wasn't enabled from the beginning and why ad_hoc_utils was named that on purpose is because in our opinion we wanted people curating their code into modules as part of a SDLC; coupling functions with where you execute them makes them less usable and modular since you likely couple execution imports with the ones for the functions... Now there's been a lot of lessons and learnings since then, so there could be improvements here.

Question:

What's the blocker for marimo integration?

skrawcz · 2025-01-06T06:27:37Z

Otherwise some API things to think through / check:

What is the proposed API?
Where can functions come from? Only the current module? or any module?
How would this interact with modules if they're provided?
Would this impact serialization / deserialization, e.g. for ray parallel...
Would this break any Hamilton UI assumptions?

Dev-iL · 2025-01-06T07:52:38Z

What types of callables will be supported? Lambdas? Static methods from classes? How to restrict the user to only the allowed callable types?

zilto · 2025-01-06T19:38:19Z

Current main code path (roughly):

functions are defined in a file, let's say dataflow.py
dataflow.py is imported into the module dataflow
hamilton.graph_utils.find_functions() retrieves "hamilton functions" from the module dataflow
find_functions() is called in to places (serving the same purpose): in hamilton.graph.create_function_graph() and @subdag()
we get a FunctionGraph, Driver, etc.
the Node object directly retrieves the __module__ information from the function
Parts of the codebase use Node metadata about the module

In other words, the module abstraction is currently irrelevant for building the DAG. It only matters for downstream usage.

Propositions:

the current hamilton.graph_utils.find_functions() does 2 things: get functions from a module and determine if they are valid "hamilton functions". We need to decouple these two operations for flexibility.
Change create_function_graph() to take in functions instead of modules. The modules passed are irrelevant to this operation
the module metadata attributes could be left empty for dynamic cases (would need to evaluate what are the affected downstream code paths).

Answers

Question:
What's the blocker for marimo integration?

Current API options around "get functions from the current namespace and build a DAG" look hacky and have poor ergonomics. Solving this problem would be the same as smoother providing a smoother "if name == main, run this as a DAG"

What is the proposed API?

The current propositions don't need to involve a user-facing API change. They would make developer life easy and would provide a first-class way of passing functions to create a function graph / driver

Simple user-facing API options:

allow .with_modules(...) to take in functions too
add .with_functions(...) to the Builder, and create_function_graph() will work on the functions and modules.

Where can functions come from? Only the current module? or any module?

Could be from anywhere. Doesn't need to come from a file. Our options to maintain compatibility:

all "anonymous" functions are put in the same namespace (i.e., module)
we tweak downstream paths to accept empty module attribute (my preferred option)
users must provide namespace or we automatically assign a uuid (similar to create_temporary_module())

How would this interact with modules if they're provided?

No change to current behavior because the module metadata plays no role in graph building and core Hamilton features.

Would this impact serialization / deserialization, e.g. for ray parallel...

Don't know the details of this. If you have an instantiated function, you must have the pickleable bytes of the instantiated function, and you probably have available source code (from a .py file or from an interactive session) that you can gather and use to re instantiate the function remotely. Doesn't sound like a blocker; all orchestrators have to deal with that.

Would this break any Hamilton UI assumptions?

Don't know the assumptions of the Hamilton UI. If there's is a blocking assumption, it's better to change it? The main potential limitation is that we have some UI components that expect a non-empty metadata field.

What types of callables will be supported? Lambdas? Static methods from classes? How to restrict the user to only the allowed callable types?

Lambdas and static method are not currently supported and are not within the scope of what I intended here. Though, building graphs from functions would introduce a simple way to create nodes and graph from lambdas and static methods

mulliken · 2025-01-29T22:50:01Z

I asked a question in Slack about this feature or something similar, and was asked to post the context on the issue directly. Granted, I am brand new to Hamilton and am merely scoping it out for possible adoption, so my perspective and context might not be as relevant. I am interested in using Hamilton to help maintain data science workflows and envision creating a shared node/subgraph library and an analytic store with individual modules corresponding to each data science workflow. Thus I imagine I would create a workflow my_workflow.py to add to the analytic store that imports a few functions from node_library, and the builder for that workflow would probably invoke Builder().with_modules([node_library, my_workflow]). While Hamilton promises that the function names align assuming the graph can be created, and visualizing the graph will help a developer see the flow, it would be nice to just look at the Builder code and know exactly which outside nodes you are bringing in. I think this would be helpful in the case where there is turnover and someone has to inherit a workflow and quickly get up to speed with it.

I also think for rapid development, importing a list of functions to be built can be useful when you want to cannibalize another workflow someone has written for a few useful functions. In the long run, this probably calls for a refactor and integration of shared components into a share node library, but perhaps a random data scientist might not have maintainer permissions for that library. If you import someone else's module wholesale, especially a complicated one, you run the risk of accidental overrides and collisions, or possibly whole extra subgraphs from the other workflow that get executed unnecessarily. You would have to carefully check you are overriding the correct functions and not invoking extra subgraphs, whereas if you could just import the few functions or subgraphs you wanted, you would just make sure your own DAG is compatible with those, allowing for safe and rapid ad hoc development.

Again, I have only been playing with toy problems and these might be non issues with more experience, but I am interested in this feature.

zilto added enhancement New feature or request core-work Work that is "core". Likely overseen by core team in most cases. labels Jan 6, 2025

skrawcz changed the title ~~Build DAG from functions~~ Build DAG from functions directly Jan 6, 2025

skrawcz added question Further information is requested and removed enhancement New feature or request labels Jan 6, 2025

zilto mentioned this issue Jan 13, 2025

[WIP] feat: build FunctionGraph from functions #1271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build DAG from functions directly #1268

Build DAG from functions directly #1268

zilto commented Jan 6, 2025

skrawcz commented Jan 6, 2025 •

edited

Loading

skrawcz commented Jan 6, 2025

Dev-iL commented Jan 6, 2025

zilto commented Jan 6, 2025

mulliken commented Jan 29, 2025

Build DAG from functions directly #1268

Build DAG from functions directly #1268

Comments

zilto commented Jan 6, 2025

Summary

Current

Problem

Benefits

Hamilton 2.0 / Broader perspective

skrawcz commented Jan 6, 2025 • edited Loading

skrawcz commented Jan 6, 2025

Dev-iL commented Jan 6, 2025

zilto commented Jan 6, 2025

Current main code path (roughly):

Propositions:

Answers

mulliken commented Jan 29, 2025

skrawcz commented Jan 6, 2025 •

edited

Loading