Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ballista Python Update #1091

Open
Tracked by #1068
milenkovicm opened this issue Oct 20, 2024 · 7 comments · May be fixed by #1100
Open
Tracked by #1068

Ballista Python Update #1091

milenkovicm opened this issue Oct 20, 2024 · 7 comments · May be fixed by #1100
Labels
enhancement New feature or request

Comments

@milenkovicm
Copy link
Contributor

milenkovicm commented Oct 20, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

With changes done in #1088 and introduction with SessionContextExt we could make changes in pyballista and support datafusion-python context directly instead of BallistaContext. This would unify datafusion and ballista
python interface enabling users to change from single node deployment to cluster deployment with single line change.

Describe the solution you'd like

I don't think we need to re-invent the wheel here, we just need to copy what https://github.com/apache/datafusion-ray is doing and do same for ballista. This PR should provide support for methods provided by SessionContextExt.

Something similar to:

from datafusion.context import SessionContext
from pyballista import StandaloneBallista, RemoteBallista

ctx : SessionContext = StandaloneBallista()

df = ctx.sql("SELECT 1")

Propose a proper, python, way to initialize datafuson::PySessionContext

I'm not python expert thus can't really propose ergonomic python interface, so not sure should we
use objects like StandaloneBallista or static methods like Ballista.standalone() although,
later would be trivial to make it may not be the most python ergonomics

use ballista::prelude::SessionContextExt;
use datafusion::prelude::SessionContext;
use datafusion_python::{context::PySessionContext, utils::wait_for_future};

#[pymethods]
impl Ballista {
    #[staticmethod]
    pub fn standalone(py: Python) -> PyResult<PySessionContext> {
        let session_context = SessionContext::standalone();
        let ctx = wait_for_future(py, session_context)?;
        Ok(ctx.into())
    }
}

It would be great if we align with datafusion-ray approach.

Make standalone optional dependency

can we make Ballista standalone optional dependency?

pip install pyballista 

should install remote mode only

pip install pyballista['standalone'] 

should install remote and standalone, providing easy way to test ballista applications.

Consider renaming python package

As pyballista is not published, can we consider renaming package to something like: datafusion-distributed or datafusion-ballista to align with other packages.

To me datafusion-distributed, makes sense and can consider renaming ballista (client) crate to same name, keeping ballista- prefix for executor and scheduler.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context

Do we need to keep current pyballista implementation or we can remove it with this PR?

Relates to #1069, #1088

@milenkovicm milenkovicm added the enhancement New feature or request label Oct 20, 2024
@milenkovicm
Copy link
Contributor Author

@andygrove , @timsaucer , @Michael-J-Ward your opinion would be greatly appreciated

@timsaucer
Copy link

I’ve skimmed through the PR you linked, but I’ll need to take some time to review in more detail. I can think of one holdup regarding the python interface but hopefully it won’t be too difficult to overcome.

@milenkovicm
Copy link
Contributor Author

Thanks @timsaucer
Looking forward for your input

I did a quick poc on top of #1088 to demonstrate functionality at dc21f9e

@tbar4
Copy link
Contributor

tbar4 commented Oct 20, 2024

@milenkovicm interested in supporting on this once Tim and team review your PR for #1069

@milenkovicm
Copy link
Contributor Author

One thing I've run into trying to do things like this is that because rust doesn't have a stable ABI, the above code probably only works when you have the identical versions of DataFusion and identical compilers. That might not be a hard blocker for ballista since it's the same group, but it does create a hard dependency that can make it difficult to maintain. For that reason I've been working on a PR to add some FFI functionality to DF core with the specific goal of allowing extension for things like TableProviders. Based on the discussion in the datafusion-ray channel I think we might also look at having ballista or ray provide a QueryPlanner that we register with the SessionContext.

referencing @timsaucer comment from discord https://discord.com/channels/885562378132000778/1297108588183027753/1298220209609642034

@timsaucer
Copy link

Sorry I haven’t been able lately to give this more attention, but I hope next week my time clears up some.

@milenkovicm
Copy link
Contributor Author

No worries @timsaucer, I just want to note important point you brought. Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants