Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve data loading speed with Dask or NumPy #37

Open
sreichl opened this issue Dec 15, 2023 · 0 comments
Open

improve data loading speed with Dask or NumPy #37

sreichl opened this issue Dec 15, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@sreichl
Copy link
Collaborator

sreichl commented Dec 15, 2023

test it for e.g., pca.py

Dask: Dask is a parallel computing library that integrates with pandas, NumPy, and scikit-learn. It can handle larger-than-memory datasets and can distribute the computation across multiple cores or even multiple machines.

import dask.dataframe as dd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import dask.array as da

# load data with dask
ddata = dd.read_csv(data_path, index_col=0)

# convert to dask array
data_array = ddata.to_dask_array(lengths=True)

# standardize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_array)

# PCA transformation
pca_obj = PCA(n_components=None, random_state=42)
data_pca = pca_obj.fit_transform(data_scaled)
@sreichl sreichl self-assigned this Dec 15, 2023
@sreichl sreichl added the enhancement New feature or request label Dec 15, 2023
@sreichl sreichl changed the title improve data loading speed with Dask improve data loading speed with Dask or NumPy Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant