improve data loading speed with Dask or NumPy #37

sreichl · 2023-12-15T17:31:40Z

test it for e.g., pca.py

Dask: Dask is a parallel computing library that integrates with pandas, NumPy, and scikit-learn. It can handle larger-than-memory datasets and can distribute the computation across multiple cores or even multiple machines.

import dask.dataframe as dd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import dask.array as da

# load data with dask
ddata = dd.read_csv(data_path, index_col=0)

# convert to dask array
data_array = ddata.to_dask_array(lengths=True)

# standardize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_array)

# PCA transformation
pca_obj = PCA(n_components=None, random_state=42)
data_pca = pca_obj.fit_transform(data_scaled)

sreichl self-assigned this Dec 15, 2023

sreichl added the enhancement New feature or request label Dec 15, 2023

sreichl changed the title ~~improve data loading speed with Dask~~ improve data loading speed with Dask or NumPy Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve data loading speed with Dask or NumPy #37

improve data loading speed with Dask or NumPy #37

sreichl commented Dec 15, 2023 •

edited

Loading

improve data loading speed with Dask or NumPy #37

improve data loading speed with Dask or NumPy #37

Comments

sreichl commented Dec 15, 2023 • edited Loading

sreichl commented Dec 15, 2023 •

edited

Loading