Diffxpy returns log2fc and coef_mle values as dask array #223

lippspatrick · 2023-05-09T07:23:58Z

Hello,
thanks a lot for your package.

I'm currently facing an issue that after running walds test, the values for log2fc and coef_mle are returned as a dask array and not as a numeric value.

I'm using the following versions:
anndata 0.7.8
diffxpy 0.7.4+103.gceb79aa ig/batchglm_api_update
batchglm 0.7.4. dev_0
dask 2021.4.1
sparse 0.9.1

Batchglm and diffxpy of those branches because of this issue #93.

I'm running the following code:

adata.X=adata.layers["raw_counts"]
res=de.test.wald(data=adata, formula_loc="~ 1 + GROUP", factor_loc_totest="GROUP")

This prints:

The provided param_names are ignored as the design matrix is of type <class 'patsy.design_info.DesignMatrix'>.
The provided param_names are ignored as the design matrix is of type <class 'patsy.design_info.DesignMatrix'>.
/lippat00/python/diffxpy_packages/batchglm/batchglm/models/base_glm/utils.py:102: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
... array[indexer]

To avoid creating the large chunks, set the option
>>> with dask.config.set({'array.slicing.split_large_chunks': True}):
... array[indexer]
np.vstack([np.mean(densify(x[np.where(grouping == g)[0], :]), axis=0) for g in np.unique(grouping)])
/lippat00/python/diffxpy_packages/batchglm/batchglm/models/base_glm/utils.py:151: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
>>> with dask.config.set({'array.slicing.split_large_chunks': False}):
... array[indexer]

To avoid creating the large chunks, set the option
>>> with dask.config.set({'array.slicing.split_large_chunks': True}):
... array[indexer]
np.vstack([np.mean(densify(x[np.where(grouping == g)[0], :]), axis=0) for g in np.unique(grouping)])
/lippat00/python/diffxpy_packages/batchglm/batchglm/models/base_glm/utils.py:168: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
>>> with dask.config.set({'array.slicing.split_large_chunks': False}):
... array[indexer]

To avoid creating the large chunks, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
... array[indexer]
[np.mean(np.square(densify(x[np.where(grouping == g)[0], :])), axis=0) for g in np.unique(grouping)]

training location model: False
training scale model: True
iter 0: ll=70663625.678866
iter 1: ll=70663625.678866, converged: 0.00% (loc: 100.00%, scale update: False), in 0.00sec
iter 2: ll=50303650.242279, converged: 11.08% (loc: 11.08%, scale update: True), in 177.80sec
iter 3: ll=50303650.242279, converged: 11.08% (loc: 100.00%, scale update: False), in 0.00sec
iter 4: ll=50076239.645603, converged: 79.02% (loc: 79.02%, scale update: True), in 149.27sec
iter 5: ll=50076239.645603, converged: 79.02% (loc: 100.00%, scale update: False), in 0.00sec
iter 6: ll=50039459.932892, converged: 92.41% (loc: 92.41%, scale update: True), in 59.11sec
iter 7: ll=50039459.932892, converged: 92.41% (loc: 100.00%, scale update: False), in 0.00sec
iter 8: ll=50031801.602276, converged: 98.12% (loc: 98.12%, scale update: True), in 44.64sec
iter 9: ll=50031801.602276, converged: 98.12% (loc: 100.00%, scale update: False), in 0.00sec
iter 10: ll=50030709.007629, converged: 99.70% (loc: 99.70%, scale update: True), in 41.18sec
iter 11: ll=50030709.007629, converged: 99.70% (loc: 100.00%, scale update: False), in 0.00sec
iter 12: ll=50030695.725333, converged: 99.97% (loc: 99.97%, scale update: True), in 36.95sec
iter 13: ll=50030695.725333, converged: 99.97% (loc: 100.00%, scale update: False), in 0.00sec
iter 14: ll=50030652.029279, converged: 99.99% (loc: 99.99%, scale update: True), in 23.52sec
iter 15: ll=50030652.029279, converged: 99.99% (loc: 100.00%, scale update: False), in 0.00sec
iter 16: ll=50030652.029279, converged: 100.00% (loc: 100.00%, scale update: True), in 155.19sec
/mambaforge/envs/diffxpy/lib/python3.9/site-packages/dask/array/core.py:2912: RuntimeWarning: divide by zero encountered in true_divide
size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))

However res.summary() then returns:

gene	pval	qval	log2fc	mean	zero_mean	grad	coef_mle	coef_sd	ll
AL627309.1	0.080515	0.237762	[dask.array<getitem, shape=(), dtype=float64, ...	0.001634	False	3.163008e-03	[dask.array<getitem, shape=(), dtype=float64, ...	6.267832e-01	0.000000

I also tried converting adata.X=adata.X.toarray() which returns the same result.

Looking forward to your response, in case you need additional information, please let me know.

Best,
Patrick

The text was updated successfully, but these errors were encountered:

c-westhoven · 2023-08-25T11:45:45Z

I have the same issue with summary returning dask.arrays

I'm using:
anndata 0.9.2
diffxpy 0.7.4+103.gceb79aa ig/batchglm_api_update
batchglm 0.7.4. dev
dask 2021.4.1
sparse 0.9.1

My adata.X is raw data.

running the de.test.wald model prints

The provided param_names are ignored as the design matrix is of type <class 'patsy.design_info.DesignMatrix'>.

The provided param_names are ignored as the design matrix is of type <class 'patsy.design_info.DesignMatrix'>.
AnnData object with n_obs × n_vars = 300 × 500
    obs: 'sample_x', 'dataset', 'condition'
training location model: False
training scale model: True
iter   0: ll=142879.916890
iter   1: ll=142879.916890, converged: 0.00% (loc: 100.00%, scale update: False), in 0.00sec
iter   2: ll=32021.056719, converged: 41.20% (loc: 41.20%, scale update: True), in 3.86sec
iter   3: ll=32021.056719, converged: 41.20% (loc: 100.00%, scale update: False), in 0.00sec
iter   4: ll=30460.923372, converged: 87.40% (loc: 87.40%, scale update: True), in 2.47sec
iter   5: ll=30460.923372, converged: 87.40% (loc: 100.00%, scale update: False), in 0.00sec
iter   6: ll=30281.772796, converged: 96.00% (loc: 96.00%, scale update: True), in 1.26sec
iter   7: ll=30281.772796, converged: 96.00% (loc: 100.00%, scale update: False), in 0.00sec
iter   8: ll=30258.987554, converged: 99.20% (loc: 99.20%, scale update: True), in 1.01sec
iter   9: ll=30258.987554, converged: 99.20% (loc: 100.00%, scale update: False), in 0.00sec
iter  10: ll=30250.794861, converged: 99.80% (loc: 99.80%, scale update: True), in 0.73sec
iter  11: ll=30250.794861, converged: 99.80% (loc: 100.00%, scale update: False), in 0.00sec
iter  12: ll=30250.794861, converged: 100.00% (loc: 100.00%, scale update: True), in 0.53sec
/home/c490n/.local/lib/python3.8/site-packages/dask/array/core.py:2912: RuntimeWarning: divide by zero encountered in true_divide
  size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))
```

My .summary output does not include coef_mle or coef_sd.

c-westhoven · 2023-08-25T13:22:58Z

I'm able to get numerics for coef_mle by converting self.theta_mle in _test in det.py under diffxpy/testing/ to a np.array. In the summary df, the dask array containing the mle information for all of the elements is saved.

def _test(self):
        """
        Returns the p-value for differential expression for each gene

        :return: np.ndarray
        """
        # Check whether single- or multiple parameters are tested.
        # For a single parameter, the wald statistic distribution is approximated
        # with a normal distribution, for multiple parameters, a chi-square distribution is used.
        self.theta_mle = self.model_estim.model_container.theta_location[self.coef_loc_totest]
        dir_path = os.path.dirname(os.path.realpath(__file__))
        if len(self.coef_loc_totest) == 1:
            print("_test, == 1", dir_path, self.theta_mle)
            **self.theta_mle = np.array(self.theta_mle[0])**
            
            print("_test, == 1, np.array", dir_path, self.theta_mle)
...
            )

c-westhoven · 2023-08-25T13:42:47Z

I'm not sure how my quick fixes would help solve the underlying issue.

But for log2fc I also added the np.array() command to the return of the log2_foldchange function in det.py.
I think adding np.array to the log_fold_change function should also be possible.

    def log2_fold_change(self, **kwargs):
        """
        Calculates the pairwise log_2 fold change(s) for this DifferentialExpressionTest.
        """
#         return self.log_fold_change(base=2, **kwargs)
        return np.array(self.log_fold_change(base=2, **kwargs))

With the change to log2_fold_change, my error in test.plot_volcano (int is not iterable) was solved as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffxpy returns log2fc and coef_mle values as dask array #223

Diffxpy returns log2fc and coef_mle values as dask array #223

lippspatrick commented May 9, 2023

c-westhoven commented Aug 25, 2023

c-westhoven commented Aug 25, 2023

c-westhoven commented Aug 25, 2023

Diffxpy returns log2fc and coef_mle values as dask array #223

Diffxpy returns log2fc and coef_mle values as dask array #223

Comments

lippspatrick commented May 9, 2023

c-westhoven commented Aug 25, 2023

c-westhoven commented Aug 25, 2023

c-westhoven commented Aug 25, 2023