Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse concatenate #2286

Merged
merged 6 commits into from
May 18, 2017
Merged

Sparse concatenate #2286

merged 6 commits into from
May 18, 2017

Conversation

astaric
Copy link
Member

@astaric astaric commented May 5, 2017

Issue

Add support for sparse tables to Table.concatenate. Fixes #2154.

Description of changes

Modify Table.extend to use vstack to concatenate compatible table, use scipy.sparse.vstack if some of the arrays are sparse.

Includes
  • Code changes
  • Tests
  • Documentation

@astaric astaric force-pushed the sparse-concatenate branch from aeee778 to dd9ee3a Compare May 5, 2017 16:37
@codecov-io
Copy link

codecov-io commented May 5, 2017

Codecov Report

Merging #2286 into master will decrease coverage by 0.02%.
The diff coverage is 93.18%.

@@            Coverage Diff             @@
##           master    #2286      +/-   ##
==========================================
- Coverage    73.2%   73.18%   -0.03%     
==========================================
  Files         316      316              
  Lines       55306    55287      -19     
==========================================
- Hits        40488    40462      -26     
- Misses      14818    14825       +7

@astaric astaric force-pushed the sparse-concatenate branch from dd9ee3a to 0759397 Compare May 8, 2017 10:20
@astaric
Copy link
Member Author

astaric commented May 8, 2017

@lanzagar the (randomly) failing test has nothing to do with this PR. I have restarted the build, hopefully, it will pass this time :)

I am tagging you since you have merged the latest changes to the mosaic (#2133). Can you take a look?

@nikicc
Copy link
Contributor

nikicc commented May 9, 2017

I tested it on text and it doesn't crash any more. Though, the same strange behaviour as described in #2304 happens on BoW and prevents to properly test.

@astaric
Copy link
Member Author

astaric commented May 9, 2017

Union and intersection are computed according to the "sameness" of the variable, not equality. Since BoW features have compute_value set, variables from different BoWs are considered different, since there is no way to tell if both compute values are the same.

If we modify the widget to only compare type, name, (and values for discrete variables), what do we set for compute_value on the output?

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

(len(attr_cols) or len(class_cols)):
raise TypeError(
"Ordinary attributes can only have primitive values")
if len(attr_cols):
if len(attr_cols) == 1:
if sp.issparse(self.X) and len(attr_cols) == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Scipy seems to be able to assign 2d arrays to columns regardless if attr_cols = 2 or attr_cols = [2]. Is this here just for efficiency?

>>> import numpy as np
>>> import scipy.sparse as sp
>>> x = sp.csr_matrix(np.eye(2))
>>> x.toarray()
array([[ 1.,  0.],
       [ 0.,  1.]])
>>> x[:, [1]] = np.array([[2], [2]])
/Users/Niko/anaconda/envs/orange3/lib/python3.5/site-packages/scipy/sparse/compressed.py:774: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
>>> x.toarray()
array([[ 1.,  2.],
       [ 0.,  2.]])
>>> x[:, 1] = np.array([[3], [3]])
>>> x.toarray()
array([[ 1.,  3.],
       [ 0.,  3.]])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original check has been here for a long time :)

I have removed the if completely.

@@ -6,6 +6,30 @@
ANNOTATED_DATA_FEATURE_NAME = "Selected"


def add_columns(domain, attributes=(), class_vars=(), metas=()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this maybe be put in domain.py?

@astaric astaric force-pushed the sparse-concatenate branch from eb3d9d7 to 344dcfc Compare May 18, 2017 06:22
astaric added 6 commits May 18, 2017 08:24
Use vstack that works with both sparse and dense tables.
- Allow setting of whole columns/rows in X and Y (when setting np.ndarray)
- Convert indices for sparse tables only. Otherwise columns in X and Y
  have to be set with 1d array and columns in metas with 2d array.
Use avoid using Table.from_numpy to make it compatible with classes
extending Table.
@astaric astaric force-pushed the sparse-concatenate branch from 344dcfc to f337bf6 Compare May 18, 2017 06:24
@nikicc nikicc merged commit 696e786 into biolab:master May 18, 2017
@astaric astaric deleted the sparse-concatenate branch September 8, 2017 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants