Skip to content

Commit

Permalink
Tabmat v4 (#286)
Browse files Browse the repository at this point in the history
* Minimal implementation (tests green)

* Remove sum method and rely on np.sum

* Force DenseMatrix to always be 2-dimensional

* Add __repr__ and __str__ methods

* Fix as_mx

* Fix ufunc return value

* Wrap SparseMatrix, too

* Demo of how the ufunc interface can be implemented

* Do not subclass csc_matrix

* Demonstrate binary ufuncs for sparse

* Add tocsc method

* Fix type checks

* Minor improvements

* ufunc support for categoricals

* Remove __array_ufunc__ interface

* Remove numpy operator mixin

* Add hstack function

* Add method for unpacking underlying array

* Add __matmul__ methods to SparseMatrix

* Stricter and more consistent indexing

* Be consistent when instantiating from 1d arrays

* Add column name metadata to `tabmat` matrices (#278)

* Add column name getters

* Matrix names are also combined

* Add names to constructors

* Add indexing support for column names

* Remove unnecessary code

* Better default column names

* Reduce code duplication

* Saner defaults

* Add convenient getters and setters

* Fix indexing

* Smarter setter for categorical matrices

* Add tests

* Fix subsetting with np.newaxis

* Remove the walrus :(

* Fix test

* Fix indexing with np.ix_

* Propagate column names where it makes sense

* Fix merge mistake

* Add changelog entry

* Matrices from formulas (#267)

* Add an experimental tabmat materializer class

* Nicer way of handling interactions

* Have proper column names [skip ci]

* Make dummy ordering consistent with pandas [skip ci]

* Fix mistake in categorical interactions [skip ci]

* Add formulaic to environment files

Have not added to the conda recipe yet.
Should probably be optional.

* Add from_formula constructor

* Add some tests

* Add more tests

* Major refactoring

 - simplify categorical interactions
 - NaNs in categoricals should be handled correctly
 - parity with formulaic in categorical names

* Make name formatting custommizable

 - interaction_separator
 - categorical_format
 - intercept_name

* Add formulaic to conda recipe

* Implement `C()` function to convert to categoricals

* Auto-convert strings to categories

* Fix C() not working from materializer interface

* Add the pandasmaterializer tests from formulaic

* Add formulaic to setup.py deps

* Implement suggestions from code review

* Clean up code

 - Add docstrings
 - Add type hints
 - Rename some classes

* Pin formulaic minimum version

* Add support for architectures not supported by xsimd (#262)

* Release 3.1.9 (#263)

* Pre-commit autoupdate (#264)

Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>

* Add params for density and cardinality thresholds

* Skip python 3.6 build

* Refactor to avoid circular imports

* Interaction of dropped and NA is dropped

* Add type hint for context

* Add unit tests for interactable vectors

* Add more checks

* Change argument name

* Make C() stateful (remember levels)

* Add test for categorizer state

* More correct handling of encoding categoricals

* Make adding an intercept implicitly parametrizable

Default is False

* Add na_action parameter to constrictor

* Add test for sparse numerical columns

* Add option to not add the constant column

* Pre-commit autoupdate (#274)

* Pre-commit autoupdate (#276)

Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>

* Bump pypa/gh-action-pypi-publish from 1.8.6 to 1.8.7 (#277)

Bumps [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) from 1.8.6 to 1.8.7.
- [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases)
- [Commits](pypa/gh-action-pypi-publish@v1.8.6...v1.8.7)

---
updated-dependencies:
- dependency-name: pypa/gh-action-pypi-publish
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump pypa/gh-action-pypi-publish from 1.8.7 to 1.8.8 (#279)

Bumps [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) from 1.8.7 to 1.8.8.
- [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases)
- [Commits](pypa/gh-action-pypi-publish@v1.8.7...v1.8.8)

---
updated-dependencies:
- dependency-name: pypa/gh-action-pypi-publish
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump pypa/cibuildwheel from 2.13.1 to 2.14.1 (#280)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.13.1 to 2.14.1.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.13.1...v2.14.1)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Minimal implementation (tests green)

* Remove sum method and rely on np.sum

* Force DenseMatrix to always be 2-dimensional

* Add __repr__ and __str__ methods

* Fix as_mx

* Fix ufunc return value

* Wrap SparseMatrix, too

* Demo of how the ufunc interface can be implemented

* Do not subclass csc_matrix

* Improve the performance of `from_pandas` in the case of low-cardinality categoricals (#275)

* Improve the performance of `from_pandas`

* Update changelog according to review

* Add benchmark data to .gitignore (#282)

* Demonstrate binary ufuncs for sparse

* Add tocsc method

* Fix type checks

* Minor improvements

* ufunc support for categoricals

* Remove __array_ufunc__ interface

* Remove numpy operator mixin

* Add hstack function

* Add method for unpacking underlying array

* Add __matmul__ methods to SparseMatrix

* Stricter and more consistent indexing

* Be consistent when instantiating from 1d arrays

* Adjust tests to work with v4

* Fix type hints

* Add changelog entry

* term and column names for formula-based matrices

* Fix handling of formula-based names

* Add tests for formula-based names

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Martin Stancsics <[email protected]>
Co-authored-by: Uwe L. Korn <[email protected]>
Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Apply Matthias' suggestions

Co-authored-by: Matthias Schmidtblaicher <[email protected]>

* Allow missing values in `CategoricalMatrix` (#281)

* Add missing support to categoricals

* Rename functions

* Parametrize missing behavior in constructors

* Return a maskedarray from recover_orig

* Propagate missing_method when indexing

* Add tests

* Template all the things!

* Privatize has_missing attribute

* Add changelog entry

* Add option to treat missing values as a category

* Update changelog

* Raise if the missing category already exists

* Add tests for missing name and raise on existing

* Don't skip tests (they are fast)

* Apply suggestions from review

* Fix indxing

* Fix intercept name in formulas

* Add missing cateegorical functinoality to formulas

* Much cooler handlong of missing categoricals

* Add changelog entry

* Correctly create missing category from model_spec (#297)

* pyupgrade 3.9

* make ruff and mypy happy

* bump minimum formulaic version (stateful transforms)

* add test case with custom cat format

* pin formulaic minimum version to 0.6 (#340)

* cosmetics

* Raise for unseen categories when materializing from an existing `ModelSpec` (#341)

* Raise error on unseen levels when materializing

* Fix test for unseen categories

* Add test for raising on unseen categories

* Properly handle missings when checking for unseen

* Expand test for unseen missings

* Improve attribute name

* Add comment about dropping missings in tests for new levels

* consistent tense

* typo

* slightly improve wording

* Describe breaking change

* improve wording

* review comments

* add change from #356

* fix

* set default context to None

* add scope to other test, too

* tiny docstring cosmetics

* remove duplicate . [skip-ci]

* more docstring formatting

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Matthias Schmidtblaicher <[email protected]>
Co-authored-by: Uwe L. Korn <[email protected]>
Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marc-Antoine Schmidt <[email protected]>
Co-authored-by: Matthias Schmidtblaicher <[email protected]>
Co-authored-by: Martin Stancsics <[email protected]>
  • Loading branch information
8 people authored Apr 23, 2024
1 parent 89ea092 commit 4e8f1d6
Show file tree
Hide file tree
Showing 23 changed files with 3,781 additions and 321 deletions.
20 changes: 16 additions & 4 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,29 @@
Changelog
=========

Unreleased
----------
4.0.0 - 2024-04-23
------------------

**Breaking changes**:

- To unify the API, :class:`DenseMatrix` does not inherit from :class:`np.ndarray` anymore. To convert a :class:`DenseMatrix` to a :class:`np.ndarray`, use :meth:`DenseMatrix.unpack`.
- Similarly, :class:`SparseMatrix` does not inherit from :class:`sps.csc_matrix` anymore. To convert a :class:`SparseMatrix` to a :class:`sps.csc_matrix`, use :meth:`SparseMatrix.unpack`.

**New features:**

- Added column name and term name metadata to :class:`MatrixBase` objects. These are automatically populated when initializing a :class:`MatrixBase` from a :class:`pandas.DataFrame`. In addition, they can be accessed and modified via the :attr:`MatrixBase.column_names` and :attr:`MatrixBase.term_names` properties.
- Added a formula interface for creating tabmat matrices from pandas data frames. See :func:`tabmat.from_formula` for details.
- Added support for missing values in :class:`CategoricalMatrix` by either creating a separate category for them or treating them as all-zero rows.
- Added support for handling missing categorical values in pandas data frames.

**Bug fix:**

- Added cython compiler directive legacy_implicit_noexcept = True to fix performance regression with cython 3.
- Added cython compiler directive ``legacy_implicit_noexcept = True`` to fix performance regression with cython 3.

**Other changes:**

- Refactored the pre-commit hooks to use ruff.
- Refactored CategoricalMatrix's transpose_matvec to be deterministic when using OpenMP.
- Refactored :meth:`CategoricalMatrix.transpose_matvec` to be deterministic when using OpenMP.
- Adjusted transformation to sparse format in :func:`tabmat.from_pandas` to future changes in pandas.

3.1.13 - 2023-10-17
Expand Down
1 change: 1 addition & 0 deletions conda.recipe/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ requirements:
- {{ pin_compatible('numpy') }}
- pandas
- scipy
- formulaic>=0.6

test:
requires:
Expand Down
1 change: 1 addition & 0 deletions environment-win.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ channels:
dependencies:
- libblas>=0=*mkl
- pandas
- formulaic>=0.6

# development tools
- click
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ channels:
- nodefaults
dependencies:
- pandas
- formulaic>=0.6

# development tools
- click
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@
],
package_dir={"": "src"},
packages=find_packages(where="src"),
install_requires=["numpy", "pandas", "scipy"],
install_requires=["numpy", "pandas", "scipy", "formulaic>=0.6"],
python_requires=">=3.9",
ext_modules=cythonize(
ext_modules,
Expand Down
7 changes: 5 additions & 2 deletions src/tabmat/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import importlib.metadata

from .categorical_matrix import CategoricalMatrix
from .constructor import from_csc, from_pandas
from .constructor import from_csc, from_formula, from_pandas
from .dense_matrix import DenseMatrix
from .matrix_base import MatrixBase
from .sparse_matrix import SparseMatrix
from .split_matrix import SplitMatrix
from .split_matrix import SplitMatrix, as_tabmat, hstack
from .standardized_mat import StandardizedMatrix

try:
Expand All @@ -21,5 +21,8 @@
"SplitMatrix",
"CategoricalMatrix",
"from_csc",
"from_formula",
"from_pandas",
"as_tabmat",
"hstack",
]
Loading

0 comments on commit 4e8f1d6

Please sign in to comment.