-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue/237 #241
Open
NickEdwards7502
wants to merge
53
commits into
dev
Choose a base branch
from
issue/237
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Issue/237 #241
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FEAT: Implemented RF class method for fitting the model FEAT: Implemented RF class method for obtaining importance analysis from a fitted RF FEAT: Implemented RF class method for returning oob error FEAT: Implemented RF class method for obtaining FDR from a fitted model FEAT: Implemented RF class method for exporting forest to JSON REFACTOR: Make RF model available at package level CHORE: Added type checking to all methods
REFACTOR: Removed FeatureSource and ImportanceAnalysis classes from core REFACTOR: Added FeatureSource import so features can be returned as a class instantiation
REFACTOR: Removed imp analysis and model training FEAT: Added conversion from feature to RDD (python) FEAT: Added conversion from feature to RDD (scala) CHORE: Added type checking
due to import order warning (#237)
separate wrapper file (#237) REFACTOR: Updated important_variables and variable_importance methods to convert to pandas DataFrames
REFACTOR: Removed model training from object instantation and updated class to accept a model as a parameter REFACTOR: Added normalisation as an optional parameter for variable importance methods FEAT: Updated variableImportance method to include splitCount in return as it is required for local FDR analysis
and passes back to python context (#237)
from importAnalysis method of AnalyticsFunctions (#237)
FIX: Update export function to process trees in batches, instead of collecting the whole forest as a map as this led to OOM errors on large forests
REFACTOR: Refactor to mirror changes to python wrapper FEAT: Include FDR calculation in unit test
FEAT: Implement function for manhattan plotting negative log p values
STYLE: Format with black
FEAT: Add wrapper class for importing covariates FEAT: Add wrapper class for unioning features and covariates
REFACTOR: Include covariate filtering in manhattan plot function STYLE: Format with black (#237)
FEAT: Add functions for importing std and transposed CSVs FEAT: Add function for unioning features and covariates
Reference changed to importTransposedCSV
REFACTOR: Remove python component of converting Feature RDD to pandas FEAT: Add RDD slice to DF function
REFACTOR: Remove conversion of whole RDD to DataFrame FEAT: Add function for slicing rows and columns and converting to DF
NickEdwards7502
added
enhancement
dependencies
Pull requests that update a dependency file
java
Pull requests that update Java code
python
Pull requests that update Python code
labels
Oct 2, 2024
* .bgz loader function implemented by Christina
* Update python wrapper to include imputation strategy parameter * Update scala API to pass imputation strategy to VCFFeatureSource * Create functions to handle mode and zero imputation strategies * Added imputation strategy to test cases * Added imputation strategy to FeatureSource cli * Remove sparkPar from test cases due to changes in class signature * Updated DefVariantToFeatureConverterTest to use zeros imputation
NickEdwards7502
force-pushed
the
issue/237
branch
from
October 17, 2024 06:07
a5673e5
to
b686d75
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
dependencies
Pull requests that update a dependency file
enhancement
java
Pull requests that update Java code
python
Pull requests that update Python code
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Major issues and features addressed in this update
VariantSpark's python wrapper has been refactored to create Random Forest models from a standalone class
python/varspark/rfmodel.py
python/varspark/core.py
python/varspark/__init__.py
src/main/scala/au/csiro/variantspark/api/GetRFModel.scala
A non-hail export model function was created
src/main/scala/au/csiro/variantspark/api/ExportModel.scala
The
FeatureSource
class, which provides wrapper functionalities for initialising genotype data for model training, has been moved to a standalone classhead(nrows, ncols)
allows the first n rows and columns to be viewed as a pandas DataFramepython/varspark/featuresource.py
python/varspark/core.py
src/main/scala/au/csiro/variantspark/input/FeatureSource.scala
Covariate support was extended
FeatureSource
wrapper class and are also of typeRDD[Feature]
, they also supporthead()
src/main/scala/au/csiro/variantspark/api/VSContext.scala
src/main/scala/au/csiro/variantspark/input/CsvStdFeatureSource.scala
src/main/scala/au/csiro/variantspark/input/UnionedFeatureSource.scala
python/varspark/lfdrvsnohail.py
Importance analyses were moved to a standalone python wrapper class
important_variables()
andvariable_importance()
are now returned as pandas DataFramesvariable_importance()
(required for Local FDR calculations)precision
supports rounding forvariable_importance()
normalized
indicates whether to normalise importances for both functionspython/varspark/importanceanalysis.py
python/varspark/core.py
src/main/scala/au/csiro/variantspark/api/ImportanceAnalysis.scala
src/main/scala/au/csiro/variantspark/api/AnalyticsFunctions.scala
Move lfdr file to non-hail python directory
python/varspark/hail/lfdrvs.py
python/varspark/lfdrvs.py
Updated all test cases according to the above changes
src/test/scala/au/csiro/variantspark/api
/CommonPairwiseOperationTest.scala
/ImportanceApiTest.scala
src/test/scala/au/csiro/variantspark/misc
/ReproducibilityTest.scala
/CovariateReproducibilityTest.scala
src/test/scala/au/csiro/variantspark/test
/TestSparkContext.scala
python/varspark/test
/test_core.py
/test_hail.py
/test_pvalues_calculation.py
src/test/scala/au/csiro/variantspark/work/hail
/HailApiApp.scala
Removed all files used exclusively in hail version
python/varspark/hail
__init__.py
context.py
hail.py
methods.py
plot.py
src/main/scala/au/csiro/variantspark/hail/methods
RFModel.scala
Removed hail installation from
pom.xml