Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue/237 #241

Open
wants to merge 53 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
5edfbfa
DEV: Created standalone class for Random Forest Models (#237)
NickEdwards7502 Sep 11, 2024
80a9c59
DEV: Updated varspark python wrapper (#237)
NickEdwards7502 Sep 11, 2024
23520ec
DEV: Created standalone FeatureSource class in separate file (#237)
NickEdwards7502 Sep 11, 2024
4560998
REFACTOR: Remove unecessary hail import for hail rf wrapper
NickEdwards7502 Sep 11, 2024
b8b39fd
DEV: Created standalone ImportanceAnalysis class in
NickEdwards7502 Sep 11, 2024
4bfaac9
DEV: Updated ImportanceAnalysis scala class (#237)
NickEdwards7502 Sep 11, 2024
0fc736f
DEV: Created scala function that trains a forest
NickEdwards7502 Sep 11, 2024
ea069d6
REFACTOR: Removed model definition and training
NickEdwards7502 Sep 11, 2024
e08f12a
DEV: Create no-hail equivalent of JSON model export (#237)
NickEdwards7502 Sep 11, 2024
ddc5912
REFACTOR: Update importance API test cases to reflect changes (#237)
NickEdwards7502 Sep 11, 2024
f6d40d4
REFACTOR: Update reproducibility test case to reflect changes (#237)
NickEdwards7502 Sep 11, 2024
3356d9a
DEV: Update python unit testing (#237)
NickEdwards7502 Sep 11, 2024
59f40bc
DEV: Create no hail lfdr class (#237)
NickEdwards7502 Sep 11, 2024
3f8066b
DEV: Create temp hail notebook for testing JSON export OOM (#237)
NickEdwards7502 Sep 11, 2024
de29b45
DEV: Create temp notebook for demonstrating VS functionality
NickEdwards7502 Sep 11, 2024
fe2db4c
DEV: Add covariate import wrapper function (#237)
NickEdwards7502 Sep 13, 2024
a9b9570
DEV: Create python class for covariate imports (#237)
NickEdwards7502 Sep 13, 2024
3ea4c8c
STYLE: Format with black (#237)
NickEdwards7502 Sep 13, 2024
8f11e62
REFACTOR: Remove covariatesource as not required (#237)
NickEdwards7502 Sep 19, 2024
d671f35
STYLE: Format with black (#237)
NickEdwards7502 Sep 19, 2024
209a463
DEV: Add wrapper functions for covariate support (#237)
NickEdwards7502 Sep 19, 2024
b94afcc
STYLE: Format with black (#237)
NickEdwards7502 Sep 19, 2024
04daae2
STYLE: Format with black (#237)
NickEdwards7502 Sep 19, 2024
30732ba
DEV: Update lfdr to support covariates (#237)
NickEdwards7502 Sep 19, 2024
3381e68
STYLE: Format with scalamft (#237)
NickEdwards7502 Sep 19, 2024
37f4193
DEV: Update VSContext to support covariates (#237)
NickEdwards7502 Sep 19, 2024
dfae3c2
DEV: Update std CSV features to support optional variable type specs …
NickEdwards7502 Sep 19, 2024
9733844
DEV: Create class for returning union of features and covariates (#237)
NickEdwards7502 Sep 19, 2024
dd32e0f
CHORE: Add reproducibility test that includes covariates (#237)
NickEdwards7502 Sep 19, 2024
769ce76
CHORE: Remove print statements from RF reproducibility test (#237)
NickEdwards7502 Sep 19, 2024
e9a23cf
CHORE: Update no hail notebook to include covariates (#237)
NickEdwards7502 Sep 19, 2024
4379b0a
DEV: Update fdr estimation to return p-value cutoff (#237)
NickEdwards7502 Sep 19, 2024
d2048d0
FIX: Correct significance line plot to use cutoff p-value (#237)
NickEdwards7502 Sep 19, 2024
2416b8e
REFACTOR: Update pairwise operation tests based on import changes (#237)
NickEdwards7502 Sep 19, 2024
1529bd8
CHORE: Update nohail notebook with significance line changes (#237)
NickEdwards7502 Sep 19, 2024
12d6137
FIX: Add covariate type parameter to python wrapper for std csv (#237)
NickEdwards7502 Sep 19, 2024
4df6e32
DEV: Replace to_df functionality with head (#237)
NickEdwards7502 Sep 19, 2024
b1fe760
DEV: Update FeatureSource dataframe conversion (#237)
NickEdwards7502 Sep 19, 2024
4506139
CHORE: Update nohail demo to include df slicing (#237)
NickEdwards7502 Sep 19, 2024
07cd144
REFACTOR: Make ExportModel compatible with JsonRFAnalyser (#237)
NickEdwards7502 Sep 20, 2024
cef13f4
REFACTOR: Make head concrete & inherited by all feature sources (#237)
NickEdwards7502 Sep 23, 2024
b6f4e3b
FIX: Update regex import for lfdr DataFrame parsing(#237)
NickEdwards7502 Sep 23, 2024
9b7f83b
CHORE: Remove cutoff pvalue print statement (#237)
NickEdwards7502 Oct 2, 2024
f66f445
CHORE: Remove VSHail python wrapper files (#237)
NickEdwards7502 Oct 2, 2024
e6e637c
CHORE: Remove VSHail Random Forest scala class (#237)
NickEdwards7502 Oct 2, 2024
8f82d29
CHORE: Update spark context test to use non-hail Kryo registrator (#237)
NickEdwards7502 Oct 2, 2024
1be8d66
CHORE: Remove test cases specifically for hail (#237)
NickEdwards7502 Oct 2, 2024
bd53f27
CHORE: Remove hail from maven build (#237)
NickEdwards7502 Oct 2, 2024
45ae200
REFACTOR: Update lfdr filename and python wrapper reference (#237)
NickEdwards7502 Oct 2, 2024
fe70285
CHORE: Remove hail unit tests (#237)
NickEdwards7502 Oct 2, 2024
279bd5b
FIX: Update LocalFDR import statement in rfmodel.py (#237)
NickEdwards7502 Oct 17, 2024
5ad8cc0
DEV: Integrate bgzipped file support in VCF import API (#237)
NickEdwards7502 Oct 17, 2024
b686d75
DEV: Implement imputation for VCF features (#237)
NickEdwards7502 Oct 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 171 additions & 0 deletions examples/notebooks/run_importance_chr22_hail.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"using variant-spark jar at '/home/edw222/Projects/VariantSpark/familiarisation/VariantSpark/target/variant-spark_2.12-0.5.3-SNAPSHOT-all.jar'\n",
"2024-09-10 10:03:30 Hail: WARN: This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.\n",
" Compatibility is not guaranteed.\n",
"2024-09-10 10:03:30 Hail: INFO: SparkUI: http://vpn-internal-gateway-check.nexus.csiro.au:4041\n",
"Running on Apache Spark version 3.1.2\n",
"SparkUI available at http://vpn-internal-gateway-check.nexus.csiro.au:4041\n",
"Welcome to\n",
" __ __ <>__\n",
" / /_/ /__ __/ /\n",
" / __ / _ `/ / /\n",
" /_/ /_/\\_,_/_/_/ version 0.2.74-0c3a74d12093\n",
"LOGGING: writing to /home/edw222/Projects/VariantSpark/familiarisation/VariantSpark/examples/notebooks/hail-20240910-1003-0.2.74-0c3a74d12093.log\n"
]
}
],
"source": [
"import hail as hl\n",
"import varspark.hail as vshl\n",
"vshl.init()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <div class=\"bk-root\">\n",
" <a href=\"https://bokeh.org\" target=\"_blank\" class=\"bk-logo bk-logo-small bk-logo-notebook\"></a>\n",
" <span id=\"1002\">Loading BokehJS ...</span>\n",
" </div>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/javascript": "\n(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n var JS_MIME_TYPE = 'application/javascript';\n var HTML_MIME_TYPE = 'text/html';\n var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n var CLASS_NAME = 'output_bokeh rendered_html';\n\n /**\n * Render data to the DOM node\n */\n function render(props, node) {\n var script = document.createElement(\"script\");\n node.appendChild(script);\n }\n\n /**\n * Handle when an output is cleared or removed\n */\n function handleClearOutput(event, handle) {\n var cell = handle.cell;\n\n var id = cell.output_area._bokeh_element_id;\n var server_id = cell.output_area._bokeh_server_id;\n // Clean up Bokeh references\n if (id != null && id in Bokeh.index) {\n Bokeh.index[id].model.document.clear();\n delete Bokeh.index[id];\n }\n\n if (server_id !== undefined) {\n // Clean up Bokeh references\n var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n cell.notebook.kernel.execute(cmd, {\n iopub: {\n output: function(msg) {\n var id = msg.content.text.trim();\n if (id in Bokeh.index) {\n Bokeh.index[id].model.document.clear();\n delete Bokeh.index[id];\n }\n }\n }\n });\n // Destroy server and session\n var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n cell.notebook.kernel.execute(cmd);\n }\n }\n\n /**\n * Handle when a new output is added\n */\n function handleAddOutput(event, handle) {\n var output_area = handle.output_area;\n var output = handle.output;\n\n // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n return\n }\n\n var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n\n if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n // store reference to embed id on output_area\n output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n }\n if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n var bk_div = document.createElement(\"div\");\n bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n var script_attrs = bk_div.children[0].attributes;\n for (var i = 0; i < script_attrs.length; i++) {\n toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n }\n // store reference to server id on output_area\n output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n }\n }\n\n function register_renderer(events, OutputArea) {\n\n function append_mime(data, metadata, element) {\n // create a DOM node to render to\n var toinsert = this.create_output_subarea(\n metadata,\n CLASS_NAME,\n EXEC_MIME_TYPE\n );\n this.keyboard_manager.register_events(toinsert);\n // Render to node\n var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n render(props, toinsert[toinsert.length - 1]);\n element.append(toinsert);\n return toinsert\n }\n\n /* Handle when an output is cleared or removed */\n events.on('clear_output.CodeCell', handleClearOutput);\n events.on('delete.Cell', handleClearOutput);\n\n /* Handle when a new output is added */\n events.on('output_added.OutputArea', handleAddOutput);\n\n /**\n * Register the mime type and append_mime function with output_area\n */\n OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n /* Is output safe? */\n safe: true,\n /* Index of renderer in `output_area.display_order` */\n index: 0\n });\n }\n\n // register the mime type if in Jupyter Notebook environment and previously unregistered\n if (root.Jupyter !== undefined) {\n var events = require('base/js/events');\n var OutputArea = require('notebook/js/outputarea').OutputArea;\n\n if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n register_renderer(events, OutputArea);\n }\n }\n\n \n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n var NB_LOAD_WARNING = {'data': {'text/html':\n \"<div style='background-color: #fdd'>\\n\"+\n \"<p>\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"</p>\\n\"+\n \"<ul>\\n\"+\n \"<li>re-rerun `output_notebook()` to attempt to load from CDN again, or</li>\\n\"+\n \"<li>use INLINE resources instead, as so:</li>\\n\"+\n \"</ul>\\n\"+\n \"<code>\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"</code>\\n\"+\n \"</div>\"}};\n\n function display_loaded() {\n var el = document.getElementById(\"1002\");\n if (el != null) {\n el.textContent = \"BokehJS is loading...\";\n }\n if (root.Bokeh !== undefined) {\n if (el != null) {\n el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(display_loaded, 100)\n }\n }\n\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error() {\n console.error(\"failed to load \" + url);\n }\n\n for (var i = 0; i < css_urls.length; i++) {\n var url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error;\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (var i = 0; i < js_urls.length; i++) {\n var url = js_urls[i];\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };var element = document.getElementById(\"1002\");\n if (element == null) {\n console.error(\"Bokeh: ERROR: autoload.js configured with elementid '1002' but no matching script tag was found. \")\n return false;\n }\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n \n var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-tables-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-gl-1.4.0.min.js\"];\n var css_urls = [];\n \n\n var inline_js = [\n function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\n function(Bokeh) {\n \n \n }\n ];\n\n function run_inline_js() {\n \n if (root.Bokeh !== undefined || force === true) {\n \n for (var i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n if (force === true) {\n display_loaded();\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n var cell = $(document.getElementById(\"1002\")).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));",
"application/vnd.bokehjs_load.v0+json": ""
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from hail.plot import show\n",
"from pprint import pprint\n",
"hl.plot.output_notebook()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"vds = hl.import_vcf('../../data/chr22_1000.vcf')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-09-10 10:03:30 Hail: INFO: Reading table to impute column types\n",
"2024-09-10 10:03:31 Hail: INFO: Finished type imputation\n",
" Loading field 'sample' as type str (imputed)\n",
" Loading field 'x22_16050408' as type int32 (imputed)\n",
" Loading field 'x22_16050612' as type str (imputed)\n",
" Loading field 'x22_16050678' as type str (imputed)\n",
" Loading field 'x22_16050984' as type int32 (imputed)\n",
" Loading field 'x22_16051107' as type int32 (imputed)\n",
" Loading field 'x22_16051249' as type int32 (imputed)\n",
" Loading field 'x22_16051347' as type int32 (imputed)\n",
" Loading field 'x22_16051453' as type int32 (imputed)\n",
" Loading field 'x22_16051477' as type int32 (imputed)\n",
" Loading field 'x22_16051480' as type int32 (imputed)\n"
]
}
],
"source": [
"labels = hl.import_table('../../data/chr22-labels-hail.csv', impute = True, delimiter=\",\").key_by('sample')"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"vds = vds.annotate_cols(label = labels[vds.s])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-09-10 10:03:31 Hail: INFO: Coerced almost-sorted dataset\n"
]
}
],
"source": [
"rf_model = vshl.random_forest_model(y=vds.label['x22_16050408'],\n",
" x=vds.GT.n_alt_alleles(), seed = 13, mtry_fraction = 0.05, min_node_size = 5, max_depth = 10)\n",
"rf_model.fit_trees(300, 50)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Saving model to: hailExportUnlabelled.json\n"
]
}
],
"source": [
"rf_model.to_json(\"hailExportUnlabelled.json\", False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading