-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolution to Issue 114 #115
Conversation
plotStackedLineage: no visible binding for global variable ‘cpcols’ Added a random color-pallete as default, can be changed later on. (source for color palette : https://www.colorcombos.com/color-schemes/16952/ColorCombo16952.html)
selectLongestDuplicate: no visible binding for global variable ‘AccNum’ selectLongestDuplicate: no visible binding for global variable ‘row.orig’ [Changes] Added @importFrom rlang .data to import the .data pronoun. Used .data$AccNum instead of AccNum to refer to the column within dplyr functions. Used .data$row.orig instead of row.orig to refer to the column within dplyr functions. Used mutate to initialise prot. Changed merge() to left_join() for consistency with dplyr. Used seq_len() instead of : for creating row numbers. Simplified the longest row selection using which.max(). Used filter() instead of negative indexing to remove rows.
@the-mayer : Kindly review this PR. |
Adding a blank commit to link PR to the issue
prot <- prot %>% | ||
dplyr::mutate(!!by_column := stringr::str_replace_all( | ||
.data[[by_column]], | ||
c( | ||
"\\." = "_d_", | ||
" " = "_", | ||
"\\+" = " ", | ||
"-" = "__", | ||
regex_identify_repeats = "\\1(s)", | ||
"__" = "-", | ||
" " = "+", | ||
"_d_" = "." | ||
) | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice -- thank you for streamlining this code, while also fixing the R-CMD check error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @Klangina. I appreciate the additional work to streamline the codebase while also fixing the errors reported by R-CMD check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendations
-
Code Robustness:
- Plot Function (
plotEstimatedWallTimes
): Add checks to ensuredf_walltimes
has the expected structure before performing operations likegather
andmutate
. This will help avoid runtime errors or unexpected warnings. - Duplicate Selection Function (
selectLongestDuplicate
): Include validation to confirm that the specifiedcolumn
exists inprot
, preventing errors if an invalid column name is provided.
- Plot Function (
-
Documentation:
- Plot Function: Improve documentation by detailing the required structure of
df_walltimes
, helping users understand the data prerequisites. - Duplicate Selection Function: Add information on the expected structure of
prot
, including that it should containAccNum
as a primary identifier, to aid users in correctly preparing their data.
- Plot Function: Improve documentation by detailing the required structure of
Summary
Both functions are structured effectively for their purposes, offering clear data transformation, targeted output (a line plot and filtered data frame, respectively), and suitable error handling. Minor documentation enhancements and a few additional error checks would further increase their reliability and usability for developers unfamiliar with the data structure.
@@ -657,13 +658,13 @@ plotEstimatedWallTimes <- function() { | |||
df_walltimes <- tidyr::gather(df_walltimes, | |||
key = "advanced_opts", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data Manipulation:
df_walltimes is reshaped using tidyr::gather, converting it into a long format suitable for ggplot.
p <- ggplot2::ggplot(df_walltimes, ggplot2::aes(x = n_inputs, | ||
y = est_walltime, | ||
color = advanced_opts)) + | ||
dplyr::mutate(est_walltime = .data$est_walltime / 3600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Estimated wall times (in seconds) are converted to hours using dplyr::mutate. This transformation is essential for an accurate representation in hours on the y-axis.
dplyr::mutate(est_walltime = .data$est_walltime / 3600) | ||
p <- ggplot2::ggplot(df_walltimes, ggplot2::aes(x = .data$n_inputs, | ||
y = .data$est_walltime, | ||
color = .data$advanced_opts)) + | ||
ggplot2::geom_line() + | ||
ggplot2::labs( | ||
title = "MolEvolvR estimated runtimes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plotting:
-
The plot is created with ggplot2, setting n_inputs on the x-axis and est_walltime (in hours) on the y-axis.
-
color represents different values of advanced_opts, adding clarity by differentiating between options.
-
geom_line() is appropriate for connecting points in a line plot, aligning well with the goal of showing runtime trends.
-
Labels are clear, and the plot title is descriptive, providing context.
prot$row.orig <- 1:nrow(prot) | ||
|
||
col <- rlang::sym(column) | ||
prot <- prot %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Structure & Functionality
Parameter Handling:
The column parameter is converted to a symbol using rlang::sym, enabling flexible selection of any column by name. This is useful for adaptability across datasets with varying column names.
|
||
col <- rlang::sym(column) | ||
prot <- prot %>% | ||
mutate(row.orig = seq_len(n())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking Original Rows:
mutate(row.orig = seq_len(n())) is used to create a unique identifier for each row, ensuring original row positions are preserved even after filtering. This addition is necessary for accurately managing duplicate entries.
|
||
dup_acc <- dups$AccNum | ||
dup_acc <- unique(dups$AccNum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate Identification:
Duplicate entries are identified by grouping on AccNum and then filtered to include only groups with more than one entry. This is well-suited for cases where AccNum uniquely identifies duplicates.
|
||
longest <- dup_rows[which(nchar(pull(dup_rows, {{ col }})) == max(nchar(pull(dup_rows, {{ col }}))))[1], "row.orig"] | ||
longest <- dup_rows$row.orig[which.max(nchar(pull(dup_rows, !!col)))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selecting the Longest Entry:
-
For each duplicate group, entries are compared based on the length of text in the specified column (column). The function identifies the row with the longest text entry using which.max(nchar(pull(...))), an efficient approach for selecting the desired row within each duplicate group.
-
Rows not selected as the longest entries are stored in remove_rows to exclude them later.
unique_dups <- prot[-remove_rows, ] %>% select(-row.orig) | ||
unique_dups <- prot %>% | ||
filter(!.data$row.orig %in% remove_rows) %>% | ||
select(-.data$row.orig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering and Output:
Finally, filter(!.data$row.orig %in% remove_rows) is used to retain only the longest entries, and row.orig is removed to clean up the output data.
Description
What kind of change(s) are included?
Checklist
Please ensure that all boxes are checked before indicating that this pull request is ready for review.