Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolution to Issue 114 #115

Merged
merged 8 commits into from
Oct 30, 2024
Merged

Resolution to Issue 114 #115

merged 8 commits into from
Oct 30, 2024

Conversation

Klangina
Copy link
Contributor

Description

What kind of change(s) are included?

  • Feature (adds or updates new capabilities)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have searched for existing content to ensure this is not a duplicate.
  • I have performed a self-review of these additions (including spelling, grammar, and related).
  • I have added comments to my code to help provide understanding.
  • I have added a test which covers the code changes found within this PR.
  • I have deleted all non-relevant text in this pull request template.
  • Reviewer assignment: Tag a relevant team member to review and approve the changes.

plotEstimatedWallTimes : get_powerset : <anonymous>: no visible global
     function definition for ‘combn’
   plotEstimatedWallTimes: no visible binding for global variable
     ‘n_inputs’
 plotEstimatedWallTimes: no visible binding for global variable
     ‘advanced_opts’
plotStackedLineage: no visible binding for global variable ‘cpcols’
Added a random color-pallete as default, can be changed later on. (source for color palette : https://www.colorcombos.com/color-schemes/16952/ColorCombo16952.html)
selectLongestDuplicate: no visible binding for global variable ‘AccNum’
selectLongestDuplicate: no visible binding for global variable  ‘row.orig’

[Changes]
Added @importFrom rlang .data to import the .data pronoun.
    Used .data$AccNum instead of AccNum to refer to the column within dplyr functions.
    Used .data$row.orig instead of row.orig to refer to the column within dplyr functions.
    Used mutate to initialise prot.
    Changed merge() to left_join() for consistency with dplyr.
    Used seq_len() instead of : for creating row numbers.
    Simplified the longest row selection using which.max().
    Used filter() instead of negative indexing to remove rows.
@Klangina
Copy link
Contributor Author

@the-mayer : Kindly review this PR.

Adding a blank commit to link PR to the issue
Comment on lines +210 to +223
prot <- prot %>%
dplyr::mutate(!!by_column := stringr::str_replace_all(
.data[[by_column]],
c(
"\\." = "_d_",
" " = "_",
"\\+" = " ",
"-" = "__",
regex_identify_repeats = "\\1(s)",
"__" = "-",
" " = "+",
"_d_" = "."
)
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice -- thank you for streamlining this code, while also fixing the R-CMD check error.

Copy link
Collaborator

@the-mayer the-mayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @Klangina. I appreciate the additional work to streamline the codebase while also fixing the errors reported by R-CMD check.

@the-mayer the-mayer merged commit b758992 into JRaviLab:main Oct 30, 2024
1 check passed
Copy link
Collaborator

@Joiejoie1 Joiejoie1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendations

  1. Code Robustness:

    • Plot Function (plotEstimatedWallTimes): Add checks to ensure df_walltimes has the expected structure before performing operations like gather and mutate. This will help avoid runtime errors or unexpected warnings.
    • Duplicate Selection Function (selectLongestDuplicate): Include validation to confirm that the specified column exists in prot, preventing errors if an invalid column name is provided.
  2. Documentation:

    • Plot Function: Improve documentation by detailing the required structure of df_walltimes, helping users understand the data prerequisites.
    • Duplicate Selection Function: Add information on the expected structure of prot, including that it should contain AccNum as a primary identifier, to aid users in correctly preparing their data.

Summary

Both functions are structured effectively for their purposes, offering clear data transformation, targeted output (a line plot and filtered data frame, respectively), and suitable error handling. Minor documentation enhancements and a few additional error checks would further increase their reliability and usability for developers unfamiliar with the data structure.

@@ -657,13 +658,13 @@ plotEstimatedWallTimes <- function() {
df_walltimes <- tidyr::gather(df_walltimes,
key = "advanced_opts",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data Manipulation:
df_walltimes is reshaped using tidyr::gather, converting it into a long format suitable for ggplot.

p <- ggplot2::ggplot(df_walltimes, ggplot2::aes(x = n_inputs,
y = est_walltime,
color = advanced_opts)) +
dplyr::mutate(est_walltime = .data$est_walltime / 3600)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Estimated wall times (in seconds) are converted to hours using dplyr::mutate. This transformation is essential for an accurate representation in hours on the y-axis.

dplyr::mutate(est_walltime = .data$est_walltime / 3600)
p <- ggplot2::ggplot(df_walltimes, ggplot2::aes(x = .data$n_inputs,
y = .data$est_walltime,
color = .data$advanced_opts)) +
ggplot2::geom_line() +
ggplot2::labs(
title = "MolEvolvR estimated runtimes",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plotting:

  • The plot is created with ggplot2, setting n_inputs on the x-axis and est_walltime (in hours) on the y-axis.

  • color represents different values of advanced_opts, adding clarity by differentiating between options.

  • geom_line() is appropriate for connecting points in a line plot, aligning well with the goal of showing runtime trends.

  • Labels are clear, and the plot title is descriptive, providing context.

prot$row.orig <- 1:nrow(prot)

col <- rlang::sym(column)
prot <- prot %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Structure & Functionality
Parameter Handling:

The column parameter is converted to a symbol using rlang::sym, enabling flexible selection of any column by name. This is useful for adaptability across datasets with varying column names.


col <- rlang::sym(column)
prot <- prot %>%
mutate(row.orig = seq_len(n()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking Original Rows:

mutate(row.orig = seq_len(n())) is used to create a unique identifier for each row, ensuring original row positions are preserved even after filtering. This addition is necessary for accurately managing duplicate entries.


dup_acc <- dups$AccNum
dup_acc <- unique(dups$AccNum)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate Identification:

Duplicate entries are identified by grouping on AccNum and then filtered to include only groups with more than one entry. This is well-suited for cases where AccNum uniquely identifies duplicates.


longest <- dup_rows[which(nchar(pull(dup_rows, {{ col }})) == max(nchar(pull(dup_rows, {{ col }}))))[1], "row.orig"]
longest <- dup_rows$row.orig[which.max(nchar(pull(dup_rows, !!col)))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selecting the Longest Entry:

  • For each duplicate group, entries are compared based on the length of text in the specified column (column). The function identifies the row with the longest text entry using which.max(nchar(pull(...))), an efficient approach for selecting the desired row within each duplicate group.

  • Rows not selected as the longest entries are stored in remove_rows to exclude them later.

unique_dups <- prot[-remove_rows, ] %>% select(-row.orig)
unique_dups <- prot %>%
filter(!.data$row.orig %in% remove_rows) %>%
select(-.data$row.orig)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering and Output:

Finally, filter(!.data$row.orig %in% remove_rows) is used to retain only the longest entries, and row.orig is removed to clean up the output data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants