Resolution to Issue 114 #115

Klangina · 2024-10-28T19:38:56Z

Description

What kind of change(s) are included?

Feature (adds or updates new capabilities)
Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

I have read and followed the CONTRIBUTING.md guidelines.
I have searched for existing content to ensure this is not a duplicate.
I have performed a self-review of these additions (including spelling, grammar, and related).
I have added comments to my code to help provide understanding.
I have added a test which covers the code changes found within this PR.
I have deleted all non-relevant text in this pull request template.
Reviewer assignment: Tag a relevant team member to review and approve the changes.

…le ‘.’

plotEstimatedWallTimes : get_powerset : <anonymous>: no visible global function definition for ‘combn’ plotEstimatedWallTimes: no visible binding for global variable ‘n_inputs’ plotEstimatedWallTimes: no visible binding for global variable ‘advanced_opts’

plotStackedLineage: no visible binding for global variable ‘cpcols’ Added a random color-pallete as default, can be changed later on. (source for color palette : https://www.colorcombos.com/color-schemes/16952/ColorCombo16952.html)

selectLongestDuplicate: no visible binding for global variable ‘AccNum’ selectLongestDuplicate: no visible binding for global variable ‘row.orig’ [Changes] Added @importFrom rlang .data to import the .data pronoun. Used .data$AccNum instead of AccNum to refer to the column within dplyr functions. Used .data$row.orig instead of row.orig to refer to the column within dplyr functions. Used mutate to initialise prot. Changed merge() to left_join() for consistency with dplyr. Used seq_len() instead of : for creating row numbers. Simplified the longest row selection using which.max(). Used filter() instead of negative indexing to remove rows.

Klangina · 2024-10-28T19:39:40Z

@the-mayer : Kindly review this PR.

Adding a blank commit to link PR to the issue

the-mayer · 2024-10-30T21:08:10Z

R/cleanup.R

+    prot <- prot %>%
+        dplyr::mutate(!!by_column := stringr::str_replace_all(
+            .data[[by_column]],
+            c(
+                "\\." = "_d_",
+                " " = "_",
+                "\\+" = " ",
+                "-" = "__",
+                regex_identify_repeats = "\\1(s)",
+                "__" = "-",
+                " " = "+",
+                "_d_" = "."
+            )
+        ))


Very nice -- thank you for streamlining this code, while also fixing the R-CMD check error.

the-mayer

Thank you for your contribution @Klangina. I appreciate the additional work to streamline the codebase while also fixing the errors reported by R-CMD check.

Joiejoie1

Recommendations

Code Robustness:
- Plot Function (plotEstimatedWallTimes): Add checks to ensure df_walltimes has the expected structure before performing operations like gather and mutate. This will help avoid runtime errors or unexpected warnings.
- Duplicate Selection Function (selectLongestDuplicate): Include validation to confirm that the specified column exists in prot, preventing errors if an invalid column name is provided.
Documentation:
- Plot Function: Improve documentation by detailing the required structure of df_walltimes, helping users understand the data prerequisites.
- Duplicate Selection Function: Add information on the expected structure of prot, including that it should contain AccNum as a primary identifier, to aid users in correctly preparing their data.

Summary

Both functions are structured effectively for their purposes, offering clear data transformation, targeted output (a line plot and filtered data frame, respectively), and suitable error handling. Minor documentation enhancements and a few additional error checks would further increase their reliability and usability for developers unfamiliar with the data structure.

Joiejoie1 · 2024-10-31T04:20:47Z

R/assign_job_queue.R

@@ -657,13 +658,13 @@ plotEstimatedWallTimes <- function() {
    df_walltimes <- tidyr::gather(df_walltimes,
                                  key = "advanced_opts",


Data Manipulation:
df_walltimes is reshaped using tidyr::gather, converting it into a long format suitable for ggplot.

Joiejoie1 · 2024-10-31T04:25:05Z

R/assign_job_queue.R

-    p <- ggplot2::ggplot(df_walltimes, ggplot2::aes(x = n_inputs,
-                                                    y = est_walltime,
-                                                    color = advanced_opts)) +
+      dplyr::mutate(est_walltime = .data$est_walltime / 3600)


Estimated wall times (in seconds) are converted to hours using dplyr::mutate. This transformation is essential for an accurate representation in hours on the y-axis.

Joiejoie1 · 2024-10-31T04:32:34Z

R/assign_job_queue.R

+      dplyr::mutate(est_walltime = .data$est_walltime / 3600)
+    p <- ggplot2::ggplot(df_walltimes, ggplot2::aes(x = .data$n_inputs, 
+                                                    y = .data$est_walltime, 
+                                                    color = .data$advanced_opts)) +
      ggplot2::geom_line() +
      ggplot2::labs(
        title = "MolEvolvR estimated runtimes",


Plotting:

The plot is created with ggplot2, setting n_inputs on the x-axis and est_walltime (in hours) on the y-axis.

color represents different values of advanced_opts, adding clarity by differentiating between options.

geom_line() is appropriate for connecting points in a line plot, aligning well with the goal of showing runtime trends.

Labels are clear, and the plot title is descriptive, providing context.

Joiejoie1 · 2024-10-31T06:06:08Z

R/cleanup.R

-    prot$row.orig <- 1:nrow(prot)
-
+    col <- rlang::sym(column)
+    prot <- prot %>% 


Code Structure & Functionality
Parameter Handling:

The column parameter is converted to a symbol using rlang::sym, enabling flexible selection of any column by name. This is useful for adaptability across datasets with varying column names.

Joiejoie1 · 2024-10-31T06:10:23Z

R/cleanup.R

-
+    col <- rlang::sym(column)
+    prot <- prot %>% 
+        mutate(row.orig = seq_len(n()))


Marking Original Rows:

mutate(row.orig = seq_len(n())) is used to create a unique identifier for each row, ensuring original row positions are preserved even after filtering. This addition is necessary for accurately managing duplicate entries.

Joiejoie1 · 2024-10-31T06:12:36Z

R/cleanup.R


-    dup_acc <- dups$AccNum
+    dup_acc <- unique(dups$AccNum)


Duplicate Identification:

Duplicate entries are identified by grouping on AccNum and then filtered to include only groups with more than one entry. This is well-suited for cases where AccNum uniquely identifies duplicates.

Joiejoie1 · 2024-10-31T06:15:58Z

R/cleanup.R


-        longest <- dup_rows[which(nchar(pull(dup_rows, {{ col }})) == max(nchar(pull(dup_rows, {{ col }}))))[1], "row.orig"]
+        longest <- dup_rows$row.orig[which.max(nchar(pull(dup_rows, !!col)))]


Selecting the Longest Entry:

For each duplicate group, entries are compared based on the length of text in the specified column (column). The function identifies the row with the longest text entry using which.max(nchar(pull(...))), an efficient approach for selecting the desired row within each duplicate group.

Rows not selected as the longest entries are stored in remove_rows to exclude them later.

Joiejoie1 · 2024-10-31T06:17:27Z

R/cleanup.R

-    unique_dups <- prot[-remove_rows, ] %>% select(-row.orig)
+    unique_dups <- prot %>% 
+        filter(!.data$row.orig %in% remove_rows) %>% 
+        select(-.data$row.orig)


Filtering and Output:

Finally, filter(!.data$row.orig %in% remove_rows) is used to retain only the longest entries, and row.orig is removed to clean up the output data.

Klangina added 4 commits October 28, 2024 21:00

[Fixed] condenseRepeatedDomains: no visible binding for global variab…

34218ab

…le ‘.’

[FIXED]

590c568

plotEstimatedWallTimes : get_powerset : <anonymous>: no visible global function definition for ‘combn’ plotEstimatedWallTimes: no visible binding for global variable ‘n_inputs’ plotEstimatedWallTimes: no visible binding for global variable ‘advanced_opts’

[FIXED]

74d0e62

plotStackedLineage: no visible binding for global variable ‘cpcols’ Added a random color-pallete as default, can be changed later on. (source for color palette : https://www.colorcombos.com/color-schemes/16952/ColorCombo16952.html)

Resolves Issue JRaviLab#114:

f62f69d

Adding a blank commit to link PR to the issue

Klangina force-pushed the Issue-114 branch from 42f7213 to f62f69d Compare October 28, 2024 19:43

the-mayer requested review from Joiejoie1 and the-mayer October 29, 2024 16:48

Merge commit 'c838c4e082460f6d48f8da98a3a37a66e248dd4e'

65baec7

the-mayer linked an issue Oct 30, 2024 that may be closed by this pull request

R-CMD Check : selectLongestDuplicate, plotEstimatedWallTimes, removeTails, condenseRepeatedDomains, plotStackedLineage #114

Closed

10 tasks

the-mayer added 2 commits October 30, 2024 14:52

add additional .data prefix

9cd85eb

update docs/NAMESPACE

4667118

the-mayer reviewed Oct 30, 2024

View reviewed changes

the-mayer approved these changes Oct 30, 2024

View reviewed changes

the-mayer merged commit b758992 into JRaviLab:main Oct 30, 2024
1 check passed

Joiejoie1 reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolution to Issue 114 #115

Resolution to Issue 114 #115

Klangina commented Oct 28, 2024

Klangina commented Oct 28, 2024

the-mayer Oct 30, 2024

the-mayer left a comment

Joiejoie1 left a comment

Joiejoie1 Oct 31, 2024

Joiejoie1 Oct 31, 2024

Joiejoie1 Oct 31, 2024

Joiejoie1 Oct 31, 2024

Joiejoie1 Oct 31, 2024

Joiejoie1 Oct 31, 2024

Joiejoie1 Oct 31, 2024

Joiejoie1 Oct 31, 2024

		@@ -657,13 +658,13 @@ plotEstimatedWallTimes <- function() {
		df_walltimes <- tidyr::gather(df_walltimes,
		key = "advanced_opts",


		longest <- dup_rows[which(nchar(pull(dup_rows, {{ col }})) == max(nchar(pull(dup_rows, {{ col }}))))[1], "row.orig"]
		longest <- dup_rows$row.orig[which.max(nchar(pull(dup_rows, !!col)))]

Resolution to Issue 114 #115

Resolution to Issue 114 #115

Conversation

Klangina commented Oct 28, 2024

Description

What kind of change(s) are included?

Checklist

Klangina commented Oct 28, 2024

Choose a reason for hiding this comment

the-mayer left a comment

Choose a reason for hiding this comment

Joiejoie1 left a comment

Choose a reason for hiding this comment

Recommendations

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment