Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL discovery in CSV files where values are not wrapped in quotes #1299

Open
cicdguy opened this issue Nov 19, 2023 · 4 comments
Open

URL discovery in CSV files where values are not wrapped in quotes #1299

cicdguy opened this issue Nov 19, 2023 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@cicdguy
Copy link

cicdguy commented Nov 19, 2023

Hello,

I'm using lychee 0.13.0 and running it against this file:
https://github.com/pharmaverse/admiraldiscovery/blob/06e6e55b884ef91de9ae457606ed66defc9dba14/data-raw/admiral-lookup-book.csv

Like so:

lychee **/*.csv

And I get the following result:

⠚ 1/47 ETA 80s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed
⠚ 2/47 ETA 39s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Ne
⠚ 3/47 ETA 25s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network
⠚ 4/47 ETA 19s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Networ
⠚ 5/47 ETA 15s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network
⠚ 6/47 ETA 12s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Netwo
⠚ 7/47 ETA 10s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed
⠚ 8/47 ETA 9s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Net
⠚ 9/47 ETA 8s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network
⠚ 10/47 ETA 7s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network
⠚ 11/47 ETA 6s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Ne
⠚ 12/47 ETA 6s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Netw
⠚ 13/47 ETA 5s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network
⠚ 14/47 ETA 4s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network e
⠚ 15/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed:
⠚ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 17/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Netwo
⠒ 18/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Netwo
⠒ 19/47 ETA 1s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Netwo
⠒ 20/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network e
⠒ 21/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: N
⠒ 22/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Fai
⠒ 23/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network e
⠒ 24/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Ne
⠒ 25/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network
⠒ 26/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Net
⠒ 27/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network
⠒ 32/47 ETA 0s █████████████░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Netwo
⠒ 33/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed:
⠒ 34/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: N
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠒ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
  47/47 ETA 0s ████████████████████ Finished extracting links                                                                               Issues found in 1 input. Find details below.

[data-raw/admiral-lookup-book.csv]:
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_query.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_wbc_abs.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_obs_number.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_pchg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Network error: Not Found

🔍 47 Total ✅ 12 OK 🚫 35 Errors (HTTP:35)

When I modify the file by adding quotes around the URLs in the CSV, I get the correct expected result.

❯ lychee **/*.csv
  47/47 ETA 0s ████████████████████ Finished extracting links           
  🔍 47 Total ✅ 47 OK 🚫 0 Errors

Although commas are allowed/safe characters in URLs, will it be possible for Lychee to detect CSV files and extract URLs from it without having to wrap the URL strings in quotes?

@cicdguy cicdguy changed the title Pattern matching in CSV files where fields are not wrapped in quotes URL discovery in CSV files where values are not wrapped in quotes Nov 19, 2023
@mre
Copy link
Member

mre commented Nov 30, 2023

Thanks for creating the issue.
I think we should discuss that with the folks at linkify, which is the plaintext parser we use.
I don't know if it will be an easy fix for them, though. 😕
Could you still open an issue over there and ask for feedback?

@cicdguy
Copy link
Author

cicdguy commented Nov 30, 2023

@mre - thank you for your response. Yes, definitely, I can open an issue there and request feedback.

@mre
Copy link
Member

mre commented Mar 3, 2024

@robinst suggested using a CSV parser and pass individual cells to linkify. This has a few advantages:

  • Clear separation of concerns between the tools.
  • Doesn't require a (potentially breaking) change in linkify.
  • Can be extended to other file formats.

I think this is the way forward. @cicdguy, perhaps you want to close the linkify issue again, and we can focus on fixing this issue in lychee itself as per the above plan? What do you think? 😃

@cicdguy
Copy link
Author

cicdguy commented Mar 3, 2024

Sounds great. Thank you @mre!

@mre mre added enhancement New feature or request help wanted Extra attention is needed and removed waiting-for-feedback waiting-for-upstream labels May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants