Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New command: Parse Lines (Alt+A) #453

Open
ProgerXP opened this issue Dec 24, 2022 · 0 comments
Open

New command: Parse Lines (Alt+A) #453

ProgerXP opened this issue Dec 24, 2022 · 0 comments

Comments

@ProgerXP
Copy link
Owner

This command is made last in Edit > Block (P&arse). It's similar to Modify Lines (Alt+M) in treating initial selection, dialog layout with Syslink and remembering input values. Dialog has two buttons on the side, &Parse each line:, input, Use |scanf()| instead of regular &expression (checkbox, checked by default; |...| = link to an online reference), bunch of links and texts similar to Alt+M's, &Replace matching line:, input, another bunch of links and texts. Links and texts will be determined later.

Operation:

  1. If the scanf() checkbox is checked, preprocess the Parse string (see below).
    • if it isn't and the regexp is malformed, either disable OK (as done in Find) or focus Parse and exit
  2. Walk document line by line like Alt+M does; for every line:
    • run vsscanf() or regexp on it; if sscanf()'s result is not exactly N or if the regexp doesn't match, skip to next line
    • replace the line with the result of calling vsprintf()

It's very easy to have undefined behaviour and even crash with bad format strings. It's also possible that vs...f() cannot be used in our scenario. In this case rather than calling them once for each input line, call them once per each format specifier in the format string and/or write custom implementation. If one call per line is made, preprocessing determines N - the number of format specifiers producing data (i.e. non-%* and non-%% specifiers) and may do some checks to avoid UB.

All functions use neutral locale ("C"), in particular no grouping (12,345) and . for decimal part (3.14). This corresponds to Math Eval's copy result format.

We must support the limited feature set of scanf():

  • format specifier begins with %, followed by % to consume literal % or by n to store number of characters read so far
  • otherwise, % may be followed by * (performs the match but doesn't store result), then by sign-less positive number ("width") used by c and s
  • finally, specifier ends on one of d i x f s c or a character class (below); aliases X e g E a are not supported
  • "length" (l h hh j ll L q t z) is not supported; d i x are always signed int or long, f float or double (same type as used for Math Eval), s c [...] always wide
  • character class is defined between two [ ] brackets; initial [ may be followed by ^ (negative class), by ] (literal ] part of the class), then any number of characters and/or 7-bit ANSI ranges made with - (a-z), then optional - before final ] (literal - part of the class)
  • if custom implementation is used, whitespace in format string must be treated non-standard: require and consume at least 1 isspace symbol (standard allows 0)

Featureset of printf():

  • format specifier begins with %, followed by % to output literal %
  • otherwise, % may be followed by position$ (a sign-less positive number followed by $), by "flags', by "width" and by .precision
  • "position" consumes next argument, or nth argument if $ is present
  • "flags" may be - (change justification from right to left) or one of (_ = space) 0 _ _0 + +0 (for numeric specifiers; 0 changes padding symbol from space to 0; _ outputs a space if the number is non-negative while + outputs literal + in this case)
  • "width" sets padding and may be a positive number or * (consume next argument as value) or *n$ (consume nth argument as value)
  • "precision" (non-negative number) sets length of mantissa (for numeric specifiers) or whole string; may be .* or .*n$ (non-standard, only if custom implementation)
  • finally, specifier ends on one of d o x X f c s and, if using custom sprintf() implementation, n with different behaviour (outputs number of characters read so far rather than storing it)
  • aliases and specifiers i u e E F g G a A C S p n m are not supported
  • as with scanf(), "length" (hh h l ll q L j z Z t) is not supported and is automatic
  • lifted limitations of standard C printf(): n$ style may be mixed with non-n$ (may be solved by preprocessing format string) and it may skip arguments (1$ and 3$ may be used in format string with 2$ unused)

This feature will allow changing column order in a CSV: Parse = %s;%s, Replace = %2$s;%1$s; as well as aligning text:

int n = 0;
char[] str = "foo\n";

Parse = %s %s = %s;, Replace = %6s %-3s = %s;:

   int n   = 0;
char[] str = "foo\n";

Combined with Sort (Alt+O), one can order lines by their length:

a
ccc
bb

First Alt+A (Parse = %s%n, Replace = %2$06d %1$s), then Alt+O (Logical number comparison`):

000001 a
000002 bb
000003 ccc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant