-
Notifications
You must be signed in to change notification settings - Fork 0
MestreLion/topuniq
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Sort input by count, printing totals and percentages. Think of it as sort | uniq -c | sort -nr on steroids ;) Sample output: $ topuniq --min-count=100 examples/2-icon-types.txt 39564 100.0% Total (8) 25373 64.1% png 12128 30.7% svg 1290 3.3% xpm 685 1.7% icon 88 0.2% Other (4) A more complex example: $ topuniq --min-perc=1 examples/3-shebangs.txt \ --total-last --label-total="TOTAL: %d unique shebangs" \ --sort-other --label-other="(other %d unique shebangs)" 330 26.7% #!/bin/sh 148 12.0% #!/usr/bin/perl -w 145 11.7% #!/usr/bin/python 143 11.6% #!/usr/bin/perl 117 9.5% (other 35 unique shebangs) 90 7.3% #! /bin/sh 80 6.5% #!/bin/bash 42 3.4% #!/usr/bin/env python 39 3.2% #! /usr/bin/perl -w 25 2.0% #! /usr/bin/python 22 1.8% #! /usr/bin/perl 21 1.7% #! /bin/bash 20 1.6% #!/bin/sh -e 14 1.1% #! /usr/bin/env perl 1236 100.0% TOTAL: 48 unique shebangs As a drop-in replacement for cmd | sort | uniq -c | sort -nr (using cat just to show pipeline usage, I know it is redundant) $ cat examples/2-icon-types.txt | topuniq --no-total --no-perc 25373 png 12128 svg 1290 xpm 685 icon 53 theme 33 cache 1 txt 1 svgz "Enhancing" previously saved data generated by cmd | sort | uniq -c | sort -nr (yes, lame and cheesy option name, but I could not think of a better one...) $ topuniq --enhance-uniq --top=10 examples/4-shebangs-preprocessed.txt 1236 100.0% Total (53) 328 26.5% #!/bin/sh 146 11.8% #!/usr/bin/perl -w 145 11.7% #!/usr/bin/python 141 11.4% #!/usr/bin/perl 90 7.3% #! /bin/sh 80 6.5% #!/bin/bash 42 3.4% #!/usr/bin/env python 39 3.2% #! /usr/bin/perl -w 25 2.0% #! /usr/bin/python 21 1.7% #! /usr/bin/perl 179 14.5% Other (43) Performance comparisons with sort | uniq -c | sort -nr (always using the 41277 lines, 235KB examples/1-man-bash-words.txt, average of 3 runs of 'time' in a 100 iterations loop) Reference: sort | uniq -c | sort -nr: real 0m10.042s Worst case scenario - no min-* or top-* filter topuniq real 0m14.360s (gawk) real 0m13.294s (mawk) Direct comparison - no-op same output as reference (no, I didn't optimize for that... yet ;) topuniq --no-total --no-perc real 0m14.201s (gawk) real 0m13.252s (mawk) Best case scenario - using min-count > total (not cheating with --stop-after-*, of course) topuniq --min-count=3000 real 0m11.797s (gawk) real 0m11.739s (mawk) Not bad, not bad at all ;) ... and soon to be hugely improved. Wishlist: (A.K.A. "Things I would add if I did not fear bloat and feature-creep) - Optimize for some common option combinations: --no-perc + no --min-perc : do not calculate percentages at all --no-other: do not update *['other'] arrays --no-total + --no-perc + no filters: skip awk entirely ;) --enhance-uniq: skip last sort -nr - Add position column, and --no-pos option. Very useful for long lists, but nothing grep -n or pasting to an editor can't do. Position would be blank for <other>, even if sorted. - Add yet another percentage: position %, same value --top-perc uses to filter To answer the question "what does being #15 in this list mean?". Besides, I already calculate it, so why not show it? ;) --no/show--pos-perc - Add 2 more percentages: cumulative % of lines above (Up) and below (Down). Useful for analyzing thresholds. --no-perc-up and --no-perc-down to disable (maybe --no-percsum-*? Anyway, --show-* to enable if not default) % down would of course also count lines filtered in <other> and not printed. Example: 40: 145 0.4% 56.2% 43.4% bash - This is starting to look like a spreadsheet, so I'd better add headers. Optional (--show-header) and customizable, of course. - Request this sweet, useful tool to be included in Debian? So you think any of these features are worth having? Leave a comment, or ask for them in "Issues". I would gladly add them in next release! Full manual, from --help: Usage: topuniq [options] [FILE...] If FILE is not given, read from standard input. For numeric input options, NUM must be a positive integer (digits only). All options requiring arguments accept both --option=ARG or --option ARG forms Options not listed here, if any, are appended to uniq -c Options: -h|--help show this page. --min-count=NUM only print lines with count >= NUM --min-perc=NUM only print lines with count percent >= NUM% --top=NUM only print the top NUM lines. 0 = all lines --top-perc=NUM only print the top NUM% lines All lines with count less than any of the above options will be grouped together as a single <other> line, printed last by default. Setting a minimum higher than total, either count or percentage, will effectively disable printing the <total> line. For --top-* options, NUM does not include the total. --stop-after-top=NUM stop reading after NUM top unique lines --stop-after-count=NUM stop reading after lines with count < NUM Unlike --min-* and --top-* options, the above will discard lines, thus affecting <total>, <other> and all percentages. --stop-after-top is equivalent to 'head -nNUM' after sort -nr and before topuniq's enhancements. For both, NUM=0 disables the option --precision=NUM use NUM decimal digits for the percentages, default 1 --no-perc do not print percentages --no-total do not print <total> line --no-other do not print <other> line --total-last print <total> line last instead of first --sort-other print <other> line in sorted position --label-total=LABEL use LABEL for <total> line, default "Total (%d)" --label-other=LABEL use LABEL for <other> line, default "Other (%d)" For the --label-* options, optional "%d" prints the number of unique lines that <total> or <other> represents --enhance-uniq consider input as already processed by sort | uniq -c, skip it and process from there. Useful for enhancing previously saved data Environment Variables: topuniq uses sort and uniq, so the user locale, particularly LC_COLLATE, affects ordering and unique matching, as well as sort performance. LC_NUMERIC affects decimal separator when printing percentages. Use LC_ALL=C for the fastest and locale-independent results. Examples: # Ignore lines with count < 10%, using case-insensitive uniq topuniq --min-perc=10 --no-other --ignore-case # Top 20, sorting <others> within the list, and customizing its label topuniq --top=20 --sort-other --label-other="Other %d unique lines" # Enhance an existing input, discarding lines with count < 10 topuniq my_uniq_data.txt --enhance-uniq --stop-after-count=10 # Behaves exactly like sort | uniq -c | sort -nr topuniq --no-total --no-perc For input data, some examples you may pipe directly to topuniq: # Words in Bash's manual page man bash | tr '[:punct:][:blank:]' '\n' | sed '/^$/d' # Icon types in /usr/share/icons find /usr/share/icons -type f -name "*.*" | awk -F. '{print $NF}' # Shebangs from /usr/bin scripts for f in /usr/bin/*; do [ -f "" ] && head -n1 "" | grep ^#!; done Copyright (C) 2012 Rodrigo Silva (MestreLion) <[email protected]> License: GPLv3 or later. See <http://www.gnu.org/licenses/gpl.html>
About
Think of it as sort | uniq -c | sort -nr on steroids ;)
Resources
Stars
Watchers
Forks
Packages 0
No packages published