GitHub - MestreLion/topuniq: Think of it as sort | uniq -c

MestreLion / topuniq Public

Notifications You must be signed in to change notification settings
Fork 0
Star 6

Think of it as sort | uniq -c | sort -nr on steroids ;)

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
examples		examples
.gitignore		.gitignore
README		README
topuniq		topuniq

Repository files navigation

Sort input by count, printing totals and percentages.

Think of it as sort | uniq -c | sort -nr on steroids ;)

Sample output:

$ topuniq --min-count=100 examples/2-icon-types.txt
  39564 100.0% Total (8)
  25373  64.1% png
  12128  30.7% svg
   1290   3.3% xpm
    685   1.7% icon
     88   0.2% Other (4)


A more complex example:

$ topuniq --min-perc=1 examples/3-shebangs.txt \
          --total-last --label-total="TOTAL: %d unique shebangs" \
          --sort-other --label-other="(other %d unique shebangs)"
    330  26.7% #!/bin/sh
    148  12.0% #!/usr/bin/perl -w
    145  11.7% #!/usr/bin/python
    143  11.6% #!/usr/bin/perl
    117   9.5% (other 35 unique shebangs)
     90   7.3% #! /bin/sh
     80   6.5% #!/bin/bash
     42   3.4% #!/usr/bin/env python
     39   3.2% #! /usr/bin/perl -w
     25   2.0% #! /usr/bin/python
     22   1.8% #! /usr/bin/perl
     21   1.7% #! /bin/bash
     20   1.6% #!/bin/sh -e
     14   1.1% #! /usr/bin/env perl
   1236 100.0% TOTAL: 48 unique shebangs


As a drop-in replacement for cmd | sort | uniq -c | sort -nr
(using cat just to show pipeline usage, I know it is redundant)

$ cat examples/2-icon-types.txt | topuniq --no-total --no-perc
  25373 png
  12128 svg
   1290 xpm
    685 icon
     53 theme
     33 cache
      1 txt
      1 svgz


"Enhancing" previously saved data generated by cmd | sort | uniq -c | sort -nr
(yes, lame and cheesy option name, but I could not think of a better one...)

$ topuniq --enhance-uniq --top=10 examples/4-shebangs-preprocessed.txt
   1236 100.0% Total (53)
    328  26.5% #!/bin/sh
    146  11.8% #!/usr/bin/perl -w
    145  11.7% #!/usr/bin/python
    141  11.4% #!/usr/bin/perl
     90   7.3% #! /bin/sh
     80   6.5% #!/bin/bash
     42   3.4% #!/usr/bin/env python
     39   3.2% #! /usr/bin/perl -w
     25   2.0% #! /usr/bin/python
     21   1.7% #! /usr/bin/perl
    179  14.5% Other (43)


Performance comparisons with sort | uniq -c | sort -nr
(always using the 41277 lines, 235KB examples/1-man-bash-words.txt, average of
3 runs of 'time' in a 100 iterations loop)

Reference:
sort | uniq -c | sort -nr:                         real	0m10.042s

Worst case scenario - no min-* or top-* filter
topuniq                                            real	0m14.360s (gawk)
                                                   real	0m13.294s (mawk)

Direct comparison - no-op same output as reference
(no, I didn't optimize for that... yet ;)
topuniq --no-total --no-perc                       real	0m14.201s (gawk)
                                                   real	0m13.252s (mawk)

Best case scenario - using min-count > total
(not cheating with --stop-after-*, of course)
topuniq --min-count=3000                           real	0m11.797s (gawk)
                                                   real	0m11.739s (mawk)

Not bad, not bad at all ;)
... and soon to be hugely improved.

Wishlist:
(A.K.A. "Things I would add if I did not fear bloat and feature-creep)

- Optimize for some common option combinations:
	--no-perc + no --min-perc : do not calculate percentages at all
	--no-other: do not update *['other'] arrays
	--no-total + --no-perc + no filters: skip awk entirely ;)
	--enhance-uniq: skip last sort -nr

- Add position column, and --no-pos option. Very useful for long lists, but
  nothing grep -n or pasting to an editor can't do. Position would be blank
  for <other>, even if sorted.

- Add yet another percentage: position %, same value --top-perc uses to filter
  To answer the question "what does being #15 in this list mean?". Besides,
  I already calculate it, so why not show it? ;) --no/show--pos-perc

- Add 2 more percentages: cumulative % of lines above (Up) and below (Down).
  Useful for analyzing thresholds. --no-perc-up and --no-perc-down to disable
  (maybe --no-percsum-*? Anyway, --show-* to enable if not default)
  % down would of course also count lines filtered in <other> and not printed.
  Example:  40:    145   0.4%  56.2%  43.4% bash

- This is starting to look like a spreadsheet, so I'd better add headers.
  Optional (--show-header) and customizable, of course.

- Request this sweet, useful tool to be included in Debian?

So you think any of these features are worth having? Leave a comment, or ask
for them in "Issues". I would gladly add them in next release!


Full manual, from --help:

Usage: topuniq [options] [FILE...]

If FILE is not given, read from standard input. For numeric input
options, NUM must be a positive integer (digits only). All options
requiring arguments accept both --option=ARG or --option ARG forms
Options not listed here, if any, are appended to uniq -c

Options:
  -h|--help              show this page.

  --min-count=NUM        only print lines with count >= NUM
  --min-perc=NUM         only print lines with count percent >= NUM%
  --top=NUM              only print the top NUM lines. 0 = all lines
  --top-perc=NUM         only print the top NUM% lines

  All lines with count less than any of the above options will be
  grouped together as a single <other> line, printed last by default.
  Setting a minimum higher than total, either count or percentage,
  will effectively disable printing the <total> line. For --top-*
  options, NUM does not include the total.

  --stop-after-top=NUM   stop reading after NUM top unique lines
  --stop-after-count=NUM stop reading after lines with count < NUM

  Unlike --min-* and --top-* options, the above will discard lines,
  thus affecting <total>, <other> and all percentages.
  --stop-after-top is equivalent to 'head -nNUM' after sort -nr and
  before topuniq's enhancements. For both, NUM=0 disables the option

  --precision=NUM        use NUM decimal digits for the percentages,
                         default 1

  --no-perc              do not print percentages
  --no-total             do not print <total> line
  --no-other             do not print <other> line

  --total-last           print <total> line last instead of first
  --sort-other           print <other> line in sorted position

  --label-total=LABEL use LABEL for <total> line, default "Total (%d)"
  --label-other=LABEL use LABEL for <other> line, default "Other (%d)"

  For the --label-* options, optional "%d" prints the number of unique
  lines that <total> or <other> represents

  --enhance-uniq       consider input as already processed by
                       sort | uniq -c, skip it and process from there.
                       Useful for enhancing previously saved data

Environment Variables:

  topuniq uses sort and uniq, so the user locale, particularly
  LC_COLLATE, affects ordering and unique matching, as well as sort
  performance. LC_NUMERIC affects decimal separator when printing
  percentages. Use LC_ALL=C for the fastest and locale-independent
  results.

Examples:

# Ignore lines with count < 10%, using case-insensitive uniq
topuniq --min-perc=10 --no-other --ignore-case

# Top 20, sorting <others> within the list, and customizing its label
topuniq --top=20 --sort-other --label-other="Other %d unique lines"

# Enhance an existing input, discarding lines with count < 10
topuniq my_uniq_data.txt --enhance-uniq --stop-after-count=10

# Behaves exactly like sort | uniq -c | sort -nr
topuniq --no-total --no-perc

For input data, some examples you may pipe directly to topuniq:

# Words in Bash's manual page
man bash | tr '[:punct:][:blank:]' '\n' | sed '/^$/d'

# Icon types in /usr/share/icons
find /usr/share/icons -type f -name "*.*" | awk -F. '{print $NF}'

# Shebangs from /usr/bin scripts
for f in /usr/bin/*; do [ -f "" ] && head -n1 "" | grep ^#!; done

Copyright (C) 2012 Rodrigo Silva (MestreLion) <[email protected]>
License: GPLv3 or later. See <http://www.gnu.org/licenses/gpl.html>