Skip to content

Latest commit

 

History

History
796 lines (504 loc) · 14.4 KB

docs.rst

File metadata and controls

796 lines (504 loc) · 14.4 KB

pyin

Like sed, but Python!

A personal project.

Warning

This project makes heavy use of Python's builtin eval() function, and comes with same set of security considerations. Mostly this means only executing trusted strings, but be sure to read the relevant Python documentation.

The primary interface to pyin is the $ pyin command-line utility. It seeks to provide shorthand for the little scripts we developers write when quickly transforming small amounts of data. It can be quite flexible and powerful, but at a certain point the expressions become difficult to read and edit.

Some (most? all?) shells (posix?) disable variable expansion inside of single quotes. Keep this in mind when writing expressions.

This Python snippet:

>>> import itertools as it
>>> with open('LICENSE.txt') as f:
...     line = next(f)
>>> line = line.strip()
>>> line = line.lower()
>>> line.count('n')
2

is equivalent to:

$ head -1 LICENSE.txt | pyin 'i.lower()' 'i.count("n")'
2

$ pyin is primarily designed to read data from stdin or a file, however in some cases it is necessary to generate your own data:

$ pyin --gen 'range(3)'
0
1
2

The --gen flag's only requirement is that it produce an iterable object:

$ pyin --gen '{"key": "value"}'
key

A directive is a special shorthand for a pre-defined operation. All directives start with the % character. Directives are split into two categores: `Item Directives`_ and `Stream Directives`_. The former modifies each item in the stream, and the latter has the ability to completely change the stream itself. For example, %json is an Item Directive, and %csvd is a Stream Directive.

A directive takes the place of a Python expression:

$ echo '[1, 2, 3]' | pyin %json 'sum(i)'
6

In some cases a directive has different behavior depending on what is passed to it. In this example the first %json call is decoding JSON data to a Python object, and the second is encoding:

$ echo '[1, 2, 3]' | pyin %json %json
[1, 2, 3]

A list of all directives and their use appears later in this document.

Expressions are typically executed against each element in the stream, but it is also possible to modify the underlying data stream directly:

All Python expressions are parsed for importable objects and automatically imported:

$ echo 'LICENSE.txt' | pyin 'os.path.exists(i)'
True

An expression containing a reference to an invalid object will fail to execute:

$ echo 'LICENSE.txt' | pyin 'os.path.ex(i)'
ERROR: module 'posixpath' has no attribute 'ex'

Python statements can be executed during setup to modify the global scope:

$ ls LICENSE.txt \
  | pyin \
    -s "from os.path import exists" \
    '(i, exists(i))' \
('LICENSE.txt', True)

A more complex example mixing directives, expressions, etc.:

$ head -4 LICENSE.txt \
  | pyin \
    %filter i \
    'i.split()' \
    'i[::2]' \
    %stream '[" ".join(i) for i in s]'
New License
Copyright 2015-2023, D.
All reserved.

is equivalent to the Python code:

>>> import itertools as it
>>> with open('LICENSE.txt') as f:
...     # Take first 4 lines
...     for i in it.islice(f, 4):
...         # Remove lines only containing whitespace
...         i = i.strip()
...         if not i:
...             continue
...         # Take every-other word
...         i = i.split()
...         i = i[::2]
...         print(" ".join(i))
New License
Copyright 2015-2023, D.
All reserved.

pyin makes use of Python's builtin eval(), which executes code within a scope with local and global variables. pyin only places the data being evaluated within the local variables, but provides a full global scope containing all of the normal Python builtins plus some aliases to potentially useful modules and functions. This scope is somewhat hidden but can be investigated:

$ pyin \
    --gen 'range(1)' \
    %stream '_scope.items()' \
    %filterfalse 'i[0].startswith("_")' \
    'f"{i[0]} {type(i[1])} {i[1].__name__}"'
it <class 'module'> itertools
op <class 'module'> operator
reduce <class 'builtin_function_or_method'> reduce

This is admittedly very hard to read, but rebuilding the command one expression at a time should reveal what is happening.

A directive is a special operation that may or may not be possible to express as a Python expression. The %json directive is an example of one that is easy to re-implement, and the %csv directive is one that would be extremely difficult.

Some directives require one or more arguments. They are noted as:

%directive argument

and are described below each notation.

Text processing.

%join string

Equivalent to:

'<string>.join(i)'

Equivalent to:

'i.lower()'

Equivalent to:

'i.strip()'

See also %lstrips.

%lstrips string

Equivalent to:

'i.lstrip(<string>)'

See also %lstrip.

%partition string

Equivalent to:

'i.partition(<string>)'
%replace old new

Equivalent to:

'i.replace(<old>, <new>)'
%rpartition string

Equivalent to:

'i.rpartition(<string>)'

Equivalent to:

'i.rstrip()'

See also %rstrips.

%rstrips string

Equivalent to:

'i.rstrip(<string>)'

See also %rstrip.

Equivalent to:

'i.split()'

See also %splits.

%splits string

Equivalent to:

'i.split(<string>)'

See also %split.

Equivalent to:

'i.strip()'
%strips string

Equivalent to:

'i.strip(<string>)'

See also %strip.

Equivalent to:

'i.upper()'

Cast from one type to another.

Equivalent to:

'bool(i)'

Equivalent to:

'dict(i)'

Equivalent to:

'float(i)'

Equivalent to:

'int(i)'

Equivalent to:

'list(i)'

Equivalent to:

'set(i)'

Equivalent to:

'set(i)'

Equivalent to:

'tuple(i)'

Data elimination.

%filter <expression>

Include items matching the expression. Equivalent to:

%stream 'filter(<expression>, s)'
%filterfalse <expression>

Exclude items matching the expression. Equivalent to:

%stream 'itertools.filterfalse(<expression>, s)'

Parsing and serializing well-known formats.

Encode/decode a CSV. If the input is a stream it is read with csv.DictReader() in a manner that is equivalent to: %stream 'csv.DictReader(s)'.

If the input data is a dictionary, first a header row is written with all fields, and then all records are written with csv.QUOTE_ALL. It is not feasible to recreate this behavior with an expression.

Encode and decode JSON data. If the input is a string, this is equivalent to:

'json.loads(i)'

otherwise:

'json.dumps(i)'

Altering how items within the stream are grouped.

Accumulate all elements in the stream into a single iterable object. Equivalent to %stream '[list(s)]'.

%stream 'itertools.batched(s, N)'

For Python 3.12 onward, this is equivalent to %stream 'itertools.batched(s, <N>)'. For older versions of Python:

>>> from itertools import islice
>>> def batched(stream, N):
...     stream = iter(stream)
...     while chunk := tuple(it.islice(stream, N)):
...         yield tuple(chunk)
>>> result = batched(range(5), 2)
>>> print(list(result))
[(0, 1), (2, 3), (4,)]

Equivalent to:

%stream 'itertools.chain(s)'

Equivalent to:

%stream 'it.islice(s)'

Reverse the entire stream. Done in a memory efficient manner. Equivalent to both of the snippets below. See %rev for more details.

%stream 'reversed(s)'
%stream 's[::-1]'

Directives not belonging to another category.

In theory this is equivalent to "reversed(i)", but in practice often equivalent to "i[::-1]". Calling reversed() on a string produces a reversed object, but reversing a string with slicing like string[::-1] does produce a string. Same for lists and tuples. This directive knows about a few of these special cases, and attempts to preserve type. it will sometimes be wrong.

%eval <expression>

Mostly users do not need to be aware of this directive. Internally, pyin assumes that any expression not associated with a directive belongs to %eval. In code terms, these are equivalent:

'i + 1'
%eval 'i + 1'
%evalif <sentinel expression> <expression>

Only evaluate expression if the sentinel expression evaulates as true.

%evalif 'i % 2 == 0' 'i ** 2'
%exec <statement>

Execute a Python statement instead of an expression.

%execif <sentinel expression> <statement>

Like %evalif, but only execute statement if sentinel expression evaluates as true.

%execif 'i % 2 != 0' 'i -= 2'
%stream <expression>

Evaluate an expression on the stream itself.

Manual for the pyin Python library. pyin.eval() is mostly what users should interact with.

Evaluate one or more Python expressions against a stream of data. This snippet:

>>> import pyin
>>> stream = range(3)
>>> expressions = ['i + 1', '[i] * 3', '%json']
>>> for item in pyin.eval(expressions, stream):
...     print(item)
[1, 1, 1]
[2, 2, 2]
[3, 3, 3]

is equivalent to:

$ pyin --gen 'range(3)' 'i + 1' '[i] * 3' %json
[1, 1, 1]
[2, 2, 2]
[3, 3, 3]

Entrypoint to the CLI for use within Python. Does not catch all exceptions. A compliant argument parser is available via the argparse_parser() function. While not part of the official API, the _cli_entrypoint() function may be worth referencing. It contains an additional layer of error handling for the $ pyin utility and exists to bridge the gap between the shell and main().

An argparse.ArgumentParser() compatible with main().

Parses expressions and constructs the operation objects necessary to execute them. Users should not need to interact with this function.

Parses expressions and attempts to import the objects they reference into a single global scope. Users should not need to interact with this function.

Base class for implementing an operation. One operation implements one or more directives. See section below on Implementing an Operation.

Like pyin.OpBase(), but for an operation that is instantiated with additional information. These classes receive a global scope intended for use with Python's builtin eval() and exec() functions, and variable names in which data should be placed when evaluating an expression.

Indicates a directive is invalid or unusable in some manner. Subclasses RuntimeError() to trigger some of the special traceback handling in $ pyin. Typically this exception should only be raised when something is systematically wrong with how pyin was developed, not incorrect use.

An operation is a single class containing the code implementing one or more directives. Each operation class can implement multiple directives. Admittedly, the term operation doesn't make much sense, but here we are.

In theory this is pluggable...

A directive should ideally map directly to a Python function or common shell utility. For example, the %rev directive is identical to the $ rev utility. %reversed would also be an acceptable name, but is probably too long. However, directives should have one name and one name only - it is not OK to register both %rev and %reversed and use one as an alias for the other. Stick with the Zen of Python:

$ python -m this | grep "There should be one"
There should be one-- and preferably only one --obvious way to do it.

Subclassing pyin.OpBase()

An operation must subclass pyin.OpBase() and implement at least the __call__() method. The operation lists which directives it supports, at call time knows which directive it is executing, and receives a global scope to execute within. See the pyin.OpBase() class's source code for more information. pyin.OpEval() and pyin.OpJSON() are also good references.