-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d40f31e
commit 3bc45cf
Showing
1 changed file
with
322 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,322 @@ | ||
# PDFix Pipeline Runner # | ||
|
||
The **PDFix Pipeline Runner** is a command-line tool designed to execute automated workflows, or "pipelines," for processing PDF documents. It leverages modular actions, each defined in a configuration JSON file, to perform various tasks such as OCR, language detection, and compliance checks (e.g., PDF/UA standards). These actions can either run locally or inside Docker containers, making the tool flexible and adaptable across different environments. | ||
|
||
The application supports custom pipelines by chaining actions with input/output dependencies, allowing for seamless file transformation and metadata updates. It also includes logging capabilities and version display options for better control and debugging. | ||
|
||
### Key Features: | ||
- **Flexible Pipelines:** Define and execute workflows using a JSON-based configuration. | ||
- **Modular Actions:** Support for both local CLI tools and Docker-based operations. | ||
- **Dynamic Arguments:** Use macros for referencing outputs between actions. | ||
- **Cross-Platform Support:** Compatible with Windows, macOS, and Linux. | ||
- **Error Handling:** Customizable acceptable return codes for robust error management. | ||
- **Logging:** Configurable directories for storing execution logs. | ||
|
||
|
||
## Pipeline execution | ||
|
||
Minimal pipeline-runner execution: | ||
```bash | ||
./pipeline-runner -p <pipeline.json> | ||
``` | ||
|
||
Parameter description: | ||
``` | ||
-p,--pipeline <pipeline_config> Path to the pipeline configuration JSON | ||
-w,-workingdir <working-dir> A directory to store any input files and files generated during the | ||
execution of a pipeline. A system temporary folder is used | ||
if no folder is provided. (optional) | ||
--log <log-location> A directory or file path where execution log files will be written. | ||
--log_level <level> A level of logged information [none, critical, error, warning, info, debug] | ||
--version Display the application version only | ||
``` | ||
|
||
## Pipeline configuration | ||
|
||
The configuration is a JSON document with an array of actions to be executed. | ||
|
||
### Defining actions | ||
|
||
The structure of this JSON is as follows: | ||
``` | ||
{ | ||
"title": "The command name", | ||
"actions": [ | ||
{ | ||
"name": "action-1", | ||
"configuration": {}, | ||
"args": [] | ||
}, | ||
{ | ||
"name": "action-2", | ||
"configuration": {}, | ||
"args": [] | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Action definition | ||
|
||
Each action is identified by a JSON node with necessary instructions for execution of the action. | ||
|
||
Supported action types are: | ||
- **local** which is a command-line application installed on a system | ||
- **docker** which is embedded in a docker image with the support of a command-line interface (*docker must be installed on a system to execute such action*) | ||
|
||
Available actions are listed on [PDFix Marketplace](https://pdfix.net/products/actions-marketplace/) | ||
|
||
Action example: | ||
``` | ||
{ | ||
"name": "action-1", | ||
"id": "action-1-id", | ||
"configuration": { | ||
"path": "<path to a local program>", | ||
"program": "my_cli_app -i ${input_pdf} -o ${output_pdf}" | ||
"platform": [ "windows", "darwin" ], | ||
"returnCodes": [ 0 ] | ||
}, | ||
"args": [] | ||
} | ||
``` | ||
|
||
### Action parameters | ||
|
||
- **name** is a string identifier of an action. The action name can be referenced from argument values. | ||
- **id** is a unique identifier of an action within the pipeline. This identifier is used when referencing values of other actions in the pipeline. If `id` is not provided the processor uses the action name as an identifier. | ||
- **configuration** defines information for execution and platform support | ||
- **args** defines arguments passed to the program when executing action | ||
- **title** is a optional user-friendly name of an action | ||
|
||
Action example with argument referencing values from another action in the pipeline: | ||
``` | ||
{ | ||
"actions": [ | ||
{ | ||
"name": "action-1", | ||
"id": "action-1-id", | ||
"configuration": { | ||
"path": "<path to a local program>", | ||
"program": "my_cli_app -i ${input_pdf} -o ${output_pdf}" | ||
}, | ||
"args": [ | ||
{ | ||
"name: "input_pdf", | ||
"value": "/usr/tmp/input.pdf" | ||
}, | ||
{ | ||
"name: "output_pdf", | ||
"value": "" | ||
} | ||
] | ||
}, | ||
{ | ||
"name": "action-2", | ||
"id": "action-2-id", | ||
"configuration": { | ||
"path": "<path to a local program>", | ||
"program": "my_cli2_app -i ${input_pdf} -o ${output_pdf}" | ||
}, | ||
"args": [ | ||
{ | ||
"name": "input_pdf", | ||
"value": "${action-1-id.output_pdf}" | ||
}, | ||
{ | ||
"name": "output_pdf", | ||
"value": "/usr/tmp/processed.pdf" | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
### Configuration | ||
|
||
**configuration** is a structure defining the execution of the external CLI program. | ||
|
||
Configuration parameters description: | ||
- **path** is used for local actions and tells the processor a path to the executable of an action. Not required if the command is available system-wide (e.g. `docker`, `java`, ...). | ||
- **program** is the command-line definition that contains the full command for execution including input or output parameters. It is highly recommended to put value macros into quotes `\"${input_pdf}\"`. The `program` may contain additional arguments provided by the pipeline-runner automatically. Such arguments are `${working_directory}`, `${license_name}`, `${license_key}`. | ||
- **platform** if provided, defines platforms supported by this action. The available values are: | ||
- `windows` for Windows operating systems | ||
- `darwin` for macOS operating systems | ||
- `linux` for Linux operating systems | ||
- **returnCodes** is an array of acceptable return codes the program returns. Any exit code not listed in this array is considered an error and terminates the pipeline. Devault value is `[ 0 ]` | ||
|
||
### Arguments | ||
|
||
**args** is an array of user-defined arguments used for execution and replaces macros in the `program` in the `configuration`. Similar to executing a bash command. | ||
|
||
Each argument is defined by the required properties `name` and `value`. Other arguments may define additional information as helpful for the integration of the action into other applications or to optimize the action execution. The argument value is used as a name of macro used and replaces in the `"program"` string. For example, macro `${input_pdf}` will be replaced with the `value` of the argument with the name `input_pdf`. | ||
`"program": "my_cli -i ${input_pdf}"` will become `"program": "my_cli -i /usr/tmp/input.pdf"` | ||
|
||
- **name** defines the name of the argument and defines a macro which is replaced by a value in the `program` string | ||
- **value** defines the value of the argument | ||
- **flags** defines argument *flags*. The acceptable values are decimal values of: | ||
- *0x0* no flags defined for argument | ||
- *0x1* argument is read-only | ||
- *0x2* argument represents an input file of the command. If set, the argument value is *required*. | ||
- *0x4* argument represents an output file of the command. If set, the argument value is *optional*. | ||
- **ext** defines the equivalent extension of value in case it represents a reference to a file name. | ||
- **type** defines the type of argument value. Possible values are: `string`, `int`, `file_path`, 'json'. Default value is `string`. | ||
|
||
Program arguments example: | ||
``` | ||
{ | ||
"name": "input_pdf", | ||
"desc": "Path to PDF document you want to proccess", | ||
"flags": 2, | ||
"ext": "pdf", | ||
"type": "file_path", | ||
"value": "/usr/tmp/input.pdf" | ||
} | ||
``` | ||
|
||
## Examples | ||
The pipeline for execution a pipeline with the fillowing steps: | ||
- **OCR Document** with docker image `pdfix/ocr-tesseract` | ||
- **Detect document language** with action in docker image `pdfix/lang-detect` | ||
- **Autotag** PDF with PDFix SDK installed locally | ||
- **Set PDF/UA Standard** in document metadata with PDFix SDK installed locally | ||
|
||
The input file to this pipeline is */usr/tmp/this_is_input.pdf*. Output will be saved to */usr/tmp/this_is_output.pdf*. | ||
Paths to input and output files should be set only in the first and last actions. Other actions contain references to their input and output files with macro for example `${action-id.output_pdf}`. | ||
|
||
**Note:** Macros `${license_name}` and `${license_key}` are automatically provided by the pipeline-runner when the license is active on the system. | ||
|
||
**An example of the full pipeline-runner configuration:** | ||
|
||
``` | ||
{ | ||
"actions": [ | ||
{ | ||
"args": [ | ||
{ | ||
"name": "input_pdf", | ||
"value": "/usr/tmp/this_is_input.pdf" | ||
}, | ||
{ | ||
"name": "output_pdf", | ||
"value": "" | ||
}, | ||
{ | ||
"name": "language", | ||
"value": "eng" | ||
} | ||
], | ||
"configuration": { | ||
"path": "", | ||
"program": "docker run --platform linux/amd64 -v \"${working_directory}:/data\" --rm pdfix/ocr-tesseract:v0.4.4 --name \"${license_name}\" --key \"${license_key}\" ocr -i \"/data/${input_pdf}\" -o \"/data/${output_pdf}\" --lang \"${language}\"", | ||
"returnCodes": [ | ||
0 | ||
] | ||
}, | ||
"id": "ocr_tesseract", | ||
"name": "ocr_tesseract", | ||
"title": "OCR Tesseract" | ||
}, | ||
{ | ||
"args": [ | ||
{ | ||
"name": "input_pdf", | ||
"value": "${ocr_tesseract.output_pdf}" | ||
}, | ||
{ | ||
"name": "output_pdf", | ||
"value": "" | ||
} | ||
], | ||
"configuration": { | ||
"path": "", | ||
"program": "docker run --platform linux/amd64 -v ${working_directory}:/data -w /data --rm pdfix/lang-detect:v0.4.4 --name \"${license_name}\" --key \"${license_key}\" lang-detect -i \"/data/${input_pdf}\" -o \"/data/${output_pdf}\"", | ||
"returnCodes": [ | ||
0 | ||
] | ||
}, | ||
"id": "language_detection", | ||
"name": "language_detection", | ||
"title": "Language Detection" | ||
}, | ||
{ | ||
"args": [ | ||
{ | ||
"name": "input_pdf", | ||
"value": "${language_detection.output.pdf}" | ||
}, | ||
{ | ||
"name": "output_pdf", | ||
"value": "" | ||
} | ||
], | ||
"configuration": { | ||
"path": "/Applications/PDFix SDK", | ||
"program": "pdfix_app batch --input \"${input_pdf}\" --output \"${output_pdf}\" -c \"${config_json}\" --progress -m \"${license_name}\" -k \"${license_key}\"", | ||
"returnCodes": [ | ||
0 | ||
] | ||
}, | ||
"id": "add_tags", | ||
"name": "add_tags", | ||
"params": [ | ||
{ | ||
"name": "template", | ||
"value": "" | ||
}, | ||
{ | ||
"name": "preflight", | ||
"value": true | ||
}, | ||
{ | ||
"name": "standard_attrs", | ||
"value": "false" | ||
}, | ||
{ | ||
"name": "sequential_headings", | ||
"value": "true" | ||
} | ||
], | ||
"title": "Autotag" | ||
}, | ||
{ | ||
"args": [ | ||
{ | ||
"name": "input_pdf", | ||
"value": "${add_tags.output_pdf}" | ||
}, | ||
{ | ||
"name": "output_pdf", | ||
"value": "/usr/tmp/this_is_output.pdf" | ||
} | ||
], | ||
"configuration": { | ||
"path": "/Applications/PDFix SDK", | ||
"program": "pdfix_app batch --input \"${input_pdf}\" --output \"${output_pdf}\" -c \"${config_json}\" --progress -m \"${license_name}\" -k \"${license_key}\"", | ||
"returnCodes": [ | ||
0 | ||
] | ||
}, | ||
"id": "set_pdf_ua_standard", | ||
"name": "set_pdf_ua_standard", | ||
"params": [ | ||
{ | ||
"name": "part_number", | ||
"value": "1" | ||
} | ||
], | ||
"title": "Set PDF/UA Standard" | ||
} | ||
], | ||
"name": "custom_action4", | ||
"id": "custom_action4", | ||
"title": "Custom Action with OCR and PDF/UA fixing", | ||
"type": "custom", | ||
"version": { | ||
"major": 1, | ||
"minor": 0, | ||
"patch": 0 | ||
} | ||
} | ||
``` |