Skip to content

Commit

Permalink
add: pymupdf/rag
Browse files Browse the repository at this point in the history
  • Loading branch information
iwilltry42 committed Oct 11, 2024
1 parent 0dddd76 commit ed50874
Show file tree
Hide file tree
Showing 6 changed files with 686 additions and 0 deletions.
2 changes: 2 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
DISCLAIMER: The files within the documentloaders/mupdf directory are differently licensed. Please refer to the LICENSE file in that folder for more information.

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Expand Down
661 changes: 661 additions & 0 deletions documentloaders/mupdf/LICENSE

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions documentloaders/mupdf/foo.gpt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Tools: ./tool.gpt
Params: input: Path to input PDF file
Params: output: Path to output file where the extracted markdown content should be written to. Should end with .md

Call the MuPDF PDF Document Loader tool with the input and output parameters.
7 changes: 7 additions & 0 deletions documentloaders/mupdf/load.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import os
import pymupdf4llm
import pathlib

md_text = pymupdf4llm.to_markdown(os.getenv("INPUT"))

pathlib.Path(os.getenv("OUTPUT")).write_bytes(md_text.encode())
3 changes: 3 additions & 0 deletions documentloaders/mupdf/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
-i https://pypi.org/simple
pymupdf==1.24.11; python_version >= '3.8'
pymupdf4llm==0.0.17
8 changes: 8 additions & 0 deletions documentloaders/mupdf/tool.gpt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Name: MuPDF PDF Document Loader
Params: input: Path to input PDF file
Params: output: Path to output file where the extracted markdown content should be written to. Should end with .md
metadata: supported-file-type: pdf

#!/usr/bin/env python3 ${GPTSCRIPT_TOOL_DIR}/load.py


0 comments on commit ed50874

Please sign in to comment.