[WIP] Version 4 with better open source model support and no langchain #223

whitead · 2024-01-04T00:58:14Z

mrubash1

Added several docstring requests, and a reformatting to get_score()

Outside of that, +1 to these features

OpenAI v1
Remove Langchain
Write nunmpy vector store
Implement MMR
Pydantic v2
Implement embedding model
Write unit tests of new LLM calls

mrubash1 · 2024-01-04T21:07:36Z

paperqa/llms.py

+
+
+def guess_model_type(model_name: str) -> str:
+    import openai


could add docstring here

Determines the type of model (either 'chat' or 'completion') based on the model name

mrubash1 · 2024-01-04T21:12:10Z

paperqa/llms.py

+    skip_system: bool = False,
+    system_prompt: str = default_system_prompt,
+) -> Callable[[dict, list[Callable[[str], None]] | None], Coroutine[Any, Any, str]]:
+    """Create a function to execute a batch of prompts


Could you declare in the docstring that this make_chain function is a replacement for langchain in paperqa <=v41

I think this will help old users track this major change

mrubash1 · 2024-01-04T21:14:17Z

paperqa/llms.py

+
+        return execute
+    else:
+        raise NotImplementedError(f"Unknown model type {llm_config['model_type']}")


+1 for the edge case handling :)

mrubash1 · 2024-01-04T21:20:16Z

paperqa/llms.py

+        raise NotImplementedError(f"Unknown model type {llm_config['model_type']}")
+
+
+def get_score(text: str) -> int:


I have a suggestion for making this logic more accessible (if I/chatgpt understands the current logic):

def get_score(text: str) -> int: # Check for N/A in the last line as a substring last_line = text.split("\n")[-1].lower() if "n/a" in last_line or "na" in last_line: return 0 # Search for score patterns patterns = [ r"[sS]core[:is\s]+([0-9]+)", # Score: number r"\(([0-9])\w*\/", # (number/ r"([0-9]+)\w*\/" # number/ ] for pattern in patterns: match = re.search(pattern, text) if match: score = int(match.group(1)) return min(score, 10) if score > 10 else score # Default scores based on text length return 1 if len(text) < 100 else 5

This didn't pass unit tests - this function confuses me for sure. Will get back to this later though - maybe we can separate this from regex

Looked more into this - let's wait until we have the json format thing we discussed implemented

mrubash1 · 2024-01-04T21:22:32Z

paperqa/prompts.py

-    "for the question below based on the provided context. "
+qa_prompt = (
+    "Write an answer ({answer_length}) "
+    "for the question below based on the provided context. Ignore irrelevant context. "
    "If the context provides insufficient information and the question cannot be directly answered, "
    'reply "I cannot answer". '
    "For each part of your answer, indicate which sources most support it "
    "via valid citation markers at the end of sentences, like (Example2012). \n"


Could you swap the citation to be wikicrow style here:

(Qiu2020aarFDC pages 3-3)

from completion prompt: https://github.com/Future-House/WikiCrow

Put latest prompts in

mrubash1 · 2024-01-04T21:23:45Z

paperqa/readers.py

@@ -1,8 +1,8 @@
 from pathlib import Path
 from typing import List

+import tiktoken


cool upgrade

mrubash1 · 2024-01-04T21:24:04Z

paperqa/readers.py

@@ -76,6 +76,12 @@ def parse_pdf(path: Path, doc: Doc, chunk_chars: int, overlap: int) -> List[Text
 def parse_txt(
    path: Path, doc: Doc, chunk_chars: int, overlap: int, html: bool = False
 ) -> List[Text]:
+    """Parse a document into chunks, based on tiktoken encoding.


might want to add the hyperlink to tiktoken for reference here in the docstring
https://github.com/openai/tiktoken

mrubash1 · 2024-01-04T21:26:25Z

paperqa/types.py

+        Args:
+            query: Query vector.
+            k: Number of results to return.
+            lambda_: Weighting of relevance and diversity.


Could you add to the docstring that the lambda can be tuned for different applications of paperqa?

done - moved to be part of vectorstore class

mrubash1 · 2024-01-04T21:27:45Z

paperqa/types.py

+    """Returns the set of variables implied by the format string"""
+    format_dict = _FormatDict()
+    s.format_map(format_dict)
+    return format_dict.key_set


 class PromptCollection(BaseModel):


+1 love the change here to be str

mrubash1

great work overall, and cool to see the implementation of the langchain vectorstore

tiny comments around a typo and 1 or 2 for clarity, otherwise good to go

mrubash1 · 2024-01-19T00:54:33Z

paperqa/docs.py

+                    client = None
+            else:
+                client = AsyncOpenAI()
+        # backwards compatibility


thanks for this

paperqa/docs.py

mrubash1 · 2024-01-19T00:55:02Z

paperqa/docs.py

        super().__init__(**data)
        self._client = client
        self._embedding_client = embedding_client
+        # run this here (instead of automateically) so it has access to privates
+        # If I ever figure out a better way of validating privates


don't know either sorry

mrubash1 · 2024-01-19T00:56:39Z

paperqa/docs.py

    def clear_docs(self):
        self.texts = []
        self.docs = {}
        self.docnames = set()

    def __getstate__(self):
+        # You may wonder why make these private if we're just going


I still don't fully track your intention here - just fyi (and no need to change anything)

mrubash1 · 2024-01-19T01:00:31Z

paperqa/docs.py

+                list[Text],
+                (
+                    await self.texts_index.max_marginal_relevance_search(
+                        self._embedding_client, answer.question, k=_k, fetch_k=5 * _k


should we make this fetch_k=5 configurable in the function, with a default of 5?

First draft without langchain

04bcda1

whitead changed the title ~~Version 4 with better open source model support and no langchain~~ [WIP] Version 4 with better open source model support and no langchain Jan 4, 2024

whitead mentioned this pull request Jan 4, 2024

Updated OpenAI compatibility to v1 #207

Closed

whitead added 3 commits January 3, 2024 22:47

Improved unit tests to be closer

0009c0c

Fixed remaining tests

527472b

Added new dependencies

2a32876

mrubash1 suggested changes Jan 4, 2024

View reviewed changes

whitead added 9 commits January 4, 2024 16:54

Refactored LLMs to allow swapping

37b82f9

Added unit tests for custom embeds/llms

04d0254

Fixed langchain compatibility and updated README

5efe49a

Refactored vector stores to maybe support langchain

71c70f2

Addressed Matt's comments

cab5596

Finished langchain vector store

a19875e

Unit test prompt adjustments

e601be1

Added warning to README

b8f1acf

Fixed some typos in README

90793ed

whitead requested a review from mrubash1 January 11, 2024 19:41

whitead added 2 commits January 15, 2024 14:20

Fixed problem for very short texts

39ba98c

Made it easier to access LLM names

59ed8d3

mrubash1 approved these changes Jan 19, 2024

View reviewed changes

Fixed text embedding errors

8fb8bdd

whitead merged commit 3cb16f2 into main Jan 22, 2024
1 check passed

whitead deleted the v42 branch January 22, 2024 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Version 4 with better open source model support and no langchain #223

[WIP] Version 4 with better open source model support and no langchain #223

whitead commented Jan 4, 2024 •

edited

Loading

mrubash1 left a comment

mrubash1 Jan 4, 2024

mrubash1 Jan 4, 2024

whitead Jan 11, 2024

mrubash1 Jan 4, 2024

mrubash1 Jan 4, 2024

whitead Jan 11, 2024

whitead Jan 16, 2024

mrubash1 Jan 4, 2024

whitead Jan 11, 2024

mrubash1 Jan 4, 2024

mrubash1 Jan 4, 2024

whitead Jan 11, 2024

mrubash1 Jan 4, 2024

whitead Jan 11, 2024

mrubash1 Jan 4, 2024

mrubash1 left a comment

mrubash1 Jan 19, 2024

mrubash1 Jan 19, 2024

mrubash1 Jan 19, 2024

mrubash1 Jan 19, 2024

		raise NotImplementedError(f"Unknown model type {llm_config['model_type']}")


		def get_score(text: str) -> int:

[WIP] Version 4 with better open source model support and no langchain #223

[WIP] Version 4 with better open source model support and no langchain #223

Conversation

whitead commented Jan 4, 2024 • edited Loading

mrubash1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrubash1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whitead commented Jan 4, 2024 •

edited

Loading