-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Version 4 with better open source model support and no langchain #223
Conversation
whitead
commented
Jan 4, 2024
•
edited
Loading
edited
- OpenAI v1
- Remove Langchain
- Write nunmpy vector store
- Implement MMR
- Pydantic v2
- Implement embedding model
- Write unit tests of new LLM calls
- Revise add_text/query flows to support custom embedding functions
- Revise make_chain flows to support custom LLMs calls
- Write migration notes (pickling set client)
- Create llama file notes (note continuous batching, own embeddings)
- Get langchain vector store adapter
- Write better example for adding_docs with externally provided embeddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added several docstring requests, and a reformatting to get_score()
Outside of that, +1 to these features
- OpenAI v1
- Remove Langchain
- Write nunmpy vector store
- Implement MMR
- Pydantic v2
- Implement embedding model
- Write unit tests of new LLM calls
|
||
|
||
def guess_model_type(model_name: str) -> str: | ||
import openai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add docstring here
Determines the type of model (either 'chat' or 'completion') based on the model name
paperqa/llms.py
Outdated
skip_system: bool = False, | ||
system_prompt: str = default_system_prompt, | ||
) -> Callable[[dict, list[Callable[[str], None]] | None], Coroutine[Any, Any, str]]: | ||
"""Create a function to execute a batch of prompts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you declare in the docstring that this make_chain function is a replacement for langchain in paperqa <=v41
I think this will help old users track this major change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
paperqa/llms.py
Outdated
|
||
return execute | ||
else: | ||
raise NotImplementedError(f"Unknown model type {llm_config['model_type']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the edge case handling :)
raise NotImplementedError(f"Unknown model type {llm_config['model_type']}") | ||
|
||
|
||
def get_score(text: str) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a suggestion for making this logic more accessible (if I/chatgpt understands the current logic):
def get_score(text: str) -> int:
# Check for N/A in the last line as a substring
last_line = text.split("\n")[-1].lower()
if "n/a" in last_line or "na" in last_line:
return 0
# Search for score patterns
patterns = [
r"[sS]core[:is\s]+([0-9]+)", # Score: number
r"\(([0-9])\w*\/", # (number/
r"([0-9]+)\w*\/" # number/
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
score = int(match.group(1))
return min(score, 10) if score > 10 else score
# Default scores based on text length
return 1 if len(text) < 100 else 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't pass unit tests - this function confuses me for sure. Will get back to this later though - maybe we can separate this from regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked more into this - let's wait until we have the json format thing we discussed implemented
paperqa/prompts.py
Outdated
"for the question below based on the provided context. " | ||
qa_prompt = ( | ||
"Write an answer ({answer_length}) " | ||
"for the question below based on the provided context. Ignore irrelevant context. " | ||
"If the context provides insufficient information and the question cannot be directly answered, " | ||
'reply "I cannot answer". ' | ||
"For each part of your answer, indicate which sources most support it " | ||
"via valid citation markers at the end of sentences, like (Example2012). \n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you swap the citation to be wikicrow style here:
(Qiu2020aarFDC pages 3-3)
from completion prompt: https://github.com/Future-House/WikiCrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put latest prompts in
@@ -1,8 +1,8 @@ | |||
from pathlib import Path | |||
from typing import List | |||
|
|||
import tiktoken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool upgrade
@@ -76,6 +76,12 @@ def parse_pdf(path: Path, doc: Doc, chunk_chars: int, overlap: int) -> List[Text | |||
def parse_txt( | |||
path: Path, doc: Doc, chunk_chars: int, overlap: int, html: bool = False | |||
) -> List[Text]: | |||
"""Parse a document into chunks, based on tiktoken encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to add the hyperlink to tiktoken for reference here in the docstring
https://github.com/openai/tiktoken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
paperqa/types.py
Outdated
Args: | ||
query: Query vector. | ||
k: Number of results to return. | ||
lambda_: Weighting of relevance and diversity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add to the docstring that the lambda can be tuned for different applications of paperqa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - moved to be part of vectorstore class
"""Returns the set of variables implied by the format string""" | ||
format_dict = _FormatDict() | ||
s.format_map(format_dict) | ||
return format_dict.key_set | ||
|
||
|
||
class PromptCollection(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 love the change here to be str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work overall, and cool to see the implementation of the langchain vectorstore
tiny comments around a typo and 1 or 2 for clarity, otherwise good to go
client = None | ||
else: | ||
client = AsyncOpenAI() | ||
# backwards compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this
super().__init__(**data) | ||
self._client = client | ||
self._embedding_client = embedding_client | ||
# run this here (instead of automateically) so it has access to privates | ||
# If I ever figure out a better way of validating privates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't know either sorry
def clear_docs(self): | ||
self.texts = [] | ||
self.docs = {} | ||
self.docnames = set() | ||
|
||
def __getstate__(self): | ||
# You may wonder why make these private if we're just going |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't fully track your intention here - just fyi (and no need to change anything)
list[Text], | ||
( | ||
await self.texts_index.max_marginal_relevance_search( | ||
self._embedding_client, answer.question, k=_k, fetch_k=5 * _k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this fetch_k=5 configurable in the function, with a default of 5?