Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Version 4 with better open source model support and no langchain #223

Merged
merged 16 commits into from
Jan 22, 2024

Conversation

whitead
Copy link
Collaborator

@whitead whitead commented Jan 4, 2024

  • OpenAI v1
  • Remove Langchain
  • Write nunmpy vector store
  • Implement MMR
  • Pydantic v2
  • Implement embedding model
  • Write unit tests of new LLM calls
  • Revise add_text/query flows to support custom embedding functions
  • Revise make_chain flows to support custom LLMs calls
  • Write migration notes (pickling set client)
  • Create llama file notes (note continuous batching, own embeddings)
  • Get langchain vector store adapter
  • Write better example for adding_docs with externally provided embeddings

@whitead whitead changed the title Version 4 with better open source model support and no langchain [WIP] Version 4 with better open source model support and no langchain Jan 4, 2024
Copy link

@mrubash1 mrubash1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added several docstring requests, and a reformatting to get_score()

Outside of that, +1 to these features

  • OpenAI v1
  • Remove Langchain
  • Write nunmpy vector store
  • Implement MMR
  • Pydantic v2
  • Implement embedding model
  • Write unit tests of new LLM calls



def guess_model_type(model_name: str) -> str:
import openai
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add docstring here

Determines the type of model (either 'chat' or 'completion') based on the model name

paperqa/llms.py Outdated
skip_system: bool = False,
system_prompt: str = default_system_prompt,
) -> Callable[[dict, list[Callable[[str], None]] | None], Coroutine[Any, Any, str]]:
"""Create a function to execute a batch of prompts
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you declare in the docstring that this make_chain function is a replacement for langchain in paperqa <=v41

I think this will help old users track this major change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

paperqa/llms.py Outdated

return execute
else:
raise NotImplementedError(f"Unknown model type {llm_config['model_type']}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the edge case handling :)

raise NotImplementedError(f"Unknown model type {llm_config['model_type']}")


def get_score(text: str) -> int:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion for making this logic more accessible (if I/chatgpt understands the current logic):


def get_score(text: str) -> int:
    # Check for N/A in the last line as a substring
    last_line = text.split("\n")[-1].lower()
    if "n/a" in last_line or "na" in last_line:
        return 0

    # Search for score patterns
    patterns = [
        r"[sS]core[:is\s]+([0-9]+)",  # Score: number
        r"\(([0-9])\w*\/",            # (number/
        r"([0-9]+)\w*\/"              # number/
    ]

    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            score = int(match.group(1))
            return min(score, 10) if score > 10 else score

    # Default scores based on text length
    return 1 if len(text) < 100 else 5

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't pass unit tests - this function confuses me for sure. Will get back to this later though - maybe we can separate this from regex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked more into this - let's wait until we have the json format thing we discussed implemented

"for the question below based on the provided context. "
qa_prompt = (
"Write an answer ({answer_length}) "
"for the question below based on the provided context. Ignore irrelevant context. "
"If the context provides insufficient information and the question cannot be directly answered, "
'reply "I cannot answer". '
"For each part of your answer, indicate which sources most support it "
"via valid citation markers at the end of sentences, like (Example2012). \n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you swap the citation to be wikicrow style here:

(Qiu2020aarFDC pages 3-3)

from completion prompt: https://github.com/Future-House/WikiCrow

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put latest prompts in

@@ -1,8 +1,8 @@
from pathlib import Path
from typing import List

import tiktoken
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool upgrade

@@ -76,6 +76,12 @@ def parse_pdf(path: Path, doc: Doc, chunk_chars: int, overlap: int) -> List[Text
def parse_txt(
path: Path, doc: Doc, chunk_chars: int, overlap: int, html: bool = False
) -> List[Text]:
"""Parse a document into chunks, based on tiktoken encoding.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to add the hyperlink to tiktoken for reference here in the docstring
https://github.com/openai/tiktoken

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

paperqa/types.py Outdated
Args:
query: Query vector.
k: Number of results to return.
lambda_: Weighting of relevance and diversity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add to the docstring that the lambda can be tuned for different applications of paperqa?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - moved to be part of vectorstore class

"""Returns the set of variables implied by the format string"""
format_dict = _FormatDict()
s.format_map(format_dict)
return format_dict.key_set


class PromptCollection(BaseModel):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 love the change here to be str

@whitead whitead requested a review from mrubash1 January 11, 2024 19:41
Copy link

@mrubash1 mrubash1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work overall, and cool to see the implementation of the langchain vectorstore

tiny comments around a typo and 1 or 2 for clarity, otherwise good to go

client = None
else:
client = AsyncOpenAI()
# backwards compatibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this

paperqa/docs.py Outdated Show resolved Hide resolved
super().__init__(**data)
self._client = client
self._embedding_client = embedding_client
# run this here (instead of automateically) so it has access to privates
# If I ever figure out a better way of validating privates

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know either sorry

def clear_docs(self):
self.texts = []
self.docs = {}
self.docnames = set()

def __getstate__(self):
# You may wonder why make these private if we're just going

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't fully track your intention here - just fyi (and no need to change anything)

list[Text],
(
await self.texts_index.max_marginal_relevance_search(
self._embedding_client, answer.question, k=_k, fetch_k=5 * _k

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this fetch_k=5 configurable in the function, with a default of 5?

@whitead whitead merged commit 3cb16f2 into main Jan 22, 2024
1 check passed
@whitead whitead deleted the v42 branch January 22, 2024 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants