Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text-to-SQL #24

Open
jgpruitt opened this issue Jun 10, 2024 · 1 comment
Open

Text-to-SQL #24

jgpruitt opened this issue Jun 10, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@jgpruitt
Copy link
Collaborator

Experiment with text-to-sql from within the database. Can we embed the pg_catalog and use it to power text-to-sql?

@jgpruitt jgpruitt added the enhancement New feature or request label Jun 10, 2024
@Wang-Yong2018
Copy link
Contributor

Hope Embedding pg_catalo can improve text to sql be more nice that traditional zeroshot or one-shot mod.

I based on zero shot idea develop a shiny llm app. It has 5 steps from convert catalog to final get data.

  1. convert pg_catalog into create table syntax
  2. get dbms name
  3. add user input.
  4. ask AI(for example) generate slq
  5. get data via AI based SQL

For example, I want to know "find the top 5 sold music name and total revenue and quantity as well as average unit price", the ai will generate sql for me basedon schema and question, I extract data using the ai sql.

image

Below is the detail steps.

  1. convert the pg_catalog into sql-ddl sentence, we can
"select
                        'CREATE TABLE ' || nspname || '.' || relname || ' (' || chr(10)||
                         array_to_string(
                          	array_agg(attname || 
                          	          ' '     || 
                          	          atttypid::regtype::text||
                          	          CASE attnotnull WHEN true THEN ' NOT NULL' ELSE ' NULL' end ||
                          	          coalesce('  -- COMMENT ' || quote_literal(description), '')|| 
                          	          ','  || chr(10)
                          	          ), '  '  
                          	          )  || chr(10)|| ');' as definition
                      from 
                          pg_attribute
                          join
                              pg_class on
                          	pg_class.oid = pg_attribute.attrelid
                          join
                              pg_namespace on
                          	pg_namespace.oid = pg_class.relnamespace
                          left join
                              pg_description on
                          	pg_description.objoid = pg_class.oid
                          	and pg_description.objsubid = pg_attribute.attnum
                      where
                        	nspname not in ('pg_catalog', 'information_schema')
                        	and relkind in ('r', 'v')
                      group by
                      	nspname, relname;"
  1. to get dbms name, as this is only postgres. we can say it is constranst

  2. system prompt for text to sql


/* 1. Given the following database schema: */
{sql_ddl}

/* 2. Answer user question followed :
{user_question}

/* 3. Follow below SQL code standard
 - 3.1. To begin with, check if the question can be answered by SQL, return explanation about how to change question.
 - 3.1.1 explanation should be comments with SQL single line comments format!!!
 - 3.2. If the question could be answer by SQL, use following format:
 - 3.2.1 Prefer to use common table expression.
 - 3.2.2 Only use the tables and fields defined in the database schema
 - 3.2.3 Generate single table query when possible.
 - 3.2.4 Only left join tables which have same field name and nor more than 3 tables join 
 - 3.2.5 If necessary,explaination code geneation  with SQL single line comments format
 - 3.2.6 the DBMS is {dbms_name}. if DBMS is sqlite, pls only use SQL-92 standard.
 - 3.2.7 the SQL code should be ending with a semicolon.
*/

  1. one-shot or few shot
    it is get data sample for each of the table records to enable large lanage model to understandard the data.
    as LLM has context input limit, so, ignore this step, except llm suport 16K or more input prompt.

=======================================================

  • final system prompt will be the template + sql_ddl + dbms name and user input
  • final ai generate sql will be depends the model and system prompt. There are so many articles discuss about the text to sql. In my viewpoint, gpt3.5 turbo , claude3 ,llama3 and deepseek v2(lowest price) showed good performance. GPT4(o) is too expensive for trail and test. And for somehow, it not work as well as gpt3.5 some time.

Attache my shiny_llm_map git link and demo line for your reference

  1. demo link https://r7prz5-yong-wang.shinyapps.io/shiny_llm_map/
  2. git link https://github.com/Wang-Yong2018/shiny_llm_map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants