I wanted to make a front-end for my postGIS database using chatGPT, to make complex queries of biodiversity accessible from natural language prompts and then display the resulting data in an exploratory map.
I've been working on building a local postgis database to hold biodiversity data from GBIF. I've been learning SQL
in the process and ended up using chatGPT quite a bit to help me write complex SQL queries. I realized that chatGPT is quite good at generating SQL queries, and wondered if I could stand up a shiny
webapp that utilized a LLM (e.g. ChatGPT) to generate SQL queries and display the results on a map or in a table so that the data could be explored further.
The result here is a proof-of-concept, that ended up surprising me with how well it works after about a week of development
Unfortunately because I have this spun up on a private network I don't have a demo that we can actually mess around with. But click the above for a video demo.
This is able to produce results for somewhat complex questions about biodiversity data using natural language, that would take a significant amount of time for people that do not have a local GBIF database of all CA records and do not know SQL. A year ago it would have taken me a significant amount of time to answer the question: which county in CA has the most records of Darlingtonia californica? With this it takes the time to write out the prompt | the few minutes of waiting while the database query runs.
The webapp is able to generate complex SQL queries using relatively little information about the tables (see below). To me, this suggests that there is little overhead to adding additional data to the database. Conceivably, if I wanted to add a wildfire perimeter dataset like FRAP I would add a new table to the postgis database, add the table description to the initial chatGPT prompt, and run from there (and probably adding a few notes to the initial prompt as it tries the first few queries). This presents a number of opportunities for exploring complex relationships between biodiversity and environmental data that I would have had a hard time imagining before learning SQL.
- At the start of the session, we call an inital prompt to chatGPT using the
chatgpt
R package. This prompt is ingpt-initial-prompt.R
- This prompt contains information about a brief pre-amble about generating SQL queries and the purpose of the webapp
- A description of the tables (verbatim output of
\d table
inpsql
) - A few additional notes per table (e.g. county names are capitalized)
- A user then inputs a prompt and submits the prompt to chatGPT-4
- chatGPT returns it's best guess of a SQL query and this passed directly to the postGIS database.
- Because it's a persistent session with chatGPT you can tell it to correct any mistakes in the prompt
- If data are returned, spatial data are mapped with
sf
andleaflet
or rendered as a table if there isn't a geometry column.
I'm not a database engineer so these tables might be pretty funky. But these are the ones I told chatGPT about:
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
gbifid | bigint | not null | ||
taxonkey | integer | |||
year | smallint | |||
geom | geometry(Point,4326) |
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
taxonkey | integer | not null | ||
kingdom | text | |||
phylum | text | |||
class | text | |||
ordo | text | |||
family | text | |||
genus | text | |||
species | text | |||
infraspecificepithet | text | |||
taxonrank | text | |||
scientificname | text |
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
id | integer | not null | nextval("ca_layers_id_seq"::regclass) | |
category | text | |||
name | text | |||
geom | geometry(MultiPolygon,4326) |
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
gbifid | bigint | |||
cwhr_type | text | |||
gap30x30id | integer | |||
countyid | integer |
Column | Type | Collation | Nullable |
---|---|---|---|
conifer_class | text | ||
geom | geometry | ||
id | integer | not null |
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
class | text | |||
lastupdate | text | |||
source_dat | text | |||
label | text | |||
geom | geometry |