Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating GNNs and LLMs for Enhanced Data Warehouse Understanding and Lineage Analysis #9839

Open
AJamal27891 opened this issue Dec 10, 2024 · 2 comments
Assignees
Labels

Comments

@AJamal27891
Copy link

🚀 The feature, motivation and pitch

@puririshi98
The Feature, Motivation, and Pitch:
Modern data warehouses face several critical challenges, including understanding data lineage, identifying data silos, and interpreting complex transformations in ETL processes. Existing systems, including those leveraging LLMs, fall short of addressing these challenges due to a lack of grounding in the structured relationships inherent in data warehouses.

I propose a feature to enable seamless integration of Graph Neural Networks (GNNs) and Large Language Models (LLMs) to model data warehouses as graphs, allowing for improved reasoning and understanding of:

  • Data Lineage: By representing transformations, dependencies, and data flow as graph structures, a hybrid GNN+LLM system can analyze and explain lineage paths.
  • Data Silos: Detecting disconnected components in the data warehouse graph to suggest potential integrations.
  • ETL Transformations: Providing insights into how raw data evolves through complex transformations into actionable insights.
  • Schema and Query Understanding: Modeling schemas as graphs can improve LLM capabilities to generate and interpret queries based on relationships between tables.

The integration would involve using PyG (PyTorch Geometric) for GNN modeling and extending existing libraries for training hybrid GNN+LLM architectures. This will allow data warehouse systems to gain both structural awareness (from GNNs) and semantic reasoning (from LLMs), reducing hallucinations and improving the interpretability of predictions and queries.

Alternatives

  • Existing LLM solutions provide semantic reasoning but often hallucinate without structured context.
  • Pure GNN solutions focus on structural reasoning but lack the language capabilities needed for intuitive query interaction.
  • Ensemble systems attempt to combine these capabilities but lack a unified framework for data warehouse tasks.

Additional context

  • Data lineage tools (e.g., Neo4j integrations or metadata graphing systems) could serve as a starting point for the graph representation of data warehouses.
  • Recent work on combining GNNs with LLMs for question answering and recommendation systems could provide foundational knowledge for this hybrid architecture.

This feature would enable the PyTorch Geometric community to explore real-world applications in data science, bridging the gap between NLP and data engineering.

@puririshi98
Copy link
Contributor

this is great! looking forward to reviewing your PR. please tag this issue in your PR so it is tracked

@puririshi98
Copy link
Contributor

puririshi98 commented Dec 10, 2024

I recommended that he makes examples/llm/relbench.py and integrates reusable code pieces in torch_geometric (utils etc)

relbench for data

He will need to extend current GNN+LLM infrastructure to hanlde HeteroData objects formed from RelBench data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants