Skip to content

πŸ“™ Awesome Data Catalogs and Observability Platforms.

License

Notifications You must be signed in to change notification settings

yeatsq/awesome-data-catalogs

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

73 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Data Discovery and Observability Awesome

forthebadge

This repository contains a curated list of awesome data data catalogs and observability platforms that help you discover, manage, and observe data in your organization.


Contents: Existing Data Discovery and Observability Solutions

OSS Data Catalogs Proprietary Monocloud DCs Proprietary Obserability Tools Other Proprietary DCs
πŸ“™ Amundsen πŸ“’ Google DC πŸ” Monte Carlo πŸ“• Alation
πŸ“™ DataHub πŸ“’ Azure DC πŸ” Databand πŸ“• Atlan
πŸ“™ Marquez πŸ” Datafold πŸ“• Collibra
πŸ“™ Atlas πŸ” Ataccama πŸ“• DataGalaxy
πŸ“™ CKAN πŸ“• Informatica
πŸ“™ Magda πŸ“•Stemma

High-Level Feature Comparision

Tool Specification -Based Search-based Network-based Lineage-based Federa- tion ML 1st Citizen Data Quality End-to-end Lineage Observ- ability
Alation ❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
Amundsen ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ ❌ ❌ ❌
Ataccama ❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
Atlan ❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
Atlas ❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ ❌ ❌ ❌
Azure DC ❌ βœ”οΈ ? βœ”οΈ ❌ ❌ ? ❌ ❌
CKAN ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌ ❌ ❌
Collibra ❌ βœ”οΈ ? βœ”οΈ ❌ ❌ ? ❌ ❌
DataGalaxy ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ ❌ βœ”οΈ βœ”οΈ
Databand ❌ ? ? ? ❌ ? ? ? βœ”οΈ
Datafold ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ βœ”οΈ ❌ βœ”οΈ
DataHub ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ ❌ ❌ ❌
Google DC ❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ ? ❌ ❌
Informatica ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
Magda ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌ ❌ ❌
Marquez OpenLineage βœ”οΈ ❌ βœ”οΈ ? ❌ ❌ ❌ ❌
Monte Carlo ❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ βœ”οΈ
Stemma ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ ? ❌ ❌
Talend ❌ βœ”οΈ ? βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌

Definitions:

  • Specification-based - uses an open standard for collecting metadata to allow efficient time-to-discovery and federating data catalogs
  • Search-based - allows to search for data assets
  • Lineage-based - provides lineage for all entities the solution operates
  • Network-based - provides rich context about data asset ownership
  • Federation - the ability to map multiple data catalogs into a single UI to avoid repeated data collection.
  • End-to-end lineage - data lineage that includes all data assets used in the organization across all its data catalogs and ML tools.
  • ML 1st citizen - operates ML entities on a high level - you can use them as any other data assets.
  • Data Quality - includes mature data quality assurance tools.

πŸ“™ Open-Source Data Catalogs

Amundsen

Website | GitHub Maintenance

A popular open-source data catalog for metadata management and data discovery originated from Lyft.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ ❌ ❌ ❌
More features
  • Strategy: Push
  • UX personalization: No
  • AI autowiring: No
  • Rich data profiling: No
  • Recommendations: Yes
  • Schemas, Description: Yes
  • Complex schemas: No
  • Data preview: Yes
  • Column statistics: Yes
  • Data owner: Yes
  • Top data users: Yes
  • Change notifications:No
  • Change feed: No
  • Deployment:
  • Supported data sources: Hive, Redshift, Druid, RDBMS, Presto, Snowflake

DataHub

Website | GitHub Maintenance

DataHub is an open-source data catalog featuring data discovery, data governance, metadata management originated from LinkedIn.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ ❌ ❌ ❌
More features
  • Strategy: Push, Pull
  • UX personalization: No
  • AI autowiring: No
  • Rich data profiling: No
  • Recommendations: ?
  • Schemas, Description: Yes
  • Complex schemas: No
  • Data preview: ?
  • Column statistics: No
  • Data owner: Yes
  • Top data users: ?
  • Change notifications: No
  • Change feed: No
  • Deployment:
  • Supported data sources: Hive, Kafka, RDBMS

Marquez

Website | GitHub Maintenance

Marquez is an open-source data catalog for collection, aggregation, and visualization of a data ecosystem’s metadata originated from WeWork.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
OpenLineage βœ”οΈ ❌ βœ”οΈ ? ❌ ❌ ❌ ❌
More features
  • Strategy: Push
  • UX personalization: No
  • AI autowiring: No
  • Rich data profiling: No
  • Recommendations: No
  • Schemas, Description: Yes
  • Complex schemas: No
  • Data preview: Yes
  • Column statistics: No
  • Data owner: Yes
  • Top data users: ?
  • Change notifications: No
  • Change feed: No
  • Deployment:
  • Supported data sources: S3, Kafka

Atlas

Website | GitHub Maintenance

Apache Atlas is an open-source data catalog for metadata collection, governance, and data democratization.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ ❌ ❌ ❌
More features
  • Strategy: Push
  • UX personalization: No
  • AI autowiring: No
  • Rich data profiling: No
  • Recommendations: No
  • Schemas, Description: Yes
  • Complex schemas: No
  • Data preview: No
  • Column statistics: No
  • Data owner: No
  • Top data users: ?
  • Change notifications: Yes
  • Change feed: No
  • Deployment:
  • Supported data sources:HBase, Hive, Sqoop, Kafka, Storm

CKAN

Website | GitHub Maintenance

CKAN is an open-source data catalog for data management, powering data portals for govenments and enterprises.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌ ❌ ❌
More features
  • Strategy: Push
  • UX personalization: No
  • AI autowiring: No
  • Rich data profiling: No
  • Recommendations: ?
  • Schemas, Description: ?
  • Complex schemas: ?
  • Data preview: ?
  • Column statistics: ?
  • Data owner: ?
  • Top data users: ?
  • Change notifications: ?
  • Change feed: ?
  • Deployment:
  • Supported data sources:

Magda

Website | GitHub Maintenance

Magda is an open-source data catalog that features data discovery, metadata enrichment, and federation, focused on geodata.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌ ❌ ❌
More features
  • Strategy: Push via UI
  • UX personalization: No
  • AI autowiring: No
  • Rich data profiling: No
  • Recommendations: No
  • Schemas, Description: Yes
  • Complex schemas: No
  • Data preview: Yes
  • Column statistics: No
  • Data owner: Yes
  • Top data users: ?
  • Change notifications: No
  • Change feed: No
  • Deployment:
  • Supported data sources: Mostly geodata

πŸ“• Proprietary Data Catalogs

Collibra

Website | GitHub

Collibra is an enterprise data catalog that helps to discover and understand data that matters and drive impactful insights from it.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ? βœ”οΈ ❌ ❌ ? ❌ ❌
More features
  • Strategy: Push
  • UX personalization: Yes
  • AI autowiring: ?
  • Network-based: No
  • Rich data profiling: ?
  • Supported data sources:

Informatica

Website | GitHub

Informatica is an enterprise data catalog that provides AI-powered data discovery engine to scan and catalog data assets.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
More features
  • Strategy: Push
  • UX personalization: ?
  • AI autowiring: ?
  • Network-based: Yes
  • Rich data profiling: Yes
  • Supported data sources:

Alation

Website | GitHub

Alation is a collaborative data catalog that helps companies to drive value and business impact from their data.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
More features
  • Strategy: Push
  • UX personalization: Yes
  • AI autowiring: No
  • Network-based: No
  • Rich data profiling: No
  • Supported data sources:

Atlan

Website | GitHub

Atlan is a modern data catalog offering data discovery, data profiling, data quality, data lineage and data governance.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
More features
  • Strategy: Pull
  • UX personalization: ?
  • AI autowiring: ?
  • Network-based: No
  • Rich data profiling: ?
  • Supported data sources: Presto, Deequ, Atlas, Airflow, Hudi

DataGalaxy

Website | GitHub

DataGalaxy is a modern data catalog offering data discovery, data profiling, data quality, data lineage and data governance.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ ❌ βœ”οΈ βœ”οΈ
More features
  • Strategy: Pull & Push
  • UX personalization: Yes
  • AI autowiring: Yes
  • Network-based: Yes
  • Rich data profiling: Yes
  • Supported data sources:

Stemma

Website

Stemma is a fully managed data catalog powered by the open-source data catalog Amundsen that helps data teams have total trust in their data.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ ? ❌ ❌
More features
  • Strategy: Push
  • UX personalization: No
  • AI autowiring: No
  • Network-based: No
  • Rich data profiling: No
  • Supported data sources:

Talend

Website | GitHub

Talend is a data catalog that helps enterprises power critical business descisions with trusted data.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ? βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
More features
  • Strategy: Push
  • UX personalization: Yes
  • AI autowiring: ?
  • Network-based: ?
  • Rich data profiling: Yes
  • Supported data sources:

πŸ“’ Monocloud Data Catalogs

Google Cloud Data Catalog

Website | GitHub

Google Cloud Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ ? ❌ ❌
More features
  • Strategy: Pull
  • UX personalization: ?
  • AI autowiring: ?
  • Network-based: No
  • Rich data profiling: No
  • Supported data sources:

Azure Data Catalog

Website

Azure Data Catalog is a fully managed, enterprise-wide metadata catalog that makes data asset discovery straightforward.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ? βœ”οΈ ❌ ❌ ? ❌ ❌
More features
  • Strategy: Pull
  • UX personalization: ?
  • AI autowiring: ?
  • Network-based: ?
  • Rich data profiling: ?
  • Supported data sources:

πŸ” Data Observability Platforms

Monte Carlo

Website

Monte Carlo is a data observability tool that helps to increase trust in data by eliminating or preventing data downtime.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ βœ”οΈ
More features
  • Strategy: Pull
  • UX personalization: ?
  • AI autowiring: ?
  • Network-based: ?
  • Rich data profiling: ?
  • Supported data sources: Snowflake, Hive, Kafka, Looker, Redshift, Tableau, Big Query, Airflow, Fivetran, Presto, Mode, Periscope, Databricks, Glue, dbt, Chartio, Spark, AWS, S3, data.world, Google Cloud Platform

Databand

Website | GitHub

Databand is an observability platform that helps data engineers identify and troubleshoot pipeline issues and data quality problems.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ ? ? ? ❌ ? ? ? βœ”οΈ
More features
  • Strategy: Push
  • UX personalization: ?
  • AI autowiring: ?
  • Network-based: ?
  • Rich data profiling: ?
  • Supported data sources:

Datafold

Website | GitHub

Datafold is a data monitoring and observability platform that gives you confidence in your data quality through diffs, profiling, and anomaly detection.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ ❌ βœ”οΈ ❌ βœ”οΈ
More features
  • Strategy: Push
  • UX personalization: ?
  • AI autowiring: ?
  • Network-based: ?
  • Rich data profiling: ?
  • Supported data sources:

Ataccama

Website | GitHub

Ataccama is an enterprise data catalog and observability tool featuring data profiling and data quality management, designed for data professionals.

Based on Open Standard Search-based Network-based Lineage-based Federation ML 1st Citizen Data Quality End-to-end Lineage Observability
❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ βœ”οΈ ❌ ❌
More features
  • Strategy: Pull
  • UX personalization: Yes
  • AI autowiring: No
  • Network-based: No
  • Rich data profiling: Yes
  • Supported data sources:

Back to top

About

πŸ“™ Awesome Data Catalogs and Observability Platforms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published