Introduction to Metadata Ingestion
Please see our Integrations page to browse our ingestion sources and filter on their features.
Integration Methods
DataHub offers three methods for data ingestion:
- UI ingestion
- CLI ingestion
- SDK-based ingestion
UI Ingestion
DataHub supports configuring and monitoring ingestion via the UI. For a detailed guide on UI ingestion, please refer to the UI Ingestion page.
CLI Ingestion
DataHub supports configuring ingestion via CLI. For more information, refer to the CLI Ingestion guide.
SDK-based ingestion
In some cases, you might want to construct Metadata events directly and use programmatic ways to emit that metadata to DataHub. In this case, take a look at the Python emitter and the Java emitter libraries which can be called from your own code.
For instance, if you want to configure and run a pipeline entirely from within your custom Python script, please refer to programmatic_pipeline.py - a basic mysql to REST programmatic pipeline.
Types of Integration
Integration can be divided into two concepts based on the method:
- Push-based integration
- Pull-based integration
Push-based Integration
Push-based integrations allow you to emit metadata directly from your data systems when metadata changes, while pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Supporting both mechanisms means that you can integrate with all your systems in the most flexible way possible. Examples of push-based integrations include Airflow, Spark, Great Expectations and Protobuf Schemas. This allows you to get low-latency metadata integration from the "active" agents in your data ecosystem.
Pull-based Integration
Examples of pull-based integrations include BigQuery, Snowflake, Looker, Tableau and many others. This document describes the pull-based metadata ingestion system that is built into DataHub for easy integration with a wide variety of sources in your data stack.
Core Concepts
The following are the core concepts related to ingestion:
- Sources : Data systems from which extract metadata. (e.g. BigQuery, MySQL)
- Sinks : Destination for metadata (e.g. File, DataHub)
- Recipe : The main configuration for ingestion in the form or .yaml file
For more advanced guides, please refer to the following: