Deploy with Dagster
Introduction to Dagster
Dagster is an orchestrator designed for developing and maintaining data assets, such as tables, datasets, machine learning models, and reports. Dagster ensures these processes are reliable and focuses on using software-defined assets (SDAs) to simplify complex data management, enhance the ability to reuse code, and provide a better understanding of data.
To read more, please refer to Dagster’s documentation. or see this community member's article for alternative approaches.
Dagster Cloud features
Dagster Cloud offers an enterprise-level orchestration service with serverless or hybrid deployment options. It incorporates native branching and built-in CI/CD to prioritize the developer experience. It enables scalable, cost-effective operations without the hassle of infrastructure management.
Dagster deployment options: Serverless versus Hybrid
The serverless option fully hosts the orchestration engine, while the hybrid model offers flexibility to use your computing resources, with Dagster managing the control plane, reducing operational overhead and ensuring security.
For more info, please refer to the Dagster Cloud docs.
Using Dagster for free
Dagster offers a 30-day free trial during which you can explore its features, such as pipeline orchestration, data quality checks, and embedded ELTs. You can try Dagster using its open source or by signing up for the trial.
Building data pipelines with dlt
How does dlt integrate with Dagster for pipeline orchestration?
dlt integrates with Dagster for pipeline orchestration, providing a streamlined process for
building, enhancing, and managing data pipelines. This enables developers to leverage dlt's
capabilities for handling data extraction and load, and Dagster's orchestration features to efficiently manage and monitor data pipelines.
Dagster supports native integration with dlt, here is a guide on how this integration works.
Orchestrating dlt pipeline on Dagster
Here's a concise guide to orchestrating a dlt pipeline with Dagster, creating a pipeline that ingests GitHub issues data from a repository and loads it into DuckDB.
You can find the full example code in this repository.
The steps are as follows:
-
Install Dagster and the embedded ELT package using pip:
pip install dagster dagster-webserver dagster-dg-cli
pip install dagster-dlt -
Set up a Dagster project:
mkdir dagster_github_issues
cd dagster_github_issues
create-dagster project github-issues -
In your Dagster project, define the dlt pipeline in the
github_sourcefolder.Note: The dlt Dagster helper works only with dlt sources. Your resources should always be grouped in a source.
import dlt
...
@dlt.resource(
table_name="issues",
write_disposition="merge",
primary_key="id",
)
def get_issues(
updated_at=dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z")
):
url = (
f"{BASE_URL}?since={updated_at.last_value}&per_page=100&sort=updated"
"&direction=desc&state=open"
)
yield pagination(url)
@dlt.source
def github_source():
return get_issues() -
Create a
dlt_assetsdefinition.The
@dlt_assetsdecorator takes adlt_sourceanddlt_pipelineparameter. In this example, we used thegithub_sourcesource and created adlt_pipelineto ingest data from GitHub to DuckDB.Here’s an example of how to define assets (
github_source/assets.py):import dlt
from dagster import AssetExecutionContext
from dagster_dlt import DagsterDltResource, dlt_assets
from .github_pipeline import github_source
@dlt_assets(
dlt_source=github_source(),
dlt_pipeline=dlt.pipeline(
pipeline_name="github_issues",
dataset_name="github",
destination="duckdb",
progress="log",
),
name="github",
group_name="github",
)
def dagster_github_assets(context: AssetExecutionContext, dlt: DagsterDltResource):
yield from dlt.run(context=context)For more information, please refer to Dagster’s documentation.
-
Create the Definitions object.
The last step is to include the assets and resource in a Definitions object (
github_source/definitions.py). This enables Dagster tools to load everything we have defined:import assets
from dagster import Definitions, load_assets_from_modules
from dagster_dlt import DagsterDltResource
dlt_resource = DagsterDltResource()
all_assets = load_assets_from_modules([assets])
defs = Definitions(
assets=all_assets,
resources={
"dlt": dlt_resource,
},
) -
Run the web server locally:
-
Run the project:
dg dev -
Navigate to localhost:3000 in your web browser to access the Dagster UI.
-
-
Run the pipeline.
Now that you have a running instance of Dagster, you can run your data pipeline.
To run the pipeline, go to Assets and click the Materialize button in the top right. In Dagster, materialization refers to executing the code associated with an asset to produce an output.
You will see the following logs in your command line:
Want to see real-world examples of dlt in production? Check out how dlt is used internally at Dagster in the Dagster Open Platform project.
For a complete picture of Dagster's integration with dlt, please refer to their documentation. This documentation offers a detailed overview and steps for ingesting GitHub data and storing it in Snowflake. You can use a similar approach to build your pipelines.