Flyte SDK: Build Production-Ready Data and ML Pipelines in Python
The Flyte SDK is a Python library for writing clear, scalable, and reproducible data and machine learning pipelines. Define your logic as pure Python functions, compose them into workflows, and let Flyte handle the execution, whether on your laptop or a distributed cluster.
Why a Python SDK for Pipelines?
Writing complex data and ML pipelines often means juggling more than just the core logic. You end up manually managing task dependencies, retries, resource allocation, and data passing. Scaling from your laptop to a production cluster can require a significant rewrite, tangling your business logic with infrastructure code. This makes pipelines hard to test, reproduce, and maintain.
The Flyte SDK solves this by letting you define your pipeline's structure and requirements directly in Python, separating your logic from the underlying execution engine.
- From messy scripts to structured workflows: Turn a collection of scripts into a clear, directed graph of tasks.
- Focus on logic, not infrastructure: Define resource needs like memory or GPUs in your code, and let Flyte handle the provisioning.
- Develop locally, scale globally: Write and test your pipelines on your machine with confidence that they will run identically on a production Flyte cluster.
How to Think About Flyte
Think of your pipeline as a recipe. The Flyte SDK helps you write that recipe in a structured way.
-
A Python function is a Task: Each step in your recipe is a
Task. In the SDK, this is just a Python function. It takes inputs and produces outputs.# A simple task that doubles an integer
async def double(x: int) -> int:
return x * 2 -
A workflow is a composition of Tasks: The full recipe, which defines the order and dependencies of your steps, is a
Workflow. This is just another Python function that calls your tasks.# A workflow that calls the 'double' task
async def main_workflow(n: int) -> int:
result = await double(n)
return result -
Data flows automatically: You simply return values from one task and pass them as arguments to the next. Flyte handles serializing and passing the data behind the scenes, even for complex types like Pandas DataFrames or files.
How It Works
The Flyte SDK provides a bridge between your Python code and a Flyte execution backend. The development lifecycle looks like this:
- Define: Write your tasks and workflows as Python functions.
- Compose: Call tasks from other tasks to build your pipeline's dependency graph.
- Run Locally: Use the
flytecommand-line tool to execute your workflow on your local machine for rapid testing and iteration. - Package: The SDK packages your code and its dependencies into a reproducible format, typically a container image.
- Execute Remotely: A Flyte backend (like Union Cloud) runs your packaged workflow, managing the orchestration, data passing, retries, and resource allocation you defined in your code.
What Can I Build?
Here are a few common patterns you can build with the Flyte SDK.
Basic Fan-Out Workflow
Run multiple tasks in parallel and gather the results.
import asyncio
import flyte
async def square(x: int) -> int:
return x * x
async def main(n: int = 5) -> list[int]:
"""Run square tasks in parallel for numbers from 0 to n-1."""
tasks = [square(i) for i in range(n)]
return await asyncio.gather(*tasks)
if __name__ == "__main__":
# To run locally: flyte run --local my_workflow.py main --n 10
print(flyte.run(main, n=10))
Handle Data Files
Pass files and directories between tasks seamlessly. Flyte manages the data transfer between local and remote storage.
import flyte
from flyte.io import File
async def write_file(content: str) -> File:
f = File.new_remote()
async with f.open("w") as fh:
await fh.write(content)
return f
async def read_file(f: File) -> str:
async with f.open("r") as fh:
return await fh.read()
async def main(content: str = "hello world") -> str:
file_obj = await write_file(content)
read_content = await read_file(file_obj)
print(read_content)
return read_content
Run Any Container
Orchestrate tasks that aren't written in Python using a ContainerTask.
import flyte
from flyte.extras import ContainerTask
greeting_task = ContainerTask(
name="say_hello_in_bash",
image="ubuntu:latest",
command=["/bin/bash", "-c", "echo Hello, {{.inputs.name}}"],
inputs={"name": str},
outputs={"output": str},
)
async def main(name: str = "flyte") -> str:
return await greeting_task(name=name)
When to Use It
-
Use Flyte SDK when:
- You have a multi-step data, ML, or backend process.
- You need to manage dependencies, parallelism, and data flow between tasks.
- You require robust error handling, automatic retries, and resource management (CPU, memory, GPU).
- You want to ensure your pipelines are reproducible and easy to test.
- You need to scale your local code to a production-grade distributed environment.
-
Consider other tools when:
- Your application is a single, simple script.
- You need a real-time, low-latency API server (Flyte is for asynchronous/batch workflows).
- Your entire process is simple, runs quickly, and fits comfortably on a single machine.
Integrations
The Flyte SDK is designed to work with the tools you already use.
- Python: Requires Python 3.10 or newer.
- Operating Systems: Runs on Linux, macOS, and Windows.
- Data & ML Stack: Natively handles types from libraries like Pandas, Polars, PyArrow, and Scikit-learn.
- Distributed Computing: Includes plugins for running tasks on Dask, Ray, and Spark clusters.
Getting Started
-
Install the SDK:
pip install flyte -
Write your first workflow (e.g., in a file named
workflow.py):import flyte
async def double(x: int) -> int:
return x * 2
async def main(n: int) -> int:
result = await double(n)
final = await double(result)
print(f"The final result is: {final}")
return final -
Run it locally from your terminal:
flyte run --local workflow.py main --n 5
Limitations & Assumptions
- Backend Required for Scale: The SDK provides a great local development experience, but for production-scale, distributed execution, it's designed to connect to a Flyte backend.
- Async-First: While synchronous (
def) functions are supported, many examples and advanced concurrency patterns leverage Python'sasync/awaitsyntax. - Data is Passed by Value: Flyte tasks are isolated. Data is passed between them by value, not by reference. This ensures reproducibility but means you should be mindful of passing very large objects between tasks.
Frequently Asked Questions
1. Is this the same as Flyte?
This is the Python SDK for the Flyte orchestration platform. You use the SDK to define your workflows, and the Flyte backend executes them.
2. Do I have to use async/await?
No, you can define tasks with standard def. However, async def is recommended for workflows that involve running multiple tasks concurrently, as it provides a natural way to express parallelism.
3. How do I pass data between tasks?
Just return values from one task function and pass them as arguments to the next. The SDK handles the serialization and data passing for you, even for complex types like files or dataframes.
4. How does the SDK handle Python dependencies?
You define your dependencies in a pyproject.toml or requirements.txt file. Flyte packages your code and dependencies into a container image, ensuring a consistent and reproducible runtime environment.
5. Can I run tasks that aren't written in Python?
Yes. The ContainerTask lets you run any command inside a specified Docker container, allowing you to orchestrate tasks written in any language as part of your Python-defined workflow.