Skip to main content

Understanding Flyte Runtime Errors

Flyte task executions can encounter various failures. To provide clear diagnostics and enable robust error handling, Flyte categorizes these failures using a structured hierarchy of runtime errors. These errors derive from a common base and distinguish between issues originating from user code, system components, or unclassified causes.

Core Runtime Error Structure

All Flyte runtime errors inherit from BaseRuntimeError. This foundational class captures essential details about a task execution failure, providing a consistent interface for error introspection.

The BaseRuntimeError constructor accepts:

  • code (str): A unique identifier for the specific error condition.
  • kind (ErrorKind): Categorizes the error as "user", "system", or "unknown". This is a critical attribute for understanding the source of the failure.
  • root_cause_message (str): A detailed message describing the underlying cause of the error. This message is typically what a standard RuntimeError would contain.
  • worker (str | None): An optional identifier for the worker process or component that encountered the error.

When a task fails, the system raises one of the specialized runtime error types, each inheriting from BaseRuntimeError and setting the kind attribute appropriately.

Categorizing Runtime Errors

Flyte defines three primary categories of runtime errors to help developers quickly identify the source and nature of a task failure.

User Errors

RuntimeUserError indicates that the task execution failed due to an issue within the user's provided code. This is the most common type of error and typically requires a code fix or adjustment in the task definition.

When a RuntimeUserError is raised, its kind attribute is set to "user".

Example: A Python task attempts to access a dictionary key that does not exist, resulting in a KeyError. Flyte wraps this underlying exception as a RuntimeUserError.

from flytekit import task, workflow
from flytekit.exceptions.base import RuntimeUserError

@task
def my_failing_task(data: dict):
try:
# This will raise a KeyError if 'non_existent_key' is not in data
value = data["non_existent_key"]
print(f"Value: {value}")
except KeyError as e:
# In a real Flyte execution, the system would catch this.
# For demonstration, we show how it maps to RuntimeUserError.
raise RuntimeUserError(
code="KEY_NOT_FOUND",
message=f"Missing expected key in input data: {e}"
)

@workflow
def my_workflow():
my_failing_task(data={"some_key": "some_value"})

# When my_workflow runs, my_failing_task will raise RuntimeUserError.

System Errors

RuntimeSystemError signifies a failure originating from the Flyte system itself or its underlying infrastructure, rather than directly from the user's task logic. This could include issues with resource allocation, network connectivity within the Flyte cluster, or internal service failures. While less frequent than user errors, system errors require attention from platform administrators or a deeper understanding of the Flyte environment.

The kind attribute for RuntimeSystemError is set to "system".

Example: A task fails to launch because the underlying Kubernetes pod cannot be scheduled due to insufficient cluster resources. Flyte would report this as a RuntimeSystemError.

from flytekit.exceptions.base import RuntimeSystemError

# This error is typically raised internally by the Flyte system.
# Developers usually encounter it as an execution status.
# For illustration, imagine an internal component raising it:
def simulate_system_failure():
raise RuntimeSystemError(
code="K8S_POD_SCHEDULING_FAILED",
message="Failed to schedule pod due to insufficient memory in the cluster."
)

# In a real scenario, you would see this in the Flyte UI or CLI output
# when a task fails due to infrastructure issues.

Unknown Errors

RuntimeUnknownError is a fallback category for failures that cannot be definitively classified as either user or system errors. This can occur when an unexpected exception is caught, or when the system lacks sufficient context to determine the root cause. While less specific, RuntimeUnknownError still indicates a problem requiring investigation.

The kind attribute for RuntimeUnknownError is set to "unknown".

Example: An external service called by a task returns an unhandled, cryptic error message that doesn't fit predefined user or system error patterns.

from flytekit.exceptions.base import RuntimeUnknownError

def call_external_service():
# Simulate an external service call that returns an unexpected error
raise ValueError("External service returned an unexpected status: 500 Internal Server Error with no details.")

@task
def process_data_with_external_service():
try:
call_external_service()
except Exception as e:
# If the system cannot classify the error, it might raise RuntimeUnknownError
raise RuntimeUnknownError(
code="EXTERNAL_SERVICE_UNCLASSIFIED_ERROR",
message=f"An unclassified error occurred during external service call: {e}"
)

@workflow
def my_external_service_workflow():
process_data_with_external_service()

# When my_external_service_workflow runs, process_data_with_external_service might raise RuntimeUnknownError.

Interpreting and Handling Runtime Errors

Understanding the kind and code attributes of a BaseRuntimeError instance is crucial for effective debugging and operational management.

  • Debugging User Errors: When kind is "user", focus debugging efforts on the task's code logic, input data, or dependencies. The root_cause_message provides specific details.
  • Addressing System Errors: If kind is "system", investigate the Flyte environment, cluster resources, or platform configurations. This often requires collaboration with platform administrators.
  • Investigating Unknown Errors: RuntimeUnknownError suggests a need for deeper investigation. Examine logs from the task execution and the Flyte system for any unhandled exceptions or unusual events. Consider enhancing error handling within tasks to catch more specific exceptions and re-raise them as RuntimeUserError with more context.

Developers can catch BaseRuntimeError or its specific subclasses in their code, though typically Flyte handles the propagation and reporting of these errors. For advanced use cases, such as custom error handling within a task or integrating with external monitoring systems, direct interaction with these error types can be beneficial.

from flytekit.exceptions.base import BaseRuntimeError, RuntimeUserError, RuntimeSystemError, RuntimeUnknownError

def analyze_flyte_error(error: BaseRuntimeError):
print(f"Error Code: {error.code}")
print(f"Error Kind: {error.kind}")
print(f"Root Cause: {error.args[0]}") # root_cause_message is the first arg of RuntimeError
if error.worker:
print(f"Worker: {error.worker}")

if isinstance(error, RuntimeUserError):
print("Action: Review task code and inputs.")
elif isinstance(error, RuntimeSystemError):
print("Action: Check Flyte system health and resources.")
elif isinstance(error, RuntimeUnknownError):
print("Action: Investigate logs for unhandled exceptions.")
else:
print("Action: Unclassified runtime error.")

# Example usage (simulating an error caught by an external handler)
try:
raise RuntimeUserError(code="INVALID_INPUT", message="Input 'x' must be positive.")
except BaseRuntimeError as e:
analyze_flyte_error(e)

try:
raise RuntimeSystemError(code="DB_CONNECTION_FAILED", message="Could not connect to database 'prod_db'.")
except BaseRuntimeError as e:
analyze_flyte_error(e)

Best Practices

  • Provide Clear Error Messages: When raising custom exceptions within your tasks that Flyte might wrap, ensure the messages are descriptive. This aids in quickly diagnosing RuntimeUserError instances.
  • Monitor Error Kinds: Leverage the kind attribute in monitoring and alerting systems to differentiate between user-fixable issues and platform-level problems. This helps route alerts to the appropriate teams.
  • Refine Error Handling: Within complex tasks, catch specific exceptions and re-raise them with more context if necessary, or log them thoroughly before allowing them to propagate. This can help prevent RuntimeUnknownError by providing clearer classification.
  • Utilize code for Granularity: Define specific code values for common failure modes within your tasks. This allows for more granular error tracking and automated responses.