Error Handling & Execution Reports

The platform provides a robust system for handling runtime errors and generating comprehensive execution reports. This ensures that task failures are clearly categorized and that detailed insights into execution outcomes are readily available.

Error Handling

The error handling system standardizes how runtime failures are classified and communicated. All custom runtime errors derive from BaseRuntimeError, which introduces a fundamental categorization of errors into three kinds: user, system, and unknown. This classification helps developers quickly identify the source and nature of a problem.

Error Kinds

User Errors (RuntimeUserError): These errors originate from issues within the user's code, configuration, or data. They indicate that the task failed due to something the user can directly control or fix.
System Errors (RuntimeSystemError): These errors point to issues within the underlying platform infrastructure or its components. They suggest a problem outside the user's immediate control, potentially requiring platform-level investigation.
Unknown Errors (RuntimeUnknownError): This category serves as a fallback for errors that cannot be definitively classified as either user or system errors.

Base Runtime Error

The BaseRuntimeError class is the foundation for all custom runtime exceptions. It captures essential information about a failure:

code: A unique string identifier for the specific error type.
kind: The classification of the error (user, system, or unknown).
root_cause_message: A descriptive message explaining the error.
worker: An optional identifier for the worker that encountered the error.

Common User Error Types

The system defines several specialized user error classes, all inheriting from RuntimeUserError, to cover common failure scenarios:

CustomError: Allows users to wrap and categorize their own exceptions. The from_exception class method is particularly useful for converting standard Python exceptions into a CustomError, using the exception's class name as the error code.
```
try:
    # User's code that might raise an exception
    result = 1 / 0
except Exception as e:
    raise CustomError.from_exception(e)
```
DeploymentError: Raised when a task's deployment fails or its preconditions are not met.
ImageBuildError: Indicates a failure during the container image build process.
ImagePullBackOffError: Occurs when the system cannot pull the required container image.
InlineIOMaxBytesBreached: Raised when the data size for inline input/output exceeds the configured limit. This limit can be adjusted per task.
InvalidImageNameError: Signifies that a provided image name is malformed or invalid.
ModuleLoadError: Occurs when a Python module cannot be loaded, either due to non-existence or a syntax error.
NotInTaskContextError: Raised when an attempt is made to access task-specific context outside of an active task execution.
OOMError: Indicates an Out-Of-Memory error during task execution.
PrimaryContainerNotFoundError: Raised if the designated primary container for a task cannot be found.
ReferenceTaskError: Occurs when a task attempts to reference another task that does not exist.
RetriesExhaustedError: Signifies that a task failed even after exhausting all configured retry attempts.
RunAbortedError: Raised when a task execution is explicitly aborted by a user.

RuntimeDataValidationError: Indicates issues with serializing or deserializing task inputs or outputs.

from src.flyte.errors import RuntimeDataValidationError

def process_data(data_input: str):
    try:
        # Attempt to parse data_input
        parsed_value = int(data_input)
    except ValueError as e:
        raise RuntimeDataValidationError(var="data_input", e=e, task_name="my_parsing_task")
    return parsed_value

TaskInterruptedError: Raised when a task's underlying execution is interrupted.
TaskTimeoutError: Occurs when a task runs longer than its specified timeout duration.

System and Unknown Error Types

UnionRpcError: A RuntimeSystemError indicating a communication failure with the platform's server.
InitializationError: A BaseRuntimeError raised when the system is accessed before proper initialization.
LogsNotYetAvailableError: A BaseRuntimeError indicating that logs for a task are not yet ready for retrieval.
ActionNotFoundError: A generic RuntimeError raised when an attempted action does not exist.

When handling errors, it is best practice to catch specific RuntimeUserError types to provide targeted feedback or recovery mechanisms. For broader error handling, inspecting the kind attribute of a BaseRuntimeError instance allows for generalized handling of user, system, or unknown issues.

Execution Reports

The reporting utility provides a flexible way to generate structured, tabbed HTML reports for task executions. These reports are invaluable for presenting results, logs, diagnostic information, or any custom HTML content in an organized manner.

The `Report` Class

The Report class is the primary interface for creating and managing execution reports. A report can contain multiple tabs, each dedicated to a specific aspect of the execution.

Key Features:

Tab Management: Reports can have multiple named tabs. A default "Main" tab is created automatically.
HTML Output: The final report is rendered as a single HTML document, which can be displayed directly in environments like Jupyter notebooks or saved as a file.
Customizable Template: Reports use an internal HTML template, allowing for a consistent look and feel.

Methods:

__init__(self, name: str): Initializes a new report with a given name.
get_tab(self, name: str, create_if_missing: bool = True) -> Tab: Retrieves a tab by its name. If create_if_missing is True, a new tab is created if it doesn't already exist.
get_final_report(self) -> Union[str, "HTML"]: Generates the complete HTML report. If running in an IPython environment, it returns an IPython.core.display.HTML object for rich display; otherwise, it returns a raw HTML string.

The `Tab` Class

Each Tab within a Report holds a collection of HTML content snippets.

Methods:

__init__(self, name: str): Initializes a new tab with a given name.
log(self, content: str): Appends new HTML content to the tab. The content should be a valid HTML string fragment (e.g., a <div>, <p>, or <table>).
replace(self, content: str): Replaces all existing content in the tab with the provided HTML string.
get_html(self) -> str: Returns the concatenated HTML content of the tab.

Practical Implementation

To create and populate an execution report:

Instantiate a Report:

from src.flyte.report._report import Report

my_report = Report(name="My Task Execution Summary")

Access or Create Tabs:

main_tab = my_report.get_tab("Main")
logs_tab = my_report.get_tab("Execution Logs")
metrics_tab = my_report.get_tab("Performance Metrics")

Add Content to Tabs: Use log() to append content or replace() to set the entire content of a tab. Ensure the content is valid HTML.

main_tab.log("<h1>Task Completed Successfully!</h1>")
main_tab.log("<p>Details of the execution are provided in other tabs.</p>")

logs_tab.log("<pre>INFO: Task started at 2023-10-27 10:00:00\n</pre>")
logs_tab.log("<pre>INFO: Processing data...</pre>")
logs_tab.log("<pre>ERROR: Failed to connect to database!</pre>")

metrics_tab.log("<table><tr><th>Metric</th><th>Value</th></tr><tr><td>Duration</td><td>120s</td></tr></table>")

Generate and Display the Report:

final_html_report = my_report.get_final_report()

# If in a Jupyter environment, this will render directly:
# final_html_report

# Otherwise, you can save it to a file:
# with open("report.html", "w") as f:
#     f.write(str(final_html_report))

Best Practices

Logical Separation: Use different tabs to logically separate distinct types of information (e.g., summary, detailed logs, input/output data, visualizations).
HTML Safety: When adding content, be mindful that the reporting utility inserts raw HTML. Ensure any user-generated or external content is properly sanitized to prevent cross-site scripting (XSS) vulnerabilities if the report is to be shared widely.
Concise Content: While reports can be detailed, strive for conciseness within each tab to improve readability.
Error Reporting Integration: In case of a RuntimeUserError or RuntimeSystemError, the report can be used to log the error details, stack traces, or relevant diagnostic information in a dedicated "Errors" tab, providing a comprehensive view of the failure.