Skip to main content

System and Infrastructure Errors

System and Infrastructure Errors

These errors indicate issues originating from the underlying system or infrastructure components rather than application-specific logic. They often point to problems with service availability, communication, or the operational state of the platform. Understanding and handling these errors is crucial for building robust and resilient applications.

Handling System-Level Failures

Applications interacting with the system must anticipate and gracefully handle infrastructure-related exceptions. These errors typically derive from BaseRuntimeError or RuntimeError, providing a consistent base for exception handling. Implementing appropriate try...except blocks allows for recovery, retry mechanisms, or informative error reporting to users.

try:
# Attempt an operation that might raise a system error
result = some_system_operation()
except (UnionRpcError, LogsNotYetAvailableError, InitializationError, ActionNotFoundError) as e:
# Log the error, notify administrators, or implement a retry strategy
print(f"A system or infrastructure error occurred: {e}")
# Depending on the error type, specific recovery actions can be taken
if isinstance(e, UnionRpcError):
print("Union server communication failed. Consider retrying or checking server status.")
elif isinstance(e, InitializationError):
print("System not initialized. Ensure proper setup before operations.")
# Re-raise if the error cannot be handled at this level
raise

Specific System and Infrastructure Error Types

The system defines several specific error types to categorize common infrastructure-related issues.

UnionRpcError

The UnionRpcError indicates a failure in communication with the Union server. This error typically arises from network issues, an unresponsive server, or incorrect server configuration. It inherits from RuntimeSystemError, signifying a critical runtime problem.

Common Use Cases:

  • Any operation requiring a remote procedure call (RPC) to the Union server.
  • Fetching data or submitting tasks to the Union backend.

Practical Implementation: When a UnionRpcError occurs, applications should consider:

  1. Retries: For transient network issues, a retry mechanism with exponential backoff can be effective.
  2. Server Status Check: Informing the user or administrator that the Union server might be unavailable.
  3. Fallback Mechanisms: If possible, provide degraded functionality or cached data.
import time

def communicate_with_union_server():
retries = 3
for i in range(retries):
try:
# Simulate an RPC call that might fail
# In a real scenario, this would be an actual network call
if i < 2: # Simulate failure for the first two attempts
raise UnionRpcError("Failed to connect to Union server.")
print("Successfully communicated with Union server.")
return "Data from Union"
except UnionRpcError as e:
print(f"Attempt {i+1} failed: {e}")
if i < retries - 1:
time.sleep(2 ** i) # Exponential backoff
else:
print("Max retries reached. Union server is unreachable.")
raise # Re-raise if all retries fail
return None

# Example usage
try:
data = communicate_with_union_server()
if data:
print(f"Received: {data}")
except UnionRpcError:
print("Application could not retrieve data from Union server after multiple attempts.")

LogsNotYetAvailableError

The LogsNotYetAvailableError is raised when an attempt is made to retrieve logs for a task, but those logs have not yet been generated or processed by the system. This is a temporal error, indicating that the requested resource is pending. It inherits from BaseRuntimeError.

Common Use Cases:

  • Polling for task logs immediately after a task has been initiated.
  • Accessing logs for a task that is still in a very early stage of execution.

Practical Implementation: When encountering LogsNotYetAvailableError, the recommended approach is to implement a polling strategy. The application should wait for a short period and then re-attempt log retrieval.

import time

def get_task_logs(task_id: str):
max_attempts = 5
for attempt in range(max_attempts):
try:
# Simulate log retrieval
# In a real scenario, this would be a call to a log service
if attempt < 2: # Simulate logs not being ready for first two attempts
raise LogsNotYetAvailableError(f"Logs for task {task_id} are not yet available.")
print(f"Logs for task {task_id} are now available.")
return ["Log line 1", "Log line 2", "Log line 3"]
except LogsNotYetAvailableError as e:
print(f"Attempt {attempt+1}: {e}. Retrying in 2 seconds...")
time.sleep(2)
raise RuntimeError(f"Failed to retrieve logs for task {task_id} after {max_attempts} attempts.")

# Example usage
try:
logs = get_task_logs("task-123")
for log in logs:
print(log)
except RuntimeError as e:
print(f"Error: {e}")

InitializationError

The InitializationError signals that an operation was attempted on the system before it was properly initialized. This error prevents operations on an unconfigured or unready system, ensuring that critical components are set up correctly before use. It inherits from BaseRuntimeError.

Common Use Cases:

  • Calling system-level functions or accessing system resources before the main init() or setup() function has completed.
  • Attempting to interact with a system component that relies on global or singleton initialization.

Practical Implementation: Developers must ensure that the system's initialization routine is called and successfully completes before any other system-dependent operations. This error serves as a safeguard against undefined behavior.

_is_system_initialized = False

def initialize_system():
global _is_system_initialized
print("Initializing system...")
# Simulate complex initialization logic
time.sleep(1)
_is_system_initialized = True
print("System initialized successfully.")

def perform_system_operation():
if not _is_system_initialized:
raise InitializationError("Union system not initialized. Call initialize_system() first.")
print("Performing system operation...")
# ... actual operation ...
return "Operation successful"

# Correct usage
initialize_system()
try:
result = perform_system_operation()
print(result)
except InitializationError as e:
print(f"Error: {e}")

# Incorrect usage (demonstrates the error)
_is_system_initialized = False # Reset for demonstration
try:
result = perform_system_operation()
print(result)
except InitializationError as e:
print(f"Caught expected error: {e}")

ActionNotFoundError

The ActionNotFoundError is raised when a user or an application attempts to invoke an action that does not exist within the system's defined capabilities. This error indicates a mismatch between the requested operation and the available actions. It inherits from RuntimeError.

Common Use Cases:

  • Calling a non-existent API endpoint or command.
  • Requesting a specific task or workflow by an incorrect identifier.
  • User input specifying an invalid action.

Practical Implementation: When an ActionNotFoundError occurs, the application should typically:

  1. Validate Input: Ensure that user-provided or dynamically generated action names are valid against a known list of available actions.
  2. Provide Feedback: Inform the user that the requested action is not recognized.
  3. Log the Error: Record the attempt to access a non-existent action for debugging or security auditing.
AVAILABLE_ACTIONS = {"create_resource", "delete_resource", "list_resources"}

def execute_action(action_name: str, *args, **kwargs):
if action_name not in AVAILABLE_ACTIONS:
raise ActionNotFoundError(f"Action '{action_name}' does not exist.")

print(f"Executing action: {action_name} with args: {args}, kwargs: {kwargs}")
# ... actual action execution logic ...
return f"Action '{action_name}' completed."

# Example usage
try:
print(execute_action("create_resource", "my_item"))
print(execute_action("list_resources"))
print(execute_action("update_resource", "item_id_123", new_value="test")) # This will raise an error
except ActionNotFoundError as e:
print(f"Error: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")

Best Practices for Error Handling

  • Specificity: Catch specific error types (UnionRpcError, LogsNotYetAvailableError) before more general exceptions to implement targeted recovery logic.
  • Logging: Always log system and infrastructure errors with sufficient detail (timestamp, error message, stack trace) to aid in debugging and operational monitoring.
  • User Feedback: Provide clear, actionable feedback to end-users when an error occurs, avoiding cryptic messages.
  • Monitoring and Alerting: Integrate error handling with monitoring systems to trigger alerts for critical infrastructure failures.
  • Idempotency and Retries: Design operations to be idempotent where possible, and implement robust retry strategies for transient errors like UnionRpcError or LogsNotYetAvailableError.
  • Graceful Degradation: Consider how the application can continue to function, possibly with reduced capabilities, when a non-critical infrastructure component fails.