Skip to main content

Common User-Induced Errors

Common user-induced errors arise from misconfigurations, incorrect code, or improper interactions with the system, leading to predictable failures during task execution or deployment. Understanding these errors is crucial for efficient development and debugging. Each error type provides specific insights into the root cause, enabling developers to quickly identify and resolve issues.

ImagePullBackOffError

This error indicates that the system failed to pull a specified container image.

Common Causes:

  • Invalid Image Name or Tag: The image name or tag is misspelled, does not exist in the registry, or refers to a non-existent version.
  • Private Registry Authentication Failure: The system lacks the necessary credentials (e.g., Docker config, Kubernetes secrets) to access a private container registry.
  • Network Connectivity Issues: The environment where the task is running cannot reach the container registry due to network policies, DNS resolution failures, or temporary outages.
  • Image Deletion: The referenced image was removed from the registry.

Resolution and Prevention:

  • Verify the image name and tag for accuracy.
  • Ensure proper authentication mechanisms are configured for private registries.
  • Confirm network connectivity from the execution environment to the container registry.
  • Implement robust image lifecycle management to prevent accidental deletion of images in use.

InlineIOMaxBytesBreached

This error occurs when a task attempts to write more data to inline standard output or error streams than the configured maximum limit allows.

Common Causes:

  • Excessive Logging: A task generates a large volume of logs or prints extensive data to stdout/stderr.
  • Large Return Values: A task's output, intended for inline storage, exceeds the byte limit.

Resolution and Prevention:

  • Adjust max_inline_io_bytes: Increase the max_inline_io_bytes setting in the task definition to accommodate larger inline outputs.
  • Externalize Large Outputs: For substantial data, refactor tasks to store outputs in external persistent storage (e.g., object storage, databases) rather than relying on inline capture.
  • Optimize Logging: Reduce verbose logging, especially in production environments, or direct logs to a dedicated logging system.

TaskTimeoutError

This error signifies that a task's execution exceeded its predefined timeout duration.

Common Causes:

  • Inefficient Task Logic: The task's code is computationally expensive or contains inefficient algorithms.
  • External Dependency Latency: The task waits indefinitely for an external service or resource that is slow or unresponsive.
  • Infinite Loops or Deadlocks: The task's code enters an unintended loop or a deadlock state.
  • Insufficient Timeout Setting: The timeout value is set too low for the expected workload or external dependencies.

Resolution and Prevention:

  • Optimize Task Code: Profile and refactor task logic to improve performance.
  • Implement Robust External Calls: Use timeouts and retry mechanisms for external API calls or resource access.
  • Increase Timeout: Adjust the task's timeout setting to a more appropriate value based on expected execution times and external factors.
  • Monitor and Alert: Implement monitoring for task duration to identify potential bottlenecks before they lead to timeouts.

NotInTaskContextError

This error is raised when code attempts to access task-specific functionalities or context outside of an active task execution environment.

Common Causes:

  • Misplaced API Calls: Using task-specific APIs (e.g., for accessing inputs, outputs, or metadata) in code that runs outside the scope of a defined task.
  • Local Development Without Mocking: Running task code locally without a proper task context or mock environment, leading to calls that expect an active task.

Resolution and Prevention:

  • Scope Task Context Usage: Ensure all calls to task-specific APIs are encapsulated within the functions or methods designated as tasks.
  • Develop with Task Context in Mind: When developing locally, use provided SDK utilities or mock objects to simulate a task context if necessary, or structure code to clearly separate task-dependent logic.

InvalidImageNameError

This error indicates that the provided container image name does not conform to valid naming conventions or syntax.

Common Causes:

  • Syntax Errors: The image name contains unsupported characters, incorrect delimiters, or is malformed.
  • Empty or Null Name: An attempt was made to use an empty or null string as an image name.

Resolution and Prevention:

  • Validate Image Name Syntax: Always ensure image names adhere to standard Docker image naming conventions (e.g., registry/repository:tag).
  • Use Configuration Validation: Implement input validation for image names in configuration files or user interfaces.

PrimaryContainerNotFoundError

This error occurs when the system cannot identify or locate the designated primary container within a multi-container task definition.

Common Causes:

  • Missing Primary Container Definition: The task configuration does not specify which container is the primary one, or the specified primary container name does not match any defined container.
  • Failed Primary Container Startup: The primary container's image failed to pull or start, making it unavailable.

Resolution and Prevention:

  • Explicitly Define Primary Container: Ensure the task definition clearly designates a primary container by name.
  • Verify Container Names: Double-check that the primary container name in the configuration exactly matches one of the defined containers.
  • Inspect Container Logs: If the primary container is defined but not found, investigate its individual logs for ImagePullBackOffError or other startup failures.

DeploymentError

This error signifies a failure during the deployment phase of a task, or when preconditions for deployment are not met.

Common Causes:

  • Resource Constraints: Insufficient compute resources (CPU, memory, disk) in the target environment to deploy the task.
  • Invalid Configuration: The task definition contains invalid parameters, references non-existent resources, or violates platform policies.
  • Missing Dependencies: Required external services, volumes, or network configurations are not available or correctly set up.
  • Infrastructure Issues: Underlying infrastructure components (e.g., orchestrator, network plugins) are experiencing issues.

Resolution and Prevention:

  • Review Deployment Logs: Examine detailed deployment logs for specific error messages and root causes.
  • Validate Task Configuration: Thoroughly check task definitions against schema and platform requirements.
  • Monitor Resource Utilization: Ensure the target environment has adequate resources.
  • Verify Infrastructure Health: Confirm that all necessary infrastructure components are operational.

RunAbortedError

This error indicates that a task execution was explicitly stopped or canceled by a user or an automated system.

Common Causes:

  • Manual Cancellation: A user initiated a stop command for the running task.
  • Automated Policy Enforcement: An external system or policy automatically terminated the task (e.g., due to cost limits, resource preemption).

Resolution and Prevention:

  • Understand Cancellation Policies: Be aware of any automated policies that might terminate tasks.
  • Implement Graceful Shutdowns: Design tasks to handle interruption signals gracefully, allowing for cleanup or checkpointing.
  • Review Audit Logs: Check system audit logs to determine who or what initiated the abortion.

RetriesExhaustedError

This error occurs when a task fails repeatedly, exhausting all configured retry attempts without successful completion.

Common Causes:

  • Persistent Task Logic Errors: The task's code contains a bug that causes it to fail consistently, regardless of retries.
  • Environmental Instability: A recurring issue in the execution environment (e.g., flaky network, intermittent resource unavailability) prevents successful execution.
  • Incorrect Retry Strategy: The retry policy (e.g., number of retries, backoff strategy) is insufficient for the nature of the transient failures.

Resolution and Prevention:

  • Analyze Initial Failure Logs: Investigate the logs from the first failure attempt to identify the root cause.
  • Debug Task Code: Thoroughly debug the task's logic to fix any persistent bugs.
  • Assess Environmental Stability: Monitor the execution environment for recurring issues.
  • Adjust Retry Policy: Re-evaluate and potentially increase the number of retries or modify the backoff strategy if failures are genuinely transient.

ModuleLoadError

This error is raised when the system cannot load a required Python module, either because it does not exist or contains syntax errors.

Common Causes:

  • Incorrect Module Path: The specified module path is wrong, or the module file is not in the expected location.
  • Missing Dependencies: The module relies on external packages that are not installed in the task's environment.
  • Syntax Errors: The Python code within the module contains syntax errors that prevent it from being parsed.
  • Circular Imports: Complex import structures can sometimes lead to module loading issues.

Resolution and Prevention:

  • Verify Module Path and Existence: Ensure the module file exists at the specified path and is accessible.
  • Manage Dependencies: Explicitly list and install all required Python packages in the task's environment (e.g., via requirements.txt).
  • Lint and Test Code: Use linters and unit tests to catch syntax errors and import issues early in development.
  • Simplify Import Structure: Refactor complex import graphs to avoid circular dependencies.

TaskInterruptedError

This error indicates that a task's execution was interrupted before it could complete, often due to external signals.

Common Causes:

  • System Shutdown/Restart: The underlying host or container orchestrator initiated a shutdown or restart.
  • Resource Preemption: In shared environments, the task's resources might have been preempted by higher-priority workloads.
  • Manual Intervention: An operator or automated system sent an interruption signal (e.g., SIGTERM) to the task.

Resolution and Prevention:

  • Implement Checkpointing: Design tasks to periodically save their state, allowing them to resume from the last checkpoint upon restart.
  • Handle Signals Gracefully: Implement signal handlers in task code to perform cleanup or save progress when an interruption signal is received.
  • Choose Appropriate Execution Environments: Select environments that offer guarantees against preemption if task interruption is critical.

RuntimeDataValidationError

This error occurs when data accessed at runtime is invalid, malformed, or when serialization/deserialization fails. It can also indicate an attempt to access a non-existent resource.

Common Causes:

  • Schema Mismatch: Data passed between tasks or stored externally does not conform to the expected schema.
  • Serialization/Deserialization Issues: Data cannot be correctly converted to or from its runtime representation (e.g., JSON, Protobuf) due to corruption or type mismatches.
  • Missing or Invalid Inputs: A task attempts to read an input variable that was not provided or is in an unexpected format.
  • Referencing Non-Existent Outputs: A downstream task tries to access an output from an upstream task that either failed or did not produce the expected output.

Resolution and Prevention:

  • Define and Enforce Schemas: Use data validation libraries or explicit schemas for all task inputs and outputs.
  • Robust Serialization Logic: Implement error handling around data serialization and deserialization.
  • Validate Inputs: Add explicit validation checks for all task inputs at the beginning of task execution.
  • Ensure Upstream Success: Verify that upstream tasks completed successfully and produced the expected outputs before attempting to consume them.

OOMError

This error signifies that a task's execution failed due to an out-of-memory condition, meaning it attempted to use more memory than allocated.

Common Causes:

  • Memory Leaks: The task's code has a memory leak, continuously consuming more memory over time.
  • Processing Large Datasets: The task attempts to load or process an excessively large dataset entirely in memory.
  • Insufficient Memory Allocation: The memory limit configured for the task is too low for its actual workload.

Resolution and Prevention:

  • Optimize Memory Usage: Refactor task code to reduce memory footprint, process data in chunks, or use memory-efficient data structures.
  • Increase Memory Limits: Adjust the task's memory allocation to a higher value if the workload genuinely requires more resources.
  • Profile Memory Usage: Use memory profiling tools to identify and fix memory leaks in the task's code.

ReferenceTaskError

This error is raised when a task attempts to reference or depend on another task that does not exist within the current workflow or system.

Common Causes:

  • Typo in Task Name: The name of the referenced task is misspelled.
  • Non-Existent Task: The referenced task was never defined, has been deleted, or is not part of the current execution scope.
  • Incorrect Scope: Attempting to reference a task that exists but is not accessible from the current task's context (e.g., across different workflows or projects).

Resolution and Prevention:

  • Verify Task Names: Double-check the spelling and existence of all referenced task names.
  • Ensure Task Definition: Confirm that the referenced task is correctly defined and registered within the system.
  • Understand Scoping Rules: Be aware of how tasks can reference each other across different levels of abstraction (e.g., within a workflow, across workflows).

ImageBuildError

This error indicates a failure during the process of building a container image.

Common Causes:

  • Dockerfile Syntax Errors: The Dockerfile contains invalid instructions or syntax.
  • Missing Build Context Files: Files or directories referenced in the Dockerfile (e.g., via COPY, ADD) are not present in the build context.
  • Dependency Installation Failures: Package managers (e.g., pip, apt, npm) fail to install required dependencies during the build process.
  • Resource Constraints on Build Agent: The build agent runs out of memory or disk space during the image build.
  • Network Issues During Build: The build process cannot reach external repositories to download packages or base images.

Resolution and Prevention:

  • Review Build Logs: Examine the detailed output from the image build process for specific error messages.
  • Validate Dockerfile: Thoroughly check the Dockerfile for syntax errors and logical issues.
  • Ensure Complete Build Context: Verify that all files and directories referenced in the Dockerfile are included in the build context.
  • Test Build Locally: Perform local image builds to catch errors early.
  • Monitor Build Agent Resources: Ensure the build environment has sufficient resources.