# Core concepts ## Job A job in Ocypod represents some task created by clients, which will be queued, then fetched and processed by workers. Each job has a set of metadata associated with it, some of which is managed by Ocypod, and some of which can be created/updated by clients/workers. ### Job lifecycle and statuses When a job is initially created, it's added to a queue, assigned the `queued` status. Clients will then poll that queue for new jobs, receiving the job's payload (the contents of its `input` field), and the job's ID. The job is removed from its queue, and its status is set to `running`. If the client completes the job, it will send a message to Ocypod asking it to update the job's status to `completed`. If there's some error/exception and the client can't finish the job, it will mark the job as `failed`. If the client fails to complete/fail a job before the job's timeout (or heartbeat timeout) is exceeded, then Ocypod marks the job as `timed_out`. Ocypod will periodically look at all failed and timed out jobs and check if they're elgible for automatic retries, and if so, will re-queue them. ### Job metadata The Ocypod server maintains the following information about a job, some of which is immutable, some of which will be modified by Ocypod throughout a job's lifecycle, and some of which is modifiable by clients. * `id` - autogenerated ID for the job, generated when a job is first created and queued * `queue` - name of the queue the job was created in * `status` - current status of the job * `tags` - list of tags (if any) assigned to this job at creation time * `created_at` - date/time this job was first created and queued * `started_at` - date/time this job was accepted by a client, and the job's status changed to `running` * `ended_at` - date/time this job stopped running, whether due to successful completed, timing out, or failure * `last_heartbeat` - date/time the last heartbeat for this job was sent by the client executing it * `input` - the job's payload, sent by the client creating this job - this typically contains the data needed for a worker to execute the job * `output` - contains any information the client working on this job decides to store here, this might include the job's result, progress information, partial results, etc. - it can be set anytime the task is running * `timeout` - maximum execution time of the job before it's marked as timed out * `heartbeat_timeout` - maximum time without receiving a heartbeat before the job is marked as timed out * `expires_after` - amount of time this job metadata will persist in Ocypod after the job reaches a final state (i.e. `completed`/`failed`/`timed_out` with no retries remaining) * `retries` - number of times this job will automatically be requeued on failure * `retries_attempted` - number of times this job has failed and been requeued * `retry_delays` - minimum amount of time to wait between each retry attempt * `ended` - indicates whether the job is in a final state or not (i.e. completed, or failed/timed out with no retries remaining) ## Job Status A job in Ocypod will always have one of the following statuses: * `queued` - set by the server when a job is first created and added to a queue * `running` - set by the server when a worker picks up a job * `completed` - set by the client to mark a job as successfully completed * `failed` - set by the client to mark a job as having failed * `timed_out` - set by the server when a job exceeds either its `timeout` or `heartbeat_timeout` * `cancelled` - set by client to mark that a job has been cancelled To aid clients that are checking on the status of jobs, each job also has an `ended` boolean field. This is set to `true` if the job is in its final state, or `false` otherwise. A job is marked as ended in the following circumstances: * job has `completed` status * job has `cancelled` status * job has `failed` status and 0 retries remaining * job has `timed_out` status and 0 retries remaining ## Queue Each queue in Ocypod has its own settings, which are used as defaults for jobs created on that queue (though they can be overridden on a per-job basis). A queue in Ocypod is a [FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)), with new jobs being added to the beginning of the queue, and workers taking jobs from the end of the queue. ### Queue settings Each queue has a number of settings, which are defaults that are applied to new jobs created in that queue. Each can be overridden on a per-job basis, they just exist at the queue level for convenience. --- #### `timeout` This is the maximum amount of time a job can be running for before it's considered to have timed out. It's specified as a human readable duration string, e.g. "30s", "1h15m5s", "3w2d", etc. To disable timeouts entirely, this can be set to "0s". --- #### `heartbeat_timeout` For long running jobs, it's recommended that workers send regular heartbeats to the Ocypod server to let it know that the job is still being processed. This allows timeouts or failures to be noticed much earlier than if just relying on `timeout`. The `heartbeat_timeout` setting determines how long a job can be running without getting a heartbeat update before it's considered to have timed out. It's specified as a human readable duration string. To disable heartbeat timeouts entirely, this can be set to "0s". --- #### `expires_after` This setting determines how long jobs that have ended (either successfully completed, failed, or timed out without any retries) will remain in the system. After this period of time, the job and its metadata will be cleared from Ocypod. This is specified as a human readable duration string, and can be set to "0s" to disable expiry entirely. In this case, you'll be responsible for managing and cleaning up old jobs manually. --- #### `retries` This controls the number of times that jobs created in this queue will be automatically retried. If a job fails or times out and has a number of retries remaining, it will be re-queued. To disable retries, this can be set to 0. #### `retry_delays` This configures an optional list of delays to apply whenever a job is retried. This allows for different backoff strategies to be configured, depending on the application. If the number of retries exceeds the number of retry delays specified, then the last value will continue to be used. E.g. configuring a queue with `retries: 4` and `retry_delays: ["10s", "1m", "5m"]` means that if a job in this queue keeps failing, Ocypod will wait 10 seconds before retrying for the 1st time, 1 minute before retrying a 2nd time, and 5 minutes before retrying for the 3rd and 4th times. To disable retry delays, this can be ommitted, or set to an empty list. ## Tag A tag is a short string that can be attached to a job at creation time. An endpoint for getting all job IDs by tag is provided. This allows separate jobs to be grouped together, use cases include e.g.: * using a batch ID tag to a related set of jobs * using a username tag to track all jobs belonging to a user * using a source tag to track the client/process that created a job