Implementing a Queueable Framework That Handles Failures Gracefully

[ development ]

Background

Asynchronous Apex plays an important role in implementing business logic on the Salesforce platform. As the extent and complexity of record-triggered logic grows, especially around core objects like Account and Opportunity, running all logic synchronously can breach Salesforce transaction limits. Asynchronous Apex allows us to move some logic processing into a new transaction, with a new set of limits, to relieve this pressure, and Queueable Apex is often the most suitable choice.

However, Queueable Apex is not been the perfect platform on which to implement resilient business logic. In particular, managing exceptions in Queueable apex so that the failure is logged somehow and can be “retried” in the case of a transient error is left to the developer. A common pattern in Queueable apex to process large sets of records is to “chain” Queueable instances together, with each instance performing a chunk of the work and then enqueueing a new instance to process the next chunk, until all work is complete. A failure is particularly catastrophic in the scenario, as it breaks the “chain” - all remaining records do not get processed after a record fails and raises an exception.

Enter Transaction Finalizers

In the spring 2020 release, Salesforce introduced Transaction Finalizers, a new feature of Queueable Apex that gives developers more support to build resilient solutions with Queueable Apex. Finalizer classes can be attached to Queueable Apex jobs, and will be executed regardless of the outcome of the job, even in the case of an unhandled exception. This provides a straightforward way to attach actions such as logging, notifications, and retries to Queueable Apex.

The remainder of this post describes a Queueable Apex framework that uses a Transaction Finalizer to handle failures and allow for retries. At the bottom of the post, you will find a link to the source code, licensed under the MIT license.

The Framework

Key features of the framework:

Uses a custom sObject Async_Job_Failure__c to store details of failed jobs, which can then be retried using the RetryService class.
Handles Queueable job chaining - after a failure, the remainder of the chain will still be processed.
Includes Apex tests for the entire framework in RecoverableQueueableTest, which uses a sample class RecoverableQueueableTestJob

There are three core classes to the framework:

`RecoverableQueueable`

This abstract class should be extended by your own classes when you want to use the framework to support them - so in your class declaration, instead of implements Queueable, you will use extends RecoverableQueueable.

In your class you must implement two methods: work which contains your business logic, and getJobName which just returns the class name. If you want to chain your class, you need to also implement getNextJob which should return the next link in the chain (i.e. the next instance of your class to be enqueued when the current one completes). This method should return null at the end of the chain.

public abstract inherited sharing class RecoverableQueueable implements Queueable {
  // Control flag to determine if the next job should be enqueued on success.
  // Defaults to true, but RetryService can set this to false.
  public Boolean chainOnSuccess = true;

  // Test-visible list to track chained jobs in unit tests without actually enqueueing them
  @TestVisible
  public static List<Queueable> testChainedJobs = new List<Queueable>();

  public void execute(QueueableContext context) {
    // Attach the finalizer immediately to capture the state for retry.
    System.attachFinalizer(new JobFinalizer(this));

    // Execute the actual business logic
    this.work(context);

    // Handle Auto-Chaining
    if (this.chainOnSuccess) {
      Queueable next = this.getNextJob();
      if (next != null) {
        enqueueChain(next);
      }
    }
  }

  // Centralized method to handle chaining, allowing interception during tests
  public static void enqueueChain(Queueable job) {
    if (Test.isRunningTest()) {
      testChainedJobs.add(job);
    } else {
      System.enqueueJob(job);
    }
  }

  // Abstract method for subclasses to implement their logic
  protected abstract void work(QueueableContext context);

  // Get class name
  public abstract String getJobName();

  // Virtual method to provide the next job in the chain.
  // Override this if you want to implement a chain that continues even if this job fails.
  public virtual Queueable getNextJob() {
    return null;
  }
}

Looking under the hood of the class, you can see that when a RecoverableQueuebale is executed by Apex, it does three things: (a) attaches the finalizer, which we will consider next, (b) executes the business logic in work, and (c) attempts to chain if chainOnSuccess is set to its default value of true, and if getNextJob returns a non-null result.

`JobFinalizer`

public without sharing class JobFinalizer implements Finalizer {
  private RecoverableQueueable job;
  private String jobClassName;

  public JobFinalizer(RecoverableQueueable job) {
    this.job = job;
    // Capture the class name for deserialization later.
    this.jobClassName = job.getJobName();
  }

  public void execute(FinalizerContext context) {
    // We only care about failures (UNHANDLED_EXCEPTION)
    if (context.getResult() == ParentJobResult.UNHANDLED_EXCEPTION) {
      handleFailure(context);
    }
  }

  private void handleFailure(FinalizerContext context) {
    Exception ex = context.getException();

    Async_Job_Failure__c failure = new Async_Job_Failure__c();
    failure.Apex_Class_Name__c = this.jobClassName;
    failure.Job_ID__c = context.getAsyncApexJobId();
    failure.Error_Message__c = ex != null ? ex.getMessage() : 'Unknown Error';
    failure.Stack_Trace__c = ex != null ? ex.getStackTraceString() : '';
    failure.Payload__c = JSON.serialize(this.job);

    // Ensure field lengths are respected (assuming standard text area limits, adjust as needed)
    if (
      failure.Error_Message__c != null &&
      failure.Error_Message__c.length() > 32768
    ) {
      failure.Error_Message__c = failure.Error_Message__c.substring(0, 32768);
    }

    if (
      failure.Stack_Trace__c != null &&
      failure.Stack_Trace__c.length() > 32768
    ) {
      failure.Stack_Trace__c = failure.Stack_Trace__c.substring(0, 32768);
    }

    failure.Status__c = this.attemptChainRecovery();

    insert failure;
  }

  private String attemptChainRecovery() {
    try {
      // Even if this job failed, we attempt to enqueue the next job in the chain
      // so that the rest of the process is not blocked.
      Queueable next = this.job.getNextJob();
      if (next != null) {
        RecoverableQueueable.enqueueChain(next);
        return 'Failed - chain recovered';
      } else {
        return 'Failed - no chain to recover';
      }
    } catch (Exception e) {
      System.debug(
        LoggingLevel.ERROR,
        'Failed to recover chain: ' + e.getMessage()
      );
      return 'Failed - chain recovery failed - ' + e.getMessage();
    }
  }
}

The finalizer class in instantiated with the RecoverableQueueable job. It does nothing in the event that the job was completed successfully. In the event of an unhandled exception, it executes handleFailure.

handleFailure serializes the job to JSON, and creates a new Async_Job_Failure__c record, storing the JSON as well as details of the exception and stack trace for analysis. It then attempts to recover the chain via attemptChainRecovery.

WARNING
Storing the serialized job on the sObject means the size of the JSON is limited to 131,072 characters. If this limitation is problematic in your use case, suggest instead you update the handleFailure method to store the JSON as a file attached to the sObject.

attemptChainRecovery gets the next link in the chain from the job and attempts to enqueue it.

`RetryService`

The final element of the framework is the RetryService class. This class can be used to retry a failed job, represented by an Async_Job_Failure__c record.

public with sharing class RetryService {
  public static Id retryJob(Id failureId) {
    return retryJob(failureId, false); // Default: Do not continue chain on retry (assume chain continued via Finalizer)
  }

  public static Id retryJob(Id failureId, Boolean continueChain) {
    // Retrieve the failure record
    Async_Job_Failure__c failure = [
      SELECT Payload__c, Apex_Class_Name__c, Status__c
      FROM Async_Job_Failure__c
      WHERE Id = :failureId
      LIMIT 1
    ];

    // Prevent double-retries
    if (failure.Status__c == 'Retried') {
      throw new RetryException('This job has already been retried.');
    }

    // Resolve the Type
    Type jobType = Type.forName(failure.Apex_Class_Name__c);
    if (jobType == null) {
      throw new RetryException(
        'Could not find type: ' + failure.Apex_Class_Name__c
      );
    }

    // Deserialize the job state
    Object jobInstance = JSON.deserialize(failure.Payload__c, jobType);

    Id newJobId;

    // Re-enqueue the job
    if (jobInstance instanceof RecoverableQueueable) {
      RecoverableQueueable recoverable = (RecoverableQueueable) jobInstance;
      recoverable.chainOnSuccess = continueChain;
      newJobId = System.enqueueJob(recoverable);
    } else if (jobInstance instanceof Queueable) {
      // Fallback for non-recoverable queueables (shouldn't happen with this framework)
      newJobId = System.enqueueJob((Queueable) jobInstance);
    } else {
      throw new RetryException('Restored object is not a Queueable');
    }

    // Update status to prevent future retries
    failure.Status__c = 'Retried';
    failure.Retry_Job_ID__c = newJobId;
    update failure;

    return newJobId;
  }

  public class RetryException extends Exception {
  }
}

This class exposes the public method retryJob, which accepts the id of an Async_Job_Failure__c record and will re-enqueue the job that lead to the failure, deserializing the job from the JSON stored in the Payload__c field of the failure record.

By default retryJob will NOT continue the chain, in the event the failed job was part of a chaining Queueable, assuming that the initial processing finished the chain following the job failure, as it tries to do. If however you would like the retry to also continue the chain from the failed job, you can pass true as a second parameter to retry.

Summary

I hope you find this simple framework to be useful as you consider how to implement resilient asynchronous logic on the Salesforce platform.

See below a link to the source code - please use at your own risk, no guarantees!

Source Code on Github