Handling Deployment Failures
This chapter explains how to handle failures during KuboCD Release deployments.
As described in Under the hood, KuboCD generates a FluxCD HelmRelease resource for each module in a package. The Flux helm-controller then executes the deployment using its embedded Helm instance.
The Flux HelmRelease resource provides several parameters to configure behavior when failures occur:
HelmRelease.spec.timeout: The wait time for individual Kubernetes operations (e.g., Hooks) during a Helm action.HelmRelease.spec.install: Configuration for Helm install actions.HelmRelease.spec.upgrade: Configuration for Helm upgrade actions.
For full details, refer to the Flux documentation for install and upgrade.
Failure Strategies
A common requirement is to automatically retry or remediate failed deployments. For example, a configuration that attempts to uninstall and re-install a failed release up to 10 times:
spec:
install:
strategy:
name: RemediateOnFailure
remediation:
retries: 10
upgrade:
strategy:
name: RemediateOnFailure
remediation:
retries: 10
To avoid manually defining these verbose configurations for every release, KuboCD allows you to define named strategies in the global configuration (Config).
Configuring Global Strategies
Sample Config resource:
apiVersion: kubocd.kubotal.io/v1alpha1
kind: Config
metadata:
name: global-config
spec:
defaultHelmTimeout: 3m0s
defaultOnFailureStrategy: updateOnFailure
onFailureStrategies:
- name: stopOnFailure
strategy: {}
- name: reinstallOnFailure
strategy:
install:
strategy:
name: RemediateOnFailure
remediation:
retries: 10
upgrade:
strategy:
name: RemediateOnFailure
remediation:
retries: 10
- name: updateOnFailure
strategy:
install:
strategy:
name: RetryOnFailure
retryInterval: 1m0s
upgrade:
strategy:
name: RetryOnFailure
retryInterval: 1m0s
Defined Strategies:
stopOnFailure: Do nothing on failure (default Flux behavior).reinstallOnFailure: Uninstall and re-install up to 10 times.updateOnFailure: Retry the update every minute until success.
Default Settings:
defaultOnFailureStrategy: Sets the strategy used by default for all deployments (updateOnFailurein this example).defaultHelmTimeout: Sets the defaultHelmRelease.spec.timeout(3 minutes).
Applying Strategies
Strategies can be applied or overridden at multiple levels:
- Global Default: As defined in
defaultOnFailureStrategy. - Package Level: Override per module using
package.module[X].onFailureStrategyandtimeout. - Release Level: Override per module using
moduleOverrides.
Example: Overriding at Release Level
apiVersion: kubocd.kubotal.io/v1alpha1
kind: Release
metadata:
name: podinfo1
namespace: default
spec:
package:
repository: quay.io/kubodoc/packages/podinfo
tag: 6.7.1-p01
moduleOverrides:
main:
onFailureStrategy: reinstallOnFailure
timeout: 2m
In this example, the main module will use the reinstallOnFailure strategy and a custom timeout of 2 minutes.