Retries¶
A Retry strategy defines the "cooling-off" period for a circuit breaker. It controls when the breaker should attempt to recover by transitioning from the OPEN state to the HALF_OPEN state. Choosing the right strategy is key to balancing fast recovery with giving a struggling service enough time to heal.
| Retry Type | Transition Timing | Best For... |
|---|---|---|
| Always | Immediately | Non-critical services where immediate retries are acceptable. |
| Never | Manual only | When recovery requires manual intervention by an operator. |
| Cooldown | After a fixed delay | A simple, predictable wait time before attempting recovery. |
| Backoff | After an exponentially increasing delay | An adaptive approach that waits longer after repeated failures. |
Always¶
This strategy moves the circuit to HALF_OPEN immediately after it opens. On any subsequent call, it will attempt a recovery.
Use with Caution
Always can be dangerous, as it encourages a "thundering herd" problem where many clients retry simultaneously, overwhelming a service that is trying to recover. It's best used for non-critical services where failures are known to be very brief.
from fluxgate import CircuitBreaker
from fluxgate.retries import Always
# Not generally recommended.
cb = CircuitBreaker(name="api", retry=Always(), ...)
Never¶
This strategy keeps the circuit in the OPEN state indefinitely until it is manually reset. This is useful when a service requires human intervention to be fixed.
from fluxgate import CircuitBreaker
from fluxgate.retries import Never
cb = CircuitBreaker(name="api", retry=Never(), ...)
# An operator must manually reset the breaker after fixing the service.
cb.reset()
Cooldown¶
This is a simple and common strategy that waits for a fixed duration (in seconds) before moving to HALF_OPEN.
It's a good default choice for services that tend to recover in a predictable amount of time.
from fluxgate import CircuitBreaker
from fluxgate.retries import Cooldown
# Wait for 60 seconds before the first recovery attempt.
cb = CircuitBreaker(
name="api",
retry=Cooldown(duration=60.0),
...
)
Backoff¶
This is the most robust and recommended strategy. It increases the wait time exponentially after each consecutive failure, giving a struggling service more and more time to recover.
The wait time is calculated as initial * (multiplier ** consecutive_failures).
from fluxgate import CircuitBreaker
from fluxgate.retries import Backoff
# The wait time starts at 10s, doubles after each failed recovery attempt,
# and is capped at a maximum of 300s.
cb = CircuitBreaker(
name="api",
retry=Backoff(
initial=10.0,
multiplier=2.0,
max_duration=300.0
),
...
)
# Sequence of wait times:
# 1st attempt -> 10s
# 2nd attempt -> 20s
# 3rd attempt -> 40s
# 4th attempt -> 80s
# 5th attempt -> 160s
# 6th+ attempt -> 300s (capped by max_duration)
Choosing the Right Retry Strategy¶
Comparison¶
| Feature | Always | Never | Cooldown | Backoff |
|---|---|---|---|---|
| Recovery | Immediate | Manual | Fixed Delay | Exponential Delay |
| Service Load | High | None | Medium | Low |
| Handles Repeated Failures? | No | N/A | No | Yes |
| Complexity | Very Simple | Very Simple | Simple | Medium |
| Recommended? | No | For special cases | Good default | Recommended |
When should I use Always?¶
Only for non-critical services where failures are known to be extremely brief and the service can handle a high volume of retries.
When should I use Never?¶
When a service requires manual intervention to fix. The circuit breaker will not attempt to recover on its own.
- Use case: During a planned deployment or when a service is taken down for maintenance.
When should I use Cooldown?¶
This is a great, simple default. It's best when you have a general idea of how long the service takes to recover.
- Use case: Protecting a service that has a predictable recovery time, like an external API with a fixed rate-limiting window.
When should I use Backoff?¶
This is the recommended strategy for most use cases. It gracefully backs off from a struggling service, giving it more time to recover after repeated failures.
- Use case: Protecting a critical downstream service that may be slow to restart or recover from an outage.
A Note on Jitter¶
Always Add Jitter
For both Cooldown and Backoff, it is highly recommended to add jitter. Jitter adds a small amount of randomness to the wait time, which helps prevent a "thundering herd" scenario where multiple instances of your service all try to recover at the exact same time.
from fluxgate.retries import Cooldown, Backoff
# For a 60s cooldown, jitter adds a +/- 6s random variation (54s to 66s).
retry_cooldown = Cooldown(duration=60.0, jitter_ratio=0.1)
# The same applies to each step of the backoff.
retry_backoff = Backoff(initial=10.0, jitter_ratio=0.1)