
Composable DSL for async operations implementing Timeout, Retry (with backoffs), Circuit Breaker, Rate Limiter, Bulkhead, Hedging, in-memory TTL Cache and Fallback, with real-time telemetry stream.
A Kotlin Multiplatform library providing resilience patterns (Timeout, Retry, Circuit Breaker, Rate Limiter, Bulkhead, Hedging, Cache, Fallback) for suspend functions. Compose them declaratively with a small DSL and observe runtime telemetry via Flow.
import com.santimattius.resilient.composition.resilient
import com.santimattius.resilient.retry.ExponentialBackoff
import kotlin.time.Duration.Companion.seconds
import kotlin.time.Duration.Companion.milliseconds
val policy = resilient {
timeout { timeout = 2.seconds }
retry {
maxAttempts = 3
backoffStrategy = ExponentialBackoff(initialDelay = 100.milliseconds)
}
circuitBreaker {
failureThreshold = 5
successThreshold = 2
timeout = 30.seconds
}
}
suspend fun call(): String = policy.execute { fetchData() }Observe runtime events from policy.events (SharedFlow):
val job = scope.launch {
policy.events.collect { event -> println(event) }
}Outer → Inner (default):
The composition order is configurable. Use compositionOrder() to specify a custom order:
import com.santimattius.resilient.composition.OrderablePolicyType
val policy = resilient(scope) {
compositionOrder(listOf(
OrderablePolicyType.CACHE, // Check cache first (after Fallback)
OrderablePolicyType.COALESCE, // Deduplicate in-flight requests by key
OrderablePolicyType.TIMEOUT, // Then apply timeout
OrderablePolicyType.RETRY, // Retry on failures
OrderablePolicyType.CIRCUIT_BREAKER,
OrderablePolicyType.RATE_LIMITER,
OrderablePolicyType.BULKHEAD,
OrderablePolicyType.HEDGING
))
// Fallback is automatically added as the outermost policy
// ... configure policies
}Important Notes:
OrderablePolicyType - it is always positioned outermost automaticallycompositionOrder(...) accepts a subset: policy types not included are appended in the default ordercompositionOrder(listOf(..., OrderablePolicyType.RETRY, OrderablePolicyType.TIMEOUT, ...)) if you want per-attempt timeout.Aborts the operation after a configured duration. onTimeout runs only on actual timeouts. This is a single time limit for the block as wrapped by the policy (see Timeout vs Retry order). For a timeout per retry attempt, use Retry with perAttemptTimeout instead.
val policy = resilient {
timeout {
timeout = 3.seconds
onTimeout = { /* log metric */ }
}
}Retries failing operations according to a backoff strategy and predicate. Use shouldRetry to avoid retrying non-transient errors (e.g. 4xx client errors); retry only on 5xx, network/IO, or timeouts. Optional perAttemptTimeout limits each attempt (including the first) so a single slow attempt does not consume the whole retry budget.
Optional shouldRetryResult retries when the block returns successfully but the value is not acceptable (e.g. HTTP 202, empty body, “not ready” flag). Return true from the predicate to request another attempt; it shares the same maxAttempts budget as exception-based retries. onRetry receives RetryableResultException (with lastValue) for telemetry; if attempts are exhausted while the predicate stays true, the last returned value is returned (unlike exception exhaustion, which rethrows).
import com.santimattius.resilient.retry.*
val policy = resilient {
retry {
maxAttempts = 4
shouldRetry = { it is java.io.IOException } // e.g. only retry IO; do not retry 4xx
shouldRetryResult = { status -> (status as Int) >= 500 } // e.g. retry on 5xx-like codes returned as value
perAttemptTimeout = 5.seconds // optional: timeout per attempt
backoffStrategy = ExponentialBackoff(
initialDelay = 200.milliseconds,
maxDelay = 5.seconds,
factor = 2.0,
jitter = true
)
}
}Supported backoffs: ExponentialBackoff, LinearBackoff, FixedBackoff.
→ In-depth documentation
Prevents hammering failing downstreams. States: CLOSED → OPEN → HALF_OPEN.
By default, failureThreshold counts consecutive failures in CLOSED (a success resets the count). Set slidingWindow to a positive duration to open when failureThreshold failures fall within that time window (not necessarily consecutive); successes only prune expired timestamps from the window.
val policy = resilient(scope) {
circuitBreaker {
failureThreshold = 5
successThreshold = 2
halfOpenMaxCalls = 1
timeout = 60.seconds // OPEN duration
slidingWindow = 30.seconds // optional: time-based failure window
}
}Token-bucket rate limiting: tokens refill at the start of each period (fixed-window refill). With maxCalls = 10 and period = 1.seconds, at most 10 calls per second are allowed. One bucket per policy (global limit). Optional max wait when limited.
val policy = resilient {
rateLimiter {
maxCalls = 10
period = 1.seconds
timeoutWhenLimited = 5.seconds // throw if wait would exceed
onRateLimited = { /* metric */ }
}
}Limit concurrent executions and queued waiters; optional acquire timeout.
val policy = resilient(scope) {
bulkhead {
maxConcurrentCalls = 8
maxWaitingCalls = 32
timeout = 2.seconds
}
}Named / shared bulkhead: use BulkheadRegistry so several policies share the same pool (e.g. one limit for all "database" calls). Cannot combine bulkhead { } and bulkheadNamed(...) in the same policy.
import com.santimattius.resilient.bulkhead.BulkheadRegistry
val bulkheads = BulkheadRegistry()
val policyA = resilient(scope) {
bulkheadNamed(bulkheads, "database") {
maxConcurrentCalls = 4
maxWaitingCalls = 16
}
}
val policyB = resilient(scope) {
bulkheadNamed(bulkheads, "database") {
maxConcurrentCalls = 4 // first registration wins; same instance as policyA
}
}Launch parallel attempts and return the first successful result. Use carefully (extra load).
val policy = resilient {
hedging {
attempts = 3
stagger = 50.milliseconds
}
}Cache successful results per key with TTL. Use a fixed key or a keyProvider (suspend) for dynamic keys (e.g. per user, per request). Invalidation is available via policy.cacheHandle when cache is configured.
val policy = resilient(scope) {
cache {
key = "users:123" // fixed key
// or keyProvider = { "user:${userId}" } // dynamic key
ttl = 30.seconds
}
}
// Invalidate by key or prefix (e.g. after write)
policy.cacheHandle?.invalidate("users:123")
policy.cacheHandle?.invalidatePrefix("user:")The in-memory implementation is one possible CachePolicy; custom backends (e.g. persistent storage or Redis) can implement the same interface.
→ In-depth documentation
Deduplicates concurrent in-flight executions by key. If multiple callers resolve the same key while the operation is still running, only one block execution happens and all callers share the same result (or error). It does not cache completed results; for TTL caching, use cache { ... }.
val policy = resilient(scope) {
coalesce {
key = "profile:42" // fixed key
// or keyProvider = { "profile:${userId}" } // dynamic key
}
}Use policy.getHealthSnapshot() to build health or readiness endpoints (e.g. Kubernetes probes, /health API). The snapshot includes circuit breaker state and counters, and bulkhead usage when configured.
import com.santimattius.resilient.circuitbreaker.CircuitState
val snapshot = policy.getHealthSnapshot()
// Circuit breaker: is the circuit open?
val healthy = snapshot.circuitBreaker?.state != CircuitState.OPEN
// Optional: snapshot.circuitBreaker?.failureCount, successCount for metrics
// Bulkhead: active and waiting counts
snapshot.bulkhead?.let { bh ->
// bh.activeConcurrentCalls, bh.waitingCalls, bh.maxConcurrentCalls, bh.maxWaitingCalls
}Return a fallback value when the operation fails. Ensure the fallback return type matches the type returned by the block passed to execute(), otherwise a ClassCastException may occur at runtime.
import com.santimattius.resilient.fallback.FallbackConfig
val policy = resilient {
fallback(FallbackConfig { err ->
// produce a fallback from error (log/metric here)
"fallback-value"
})
}val policy = resilient {
cache { key = "profile:42"; ttl = 20.seconds }
timeout { timeout = 2.seconds }
retry {
maxAttempts = 3
backoffStrategy = ExponentialBackoff(initialDelay = 100.milliseconds)
}
circuitBreaker { failureThreshold = 5; successThreshold = 2; halfOpenMaxCalls = 1 }
rateLimiter { maxCalls = 50; period = 1.seconds }
bulkhead { maxConcurrentCalls = 10; maxWaitingCalls = 100 }
hedging { attempts = 2 }
fallback(FallbackConfig { "cached-or-default" })
}
suspend fun load(): User = policy.execute { loadProfile() }A ready-to-run sample is available at:
androidApp/src/main/kotlin/com/santimattius/kmp/ResilientExample.ktUse it from MainActivity:
setContent { ResilientExample() }The resilient-test module provides utilities for testing resilience policies with fault injection and simplified policy builders.
Use FaultInjector to simulate failures, delays, and intermittent behavior in your tests:
import com.santimattius.resilient.test.FaultInjector
import kotlin.time.Duration.Companion.milliseconds
val injector = FaultInjector.builder()
.failureRate(0.3) // 30% chance of failure
.delay(50.milliseconds) // Add 50ms delay
.delayJitter(true) // Randomize delay ±20%
.exception { CustomException() } // Custom exception
.build()
val result = injector.execute {
fetchData() // may throw or delay
}Configuration:
failureRate(rate: Double): Probability of throwing an exception (0.0 = never, 1.0 = always).exception(block: () -> Throwable): Factory for the exception to throw (default: FaultInjectedException).delay(duration: Duration): Fixed delay before executing the block (default: Duration.ZERO).delayJitter(enable: Boolean): Randomize delay ±20% (default: false).Use PolicyBuilders to create policies with sensible test defaults:
import com.santimattius.resilient.test.PolicyBuilders
import com.santimattius.resilient.test.TestResilientScope
import kotlinx.coroutines.test.runTest
@Test
fun `test retry with fault injection`() = runTest {
val scope = TestResilientScope(this)
val policy = PolicyBuilders.retryPolicy(
scope,
maxAttempts = 3,
initialDelay = 10.milliseconds
)
val injector = FaultInjector.builder()
.failureRate(0.5)
.build()
val result = policy.execute {
injector.execute { "success" }
}
assertEquals("success", result)
}Available builders:
retryPolicy(scope, maxAttempts = 3, initialDelay = 10ms, maxDelay = 100ms, shouldRetry = { true })timeoutPolicy(scope, timeout = 1.second)circuitBreakerPolicy(scope, failureThreshold = 3, successThreshold = 2, timeout = 5.seconds)bulkheadPolicy(scope, maxConcurrentCalls = 2, maxWaitingCalls = 4)rateLimiterPolicy(scope, maxCalls = 5, period = 1.second)Use TestResilientScope to create a test-friendly scope for policies:
import com.santimattius.resilient.test.TestResilientScope
import kotlinx.coroutines.test.runTest
@Test
fun `test policy lifecycle`() = runTest {
val scope = TestResilientScope(this)
val policy = resilient(scope) {
retry { maxAttempts = 3 }
}
// ... test logic
scope.cancel() // cleanup
}Add the resilient-test module to your test dependencies:
// build.gradle.kts
kotlin {
sourceSets {
commonTest.dependencies {
implementation("io.github.santimattius.resilient:resilient-test:1.3.0")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-test:1.9.0")
}
}
}kotlinx-coroutines-test for virtual time with runTest and TestTimeSource.FaultInjector to simulate realistic failure scenarios (intermittent errors, slow responses).Theoretical and technical deep-dive documentation for each pattern — definition, when to apply, configuration reference, and extended examples:
| Pattern | Document |
|---|---|
| Timeout | docs/patterns/timeout.md |
| Retry | docs/patterns/retry.md |
| Circuit Breaker | docs/patterns/circuit-breaker.md |
| Rate Limiter | docs/patterns/rate-limiter.md |
| Bulkhead | docs/patterns/bulkhead.md |
| Hedging | docs/patterns/hedging.md |
| Cache | docs/patterns/cache.md |
| Fallback | docs/patterns/fallback.md |
A Kotlin Multiplatform library providing resilience patterns (Timeout, Retry, Circuit Breaker, Rate Limiter, Bulkhead, Hedging, Cache, Fallback) for suspend functions. Compose them declaratively with a small DSL and observe runtime telemetry via Flow.
import com.santimattius.resilient.composition.resilient
import com.santimattius.resilient.retry.ExponentialBackoff
import kotlin.time.Duration.Companion.seconds
import kotlin.time.Duration.Companion.milliseconds
val policy = resilient {
timeout { timeout = 2.seconds }
retry {
maxAttempts = 3
backoffStrategy = ExponentialBackoff(initialDelay = 100.milliseconds)
}
circuitBreaker {
failureThreshold = 5
successThreshold = 2
timeout = 30.seconds
}
}
suspend fun call(): String = policy.execute { fetchData() }Observe runtime events from policy.events (SharedFlow):
val job = scope.launch {
policy.events.collect { event -> println(event) }
}Outer → Inner (default):
The composition order is configurable. Use compositionOrder() to specify a custom order:
import com.santimattius.resilient.composition.OrderablePolicyType
val policy = resilient(scope) {
compositionOrder(listOf(
OrderablePolicyType.CACHE, // Check cache first (after Fallback)
OrderablePolicyType.COALESCE, // Deduplicate in-flight requests by key
OrderablePolicyType.TIMEOUT, // Then apply timeout
OrderablePolicyType.RETRY, // Retry on failures
OrderablePolicyType.CIRCUIT_BREAKER,
OrderablePolicyType.RATE_LIMITER,
OrderablePolicyType.BULKHEAD,
OrderablePolicyType.HEDGING
))
// Fallback is automatically added as the outermost policy
// ... configure policies
}Important Notes:
OrderablePolicyType - it is always positioned outermost automaticallycompositionOrder(...) accepts a subset: policy types not included are appended in the default ordercompositionOrder(listOf(..., OrderablePolicyType.RETRY, OrderablePolicyType.TIMEOUT, ...)) if you want per-attempt timeout.Aborts the operation after a configured duration. onTimeout runs only on actual timeouts. This is a single time limit for the block as wrapped by the policy (see Timeout vs Retry order). For a timeout per retry attempt, use Retry with perAttemptTimeout instead.
val policy = resilient {
timeout {
timeout = 3.seconds
onTimeout = { /* log metric */ }
}
}Retries failing operations according to a backoff strategy and predicate. Use shouldRetry to avoid retrying non-transient errors (e.g. 4xx client errors); retry only on 5xx, network/IO, or timeouts. Optional perAttemptTimeout limits each attempt (including the first) so a single slow attempt does not consume the whole retry budget.
Optional shouldRetryResult retries when the block returns successfully but the value is not acceptable (e.g. HTTP 202, empty body, “not ready” flag). Return true from the predicate to request another attempt; it shares the same maxAttempts budget as exception-based retries. onRetry receives RetryableResultException (with lastValue) for telemetry; if attempts are exhausted while the predicate stays true, the last returned value is returned (unlike exception exhaustion, which rethrows).
import com.santimattius.resilient.retry.*
val policy = resilient {
retry {
maxAttempts = 4
shouldRetry = { it is java.io.IOException } // e.g. only retry IO; do not retry 4xx
shouldRetryResult = { status -> (status as Int) >= 500 } // e.g. retry on 5xx-like codes returned as value
perAttemptTimeout = 5.seconds // optional: timeout per attempt
backoffStrategy = ExponentialBackoff(
initialDelay = 200.milliseconds,
maxDelay = 5.seconds,
factor = 2.0,
jitter = true
)
}
}Supported backoffs: ExponentialBackoff, LinearBackoff, FixedBackoff.
→ In-depth documentation
Prevents hammering failing downstreams. States: CLOSED → OPEN → HALF_OPEN.
By default, failureThreshold counts consecutive failures in CLOSED (a success resets the count). Set slidingWindow to a positive duration to open when failureThreshold failures fall within that time window (not necessarily consecutive); successes only prune expired timestamps from the window.
val policy = resilient(scope) {
circuitBreaker {
failureThreshold = 5
successThreshold = 2
halfOpenMaxCalls = 1
timeout = 60.seconds // OPEN duration
slidingWindow = 30.seconds // optional: time-based failure window
}
}Token-bucket rate limiting: tokens refill at the start of each period (fixed-window refill). With maxCalls = 10 and period = 1.seconds, at most 10 calls per second are allowed. One bucket per policy (global limit). Optional max wait when limited.
val policy = resilient {
rateLimiter {
maxCalls = 10
period = 1.seconds
timeoutWhenLimited = 5.seconds // throw if wait would exceed
onRateLimited = { /* metric */ }
}
}Limit concurrent executions and queued waiters; optional acquire timeout.
val policy = resilient(scope) {
bulkhead {
maxConcurrentCalls = 8
maxWaitingCalls = 32
timeout = 2.seconds
}
}Named / shared bulkhead: use BulkheadRegistry so several policies share the same pool (e.g. one limit for all "database" calls). Cannot combine bulkhead { } and bulkheadNamed(...) in the same policy.
import com.santimattius.resilient.bulkhead.BulkheadRegistry
val bulkheads = BulkheadRegistry()
val policyA = resilient(scope) {
bulkheadNamed(bulkheads, "database") {
maxConcurrentCalls = 4
maxWaitingCalls = 16
}
}
val policyB = resilient(scope) {
bulkheadNamed(bulkheads, "database") {
maxConcurrentCalls = 4 // first registration wins; same instance as policyA
}
}Launch parallel attempts and return the first successful result. Use carefully (extra load).
val policy = resilient {
hedging {
attempts = 3
stagger = 50.milliseconds
}
}Cache successful results per key with TTL. Use a fixed key or a keyProvider (suspend) for dynamic keys (e.g. per user, per request). Invalidation is available via policy.cacheHandle when cache is configured.
val policy = resilient(scope) {
cache {
key = "users:123" // fixed key
// or keyProvider = { "user:${userId}" } // dynamic key
ttl = 30.seconds
}
}
// Invalidate by key or prefix (e.g. after write)
policy.cacheHandle?.invalidate("users:123")
policy.cacheHandle?.invalidatePrefix("user:")The in-memory implementation is one possible CachePolicy; custom backends (e.g. persistent storage or Redis) can implement the same interface.
→ In-depth documentation
Deduplicates concurrent in-flight executions by key. If multiple callers resolve the same key while the operation is still running, only one block execution happens and all callers share the same result (or error). It does not cache completed results; for TTL caching, use cache { ... }.
val policy = resilient(scope) {
coalesce {
key = "profile:42" // fixed key
// or keyProvider = { "profile:${userId}" } // dynamic key
}
}Use policy.getHealthSnapshot() to build health or readiness endpoints (e.g. Kubernetes probes, /health API). The snapshot includes circuit breaker state and counters, and bulkhead usage when configured.
import com.santimattius.resilient.circuitbreaker.CircuitState
val snapshot = policy.getHealthSnapshot()
// Circuit breaker: is the circuit open?
val healthy = snapshot.circuitBreaker?.state != CircuitState.OPEN
// Optional: snapshot.circuitBreaker?.failureCount, successCount for metrics
// Bulkhead: active and waiting counts
snapshot.bulkhead?.let { bh ->
// bh.activeConcurrentCalls, bh.waitingCalls, bh.maxConcurrentCalls, bh.maxWaitingCalls
}Return a fallback value when the operation fails. Ensure the fallback return type matches the type returned by the block passed to execute(), otherwise a ClassCastException may occur at runtime.
import com.santimattius.resilient.fallback.FallbackConfig
val policy = resilient {
fallback(FallbackConfig { err ->
// produce a fallback from error (log/metric here)
"fallback-value"
})
}val policy = resilient {
cache { key = "profile:42"; ttl = 20.seconds }
timeout { timeout = 2.seconds }
retry {
maxAttempts = 3
backoffStrategy = ExponentialBackoff(initialDelay = 100.milliseconds)
}
circuitBreaker { failureThreshold = 5; successThreshold = 2; halfOpenMaxCalls = 1 }
rateLimiter { maxCalls = 50; period = 1.seconds }
bulkhead { maxConcurrentCalls = 10; maxWaitingCalls = 100 }
hedging { attempts = 2 }
fallback(FallbackConfig { "cached-or-default" })
}
suspend fun load(): User = policy.execute { loadProfile() }A ready-to-run sample is available at:
androidApp/src/main/kotlin/com/santimattius/kmp/ResilientExample.ktUse it from MainActivity:
setContent { ResilientExample() }The resilient-test module provides utilities for testing resilience policies with fault injection and simplified policy builders.
Use FaultInjector to simulate failures, delays, and intermittent behavior in your tests:
import com.santimattius.resilient.test.FaultInjector
import kotlin.time.Duration.Companion.milliseconds
val injector = FaultInjector.builder()
.failureRate(0.3) // 30% chance of failure
.delay(50.milliseconds) // Add 50ms delay
.delayJitter(true) // Randomize delay ±20%
.exception { CustomException() } // Custom exception
.build()
val result = injector.execute {
fetchData() // may throw or delay
}Configuration:
failureRate(rate: Double): Probability of throwing an exception (0.0 = never, 1.0 = always).exception(block: () -> Throwable): Factory for the exception to throw (default: FaultInjectedException).delay(duration: Duration): Fixed delay before executing the block (default: Duration.ZERO).delayJitter(enable: Boolean): Randomize delay ±20% (default: false).Use PolicyBuilders to create policies with sensible test defaults:
import com.santimattius.resilient.test.PolicyBuilders
import com.santimattius.resilient.test.TestResilientScope
import kotlinx.coroutines.test.runTest
@Test
fun `test retry with fault injection`() = runTest {
val scope = TestResilientScope(this)
val policy = PolicyBuilders.retryPolicy(
scope,
maxAttempts = 3,
initialDelay = 10.milliseconds
)
val injector = FaultInjector.builder()
.failureRate(0.5)
.build()
val result = policy.execute {
injector.execute { "success" }
}
assertEquals("success", result)
}Available builders:
retryPolicy(scope, maxAttempts = 3, initialDelay = 10ms, maxDelay = 100ms, shouldRetry = { true })timeoutPolicy(scope, timeout = 1.second)circuitBreakerPolicy(scope, failureThreshold = 3, successThreshold = 2, timeout = 5.seconds)bulkheadPolicy(scope, maxConcurrentCalls = 2, maxWaitingCalls = 4)rateLimiterPolicy(scope, maxCalls = 5, period = 1.second)Use TestResilientScope to create a test-friendly scope for policies:
import com.santimattius.resilient.test.TestResilientScope
import kotlinx.coroutines.test.runTest
@Test
fun `test policy lifecycle`() = runTest {
val scope = TestResilientScope(this)
val policy = resilient(scope) {
retry { maxAttempts = 3 }
}
// ... test logic
scope.cancel() // cleanup
}Add the resilient-test module to your test dependencies:
// build.gradle.kts
kotlin {
sourceSets {
commonTest.dependencies {
implementation("io.github.santimattius.resilient:resilient-test:1.3.0")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-test:1.9.0")
}
}
}kotlinx-coroutines-test for virtual time with runTest and TestTimeSource.FaultInjector to simulate realistic failure scenarios (intermittent errors, slow responses).Theoretical and technical deep-dive documentation for each pattern — definition, when to apply, configuration reference, and extended examples:
| Pattern | Document |
|---|---|
| Timeout | docs/patterns/timeout.md |
| Retry | docs/patterns/retry.md |
| Circuit Breaker | docs/patterns/circuit-breaker.md |
| Rate Limiter | docs/patterns/rate-limiter.md |
| Bulkhead | docs/patterns/bulkhead.md |
| Hedging | docs/patterns/hedging.md |
| Cache | docs/patterns/cache.md |
| Fallback | docs/patterns/fallback.md |