Documentation Index
Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt
Use this file to discover all available pages before exploring further.
The Graduation Path
Agent Workflow separates workflow definition from execution. TheStepRunner interface is the seam — swap the bean, not the workflow:
| Level | Runner | What it adds |
|---|---|---|
| 0 | LocalStepRunner | In-process, zero overhead. Default. |
| 1 | CheckpointingStepRunner | JDBC crash recovery — resume from last completed step |
| 2 | TemporalStepRunner | Distributed durable execution via Temporal activities |
CheckpointingStepRunner
Persists step outputs to a JDBC database. On restart with the samerunId, completed steps are skipped — their cached output is returned directly.
How it works
- Before executing a step, queries by
(runId, stepName)— the checkpoint key - If a
COMPLETEDrecord exists, returns the cachedoutputPayload(skip) - Otherwise, creates a
STARTEDrecord, executes the step, upgrades toCOMPLETEDwith the serialized output - On exception, records
FAILEDwith the error message
Maven coordinates
DataSource on the classpath. H2 works for development; Postgres or MySQL for production.
Restart semantics
runId is the stable identity for a workflow instance. COMPLETED steps are skipped permanently for that runId. FAILED steps are not automatically retried — the system leaves them in place until an operator explicitly decides to retry.
This is intentional. Not all failures are transient: a bad prompt, a schema mismatch, or a programming error will fail again without a fix. Automatic retry would mask the real problem.
Crash-and-resume with CheckpointManager
When a step fails, callCheckpointManager.getRunState() to inspect what happened, then resetFailedSteps() only after confirming the failure was transient:
resetFailedSteps deletes FAILED records; COMPLETED records are untouched. The next execution creates a fresh STARTED record for each reset step and re-runs it.
Basic crash-and-resume example
A 4-step workflow crashes at step 3. After operator reset, steps 1-2 are skipped (cached), step 3 retried:A complete runnable example is in
workflow-dsl-examples/CrashRecoveryIT — @DataJpaTest + H2, no LLM needed.JPA entities
Two JPA entities back the checkpoint system:| Entity | Table | Purpose |
|---|---|---|
AgentStepExecution | agent_step_executions | Per-step checkpoint. Key: (runId, stepName) unique constraint. Tracks status, output, tokens, cost. |
AgentFlowExecution | agent_flow_executions | Per-run envelope. Tracks workflow name, steps total/completed, total cost. |
BatchStatus (severity-ordered enum) and ExitStatus (embeddable record with severity-based composition via and()).
Typed output deserialization
Each checkpoint stores the step’s output type alongside its serialized payload. On restore,CheckpointingStepRunner uses Class.forName(outputType) to deserialize back to the original type rather than raw Object. This means that when a step is skipped and its cached output is returned to the next step, the type is preserved:
outputType() participate fully. Step.named() lambdas return Object.class by default — deserialization falls back to Jackson’s type inference for those.
JdbcTraceRecorder
Records every step transition to astep_transitions table. Auto-creates the table on first use.
StepTransition record includes: run_id, workflow_name, from_step, to_step, timestamp, duration_ms, tokens_used, cost_usd, node_type, label.
Query traces for a run:
TemporalStepRunner
Dispatches each step as a Temporal Activity. Steps must be registered withStepActivityImpl on the worker side.
Maven coordinates
Activity dispatch
Worker-side step registration
ConcurrentHashMap registry. The activity creates a fresh AgentContext with the runId for each execution.
Steps dispatched via Temporal must be idempotent — Temporal may retry activities on timeout or failure.
Sub-workflows run inline, not as activities. A
Workflow used as a step inside another Workflow bypasses the TemporalStepRunner and executes in-process. Only leaf steps are dispatched as Temporal activities. This is required for correct context propagation — the activity worker receives only the runId, not the full parent context.Related
API Reference
StepRunner interface, TraceRecorder, WorkflowExecutor
DSL Primitives
Sequential, parallel, gate, loop, branch, and more