Data Format Contracts for AI Pipelines

A microservice produced ISO datetime with timezone. The downstream service expected Unix timestamps. The result: silent data corruption in production. The fix was documentation that should have existed from the start.

Pipelines fail at interfaces. The logic within each component works fine. The hand-off between components is where things break.

This is true for traditional software. It’s amplified when AI is involved, because AI generates code that may not match the implicit assumptions of the systems it integrates with.

The contract problem

Working with Claude Code, I built a data processing pipeline: an ingestion service that parsed logs, a transformation service that normalized the data, and an analytics service that aggregated results. Three services, each working correctly in isolation.

The ingestion service output timestamps in ISO format with timezone: 2025-02-27T22:36:03Z. The analytics service expected Unix timestamps in milliseconds: 1740693363000. The transformation service passed timestamps through unchanged, assuming consistency.

The result was a silent failure — aggregations grouped by nonsensical time buckets, which I didn’t notice until the dashboards showed impossible patterns.

Neither service was wrong. Neither service was broken. The contract between them was never defined.

Explicit is better than implicit

The fix wasn’t code — it was documentation. I directed Claude to add a “Data Format Contract” section to the ingestion service’s CLAUDE.md file:

# Timestamp format contract
timestamps: Unix milliseconds (integer)
# NOT ISO 8601 strings
# NOT Unix seconds

# All downstream consumers expect milliseconds
# Conversion happens at ingestion, not downstream

The consuming service’s CLAUDE.md got a matching section explaining what formats it accepts and what happens with malformed input.

With the contract documented, the AI assistant that helps me maintain these services knows what format to generate when creating new components, and what to check when debugging integration failures.

What belongs in a data contract

For each data hand-off in a pipeline:

Field names. Exact names, case sensitivity, whether underscores or camelCase.

Field formats. Timestamp format, number precision, string encoding, null handling.

Required vs optional. Which fields must be present, which can be omitted.

Validation rules. What values are valid, what happens with invalid input.

Limitations. What the producing service doesn’t handle, what the consuming service can’t process.

The last item — limitations — is often forgotten. Features get documented. Limitations don’t. But knowing what doesn’t work is as important as knowing what does.

API contracts

A related pattern applies to APIs and their clients.

I had a REST API with endpoints that returned paginated results. The API and client evolved separately. Claude added new response fields to the API without corresponding handling in the client. Claude added new query parameters to the client without corresponding support in the API.

The fix: document them together.

## Pagination Contract

| Field | Type | Required | Notes |
|-------|------|----------|-------|
| page | integer | yes | 1-indexed |
| per_page | integer | no | Default 20, max 100 |
| total | integer | response only | Total items across all pages |
| next_cursor | string | response only | Null if last page |

If a field exists in the response but not this table,
clients should ignore it (forward compatibility).

If a parameter exists in the client but not this table,
the API will return 400 (strict validation).

The table is the contract. API and client must both conform to it.

Known limitations documentation

When I asked Claude to add rate limiting to my API gateway, it worked — for a specific definition of “worked.” The limiter used a sliding window algorithm that could briefly exceed limits during high burst traffic.

This surprised me when I first encountered it in load testing. I’d expected hard limits.

The fix was documentation in the service’s limitations section:

## Known Limitations

### Rate limiting
- Algorithm: Sliding window (not token bucket)
- Burst tolerance: May briefly exceed limit by up to 10%
- Reset behavior: Gradual decay, not hard reset at window boundary
- Workaround: Set limits 10% below actual threshold for strict enforcement

Now I know what to expect. The AI assistant knows too, and can make informed decisions when generating code that interacts with the rate limiter.

Configuration contracts

Configuration formats are contracts too. Environment variable naming conventions, config file schemas, feature flag formats — these are specifications that multiple components depend on.

The failure mode: configuration conventions exist in one service’s code but don’t get communicated to other services or to the AI assistant generating new components.

The fix: surface configuration contracts in discoverable locations. Document the config schema in the repository root. Add key conventions to CLAUDE.md so they’re loaded with every session.

## Configuration Contract

Environment variables:
- PREFIX: SERVICE_NAME_ (e.g., ANALYTICS_DB_HOST)
- Format: UPPER_SNAKE_CASE
- Booleans: "true"/"false" strings, not 1/0

Feature flags:
- Format: feature.{domain}.{name}
- Values: boolean only
- Source: LaunchDarkly, cached locally

AI-specific considerations

AI assistants generate code based on patterns in their training and context. They don’t have inherent knowledge of your pipeline’s contracts.

If you don’t document that timestamps should be Unix milliseconds, the AI might generate ISO strings, or Unix seconds, or something else entirely. It’s not wrong — it just doesn’t know your convention.

The documentation serves two audiences:

Human maintainers who need to understand the system
AI assistants who need context for generating compatible code

Both benefit from explicit contracts. Neither can rely on implicit assumptions.

The documentation investment

Documenting contracts takes time upfront. It saves more time later.

Each format mismatch I’ve debugged consumed 30-60 minutes of investigation. Unclear behavior in pipelines leads to cascading errors as multiple services produce unexpected output. Silent failures — like incorrect timestamp aggregations — can persist for days before discovery.

Spending 10 minutes documenting a contract prevents hours of debugging. And once documented, the contract serves as specification for AI-assisted maintenance and extension.

Implementation pattern

When building a new service pipeline with AI assistance:

Define contracts before implementation. What format will each service output? What format will the next service expect? Document first.
Store contracts with code. CLAUDE.md in each repository, covering that service’s input/output contracts. The AI reads this automatically.
Include limitations. What doesn’t the service handle? What edge cases produce unexpected results? Future you needs to know.
Link related contracts. If service A outputs to service B, reference B’s input contract from A’s documentation.
Update contracts when behavior changes. Contracts are living documentation. If the format changes, the contract must change too.

The meta-lesson

Pipelines are interface-heavy systems. The more interfaces, the more opportunities for implicit assumption failures.

AI-assisted development adds another layer: the AI itself is an interface participant, generating code based on its understanding of the system. If that understanding doesn’t include your contracts, the generated code may not conform to them.

Explicit documentation solves both problems. It tells human maintainers what to expect. It tells AI assistants what to generate.

The investment in contracts is an investment in reliability.

This article is part of the synthesis coding series . For related content on pipeline architecture, see RAG Architecture Lessons from Practice .

Rajiv Pant is President of Flatiron Software and Snapshot AI , where he leads organizational growth and AI innovation. He is former Chief Product & Technology Officer at The Wall Street Journal, The New York Times, and Hearst Magazines. Earlier in his career, he headed technology for Condé Nast’s brands including Reddit. Rajiv coined the terms “ synthesis engineering ” and “ synthesis coding ” to describe the systematic integration of human expertise with AI capabilities in professional software development. Connect with him on LinkedIn or read more at rajiv.com .