Data Availability

Last updated March 21, 2026

SynthLink data does not appear the moment a source publishes it. It becomes available progressively — through collection, normalization, storage, enrichment, and API exposure. This page explains that process and what it means for the data you consume.

Availability model

Availability in SynthLink is not a binary state. It describes the condition of data at a given point in the pipeline — not simply whether a document exists or not. A document can be available at the API level while its insight layer is still being generated. A source can be actively collected while some of its items are pending quality review. Availability is the result of every stage completing successfully, and each stage operates independently.

Stages of availability
1. Source publishes new content
2. Crawler detects and fetches the item on its next scheduled run
3. Item passes quality filtering and normalization
4. Document is written to the documents table → available via /api/v1/documents
5. Insight worker processes the document asynchronously
6. Insight is written to the insights table → available via /api/v1/insights

Steps 4 and 6 happen at different times. This means the documents and insights APIs do not reflect the same moment of completeness. Building on top of SynthLink requires understanding that both layers exist and that they update independently.

Document and insight timing

Documents and insights are produced by separate systems running on separate schedules. The crawler writes documents directly after each crawl cycle. The insight worker runs on its own recurring schedule and processes a bounded batch per cycle.

In practice, this means a document may be visible in the API before its insight exists. Enrichment is tracked internally as pending,completed, or failed. Public clients can infer readiness by checking whether insight is present in /api/v1/combined.

This is not an error state. It is a normal consequence of decoupled asynchronous processing. Applications that rely on insight fields should handle the case where an insight does not yet exist for a given document.

StateDocumentInsight
Just collectedAvailableNot yet created
ProcessingAvailablePending (internal)
Fully enrichedAvailableCompleted (internal)
Enrichment failedAvailableFailed (internal)

Freshness signals

Document timestamps should be interpreted carefully. SynthLink exposes ingestion time, not the source's original publication time.

created_at is the timestamp of first ingestion. It is set once and never modified. It tells you when SynthLink first encountered this document — not when the source originally published it.

The gap between source publication time and created_at reflects the crawl interval for that source. A document published at noon on a source with a 12-hour crawl interval may not appear in SynthLink until the next scheduled run — which could be up to 12 hours later.

Note:Use created_at to sort or filter by when a document entered SynthLink. Use the Status page to assess whether a source is still being collected successfully.

Partial availability

Not all data arrives at the same level of completeness. SynthLink exposes data progressively, and different fields may be populated at different times.

  • Document without insight

    A document may be available via /api/v1/documents before its enrichment is complete. The insight fields — llm_summary, keywords, tags, category — will not be present until the insight worker has processed it.

  • Summary without full content

    Some sources provide only a summary in their feed. If the detail page fetch fails or yields insufficient content, the document is stored with only the raw summary. The content field may be null or shorter than expected.

  • Source with reduced coverage

    If a crawl cycle partially fails — for example, if some items pass quality filtering while others do not — coverage for that source may be incomplete for that cycle. The next successful run will fill in remaining items.

Availability has several independent layers, each updated on its own schedule. The API never blocks access to a document because its enrichment is incomplete — partial data is always better than no data for discovery and triage use cases.

Reliability

SynthLink does not assume zero data loss. The system is designed around the reality that external sources fail, responses are inconsistent, and processing queues back up. Reliability is maintained through repeated collection and retry-based recovery rather than guaranteed single-pass delivery.

Crawlers retry failed requests with exponential backoff. Documents that pass quality filtering are upserted — meaning a document that was partially stored can be updated on the next successful crawl. Insight jobs that fail for transient reasons are re-queued with increasing delays. Only non-retryable failures — such as documents with no extractable text — are permanently marked as failed.

The practical implication is that gaps in coverage are usually temporary. A missing document or insight today may be available after the next crawl cycle or retry window.

Note:If a source shows degraded status on the Status page for an extended period, it may indicate a persistent upstream issue rather than a transient failure.

Status interpretation

The Status page provides a real-time view of the pipeline health. Understanding what it shows helps you interpret gaps or delays in the data you receive.

Worker runs

Shows the most recent crawl result for each source — whether it succeeded, how many records were processed, and when it last ran. A source that has not run recently may explain missing recent documents.

Integrity checks

Reports orphan insights, duplicate URLs, and completed insights with missing summaries. These numbers help identify systemic issues rather than individual document failures.

API health

Confirms that the public endpoints are responding. A degraded API health signal means document and insight availability is affected across all sources.

Data availability is not a static property — it reflects the current state of a continuously running system. The Status page is the right place to start when the data you expect is not yet present.

Was this helpful?