Insights

Last updated March 21, 2026

SynthLink Insights are the insight layer generated after a document is collected. While crawlers store normalized source documents, the insight pipeline turns each document into a compact, structured interpretation that is easier to search, filter, and consume in downstream applications.

What insights are

Insights do not replace the original document. They sit on top of it and provide a machine-friendly summary of what the document is about, which themes it contains, and how it can be categorized.

Insights are generated per document and exposed as a standalone insight record. The fields produced by the insight pipeline are consistent across all sources regardless of where the original content came from.

Insight fields

{
  "llm_summary":  string,    // concise plain-language summary
  "keywords":     string[],  // key terms extracted from the document
  "tags":         string[],  // semantic topic tags
  "category":     string,    // top-level category label
  "source":       string,    // document source
  "created_at":   string     // ISO 8601 timestamp
}

Why they exist

Raw source documents are useful for traceability, but they are often too long, inconsistent, or source-specific to use directly in application logic. The insight layer exists to give every document a consistent analytical shape.

Render concise feeds and summaries without processing raw content
Filter insights by category or keyword across all sources
Build search and recommendation features on top of structured fields
Consume multiple sources through a common interpretation layer

Insight pipeline

The insight pipeline starts after a document has been written to the documents table. It selects either new documents that do not yet have an insight record, or previously failed insight jobs that are eligible for retry.

For each target document, the pipeline chooses the best available source text — content when present, falling back to summary when full content is not available. This ensures documents collected from different sources can pass through the same enrichment flow.

Pipeline flow

documents table
  → insight-worker selects unprocessed or retryable documents
  → chooses content (preferred) or summary (fallback) as input
  → produces llm_summary, keywords, tags, category
  → writes to insights table with document_id link

Note:The insight pipeline runs on a recurring schedule, decoupled from the crawl cycle. A document may appear in the documents API before its insight is ready.

What the API returns

Insights are exposed through two read-only endpoints. Use /api/v1/insights when you only need insight records. Use /api/v1/combined when you need the source document and its insight together in a single response.

/api/v1/insights

Returns insight records only. Useful when you already have document data and need the insight layer.

/api/v1/combined

Returns document and insight merged into one payload. Useful when building feeds that show both source and analysis.

Freshness and timing

Insight generation is asynchronous. A document may appear in the documents API before its insight is available — this is expected behavior, not an error. In normal operation, insights are generated shortly after ingestion, but availability depends on queue volume and retry state.

The worker processes a bounded number of items per cycle and runs repeatedly on a schedule rather than inline with crawling. This keeps source collection and analysis decoupled, so a slow enrichment queue does not delay document availability.

Note:If you need to check whether an insight is ready, query /api/v1/combined and check whether the insight field is null. The /api/v1/insights endpoint returns completed insight records only.

Failure and retry

Insight generation is designed as a retryable background job. If processing fails for a temporary reason, the job is retried with backoff at 1, 5, and 15 minute intervals up to a maximum of 3 attempts.

Some failure types are classified as non-retryable — for example, when the source document has no usable text, or when the document record is missing. These are marked failed immediately without further retries.

Importantly, the document remains accessible even when its insight is missing or failed. The insight layer is additive — document availability is never blocked by enrichment state.

StatusMeaningRetried

pendingQueued or in progress—

completedEnrichment finished—

failedTemporary failureYes, up to 3×

failedNon-retryable errorNo

Usage notes

Insights should be treated as a convenience layer for discovery and application logic, not as a replacement for the original source. If precision matters, use insight fields for filtering and triage, then refer back to the source document URL for final verification.

The most useful pattern for building a document feed is straightforward.

Fetch recent documents or combined records from /api/v1/combined

Use category, keywords, and tags to narrow the set

Use llm_summary for preview rendering

Use the original document url for full verification

Warning:llm_summary is a generated interpretation, not a verbatim excerpt. Always link users to the original source URL for authoritative content.

Was this helpful?

Crawler

Status