Crawler
Last updated March 21, 2026
SynthLink's crawlers periodically collect data from public sources and convert it into normalized document records ready for LLM enrichment and API delivery. Each crawler handles a different external source, but all follow the same contract — fetch, filter, normalize, and upsert into the shared documents table.
Overview
The crawler layer has three core responsibilities. First, standardizing different external data formats into a single documents table schema. Second, controlling duplicates by URL while tracking the most recent observation time. Third, making stored documents reliably available to the downstream enrichment pipeline and the read-only API.
Every document produced by a crawler contains the same base fields regardless of source.
{
"title": string, // extracted from source
"url": string, // canonical URL after normalization
"summary": string, // raw excerpt — not LLM-generated
"content": string, // full body if available, else null
"source": string, // logical source identifier
"content_source": string, // ingestion method: rss | detail | api
"created_at": string // first seen (ISO 8601)
}source identifies which logical source a document belongs to. content_source describes how the content was actually obtained — whether only the RSS summary was stored, whether the detail page was fetched and parsed, or whether the content came directly from a structured API response.
Data collection
SynthLink uses three collection strategies depending on the source.
Feed-based collection
OpenAI, NASA, and arXiv crawlers read RSS or Atom feeds first, extracting titles, links, and summaries from each entry. OpenAI and NASA go a step further — after parsing the feed, they fetch the detail page HTML and attempt to extract longer body text from the article or main region. If the detail page yields insufficient content, the crawler falls back to the RSS summary rather than discarding the document.
API-based collection
GitHub, NVD, and Hacker News crawlers call public APIs directly. Because responses are already structured, no HTML parsing is needed. Instead, relevant fields are composed into a summary and content. For example, GitHub composes a summary from the repository description, star count, fork count, primary language, and topics. NVD structures the CVE description, CVSS score, KEV status, and reference links. Hacker News combines the story title, points, comment count, author, and story text.
Hybrid collection
OpenAI and NASA use a hybrid approach — feed parsing for item discovery and detail page fetching for content enrichment. This separation means new items are detected quickly via the feed, and detail page failures do not block the entire crawl cycle.
Normalization and filtering
Crawlers do not store raw content as-is. Two steps run before every upsert.
URL normalization
Each crawler produces a canonical URL used as the deduplication key. Common transformations include removing tracking parameters (utm_source etc.), stripping trailing slashes, and normalizing domain variants. arXiv normalizesexport.arxiv.org URLs and strips version suffixes to produce a stable arxiv.org/abs/... form. Hacker News substitutes the HN item URL when no external link is present.
Quality filtering
Each crawler applies minimum quality thresholds before writing a record. OpenAI and NASA require a minimum body length and summary length — documents that fail both checks are discarded. arXiv rejects abstracts that are too short. GitHub, NVD, and Hacker News apply filters based on star count, CVSS score, story score, and time range respectively. The system is designed to expose only information worth surfacing through the API, not everything that was fetched.
Data flow
After a crawler writes to the documents table, the data moves through two more stages before reaching external consumers.
external source → crawler (upsert into documents) → insight-worker (LLM enrichment → upsert into insights) → public API (/api/v1/documents, /api/v1/insights, /api/v1/combined)
The insight-worker picks up documents that do not yet have an insight, or insight records in a retryable failed state. It uses content as the LLM input when available, falling back to summary. The model output is parsed into llm_summary, keywords, tags, and category, then written to the insights table.
Upserts use url as the conflict key. If a document with the same URL already exists, the existing record is preserved. created_at is never modified after initial insertion.
Automation
Each crawler implements both scheduled() and fetch() handlers, supporting scheduled runs and HTTP-triggered manual runs without any code changes.
Schedules are managed at two levels. Some crawlers define a Cloudflare cron trigger in wrangler.toml. Others are registered with an external scheduler and tracked in the crawlers table alongside metadata such as trigger_url, cron_schedule, and cron_enabled. In practice, the active schedule for any given crawler may differ from what is defined in the repository — always refer to the Status page for the latest run history.
Failure handling
Crawlers are designed to assume external request failures. Each worker uses up to three retries with exponential backoff. Transient errors (429, 5xx) are retried. NVD respects the retry-after header. OpenAI and NASA continue with RSS summary if detail page fetching fails, rather than discarding the item.
The insight-worker applies a separate retry policy for LLM enrichment. Failed insight records with retry_count < 3 are reprocessed with delays of 1, 5, and 15 minutes between attempts. Errors classified as non-retryable — such as missing_openrouter_api_key, empty_source_text, or missing_document — are marked permanently failed without further retries.
Note:A failed crawl cycle does not affect existing documents. Records already in the database remain accessible. The main observable effect is that no new documents from that source will appear until the next successful run.
Observability
Every crawler writes its run result to the worker_runs table — worker name, success flag, number of processed records, and error message if applicable. This is the data source for the crawler history shown on the Status page.
The insight-worker additionally writes to integrity_checks after each run — recording the count of orphan insights, duplicate URLs, and completed insights with an empty llm_summary. These values are also visible on the Status page.
Sources
Each source section below follows the same format — input method, filter criteria, URL normalization, and operational notes.
OpenAI News
openai_newsevery 12hInput
RSS feed → detail page HTML
content_source
rss + detail
Filter
Minimum summary and body length thresholds
URL normalization
export.arxiv.org URLs normalized to arxiv.org/abs/...
Falls back to RSS summary if detail page fetch fails or body is too short.
NASA Science
nasa_newsevery 24hInput
RSS feed → detail page HTML
content_source
rss + detail
Filter
Minimum summary and body length thresholds
URL normalization
Trailing slashes removed
Falls back to RSS summary if detail page fetch fails.
GitHub Trending
github_trendingevery 6hInput
GitHub public API
content_source
api
Filter
Top N repositories by stars
URL normalization
Trailing slash removed
Summary is composed from repo description, stars, forks, language, and topics.
arXiv Papers
arxivevery 12hInput
Atom feed
content_source
rss
Filter
Abstract minimum length threshold
URL normalization
Normalized to arxiv.org/abs/... format, version suffix removed
Summary is the abstract text. No detail page fetch.
Hacker News
hnevery 3hInput
HN public API (top stories)
content_source
api
Filter
Minimum score threshold; falls back to HN item URL if no external link
URL normalization
External link preferred; HN item URL as fallback
Summary is composed from story title, points, comment count, author, and story text.
NVD CVE Feed
nvdConfigured externallyInput
NVD REST API v2
content_source
api
Filter
Time range filter; respects retry-after header on 429
URL normalization
Canonical CVE URL (nvd.nist.gov/vuln/detail/...)
Summary includes CVE description, CVSS score, KEV status, and reference links.