A2J
AI2JSON
Web → JSON for AI reading.
Public sandbox
AI2JSON URL → JSON

AI2JSON turns any public webpage into clean JSON for AI reading.

Most webpages are noisy: menus, footers, cookie banners, scripts. AI2JSON extracts the real content and returns a stable JSON document you can feed into LLMs and RAG pipelines. No summary. No interpretation. Just readable content + structure + a hash.

Extract main content
Keep sections[] structure
Deterministic output
Hash for diffing
Free. No API key. Rate-limited sandbox. Public HTTP(S) only. Spec: webpage.ai.v0.1
Value

Why JSON helps your AI more than raw HTML.

LLMs can ingest HTML, but it’s inefficient and error-prone: boilerplate dominates tokens, chunking is brittle, and “what did the model really read?” is unclear. AI2JSON gives you a small, predictable contract so your AI pipeline becomes simpler, cheaper, and easier to debug.

Less token waste

Navigation, cookie banners, and layout noise are removed so prompts are mostly real content.

Natural chunks

sections[] are stable content blocks for prompting and RAG indexing (better than “split every N chars”).

Deterministic debugging

Same URL → same schema. Easier to reproduce issues vs. scraping/parsing HTML differently per project.

Content fingerprint

content_hash lets you detect meaningful changes (docs/regulations/ToS) and avoid re-embedding unchanged pages.

Compare

What changes vs. “classic web reading”.

A fair comparison is not “AI can’t read HTML”. It can. The real difference is: cost, reliability, and traceability.

Without AI2JSON raw HTML / ad-hoc parsing

Your agent fetches a page and either dumps HTML into a prompt or implements custom parsing.

  • Token bloat: menus/footers/scripts dominate context.
  • Brittle parsing: every site breaks differently.
  • Chunking guesswork: “split by N chars” loses structure.
  • Hard to prove input: unclear what exact text was used later.
With AI2JSON stable JSON contract

Your agent gets title + sections[] in a predictable schema, plus a SHA-256 fingerprint.

  • Cleaner prompts: content-first, less noise.
  • Stable ingestion: one schema across all sites.
  • Better RAG: sections are natural chunks for embeddings.
  • Traceability: store content_hash to prove what was read.
Prompt template (minimal) You are given structured page content: - title: {{title}} - sections: {{sections}} (each has heading, level, text) Task: Answer using only these sections. Cite section ids when possible. Tip For most use cases, only send: title + sections[].text
JSON contract

webpage.ai.v0.1 — minimal, predictable, AI-friendly.

This sandbox returns webpage.ai.v0.1. In v0.1, the payload is intentionally simple: one main section with extracted text, plus metadata and a hash.

Field Type Meaning
specstringAlways "webpage.ai.v0.1".
urlstringRequested URL.
canonical_urlstringCanonical URL (if detected).
last_fetchedstringISO 8601 timestamp (UTC).
languagestring/nullBest-effort language (from HTML).
content_hashstringSHA-256 of extracted text (prefixed sha256:).
titlestringPage title.
typestringarticle / homepage / unknown.
sections[]arrayOrdered blocks: heading, level, id, text.
metaobjectBest-effort metadata (v0.1: source).
Sandbox API (your live endpoint) GET https://ai2json-apiv1.jeason-bacoul.workers.dev/v1/transform?url= Example https://ai2json-apiv1.jeason-bacoul.workers.dev/v1/transform?url=https://wikipedia.org
Limits & abuse protection

Free by default, protected by design.

This is a public sandbox. It is rate-limited, restricted to public HTTP(S) URLs, and blocks obvious private/localhost targets (basic SSRF hardening). Very large pages or slow upstream responses may be rejected.

Common responses (JSON) 200 OK 400 bad_request (missing/invalid url) 403 forbidden (private hosts / non-http(s)) 415 unsupported_content_type (non-HTML) 429 rate_limited (Retry-After header) 5xx upstream_or_internal Notes - No API key in sandbox. - No JS rendering in v0.1 (static HTML only). - Best-effort extraction (some pages will be not_extractable).

Node.js fetch example

Store content_hash to skip re-processing unchanged pages.

const url = "https://wikipedia.org"; const api = "https://ai2json-apiv1.jeason-bacoul.workers.dev/v1/transform?url=" + encodeURIComponent(url); const res = await fetch(api, { headers: { "Accept": "application/json" } }); if (!res.ok) throw new Error("HTTP " + res.status); const doc = await res.json(); // Use in prompts / RAG: const payload = { title: doc.title, sections: doc.sections?.map(s => ({ id: s.id, heading: s.heading, text: s.text })) ?? [] }; console.log(doc.content_hash, payload);

RAG chunking tip

Index each section separately: section.id is your stable reference.

chunks = sections.map(s => ({ id: s.id, text: (s.heading ? (s.heading + "\n") : "") + s.text })); // embeddings(chunks[i].text) // store metadata: { url, content_hash, section_id: chunks[i].id }