AI2JSON turns any public webpage into clean JSON for AI reading.
Most webpages are noisy: menus, footers, cookie banners, scripts. AI2JSON extracts the real content and returns a stable JSON document you can feed into LLMs and RAG pipelines.
No summary. No interpretation. Just readable content + structure + a hash.
Free. No API key.Rate-limited sandbox.Public HTTP(S) only.Spec: webpage.ai.v0.1
Value
Why JSON helps your AI more than raw HTML.
LLMs can ingest HTML, but it’s inefficient and error-prone: boilerplate dominates tokens, chunking is brittle, and “what did the model really read?” is unclear.
AI2JSON gives you a small, predictable contract so your AI pipeline becomes simpler, cheaper, and easier to debug.
Less token waste
Navigation, cookie banners, and layout noise are removed so prompts are mostly real content.
Natural chunks
sections[] are stable content blocks for prompting and RAG indexing (better than “split every N chars”).
Deterministic debugging
Same URL → same schema. Easier to reproduce issues vs. scraping/parsing HTML differently per project.
Content fingerprint
content_hash lets you detect meaningful changes (docs/regulations/ToS) and avoid re-embedding unchanged pages.
Compare
What changes vs. “classic web reading”.
A fair comparison is not “AI can’t read HTML”. It can. The real difference is: cost, reliability, and traceability.
Without AI2JSONraw HTML / ad-hoc parsing
Your agent fetches a page and either dumps HTML into a prompt or implements custom parsing.
Chunking guesswork: “split by N chars” loses structure.
Hard to prove input: unclear what exact text was used later.
With AI2JSONstable JSON contract
Your agent gets title + sections[] in a predictable schema, plus a SHA-256 fingerprint.
Cleaner prompts: content-first, less noise.
Stable ingestion: one schema across all sites.
Better RAG: sections are natural chunks for embeddings.
Traceability: store content_hash to prove what was read.
Prompt template (minimal)
You are given structured page content:
- title: {{title}}
- sections: {{sections}} (each has heading, level, text)
Task: Answer using only these sections. Cite section ids when possible.
Tip
For most use cases, only send:
title + sections[].text
This sandbox returns webpage.ai.v0.1. In v0.1, the payload is intentionally simple: one main section with extracted text, plus metadata and a hash.
Field
Type
Meaning
spec
string
Always "webpage.ai.v0.1".
url
string
Requested URL.
canonical_url
string
Canonical URL (if detected).
last_fetched
string
ISO 8601 timestamp (UTC).
language
string/null
Best-effort language (from HTML).
content_hash
string
SHA-256 of extracted text (prefixed sha256:).
title
string
Page title.
type
string
article / homepage / unknown.
sections[]
array
Ordered blocks: heading, level, id, text.
meta
object
Best-effort metadata (v0.1: source).
Sandbox API (your live endpoint)
GET https://ai2json-apiv1.jeason-bacoul.workers.dev/v1/transform?url=
Example
https://ai2json-apiv1.jeason-bacoul.workers.dev/v1/transform?url=https://wikipedia.org
Limits & abuse protection
Free by default, protected by design.
This is a public sandbox. It is rate-limited, restricted to public HTTP(S) URLs, and blocks obvious private/localhost targets (basic SSRF hardening).
Very large pages or slow upstream responses may be rejected.
Common responses (JSON)
200 OK
400 bad_request (missing/invalid url)
403 forbidden (private hosts / non-http(s))
415 unsupported_content_type (non-HTML)
429 rate_limited (Retry-After header)
5xx upstream_or_internal
Notes
- No API key in sandbox.
- No JS rendering in v0.1 (static HTML only).
- Best-effort extraction (some pages will be not_extractable).
Node.js fetch example
Store content_hash to skip re-processing unchanged pages.