AI2JSON turns any public webpage into clean JSON for AI reading.

Most webpages are noisy: menus, footers, cookie banners, scripts. AI2JSON extracts the real content and returns a stable JSON document you can feed into LLMs and RAG pipelines. No summary. No interpretation. Just readable content + structure + a hash.

Extract main content

Keep sections[] structure

Deterministic output

Hash for diffing

Try with a URL See AI comparison

Free. No API key. Rate-limited sandbox. Public HTTP(S) only. Spec: webpage.ai.v0.1

Try it now (live sandbox)

No API key Rate-limited Public pages only No JS rendering (v0.1)

JSON Prompt-ready curl Errors

{ "spec": "webpage.ai.v0.1", "url": "https://example.com/article", "canonical_url": "https://example.com/article", "last_fetched": "2025-01-02T10:11:12Z", "language": "en", "content_hash": "sha256:…", "title": "Example page title", "summary": "", "type": "article", "sections": [ { "heading": null, "level": 0, "id": "content", "text": "Clean extracted content…" } ], "links": [], "meta": { "source": "https://example.com/article" }, "proof": {} }

For LLM usage: send title + sections[].text. Keep content_hash to detect changes over time.

Why JSON helps your AI more than raw HTML.

LLMs can ingest HTML, but it’s inefficient and error-prone: boilerplate dominates tokens, chunking is brittle, and “what did the model really read?” is unclear. AI2JSON gives you a small, predictable contract so your AI pipeline becomes simpler, cheaper, and easier to debug.

Less token waste

Navigation, cookie banners, and layout noise are removed so prompts are mostly real content.

Natural chunks

sections[] are stable content blocks for prompting and RAG indexing (better than “split every N chars”).

Deterministic debugging

Same URL → same schema. Easier to reproduce issues vs. scraping/parsing HTML differently per project.

Content fingerprint

content_hash lets you detect meaningful changes (docs/regulations/ToS) and avoid re-embedding unchanged pages.

What changes vs. “classic web reading”.

A fair comparison is not “AI can’t read HTML”. It can. The real difference is: cost, reliability, and traceability.

Without AI2JSON raw HTML / ad-hoc parsing

Your agent fetches a page and either dumps HTML into a prompt or implements custom parsing.

Token bloat: menus/footers/scripts dominate context.
Brittle parsing: every site breaks differently.
Chunking guesswork: “split by N chars” loses structure.
Hard to prove input: unclear what exact text was used later.

With AI2JSON stable JSON contract

Your agent gets title + sections[] in a predictable schema, plus a SHA-256 fingerprint.

Cleaner prompts: content-first, less noise.
Stable ingestion: one schema across all sites.
Better RAG: sections are natural chunks for embeddings.
Traceability: store content_hash to prove what was read.

Prompt template (minimal) You are given structured page content: - title: {{title}} - sections: {{sections}} (each has heading, level, text) Task: Answer using only these sections. Cite section ids when possible. Tip For most use cases, only send: title + sections[].text

webpage.ai.v0.1 — minimal, predictable, AI-friendly.

This sandbox returns webpage.ai.v0.1. In v0.1, the payload is intentionally simple: one main section with extracted text, plus metadata and a hash.

Field	Type	Meaning
`spec`	string	Always `"webpage.ai.v0.1"`.
`url`	string	Requested URL.
`canonical_url`	string	Canonical URL (if detected).
`last_fetched`	string	ISO 8601 timestamp (UTC).
`language`	string/null	Best-effort language (from HTML).
`content_hash`	string	SHA-256 of extracted text (prefixed `sha256:`).
`title`	string	Page title.
`type`	string	`article` / `homepage` / `unknown`.
`sections[]`	array	Ordered blocks: `heading`, `level`, `id`, `text`.
`meta`	object	Best-effort metadata (v0.1: `source`).

Sandbox API (your live endpoint) GET https://ai2json-apiv1.jeason-bacoul.workers.dev/v1/transform?url= Example https://ai2json-apiv1.jeason-bacoul.workers.dev/v1/transform?url=https://wikipedia.org

Free by default, protected by design.

This is a public sandbox. It is rate-limited, restricted to public HTTP(S) URLs, and blocks obvious private/localhost targets (basic SSRF hardening). Very large pages or slow upstream responses may be rejected.

Common responses (JSON) 200 OK 400 bad_request (missing/invalid url) 403 forbidden (private hosts / non-http(s)) 415 unsupported_content_type (non-HTML) 429 rate_limited (Retry-After header) 5xx upstream_or_internal Notes - No API key in sandbox. - No JS rendering in v0.1 (static HTML only). - Best-effort extraction (some pages will be not_extractable).

Node.js fetch example

Store content_hash to skip re-processing unchanged pages.

RAG chunking tip

Index each section separately: section.id is your stable reference.

chunks = sections.map(s => ({ id: s.id, text: (s.heading ? (s.heading + "\n") : "") + s.text })); // embeddings(chunks[i].text) // store metadata: { url, content_hash, section_id: chunks[i].id }