Documentation / Reader Mode

Reader Mode v0.15+

Strip every piece of page noise — navigation, ads, cookie banners, sidebars, share buttons, related-articles widgets — and get back the pure content as clean Markdown, with title, author, date, and reading time.

What It Does

Reader Mode applies a multi-signal noise removal pipeline to any webpage. Think of it as the "Reader View" button in Firefox or Safari, but for your code. The result is clean, LLM-ready Markdown with full article metadata attached.

✅ Removes 25+ classes of page noise (nav, ads, cookie banners, related articles, share widgets, sidebars, footers, pop-ups…)
✅ Candidate scoring picks the highest-density content block
✅ Extracts metadata: title, author, publish date, reading time, word count
✅ Returns clean Markdown — ready to pass to any LLM
✅ Works on news articles, blog posts, documentation, long-form essays

CLI

# Enable reader mode with --readable
npx webpeel "https://techcrunch.com/2026/02/24/some-article" --readable

# JSON output — includes metadata fields
npx webpeel "https://techcrunch.com/2026/02/24/some-article" --readable --json

# Also works with browser rendering for JS-rendered pages
npx webpeel "https://medium.com/@user/article-slug" --readable --render --json

API

# Basic readable fetch
GET /v1/fetch?url=https://techcrunch.com/2026/02/24/some-article&readable=true

# With curl
curl "https://api.webpeel.dev/v1/fetch?url=https://techcrunch.com/2026/02/24/some-article&readable=true" \
  -H "Authorization: Bearer YOUR_API_KEY"

Query Parameters

Parameter	Type	Description
`url`	string (required)	Page URL to fetch in reader mode
`readable`	boolean	Set to `true` to enable reader mode
`render`	boolean	Use headless browser for JS-rendered pages

MCP

Pass readable: true to webpeel_fetch:

{
  "tool": "webpeel_fetch",
  "arguments": {
    "url": "https://techcrunch.com/2026/02/24/some-article",
    "readable": true
  }
}

How It Works

Reader Mode runs a three-stage pipeline entirely in-process — no external calls:

Stage 1 — Noise Removal (25+ patterns)

The following element types are stripped before any content scoring:

Category	What gets removed
Navigation	`<nav>`, `[role=navigation]`, header nav bars, breadcrumb trails
Advertising	`.ad`, `.ads`, `.advertisement`, `[data-ad]`, iframe ads
Cookie / GDPR	`.cookie-banner`, `#consent`, GDPR overlays, "Accept cookies" dialogs
Sidebars	`aside`, `[role=complementary]`, `.sidebar`, `.widget-area`
Share buttons	`.share`, `.social-share`, `.addthis`, floating share bars
Related articles	`.related`, `.recommended`, `.you-may-also-like`
Comments	`#comments`, `.comment-section`, Disqus embeds
Footers	`<footer>`, `[role=contentinfo]`, site footers
Pop-ups / modals	`.modal`, `.popup`, newsletter sign-up overlays
Sticky bars	Fixed-position headers, scroll-triggered notification bars

Stage 2 — Candidate Scoring

After stripping noise, remaining block-level elements (<article>, <main>, <div>, <section>) are scored by content density — the ratio of text to HTML. The highest-scoring block is selected as the article body.

Stage 3 — Metadata Extraction

Structured metadata is extracted from <meta> tags, JSON-LD, and OpenGraph properties, then verified against visible on-page signals:

Title — from og:title, <title>, or largest heading
Author — from article:author, byline heuristics, JSON-LD Person
Published date — from article:published_time, datePublished, visible date patterns
Reading time — calculated at 238 WPM (average adult reading speed)
Word count — plain-text word count of the extracted body

Example Output

{
  "url": "https://techcrunch.com/2026/02/24/some-article",
  "readable": true,
  "title": "The Rise of LLM-Free Web Agents",
  "author": "Jane Smith",
  "publishedAt": "2026-02-24T08:00:00Z",
  "readingTime": 4,
  "wordCount": 980,
  "content": "# The Rise of LLM-Free Web Agents\n\nFor years, building a reliable web agent meant stitching together an LLM, a browser, and a prompt. But a new wave of tools is changing that...\n\n## Why BM25 Is Enough for Most Tasks\n\nLarge language models are powerful, but they're also slow and expensive...",
  "excerpt": "For years, building a reliable web agent meant stitching together an LLM, a browser, and a prompt."
}

SDK Usage

import { peel } from 'webpeel';

const result = await peel('https://techcrunch.com/2026/02/24/some-article', {
  readable: true
});

console.log(result.title);       // "The Rise of LLM-Free Web Agents"
console.log(result.author);      // "Jane Smith"
console.log(result.publishedAt); // "2026-02-24T08:00:00Z"
console.log(result.readingTime); // 4 (minutes)
console.log(result.wordCount);   // 980
console.log(result.content);     // Clean Markdown article body

from webpeel import WebPeel

client = WebPeel()
result = client.scrape(
    "https://techcrunch.com/2026/02/24/some-article",
    readable=True
)

print(result.title)        # "The Rise of LLM-Free Web Agents"
print(result.author)       # "Jane Smith"
print(result.reading_time) # 4
print(result.word_count)   # 980
print(result.content)      # Clean Markdown

When to Use Reader Mode

Use case	Use Reader Mode?
News articles, blog posts	✅ Yes — lots of surrounding noise
Documentation pages	✅ Yes — strip nav and ads
Long-form essays	✅ Yes — ideal use case
E-commerce product pages	⚠️ Partial — use `--schema` instead for structured data
Search result pages	❌ No — content IS the listing grid
Single-page apps (SPAs)	✅ Yes, but add `--render` flag
Twitter / GitHub / Reddit	❌ No — use Domain Extractors instead

💡 Combine with --focus for LLM efficiency
Stack Reader Mode with BM25 query filtering for maximum token efficiency: --readable --focus "climate impact". Reader Mode strips noise first, then BM25 keeps only the most query-relevant paragraphs. Typical savings: 40–75% fewer tokens vs. raw HTML.