Reader Mode v0.15
Strip every piece of page noise — navigation, ads, cookie banners, sidebars, share buttons, related-articles widgets — and get back the pure content as clean Markdown, with title, author, date, and reading time.
What It Does
Reader Mode applies a multi-signal noise removal pipeline to any webpage. Think of it as the "Reader View" button in Firefox or Safari, but for your code. The result is clean, LLM-ready Markdown with full article metadata attached.
- ✅ Removes 25+ classes of page noise (nav, ads, cookie banners, related articles, share widgets, sidebars, footers, pop-ups…)
- ✅ Candidate scoring picks the highest-density content block
- ✅ Extracts metadata: title, author, publish date, reading time, word count
- ✅ Returns clean Markdown — ready to pass to any LLM
- ✅ Works on news articles, blog posts, documentation, long-form essays
CLI
# Enable reader mode with --readable
npx webpeel "https://techcrunch.com/2026/02/24/some-article" --readable
# JSON output — includes metadata fields
npx webpeel "https://techcrunch.com/2026/02/24/some-article" --readable --json
# Also works with browser rendering for JS-rendered pages
npx webpeel "https://medium.com/@user/article-slug" --readable --render --json
API
# Basic readable fetch
GET /v1/fetch?url=https://techcrunch.com/2026/02/24/some-article&readable=true
# With curl
curl "https://api.webpeel.dev/v1/fetch?url=https://techcrunch.com/2026/02/24/some-article&readable=true" \
-H "Authorization: Bearer YOUR_API_KEY"
Query Parameters
| Parameter | Type | Description |
|---|---|---|
url |
string (required) | Page URL to fetch in reader mode |
readable |
boolean | Set to true to enable reader mode |
render |
boolean | Use headless browser for JS-rendered pages |
MCP
Pass readable: true to webpeel_fetch:
{
"tool": "webpeel_fetch",
"arguments": {
"url": "https://techcrunch.com/2026/02/24/some-article",
"readable": true
}
}
How It Works
Reader Mode runs a three-stage pipeline entirely in-process — no external calls:
Stage 1 — Noise Removal (25+ patterns)
The following element types are stripped before any content scoring:
| Category | What gets removed |
|---|---|
| Navigation | <nav>, [role=navigation], header nav bars, breadcrumb trails |
| Advertising | .ad, .ads, .advertisement, [data-ad], iframe ads |
| Cookie / GDPR | .cookie-banner, #consent, GDPR overlays, "Accept cookies" dialogs |
| Sidebars | aside, [role=complementary], .sidebar, .widget-area |
| Share buttons | .share, .social-share, .addthis, floating share bars |
| Related articles | .related, .recommended, .you-may-also-like |
| Comments | #comments, .comment-section, Disqus embeds |
| Footers | <footer>, [role=contentinfo], site footers |
| Pop-ups / modals | .modal, .popup, newsletter sign-up overlays |
| Sticky bars | Fixed-position headers, scroll-triggered notification bars |
Stage 2 — Candidate Scoring
After stripping noise, remaining block-level elements (<article>, <main>, <div>, <section>) are scored by content density — the ratio of text to HTML. The highest-scoring block is selected as the article body.
Stage 3 — Metadata Extraction
Structured metadata is extracted from <meta> tags, JSON-LD, and OpenGraph properties, then verified against visible on-page signals:
- Title — from
og:title,<title>, or largest heading - Author — from
article:author, byline heuristics, JSON-LDPerson - Published date — from
article:published_time,datePublished, visible date patterns - Reading time — calculated at 238 WPM (average adult reading speed)
- Word count — plain-text word count of the extracted body
Example Output
{
"url": "https://techcrunch.com/2026/02/24/some-article",
"readable": true,
"title": "The Rise of LLM-Free Web Agents",
"author": "Jane Smith",
"publishedAt": "2026-02-24T08:00:00Z",
"readingTime": 4,
"wordCount": 980,
"content": "# The Rise of LLM-Free Web Agents\n\nFor years, building a reliable web agent meant stitching together an LLM, a browser, and a prompt. But a new wave of tools is changing that...\n\n## Why BM25 Is Enough for Most Tasks\n\nLarge language models are powerful, but they're also slow and expensive...",
"excerpt": "For years, building a reliable web agent meant stitching together an LLM, a browser, and a prompt."
}
SDK Usage
import { peel } from 'webpeel';
const result = await peel('https://techcrunch.com/2026/02/24/some-article', {
readable: true
});
console.log(result.title); // "The Rise of LLM-Free Web Agents"
console.log(result.author); // "Jane Smith"
console.log(result.publishedAt); // "2026-02-24T08:00:00Z"
console.log(result.readingTime); // 4 (minutes)
console.log(result.wordCount); // 980
console.log(result.content); // Clean Markdown article body
from webpeel import WebPeel
client = WebPeel()
result = client.scrape(
"https://techcrunch.com/2026/02/24/some-article",
readable=True
)
print(result.title) # "The Rise of LLM-Free Web Agents"
print(result.author) # "Jane Smith"
print(result.reading_time) # 4
print(result.word_count) # 980
print(result.content) # Clean Markdown
When to Use Reader Mode
| Use case | Use Reader Mode? |
|---|---|
| News articles, blog posts | ✅ Yes — lots of surrounding noise |
| Documentation pages | ✅ Yes — strip nav and ads |
| Long-form essays | ✅ Yes — ideal use case |
| E-commerce product pages | ⚠️ Partial — use --schema instead for structured data |
| Search result pages | ❌ No — content IS the listing grid |
| Single-page apps (SPAs) | ✅ Yes, but add --render flag |
| Twitter / GitHub / Reddit | ❌ No — use Domain Extractors instead |
--focus for LLM efficiencyStack Reader Mode with BM25 query filtering for maximum token efficiency:
--readable --focus "climate impact". Reader Mode strips noise first, then BM25 keeps only the most query-relevant paragraphs. Typical savings: 40–75% fewer tokens vs. raw HTML.