Documentation / Domain Extractors

Domain Extractors v0.21

WebPeel has 32 domain-specific extractors that activate automatically when it detects a supported URL. Get back clean, structured data — tweets, repo stats, product prices, transcripts, articles — no special flags needed.

What It Does

Domain Extractors are purpose-built parsers that activate automatically when WebPeel detects a supported platform URL. Instead of scraping HTML (slow, brittle, blocked), each extractor calls the platform's own data API for reliable, structured output:

🐦 Twitter/X — tweet text, engagement metrics, threads, quoted tweets
🟠 Reddit — post + comments via Reddit's JSON API
🐙 GitHub — repo stats, README, open issues, recent PRs via GitHub API
🟧 Hacker News — stories + threaded comments via Firebase API
📖 Wikipedia — article text, sections, categories, metadata
▶️ YouTube — video metadata, transcript, channel info
📄 arXiv — paper abstract, authors, categories, PDF link
💬 Stack Overflow — question, answers, votes, accepted answer
📦 npm — package info, version history, weekly downloads
🛒 Best Buy — product price, specs, availability, ratings
🛍️ Walmart — product price, availability, structured product data
📦 Amazon — product price, reviews, ASIN, availability
📝 Medium — article text, author, claps, publication
📧 Substack — newsletter post content, author, subscriber count
🍳 Allrecipes — recipe ingredients, instructions, nutrition, ratings
🎬 IMDB — movie/TV metadata, cast, ratings, plot summary
💼 LinkedIn — profile info, job listings, company pages
🐍 PyPI — package info, version, downloads, dependencies
👩‍💻 Dev.to — article content, tags, reactions, comments
🏘️ Craigslist — listing content, price, location, images
🎵 Spotify — track/album/artist metadata
🎵 TikTok — video metadata, author, engagement
📌 Pinterest — pin content, board info, image metadata
📰 NYTimes — article text, author, section, publish date
📺 BBC — article content, category, publish date
📡 CNN — article text, author, publish date
🎮 Twitch — channel info, stream status, recent clips
🎧 SoundCloud — track metadata, artist, plays, waveform
📸 Instagram — post caption, likes, comments (public posts)
🚀 Product Hunt — product listing, votes, comments
📄 PDF files — text extraction from any URL ending in .pdf

All extractors are zero-config: just fetch the URL as you normally would.

CLI

# GitHub repo — structured stats + README
npx webpeel "https://github.com/webpeel/webpeel" --json

# Reddit post + comments
npx webpeel "https://reddit.com/r/programming/comments/xyz/title" --json

# Hacker News story + comments
npx webpeel "https://news.ycombinator.com/item?id=12345" --json

# Twitter/X tweet
npx webpeel "https://x.com/webpeel_dev/status/12345" --json

# Plain text output (no --json)
npx webpeel "https://github.com/webpeel/webpeel"

The structured data appears in the domainData field of the JSON output alongside the standard content and title fields.

MCP

Domain Extractors are automatic via webpeel_fetch. No special tool name required:

{
  "tool": "webpeel_fetch",
  "arguments": {
    "url": "https://github.com/webpeel/webpeel"
  }
}

The response will include a domainData field with the structured platform data when a supported domain is detected.

Supported Platforms

Twitter / X

Extracts tweet content and engagement using the tweet embed API — no Twitter developer account needed.

Supported URL patterns

https://x.com/:user/status/:id
https://twitter.com/:user/status/:id

Example output

{
  "platform": "twitter",
  "tweet": {
    "id": "1234567890",
    "text": "Introducing WebPeel v0.15 — YouTube transcripts, BM25 Q&A, and more 🚀",
    "author": {
      "username": "webpeel_dev",
      "name": "WebPeel",
      "verified": false
    },
    "metrics": {
      "likes": 142,
      "retweets": 38,
      "replies": 17,
      "views": 4821
    },
    "createdAt": "2026-02-24T10:00:00Z",
    "quotedTweet": null,
    "thread": []
  }
}

Uses Reddit's public .json API endpoint — no OAuth required, no scraping.

Supported URL patterns

https://reddit.com/r/:sub/comments/:id/:slug
https://www.reddit.com/r/:sub/comments/:id/:slug
https://old.reddit.com/r/:sub/comments/:id/:slug

Example output

{
  "platform": "reddit",
  "post": {
    "title": "WebPeel vs Firecrawl — an honest comparison",
    "author": "u/dev_tools_fan",
    "subreddit": "r/programming",
    "score": 312,
    "upvoteRatio": 0.94,
    "numComments": 47,
    "url": "https://reddit.com/r/programming/comments/xyz/...",
    "selftext": "I spent a weekend benchmarking both tools...",
    "createdAt": "2026-02-20T18:45:00Z"
  },
  "comments": [
    {
      "author": "u/alice",
      "score": 88,
      "text": "WebPeel's BM25 quick answer is genuinely useful for agent pipelines.",
      "replies": [
        {
          "author": "u/bob",
          "score": 31,
          "text": "Agreed — especially the no-LLM-key angle."
        }
      ]
    }
  ]
}

GitHub

Calls the GitHub REST API to return repo metadata, README, open issues, and recent pull requests. No GitHub token needed for public repos (60 req/hour unauthenticated; set GITHUB_TOKEN for 5000 req/hour).

Supported URL patterns

https://github.com/:owner/:repo
https://github.com/:owner/:repo/issues/:n
https://github.com/:owner/:repo/pull/:n

Example output

{
  "platform": "github",
  "repo": {
    "name": "webpeel",
    "fullName": "webpeel/webpeel",
    "description": "Fast web fetching for AI agents",
    "stars": 1842,
    "forks": 94,
    "openIssues": 12,
    "language": "TypeScript",
    "license": "AGPL-3.0",
    "topics": ["web-scraping", "ai", "mcp", "typescript"],
    "homepage": "https://webpeel.dev",
    "pushedAt": "2026-02-24T09:30:00Z"
  },
  "readme": "# WebPeel\n\nFast web fetching for AI agents...",
  "recentIssues": [
    { "number": 201, "title": "Support for Playwright tracing", "state": "open", "createdAt": "2026-02-22T..." }
  ],
  "recentPRs": [
    { "number": 198, "title": "feat: YouTube transcript extractor", "state": "merged", "mergedAt": "2026-02-20T..." }
  ]
}

Hacker News

Uses the official HN Firebase API — same data powering news.ycombinator.com itself.

Supported URL patterns

https://news.ycombinator.com/item?id=:id
https://news.ycombinator.com/ (front page)

Example output

{
  "platform": "hackernews",
  "story": {
    "id": 42123456,
    "title": "WebPeel v0.15 – YouTube transcripts and BM25 Q&A",
    "url": "https://webpeel.dev/changelog",
    "score": 284,
    "author": "jliu",
    "numComments": 63,
    "createdAt": "2026-02-24T08:00:00Z"
  },
  "comments": [
    {
      "id": 42123500,
      "author": "pg",
      "text": "Nice. The BM25 quick answer is clever.",
      "score": 41,
      "replies": []
    }
  ]
}

SDK Usage

import { peel } from 'webpeel';

// GitHub repo
const repo = await peel('https://github.com/webpeel/webpeel');
console.log(repo.domainData.repo.stars);   // 1842
console.log(repo.domainData.readme);        // README markdown

// Reddit post
const post = await peel('https://reddit.com/r/programming/comments/xyz/title');
console.log(post.domainData.post.score);   // 312
console.log(post.domainData.comments);     // threaded comments

// Hacker News story
const hn = await peel('https://news.ycombinator.com/item?id=42123456');
console.log(hn.domainData.story.title);
console.log(hn.domainData.comments);

from webpeel import WebPeel

client = WebPeel()

# GitHub repo
repo = client.scrape("https://github.com/webpeel/webpeel")
print(repo.domain_data["repo"]["stars"])  # 1842
print(repo.domain_data["readme"])         # README markdown

# Reddit post
post = client.scrape("https://reddit.com/r/programming/comments/xyz/title")
print(post.domain_data["post"]["score"])
print(post.domain_data["comments"])

💡 GitHub rate limits
The GitHub extractor works without authentication for public repos (60 requests/hour). Set a GITHUB_TOKEN environment variable to raise the limit to 5,000 requests/hour. Generate a token at github.com/settings/tokens — no special scopes needed for public repos.