Structured Extraction

Extract structured JSON data from any web page using LLM inference. Define a JSON Schema or natural language prompt — WebPeel fetches the page and returns typed, structured data.

BYOK (Bring Your Own Key): WebPeel calls your LLM provider directly. Provide an OpenAI-compatible API key in llmApiKey. Works with OpenAI, Anthropic (via proxy), or any OpenAI-compatible endpoint.

Endpoints

POST/v1/extractAuth Required

Fetch a URL and extract structured data using LLM inference and a JSON Schema or prompt.

GET/v1/extract/autoAuth Required

Automatically detect the page type and extract heuristic structured data without an LLM. No API key needed for the extraction itself.

POST /v1/extract

Request Body

Parameter	Type	Required	Description
url	string	Required	The URL to fetch and extract data from.
schema	object	Optional*	JSON Schema object defining the expected output shape. Either `schema` or `prompt` must be provided.
prompt	string	Optional*	Natural language instruction for extraction. Either `schema` or `prompt` must be provided.
llmApiKey	string	Optional	Your OpenAI-compatible API key. Falls back to server-configured `OPENAI_API_KEY` if omitted.
model	string	Optional	LLM model to use. Default: `gpt-4o-mini`.
baseUrl	string	Optional	Custom OpenAI-compatible base URL. Default: `https://api.openai.com/v1`.

* At least one of schema or prompt is required.

Response

{
  "success": true,
  "data": {
    "title": "Example Product",
    "price": "$29.99",
    "rating": 4.5,
    "inStock": true
  },
  "confidence": 0.91,
  "metadata": {
    "url": "https://example.com/product",
    "title": "Example Product Page",
    "tokensUsed": 842,
    "model": "gpt-4o-mini",
    "cost": 0.000252,
    "elapsed": 3241
  }
}

confidence is a 0–1 score reflecting extraction quality. LLM extraction scores 0.85–0.98 based on field fill rate. Heuristic extraction scores 0.65–0.70. A score below 0.3 means most fields returned null.

GET /v1/extract/auto

Heuristic extraction without LLM — detects page type (product, article, job listing, etc.) and extracts known fields using DOM parsing.

Query Parameters

Parameter	Type	Required	Description
url	string	Required	The URL to auto-extract from.

Response

{
  "url": "https://example.com/article",
  "pageType": "article",
  "confidence": 0.68,
  "structured": {
    "type": "article",
    "title": "Breaking News: ...",
    "author": "Jane Doe",
    "publishedAt": "2024-03-04",
    "description": "A summary of the article..."
  }
}

Examples

# Extract company info from a Wikipedia page
curl -X POST https://api.webpeel.dev/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/OpenAI",
    "schema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "founded": { "type": "string" },
        "headquarters": { "type": "string" },
        "ceo": { "type": "string" },
        "description": { "type": "string" },
        "products": { "type": "array", "items": { "type": "string" } }
      }
    },
    "llmApiKey": "sk-..."
  }'

# Response:
# {
#   "success": true,
#   "data": {
#     "name": "OpenAI",
#     "founded": "2015",
#     "headquarters": "San Francisco, California",
#     "ceo": "Sam Altman",
#     "description": "American artificial intelligence safety company...",
#     "products": ["ChatGPT", "GPT-4", "DALL-E", "Whisper", "Sora"]
#   },
#   "metadata": { "tokensUsed": 1240, "model": "gpt-4o-mini", "elapsed": 2841 }
# }

# Extract top HN stories with natural language prompt
curl -X POST https://api.webpeel.dev/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "prompt": "Extract the top 5 story titles and their point counts as a JSON array of objects with title and points fields.",
    "llmApiKey": "sk-..."
  }'

# Response:
# {
#   "success": true,
#   "data": [
#     { "title": "Llama 4 released", "points": 842 },
#     { "title": "Ask HN: Best tools for...", "points": 634 }
#   ]
# }

# No LLM key needed — heuristic extraction from article page
curl "https://api.webpeel.dev/v1/extract/auto?url=https://techcrunch.com/2025/01/15/example" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Response:
# {
#   "url": "https://techcrunch.com/2025/01/15/example",
#   "pageType": "article",
#   "structured": {
#     "type": "article",
#     "title": "Example Article Title",
#     "author": "Jane Doe",
#     "publishedAt": "2025-01-15",
#     "description": "Article summary..."
#   }
# }

# Auto-extraction works on product pages too
curl "https://api.webpeel.dev/v1/extract/auto?url=https://www.amazon.com/dp/B0EXAMPLE" \
  -H "Authorization: Bearer YOUR_API_KEY"
# Returns: { pageType: "product", structured: { name, price, rating, reviewCount, ... } }

// Extract product info from an e-commerce page
const response = await fetch('https://api.webpeel.dev/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.WEBPEEL_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://shop.example.com/product/widget-pro',
    schema: {
      type: 'object',
      properties: {
        name:         { type: 'string', description: 'Product name' },
        price:        { type: 'number', description: 'Price in USD' },
        currency:     { type: 'string', description: 'Currency code (USD, EUR, etc.)' },
        availability: { type: 'string', enum: ['in_stock', 'out_of_stock', 'limited'] },
        rating:       { type: 'number', description: 'Average customer rating (0-5)' },
        reviewCount:  { type: 'integer', description: 'Number of reviews' },
        images:       { type: 'array', items: { type: 'string' }, description: 'Image URLs' },
        features:     { type: 'array', items: { type: 'string' }, description: 'Key features list' },
      },
      required: ['name', 'price'],
    },
    llmApiKey: process.env.OPENAI_API_KEY,
    model: 'gpt-4o-mini',
  }),
});

const { data, metadata } = await response.json();
console.log(data);
// {
//   name: 'Widget Pro 3000',
//   price: 49.99,
//   currency: 'USD',
//   availability: 'in_stock',
//   rating: 4.7,
//   reviewCount: 1842,
//   images: ['https://...jpg'],
//   features: ['Wireless charging', 'Waterproof', '2-year warranty']
// }
console.log(`Model: ${metadata.model}, tokens: ${metadata.tokensUsed}, ${metadata.elapsed}ms`);

import requests, os

# Extract product info from any e-commerce page
response = requests.post(
    'https://api.webpeel.dev/v1/extract',
    headers={
        'Authorization': f'Bearer {os.environ["WEBPEEL_API_KEY"]}',
        'Content-Type': 'application/json',
    },
    json={
        'url': 'https://shop.example.com/product/widget-pro',
        'schema': {
            'type': 'object',
            'properties': {
                'name':         {'type': 'string'},
                'price':        {'type': 'number'},
                'currency':     {'type': 'string'},
                'availability': {'type': 'string', 'enum': ['in_stock', 'out_of_stock', 'limited']},
                'rating':       {'type': 'number'},
                'reviewCount':  {'type': 'integer'},
                'features':     {'type': 'array', 'items': {'type': 'string'}},
            },
        },
        'llmApiKey': os.environ['OPENAI_API_KEY'],
        'model': 'gpt-4o-mini',
    },
)

result = response.json()
product = result['data']
meta = result['metadata']

print(f"{product['name']} — ${product['price']} ({product['availability']})")
print(f"Rating: {product.get('rating', 'N/A')} ({product.get('reviewCount', 0)} reviews)")
print(f"Tokens used: {meta['tokensUsed']}, elapsed: {meta['elapsed']}ms")

import requests, os

# Extract company info from a company homepage
response = requests.post(
    'https://api.webpeel.dev/v1/extract',
    headers={
        'Authorization': f'Bearer {os.environ["WEBPEEL_API_KEY"]}',
        'Content-Type': 'application/json',
    },
    json={
        'url': 'https://stripe.com',
        'schema': {
            'type': 'object',
            'properties': {
                'name':        {'type': 'string', 'description': 'Company name'},
                'tagline':     {'type': 'string', 'description': 'Main tagline or value proposition'},
                'description': {'type': 'string', 'description': 'What the company does'},
                'products':    {'type': 'array', 'items': {'type': 'string'}, 'description': 'Main products/services'},
                'targetAudience': {'type': 'string', 'description': 'Who the product is for'},
            },
        },
        'llmApiKey': os.environ['OPENAI_API_KEY'],
    },
)

data = response.json()['data']
print(f"Company: {data['name']}")
print(f"Tagline: {data.get('tagline', 'N/A')}")
print(f"Products: {', '.join(data.get('products', []))}")

BYOK — Bring Your Own LLM Key

By default, /v1/extract uses OpenAI's API with your llmApiKey. You can point it to any OpenAI-compatible endpoint using the baseUrl parameter:

{
  "url": "https://example.com",
  "prompt": "Extract the main headline and summary",
  "llmApiKey": "your-key-here",
  "model": "llama3-70b",
  "baseUrl": "https://api.cerebras.ai/v1"
}

Provider	baseUrl	Example model
OpenAI (default)	`https://api.openai.com/v1`	`gpt-4o-mini`
Cerebras	`https://api.cerebras.ai/v1`	`llama3.1-70b`
Groq	`https://api.groq.com/openai/v1`	`llama-3.3-70b-versatile`
Ollama (local)	`http://localhost:11434/v1`	`llama3.2`
Together AI	`https://api.together.xyz/v1`	`mistralai/Mixtral-8x7B`

Heuristic vs. LLM Extraction

	POST /v1/extract (LLM)	GET /v1/extract/auto (heuristic)
LLM required	Yes (BYOK)	No
Schema support	Yes — full JSON Schema	No — fixed output per page type
Page types	Any page	article, product, job listing
Speed	2–5 seconds	Under 1 second
Best for	Custom schemas, complex pages	Quick scraping of known page types

Errors

Error Code	Status	Cause
`invalid_request`	400	Missing `url`, or neither `schema` nor `prompt` provided.
`missing_api_key`	400	No LLM API key in request and no server default configured.
`llm_auth_failed`	401	The provided `llmApiKey` was rejected by the LLM provider.
`llm_rate_limited`	429	The LLM provider returned a rate limit error.
`extraction_failed`	500	The page was fetched but LLM extraction failed.

Structured Extraction

Endpoints

POST /v1/extract

Request Body

Response

GET /v1/extract/auto

Query Parameters

Response

Examples

BYOK — Bring Your Own LLM Key

Heuristic vs. LLM Extraction

Errors

See Also