Structured Extraction
Extract structured JSON data from any web page using LLM inference. Define a JSON Schema or natural language prompt — WebPeel fetches the page and returns typed, structured data.
llmApiKey. Works with OpenAI, Anthropic (via proxy), or any OpenAI-compatible endpoint.
Endpoints
Fetch a URL and extract structured data using LLM inference and a JSON Schema or prompt.
Automatically detect the page type and extract heuristic structured data without an LLM. No API key needed for the extraction itself.
POST /v1/extract
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| url | string | Required | The URL to fetch and extract data from. |
| schema | object | Optional* | JSON Schema object defining the expected output shape. Either schema or prompt must be provided. |
| prompt | string | Optional* | Natural language instruction for extraction. Either schema or prompt must be provided. |
| llmApiKey | string | Optional | Your OpenAI-compatible API key. Falls back to server-configured OPENAI_API_KEY if omitted. |
| model | string | Optional | LLM model to use. Default: gpt-4o-mini. |
| baseUrl | string | Optional | Custom OpenAI-compatible base URL. Default: https://api.openai.com/v1. |
* At least one of schema or prompt is required.
Response
{
"success": true,
"data": {
"title": "Example Product",
"price": "$29.99",
"rating": 4.5,
"inStock": true
},
"confidence": 0.91,
"metadata": {
"url": "https://example.com/product",
"title": "Example Product Page",
"tokensUsed": 842,
"model": "gpt-4o-mini",
"cost": 0.000252,
"elapsed": 3241
}
}
confidence is a 0–1 score reflecting extraction quality. LLM extraction scores 0.85–0.98 based on field fill rate. Heuristic extraction scores 0.65–0.70. A score below 0.3 means most fields returned null.
GET /v1/extract/auto
Heuristic extraction without LLM — detects page type (product, article, job listing, etc.) and extracts known fields using DOM parsing.
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| url | string | Required | The URL to auto-extract from. |
Response
{
"url": "https://example.com/article",
"pageType": "article",
"confidence": 0.68,
"structured": {
"type": "article",
"title": "Breaking News: ...",
"author": "Jane Doe",
"publishedAt": "2024-03-04",
"description": "A summary of the article..."
}
}
Examples
# Extract company info from a Wikipedia page
curl -X POST https://api.webpeel.dev/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/OpenAI",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"founded": { "type": "string" },
"headquarters": { "type": "string" },
"ceo": { "type": "string" },
"description": { "type": "string" },
"products": { "type": "array", "items": { "type": "string" } }
}
},
"llmApiKey": "sk-..."
}'
# Response:
# {
# "success": true,
# "data": {
# "name": "OpenAI",
# "founded": "2015",
# "headquarters": "San Francisco, California",
# "ceo": "Sam Altman",
# "description": "American artificial intelligence safety company...",
# "products": ["ChatGPT", "GPT-4", "DALL-E", "Whisper", "Sora"]
# },
# "metadata": { "tokensUsed": 1240, "model": "gpt-4o-mini", "elapsed": 2841 }
# }
# Extract top HN stories with natural language prompt
curl -X POST https://api.webpeel.dev/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"prompt": "Extract the top 5 story titles and their point counts as a JSON array of objects with title and points fields.",
"llmApiKey": "sk-..."
}'
# Response:
# {
# "success": true,
# "data": [
# { "title": "Llama 4 released", "points": 842 },
# { "title": "Ask HN: Best tools for...", "points": 634 }
# ]
# }
# No LLM key needed — heuristic extraction from article page
curl "https://api.webpeel.dev/v1/extract/auto?url=https://techcrunch.com/2025/01/15/example" \
-H "Authorization: Bearer YOUR_API_KEY"
# Response:
# {
# "url": "https://techcrunch.com/2025/01/15/example",
# "pageType": "article",
# "structured": {
# "type": "article",
# "title": "Example Article Title",
# "author": "Jane Doe",
# "publishedAt": "2025-01-15",
# "description": "Article summary..."
# }
# }
# Auto-extraction works on product pages too
curl "https://api.webpeel.dev/v1/extract/auto?url=https://www.amazon.com/dp/B0EXAMPLE" \
-H "Authorization: Bearer YOUR_API_KEY"
# Returns: { pageType: "product", structured: { name, price, rating, reviewCount, ... } }
// Extract product info from an e-commerce page
const response = await fetch('https://api.webpeel.dev/v1/extract', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.WEBPEEL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://shop.example.com/product/widget-pro',
schema: {
type: 'object',
properties: {
name: { type: 'string', description: 'Product name' },
price: { type: 'number', description: 'Price in USD' },
currency: { type: 'string', description: 'Currency code (USD, EUR, etc.)' },
availability: { type: 'string', enum: ['in_stock', 'out_of_stock', 'limited'] },
rating: { type: 'number', description: 'Average customer rating (0-5)' },
reviewCount: { type: 'integer', description: 'Number of reviews' },
images: { type: 'array', items: { type: 'string' }, description: 'Image URLs' },
features: { type: 'array', items: { type: 'string' }, description: 'Key features list' },
},
required: ['name', 'price'],
},
llmApiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o-mini',
}),
});
const { data, metadata } = await response.json();
console.log(data);
// {
// name: 'Widget Pro 3000',
// price: 49.99,
// currency: 'USD',
// availability: 'in_stock',
// rating: 4.7,
// reviewCount: 1842,
// images: ['https://...jpg'],
// features: ['Wireless charging', 'Waterproof', '2-year warranty']
// }
console.log(`Model: ${metadata.model}, tokens: ${metadata.tokensUsed}, ${metadata.elapsed}ms`);
import requests, os
# Extract product info from any e-commerce page
response = requests.post(
'https://api.webpeel.dev/v1/extract',
headers={
'Authorization': f'Bearer {os.environ["WEBPEEL_API_KEY"]}',
'Content-Type': 'application/json',
},
json={
'url': 'https://shop.example.com/product/widget-pro',
'schema': {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'price': {'type': 'number'},
'currency': {'type': 'string'},
'availability': {'type': 'string', 'enum': ['in_stock', 'out_of_stock', 'limited']},
'rating': {'type': 'number'},
'reviewCount': {'type': 'integer'},
'features': {'type': 'array', 'items': {'type': 'string'}},
},
},
'llmApiKey': os.environ['OPENAI_API_KEY'],
'model': 'gpt-4o-mini',
},
)
result = response.json()
product = result['data']
meta = result['metadata']
print(f"{product['name']} — ${product['price']} ({product['availability']})")
print(f"Rating: {product.get('rating', 'N/A')} ({product.get('reviewCount', 0)} reviews)")
print(f"Tokens used: {meta['tokensUsed']}, elapsed: {meta['elapsed']}ms")
import requests, os
# Extract company info from a company homepage
response = requests.post(
'https://api.webpeel.dev/v1/extract',
headers={
'Authorization': f'Bearer {os.environ["WEBPEEL_API_KEY"]}',
'Content-Type': 'application/json',
},
json={
'url': 'https://stripe.com',
'schema': {
'type': 'object',
'properties': {
'name': {'type': 'string', 'description': 'Company name'},
'tagline': {'type': 'string', 'description': 'Main tagline or value proposition'},
'description': {'type': 'string', 'description': 'What the company does'},
'products': {'type': 'array', 'items': {'type': 'string'}, 'description': 'Main products/services'},
'targetAudience': {'type': 'string', 'description': 'Who the product is for'},
},
},
'llmApiKey': os.environ['OPENAI_API_KEY'],
},
)
data = response.json()['data']
print(f"Company: {data['name']}")
print(f"Tagline: {data.get('tagline', 'N/A')}")
print(f"Products: {', '.join(data.get('products', []))}")
BYOK — Bring Your Own LLM Key
By default, /v1/extract uses OpenAI's API with your llmApiKey. You can point it to any OpenAI-compatible endpoint using the baseUrl parameter:
{
"url": "https://example.com",
"prompt": "Extract the main headline and summary",
"llmApiKey": "your-key-here",
"model": "llama3-70b",
"baseUrl": "https://api.cerebras.ai/v1"
}
| Provider | baseUrl | Example model |
|---|---|---|
| OpenAI (default) | https://api.openai.com/v1 | gpt-4o-mini |
| Cerebras | https://api.cerebras.ai/v1 | llama3.1-70b |
| Groq | https://api.groq.com/openai/v1 | llama-3.3-70b-versatile |
| Ollama (local) | http://localhost:11434/v1 | llama3.2 |
| Together AI | https://api.together.xyz/v1 | mistralai/Mixtral-8x7B |
Heuristic vs. LLM Extraction
| POST /v1/extract (LLM) | GET /v1/extract/auto (heuristic) | |
|---|---|---|
| LLM required | Yes (BYOK) | No |
| Schema support | Yes — full JSON Schema | No — fixed output per page type |
| Page types | Any page | article, product, job listing |
| Speed | 2–5 seconds | Under 1 second |
| Best for | Custom schemas, complex pages | Quick scraping of known page types |
Errors
| Error Code | Status | Cause |
|---|---|---|
invalid_request | 400 | Missing url, or neither schema nor prompt provided. |
missing_api_key | 400 | No LLM API key in request and no server default configured. |
llm_auth_failed | 401 | The provided llmApiKey was rejected by the LLM provider. |
llm_rate_limited | 429 | The LLM provider returned a rate limit error. |
extraction_failed | 500 | The page was fetched but LLM extraction failed. |
See Also
- Deep Research — Multi-round research agent for comprehensive cited reports
- Fetch API — Fetch any page as clean markdown
- Domain Extractors — Pre-built extractors for YouTube, Twitter, GitHub, and more
- Error Reference — All error codes and troubleshooting