Content Extraction API
Extract specific HTML elements (titles, headers, paragraphs) in a structured, LLM-ready format. Feed clean web data directly into RAG pipelines, content analysis tools, or AI applications without running your own scraper infrastructure.
v3.0 enables smart defaults — auto_proxy, auto_render, and retry are all on by default. The best proxy and rendering strategy is chosen automatically for each target domain.
Endpoint
POST https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_IDContent-Type: application/json
Parameters
Body Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| site | string | Yes | — | The URL to scrape |
| selectors | object | No | — | CSS selector configurations for structured data extraction |
| html_elements | string | No | title,h1,h2,h3,h4,h5,p | Comma-separated HTML tags used to generate concatenatedText |
| JavaScript Rendering | ||||
| wait_for_selector | string | No | — | Wait for this CSS selector to appear in the DOM before capturing. Automatically enables JS rendering. |
| scroll_to_bottom | boolean | No | false | Scroll the full page to trigger lazy-loaded content before capturing. Automatically enables JS rendering. |
| load_more_selector | string | No | — | CSS selector for a "Load more" button to click before capturing. Automatically enables JS rendering. |
| load_more_clicks | integer | No | 3 | How many times to click the load more button (1–10). |
| load_more_wait | integer | No | 1500 | Milliseconds to wait for new content after each click (0–5000). |
| load_more_scroll | boolean | No | true | Scroll to the bottom before each click to ensure the button is in view. |
Setting wait_for_selector, scroll_to_bottom, or load_more_selector automatically enables JavaScript rendering — you do not need to also pass full_render=true.
Query Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| cache_ok | boolean | true | Allow cached results |
| max_cache_age | number | — | Max cache age in seconds |
| full_render | boolean | false | Use headless browser rendering |
| use_proxy | boolean | false | Route request through proxy |
| use_premium | boolean | false | Use premium proxy (requires plan support) |
| use_superior | boolean | false | Use superior proxy (requires plan support) |
| auto_proxy | boolean | true | Automatically select the best proxy for the target domain |
| auto_render | boolean | true | Automatically use headless rendering when beneficial |
| retry | boolean | true | Retry with proxy escalation on failure (requires plan support) |
| max_retries | number | 3 | Max retry attempts (1–4) |
| retry_escalate | boolean | true | Escalate proxy level on each retry |
| ai_sanitize | boolean | false | Enable prompt injection detection |
| ai_sanitize_mode | string | sanitize | One of: sanitize, warn, block |
Selectors Format
The selectors object maps custom keys to CSS selector configurations:
{
"selectors": {
"pageTitle": {
"selector": "h1.title",
"type": "text"
},
"navLinks": {
"selector": "a.nav-link",
"multiple": true,
"type": "attr",
"attr": "href"
},
"firstParagraph": {
"selector": "article p",
"type": "text"
}
}
}Selector Config Options
| Property | Type | Default | Description |
|---|---|---|---|
| selector | string | — | Any valid CSS selector |
| multiple | boolean | false | true returns all matches as an array; false returns only the first match |
| type | string | text | text extracts inner text; attr extracts an HTML attribute value |
| attr | string | — | The attribute to extract (required when type is attr) |
JavaScript Rendering
These options control how the page is rendered before content is captured. Any one of them is enough to trigger full headless browser rendering automatically.
Wait for an element
Use wait_for_selector when the content you need is rendered by JavaScript after the initial page load — common on single-page applications.
{
"site": "https://example.com/dashboard",
"wait_for_selector": "#content-loaded",
"selectors": {
"items": { "selector": ".item", "multiple": true, "type": "text" }
}
}Scroll to load lazy content
Use scroll_to_bottom when images, cards, or links only load as the user scrolls — common on media grids and feeds.
{
"site": "https://example.com/gallery",
"scroll_to_bottom": true,
"selectors": {
"images": { "selector": "img.gallery-image", "multiple": true, "type": "attr", "attr": "src" }
}
}Click "Load more" to reveal paginated content
Use load_more_selector when a site hides additional results behind a button. The API will click it up to load_more_clicks times, wait for new content to appear after each click, and capture the full HTML once done.
The first selector in your selectors object is automatically used to detect when new items have loaded after each click — no extra configuration needed.
curl -X POST "https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID" \
-H "Content-Type: application/json" \
-d '{
"site": "https://example.com/homes-for-sale",
"selectors": {
"listings": {
"selector": "a.listing-card",
"multiple": true,
"type": "attr",
"attr": "href"
}
},
"load_more_selector": "button.load-more",
"load_more_clicks": 3
}'Load More Limits
| Parameter | Min | Max | Default |
|---|---|---|---|
load_more_clicks | 1 | 10 | 3 |
load_more_wait (ms) | 0 | 5000 | 1500 |
The click loop always stops early if the button disappears, becomes disabled, or no new items are detected — even before reaching load_more_clicks.
Example Request
curl -X POST "https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID" \
-H "Content-Type: application/json" \
-d '{
"site": "https://example.com",
"selectors": {
"heading": {
"selector": "h1",
"type": "text"
},
"allLinks": {
"selector": "a",
"multiple": true,
"type": "attr",
"attr": "href"
}
}
}'Example Response
{
"url": "https://example.com",
"concatenatedText": "Example Domain This domain is for use in illustrative examples...",
"data": {
"heading": "Example Domain",
"allLinks": ["https://www.iana.org/domains/example"]
}
}Response Fields
| Field | Presence | Description |
|---|---|---|
| url | Always | The URL that was requested |
| concatenatedText | Always | Plain text extracted from the specified (or default) html_elements tags, concatenated into a single string |
| data | Only when selectors provided | An object containing extraction results keyed by your selector names |
| ai_safety | Only when ai_sanitize enabled | Prompt injection risk assessment |
LLM Tip: Use concatenatedText when feeding content to AI models for summarization. It provides clean text without HTML markup.
AI Safety
When ai_sanitize is enabled, the response includes an ai_safety object with prompt injection risk assessment:
{
"ai_safety": {
"risk_score": 0.02,
"risk_level": "low",
"signals": {}
}
}Use ai_sanitize_mode to control behavior: sanitize strips detected injections, warn adds flags but keeps content, and block rejects high-risk responses with a 422 error.
Errors
| Status | Code | Condition |
|---|---|---|
| 400 | — | Missing or invalid site URL |
| 400 | -2233 | Plan does not support the requested feature (premium proxy, retry, etc.) |
| 422 | -4001 | ai_sanitize_mode=block and high injection risk was detected |
Use Cases
- AI/LLM data pipelines – feed clean text to language models
- Content analysis and summarization
- SEO content auditing – check heading structure
- Research and data collection
- Automated reporting
MCP Tool
This endpoint is available as the Extract Content tool in the OpenGraph MCP Server. Your AI assistant can extract elements directly without writing any code.