Content Extraction API

Extract specific HTML elements (titles, headers, paragraphs) in a structured, LLM-ready format. Feed clean web data directly into RAG pipelines, content analysis tools, or AI applications without running your own scraper infrastructure.

API Version

v3.0 enables smart defaults — auto_proxy, auto_render, and retry are all on by default. The best proxy and rendering strategy is chosen automatically for each target domain.

Endpoint

HTTP

POST https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID

Content-Type: application/json

Parameters

Body Parameters

Parameter	Type	Required	Default	Description
site	string	Yes	—	The URL to scrape
selectors	object	No	—	CSS selector configurations for structured data extraction
html_elements	string	No	title,h1,h2,h3,h4,h5,p	Comma-separated HTML tags used to generate `concatenatedText`
JavaScript Rendering
wait_for_selector	string	No	—	Wait for this CSS selector to appear in the DOM before capturing. Automatically enables JS rendering.
scroll_to_bottom	boolean	No	false	Scroll the full page to trigger lazy-loaded content before capturing. Automatically enables JS rendering.
load_more_selector	string	No	—	CSS selector for a "Load more" button to click before capturing. Automatically enables JS rendering.
load_more_clicks	integer	No	3	How many times to click the load more button (1–10).
load_more_wait	integer	No	1500	Milliseconds to wait for new content after each click (0–5000).
load_more_scroll	boolean	No	true	Scroll to the bottom before each click to ensure the button is in view.

Setting wait_for_selector, scroll_to_bottom, or load_more_selector automatically enables JavaScript rendering — you do not need to also pass full_render=true.

Query Parameters

Parameter	Type	Default	Description
cache_ok	boolean	true	Allow cached results
max_cache_age	number	—	Max cache age in seconds
full_render	boolean	false	Use headless browser rendering
use_proxy	boolean	false	Route request through proxy
use_premium	boolean	false	Use premium proxy (requires plan support)
use_superior	boolean	false	Use superior proxy (requires plan support)
auto_proxy	boolean	true	Automatically select the best proxy for the target domain
auto_render	boolean	true	Automatically use headless rendering when beneficial
retry	boolean	true	Retry with proxy escalation on failure (requires plan support)
max_retries	number	3	Max retry attempts (1–4)
retry_escalate	boolean	true	Escalate proxy level on each retry
ai_sanitize	boolean	false	Enable prompt injection detection
ai_sanitize_mode	string	sanitize	One of: `sanitize`, `warn`, `block`

Selectors Format

The selectors object maps custom keys to CSS selector configurations:

Selectors Example

{
  "selectors": {
    "pageTitle": {
      "selector": "h1.title",
      "type": "text"
    },
    "navLinks": {
      "selector": "a.nav-link",
      "multiple": true,
      "type": "attr",
      "attr": "href"
    },
    "firstParagraph": {
      "selector": "article p",
      "type": "text"
    }
  }
}

Selector Config Options

Property	Type	Default	Description
selector	string	—	Any valid CSS selector
multiple	boolean	false	`true` returns all matches as an array; `false` returns only the first match
type	string	text	`text` extracts inner text; `attr` extracts an HTML attribute value
attr	string	—	The attribute to extract (required when `type` is `attr`)

JavaScript Rendering

These options control how the page is rendered before content is captured. Any one of them is enough to trigger full headless browser rendering automatically.

Wait for an element

Use wait_for_selector when the content you need is rendered by JavaScript after the initial page load — common on single-page applications.

Wait for selector

{
  "site": "https://example.com/dashboard",
  "wait_for_selector": "#content-loaded",
  "selectors": {
    "items": { "selector": ".item", "multiple": true, "type": "text" }
  }
}

Scroll to load lazy content

Use scroll_to_bottom when images, cards, or links only load as the user scrolls — common on media grids and feeds.

Scroll to bottom

{
  "site": "https://example.com/gallery",
  "scroll_to_bottom": true,
  "selectors": {
    "images": { "selector": "img.gallery-image", "multiple": true, "type": "attr", "attr": "src" }
  }
}

Click "Load more" to reveal paginated content

Use load_more_selector when a site hides additional results behind a button. The API will click it up to load_more_clicks times, wait for new content to appear after each click, and capture the full HTML once done.

The first selector in your selectors object is automatically used to detect when new items have loaded after each click — no extra configuration needed.

curl -X POST "https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "site": "https://example.com/homes-for-sale",
    "selectors": {
      "listings": {
        "selector": "a.listing-card",
        "multiple": true,
        "type": "attr",
        "attr": "href"
      }
    },
    "load_more_selector": "button.load-more",
    "load_more_clicks": 3
  }'

Load More Limits

Parameter	Min	Max	Default
`load_more_clicks`	1	10	3
`load_more_wait` (ms)	0	5000	1500

The click loop always stops early if the button disappears, becomes disabled, or no new items are detected — even before reaching load_more_clicks.

Example Request

curl -X POST "https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "site": "https://example.com",
    "selectors": {
      "heading": {
        "selector": "h1",
        "type": "text"
      },
      "allLinks": {
        "selector": "a",
        "multiple": true,
        "type": "attr",
        "attr": "href"
      }
    }
  }'

Example Response

Response

{
  "url": "https://example.com",
  "concatenatedText": "Example Domain This domain is for use in illustrative examples...",
  "data": {
    "heading": "Example Domain",
    "allLinks": ["https://www.iana.org/domains/example"]
  }
}

Response Fields

Field	Presence	Description
url	Always	The URL that was requested
concatenatedText	Always	Plain text extracted from the specified (or default) `html_elements` tags, concatenated into a single string
data	Only when `selectors` provided	An object containing extraction results keyed by your selector names
ai_safety	Only when `ai_sanitize` enabled	Prompt injection risk assessment

LLM Tip: Use concatenatedText when feeding content to AI models for summarization. It provides clean text without HTML markup.

AI Safety

When ai_sanitize is enabled, the response includes an ai_safety object with prompt injection risk assessment:

AI Safety Response

{
  "ai_safety": {
    "risk_score": 0.02,
    "risk_level": "low",
    "signals": {}
  }
}

Use ai_sanitize_mode to control behavior: sanitize strips detected injections, warn adds flags but keeps content, and block rejects high-risk responses with a 422 error.

Errors

Status	Code	Condition
400	—	Missing or invalid `site` URL
400	-2233	Plan does not support the requested feature (premium proxy, retry, etc.)
422	-4001	`ai_sanitize_mode=block` and high injection risk was detected

Use Cases

AI/LLM data pipelines – feed clean text to language models
Content analysis and summarization
SEO content auditing – check heading structure
Research and data collection
Automated reporting

MCP Tool

This endpoint is available as the Extract Content tool in the OpenGraph MCP Server. Your AI assistant can extract elements directly without writing any code.

Get started with MCP in 2 minutes →

Learn more about MCP integration →