OpenGraph.io

Content Extraction API

Extract specific HTML elements (titles, headers, paragraphs) in a structured, LLM-ready format. Feed clean web data directly into RAG pipelines, content analysis tools, or AI applications without running your own scraper infrastructure.

API Version

v3.0 enables smart defaults — auto_proxy, auto_render, and retry are all on by default. The best proxy and rendering strategy is chosen automatically for each target domain.

Endpoint

HTTP
POST https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID

Content-Type: application/json

Parameters

Body Parameters

ParameterTypeRequiredDefaultDescription
sitestringYesThe URL to scrape
selectorsobjectNoCSS selector configurations for structured data extraction
html_elementsstringNotitle,h1,h2,h3,h4,h5,pComma-separated HTML tags used to generate concatenatedText
JavaScript Rendering
wait_for_selectorstringNoWait for this CSS selector to appear in the DOM before capturing. Automatically enables JS rendering.
scroll_to_bottombooleanNofalseScroll the full page to trigger lazy-loaded content before capturing. Automatically enables JS rendering.
load_more_selectorstringNoCSS selector for a "Load more" button to click before capturing. Automatically enables JS rendering.
load_more_clicksintegerNo3How many times to click the load more button (1–10).
load_more_waitintegerNo1500Milliseconds to wait for new content after each click (0–5000).
load_more_scrollbooleanNotrueScroll to the bottom before each click to ensure the button is in view.

Setting wait_for_selector, scroll_to_bottom, or load_more_selector automatically enables JavaScript rendering — you do not need to also pass full_render=true.

Query Parameters

ParameterTypeDefaultDescription
cache_okbooleantrueAllow cached results
max_cache_agenumberMax cache age in seconds
full_renderbooleanfalseUse headless browser rendering
use_proxybooleanfalseRoute request through proxy
use_premiumbooleanfalseUse premium proxy (requires plan support)
use_superiorbooleanfalseUse superior proxy (requires plan support)
auto_proxybooleantrueAutomatically select the best proxy for the target domain
auto_renderbooleantrueAutomatically use headless rendering when beneficial
retrybooleantrueRetry with proxy escalation on failure (requires plan support)
max_retriesnumber3Max retry attempts (1–4)
retry_escalatebooleantrueEscalate proxy level on each retry
ai_sanitizebooleanfalseEnable prompt injection detection
ai_sanitize_modestringsanitizeOne of: sanitize, warn, block

Selectors Format

The selectors object maps custom keys to CSS selector configurations:

Selectors Example
{
  "selectors": {
    "pageTitle": {
      "selector": "h1.title",
      "type": "text"
    },
    "navLinks": {
      "selector": "a.nav-link",
      "multiple": true,
      "type": "attr",
      "attr": "href"
    },
    "firstParagraph": {
      "selector": "article p",
      "type": "text"
    }
  }
}

Selector Config Options

PropertyTypeDefaultDescription
selectorstringAny valid CSS selector
multiplebooleanfalsetrue returns all matches as an array; false returns only the first match
typestringtexttext extracts inner text; attr extracts an HTML attribute value
attrstringThe attribute to extract (required when type is attr)

JavaScript Rendering

These options control how the page is rendered before content is captured. Any one of them is enough to trigger full headless browser rendering automatically.

Wait for an element

Use wait_for_selector when the content you need is rendered by JavaScript after the initial page load — common on single-page applications.

Wait for selector
{
  "site": "https://example.com/dashboard",
  "wait_for_selector": "#content-loaded",
  "selectors": {
    "items": { "selector": ".item", "multiple": true, "type": "text" }
  }
}

Scroll to load lazy content

Use scroll_to_bottom when images, cards, or links only load as the user scrolls — common on media grids and feeds.

Scroll to bottom
{
  "site": "https://example.com/gallery",
  "scroll_to_bottom": true,
  "selectors": {
    "images": { "selector": "img.gallery-image", "multiple": true, "type": "attr", "attr": "src" }
  }
}

Click "Load more" to reveal paginated content

Use load_more_selector when a site hides additional results behind a button. The API will click it up to load_more_clicks times, wait for new content to appear after each click, and capture the full HTML once done.

The first selector in your selectors object is automatically used to detect when new items have loaded after each click — no extra configuration needed.

curl -X POST "https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "site": "https://example.com/homes-for-sale",
    "selectors": {
      "listings": {
        "selector": "a.listing-card",
        "multiple": true,
        "type": "attr",
        "attr": "href"
      }
    },
    "load_more_selector": "button.load-more",
    "load_more_clicks": 3
  }'

Load More Limits

ParameterMinMaxDefault
load_more_clicks1103
load_more_wait (ms)050001500

The click loop always stops early if the button disappears, becomes disabled, or no new items are detected — even before reaching load_more_clicks.

Example Request

curl -X POST "https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "site": "https://example.com",
    "selectors": {
      "heading": {
        "selector": "h1",
        "type": "text"
      },
      "allLinks": {
        "selector": "a",
        "multiple": true,
        "type": "attr",
        "attr": "href"
      }
    }
  }'

Example Response

Response
{
  "url": "https://example.com",
  "concatenatedText": "Example Domain This domain is for use in illustrative examples...",
  "data": {
    "heading": "Example Domain",
    "allLinks": ["https://www.iana.org/domains/example"]
  }
}

Response Fields

FieldPresenceDescription
urlAlwaysThe URL that was requested
concatenatedTextAlwaysPlain text extracted from the specified (or default) html_elements tags, concatenated into a single string
dataOnly when selectors providedAn object containing extraction results keyed by your selector names
ai_safetyOnly when ai_sanitize enabledPrompt injection risk assessment

LLM Tip: Use concatenatedText when feeding content to AI models for summarization. It provides clean text without HTML markup.

AI Safety

When ai_sanitize is enabled, the response includes an ai_safety object with prompt injection risk assessment:

AI Safety Response
{
  "ai_safety": {
    "risk_score": 0.02,
    "risk_level": "low",
    "signals": {}
  }
}

Use ai_sanitize_mode to control behavior: sanitize strips detected injections, warn adds flags but keeps content, and block rejects high-risk responses with a 422 error.

Errors

StatusCodeCondition
400Missing or invalid site URL
400-2233Plan does not support the requested feature (premium proxy, retry, etc.)
422-4001ai_sanitize_mode=block and high injection risk was detected

Use Cases

  • AI/LLM data pipelines – feed clean text to language models
  • Content analysis and summarization
  • SEO content auditing – check heading structure
  • Research and data collection
  • Automated reporting

MCP Tool

This endpoint is available as the Extract Content tool in the OpenGraph MCP Server. Your AI assistant can extract elements directly without writing any code.

Get started with MCP in 2 minutes →

Learn more about MCP integration →

Related