Web Crawl (scrapling)

πŸ“„ README

Why. Crawl web pages behind Cloudflare and extract content as Markdown. Progressive fetching β€” starts lightweight (curl_cffi), escalates to Playwright stealth when blocked. Auto-installs Python venv on first call.

How it works. Registers web_crawl tool. On first call, creates .pi/scrapling-venv/ with scrapling[fetchers] and markdownify. Validates URL, acquires concurrency semaphore (max 2 concurrent β€” protects 8GB RAM). Runs Python subprocess: lightweight curl_cffi fetch β†’ Playwright stealth on Cloudflare block. Extracts Markdown via markdownify, truncates by maxTokens parameter. Returns formatted --- URL (via method) ---\ncontent. Configurable via config.json.

Troubleshooting: If crawling fails with Chromium errors, delete the venv and retry β€” it auto-recreates:

rm -rf .pi/scrapling-venv

Location: .pi/extensions/scrapling/

Details

Architecture

Port-based adapter pattern for progressive web crawling:

β”œβ”€β”€ index.ts           # Entry: tool registration, URL validation, concurrency semaphore (max 2)
β”œβ”€β”€ crawler-engine.ts  # CrawlerEngine interface + crawl() orchestration
β”œβ”€β”€ python-adapter.ts  # PythonAdapter: subprocess orchestration via pi.exec
β”œβ”€β”€ python-script.ts   # Inline Python crawler script using scrapling library
β”œβ”€β”€ venv-setup.ts      # Auto-create .pi/scrapling-venv + pip install scrapling[fetchers] + markdownify
β”œβ”€β”€ types.ts           # CrawlResult, CrawlPage types
β”œβ”€β”€ mock-adapter.ts    # Mock adapter for testing (no network)
└── test/              # Unit + integration tests

Progressive Fetch Strategy

flowchart TD
    A[Crawl URL] --> B[curl_cffi: lightweight fetch]
    B -- success --> C[Extract content]
    B -- Cloudflare block --> D[Playwright stealth mode]
    D -- success --> C
    D -- failure --> E[Fallback: report error]
    C --> F[markdownify: HTML→Markdown]
    F --> G[Truncate by maxTokens]
    G --> H[Return result]

Key Design Decisions

  • Concurrency semaphore (max 2) β€” Protects 8GB RAM. acquireCrawlLock() uses polling loop (1000ms interval). Excessive for concurrent needs but safe for infrequent crawl calls.
  • Progressive escalation β€” Starts with curl_cffi (lightweight, no browser). If Cloudflare blocks, escalates to Playwright stealth. Never runs both.
  • Auto-installing venv β€” On first call, creates .pi/scrapling-venv/. If Chromium errors occur, user can rm -rf .pi/scrapling-venv and retry β€” auto-recreates.
  • maxPages cap at 10 β€” Hard upper bound prevents runaway crawling. Default 1.
  • maxTokens truncation β€” Content truncated with notice. 0 = no limit.
  • URL validation via new URL() β€” Rejects invalid URLs early. No protocol restriction (http/https/ftp/etc).
  • Python subprocess via python-adapter.ts β€” Wraps pi.exec for subprocess orchestration. Handles signal cancellation for clean shutdown.

Output Format

--- https://example.com (via curl_cffi) ---
# Page Title

Content extracted as Markdown...

Troubleshooting

If crawling fails with Chromium errors, delete the venv and retry β€” it auto-recreates:

rm -rf .pi/scrapling-venv

Testing

Tests cover:

  • Mock adapter: returns predefined results without network
  • Venv setup: creation, cache, re-creation on failure
  • Python adapter: subprocess invocation, cancellation via signal
  • Type validation: required fields, optional fields
  • Progressive fetch logic: lightweight β†’ stealth escalation sequence

Copyright © 2026 SchneiderDaniel. Distributed under the MIT License.

This site uses Just the Docs, a documentation theme for Jekyll.