Web Crawl (scrapling)
Why. Crawl web pages behind Cloudflare and extract content as Markdown. Progressive fetching β starts lightweight (curl_cffi), escalates to Playwright stealth when blocked. Auto-installs Python venv on first call.
How it works. Registers web_crawl tool. On first call, creates .pi/scrapling-venv/ with scrapling[fetchers] and markdownify. Validates URL, acquires concurrency semaphore (max 2 concurrent β protects 8GB RAM). Runs Python subprocess: lightweight curl_cffi fetch β Playwright stealth on Cloudflare block. Extracts Markdown via markdownify, truncates by maxTokens parameter. Returns formatted --- URL (via method) ---\ncontent. Configurable via config.json.
Troubleshooting: If crawling fails with Chromium errors, delete the venv and retry β it auto-recreates:
rm -rf .pi/scrapling-venv
Location: .pi/extensions/scrapling/
Details
Architecture
Port-based adapter pattern for progressive web crawling:
βββ index.ts # Entry: tool registration, URL validation, concurrency semaphore (max 2)
βββ crawler-engine.ts # CrawlerEngine interface + crawl() orchestration
βββ python-adapter.ts # PythonAdapter: subprocess orchestration via pi.exec
βββ python-script.ts # Inline Python crawler script using scrapling library
βββ venv-setup.ts # Auto-create .pi/scrapling-venv + pip install scrapling[fetchers] + markdownify
βββ types.ts # CrawlResult, CrawlPage types
βββ mock-adapter.ts # Mock adapter for testing (no network)
βββ test/ # Unit + integration tests
Progressive Fetch Strategy
flowchart TD
A[Crawl URL] --> B[curl_cffi: lightweight fetch]
B -- success --> C[Extract content]
B -- Cloudflare block --> D[Playwright stealth mode]
D -- success --> C
D -- failure --> E[Fallback: report error]
C --> F[markdownify: HTMLβMarkdown]
F --> G[Truncate by maxTokens]
G --> H[Return result]
Key Design Decisions
- Concurrency semaphore (max 2) β Protects 8GB RAM.
acquireCrawlLock()uses polling loop (1000ms interval). Excessive for concurrent needs but safe for infrequent crawl calls. - Progressive escalation β Starts with
curl_cffi(lightweight, no browser). If Cloudflare blocks, escalates to Playwright stealth. Never runs both. - Auto-installing venv β On first call, creates
.pi/scrapling-venv/. If Chromium errors occur, user canrm -rf .pi/scrapling-venvand retry β auto-recreates. - maxPages cap at 10 β Hard upper bound prevents runaway crawling. Default 1.
- maxTokens truncation β Content truncated with notice. 0 = no limit.
- URL validation via
new URL()β Rejects invalid URLs early. No protocol restriction (http/https/ftp/etc). - Python subprocess via
python-adapter.tsβ Wrapspi.execfor subprocess orchestration. Handlessignalcancellation for clean shutdown.
Output Format
--- https://example.com (via curl_cffi) ---
# Page Title
Content extracted as Markdown...
Troubleshooting
If crawling fails with Chromium errors, delete the venv and retry β it auto-recreates:
rm -rf .pi/scrapling-venv
Testing
Tests cover:
- Mock adapter: returns predefined results without network
- Venv setup: creation, cache, re-creation on failure
- Python adapter: subprocess invocation, cancellation via signal
- Type validation: required fields, optional fields
- Progressive fetch logic: lightweight β stealth escalation sequence