Web

Description: Enable agents to scrape, crawl, and map websites.

Author: Arcade AI

Code: GitHub

Auth: API Key

The Arcade AI Web toolkit provides a pre-built set of tools for interacting with websites. These tools make it easy to build agents and AI apps that can:

Scrape web pages
Crawl websites
Map website structures
Retrieve crawl status and data
Cancel ongoing crawls

Install

pip install arcade_web

Available Tools

These tools are currently available in the Arcade AI GitHub toolkit.

Tool Name	Description
ScrapeUrl	Scrape a URL and return data in specified formats.
CrawlWebsite	Crawl a website and return crawl status and data.
GetCrawlStatus	Retrieve the status of a crawl job.
GetCrawlData	Retrieve data from a completed crawl job.
CancelCrawl	Cancel an ongoing crawl job.
MapWebsite	Map a website from a single URL to a map of the entire website.

If you need to perform an action that's not listed here, you can get in touch with us to request a new tool, or create your own tools.

ScrapeUrl

Scrape a URL and return data in specified formats.

Auth:

Environment Variables Required:
- FIRECRAWL_API_KEY: Your Firecrawl (opens in a new tab) API key.

Parameters

url (string, required) The URL to scrape.
formats (enum (Formats), optional) The format of the scraped web page. Defaults to Formats.MARKDOWN.
only_main_content (bool, optional) Only return the main content of the page. Defaults to True.
include_tags (list, optional) List of tags to include in the output.
exclude_tags (list, optional) List of tags to exclude from the output.
wait_for (int, optional) Delay in milliseconds before fetching content. Defaults to 10.
timeout (int, optional) Timeout in milliseconds for the request. Defaults to 30000.

CrawlWebsite

Crawl a website and return crawl status and data.

Auth:

Environment Variables Required:
- FIRECRAWL_API_KEY: Your Firecrawl (opens in a new tab) API key.

Parameters

url (string, required) The URL to crawl.
exclude_paths (list, optional) URL patterns to exclude from the crawl.
include_paths (list, optional) URL patterns to include in the crawl.
max_depth (int, required) Maximum depth to crawl. Defaults to 2.
ignore_sitemap (bool, required) Ignore the website sitemap. Defaults to True.
limit (int, required) Limit the number of pages to crawl. Defaults to 10.
allow_backward_links (bool, required) Enable navigation to previously linked pages. Defaults to False.
allow_external_links (bool, required) Allow following links to external websites. Defaults to False.
webhook (string, optional) URL to send a POST request when the crawl is started, updated, and completed.
async_crawl (bool, required) Run the crawl asynchronously. Defaults to True.

GetCrawlStatus

Retrieve the status of a crawl job.

Auth:

Environment Variables Required:
- FIRECRAWL_API_KEY: Your Firecrawl (opens in a new tab) API key.

Parameters

crawl_id (string, required) The ID of the crawl job.

GetCrawlData

Retrieve data from a completed crawl job.

Auth:

Environment Variables Required:
- FIRECRAWL_API_KEY: Your Firecrawl (opens in a new tab) API key.

Parameters

crawl_id (string, required) The ID of the crawl job.

CancelCrawl

Cancel an ongoing crawl job.

Auth:

Environment Variables Required:
- FIRECRAWL_API_KEY: Your Firecrawl (opens in a new tab) API key.

Parameters

crawl_id (string, required) The ID of the asynchronous crawl job to cancel.

MapWebsite

Map a website from a single URL to a map of the entire website.

Auth:

Environment Variables Required:
- FIRECRAWL_API_KEY: Your Firecrawl (opens in a new tab) API key.

Parameters

url (string, required) The base URL to start crawling from.
search (string, optional) Search query to use for mapping.
ignore_sitemap (bool, required) Ignore the website sitemap. Defaults to True.
include_subdomains (bool, required) Include subdomains of the website. Defaults to False.
limit (int, required) Maximum number of links to return. Defaults to 5000.

Auth

The Arcade AI Web toolkit uses Firecrawl (opens in a new tab) to scrape, crawl, and map websites.

Global Environment Variables:

FIRECRAWL_API_KEY: Your Firecrawl (opens in a new tab) API key.

Reference Overview