Web
Description: Enable agents to scrape, crawl, and map websites.
Author: Arcade AI
Code: GitHub
Auth: API Key
The Arcade AI Web toolkit provides a pre-built set of tools for interacting with websites. These tools make it easy to build agents and AI apps that can:
- Scrape web pages
- Crawl websites
- Map website structures
- Retrieve crawl status and data
- Cancel ongoing crawls
Install
pip install arcade_web
Available Tools
These tools are currently available in the Arcade AI GitHub toolkit.
Tool Name | Description |
---|---|
ScrapeUrl | Scrape a URL and return data in specified formats. |
CrawlWebsite | Crawl a website and return crawl status and data. |
GetCrawlStatus | Retrieve the status of a crawl job. |
GetCrawlData | Retrieve data from a completed crawl job. |
CancelCrawl | Cancel an ongoing crawl job. |
MapWebsite | Map a website from a single URL to a map of the entire website. |
If you need to perform an action that's not listed here, you can get in touch with us to request a new tool, or create your own tools.
ScrapeUrl
Scrape a URL and return data in specified formats.
Auth:
- Environment Variables Required:
FIRECRAWL_API_KEY
: Your Firecrawl (opens in a new tab) API key.
Parameters
url
(string, required) The URL to scrape.formats
(enum (Formats), optional) The format of the scraped web page. Defaults toFormats.MARKDOWN
.only_main_content
(bool, optional) Only return the main content of the page. Defaults toTrue
.include_tags
(list, optional) List of tags to include in the output.exclude_tags
(list, optional) List of tags to exclude from the output.wait_for
(int, optional) Delay in milliseconds before fetching content. Defaults to10
.timeout
(int, optional) Timeout in milliseconds for the request. Defaults to30000
.
CrawlWebsite
Crawl a website and return crawl status and data.
Auth:
- Environment Variables Required:
FIRECRAWL_API_KEY
: Your Firecrawl (opens in a new tab) API key.
Parameters
url
(string, required) The URL to crawl.exclude_paths
(list, optional) URL patterns to exclude from the crawl.include_paths
(list, optional) URL patterns to include in the crawl.max_depth
(int, required) Maximum depth to crawl. Defaults to2
.ignore_sitemap
(bool, required) Ignore the website sitemap. Defaults toTrue
.limit
(int, required) Limit the number of pages to crawl. Defaults to10
.allow_backward_links
(bool, required) Enable navigation to previously linked pages. Defaults toFalse
.allow_external_links
(bool, required) Allow following links to external websites. Defaults toFalse
.webhook
(string, optional) URL to send a POST request when the crawl is started, updated, and completed.async_crawl
(bool, required) Run the crawl asynchronously. Defaults toTrue
.
GetCrawlStatus
Retrieve the status of a crawl job.
Auth:
- Environment Variables Required:
FIRECRAWL_API_KEY
: Your Firecrawl (opens in a new tab) API key.
Parameters
crawl_id
(string, required) The ID of the crawl job.
GetCrawlData
Retrieve data from a completed crawl job.
Auth:
- Environment Variables Required:
FIRECRAWL_API_KEY
: Your Firecrawl (opens in a new tab) API key.
Parameters
crawl_id
(string, required) The ID of the crawl job.
CancelCrawl
Cancel an ongoing crawl job.
Auth:
- Environment Variables Required:
FIRECRAWL_API_KEY
: Your Firecrawl (opens in a new tab) API key.
Parameters
crawl_id
(string, required) The ID of the asynchronous crawl job to cancel.
MapWebsite
Map a website from a single URL to a map of the entire website.
Auth:
- Environment Variables Required:
FIRECRAWL_API_KEY
: Your Firecrawl (opens in a new tab) API key.
Parameters
url
(string, required) The base URL to start crawling from.search
(string, optional) Search query to use for mapping.ignore_sitemap
(bool, required) Ignore the website sitemap. Defaults toTrue
.include_subdomains
(bool, required) Include subdomains of the website. Defaults toFalse
.limit
(int, required) Maximum number of links to return. Defaults to5000
.
Auth
The Arcade AI Web toolkit uses Firecrawl (opens in a new tab) to scrape, crawl, and map websites.
Global Environment Variables:
FIRECRAWL_API_KEY
: Your Firecrawl (opens in a new tab) API key.