Integrations
Prebuilt Toolkits
Web
Web

Web

Description: Enable agents to scrape, crawl, and map websites.

Author: Arcade AI

Code: GitHub

Auth: API Key

PyPI VersionLicensePython VersionsWheel StatusDownloads

The Arcade AI Web toolkit provides a pre-built set of tools for interacting with websites. These tools make it easy to build agents and AI apps that can:

  • Scrape web pages
  • Crawl websites
  • Map website structures
  • Retrieve crawl status and data
  • Cancel ongoing crawls

Install

pip install arcade_web

Available Tools

These tools are currently available in the Arcade AI GitHub toolkit.

Tool NameDescription
ScrapeUrlScrape a URL and return data in specified formats.
CrawlWebsiteCrawl a website and return crawl status and data.
GetCrawlStatusRetrieve the status of a crawl job.
GetCrawlDataRetrieve data from a completed crawl job.
CancelCrawlCancel an ongoing crawl job.
MapWebsiteMap a website from a single URL to a map of the entire website.

If you need to perform an action that's not listed here, you can get in touch with us to request a new tool, or create your own tools.

ScrapeUrl


Scrape a URL and return data in specified formats.

Auth:

Parameters

  • url (string, required) The URL to scrape.
  • formats (enum (Formats), optional) The format of the scraped web page. Defaults to Formats.MARKDOWN.
  • only_main_content (bool, optional) Only return the main content of the page. Defaults to True.
  • include_tags (list, optional) List of tags to include in the output.
  • exclude_tags (list, optional) List of tags to exclude from the output.
  • wait_for (int, optional) Delay in milliseconds before fetching content. Defaults to 10.
  • timeout (int, optional) Timeout in milliseconds for the request. Defaults to 30000.

CrawlWebsite


Crawl a website and return crawl status and data.

Auth:

Parameters

  • url (string, required) The URL to crawl.
  • exclude_paths (list, optional) URL patterns to exclude from the crawl.
  • include_paths (list, optional) URL patterns to include in the crawl.
  • max_depth (int, required) Maximum depth to crawl. Defaults to 2.
  • ignore_sitemap (bool, required) Ignore the website sitemap. Defaults to True.
  • limit (int, required) Limit the number of pages to crawl. Defaults to 10.
  • allow_backward_links (bool, required) Enable navigation to previously linked pages. Defaults to False.
  • allow_external_links (bool, required) Allow following links to external websites. Defaults to False.
  • webhook (string, optional) URL to send a POST request when the crawl is started, updated, and completed.
  • async_crawl (bool, required) Run the crawl asynchronously. Defaults to True.

GetCrawlStatus


Retrieve the status of a crawl job.

Auth:

Parameters

  • crawl_id (string, required) The ID of the crawl job.

GetCrawlData


Retrieve data from a completed crawl job.

Auth:

Parameters

  • crawl_id (string, required) The ID of the crawl job.

CancelCrawl


Cancel an ongoing crawl job.

Auth:

Parameters

  • crawl_id (string, required) The ID of the asynchronous crawl job to cancel.

MapWebsite


Map a website from a single URL to a map of the entire website.

Auth:

Parameters

  • url (string, required) The base URL to start crawling from.
  • search (string, optional) Search query to use for mapping.
  • ignore_sitemap (bool, required) Ignore the website sitemap. Defaults to True.
  • include_subdomains (bool, required) Include subdomains of the website. Defaults to False.
  • limit (int, required) Maximum number of links to return. Defaults to 5000.

Auth

The Arcade AI Web toolkit uses Firecrawl (opens in a new tab) to scrape, crawl, and map websites.

Global Environment Variables: