Reference

Web Scraping Glossary

45+ essential terms for web scraping, data extraction, anti-bot systems, and browser automation — with links to deeper reading.

A

AJAX: Asynchronous JavaScript and XML — a technique for loading data dynamically without a full page reload. Many modern websites use AJAX to fetch content, requiring scrapers to handle JavaScript rendering or intercept network requests.
Anti-Bot: Systems and techniques designed to detect and block automated traffic (bots, crawlers, scrapers) on websites. Examples include PerimeterX, Cloudflare Bot Management, Akamai Bot Manager, and DataDome.
API: Application Programming Interface — a programmatic interface that allows software to communicate. When available, using a site's official API is always preferred over scraping as it's reliable, legal, and structured.

B

Behavioral Biometrics: Analysis of how users physically interact with a website — mouse trajectories, scroll speed, click patterns, typing cadence. Anti-bot systems use this to distinguish humans from automated scripts.
Bot: An automated software agent that performs tasks on the internet. Bots include search engine crawlers (beneficial), scrapers (data collection), and malicious bots (fraud, spam).
Browser Extension: Software that extends browser functionality. CrawlPilot is a browser extension that enables point-and-click web scraping without coding, running entirely inside the user's browser.
Browser Fingerprinting: A tracking technique that collects browser and device attributes — screen resolution, installed fonts, canvas rendering, WebGL data, installed plugins — to create a persistent identifier.

C

CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart. A challenge (image recognition, text puzzles) presented to verify a visitor is human. Common anti-scraping measure.
Cookie: A small piece of data stored by a browser that servers use to maintain sessions, authentication state, and user preferences. Scrapers often need to manage cookies to maintain authenticated sessions.
Crawler: A program that systematically browses the web, following links to discover and index pages. Distinguished from a scraper in that crawlers navigate broadly while scrapers target specific data on known pages.
CSS Selector: A pattern used in CSS and JavaScript to target specific HTML elements. CSS selectors are one of the two primary methods (along with XPath) for identifying elements to extract during scraping.

D

Data Pipeline: A series of automated data processing steps — extraction, transformation, enrichment, loading — that moves data from source websites to a destination storage or analysis system.
Datacenter Proxy: A proxy server hosted in a data center. Faster and cheaper than residential proxies but more easily detected as non-human traffic since the IP ranges are well-known to anti-bot systems.
DOM: Document Object Model — the tree-based in-memory representation of an HTML document. Scrapers navigate and query the DOM to locate and extract data from web pages.

E

ETL: Extract, Transform, Load — the standard process for moving data: extract from source (scraping), transform into a clean format, and load into a database or analysis tool.

F

Fingerprinting: The practice of collecting browser, device, or network attributes to identify and track users or detect bots. Includes browser fingerprinting, TLS fingerprinting, and canvas fingerprinting.

H

Headless Browser: A web browser that runs without a graphical interface and can be controlled programmatically. Used in scraping to render JavaScript-heavy pages. Examples: headless Chrome via Puppeteer, headless Firefox via Playwright.
HTTP Headers: Metadata sent with every HTTP request and response. Headers like User-Agent, Accept-Language, and Referer can be inspected by anti-bot systems to detect automated traffic.

I

Infinite Scroll: A UX pattern where content loads automatically as the user scrolls down, rather than paginating. Requires scrapers to simulate scroll events or intercept API calls to collect all data.
IP Rotation: The practice of cycling through multiple IP addresses when scraping to avoid rate limiting and IP bans. Achieved through proxy pools or rotating proxy services.

J

JavaScript Rendering: Executing JavaScript code to generate the final HTML content of a page. Required for scraping Single-Page Applications (SPAs) where content is loaded dynamically rather than present in the initial HTML response.
JSON: JavaScript Object Notation — a lightweight, human-readable data format. The most common format for structured data exports from web scrapers.

L

Local-First: An architecture where all data processing happens on the user's device rather than a remote server. CrawlPilot is local-first: scraped data never leaves your browser, ensuring privacy.

M

MCP: Model Context Protocol — an open standard by Anthropic for connecting AI models to external tools and data sources. CrawlPilot supports MCP, allowing AI agents to trigger web scraping jobs.

P

Pagination: The process of navigating through multiple pages of results to collect complete datasets. Scrapers must detect and follow pagination patterns (next-page buttons, URL parameters, API offsets).
Playwright: A cross-browser automation library from Microsoft that controls Chrome, Firefox, and Safari. Used for scraping JavaScript-rendered pages with more browser coverage than Puppeteer.
Polite Crawling: Following ethical scraping practices: respecting robots.txt, honoring crawl-delay directives, limiting request rates, and avoiding scraping during peak traffic hours.
Proxy: An intermediary server that routes requests on behalf of a client, masking the client's real IP address. Used in scraping to distribute requests and avoid IP bans.
Puppeteer: A Node.js library that provides a high-level API for controlling headless Chrome. Widely used for scraping JavaScript-rendered pages and automating browser interactions.

R

Rate Limiting: A restriction on the number of requests a server will accept from a single client per time period. Scrapers must implement delays and request throttling to stay within rate limits.
Residential Proxy: A proxy that routes traffic through real residential IP addresses (home internet connections). Harder for anti-bot systems to block because the IPs appear as genuine users.
robots.txt: A text file at the root of a website (e.g., /robots.txt) that specifies which parts of the site crawlers are allowed to access. Ethical scrapers should read and respect robots.txt directives.

S

Scraping: The automated extraction of data from websites by parsing HTML, executing JavaScript, and collecting structured information from page elements. Also called web scraping or data scraping.
Selector: A query pattern used to identify specific elements within an HTML document. CSS selectors and XPath are the two primary selector languages used in web scraping.
Selenium: A browser automation framework originally designed for web testing. Also used for scraping JavaScript-rendered pages, though it has been largely superseded by Puppeteer and Playwright for scraping.
Session Management: Maintaining authentication state (cookies, tokens) across multiple scraping requests to access content that requires login, or to maintain consistent identity while navigating a site.
Sitemap: An XML file listing all pages of a website. Search engine crawlers use sitemaps to discover content. Scrapers can also start from a sitemap to systematically collect all pages of a site.
SPA: Single-Page Application — a web app that dynamically rewrites the current page rather than loading new pages. SPAs require headless browser execution or API interception for complete data extraction.
Stealth Mode: Techniques that make automated browsers appear more human-like: randomizing timing, spoofing browser fingerprints, managing cookies naturally, and avoiding detectable automation artifacts.
Structured Data: Data organized in a predefined, machine-readable format such as JSON, CSV, or XML. The goal of web scraping is typically to transform unstructured HTML content into structured data.

T

TLS Fingerprint: A fingerprint derived from the parameters of a TLS (HTTPS) handshake — cipher suites, extensions, elliptic curves. Different HTTP clients produce distinct TLS fingerprints that can identify headless browsers.

U

User Agent: An HTTP header string that identifies the browser and operating system making a request. Anti-bot systems inspect User-Agent headers; scrapers often set realistic User-Agent strings to avoid detection.

W

Web Archive: A snapshot of web content preserved at a specific point in time. Services like the Wayback Machine store historical snapshots. Some scraping use cases access archives instead of live sites.
Webhook: An HTTP callback triggered by a specific event. CrawlPilot's Business plan supports webhooks, allowing scraped data to be pushed automatically to external systems when a job completes.

X

XPath: XML Path Language — a query language for selecting nodes in an XML or HTML document using path expressions. An alternative to CSS selectors for element targeting in web scraping.

Put the theory to work

Start extracting data now

No proxies. No code. No servers. CrawlPilot runs everything inside your browser.

Install Free Extension