List Extractor

The List Extractor pulls structured data from pages that display repeated items — product grids, article feeds, job listings, directory tables, and anything with a consistent repeating pattern.

It supports single-page extraction as well as multi-page collection via infinite scroll, "Load More" buttons, and classic next-page pagination.

When to Use It

  • A page shows 50 products and you want all their names, prices, and URLs
  • A blog lists 200 articles with titles and dates across 10 pages
  • A job board shows listings with infinite scroll
  • A directory table has rows you want as spreadsheet data

The 3-Step Wizard

Step 1 — Pick the Container

The container is the parent element that wraps all your list items.

  1. 02
    Click Pick Container.
  2. 04
    Hover over the page — elements highlight as you move.
  3. 06
    Find the element that surrounds all items (the grid wrapper, the <ul>, the <div class="results">). When you see all items grouped with a green outline, you're at the right level.
  4. 08
    Click to confirm.

CrawlPilot shows the detected CSS selector and the number of items found on the current page.

[!TIP] Use the ↑ Up and ↓ Down arrows in the picker to navigate up or down the DOM tree. Siblings highlight green to confirm how many items are captured at each level.

Step 2 — Pick One Item

  1. 02
    Click Pick Item.
  2. 04
    Hover over a single repeating unit (one product card, one article row, one table row).
  3. 06
    Click to confirm.

CrawlPilot detects all sibling items matching the same pattern and shows the count.

Step 3 — Review Schema

CrawlPilot analyzes the selected item and auto-generates columns for the fields it detects. Common auto-detected types:

TypeDetected fromExample value
TitleHeading elements, bold text"Running Shoes v2"
PriceElements containing $, , currency patterns"$49.99"
Image<img> src attributesproduct-image.jpg
URL<a> href attributes/products/shoes
TextAny text node"In stock"

You can:

  • Rename any column by clicking its label
  • Delete columns you don't need
  • Add a custom column with your own CSS selector

Pagination Settings

Choose how CrawlPilot collects data beyond the first page:

ModeUse when
NoneSingle page only
Auto-scroll — Infinite FeedTwitter, LinkedIn, Instagram-style feeds that load dynamically as you scroll
Auto-scroll — Static ListLists where items appear in index order as you scroll
Pagination — Next ButtonClassic "Next ›" or "Page 2" navigation
Load More ButtonA "Show More" or "Load More" button that expands the list in place

For button-based modes, CrawlPilot asks you to click the button once on the page so it can identify it.

Speed: Controls the delay between scroll cycles or page clicks. Slower speeds are more reliable on heavy, JavaScript-heavy sites.

Max Pages: Maximum number of pages or scroll cycles to process. Set to 0 for unlimited (use with caution on very large sites).

Running the Extraction

Click Start Extraction. The panel shows:

  • A live item count updating as data is collected
  • A progress bar for paginated extractions
  • A Stop button to halt early — data collected so far is saved

After Extraction

When complete:

  • Click View Data to open the full data table in a new tab
  • Go to History to manage this job, re-run it, or delete it

Handling Duplicate Rows

CrawlPilot automatically deduplicates rows at two levels:

  1. 02
    In-memory: During the scroll session, duplicate rows are dropped before they reach storage
  2. 04
    Database: A hash-based unique constraint ensures no true duplicates are ever stored

Example: Scraping an E-commerce Product Grid

Goal: Collect title, author, price, and product URL for 200 books from an online bookstore.

  1. 02
    Open the bookstore's category page.
  2. 04
    Open CrawlPilot → List Extractor.
  3. 06
    Pick Container: hover over <div class="book-grid"> (all books highlight together).
  4. 08
    Pick Item: hover over one <div class="book-card">.
  5. 10
    Schema auto-detects: Title (h3), Author (span.author), Price (span.price), URL (a href).
  6. 12
    Set pagination to Pagination — Next Button, click the "Next ›" button.
  7. 14
    Set Max Pages to 20, Speed to Medium.
  8. 16
    Click Start — watch the count climb to 200.
  9. 18
    Click View DataExport CSV.

Selector Editor

For advanced users, every auto-detected selector can be manually edited:

  1. 02
    Click the pencil icon next to any column in the schema table.
  2. 04
    Enter a CSS selector or XPath expression.
  3. 06
    CrawlPilot validates the selector against the current page and shows a match count.

XPath example for a specific attribute:

//div[@class='price']/@data-price