Skip to content

Scraper Module

Public Summary

Automated web scraping pipeline using Puppeteer with concurrent tabs, market discovery, price collection, geocoding, and post-scrape embedding generation. Runs on a twice-weekly cron schedule.

Internal Details

Files

FileRole
scraper.service.jsOrchestrator: launches browser, dispatches scrapers
scraper.cron.jsCron schedule definition
scraper.registry.jsPlugin registry for market scrapers
geocoder.service.jsNominatim geocoding with caching
vero.scraper.jsVero market scraper
ramstore.scraper.jsRamstore scraper
stokomak.scraper.jsStokomak scraper
kam.scraper.jsKAM scraper
superkitgo.scraper.jsSuperKitGo scraper
kipper.scraper.jsKipper scraper
utils/Shared scraping utilities

Scripts

CommandDescription
npm run scrapeRun all scrapers
npm run scrape:<chain>Run a single chain scraper
npm run scrape:db:wipeWipe all scrape-related data
npm run scrape:db:wipe:<chain>Wipe a single chain's data (markets, products, embeddings)

Cron Schedule

  • When: Monday and Thursday at 03:00
  • Concurrency: 2 tabs in production, 4 in development

Pipeline

Strategy + Registry Pattern

Each market scraper implements a common interface and registers itself in the scraper registry. The orchestrator iterates the registry without knowing scraper internals.

js
// Each scraper exports: { name, scrape(page, deps) }
registry.register(veroScraper);
registry.register(ramstoreScraper);
// ...

Geocoding

  • Primary: Nominatim API lookup by market address.
  • Fallback: City center coordinates if address lookup fails.
  • Caching: Coordinates are cached to avoid redundant API calls.

Performance Optimizations

  • Request interception: blocks images, CSS, fonts during scraping.
  • Navigation timeout: 90 seconds per page.
  • Batch processing: concurrent tab pool limits resource usage.

Dependencies

  • Product, Market, MarketProduct modules (upsert data)
  • Image module (market images)
  • Search module (post-scrape embedding generation)
  • Feature Flag module (conditional behavior)
  • Geocoder service (Nominatim)

Chain-Specific Wipe

The wipe script (wipe-db-scrape-data.js) supports two modes:

  • Full wipe (npm run scrape:db:wipe): deletes all chains, markets, market_products, products, and product_embeddings.
  • Per-chain wipe (npm run scrape:db:wipe:<chain>): deletes only the target chain's markets and market_products, removes orphaned products and embeddings not referenced by other chains, then deletes the chain record.

Source Anchors

PathRelevance
apps/server/src/modules/scraper/Service, cron, registry, geocoder, market scrapers
apps/server/src/scripts/wipe-db-scrape-data.jsFull and per-chain data wipe

Failure Modes

FailureBehavior
Scraper page timeoutSkip market, log error, continue
Geocoding failureUse city center fallback
Embedding generation failureProducts saved without embeddings
Browser crashCron retries on next scheduled run

Obrok engineering documentation.