Skip to content

Scraper Module

Public Summary

Automated web scraping pipeline using Puppeteer with concurrent tabs, market discovery, price collection, geocoding, and post-scrape embedding generation. Runs on a twice-weekly cron schedule.

Internal Details

Files

FileRole
scraper.service.jsOrchestrator: launches browser, dispatches scrapers
scraper.cron.jsCron schedule definition
scraper.registry.jsPlugin registry for market scrapers
geocoder.service.jsNominatim geocoding with caching
vero.scraper.jsVero market scraper
ramstore.scraper.jsRamstore scraper
stokomak.scraper.jsStokomak scraper
kam.scraper.jsKAM scraper
superkitgo.scraper.jsSuperKitGo scraper
kipper.scraper.jsKipper scraper
utils/Shared scraping utilities

Cron Schedule

  • When: Monday and Thursday at 03:00
  • Concurrency: 2 tabs in production, 4 in development

Pipeline

Strategy + Registry Pattern

Each market scraper implements a common interface and registers itself in the scraper registry. The orchestrator iterates the registry without knowing scraper internals.

js
// Each scraper exports: { name, scrape(page, deps) }
registry.register(veroScraper);
registry.register(ramstoreScraper);
// ...

Geocoding

  • Primary: Nominatim API lookup by market address.
  • Fallback: City center coordinates if address lookup fails.
  • Caching: Coordinates are cached to avoid redundant API calls.

Performance Optimizations

  • Request interception: blocks images, CSS, fonts during scraping.
  • Navigation timeout: 90 seconds per page.
  • Batch processing: concurrent tab pool limits resource usage.

Dependencies

  • Product, Market, MarketProduct modules (upsert data)
  • Image module (market images)
  • Search module (post-scrape embedding generation)
  • Feature Flag module (conditional behavior)
  • Geocoder service (Nominatim)

Source Anchors

PathRelevance
apps/server/src/modules/scraper/Service, cron, registry, geocoder, market scrapers

Failure Modes

FailureBehavior
Scraper page timeoutSkip market, log error, continue
Geocoding failureUse city center fallback
Embedding generation failureProducts saved without embeddings
Browser crashCron retries on next scheduled run

Student Obrok engineering documentation.