One failed login attempt across a thousand e-commerce sites triggers a retry. The retry hits a rate limit. The rate limit triggers a proxy rotation. The new proxy gets a different session state. The session state mismatch triggers another retry. Within ninety seconds, you've gone from one authentication failure to three hundred cascading errors across unrelated extraction jobs.
This is the operational reality that Jiří Moravčík, a senior engineer at Apify, has been solving through Crawlee—an open-source web scraping library that reached Python v1.0 in September 2025. His work reveals something most people miss about web extraction at scale: the hard part lives in cascade effects that turn single failures into systemic breakdowns when you're running thousands of simultaneous jobs.
Infrastructure That Survives Production
Moravčík's background at Apify, where the platform handles extraction across millions of web pages daily, shaped Crawlee's defaults in specific ways. At Prague Crawl 2025, he presented on datasets for training AI models—work that requires extraction infrastructure to handle schema drift and authentication complexity as first-class operational problems, not edge cases you'll address later.
The interesting signal in Crawlee isn't what it can do. It's what ships as defaults:
- Browser fingerprint rotation
- Request queuing that persists to disk
- Automatic proxy rotation
- Session management that ties proxies to browser contexts
These defaults exist because they're the infrastructure you need from day one if you want extraction jobs to survive production. Not features you enable when you're ready to scale.
Request queuing that persists to disk solves a specific problem. At scale, you're seeing authentication failures on 2-3% of requests across thousands of sites—dozens of errors every minute. When a job crashes, you need to resume exactly where you left off, with session state intact, without corrupting downstream data pipelines. This determines whether your system runs overnight or requires an on-call engineer.
We've encountered this exact challenge at TinyFish when building web agent infrastructure—authentication failures don't fail gracefully, they multiply across concurrent sessions.
A single selector change doesn't just break one site's extraction—it can silently corrupt stored data if validation doesn't catch it. Authentication failures multiply across concurrent sessions, turning one timeout into hundreds of blocked requests.
When One Failure Becomes Three Hundred
The session management architecture in Crawlee—tying proxies to browser contexts so sites see consistent "users"—solves a problem that keeps infrastructure engineers up at night. When session state breaks mid-extraction, sites see the same "user" with different IPs and browser fingerprints. This triggers fraud detection that blocks legitimate extraction jobs, creates phantom users in analytics, and makes it impossible to extract personalized content reliably.
At production scale, infrastructure decisions compound exponentially.
Without automated validation—schema checks, null detection, periodic diff-based QA—a single selector change can corrupt terabytes of stored data. But validation only works if your extraction infrastructure can isolate failures, maintain session state across retries, and resume jobs without data loss.
Why This Matters
The infrastructure choices Moravčík has made in Crawlee—what gets built into the core versus what's left as extensions—reveal what actually breaks when you're running extraction across thousands of sites simultaneously. Parsing logic and AI-powered selectors matter less than systems that contain cascade failures, maintain enough operational context to resume gracefully, and keep downstream pipelines intact.
When authentication breaks at 3 AM, you need infrastructure that isolates the cascade, maintains session state across retries, and keeps your data pipelines running until morning. That's the operational reality Moravčík has encoded into Crawlee's defaults—and what enterprise teams miss when they're evaluating extraction infrastructure in demos rather than production.
Things to follow up on...
-
OxyCon 2025 presentations: Fred de Villamil from NielsenIQ Digital Shelf presented at OxyCon 2025 on scaling e-commerce data extraction to handle over 10 billion products per day, revealing infrastructure approaches for massive-scale product data operations.
-
Scrapy-Playwright integration work: Core Scrapy maintainers developed scrapy-playwright to provide production-grade browser automation that handles JavaScript-heavy pages while maintaining Scrapy's workflow patterns for request scheduling and item processing.
-
Schema drift detection systems: PromptCloud's analysis shows that automated validation with schema checks and periodic diff-based QA is essential because a single selector change can silently corrupt terabytes of stored data at production scale.
-
Cost management at scale: Zyte's 2025 industry report emphasizes that LLM-powered extraction tools require careful cost optimization as infrastructure expenses remain a significant variable whether using cloud providers or self-hosting models.

