ZERO HARD-CODED SELECTORS! Complete dynamic pattern detection system for universal e-commerce extraction.
โNever hard-code selectors. Always detect patterns dynamically from DOM structure.โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EXTRACTION PIPELINE (Cascade) โ
โ โ
โ 1. LLM-Guided Extractor โ
โ โโ LLM analyzes DOM samples โ
โ โโ Proposes container selector โ
โ โโ โ Often fails (generic suggestions) โ
โ โ
โ 2. Dynamic Detector (Python) โ
โ โโ Finds 100 "signal" elements (prices) โ
โ โโ Analyzes parent structures โ
โ โโ Forms clusters by similarity โ
โ โโ โ ๏ธ May filter everything if too strict โ
โ โ
โ 3. Iterative Extractor (JavaScript) โ
MAIN WORKER โ
โ โโ Quick page check (prices, links, structure) โ
โ โโ Dynamic container detection: โ
โ โข Find all elements with prices โ
โ โข Analyze parents 1-3 levels up โ
โ โข Count repeating patterns โ
โ โข Filter valid CSS class names โ
โ โข Score candidates (see below) โ
โ โโ Field detection (name, price, url) โ
โ โโ Data extraction โ
โ โโ Price filtering โ
โ โโ โ
Most reliable extractor โ
โ โ
โ 4. BQL Orchestrator โ
โ โโ Query-based extraction โ
โ โโ Uses LLM for DOM analysis โ
โ โ
โ 5. Extraction Orchestrator โ
โ โโ LLM-guided navigation & extraction โ
โ โโ Form filling if needed โ
โ โ
โ 6. Standard Planner (Fallback) โ
โ โโ Multi-step navigation โ
โ โโ Scrolling & clicking โ
โ โโ Calls products.extract repeatedly โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// 1. FIND SIGNAL ELEMENTS (prices)
const priceRegex = /\d+[\s,.]?\d*[\s,.]?\d{2}\s*(?:zล|PLN|โฌ|\$)/i;
const signalElements = [...document.querySelectorAll('*')]
.filter(el => priceRegex.test(el.innerText));
// 2. ANALYZE PARENTS
for (const signal of signalElements) {
for (let depth = 0; depth < 3; depth++) {
const parent = signal.parentElement;
// Extract valid CSS classes only
const classes = parent.className
.split(' ')
.filter(c => /^[a-zA-Z][a-zA-Z0-9_-]*$/.test(c)); // NO invalid chars!
if (classes.length === 0) continue; // Skip no classes
const selector = parent.tagName.toLowerCase() + '.' + classes[0];
const count = document.querySelectorAll(selector).length;
if (count >= 5) { // Must repeat
candidates.push({
selector,
count,
specificity: classes.length,
has_price: true, // Guaranteed (we started from price)
has_link: !!parent.querySelector('a[href]'),
has_image: !!parent.querySelector('img')
});
}
}
}
// 3. SCORE CANDIDATES
for (const c of candidates) {
let score = 0;
// SPECIFICITY (most important!)
if (c.specificity >= 3) score += 50;
else if (c.specificity >= 2) score += 35;
else if (c.specificity >= 1) score += 20;
// PENALTY for generic layout classes
const utilityClasses = ['container', 'row', 'col', 'wrapper', 'inner', 'main', ...];
const tailwindPrefixes = ['mt-', 'mb-', 'p-', 'flex', 'grid', 'border-', ...];
if (isLayoutClass) score -= 30; // Heavy penalty!
// SIZE (reduced importance)
score += Math.min(c.count / 50, 1) * 15;
// STRUCTURE
score += 25; // Has price (guaranteed)
score += c.has_link ? 20 : 0;
score += c.has_image ? 15 : 0;
// TEXT QUALITY
const hasProductKeywords = /laptop|phone|notebook/.test(text);
const hasSpecs = /\d+GB|\d+GHz|Core|Ryzen/.test(text);
const hasMarketing = /okazja|promocja|rabat/.test(text);
if (hasProductKeywords) score += 15;
if (hasSpecs) score += 20;
if (hasMarketing) score -= 15; // Penalty!
// COMPLETE STRUCTURE bonus
if (c.has_price && c.has_link && c.has_image) score += 10;
c.score = score;
}
// 4. SELECT WINNER
candidates.sort((a, b) => b.score - a.score);
return candidates[0]; // Highest score wins!
โ Good Product Container:
li.product (e-commerce)
+ specificity(13 classes): 50
- layout penalty: 0
+ count(49): 15
+ structure: 60
+ keywords("laptop"): 15
+ specs("16GB"): 20
+ complete: 10
โโโโโโโโโโโโโโโโ
= 170 points โ
โ Generic Layout Container:
div.container (layout)
+ specificity(4): 35
- layout penalty: -30 โ KEY!
+ count(9): 6
+ structure: 60
- marketing: -15
โโโโโโโโโโโโโโโโ
= 66 points โ
// Only allow valid CSS class names
/^[a-zA-Z][a-zA-Z0-9_-]*$/
// โ REJECT:
// !mt-4, xl:grid, @apply, #id, $var, [attr]
// โ
ACCEPT:
// product-box, cat-prod-row, item123
// Skip if no classes and generic tag
if (classes.length === 0 && tag in ['div', 'span', 'article', 'section', 'li'])
continue;
if (count < 5) continue; // Must repeat at least 5 times
if (!has_price) continue; // Products must have prices!
// Handle SVGAnimatedString (not plain string)
const classNameStr = typeof parent.className === 'string'
? parent.className
: (parent.className?.baseVal || '');
| Site | Container Found | Products | Specificity | Score |
|---|---|---|---|---|
| Komputronik | div.border-transparent |
15 | 5 | 115.1 |
| Skapiec | div.product-box-wide-d |
3 | 1 | 90.0 |
| Ceneo | div.cat-prod-row |
14 | 3 | 96.0 |
| Balta | li.product |
32 | 13 | 169.7 |
| Lidl | div.odsc-tile |
10 | 4 | 124.5 |
curllm_core/
โโโ extraction_registry.py โ NEW: Transparency & tracking
โโโ iterative_extractor.py โ MAIN: Dynamic detection
โโโ dynamic_detector.py โ Python-based pattern detection
โโโ llm_guided_extractor.py โ LLM-based container selection
โโโ extraction_orchestrator.py โ High-level orchestration
โโโ bql_extraction_orchestrator.py
โโโ extraction.py โ Legacy (redirects to new system)
from curllm_core.iterative_extractor import IterativeExtractor
extractor = IterativeExtractor(page, run_logger)
result = await extractor.extract(
instruction="Find products under 950zล",
page_type="product_listing" # or "single_product" or None (auto-detect)
)
print(f"Found: {len(result['products'])} products")
from curllm_core.extraction_registry import ExtractionPipeline, ExtractorType
pipeline = ExtractionPipeline("Find products under 950zล", page.url)
# Try extractor
attempt = pipeline.start_attempt(ExtractorType.ITERATIVE)
# ... run extraction ...
attempt.add_detected_selector("div.product-box", 95.0, 3, 50, {...})
attempt.set_chosen_selector("div.product-box", "Highest score", {...})
attempt.set_result(ExtractorStatus.SUCCESS, products_found=50)
# Generate report
report = pipeline.get_transparency_report()
pipeline.print_transparency_log()
products = await quick_extract_products(
page,
container_selector=".product-box", # โ NEVER DO THIS!
name_selector="h3",
price_selector=".price"
)
extractor = IterativeExtractor(page)
result = await extractor.extract(instruction="Find products")
products = result['products']
# Selectors detected automatically!
Built with โค๏ธ for universal web extraction without configuration.