DOM Tree: 100KB → LLM (7-10s) → Decision
└─ Jeśli products.heuristics zwraca 0, nie wiesz dlaczego
Problemy:
┌─────────────────────────────────────────────┐
│ Step 1: Quick Page Check (~100ms) │
│ Fast JS: Has prices? Product links? Count? │
│ Decision: Continue or skip │
└──────────────┬──────────────────────────────┘
│ ✅ Product page
v
┌─────────────────────────────────────────────┐
│ Step 2: Container Detection (~200ms) │
│ Find pattern: .product-box? article? │
│ Return: Best selector + count │
└──────────────┬──────────────────────────────┘
│ ✅ Found containers
v
┌─────────────────────────────────────────────┐
│ Step 3: Field Location Detection (~150ms) │
│ Analyze FIRST container: Where is name? │
│ Where is price? Where is URL? │
└──────────────┬──────────────────────────────┘
│ ✅ Fields mapped
v
┌─────────────────────────────────────────────┐
│ Step 4: Data Extraction (~300ms) │
│ Extract using discovered strategy │
│ Return: Clean product data │
└─────────────────────────────────────────────┘
Total: ~750ms (vs 7-12s!)
# Iterative Extractor jest domyślnie enabled
curllm --stealth "https://ceneo.pl/..." -d "Find products under 150zł"
from curllm_core.iterative_extractor import iterative_extract
result = await iterative_extract(
instruction="Find products under 150zł",
page=page,
llm=llm,
run_logger=logger
)
# Result structure:
{
"products": [
{"name": "...", "price": 149.99, "url": "..."},
...
],
"count": 10,
"reason": "Success",
"metadata": {
"checks_performed": [...],
"decisions": [...],
"extraction_strategy": {
"container_selector": ".product-box",
"fields": {...}
}
}
}
Cel: Szybko określić czy strona zawiera produkty
JavaScript (~100ms):
{
has_prices: true,
price_count: 45,
has_product_links: true,
product_link_count: 38,
page_type: 'product_listing'
}
Decyzja: Czy kontynuować? (TAK/NIE)
Cel: Znaleźć wzorzec kontenerów produktów
JavaScript (~200ms):
{
best: {
selector: ".product-box",
count: 38,
has_link: true,
has_price: true,
has_image: true
}
}
Decyzja: Który selektor użyć do ekstrakcji
Cel: Zmapować gdzie w kontenerze są dane
JavaScript (~150ms) - Analizuje TYLKO pierwszy kontener:
{
fields: {
name: {
selector: "h3.product-name",
sample: "Odkurzacz ABC"
},
price: {
selector: "span.price",
sample: "149.99 zł",
value: 149.99
},
url: {
selector: "a[href]",
sample: "https://ceneo.pl/12345"
}
},
completeness: 1.0 // 100% pól znaleziono
}
Decyzja: Czy mamy wystarczająco danych? (completeness >= 0.5)
Cel: Wyciągnij dane używając odkrytej strategii
JavaScript (~300ms) - Używa strategii z Step 3:
// For each container:
containers.forEach(container => {
const name = container.querySelector("h3.product-name").innerText;
const price = parseFloat(container.querySelector("span.price").innerText);
const url = container.querySelector("a[href]").href;
products.push({name, price, url});
});
Każdy krok loguje szczegóły:
🔄 ═══ ITERATIVE EXTRACTOR ═══
🔍 Step 1: Quick Page Check
Running fast indicators check...
{
"has_prices": true,
"price_count": 45,
"has_product_links": true,
"product_link_count": 38,
"page_type": "product_listing"
}
🔍 Step 2: Container Structure Detection
Looking for product_listing containers...
{
"found": true,
"best": {
"selector": ".product-box",
"count": 38,
"has_link": true,
"has_price": true
}
}
🔍 Step 3: Field Location Detection
Analyzing fields in .product-box...
{
"found": true,
"fields": {
"name": {"selector": "h3.product-name", "sample": "Odkurzacz ABC"},
"price": {"selector": "span.price", "value": 149.99},
"url": {"selector": "a[href]", "sample": "https://ceneo.pl/12345"}
},
"completeness": 1.0
}
🔍 Step 4: Data Extraction
Extracting up to 50 items using strategy...
{
"count": 38,
"sample": [
{"name": "Odkurzacz ABC", "price": 149.99, "url": "..."},
{"name": "Mop XYZ", "price": 139.00, "url": "..."},
{"name": "Robot DEF", "price": 145.50, "url": "..."}
]
}
✅ Iterative Extractor succeeded - found 38 items
| Metryka | Stare (Full DOM) | Nowe (Iterative) | Improvement |
|---|---|---|---|
| Czas | 7-12s | 0.5-1s | 10-20x ⚡ |
| Prompt size | 100KB | 1-2KB | 50-100x 📉 |
| Debugowanie | ❌ Brak | ✅ Pełne | ∞ 🔍 |
| Early exit | ❌ Nie | ✅ Tak | Smart 🧠 |
| LLM calls | 1 duże | 0 (pure JS!) | 0 cost 💰 |
Kluczowa różnica: Iterative Extractor używa czystego JavaScript - LLM NIE jest używany!
# Enable/disable
CURLLM_ITERATIVE_EXTRACTOR=true # Default: true
# Max items to extract
CURLLM_ITERATIVE_EXTRACTOR_MAX_ITEMS=50 # Default: 50
# Fast atomic extraction (domyślnie włączone)
CURLLM_ITERATIVE_EXTRACTOR=true
CURLLM_ITERATIVE_EXTRACTOR_MAX_ITEMS=50
Diagnoza: Step 2 nie znalazł kontenerów
Rozwiązania:
await page.wait_for_timeout(3000)Diagnoza: Step 3 nie znalazł pól (completeness < 0.5)
Rozwiązania:
Diagnoza: Step 4 extraction failed
Rozwiązania:
System próbuje w kolejności:
1. Iterative Extractor (najszybszy) ← NOWY!
└─ Success? → Return
└─ Fail? ↓
2. BQL Orchestrator (structured)
└─ Success? → Return
└─ Fail? ↓
3. Extraction Orchestrator (LLM-guided)
└─ Success? → Return
└─ Fail? ↓
4. Standard Planner (full context)
└─ Last resort
# Zobacz dokładnie co się stało
cat logs/run-*.md | grep "Iterative Extractor"
# Sprawdź czasy wykonania
grep "fn:.*_ms" logs/run-*.md
result = await iterative_extract(...)
print(result["metadata"]["extraction_strategy"])
# Dowiedz się jakiej strategii użył
# Ceneo
curllm --stealth "https://ceneo.pl/..." -d "Find products"
# Allegro
curllm --stealth "https://allegro.pl/..." -d "Find products"
# Custom
curllm --stealth "https://your-site.com/..." -d "Find products"
curllm --stealth "https://www.ceneo.pl/Telefony_komorkowe" -d "Find all smartphones under 2000zł"
Log Output:
🔄 Iterative Extractor enabled - trying atomic DOM queries
🔍 Step 1: Quick Page Check
page_type: product_listing, price_count: 89
🔍 Step 2: Container Detection
Found 89 containers with .product-box
🔍 Step 3: Field Detection
completeness: 1.0 (all fields found)
🔍 Step 4: Data Extraction
Extracted 89 products
✅ Iterative Extractor succeeded - found 89 items
curllm --stealth "https://www.ceneo.pl/" -d "Find products"
Log Output:
🔄 Iterative Extractor enabled
🔍 Step 1: Quick Page Check
page_type: other, price_count: 0
⚠️ Iterative Extractor returned no data: Page type not suitable
Result: Szybki exit bez marnowania czasu!
Iterative Extractor to game changer dla ekstrakcji produktów:
Domyślnie włączony - po prostu użyj curllm i ciesz się prędkością! 🚀