curllm

LLM-DSL URL Resolution Architecture

Status: πŸ”„ IN PROGRESS Last Updated: 2025-12-08

Overview

This document describes the LLM-DSL architecture for dynamic URL resolution and element finding. The system replaces hardcoded keywords with LLM-driven semantic analysis.

Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LLM-DSL URL RESOLUTION                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                          β”‚
β”‚  User Instruction: "ZnajdΕΊ formularz kontaktowy"                        β”‚
β”‚                          β”‚                                               β”‚
β”‚                          β–Ό                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  1. GoalDetector (LLM-first)                                      β”‚   β”‚
β”‚  β”‚     β”œβ”€β”€ LLM semantic analysis β†’ TaskGoal.FIND_CONTACT_FORM       β”‚   β”‚
β”‚  β”‚     └── Statistical fallback (NO hardcoded keywords)             β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                          β”‚                                               β”‚
β”‚                          β–Ό                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  2. UrlResolver.find_url_for_goal()                               β”‚   β”‚
β”‚  β”‚     β”œβ”€β”€ _find_url_with_llm()     ← LLM semantic                  β”‚   β”‚
β”‚  β”‚     β”œβ”€β”€ dom_helpers.find_link()  ← Statistical word-overlap      β”‚   β”‚
β”‚  β”‚     └── _legacy_fallback()       ← Pattern matching (deprecated) β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                          β”‚                                               β”‚
β”‚                          β–Ό                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  3. LLMElementFinder / LLMSelectorGenerator                       β”‚   β”‚
β”‚  β”‚     β”œβ”€β”€ find_element(purpose="contact form")                     β”‚   β”‚
β”‚  β”‚     β”œβ”€β”€ generate_field_selector(purpose="email input")           β”‚   β”‚
β”‚  β”‚     └── generate_consent_selector()                              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                          β”‚                                               β”‚
β”‚                          β–Ό                                               β”‚
β”‚  Result: ResolvedUrl(url="/kontakt", method="llm", confidence=0.9)      β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

1. Goal Detection (goal_detector_llm/)

BEFORE (Hardcoded):

# ❌ Hardcoded keyword lists
if 'kontakt' in instruction or 'contact' in instruction:
    return TaskGoal.FIND_CONTACT_FORM

AFTER (LLM-DSL):

# βœ… LLM semantic analysis
result = await llm.aquery(f"""
Analyze this instruction and determine user's goal:
"{instruction}"

Goals: FIND_CART, FIND_LOGIN, FIND_CONTACT_FORM, FIND_RETURNS, ...
Return: goal
""")

2. URL Resolution (url_resolution/resolver.py)

Strategy Hierarchy:

  1. LLM Analysis - Semantic understanding of page links
  2. Statistical Analysis - Word overlap scoring (no hardcoded keywords)
  3. Structural Analysis - DOM patterns (nav, footer, header)
  4. Legacy Fallback - Pattern matching (being deprecated)

3. Element Finding (llm_dsl/element_finder.py)

# Find element by PURPOSE, not selector
finder = LLMElementFinder(page=page, llm=llm)
result = await finder.find_element(purpose="contact form submit button")

# Result:
# ElementMatch(
#     found=True,
#     selector="form.contact button[type='submit']",
#     confidence=0.9,
#     method="llm"
# )

4. Selector Generation (llm_dsl/selector_generator.py)

# Generate selector dynamically
generator = LLMSelectorGenerator(llm=llm)
result = await generator.generate_field_selector(
    page=page,
    purpose="email input field"
)

# Result:
# GeneratedSelector(
#     selector="input[type='email'], input[name*='mail']",
#     confidence=0.85,
#     method="llm"
# )

Single Source of Truth

All semantic concepts are defined in ONE location:

curllm_core/
β”œβ”€β”€ url_types.py              # TaskGoal enum (goals only)
β”œβ”€β”€ llm_dsl/
β”‚   β”œβ”€β”€ concepts.py           # FIELD_CONCEPTS (NEW - single source)
β”‚   β”œβ”€β”€ selector_generator.py # Uses concepts from above
β”‚   └── element_finder.py     # Uses concepts from above
└── form_fill/
    └── js_scripts.py         # generate_field_concepts_with_llm()

Migration from Hardcoded

Keywords β†’ LLM Analysis

Component Before After
Goal detection if 'kontakt' in text llm.analyze_intent(text)
URL finding url_patterns = ['/kontakt'] llm.find_url_for_purpose(purpose)
Selector '#email, .email-input' generator.generate_field_selector('email')
Form fields ['email', 'mail', 'adres'] generate_field_concepts_with_llm(page)

Statistical Fallback (NO Hardcoded Keywords)

When LLM is unavailable, use statistical analysis:

async def _find_link_statistical(page, goal: str) -> Optional[LinkInfo]:
    """
    Statistical word-overlap scoring.
    NO HARDCODED KEYWORDS - derives keywords from goal description.
    """
    # Extract keywords from goal semantically
    goal_words = set(goal.replace('_', ' ').lower().split())
    
    links = await page.evaluate("() => [...document.querySelectorAll('a')]...")
    
    for link in links:
        # Score based on word overlap
        link_words = set(link.text.lower().split() + link.href.split('/'))
        score = len(goal_words & link_words) / len(goal_words)
        ...

Files to Refactor

File Issue Solution
goal_detector_hybrid.py GOAL_KEYWORDS dict Use LLM analysis, remove dict
resolver.py::_extract_keywords FILTER_WORDS set Use LLM to extract search terms
_find_link_keyword_fallback.py goal_config dict Use statistical scoring

Testing

# Run URL resolver examples
cd examples/url_resolver
python run_all.py

# Expected results after migration:
# - Goal detection: 90%+ (LLM semantic)
# - URL finding: 80%+ (LLM + statistical)
# - Form filling: 85%+ (LLM selectors)

References