# Revelys Seed Engine

Local CLI that:

1. Searches official websites for each `(industry, city)` pair.
2. Crawls the site to extract BCE + contact email + address + images.
3. Calls a local LLM (Ollama) to structure "Revelys matching" fields.
4. Generates `out/seed_companies.sql` (Supabase SQL editor import).

The functional spec is in `CDC_Revelys_Seed_Engine_ShadowPower_v1_3_2.md`.

## Setup

Requirements:

- Node.js 18+ (20+ recommended)
- Ollama running locally
- Supabase access if you enable `UPLOAD_IMAGES=1` or `VERIFY_BCE_WITH_KBO=1`
- Optional but recommended BCE CSV snapshot in `./bce` (`enterprise.csv`, `denomination.csv`, `contact.csv`, `code.csv`, `meta.csv`)

Install:

```bash
npm install
```

Configure:

- Copy `.env.example` to `.env` and fill values
- Edit `config/cities.json` and `config/industries.json`
- Put BCE snapshot files in `./bce` (or set `BCE_DIR`) to enable local legal enrichment and BCE cross-checks
- Optional: if you use `LOCATION_MATCH_MODE=province`, adjust `config/city_provinces.json`
- Recommended: set `SEARCH_PROVIDER=serper`, fill `SERPER_API_KEY`, and keep `BRAVE_API_KEY` as fallback

Pull the recommended Ollama model:

```bash
ollama pull qwen2.5:14b-instruct-q4_K_M
```

## Run

```bash
npm run seed
```

Outputs:

- `out/seed_companies.sql`
- `out/run_log.ndjson`
- `out/stats.json`

Notes:

- Change output folder with `OUT_DIR` (example: `OUT_DIR=out-test`).
- Resume mode (`RESUME=1`) reloads previous progress from SQL + logs.
- Use `LOG_LEVEL=debug` to see internal crawl pages.

## Search Pagination

Search is paginated and continues while a pair is incomplete:

- Candidates are loaded lazily (on demand), page by page, per query term.
- If target is not reached, the engine fetches next pages (`+10` style pagination for Serper/Brave) and keeps crawling.
- Stops when target is reached, no more eligible candidates exist, or `MAX_SITES_TO_TRY` is reached.

This works with slash-separated industries too (example: `Plombier / debouchage`): both core terms are queried and paginated.

## Priority Re-crawl

To re-optimize previously found companies while minimizing new search API calls:

- Enable `PRIORITY_FROM_LOG=1`.
- Optionally set `PRIORITY_LOG_PATH` to one or multiple older `run_log.ndjson` paths (comma-separated).
- The engine queues historical `status=ok` sites first (per `(industry, city)`).
- Each priority site is revalidated with the same quality guards (non-official detection, BCE, geo, email, images, LLM checks).
- Only when the priority queue is exhausted does it continue with paginated search.

## Schema Notes

- Generated SQL includes `services` (`text[]`).
- Optional pricing fields: `pricing_model`, `budget_level`, `price_indication`, `devis_gratuit`.
- Optional public/contact fields: `contact_name`, `contact_phone`, `public_phone`, `facebook`, `instagram`, `linkedin`, `tiktok`, `founder_name`, `founder_role`, `founder_photo_url`, `ideal_zone`, `languages`, `availability`, `opening_hours`, `google_rating`, `google_reviews_count`, `google_reviews`, `company_faq`.
- BCE fields: `bce_number`, `bce_status`, `bce_legal_name`, `founded_on`, `bce_type_of_enterprise`, `bce_juridical_form`, `bce_juridical_situation`, `bce_source`, `bce_verified_at`, `bce_last_checked_at`, `bce_source_update_date`.
- SEO fields: `seo_title`, `seo_description`, `og_title`, `og_description`, `seo_jsonld`, `seo_generated_at`, `seo_ai_used`, `seo_version`, `seo_last_inputs_hash`.
- SQL upsert: `ON CONFLICT (slug) DO UPDATE ... WHERE companies.is_claimed=false`.
- Insert guards duplicates on `bce_number` with `WHERE NOT EXISTS`.
- If you use a materialized public view (`companies_public`), refresh after import.

## Quality Notes

- Cover and gallery are deduplicated more strictly (URL canonicalization + image hash when uploads are enabled), so a cover image should not be duplicated in gallery.
- BCE dedup is enforced during run and reinforced on resume by reloading BCE numbers from existing seed SQL.
- Non-official domain filtering and blacklist were expanded to reduce crawl waste on directories/job boards.
- Social profile URLs are normalized and filtered (`share/privacy/help/login/checkpoint` paths are ignored).
- One-liner quality is stricter (template detection + company-name presence check).
- A publish gate can enforce premium profile constraints (`STRICT_PUBLISH_QUALITY_GATE=1`): valid address + credible contact + non-template one-liner.

## Testing

- `npm run seed:test` is a smoke run (same pipeline, reduced scope: 1 city, 1 industry, 1 target).
- It is not a full assertion test suite yet.
- Syntax checks:

```bash
node --check seed-revelys.js
node --check seed-revelys-test.js
```
