crawler/README.md

# NewEnglandBio Fuel Price Crawler

Python scraper that collects heating oil prices from NewEnglandOil.com and MaineOil.com and stores them in PostgreSQL. Runs as a batch job (no HTTP server).

## Tech Stack

- **Language:** Python 3.9+
- **HTTP:** requests + BeautifulSoup4
- **Database:** SQLAlchemy + psycopg2 (PostgreSQL)
- **Deployment:** Docker

## Project Structure

```
crawler/
├── run.py                      # CLI entry point (initdb / scrape)
├── database.py                 # SQLAlchemy engine and session config
├── models.py                   # ORM models (OilPrice, County, Company)
├── fuel_scraper.py             # Legacy monolithic scraper (deprecated)
├── fuel_scraper/               # Modular package (use this)
│   ├── __init__.py             # Exports main()
│   ├── config.py               # Site configs, zone-to-county mappings, logging
│   ├── http_client.py          # HTTP requests with browser User-Agent
│   ├── parsers.py              # HTML table parsing for price extraction
│   ├── scraper.py              # Main orchestrator
│   └── db_operations.py        # Upsert logic for oil_prices table
├── test.py                     # HTML parsing validation
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env
```

## URLs Scraped

The crawler hits these external websites to collect price data:

### NewEnglandOil.com (5 states)

**URL pattern:** `https://www.newenglandoil.com/{state}/{zone}.asp?type=0`

| State | Zones | Example URL |
|-------|-------|-------------|
| Connecticut | zone1–zone10 | `https://www.newenglandoil.com/connecticut/zone1.asp?type=0` |
| Massachusetts | zone1–zone15 | `https://www.newenglandoil.com/massachusetts/zone1.asp?type=0` |
| New Hampshire | zone1–zone6 | `https://www.newenglandoil.com/newhampshire/zone1.asp?type=0` |
| Rhode Island | zone1–zone4 | `https://www.newenglandoil.com/rhodeisland/zone1.asp?type=0` |
| Vermont | zone1–zone4 | `https://www.newenglandoil.com/vermont/zone1.asp?type=0` |

### MaineOil.com (1 state)

**URL pattern:** `https://www.maineoil.com/{zone}.asp?type=0`

| State | Zones | Example URL |
|-------|-------|-------------|
| Maine | zone1–zone7 | `https://www.maineoil.com/zone1.asp?type=0` |

**Total: ~46 pages scraped per run.**

Each page contains an HTML table with columns: Company Name, Price, Date. The parser extracts these and maps zones to counties using the config.

## How to Run

### CLI Usage

```bash
# Initialize database tables
python3 run.py initdb

# Run the scraper
python3 run.py scrape
```

### Docker

```bash
# Build
docker-compose build

# Run scraper (default command)
docker-compose run app

# Initialize database via Docker
docker-compose run app python3 run.py initdb

# Both in sequence
docker-compose run app python3 run.py initdb && docker-compose run app
```

### Curl the Scraped Data

The crawler itself does **not** serve HTTP endpoints. After scraping, the data is available through the **Rust API** (port 9552):

```bash
# Get oil prices for a specific county
curl http://localhost:9552/oil-prices/county/1

# Get oil prices for Suffolk County (MA) — find county_id first
curl http://localhost:9552/state/MA
# Then use the county_id from the response
curl http://localhost:9552/oil-prices/county/5
```

**Response format:**
```json
[
  {
    "id": 1234,
    "state": "Massachusetts",
    "zone": 1,
    "name": "ABC Fuel Co",
    "price": 3.29,
    "date": "01/15/2026",
    "scrapetimestamp": "2026-01-15T14:30:00Z",
    "county_id": 5
  }
]
```

### Query the Database Directly

```bash
# All prices for Massachusetts
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
  -c "SELECT name, price, date, county_id FROM oil_prices WHERE state='Massachusetts' ORDER BY price;"

# Latest scrape timestamp
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
  -c "SELECT MAX(scrapetimestamp) FROM oil_prices;"

# Prices by county with county name
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
  -c "SELECT c.name AS county, o.name AS company, o.price
      FROM oil_prices o JOIN county c ON o.county_id = c.id
      WHERE c.state='MA' ORDER BY o.price;"
```

## Environment

Create `.env`:

```
DATABASE_URL=postgresql://postgres:password@192.168.1.204:5432/fuelprices
```

## Zone-to-County Mapping

Each scraping zone maps to one or more counties:

**Connecticut (10 zones):**
- zone1 → Fairfield | zone2 → New Haven | zone3 → Middlesex
- zone4 → New London | zone5 → Hartford | zone6 → Hartford
- zone7 → Litchfield | zone8 → Tolland | zone9 → Windham
- zone10 → New Haven

**Massachusetts (15 zones):**
- zone1 → Berkshire | zone2 → Franklin | zone3 → Hampshire
- zone4 → Hampden | zone5 → Worcester | zone6 → Worcester
- zone7 → Middlesex | zone8 → Essex | zone9 → Suffolk
- zone10 → Norfolk | zone11 → Plymouth | zone12 → Bristol
- zone13 → Barnstable | zone14 → Dukes | zone15 → Nantucket

**New Hampshire (6 zones):**
- zone1 → Coos, Grafton | zone2 → Carroll, Belknap
- zone3 → Sullivan, Merrimack | zone4 → Strafford, Cheshire
- zone5 → Hillsborough | zone6 → Rockingham

**Rhode Island (4 zones):**
- zone1 → Providence | zone2 → Kent, Bristol
- zone3 → Washington | zone4 → Newport

**Maine (7 zones):**
- zone1 → Cumberland | zone2 → York | zone3 → Sagadahoc, Lincoln, Knox
- zone4 → Androscoggin, Oxford, Franklin
- zone5 → Kennebec, Somerset | zone6 → Penobscot, Piscataquis
- zone7 → Hancock, Washington, Waldo, Aroostook

## Upsert Logic

When storing scraped data, the crawler:

1. Matches existing records by `(name, state, county_id)` or `(name, state, zone)`
2. **Skips** records where `company_id IS NOT NULL` (vendor-managed prices take priority)
3. **Updates** if the price or county_id has changed
4. **Inserts** a new record if no match exists

## Scheduling

The crawler has no built-in scheduler. Run it via cron or Unraid's User Scripts:

```bash
# Cron: run daily at 2 AM
0 2 * * * cd /mnt/code/tradewar/crawler && docker-compose run app
```

## Logging

Logs to `oil_scraper.log` in the working directory. Level: INFO.

```
2026-01-15 14:30:00 - INFO - [scraper.py:42] - Scraping Massachusetts zone1...
2026-01-15 14:30:01 - INFO - [db_operations.py:28] - Upserted 15 records for Massachusetts zone1
```