refactor: replace fuel_scraper with newenglandoil + cheapestoil scrapers

- Add newenglandoil/ package as the primary scraper (replaces fuel_scraper) - Add cheapestoil/ package as a secondary market price scraper - Add app.py entry point for direct execution - Update run.py: new scrape_cheapest(), migrate command, --state filter, --refresh-metadata flag for overwriting existing phone/URL data - Update models.py with latest schema fields - Update requirements.txt dependencies - Update Dockerfile and docker-compose.yml for new structure - Remove deprecated fuel_scraper module, test.py, and log file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-06 11:34:21 -05:00
parent 8f45f4c209
commit 1592e6d685
26 changed files with 3221 additions and 1468 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,203 @@
+# NewEnglandBio Fuel Price Crawler
+
+Python scraper that collects heating oil prices from NewEnglandOil.com and MaineOil.com and stores them in PostgreSQL. Runs as a batch job (no HTTP server).
+
+## Tech Stack
+
+- **Language:** Python 3.9+
+- **HTTP:** requests + BeautifulSoup4
+- **Database:** SQLAlchemy + psycopg2 (PostgreSQL)
+- **Deployment:** Docker
+
+## Project Structure
+
+```
+crawler/
+├── run.py                      # CLI entry point (initdb / scrape)
+├── database.py                 # SQLAlchemy engine and session config
+├── models.py                   # ORM models (OilPrice, County, Company)
+├── fuel_scraper.py             # Legacy monolithic scraper (deprecated)
+├── fuel_scraper/               # Modular package (use this)
+│   ├── __init__.py             # Exports main()
+│   ├── config.py               # Site configs, zone-to-county mappings, logging
+│   ├── http_client.py          # HTTP requests with browser User-Agent
+│   ├── parsers.py              # HTML table parsing for price extraction
+│   ├── scraper.py              # Main orchestrator
+│   └── db_operations.py        # Upsert logic for oil_prices table
+├── test.py                     # HTML parsing validation
+├── requirements.txt
+├── Dockerfile
+├── docker-compose.yml
+└── .env
+```
+
+## URLs Scraped
+
+The crawler hits these external websites to collect price data:
+
+### NewEnglandOil.com (5 states)
+
+**URL pattern:** `https://www.newenglandoil.com/{state}/{zone}.asp?type=0`
+
+| State | Zones | Example URL |
+|-------|-------|-------------|
+| Connecticut | zone1–zone10 | `https://www.newenglandoil.com/connecticut/zone1.asp?type=0` |
+| Massachusetts | zone1–zone15 | `https://www.newenglandoil.com/massachusetts/zone1.asp?type=0` |
+| New Hampshire | zone1–zone6 | `https://www.newenglandoil.com/newhampshire/zone1.asp?type=0` |
+| Rhode Island | zone1–zone4 | `https://www.newenglandoil.com/rhodeisland/zone1.asp?type=0` |
+| Vermont | zone1–zone4 | `https://www.newenglandoil.com/vermont/zone1.asp?type=0` |
+
+### MaineOil.com (1 state)
+
+**URL pattern:** `https://www.maineoil.com/{zone}.asp?type=0`
+
+| State | Zones | Example URL |
+|-------|-------|-------------|
+| Maine | zone1–zone7 | `https://www.maineoil.com/zone1.asp?type=0` |
+
+**Total: ~46 pages scraped per run.**
+
+Each page contains an HTML table with columns: Company Name, Price, Date. The parser extracts these and maps zones to counties using the config.
+
+## How to Run
+
+### CLI Usage
+
+```bash
+# Initialize database tables
+python3 run.py initdb
+
+# Run the scraper
+python3 run.py scrape
+```
+
+### Docker
+
+```bash
+# Build
+docker-compose build
+
+# Run scraper (default command)
+docker-compose run app
+
+# Initialize database via Docker
+docker-compose run app python3 run.py initdb
+
+# Both in sequence
+docker-compose run app python3 run.py initdb && docker-compose run app
+```
+
+### Curl the Scraped Data
+
+The crawler itself does **not** serve HTTP endpoints. After scraping, the data is available through the **Rust API** (port 9552):
+
+```bash
+# Get oil prices for a specific county
+curl http://localhost:9552/oil-prices/county/1
+
+# Get oil prices for Suffolk County (MA) — find county_id first
+curl http://localhost:9552/state/MA
+# Then use the county_id from the response
+curl http://localhost:9552/oil-prices/county/5
+```
+
+**Response format:**
+```json
+[
+  {
+    "id": 1234,
+    "state": "Massachusetts",
+    "zone": 1,
+    "name": "ABC Fuel Co",
+    "price": 3.29,
+    "date": "01/15/2026",
+    "scrapetimestamp": "2026-01-15T14:30:00Z",
+    "county_id": 5
+  }
+]
+```
+
+### Query the Database Directly
+
+```bash
+# All prices for Massachusetts
+psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
+  -c "SELECT name, price, date, county_id FROM oil_prices WHERE state='Massachusetts' ORDER BY price;"
+
+# Latest scrape timestamp
+psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
+  -c "SELECT MAX(scrapetimestamp) FROM oil_prices;"
+
+# Prices by county with county name
+psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
+  -c "SELECT c.name AS county, o.name AS company, o.price
+      FROM oil_prices o JOIN county c ON o.county_id = c.id
+      WHERE c.state='MA' ORDER BY o.price;"
+```
+
+## Environment
+
+Create `.env`:
+
+```
+DATABASE_URL=postgresql://postgres:password@192.168.1.204:5432/fuelprices
+```
+
+## Zone-to-County Mapping
+
+Each scraping zone maps to one or more counties:
+
+**Connecticut (10 zones):**
+- zone1 → Fairfield | zone2 → New Haven | zone3 → Middlesex
+- zone4 → New London | zone5 → Hartford | zone6 → Hartford
+- zone7 → Litchfield | zone8 → Tolland | zone9 → Windham
+- zone10 → New Haven
+
+**Massachusetts (15 zones):**
+- zone1 → Berkshire | zone2 → Franklin | zone3 → Hampshire
+- zone4 → Hampden | zone5 → Worcester | zone6 → Worcester
+- zone7 → Middlesex | zone8 → Essex | zone9 → Suffolk
+- zone10 → Norfolk | zone11 → Plymouth | zone12 → Bristol
+- zone13 → Barnstable | zone14 → Dukes | zone15 → Nantucket
+
+**New Hampshire (6 zones):**
+- zone1 → Coos, Grafton | zone2 → Carroll, Belknap
+- zone3 → Sullivan, Merrimack | zone4 → Strafford, Cheshire
+- zone5 → Hillsborough | zone6 → Rockingham
+
+**Rhode Island (4 zones):**
+- zone1 → Providence | zone2 → Kent, Bristol
+- zone3 → Washington | zone4 → Newport
+
+**Maine (7 zones):**
+- zone1 → Cumberland | zone2 → York | zone3 → Sagadahoc, Lincoln, Knox
+- zone4 → Androscoggin, Oxford, Franklin
+- zone5 → Kennebec, Somerset | zone6 → Penobscot, Piscataquis
+- zone7 → Hancock, Washington, Waldo, Aroostook
+
+## Upsert Logic
+
+When storing scraped data, the crawler:
+
+1. Matches existing records by `(name, state, county_id)` or `(name, state, zone)`
+2. **Skips** records where `company_id IS NOT NULL` (vendor-managed prices take priority)
+3. **Updates** if the price or county_id has changed
+4. **Inserts** a new record if no match exists
+
+## Scheduling
+
+The crawler has no built-in scheduler. Run it via cron or Unraid's User Scripts:
+
+```bash
+# Cron: run daily at 2 AM
+0 2 * * * cd /mnt/code/tradewar/crawler && docker-compose run app
+```
+
+## Logging
+
+Logs to `oil_scraper.log` in the working directory. Level: INFO.
+
+```
+2026-01-15 14:30:00 - INFO - [scraper.py:42] - Scraping Massachusetts zone1...
+2026-01-15 14:30:01 - INFO - [db_operations.py:28] - Upserted 15 records for Massachusetts zone1
+```