refactor: replace fuel_scraper with newenglandoil + cheapestoil scrapers

- Add newenglandoil/ package as the primary scraper (replaces fuel_scraper)
- Add cheapestoil/ package as a secondary market price scraper
- Add app.py entry point for direct execution
- Update run.py: new scrape_cheapest(), migrate command, --state filter,
  --refresh-metadata flag for overwriting existing phone/URL data
- Update models.py with latest schema fields
- Update requirements.txt dependencies
- Update Dockerfile and docker-compose.yml for new structure
- Remove deprecated fuel_scraper module, test.py, and log file

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-06 11:34:21 -05:00
parent 8f45f4c209
commit 1592e6d685
26 changed files with 3221 additions and 1468 deletions

203
README.md Normal file
View File

@@ -0,0 +1,203 @@
# NewEnglandBio Fuel Price Crawler
Python scraper that collects heating oil prices from NewEnglandOil.com and MaineOil.com and stores them in PostgreSQL. Runs as a batch job (no HTTP server).
## Tech Stack
- **Language:** Python 3.9+
- **HTTP:** requests + BeautifulSoup4
- **Database:** SQLAlchemy + psycopg2 (PostgreSQL)
- **Deployment:** Docker
## Project Structure
```
crawler/
├── run.py # CLI entry point (initdb / scrape)
├── database.py # SQLAlchemy engine and session config
├── models.py # ORM models (OilPrice, County, Company)
├── fuel_scraper.py # Legacy monolithic scraper (deprecated)
├── fuel_scraper/ # Modular package (use this)
│ ├── __init__.py # Exports main()
│ ├── config.py # Site configs, zone-to-county mappings, logging
│ ├── http_client.py # HTTP requests with browser User-Agent
│ ├── parsers.py # HTML table parsing for price extraction
│ ├── scraper.py # Main orchestrator
│ └── db_operations.py # Upsert logic for oil_prices table
├── test.py # HTML parsing validation
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env
```
## URLs Scraped
The crawler hits these external websites to collect price data:
### NewEnglandOil.com (5 states)
**URL pattern:** `https://www.newenglandoil.com/{state}/{zone}.asp?type=0`
| State | Zones | Example URL |
|-------|-------|-------------|
| Connecticut | zone1zone10 | `https://www.newenglandoil.com/connecticut/zone1.asp?type=0` |
| Massachusetts | zone1zone15 | `https://www.newenglandoil.com/massachusetts/zone1.asp?type=0` |
| New Hampshire | zone1zone6 | `https://www.newenglandoil.com/newhampshire/zone1.asp?type=0` |
| Rhode Island | zone1zone4 | `https://www.newenglandoil.com/rhodeisland/zone1.asp?type=0` |
| Vermont | zone1zone4 | `https://www.newenglandoil.com/vermont/zone1.asp?type=0` |
### MaineOil.com (1 state)
**URL pattern:** `https://www.maineoil.com/{zone}.asp?type=0`
| State | Zones | Example URL |
|-------|-------|-------------|
| Maine | zone1zone7 | `https://www.maineoil.com/zone1.asp?type=0` |
**Total: ~46 pages scraped per run.**
Each page contains an HTML table with columns: Company Name, Price, Date. The parser extracts these and maps zones to counties using the config.
## How to Run
### CLI Usage
```bash
# Initialize database tables
python3 run.py initdb
# Run the scraper
python3 run.py scrape
```
### Docker
```bash
# Build
docker-compose build
# Run scraper (default command)
docker-compose run app
# Initialize database via Docker
docker-compose run app python3 run.py initdb
# Both in sequence
docker-compose run app python3 run.py initdb && docker-compose run app
```
### Curl the Scraped Data
The crawler itself does **not** serve HTTP endpoints. After scraping, the data is available through the **Rust API** (port 9552):
```bash
# Get oil prices for a specific county
curl http://localhost:9552/oil-prices/county/1
# Get oil prices for Suffolk County (MA) — find county_id first
curl http://localhost:9552/state/MA
# Then use the county_id from the response
curl http://localhost:9552/oil-prices/county/5
```
**Response format:**
```json
[
{
"id": 1234,
"state": "Massachusetts",
"zone": 1,
"name": "ABC Fuel Co",
"price": 3.29,
"date": "01/15/2026",
"scrapetimestamp": "2026-01-15T14:30:00Z",
"county_id": 5
}
]
```
### Query the Database Directly
```bash
# All prices for Massachusetts
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT name, price, date, county_id FROM oil_prices WHERE state='Massachusetts' ORDER BY price;"
# Latest scrape timestamp
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT MAX(scrapetimestamp) FROM oil_prices;"
# Prices by county with county name
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT c.name AS county, o.name AS company, o.price
FROM oil_prices o JOIN county c ON o.county_id = c.id
WHERE c.state='MA' ORDER BY o.price;"
```
## Environment
Create `.env`:
```
DATABASE_URL=postgresql://postgres:password@192.168.1.204:5432/fuelprices
```
## Zone-to-County Mapping
Each scraping zone maps to one or more counties:
**Connecticut (10 zones):**
- zone1 → Fairfield | zone2 → New Haven | zone3 → Middlesex
- zone4 → New London | zone5 → Hartford | zone6 → Hartford
- zone7 → Litchfield | zone8 → Tolland | zone9 → Windham
- zone10 → New Haven
**Massachusetts (15 zones):**
- zone1 → Berkshire | zone2 → Franklin | zone3 → Hampshire
- zone4 → Hampden | zone5 → Worcester | zone6 → Worcester
- zone7 → Middlesex | zone8 → Essex | zone9 → Suffolk
- zone10 → Norfolk | zone11 → Plymouth | zone12 → Bristol
- zone13 → Barnstable | zone14 → Dukes | zone15 → Nantucket
**New Hampshire (6 zones):**
- zone1 → Coos, Grafton | zone2 → Carroll, Belknap
- zone3 → Sullivan, Merrimack | zone4 → Strafford, Cheshire
- zone5 → Hillsborough | zone6 → Rockingham
**Rhode Island (4 zones):**
- zone1 → Providence | zone2 → Kent, Bristol
- zone3 → Washington | zone4 → Newport
**Maine (7 zones):**
- zone1 → Cumberland | zone2 → York | zone3 → Sagadahoc, Lincoln, Knox
- zone4 → Androscoggin, Oxford, Franklin
- zone5 → Kennebec, Somerset | zone6 → Penobscot, Piscataquis
- zone7 → Hancock, Washington, Waldo, Aroostook
## Upsert Logic
When storing scraped data, the crawler:
1. Matches existing records by `(name, state, county_id)` or `(name, state, zone)`
2. **Skips** records where `company_id IS NOT NULL` (vendor-managed prices take priority)
3. **Updates** if the price or county_id has changed
4. **Inserts** a new record if no match exists
## Scheduling
The crawler has no built-in scheduler. Run it via cron or Unraid's User Scripts:
```bash
# Cron: run daily at 2 AM
0 2 * * * cd /mnt/code/tradewar/crawler && docker-compose run app
```
## Logging
Logs to `oil_scraper.log` in the working directory. Level: INFO.
```
2026-01-15 14:30:00 - INFO - [scraper.py:42] - Scraping Massachusetts zone1...
2026-01-15 14:30:01 - INFO - [db_operations.py:28] - Upserted 15 records for Massachusetts zone1
```