- Add newenglandoil/ package as the primary scraper (replaces fuel_scraper) - Add cheapestoil/ package as a secondary market price scraper - Add app.py entry point for direct execution - Update run.py: new scrape_cheapest(), migrate command, --state filter, --refresh-metadata flag for overwriting existing phone/URL data - Update models.py with latest schema fields - Update requirements.txt dependencies - Update Dockerfile and docker-compose.yml for new structure - Remove deprecated fuel_scraper module, test.py, and log file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.3 KiB
NewEnglandBio Fuel Price Crawler
Python scraper that collects heating oil prices from NewEnglandOil.com and MaineOil.com and stores them in PostgreSQL. Runs as a batch job (no HTTP server).
Tech Stack
- Language: Python 3.9+
- HTTP: requests + BeautifulSoup4
- Database: SQLAlchemy + psycopg2 (PostgreSQL)
- Deployment: Docker
Project Structure
crawler/
├── run.py # CLI entry point (initdb / scrape)
├── database.py # SQLAlchemy engine and session config
├── models.py # ORM models (OilPrice, County, Company)
├── fuel_scraper.py # Legacy monolithic scraper (deprecated)
├── fuel_scraper/ # Modular package (use this)
│ ├── __init__.py # Exports main()
│ ├── config.py # Site configs, zone-to-county mappings, logging
│ ├── http_client.py # HTTP requests with browser User-Agent
│ ├── parsers.py # HTML table parsing for price extraction
│ ├── scraper.py # Main orchestrator
│ └── db_operations.py # Upsert logic for oil_prices table
├── test.py # HTML parsing validation
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env
URLs Scraped
The crawler hits these external websites to collect price data:
NewEnglandOil.com (5 states)
URL pattern: https://www.newenglandoil.com/{state}/{zone}.asp?type=0
| State | Zones | Example URL |
|---|---|---|
| Connecticut | zone1–zone10 | https://www.newenglandoil.com/connecticut/zone1.asp?type=0 |
| Massachusetts | zone1–zone15 | https://www.newenglandoil.com/massachusetts/zone1.asp?type=0 |
| New Hampshire | zone1–zone6 | https://www.newenglandoil.com/newhampshire/zone1.asp?type=0 |
| Rhode Island | zone1–zone4 | https://www.newenglandoil.com/rhodeisland/zone1.asp?type=0 |
| Vermont | zone1–zone4 | https://www.newenglandoil.com/vermont/zone1.asp?type=0 |
MaineOil.com (1 state)
URL pattern: https://www.maineoil.com/{zone}.asp?type=0
| State | Zones | Example URL |
|---|---|---|
| Maine | zone1–zone7 | https://www.maineoil.com/zone1.asp?type=0 |
Total: ~46 pages scraped per run.
Each page contains an HTML table with columns: Company Name, Price, Date. The parser extracts these and maps zones to counties using the config.
How to Run
CLI Usage
# Initialize database tables
python3 run.py initdb
# Run the scraper
python3 run.py scrape
Docker
# Build
docker-compose build
# Run scraper (default command)
docker-compose run app
# Initialize database via Docker
docker-compose run app python3 run.py initdb
# Both in sequence
docker-compose run app python3 run.py initdb && docker-compose run app
Curl the Scraped Data
The crawler itself does not serve HTTP endpoints. After scraping, the data is available through the Rust API (port 9552):
# Get oil prices for a specific county
curl http://localhost:9552/oil-prices/county/1
# Get oil prices for Suffolk County (MA) — find county_id first
curl http://localhost:9552/state/MA
# Then use the county_id from the response
curl http://localhost:9552/oil-prices/county/5
Response format:
[
{
"id": 1234,
"state": "Massachusetts",
"zone": 1,
"name": "ABC Fuel Co",
"price": 3.29,
"date": "01/15/2026",
"scrapetimestamp": "2026-01-15T14:30:00Z",
"county_id": 5
}
]
Query the Database Directly
# All prices for Massachusetts
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT name, price, date, county_id FROM oil_prices WHERE state='Massachusetts' ORDER BY price;"
# Latest scrape timestamp
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT MAX(scrapetimestamp) FROM oil_prices;"
# Prices by county with county name
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT c.name AS county, o.name AS company, o.price
FROM oil_prices o JOIN county c ON o.county_id = c.id
WHERE c.state='MA' ORDER BY o.price;"
Environment
Create .env:
DATABASE_URL=postgresql://postgres:password@192.168.1.204:5432/fuelprices
Zone-to-County Mapping
Each scraping zone maps to one or more counties:
Connecticut (10 zones):
- zone1 → Fairfield | zone2 → New Haven | zone3 → Middlesex
- zone4 → New London | zone5 → Hartford | zone6 → Hartford
- zone7 → Litchfield | zone8 → Tolland | zone9 → Windham
- zone10 → New Haven
Massachusetts (15 zones):
- zone1 → Berkshire | zone2 → Franklin | zone3 → Hampshire
- zone4 → Hampden | zone5 → Worcester | zone6 → Worcester
- zone7 → Middlesex | zone8 → Essex | zone9 → Suffolk
- zone10 → Norfolk | zone11 → Plymouth | zone12 → Bristol
- zone13 → Barnstable | zone14 → Dukes | zone15 → Nantucket
New Hampshire (6 zones):
- zone1 → Coos, Grafton | zone2 → Carroll, Belknap
- zone3 → Sullivan, Merrimack | zone4 → Strafford, Cheshire
- zone5 → Hillsborough | zone6 → Rockingham
Rhode Island (4 zones):
- zone1 → Providence | zone2 → Kent, Bristol
- zone3 → Washington | zone4 → Newport
Maine (7 zones):
- zone1 → Cumberland | zone2 → York | zone3 → Sagadahoc, Lincoln, Knox
- zone4 → Androscoggin, Oxford, Franklin
- zone5 → Kennebec, Somerset | zone6 → Penobscot, Piscataquis
- zone7 → Hancock, Washington, Waldo, Aroostook
Upsert Logic
When storing scraped data, the crawler:
- Matches existing records by
(name, state, county_id)or(name, state, zone) - Skips records where
company_id IS NOT NULL(vendor-managed prices take priority) - Updates if the price or county_id has changed
- Inserts a new record if no match exists
Scheduling
The crawler has no built-in scheduler. Run it via cron or Unraid's User Scripts:
# Cron: run daily at 2 AM
0 2 * * * cd /mnt/code/tradewar/crawler && docker-compose run app
Logging
Logs to oil_scraper.log in the working directory. Level: INFO.
2026-01-15 14:30:00 - INFO - [scraper.py:42] - Scraping Massachusetts zone1...
2026-01-15 14:30:01 - INFO - [db_operations.py:28] - Upserted 15 records for Massachusetts zone1