- Add newenglandoil/ package as the primary scraper (replaces fuel_scraper) - Add cheapestoil/ package as a secondary market price scraper - Add app.py entry point for direct execution - Update run.py: new scrape_cheapest(), migrate command, --state filter, --refresh-metadata flag for overwriting existing phone/URL data - Update models.py with latest schema fields - Update requirements.txt dependencies - Update Dockerfile and docker-compose.yml for new structure - Remove deprecated fuel_scraper module, test.py, and log file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
204 lines
6.3 KiB
Markdown
204 lines
6.3 KiB
Markdown
# NewEnglandBio Fuel Price Crawler
|
||
|
||
Python scraper that collects heating oil prices from NewEnglandOil.com and MaineOil.com and stores them in PostgreSQL. Runs as a batch job (no HTTP server).
|
||
|
||
## Tech Stack
|
||
|
||
- **Language:** Python 3.9+
|
||
- **HTTP:** requests + BeautifulSoup4
|
||
- **Database:** SQLAlchemy + psycopg2 (PostgreSQL)
|
||
- **Deployment:** Docker
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
crawler/
|
||
├── run.py # CLI entry point (initdb / scrape)
|
||
├── database.py # SQLAlchemy engine and session config
|
||
├── models.py # ORM models (OilPrice, County, Company)
|
||
├── fuel_scraper.py # Legacy monolithic scraper (deprecated)
|
||
├── fuel_scraper/ # Modular package (use this)
|
||
│ ├── __init__.py # Exports main()
|
||
│ ├── config.py # Site configs, zone-to-county mappings, logging
|
||
│ ├── http_client.py # HTTP requests with browser User-Agent
|
||
│ ├── parsers.py # HTML table parsing for price extraction
|
||
│ ├── scraper.py # Main orchestrator
|
||
│ └── db_operations.py # Upsert logic for oil_prices table
|
||
├── test.py # HTML parsing validation
|
||
├── requirements.txt
|
||
├── Dockerfile
|
||
├── docker-compose.yml
|
||
└── .env
|
||
```
|
||
|
||
## URLs Scraped
|
||
|
||
The crawler hits these external websites to collect price data:
|
||
|
||
### NewEnglandOil.com (5 states)
|
||
|
||
**URL pattern:** `https://www.newenglandoil.com/{state}/{zone}.asp?type=0`
|
||
|
||
| State | Zones | Example URL |
|
||
|-------|-------|-------------|
|
||
| Connecticut | zone1–zone10 | `https://www.newenglandoil.com/connecticut/zone1.asp?type=0` |
|
||
| Massachusetts | zone1–zone15 | `https://www.newenglandoil.com/massachusetts/zone1.asp?type=0` |
|
||
| New Hampshire | zone1–zone6 | `https://www.newenglandoil.com/newhampshire/zone1.asp?type=0` |
|
||
| Rhode Island | zone1–zone4 | `https://www.newenglandoil.com/rhodeisland/zone1.asp?type=0` |
|
||
| Vermont | zone1–zone4 | `https://www.newenglandoil.com/vermont/zone1.asp?type=0` |
|
||
|
||
### MaineOil.com (1 state)
|
||
|
||
**URL pattern:** `https://www.maineoil.com/{zone}.asp?type=0`
|
||
|
||
| State | Zones | Example URL |
|
||
|-------|-------|-------------|
|
||
| Maine | zone1–zone7 | `https://www.maineoil.com/zone1.asp?type=0` |
|
||
|
||
**Total: ~46 pages scraped per run.**
|
||
|
||
Each page contains an HTML table with columns: Company Name, Price, Date. The parser extracts these and maps zones to counties using the config.
|
||
|
||
## How to Run
|
||
|
||
### CLI Usage
|
||
|
||
```bash
|
||
# Initialize database tables
|
||
python3 run.py initdb
|
||
|
||
# Run the scraper
|
||
python3 run.py scrape
|
||
```
|
||
|
||
### Docker
|
||
|
||
```bash
|
||
# Build
|
||
docker-compose build
|
||
|
||
# Run scraper (default command)
|
||
docker-compose run app
|
||
|
||
# Initialize database via Docker
|
||
docker-compose run app python3 run.py initdb
|
||
|
||
# Both in sequence
|
||
docker-compose run app python3 run.py initdb && docker-compose run app
|
||
```
|
||
|
||
### Curl the Scraped Data
|
||
|
||
The crawler itself does **not** serve HTTP endpoints. After scraping, the data is available through the **Rust API** (port 9552):
|
||
|
||
```bash
|
||
# Get oil prices for a specific county
|
||
curl http://localhost:9552/oil-prices/county/1
|
||
|
||
# Get oil prices for Suffolk County (MA) — find county_id first
|
||
curl http://localhost:9552/state/MA
|
||
# Then use the county_id from the response
|
||
curl http://localhost:9552/oil-prices/county/5
|
||
```
|
||
|
||
**Response format:**
|
||
```json
|
||
[
|
||
{
|
||
"id": 1234,
|
||
"state": "Massachusetts",
|
||
"zone": 1,
|
||
"name": "ABC Fuel Co",
|
||
"price": 3.29,
|
||
"date": "01/15/2026",
|
||
"scrapetimestamp": "2026-01-15T14:30:00Z",
|
||
"county_id": 5
|
||
}
|
||
]
|
||
```
|
||
|
||
### Query the Database Directly
|
||
|
||
```bash
|
||
# All prices for Massachusetts
|
||
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
|
||
-c "SELECT name, price, date, county_id FROM oil_prices WHERE state='Massachusetts' ORDER BY price;"
|
||
|
||
# Latest scrape timestamp
|
||
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
|
||
-c "SELECT MAX(scrapetimestamp) FROM oil_prices;"
|
||
|
||
# Prices by county with county name
|
||
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
|
||
-c "SELECT c.name AS county, o.name AS company, o.price
|
||
FROM oil_prices o JOIN county c ON o.county_id = c.id
|
||
WHERE c.state='MA' ORDER BY o.price;"
|
||
```
|
||
|
||
## Environment
|
||
|
||
Create `.env`:
|
||
|
||
```
|
||
DATABASE_URL=postgresql://postgres:password@192.168.1.204:5432/fuelprices
|
||
```
|
||
|
||
## Zone-to-County Mapping
|
||
|
||
Each scraping zone maps to one or more counties:
|
||
|
||
**Connecticut (10 zones):**
|
||
- zone1 → Fairfield | zone2 → New Haven | zone3 → Middlesex
|
||
- zone4 → New London | zone5 → Hartford | zone6 → Hartford
|
||
- zone7 → Litchfield | zone8 → Tolland | zone9 → Windham
|
||
- zone10 → New Haven
|
||
|
||
**Massachusetts (15 zones):**
|
||
- zone1 → Berkshire | zone2 → Franklin | zone3 → Hampshire
|
||
- zone4 → Hampden | zone5 → Worcester | zone6 → Worcester
|
||
- zone7 → Middlesex | zone8 → Essex | zone9 → Suffolk
|
||
- zone10 → Norfolk | zone11 → Plymouth | zone12 → Bristol
|
||
- zone13 → Barnstable | zone14 → Dukes | zone15 → Nantucket
|
||
|
||
**New Hampshire (6 zones):**
|
||
- zone1 → Coos, Grafton | zone2 → Carroll, Belknap
|
||
- zone3 → Sullivan, Merrimack | zone4 → Strafford, Cheshire
|
||
- zone5 → Hillsborough | zone6 → Rockingham
|
||
|
||
**Rhode Island (4 zones):**
|
||
- zone1 → Providence | zone2 → Kent, Bristol
|
||
- zone3 → Washington | zone4 → Newport
|
||
|
||
**Maine (7 zones):**
|
||
- zone1 → Cumberland | zone2 → York | zone3 → Sagadahoc, Lincoln, Knox
|
||
- zone4 → Androscoggin, Oxford, Franklin
|
||
- zone5 → Kennebec, Somerset | zone6 → Penobscot, Piscataquis
|
||
- zone7 → Hancock, Washington, Waldo, Aroostook
|
||
|
||
## Upsert Logic
|
||
|
||
When storing scraped data, the crawler:
|
||
|
||
1. Matches existing records by `(name, state, county_id)` or `(name, state, zone)`
|
||
2. **Skips** records where `company_id IS NOT NULL` (vendor-managed prices take priority)
|
||
3. **Updates** if the price or county_id has changed
|
||
4. **Inserts** a new record if no match exists
|
||
|
||
## Scheduling
|
||
|
||
The crawler has no built-in scheduler. Run it via cron or Unraid's User Scripts:
|
||
|
||
```bash
|
||
# Cron: run daily at 2 AM
|
||
0 2 * * * cd /mnt/code/tradewar/crawler && docker-compose run app
|
||
```
|
||
|
||
## Logging
|
||
|
||
Logs to `oil_scraper.log` in the working directory. Level: INFO.
|
||
|
||
```
|
||
2026-01-15 14:30:00 - INFO - [scraper.py:42] - Scraping Massachusetts zone1...
|
||
2026-01-15 14:30:01 - INFO - [db_operations.py:28] - Upserted 15 records for Massachusetts zone1
|
||
```
|