Files
crawler/README.md
Edwin Eames 1592e6d685 refactor: replace fuel_scraper with newenglandoil + cheapestoil scrapers
- Add newenglandoil/ package as the primary scraper (replaces fuel_scraper)
- Add cheapestoil/ package as a secondary market price scraper
- Add app.py entry point for direct execution
- Update run.py: new scrape_cheapest(), migrate command, --state filter,
  --refresh-metadata flag for overwriting existing phone/URL data
- Update models.py with latest schema fields
- Update requirements.txt dependencies
- Update Dockerfile and docker-compose.yml for new structure
- Remove deprecated fuel_scraper module, test.py, and log file

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-06 11:34:21 -05:00

204 lines
6.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NewEnglandBio Fuel Price Crawler
Python scraper that collects heating oil prices from NewEnglandOil.com and MaineOil.com and stores them in PostgreSQL. Runs as a batch job (no HTTP server).
## Tech Stack
- **Language:** Python 3.9+
- **HTTP:** requests + BeautifulSoup4
- **Database:** SQLAlchemy + psycopg2 (PostgreSQL)
- **Deployment:** Docker
## Project Structure
```
crawler/
├── run.py # CLI entry point (initdb / scrape)
├── database.py # SQLAlchemy engine and session config
├── models.py # ORM models (OilPrice, County, Company)
├── fuel_scraper.py # Legacy monolithic scraper (deprecated)
├── fuel_scraper/ # Modular package (use this)
│ ├── __init__.py # Exports main()
│ ├── config.py # Site configs, zone-to-county mappings, logging
│ ├── http_client.py # HTTP requests with browser User-Agent
│ ├── parsers.py # HTML table parsing for price extraction
│ ├── scraper.py # Main orchestrator
│ └── db_operations.py # Upsert logic for oil_prices table
├── test.py # HTML parsing validation
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env
```
## URLs Scraped
The crawler hits these external websites to collect price data:
### NewEnglandOil.com (5 states)
**URL pattern:** `https://www.newenglandoil.com/{state}/{zone}.asp?type=0`
| State | Zones | Example URL |
|-------|-------|-------------|
| Connecticut | zone1zone10 | `https://www.newenglandoil.com/connecticut/zone1.asp?type=0` |
| Massachusetts | zone1zone15 | `https://www.newenglandoil.com/massachusetts/zone1.asp?type=0` |
| New Hampshire | zone1zone6 | `https://www.newenglandoil.com/newhampshire/zone1.asp?type=0` |
| Rhode Island | zone1zone4 | `https://www.newenglandoil.com/rhodeisland/zone1.asp?type=0` |
| Vermont | zone1zone4 | `https://www.newenglandoil.com/vermont/zone1.asp?type=0` |
### MaineOil.com (1 state)
**URL pattern:** `https://www.maineoil.com/{zone}.asp?type=0`
| State | Zones | Example URL |
|-------|-------|-------------|
| Maine | zone1zone7 | `https://www.maineoil.com/zone1.asp?type=0` |
**Total: ~46 pages scraped per run.**
Each page contains an HTML table with columns: Company Name, Price, Date. The parser extracts these and maps zones to counties using the config.
## How to Run
### CLI Usage
```bash
# Initialize database tables
python3 run.py initdb
# Run the scraper
python3 run.py scrape
```
### Docker
```bash
# Build
docker-compose build
# Run scraper (default command)
docker-compose run app
# Initialize database via Docker
docker-compose run app python3 run.py initdb
# Both in sequence
docker-compose run app python3 run.py initdb && docker-compose run app
```
### Curl the Scraped Data
The crawler itself does **not** serve HTTP endpoints. After scraping, the data is available through the **Rust API** (port 9552):
```bash
# Get oil prices for a specific county
curl http://localhost:9552/oil-prices/county/1
# Get oil prices for Suffolk County (MA) — find county_id first
curl http://localhost:9552/state/MA
# Then use the county_id from the response
curl http://localhost:9552/oil-prices/county/5
```
**Response format:**
```json
[
{
"id": 1234,
"state": "Massachusetts",
"zone": 1,
"name": "ABC Fuel Co",
"price": 3.29,
"date": "01/15/2026",
"scrapetimestamp": "2026-01-15T14:30:00Z",
"county_id": 5
}
]
```
### Query the Database Directly
```bash
# All prices for Massachusetts
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT name, price, date, county_id FROM oil_prices WHERE state='Massachusetts' ORDER BY price;"
# Latest scrape timestamp
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT MAX(scrapetimestamp) FROM oil_prices;"
# Prices by county with county name
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
-c "SELECT c.name AS county, o.name AS company, o.price
FROM oil_prices o JOIN county c ON o.county_id = c.id
WHERE c.state='MA' ORDER BY o.price;"
```
## Environment
Create `.env`:
```
DATABASE_URL=postgresql://postgres:password@192.168.1.204:5432/fuelprices
```
## Zone-to-County Mapping
Each scraping zone maps to one or more counties:
**Connecticut (10 zones):**
- zone1 → Fairfield | zone2 → New Haven | zone3 → Middlesex
- zone4 → New London | zone5 → Hartford | zone6 → Hartford
- zone7 → Litchfield | zone8 → Tolland | zone9 → Windham
- zone10 → New Haven
**Massachusetts (15 zones):**
- zone1 → Berkshire | zone2 → Franklin | zone3 → Hampshire
- zone4 → Hampden | zone5 → Worcester | zone6 → Worcester
- zone7 → Middlesex | zone8 → Essex | zone9 → Suffolk
- zone10 → Norfolk | zone11 → Plymouth | zone12 → Bristol
- zone13 → Barnstable | zone14 → Dukes | zone15 → Nantucket
**New Hampshire (6 zones):**
- zone1 → Coos, Grafton | zone2 → Carroll, Belknap
- zone3 → Sullivan, Merrimack | zone4 → Strafford, Cheshire
- zone5 → Hillsborough | zone6 → Rockingham
**Rhode Island (4 zones):**
- zone1 → Providence | zone2 → Kent, Bristol
- zone3 → Washington | zone4 → Newport
**Maine (7 zones):**
- zone1 → Cumberland | zone2 → York | zone3 → Sagadahoc, Lincoln, Knox
- zone4 → Androscoggin, Oxford, Franklin
- zone5 → Kennebec, Somerset | zone6 → Penobscot, Piscataquis
- zone7 → Hancock, Washington, Waldo, Aroostook
## Upsert Logic
When storing scraped data, the crawler:
1. Matches existing records by `(name, state, county_id)` or `(name, state, zone)`
2. **Skips** records where `company_id IS NOT NULL` (vendor-managed prices take priority)
3. **Updates** if the price or county_id has changed
4. **Inserts** a new record if no match exists
## Scheduling
The crawler has no built-in scheduler. Run it via cron or Unraid's User Scripts:
```bash
# Cron: run daily at 2 AM
0 2 * * * cd /mnt/code/tradewar/crawler && docker-compose run app
```
## Logging
Logs to `oil_scraper.log` in the working directory. Level: INFO.
```
2026-01-15 14:30:00 - INFO - [scraper.py:42] - Scraping Massachusetts zone1...
2026-01-15 14:30:01 - INFO - [db_operations.py:28] - Upserted 15 records for Massachusetts zone1
```