# NewEnglandBio Fuel Price Crawler Python scraper that collects heating oil prices from NewEnglandOil.com and MaineOil.com and stores them in PostgreSQL. Runs as a batch job (no HTTP server). ## Tech Stack - **Language:** Python 3.9+ - **HTTP:** requests + BeautifulSoup4 - **Database:** SQLAlchemy + psycopg2 (PostgreSQL) - **Deployment:** Docker ## Project Structure ``` crawler/ ├── run.py # CLI entry point (initdb / scrape) ├── database.py # SQLAlchemy engine and session config ├── models.py # ORM models (OilPrice, County, Company) ├── fuel_scraper.py # Legacy monolithic scraper (deprecated) ├── fuel_scraper/ # Modular package (use this) │ ├── __init__.py # Exports main() │ ├── config.py # Site configs, zone-to-county mappings, logging │ ├── http_client.py # HTTP requests with browser User-Agent │ ├── parsers.py # HTML table parsing for price extraction │ ├── scraper.py # Main orchestrator │ └── db_operations.py # Upsert logic for oil_prices table ├── test.py # HTML parsing validation ├── requirements.txt ├── Dockerfile ├── docker-compose.yml └── .env ``` ## URLs Scraped The crawler hits these external websites to collect price data: ### NewEnglandOil.com (5 states) **URL pattern:** `https://www.newenglandoil.com/{state}/{zone}.asp?type=0` | State | Zones | Example URL | |-------|-------|-------------| | Connecticut | zone1–zone10 | `https://www.newenglandoil.com/connecticut/zone1.asp?type=0` | | Massachusetts | zone1–zone15 | `https://www.newenglandoil.com/massachusetts/zone1.asp?type=0` | | New Hampshire | zone1–zone6 | `https://www.newenglandoil.com/newhampshire/zone1.asp?type=0` | | Rhode Island | zone1–zone4 | `https://www.newenglandoil.com/rhodeisland/zone1.asp?type=0` | | Vermont | zone1–zone4 | `https://www.newenglandoil.com/vermont/zone1.asp?type=0` | ### MaineOil.com (1 state) **URL pattern:** `https://www.maineoil.com/{zone}.asp?type=0` | State | Zones | Example URL | |-------|-------|-------------| | Maine | zone1–zone7 | `https://www.maineoil.com/zone1.asp?type=0` | **Total: ~46 pages scraped per run.** Each page contains an HTML table with columns: Company Name, Price, Date. The parser extracts these and maps zones to counties using the config. ## How to Run ### CLI Usage ```bash # Initialize database tables python3 run.py initdb # Run the scraper python3 run.py scrape ``` ### Docker ```bash # Build docker-compose build # Run scraper (default command) docker-compose run app # Initialize database via Docker docker-compose run app python3 run.py initdb # Both in sequence docker-compose run app python3 run.py initdb && docker-compose run app ``` ### Curl the Scraped Data The crawler itself does **not** serve HTTP endpoints. After scraping, the data is available through the **Rust API** (port 9552): ```bash # Get oil prices for a specific county curl http://localhost:9552/oil-prices/county/1 # Get oil prices for Suffolk County (MA) — find county_id first curl http://localhost:9552/state/MA # Then use the county_id from the response curl http://localhost:9552/oil-prices/county/5 ``` **Response format:** ```json [ { "id": 1234, "state": "Massachusetts", "zone": 1, "name": "ABC Fuel Co", "price": 3.29, "date": "01/15/2026", "scrapetimestamp": "2026-01-15T14:30:00Z", "county_id": 5 } ] ``` ### Query the Database Directly ```bash # All prices for Massachusetts psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \ -c "SELECT name, price, date, county_id FROM oil_prices WHERE state='Massachusetts' ORDER BY price;" # Latest scrape timestamp psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \ -c "SELECT MAX(scrapetimestamp) FROM oil_prices;" # Prices by county with county name psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \ -c "SELECT c.name AS county, o.name AS company, o.price FROM oil_prices o JOIN county c ON o.county_id = c.id WHERE c.state='MA' ORDER BY o.price;" ``` ## Environment Create `.env`: ``` DATABASE_URL=postgresql://postgres:password@192.168.1.204:5432/fuelprices ``` ## Zone-to-County Mapping Each scraping zone maps to one or more counties: **Connecticut (10 zones):** - zone1 → Fairfield | zone2 → New Haven | zone3 → Middlesex - zone4 → New London | zone5 → Hartford | zone6 → Hartford - zone7 → Litchfield | zone8 → Tolland | zone9 → Windham - zone10 → New Haven **Massachusetts (15 zones):** - zone1 → Berkshire | zone2 → Franklin | zone3 → Hampshire - zone4 → Hampden | zone5 → Worcester | zone6 → Worcester - zone7 → Middlesex | zone8 → Essex | zone9 → Suffolk - zone10 → Norfolk | zone11 → Plymouth | zone12 → Bristol - zone13 → Barnstable | zone14 → Dukes | zone15 → Nantucket **New Hampshire (6 zones):** - zone1 → Coos, Grafton | zone2 → Carroll, Belknap - zone3 → Sullivan, Merrimack | zone4 → Strafford, Cheshire - zone5 → Hillsborough | zone6 → Rockingham **Rhode Island (4 zones):** - zone1 → Providence | zone2 → Kent, Bristol - zone3 → Washington | zone4 → Newport **Maine (7 zones):** - zone1 → Cumberland | zone2 → York | zone3 → Sagadahoc, Lincoln, Knox - zone4 → Androscoggin, Oxford, Franklin - zone5 → Kennebec, Somerset | zone6 → Penobscot, Piscataquis - zone7 → Hancock, Washington, Waldo, Aroostook ## Upsert Logic When storing scraped data, the crawler: 1. Matches existing records by `(name, state, county_id)` or `(name, state, zone)` 2. **Skips** records where `company_id IS NOT NULL` (vendor-managed prices take priority) 3. **Updates** if the price or county_id has changed 4. **Inserts** a new record if no match exists ## Scheduling The crawler has no built-in scheduler. Run it via cron or Unraid's User Scripts: ```bash # Cron: run daily at 2 AM 0 2 * * * cd /mnt/code/tradewar/crawler && docker-compose run app ``` ## Logging Logs to `oil_scraper.log` in the working directory. Level: INFO. ``` 2026-01-15 14:30:00 - INFO - [scraper.py:42] - Scraping Massachusetts zone1... 2026-01-15 14:30:01 - INFO - [db_operations.py:28] - Upserted 15 records for Massachusetts zone1 ```