Files
crawler/README.md
Edwin Eames 1592e6d685 refactor: replace fuel_scraper with newenglandoil + cheapestoil scrapers
- Add newenglandoil/ package as the primary scraper (replaces fuel_scraper)
- Add cheapestoil/ package as a secondary market price scraper
- Add app.py entry point for direct execution
- Update run.py: new scrape_cheapest(), migrate command, --state filter,
  --refresh-metadata flag for overwriting existing phone/URL data
- Update models.py with latest schema fields
- Update requirements.txt dependencies
- Update Dockerfile and docker-compose.yml for new structure
- Remove deprecated fuel_scraper module, test.py, and log file

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-06 11:34:21 -05:00

6.3 KiB
Raw Permalink Blame History

NewEnglandBio Fuel Price Crawler

Python scraper that collects heating oil prices from NewEnglandOil.com and MaineOil.com and stores them in PostgreSQL. Runs as a batch job (no HTTP server).

Tech Stack

  • Language: Python 3.9+
  • HTTP: requests + BeautifulSoup4
  • Database: SQLAlchemy + psycopg2 (PostgreSQL)
  • Deployment: Docker

Project Structure

crawler/
├── run.py                      # CLI entry point (initdb / scrape)
├── database.py                 # SQLAlchemy engine and session config
├── models.py                   # ORM models (OilPrice, County, Company)
├── fuel_scraper.py             # Legacy monolithic scraper (deprecated)
├── fuel_scraper/               # Modular package (use this)
│   ├── __init__.py             # Exports main()
│   ├── config.py               # Site configs, zone-to-county mappings, logging
│   ├── http_client.py          # HTTP requests with browser User-Agent
│   ├── parsers.py              # HTML table parsing for price extraction
│   ├── scraper.py              # Main orchestrator
│   └── db_operations.py        # Upsert logic for oil_prices table
├── test.py                     # HTML parsing validation
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env

URLs Scraped

The crawler hits these external websites to collect price data:

NewEnglandOil.com (5 states)

URL pattern: https://www.newenglandoil.com/{state}/{zone}.asp?type=0

State Zones Example URL
Connecticut zone1zone10 https://www.newenglandoil.com/connecticut/zone1.asp?type=0
Massachusetts zone1zone15 https://www.newenglandoil.com/massachusetts/zone1.asp?type=0
New Hampshire zone1zone6 https://www.newenglandoil.com/newhampshire/zone1.asp?type=0
Rhode Island zone1zone4 https://www.newenglandoil.com/rhodeisland/zone1.asp?type=0
Vermont zone1zone4 https://www.newenglandoil.com/vermont/zone1.asp?type=0

MaineOil.com (1 state)

URL pattern: https://www.maineoil.com/{zone}.asp?type=0

State Zones Example URL
Maine zone1zone7 https://www.maineoil.com/zone1.asp?type=0

Total: ~46 pages scraped per run.

Each page contains an HTML table with columns: Company Name, Price, Date. The parser extracts these and maps zones to counties using the config.

How to Run

CLI Usage

# Initialize database tables
python3 run.py initdb

# Run the scraper
python3 run.py scrape

Docker

# Build
docker-compose build

# Run scraper (default command)
docker-compose run app

# Initialize database via Docker
docker-compose run app python3 run.py initdb

# Both in sequence
docker-compose run app python3 run.py initdb && docker-compose run app

Curl the Scraped Data

The crawler itself does not serve HTTP endpoints. After scraping, the data is available through the Rust API (port 9552):

# Get oil prices for a specific county
curl http://localhost:9552/oil-prices/county/1

# Get oil prices for Suffolk County (MA) — find county_id first
curl http://localhost:9552/state/MA
# Then use the county_id from the response
curl http://localhost:9552/oil-prices/county/5

Response format:

[
  {
    "id": 1234,
    "state": "Massachusetts",
    "zone": 1,
    "name": "ABC Fuel Co",
    "price": 3.29,
    "date": "01/15/2026",
    "scrapetimestamp": "2026-01-15T14:30:00Z",
    "county_id": 5
  }
]

Query the Database Directly

# All prices for Massachusetts
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
  -c "SELECT name, price, date, county_id FROM oil_prices WHERE state='Massachusetts' ORDER BY price;"

# Latest scrape timestamp
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
  -c "SELECT MAX(scrapetimestamp) FROM oil_prices;"

# Prices by county with county name
psql postgresql://postgres:password@192.168.1.204:5432/fuelprices \
  -c "SELECT c.name AS county, o.name AS company, o.price
      FROM oil_prices o JOIN county c ON o.county_id = c.id
      WHERE c.state='MA' ORDER BY o.price;"

Environment

Create .env:

DATABASE_URL=postgresql://postgres:password@192.168.1.204:5432/fuelprices

Zone-to-County Mapping

Each scraping zone maps to one or more counties:

Connecticut (10 zones):

  • zone1 → Fairfield | zone2 → New Haven | zone3 → Middlesex
  • zone4 → New London | zone5 → Hartford | zone6 → Hartford
  • zone7 → Litchfield | zone8 → Tolland | zone9 → Windham
  • zone10 → New Haven

Massachusetts (15 zones):

  • zone1 → Berkshire | zone2 → Franklin | zone3 → Hampshire
  • zone4 → Hampden | zone5 → Worcester | zone6 → Worcester
  • zone7 → Middlesex | zone8 → Essex | zone9 → Suffolk
  • zone10 → Norfolk | zone11 → Plymouth | zone12 → Bristol
  • zone13 → Barnstable | zone14 → Dukes | zone15 → Nantucket

New Hampshire (6 zones):

  • zone1 → Coos, Grafton | zone2 → Carroll, Belknap
  • zone3 → Sullivan, Merrimack | zone4 → Strafford, Cheshire
  • zone5 → Hillsborough | zone6 → Rockingham

Rhode Island (4 zones):

  • zone1 → Providence | zone2 → Kent, Bristol
  • zone3 → Washington | zone4 → Newport

Maine (7 zones):

  • zone1 → Cumberland | zone2 → York | zone3 → Sagadahoc, Lincoln, Knox
  • zone4 → Androscoggin, Oxford, Franklin
  • zone5 → Kennebec, Somerset | zone6 → Penobscot, Piscataquis
  • zone7 → Hancock, Washington, Waldo, Aroostook

Upsert Logic

When storing scraped data, the crawler:

  1. Matches existing records by (name, state, county_id) or (name, state, zone)
  2. Skips records where company_id IS NOT NULL (vendor-managed prices take priority)
  3. Updates if the price or county_id has changed
  4. Inserts a new record if no match exists

Scheduling

The crawler has no built-in scheduler. Run it via cron or Unraid's User Scripts:

# Cron: run daily at 2 AM
0 2 * * * cd /mnt/code/tradewar/crawler && docker-compose run app

Logging

Logs to oil_scraper.log in the working directory. Level: INFO.

2026-01-15 14:30:00 - INFO - [scraper.py:42] - Scraping Massachusetts zone1...
2026-01-15 14:30:01 - INFO - [db_operations.py:28] - Upserted 15 records for Massachusetts zone1