Analyzing Utah’s Air Quality
- Part 1: Connecting to the EPA’s AQS Data API
- Part 2: Cleaning and Transforming AQS Data
- Part 3: Using Shapefiles and Assigning AQI Stations in MapD
- Part 4: Building the Utah AQI Dashboard in MapD
- Part 5: Final Analysis: Air Quality Findings
In parts 1 and 2 of this series, Jason demonstrated the power of open-source tools (Python) and open data (the EPA Air Quality API) in allowing everyday citizens to answer questions of interest to them. Downloading data from a government API and calculating customized summary statistics using pandas are now so common that I budget time for this in a great majority of my analytics projects.
That said, the difference between reporting and analytics is often in combining different data sources, gaining insight that no one data source possesses individually. In this blog post, I’ll show how to use Postgres with the Postgis extension to calculate the closest air quality station to each parcel in Utah.
Combining Shapefiles Into Single Table
In order to analyze the economic impact of air pollution, we used the Land Information Record (LIR) data provided by the Utah state government. This data provides parcel boundaries as (multi-)polygons for over 1 million properties across 25 Utah counties, along with properties of the parcel such as land acreage, estimated value, number of buildings, etc.
Because each county updates their records on a different schedule, the data are provided as 25 shapefiles. For ease of manipulation, we combined the 25 shapefiles into a single table representing Utah as a whole. After unzipping the shapefiles and building a Postgres table to insert the data into (link to full code), we used shp2pgsql
to read the shapefiles and insert the data into Postgres:
for i in $(find utah_lir_shapefiles_unzipped/ -type f -name '*.shp'); do shp2pgsql -I -s 26912 -a $i utahlirparcels | psql -h localhost -U; done;
Inquisitive users might be asking “Why didn’t you do this in MapD?” Quite simply, the necessary geospatial features in MapD hadn’t been publicly released when I started the project! Postgis itself is a fantastic open-source tool from which I take a lot of inspiration and very complementary to MapD.
Calculating Closest Air Quality Station To Each Parcel
With the parcels data nicely formatted, we still have to solve for the main question of ‘What’s the air quality at a given parcel?’ The air quality data obtained from the EPA shows there are 58 air quality stations in Utah, but not all stations are equally spaced (most are around Salt Lake, Logan and St. George areas):
It’s also the case that not all stations measure the same pollutants. With that in mind, we calculated the closest station for each pollutant as follows:
1. Create a Station Primary Key
From the daily air quality data Jason pulled, we can can do a group by query to get a de-duplicated list of stations with their countycode
and stationkey
and coordinates. An additional column is added as a Primary Key using row_number()
in Postgres, so that countycode
and stationkey
don’t have to be used as the join keys:
create table aqistations as select innert.*, ST_SetSRID(ST_Point(longitude, latitude),4326) as aqi_pgis_point from (select row_number() over() as stationkey, countycode, sitenum, latitude, longitude, max(case when aqi_particulatematter0_10 is not null then 1 else 0 end) as particulatematter0_10, max(case when aqi_ozone is not null then 1 else 0 end) as ozone, max(case when aqi_carbonmonoxide is not null then 1 else 0 end) as carbonmonoxide, max(case when aqi_nitrogendioxide is not null then 1 else 0 end) as nitrogendioxide, max(case when aqi_particulatematter2pt5 is not null then 1 else 0 end) as particulatematter2pt5, max(case when aqi_sulferdioxide is not null then 1 else 0 end) as sulferdioxide from utahweathertransposed group by 2,3,4,5) as innert;
Also calculated are attributes for whether a given weather station has ever reported a given pollutant. While this won’t guarantee a pollutant value for every single day, it simplifies the attribution of a air quality station to a parcel. One alternative would’ve been to take the closest parcel for a day (i.e. possibly a different station each day), but there were concerns about how far a reported station might be away from the parcel.
2. Perform Distance Calculations With ST_DISTANCE and Window Functions
To do the distance calculations between the parcel data (polygons) and the air quality stations (points), we can use ST_DISTANCE
from Postgis. This function takes two geography
types in Postgis and calculates the minimum distance (in meters) between them. The minimum part is necessary, as the parcel data represents a solid area from which many points could be chosen.
create table closest_aqi_ozone as select * from (with parcels as (select gid, geom4326 from utahlirparcels) SELECT parcels.gid, points.stationkey, ST_Distance(parcels.geom4326::geography, points.aqi_pgis_point::geography)/1000 as distance_km, row_number() over (partition by gid order by ST_Distance(parcels.geom4326::geography, points.aqi_pgis_point::geography)) as station_rank FROM parcels inner join aqistations as points ON ST_DWithin(parcels.geom4326::geography, points.aqi_pgis_point::geography, 200000) where ozone = 1) innert where station_rank = 1;
This query might seem more complex than it really is. The inner query joins all the data parcels to all the air quality station points. The row_number()
function is a Postgres Window function, which orders the data by the calculated distance. The outer query picks the closest station (i.e. where row_number()
= 1).
The end result of this two-part process assigns a primary key to the air quality stations, then calculates the closest station, placing the key for that station in the parcels table. From here, we can import the data into MapD and start our analytics.
Export Postgres/Import to MapD
Exporting data from Postgres and importing to MapD can be done many ways; I chose to use plain-text files, as the data sizes are relatively small and export syntax is straightforward:
-- export data to csv from postgis PGPASSWORD=password psql -h localhost -U mapd gisdata -c "\Copy (select *, ST_ASTEXT(geom4326) as geo from utahlirparcels_with_keys) To STDOUT With CSV HEADER DELIMITER ',';" >> utahlirparcels_with_keys.tsv PGPASSWORD=password psql -h localhost -U mapd gisdata -c "\Copy (select * from utahdailyweather_withkeys) To STDOUT With CSV HEADER DELIMITER ',';" >> utahdailyweather_withkeys.tsv
To import the data into MapD, I used the data import utility within the Immerse editor, which auto-created the table definitions and loads the data, all through the browser.
Checking Data Quality
Although it’s tempting at this point to jump right in and start analyzing, it’s always good to do a data sanity check. I created a simple dashboard which plotted the parcels and air quality stations on a map, showed the timeseries’ of pollutant measurements and showed avg. ozone by month and year. All the data seems reasonable at first glance:
Next Step…Time For Analytics!
If you’ve made it this far, you’ve read through how Jason used Python to access the EPA Air Quality API and read through how I calculated the nearest air quality station to every parcel in Utah. The next two blog posts will show the creation of an interactive dashboard in MapD Immerse and how Jason and I interpret what we see.
[author] Randy Zwitch [author_image timthumb=’on’]https://33sticks.com/wp-content/uploads/2018/07/randy-150×150.jpeg[/author_image] [author_info]Randy is an accomplished Data Science and Analytics professional with broad industry experience in big data and data science. An open-source contributor in the R and Julia programming language communities with specialties Big Data, Data Visualization, Relational Databases, Predictive Modeling and Machine Learning, Data Engineering. Randy is the Senior Developer Advocate at MapD and is a highly sought after speaker and lecturer. [/author_info] [/author]
Leave a comment