Analyzing Utah’s Air Quality – Connecting to the EPA’s AQS Data API

Living in a mountain valley is kind of like living in a soup bowl, all the heavy stuff seems to collect on the bottom of the bowl. I want to say that many of the valleys in Utah are referred to as Horst and Graben, although I’m sure some geologist could correct me. Regardless, being surround on all sides by mountains means that air pollution tends to collect and get stuck in the bottom of the bowl.

Utah Valley from Woodland Hills
Utah Valley from Woodland Hills

As a lifelong Utahan, I began to wonder how bad is the pollution? The news reporters seem to think it’s pretty bad. The politicians say it’s never been better. So how bad is it? What impact does it have on things like real estate value? How many people are impacted?

To help answer some of these questions, we have partnered with Randy Zwitch, Senior Developer Advocate at MapD. Through this partnership, we hope to better understand Utah’s air quality and it’s impact and as we learn, we will share our process with you.

This partnership will result in a multi-part series of blog posts outlining our process and learnings.

Analyzing Utah’s Air Quality

Let’s get started…

Register for an Account on EPA.gov

We are grabbing our air quality data from the Environmental Protection Agency. The data is made freely available, the only requirement is to create an account that we will use to access the air quality data API.

To create a new account, head over to https://aqs.epa.gov/signup. The only piece of information you need to supply is an email address. After submitting the form with your email address, you will be emailed a password.

 

Familiarize Yourself with API Parameters and Data

After you have received your API password, you can start querying the air quality data right away using a web-based query form.

EPA Air Quality Web-Based Query Tool
EPA Air Quality Web-Based Query Tool

Using this web-based query tool is a quick way to familiarize yourself with the types of data available, the parameters you can use to select your desired data, and the overall data output format.

 

Determine the Data You Want for Your Analysis

There is a lot of air quality data available through the API, as you experiment with the web-based query tool you can start to understand what dataset best fits your interests. For our analysis, we are using the following parameters:

  • AQI Pollutants: This dataset contains all pollutants that are used to measure the Air Quality Index, which you may be more familiar as the pollution color scale e.g. Today our air pollution is RED, please carpool today.
  •  

  • Parameter Code: We aren’t supplying a parameter code as we want to evaluate all AQI related pollutants. However, if you were only interested in say Ozone, you could limit your query by passing in the ‘Ozone’ Parameter Code (44201 – Ozone).
  •  

  • State Code: For this analysis, we are interested in Utah (49 – Utah)
  •  

  • County Code: We want to retrieve air quality data for all counties in Utah, however leaving this parameter blank causes the API call to fail so we will need to request each of the county datasets individually. More on that in the next step.

 

Migrate from Web Form to Programmatic API Calls

Once you have gained an understanding of the data and have an idea for how you want to construct your query, you can move from the web-based form to the programming language of your choice to retrieve, munge, clean, transfer, etc. your data. For this example, we will be using Python.

For a detailed set of documentation for how to interact with the API refer to https://aqs.epa.gov/aqsweb/documents/ramltohtml.html

Example Python Script
Project Directory Structure

├── project_code.py
├── _out
    └── output_files.csv
├── _src
    └── config.py
    └── county.py

# Import libraries
import pandas as pd
import io
import requests

# Running the query for an entire state exceeds API size limits so we need to query by county
# For the defined stateName, loop through counties and execute API call for each county
df = pd.DataFrame()

from src import county
from src import config

for key in county.counties[config.stateName]:
    #Get current countyCode from county.py
    config.countyCode = county.counties[config.stateName][key]
    
    #Build API request URL
    requestURL = config.apiURL + 'user=' + config.apiUser + '&pw=' + config.apiPassword + '&format=' + config.outputFormat \
    + '&pc=' + config.aqsClass + '&bdate=' + config.bDate + '&edate=' + config.eDate + '&state=' + config.stateCode \
    + '&county=' + config.countyCode

    #Make API Request
    apiResp = requests.get(requestURL)
    
    #Read API Response into a DataFrame
    aqs_df = pd.read_csv(io.StringIO(apiResp.content.decode('utf-8')))

    #Concatenate response DataFrame (aqs_df) into our master DataFrame (df)
    df = pd.concat([df,aqs_df])

outputName = '_out/' + year + '_raw.csv'
df.to_csv(outputName, sep=',', index=False) 

 
Let’s break down what is happening in this example:

Step 1: Import Python Libraries

# Import libraries
import pandas as pd
import io
import requests

 
pandas: As the data comes in from the API, we will make use of Pandas to store the data in a DataFrame. Later on, we will make use on a few other Pandas functions as we manipulate the data.

io: We will make use of the io library to decode the data returned from the API.

requests: The Requests library will be used to make the API request to the EPA.gov servers.

 
Step 2: Create a Pandas Dataframe

df = pd.DataFrame()

 
We will create an empty DataFrame to store all of the API responses.

 
Step 3: Import Configuration Data

from src import county
from src import config

 
As previously mentioned, we can’t request the data for an entire state, so we need an efficient way to request the data county by county. In order to make the code more scalable, we will make use of county.py to retrieve a list of counties to process. While we are looking only at Utah here, the code could easily be expanded to process any state.

county.py
county.py

 
The basic configuration details that will be used to construct the API call are contained in a file called config.py, this file operates as a basic configuration file and any details you want to abstract from your main project code could live here.

config.py
config.py

 
Step 4: Loop Through Each County for Defined State
Now we need to loop through each county for the state we are interested in analyzing.

for key in county.counties[config.stateName]:

 
This is how we will define our loop. Using the list of counties, contained in county.py, we will step through each county name in the list of counties for our state (as defined in config.py). For us, our config.stateName = utah

 
Step 5: Build the API Call
Within our counties loop, we will construct an API call to retrieve the air quality data for a given State-County combo.

    #Get current countyCode from county.py
    config.countyCode = county.counties[config.stateName][key]

    #Build API request URL
    requestURL = config.apiURL + 'user=' + config.apiUser + '&pw=' + config.apiPassword + '&format=' + config.outputFormat \
    + '&pc=' + config.aqsClass + '&bdate=' + config.bDate + '&edate=' + config.eDate + '&state=' + config.stateCode \
    + '&county=' + config.countyCode

 

Here we are simply building a string that will then be used to execute the API call. The API connection details, such as apiURL and apiUser, are defined in config.py.

Sample Constructed API String

https://aqs.epa.gov/api/rawData?user=jason@33sticks.com&pw=XXXXXXXXXXXXX28&format=DMCSV&pc=AQI POLLUTANTS&bdate=20160101&edate=20161231&state=49&county=003

 

The basic query parameters, such as aqsClass and stateCode, are also defined in config.py.

countyCode is the code for the current county being processed by the loop.

The start (bdate) and end (edate) dates for your requested dataset can also be coded into config.py if you wish by simply adding two additional line items into your config.py file as shown:

#Date range using yyyymmdd format
bdate = '20180101'
edate = '20180102'

 
I wanted to make the script a little easier to run, by requesting data for a given year, without having to update the config file each time so I simply coded a user input to grab a desired year to request as shown:

#Get requested year from user input
year = raw_input("Year to process? ")

# Define Start and End Dates
config.bDate = year + '0101'
config.eDate = year + '1231'

 

Step 6: Make API Request and Handle Results
We will make use of the requests library to send the API request, using the string we constructed in the previous step.

    apiResp = requests.get(requestURL)

 
We will then store the response in a Pandas DataFrame, aqs_df.

    aqs_df = pd.read_csv(io.StringIO(apiResp.content.decode('utf-8')))

 
And then finally, we will combine the response DataFrame into our master DataFrame. Remember, we are looping through each county for a given state, so we need to handle the results and then build a DataFrame which contains all of our data, for every county in the state.

    df = pd.concat([df,aqs_df])

 

Step 7: Output Full Set of Results
Finally, after we have made an API request for each county in the state, and combined the response for each of those API calls into our master DataFrame, df, we can now output the results into a csv file.

While there is additional cleanup and work we will do in Python, we wanted to quickly get the output data into MapD to ensure the format would be ideal before we completed any additional work in Python (these additional calculation and cleanup steps will be featured in a future post).

outputName = '_out/' + year + '_raw.csv'
df.to_csv(outputName, sep=',', index=False) 

 
 
The next post in this series will focus on cleaning the data from the API, using the data calculate an Air Quality Index (AQI) score, and exporting the data out for import into MapD which we will use to further analyze the data and to create interactive data visualizations.

 
About Randy Zwitch and MapD

[one_half]
Randy Zwitch
Randy Zwitch is an accomplished Data Science and Analytics professional with broad industry experience in big data and data science. An open-source contributor in the R and Julia programming language communities with specialties Big Data, Data Visualization, Relational Databases, Predictive Modeling and Machine Learning, Data Engineering.

Randy is the Senior Developer Advocate at MapD and is a highly sought after speaker and lecturer.
[/one_half][one_half_last]
MapD Logo
MapD is an extreme analytics platform, used in business and government to find insights in data beyond the limits of mainstream analytics tools. The MapD platform delivers zero-latency querying and visual exploration of big data, dramatically accelerating operational analytics; data science; and geospatial analytics.

Originating from research at MIT, MapD is a technology breakthrough, harnessing the massive parallel computing of GPUs for data analytics. MapD Technologies Inc. is headquartered in San Francisco, and the platform is available globally via open source and enterprise license options.
[/one_half_last]

 
 

[author] Jason Thompson [author_image timthumb=’on’]http://i2.wp.com/33sticks.com/wp-content/uploads/2015/02/jason_250x250.jpg?zoom=2&w=1080[/author_image] [author_info]Jason is the co-founder and CEO of 33 Sticks. In addition to being an amateur chef and bass player, he can also eat large amounts of sushi. As an analytics and optimization expert, he brings over 15 years of data experience, going back to being part of the original team at Omniture. [/author_info] [/author]

 
 

Join the Conversation

4 Comments

Leave a comment

Your email address will not be published. Required fields are marked *