Data exploration using Matplotlib and Seaborn
In this article, I’ll explore various environment remediation sites located in New York and try to visualize the information contained in the dataset. The data is hosted by the State of New York and is available on Kaggle. You can explore the notebook with the same below:
- There are 70,324 sites in the dataset.
- Each site information has 42 different data columns. Each record includes information about the site, the program, wastes disposed and more. Each record also includes a lot of information in the form of addresses, ZIP codes and more.
- Some of the columns have null values such as Address 2 and Waste Name.
- The entries in each column are of type integer, floats and objects.
There are a total of 5 different program types in the dataset:
- HW — State Superfund Program
- BCP — Brownfield Cleanup Program
- VCP — Voluntary Cleanup Program
- ERP — Environmental Restoration Program
- RCRA — Hazardous Waste Management Program
The most common Program Type is the State Superfund Program. Also, the least common type is the Hazardous Waste Management Program with almost negligible sites out of all the data.
The class/status of each site is identified using an alpha-numeric code as described below:
- 02 — The disposal of hazardous waste represents a significant threat to the environment or to health
- 03 — Contamination does not presently constitute a significant threat to the environment or to health
- 04 — The site has been properly closed but that requires continued site management
- 05 — No further action required
- A — Active
- C — Completed
- P — Sites where preliminary information indicates that a site may have contamination
- PR — Sites that are, or have been, subject to the requirements of the RCRA
- N — No further action
While the description for each class is clearly stated, some of it appears redundant such as using both 05 and N for indicating that nothing more needs to be done. However, let’s take a look at the plot and see if it reveals any information.
The data distribution is quite unique in itself. One thing that clearly stands out is that many site projects have been completed.
Project Completion Date
While proposed completion dates are also provided for all sites, I’m more interested in the sites that have already been closed and if there is any trend in the data.
As we can see from the line plot above, very few sites closed from 1985–2005, however, the number grew significantly after that. The maximum sites closed in the year 2015.
There could have been less closes as there were less number of sites back in the period of 1985–2005 which grew as the waste production increased with human population outburst.
Rather than exploring the waste names, it’s better that we know which site has what contaminant. This may allow for similar solutions to be applied to sites that have same contaminants. There are over 237 different contaminants that have to be dealt with.
Lead is the most common contaminant across all sites with a whopping 2500+ sites with it. The least common contaminants are pickle liquor, mineral/white spirits and calcium carbonate.
The control type is divided between Institutional and Engineering at the top level which are further composed of Deed Restriction, Decision Document, Environmental Notice, Environmental Easement and Other Controls.
Deed Restriction is the most common type of control type with approximately 45% of the total dataset.
The data reveals a lot of information about the remediation sites which enables us to handle similar sites clubbed together.
Hope that you liked my work. Please feel free to share your thoughts, suggestions and ideas.
Exploring Environment Remediation Sites in New York was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.