The Open-FF data set is regenerated roughly every month to update with new fracking disclosures and to incorporate and changes that were made to existing disclosures. The process is performed by the developers of Open-FF and is sponsored by the FracTracker Alliance.
The process has many steps, some automated, some manual. It is guided by a jupyter notebook that includes instructions, code and tests to validate the process through each step. The primary steps are:
- Downloading the materials needed:
- the previous data repo
- the external data sets used
- a fresh FracFocus download
- Determine the disclosures that are new
- Search the fresh data for new
CASNumbers; fetch authoritative data about them (SciFinder, CompTox) - Search the fresh data for new
IngredientNames; try to resolve to an authoritive CASRN. - Assign final
bgCASvalue to each newbgCAS:IngredientNamepairs - Search for new company names and link them to other existing company names as appropriate
- Check geographic and location data - flag errors and curate any new counties
- Determine the carrier record(s) of every disclosure to facilitate mass calculations
- Search for duplicate disclosures and duplicate records; flag them
- Flag disclosures without chemicals
- Assemble chemical, disclosure, and company tables
- Apply external lists to chemical list
- Calculate mass where enough data is available
- Produce full data set
- Perform dataset-wide integrity tests
- Detect and flag set of documented "FF_issues."
- Construct a full data repository
