The problem
Working on the Gradebook app I realized education data is harder to come by than one might think. Since education ultimately falls to the states, the most valuable information about schools, test scores, graduation rates, school demographics etc. typically reside in portals like this. Moreover there are additional data sets compiled by federal agencies or even local ones. Trying to combine these for analysis typically left me drowning in spreadsheets and a chaotic Downloads folder. I contacted a few schools to see how they compiled these data and I found them in the same boat- downloading each file as needed. This is especially onerous for longitudinal or time series analysis. Studying the relationship between ACT scores and school size over 10 years involves downloading and combining at least 20 .csv files.
The solution
Like any good data migration this was going to need 3 parts: selecting, extracting, loading, and transforming.
Selecting
The primary objective in data selection was diversity- particularly regional diversity. When I was a teacher I remember being frustrated by the lack of comparison points to other areas. How my students performed on Wisconsin’s annual summative assessment was helpful but was that test rigorous? How did it compare to what kids in New York and California were learning?
A good place to start was the The National Center for Education Statistics publishes a ‘Common Core of Data’ for all non-private schools nationwide. Each school was given a NCES School Id and a lot of the state data sets used this id as well. This NCES data set could lay the foundation for the national dataset with state and local datasets adding additional context. The states I started out with were:
New York: I know the schools and cities fairly well and huge juxtapositions presented by different parts of the state (think Schenectady vs Brooklyn vs Catskills) make for an intriguing analysis. Plus their data was clean and accessible.
Wisconsin: Wisconsin is also intriguing, not just for the personal connection, but also because of the variety of Public, Private, Charter and Voucher schools all tenuously cohabiting a state with some of the best and worst educational outcomes in the nation
Louisiana: Louisiana was none of these things. I know next to nothing about it, can’t pronounce any of the names, had to look up what a Parrish was, and their data access and quality it abysmal. But including Louisiana prevents the solution from working just on states with good data quality like NY and WI.
Extracting
Extracting these data means being able to collect .csv and .xlsx files in disparate formats with nonstandard attributes, then stream their contents to a database.
Using PostgreSQL, I created a ‘raw_data’ schema with ‘source_dm’ and ‘file_ft‘ tables. the source_dm table is pretty self explanatory but the file_ft table collected everything needed to stream the file- the file format, download url, zip file instructions, minimum necessary data transformations like removing special characters, and a ‘topic’.
Equipped with everything needed to stream the file, we can just run extract.py which will use the values from file_ft to download the desired file and hold it in memory. This function presupposes each file is either .csv,.xls or .xlsx. Having one file with instructions stored in a database is immensely helpful for a couple reasons. All parameters are stored centrally for easy batch changes in the event of something like a url change, it prevents the duplication inherent in a bespoke solution for each data source and makes selecting which data to populate as easy as a SQL query.
Loading
Loading a bunch of sanitized flat data files from a stream is pretty straightforward. The difficult part is doing it fast enough. Writing a csv file to a relational database is surprisingly slow, even when batched. The solution is ditching python and going full Postgres by taking advantage of Postgres’s ‘/copy’ feature. Just create a temp table for your .csv data, copy the data using the .copy_from method in the psycopg2 library and the data loads almost instantaneously. After that you’re a simple join away from having a fully populated table in a fraction of the time it would take a ‘INSERT’ statement.
Transforming
This is the fun part. Good data collection (especially for events data!) means collecting everything then slowly cleaning it up in successively clean schemas until one reaches a business-ready data warehouse. This also means we get to design a schema 🤗.
The obvious candidate for our fact table is school_ft. From here one can add tables like student demographics, test scores, dropout rates, school type etc. etc. to the school using (school_id and school_year). The additional beauty of the incremental cleaning approach is that this dataset is still very useful even if we haven’t had the time to build out a full data warehouse.
Uses for these data
As a data scientist, my first desire was to throw this in a notebook and get busy. I learned a lot about schools, students, and differences between states. You can check out some of the things I played around with in these notebooks. However, after talking with people in the world of education, these seemed less valuable. Teachers and administrators are kept up at night with questions like “how can I increase enrollment?“, “What modifications can I make tomorrow to make student x successful on the next test?” and “How can I know my students are prepared for college?“ Obviously no dataset can answer all of these but through answering similarly abstract questions about tech products and business strategies I’ve learned that approaches require super granular data to illuminate these problems. These data collected in this project are best combined with student data- data no school will (or should) give me access to.
Moreover, analyzing these data from a Postgres database in a Python notebook requires the desire to spend way too much time learning how these technologies work, time administrators and teachers don’t have. One potential solution to make these data more actionable would be to migrate the data to Google BigQuery and build a Google Spreadsheet interface to make importing data easier and code free. Stay tuned.