CPS
For Metis project #2, we were tasked with selecting a dataset appropriate for linear regression modeling. I chose to look at Chicago Public Schools high school graduation rates. I decided on CPS for several reasons. Firstly, having worked with CPS through DePaul, 826CHI, and Open Books Literacy Organization, I brought a certain level of domain knowledge to the project. Secondly, having spent my early childhood in Chicago, I have a personal attachment to the city and the education of its youth. Thirdly, I believe public education is a foundational pillar of a healthy society.
After identifying a question to be answered, the first big challenge of a data science investigation is locating good data. I used the school profile CSV files found at Chicago City Data Portal. My first instinct was to look at SQRP (School Quality Rating Policy) scores published here. Ultimately, I decided against SQRP for two reasons: the data was not continuous; and each SQRP score had subjective decisions baked into them that would complicate any conclusions. I chose an alternative, graduation rate, since it is both continuous and relatively objective. However, the graduation rate data was not ready out of the box. Linear regression modeling requires normally distributed target data, and the graduation rates were skewed left. By subtracting each graduation rate from 100, I reflected the data, creating a right skewed failed-to-graduate target, making it suitable for a log transformation. After the transform, the distribution approximated a normal distribution, and was ready for modeling.
The most interesting part of the project was feature selection and transformation. I pulled mean household incomes from the census, and matched it to each high school’s zip code. I created dummy variables representing access to El trains, ADA accessibility, dress code, whether or not the school was a charter, and gender of the head administrator. I took counts of languages and grade levels offered. Lastly, converted student demographic counts to percentages of the student body. In the end, I had 29 features in my data-frame ready for modeling.
The LASSO model is used to penalize model complexity and zero out less necessary features. After using it, I was able to narrow down my feature list to 18. I then followed the LARS path to order what was left by importance. I found the top five features affecting graduation rate positively were whether it were a charter and had a dress code, the total student count, mean zip income, and proximity to the blue line. The top five negative features affecting graduation rate were the count of special ed learners, African American student count, grades offered, gender of principal, and low-income student population.
While on the whole, the project was quite informative, my model needs improvement. It’s predictive performance averaged a root mean squared error over 17 percentage points for the testing data. Adding more data reaching back to earlier school years and fine-tuning the features will improve its performance. In the near future, I will add the 2015/16 school year, and reassess the results. I also hope to compare CPS high schools to Chicago private schools, suburban school networks, and public school systems in other large cities.