Project Gutenberg Book Recommender

27 Mar 2019

The fourth Metis project focused on natural language processing, or NLP for short. As a subset of data science, it is both incredibly exciting and deceptively frustrating. It’s not hard to explain its allure: transforming words from books or articles into multi-dimensional space is a beautiful idea. I used Project Gutenberg as the data source for my project. I created a Flask app which takes in a piece of text, then recommends a similar book to read from Project Gutenberg. Stay with me for a few more paragraphs to learn about the challenges and rewards of the project.

Three important tools for an NLP project are tokenizing, vectorization, and topic modeling. Tokenizers take a document and break it up into smaller parts: sentences, words, parts of words. Vectorization turns these words into value counts in a new dimension where a single word exists. I used a tf-idf vectorizer, which transforms a document into vectorized tokens representing the frequency of a word in proportion to its likelihood of appearing in any given document. After tf-idf vectorization, topic modeling finds similarities between these frequencies, and winnows down the number of dimensions to a number of topics. The larger concept is known as dimensionality reduction.

Hopefully the flow of the last paragraph communicated both the beauty and frustration I alluded to at the beginning. Imagining words floating around in a spatial representation of their frequency is awesome. And the algorithms which I implemented to reduce these frequencies to topic patterns work surprisingly well. While analyzing the set of the Top 100 ebooks downloaded on Project Gutenberg on a given day, the algorithms spit out pretty concise topics. If I set sklearn’s NMF modeler to 7 topics, it yielded the following self-labeled categories: seafaring adventure, slavery/race, city life, law/philosophy, family, manners, and terrestrial/space exploration. These labels were subjectively defined by me, yet it is hard to look at the groups and not see intriguing patterns. Here is an example of first five words relating to seafaring: ship, sea, captain, boat, island. Here are the first five for law/philosophy: power, opinion, nature, subject, law. The NMF modeler and its kin are kind of magical.

The major frustration comes in tandem with the magic. It is difficult to visualize how dimensionality reduction works. I found myself imagining the word frequencies in three dimensional space. This is a fine place to start, but making the leap from three dimensions to a dimension per word is a harder leap. There is no visual equivalent. The frustration entered when the initial awe of mapping a text into multiple dimensions gave way to the recognition that I couldn’t visually imagine what the algorithms were doing. The way to understand them is via linear algebra calculations, which are beautiful in their own way, but not as alluring as points floating in space.

As is often the case in data science, beauty and frustration give way to whether the end-product works. The flask app performs fairly well. It makes interesting predictions, and points the user in a good direction. The app is not live yet. Until it is, check out my code on Github.

Prison Recidivism

16 Feb 2019

The American prison population is large, to say the least. If the two million people lived in one city, it would rank as America’s fifth most populous metropolis, just behind Houston.¹ Each year, approximately 600,000 people, a population approaching that of Portland, are released back into the community.² Recidivism, the term used for ex-offenders who are re-incarcerated after release, is an important factor in understanding how the prison population remains so large. Within three years of their release, two out of three prisoners are rearrested.³ My third project for the Metis Data Science Bootcamp, discussed at length below, attempts to use supervised learning algorithms to shed some light on the main factors contributing to recidivism.

Prison Cell

For the project, I needed a dataset that included information broken down by offender. Luckily, the Iowa department of corrections offers a public dataset of over 26,000 records.⁴ The Iowa study tracked released ex-offenders over a three-year period from 2013 until 2018, and includes demographic such as race and age, release type, and several layers of detail about the crime each person was sentenced for. After dropping records with null values, I had 24,150 rows of data to analyze. The data was imbalanced: it included a ratio of about one recidivist to two non-recidivist. It also included a heavy racial imbalance of over 16000 Caucasian prisoners. The latter distribution is not representative of the racial breakdown of general prison population of the US.⁵

Age Distribution Racial Distribution

Prior to modeling, I broke the dataset down into handpicked subsets. One of which dropped the “year released”, “reporting year”, and “target population” features. While modeling, this subset consistently outperformed the others. The best performing models were logistic regression with a C of approximately 1.21, random forest classifier with a max depth of 8, and the naive Bayes Bernoulli classifier. They all returned an auc_roc score around .65. I adjusted the threshold to .4 to reduce false negatives, i.e. released persons the model predicted would not be rearrested but were. False negatives are important since, when considering resource allocation, they represent ex-offenders who may have benefited from more support.

Footnotes
¹ The official number from the Bureau of Justice Statistics is. 2,131,000 people incarcerated in prison or local jail as of 12/31/16.
² In 2014, the official number of released prisoners was 636,000: https://www.bjs.gov/content/pub/pdf/p14.pdf
³ https://www.nij.gov/topics/corrections/recidivism/pages/welcome.aspx
⁴ https://data.iowa.gov/Public-Safety/3-Year-Recidivism-for-Offenders-Released-from-Pris/mw8r-vqy4
⁵ “In 2016, blacks represented 12% of the U.S. adult population but 33% of the sentenced prison population. Whites accounted for 64% of adults but 30% of prisoners. And while Hispanics represented 16% of the adult population, they accounted for 23% of inmates.” (http://www.pewresearch.org/fact-tank/2018/01/12/shrinking-gap-between-number-of-blacks-and-whites-in-prison/)