Project Gutenberg Book Recommender

27 Mar 2019

The fourth Metis project focused on natural language processing, or NLP for short. As a subset of data science, it is both incredibly exciting and deceptively frustrating. It’s not hard to explain its allure: transforming words from books or articles into multi-dimensional space is a beautiful idea. I used Project Gutenberg as the data source for my project. I created a Flask app which takes in a piece of text, then recommends a similar book to read from Project Gutenberg. Stay with me for a few more paragraphs to learn about the challenges and rewards of the project.

Three important tools for an NLP project are tokenizing, vectorization, and topic modeling. Tokenizers take a document and break it up into smaller parts: sentences, words, parts of words. Vectorization turns these words into value counts in a new dimension where a single word exists. I used a tf-idf vectorizer, which transforms a document into vectorized tokens representing the frequency of a word in proportion to its likelihood of appearing in any given document. After tf-idf vectorization, topic modeling finds similarities between these frequencies, and winnows down the number of dimensions to a number of topics. The larger concept is known as dimensionality reduction.

Hopefully the flow of the last paragraph communicated both the beauty and frustration I alluded to at the beginning. Imagining words floating around in a spatial representation of their frequency is awesome. And the algorithms which I implemented to reduce these frequencies to topic patterns work surprisingly well. While analyzing the set of the Top 100 ebooks downloaded on Project Gutenberg on a given day, the algorithms spit out pretty concise topics. If I set sklearn’s NMF modeler to 7 topics, it yielded the following self-labeled categories: seafaring adventure, slavery/race, city life, law/philosophy, family, manners, and terrestrial/space exploration. These labels were subjectively defined by me, yet it is hard to look at the groups and not see intriguing patterns. Here is an example of first five words relating to seafaring: ship, sea, captain, boat, island. Here are the first five for law/philosophy: power, opinion, nature, subject, law. The NMF modeler and its kin are kind of magical.

The major frustration comes in tandem with the magic. It is difficult to visualize how dimensionality reduction works. I found myself imagining the word frequencies in three dimensional space. This is a fine place to start, but making the leap from three dimensions to a dimension per word is a harder leap. There is no visual equivalent. The frustration entered when the initial awe of mapping a text into multiple dimensions gave way to the recognition that I couldn’t visually imagine what the algorithms were doing. The way to understand them is via linear algebra calculations, which are beautiful in their own way, but not as alluring as points floating in space.

As is often the case in data science, beauty and frustration give way to whether the end-product works. The flask app performs fairly well. It makes interesting predictions, and points the user in a good direction. The app is not live yet. Until it is, check out my code on Github.