The Art of Feature Engineering

Essentials for Machine Learning

by Pablo Duboue, PhD


The book has 10,000 lines of Python code in 5 different Jupyter notebooks, operating over 2.1Gb of compressed data. The code behind these case studies is intended as a communication tool for the ideas expressed in the book.

Preorder on Amazon expected availability, May 2020.

The Problem

The task tackled in the first four chapters is that of predicting population of cities and small towns using different data sources. This task that can be attacked with structural features, with timestamped features, textual features and image features. In particular, for cities, this means their ontological properties (e.g., title of its leader or its time zone), based on its historical population and historical features (which involves a time series analysis), based on the textual description of the place (which involves text analysis, particularly as sometimes the text includes the population) and a satellite image of the city (which involves image processing).

These case studies reflect the author attempt to solve these problems through feature engineering alone with the following constraints:

  • Python code understandable for non-Python developers with as few Python dependencies as possible;
  • running time under two days per notebook without a computer cluster or a GPU and using 8Gb of RAM or less;
  • source dataset below 2Gb for all case studies combined.

Note that there are two obvious casualties from these decisions: not using a deep learning framework (like TF) nor performing hyperparameter search. This last item was a decision motivated by these constraints.

As mentioned in the GitHub README this code is intended as a way of communicating ideas. It is as far as production code as source can get.

The Data

DBpedia + GeoCities + Wikipedia + NASA tiles

Download 2.2Gb Zip compressed.

Download 1.7Gb Tar BZip2 compressed.


The data for each individual chapter is already available for download below. Note that Chapter 9 contains the source tiles and it is 3x larger that the files above. You will only need the source tiles if you want to try other types of box constructions around each city.


The Jupyter Notebooks

All code is available, together with installation instructions on GitHub, under MIT License.

Chapter 6

Graphs

This notebook has 31 cells. It uses numpy, scikit-learn, matplotlib and graphviz.

GitHub Rendered Notebook Dataset (348Mib)
Chapter 7

Timestamped data

This notebook has 35 cells. It uses numpy, scikit-learn, matplotlib and statsmodels.

GitHub Rendered Notebook Dataset (426Mib)
Chapter 8

Text

This notebook has 16 cells. It uses numpy, scikit-learn, matplotlib and gensim.

GitHub Rendered Notebook Dataset (66Mib)
Chapter 9

Images

This notebook has 21 cells. It uses numpy, scikit-learn, matplotlib and opencv.

The chapter dataset contains the full NASA tiles (only needed if doing experiments changing the box extraction algorithm in Cell #3). The full all chapters data set contains only boxes around each city and it is much smaller.

GitHub Rendered Notebook Dataset
Chapter 10

Video, geographic information and preferences

This notebook has 19 cells. It uses numpy, scikit-learn, matplotlib, opencv and geopy.

GitHub Rendered Notebook Dataset