Chapter 9: Images

Accompanying code for the book The Art of Feature Engineering.

This notebook plus notebooks for the other chapters are available online at https://github.com/DrDub/artfeateng

MIT License

Copyright 2019 Pablo Duboue

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Limitations

  • Simple python intented for people coming from other languages that plan to use the ideas described in the book outside of Python.
  • Many of these techniques are available as library calls. They are spelled out as for teaching purposes.
  • Resource limitations:
    • At most one day of running time per notebook.
    • No GPU required.
    • Minimal dependencies.
    • At most 8Gb of RAM.
  • Due to resource limitations, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.
  • To help readers try variants of some cells in isolation, the cells are easily executable without having to re-run the whole notebook. As such, most cells read everything they need from disk and write all their results back into disk, which is unnecessary with normal notebooks. The code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program.
  • I dislike Pandas so these notebooks are Pandas-free, which might seem unusual to some.

Get the GPS coordinates for all cities (Cell #1).

In [1]:
# CELL 1
from collections import OrderedDict

geo_id_to_uri = OrderedDict()
with open("ch6_cell1_cities1000_links.ttl") as f:
    for line in f:
        fields = line.split(' ')
        _id = fields[2]
        _id = _id.split('/')[-2]
        geo_id_to_uri[_id] = fields[0]
    
print("Linked cities: {:,}".format(len(geo_id_to_uri)))
geo_id_to_gps = dict()
found = 0
with open("cities1000.txt") as f:
    for line in f:
        fields = line.split('\t')
        _id = fields[0]
        lat = fields[4]
        lon = fields[5]
        if _id in geo_id_to_uri:
            geo_id_to_gps[_id] = (lat, lon)
            found += 1
                            
print("With coordinates: {:,} ({:%})".format(found, found/len(geo_id_to_uri)))
    
with open("ch9_cell1_cities1000_gps.tsv", "w") as w:
    w.write("name\tgeo_id\tlat\tlon\n")
    for _id in geo_id_to_uri:
        if _id in geo_id_to_gps:
            lat, lon = geo_id_to_gps[_id]
            w.write("{}\t{}\t{}\t{}\n".format(geo_id_to_uri[_id], _id, lat, lon))
Linked cities: 80,199
With coordinates: 80,199 (100.000000%)

Generate tile lists (Cell #2).

In [2]:
# CELL 2

PARAM_WIDTH_TILES = 2560
PARAM_HEIGHT_TILES = 1280

PARAM_TOP_LON = -180
PARAM_TOP_LAT = 90

PARAM_TILE_SIZE = 512
PARAM_BOX_SIZE = 128 # output tiles around a city

def cell2_to_tile_pixel(lat, lon):
    tile_row = (PARAM_TOP_LAT - lat) / 360 * PARAM_WIDTH_TILES
    tile_col = (lon - PARAM_TOP_LON) / 180 * PARAM_HEIGHT_TILES

    pixel_x = (tile_col - int(tile_col)) * PARAM_TILE_SIZE
    pixel_y = (tile_row - int(tile_row)) * PARAM_TILE_SIZE
    
    tile_row = int(tile_row)
    tile_col = int(tile_col)
    return tile_row, tile_col, int(pixel_x), int(pixel_y)

def tiles_to_fetch(lat, lon):
    tile_row, tile_col, pixel_x, pixel_y = cell2_to_tile_pixel(lat, lon)
    
    result = [ (tile_row, tile_col) ]
    tile_col_m1 = (tile_col - 1 + PARAM_HEIGHT_TILES) % PARAM_HEIGHT_TILES
    tile_col_p1 = (tile_col + 1) % PARAM_HEIGHT_TILES
    tile_row_m1 = (tile_row - 1 + PARAM_WIDTH_TILES) % PARAM_WIDTH_TILES
    tile_row_p1 = (tile_row + 1) % PARAM_WIDTH_TILES
    
    if pixel_x < PARAM_BOX_SIZE / 2:
        result.append( (tile_row_m1, tile_col))
    if pixel_y < PARAM_BOX_SIZE / 2:
        result.append( (tile_row, tile_col_m1))
    if pixel_x < PARAM_BOX_SIZE / 2 and pixel_y < PARAM_BOX_SIZE / 2:
        result.append( (tile_row_m1, tile_col_m1))
    if PARAM_TILE_SIZE - pixel_x < PARAM_BOX_SIZE / 2: 
        result.append( (tile_row_p1, tile_col))
    if PARAM_TILE_SIZE - pixel_y < PARAM_BOX_SIZE / 2: 
        result.append( (tile_row, tile_col_p1))
    if PARAM_TILE_SIZE - pixel_x < PARAM_BOX_SIZE /2 and PARAM_TILE_SIZE - pixel_y < PARAM_BOX_SIZE / 2: 
        result.append( (tile_row_p1, tile_col_p1))
    return result

tiles = set()
with open("ch9_cell1_cities1000_gps.tsv") as f:
    next(f) # header
    for line in f:
        _, _, lat, lon = line.split("\t")
        for row, col in tiles_to_fetch(float(lat), float(lon)):
            tiles.add("{}\t{}".format(row,col))
            
print("To fetch: {:,} tiles".format(len(tiles)))

with open("ch9_cell2_tiles_zoom11.tsv", "w") as w:
    w.write("\n".join(sorted(tiles)) + "\n")
To fetch: 68,304 tiles

The tiles are fetched through a script and available in the tiles/ folder. From there, we can extract boxes surrounding a given point. Cell #3 extracs boxes of size $64 \times 64$.

In [3]:
# CELL 3
import os
import cv2
import numpy as np

PARAM_TILE_SIZE = 512
PARAM_BOX_SIZE  = 64
PARAM_DATA_PATH   = './tiles'
PARAM_OUTPUT_PATH = './boxes'

if not os.path.exists(PARAM_OUTPUT_PATH):
    os.mkdir(PARAM_OUTPUT_PATH)

cities = list()
with open("ch9_cell1_cities1000_gps.tsv") as f:
    next(f) # header
    for line in f:
        uri, geoid, lat, lon = line.split("\t")
        cities.append( (uri, geoid, float(lat), float(lon)) )

target = np.zeros( (PARAM_BOX_SIZE, PARAM_BOX_SIZE, 3), dtype='uint8' )
large = np.zeros( (512 * 2, 512 * 2, 3), dtype='uint8')

missing_tile = set()
multi_tile = 0
written = 0
for uri, geoid, lat, lon in cities:
    tile_row, tile_col, pixel_x, pixel_y = cell2_to_tile_pixel(lat, lon)

    img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row, tile_col)
    if not os.path.exists(img_file):
        missing_tile.add( (tile_row, tile_col) )
        continue
    img = cv2.imread(img_file)
    if img is None:
        missing_tile.add( (tile_row, tile_col) )
        continue        
    
    if pixel_x >= PARAM_BOX_SIZE / 2 and \
            pixel_y >= PARAM_BOX_SIZE / 2 and \
            PARAM_TILE_SIZE - pixel_x >= PARAM_BOX_SIZE / 2 and \
            PARAM_TILE_SIZE - pixel_y >= PARAM_BOX_SIZE / 2:            
        # single tile wonder
        target[:,:,:] = img[(pixel_x - PARAM_BOX_SIZE // 2):(pixel_x + PARAM_BOX_SIZE // 2),\
                          (pixel_y - PARAM_BOX_SIZE // 2):(pixel_y + PARAM_BOX_SIZE // 2),\
                          :]
    else:
        tile_col_m1 = (tile_col - 1 + PARAM_HEIGHT_TILES) % PARAM_HEIGHT_TILES
        tile_col_p1 = (tile_col + 1) % PARAM_HEIGHT_TILES
        tile_row_m1 = (tile_row - 1 + PARAM_WIDTH_TILES) % PARAM_WIDTH_TILES
        tile_row_p1 = (tile_row + 1) % PARAM_WIDTH_TILES
        
        large[:,:,:] = 0
        
        if pixel_x < PARAM_BOX_SIZE / 2:
            # lower part
            upper_img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row_m1, tile_col)
            if not os.path.exists(upper_img_file):
                missing_tile.add( (tile_row_m1, tile_col) )
                continue
            upper_img = cv2.imread(upper_img_file)
            if pixel_y < PARAM_BOX_SIZE / 2:
                # lower right
                large[PARAM_TILE_SIZE:,PARAM_TILE_SIZE:, :] = img
                pixel_x += PARAM_TILE_SIZE
                pixel_y += PARAM_TILE_SIZE
                large[0:PARAM_TILE_SIZE,PARAM_TILE_SIZE:, :] = upper_img
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row, tile_col_m1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row, tile_col_m1) )
                    continue
                large[PARAM_TILE_SIZE:,0:PARAM_TILE_SIZE, :] = cv2.imread(img_file)
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row_m1, tile_col_m1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row_m1, tile_col_m1) )
                    continue
                large[0:PARAM_TILE_SIZE,0:PARAM_TILE_SIZE, :] = cv2.imread(img_file)
            elif PARAM_TILE_SIZE - pixel_y < PARAM_BOX_SIZE / 2:
                # lower left
                large[PARAM_TILE_SIZE:,0:PARAM_TILE_SIZE, :] = img
                pixel_x += PARAM_TILE_SIZE
                large[0:PARAM_TILE_SIZE,0:PARAM_TILE_SIZE, :] = upper_img
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row, tile_col_p1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row, tile_col_p1) )
                    continue
                large[PARAM_TILE_SIZE:,PARAM_TILE_SIZE:, :] = cv2.imread(img_file)
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row_m1, tile_col_p1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row_m1, tile_col_p1) )
                    continue
                large[0:PARAM_TILE_SIZE:,PARAM_TILE_SIZE:, :] = cv2.imread(img_file)
            else: # lower, use left
                large[PARAM_TILE_SIZE:,0:PARAM_TILE_SIZE, :] = img
                pixel_x += PARAM_TILE_SIZE
                large[0:PARAM_TILE_SIZE,0:PARAM_TILE_SIZE, :] = upper_img
        elif PARAM_TILE_SIZE - pixel_x < PARAM_BOX_SIZE / 2:
            # upper part
            lower_img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row_p1, tile_col)
            if not os.path.exists(lower_img_file):
                missing_tile.add( (tile_row_p1, tile_col) )
                continue
            lower_img = cv2.imread(lower_img_file)
            if pixel_y < PARAM_BOX_SIZE / 2:
                # upper right
                large[0:PARAM_TILE_SIZE,PARAM_TILE_SIZE:, :] = img
                pixel_y += PARAM_TILE_SIZE
                large[PARAM_TILE_SIZE:,PARAM_TILE_SIZE:, :] = lower_img
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row, tile_col_m1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row, tile_col_m1) )
                    continue
                large[0:PARAM_TILE_SIZE:,0:PARAM_TILE_SIZE, :] = cv2.imread(img_file)
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row_p1, tile_col_m1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row_p1, tile_col_m1) )
                    continue
                large[PARAM_TILE_SIZE:,0:PARAM_TILE_SIZE, :] = cv2.imread(img_file)
            elif PARAM_TILE_SIZE - pixel_y < PARAM_BOX_SIZE / 2:
                # upper left
                large[0:PARAM_TILE_SIZE,0:PARAM_TILE_SIZE, :] = img
                large[PARAM_TILE_SIZE:,0:PARAM_TILE_SIZE, :] = lower_img
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row, tile_col_p1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row, tile_col_p1) )
                    continue
                large[0:PARAM_TILE_SIZE,PARAM_TILE_SIZE:, :] = cv2.imread(img_file)
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row_p1, tile_col_p1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row_p1, tile_col_p1) )
                    continue
                large[PARAM_TILE_SIZE:,PARAM_TILE_SIZE:, :] = cv2.imread(img_file)
            else: # upper, use left
                large[0:PARAM_TILE_SIZE,0:PARAM_TILE_SIZE, :] = img
                large[PARAM_TILE_SIZE:,0:PARAM_TILE_SIZE, :] = lower_img
        else:
            if pixel_y < PARAM_BOX_SIZE / 2:
                # right, use upper
                large[0:PARAM_TILE_SIZE:,PARAM_TILE_SIZE:, :] = img
                pixel_y += PARAM_TILE_SIZE
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row, tile_col_m1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row, tile_col_m1) )
                    continue
                large[0:PARAM_TILE_SIZE:,0:PARAM_TILE_SIZE, :] = cv2.imread(img_file)
            elif PARAM_TILE_SIZE - pixel_y < PARAM_BOX_SIZE / 2:
                # left, use upper
                large[0:PARAM_TILE_SIZE,0:PARAM_TILE_SIZE, :] = img
                img_file = "{}/{}-{}.jpeg".format(PARAM_DATA_PATH, tile_row, tile_col_p1)
                if not os.path.exists(img_file):
                    missing_tile.add( (tile_row, tile_col_p1) )
                    continue
                large[0:PARAM_TILE_SIZE,PARAM_TILE_SIZE:, :] = cv2.imread(img_file)
        target[:,:,:] = large[(pixel_x - PARAM_BOX_SIZE // 2):(pixel_x + PARAM_BOX_SIZE // 2),\
                  (pixel_y - PARAM_BOX_SIZE // 2):(pixel_y + PARAM_BOX_SIZE // 2),\
                  :]
        multi_tile += 1

    cv2.imwrite("{}/{}.png".format(PARAM_OUTPUT_PATH, geoid), target)
    written += 1

print("Missing tiles: {:,}".format(len(missing_tile)))
print("Multi-tiles:   {:,}".format(multi_tile))
print("Written:       {:,}".format(written))
if len(missing_tile) > 0:
    with open("missing_tiles.tsv", "w") as w:
        w.write("".join(map(lambda x:"{}\t{}\n".format(*x), missing_tile))+"\n")
Missing tiles: 0
Multi-tiles:   18,595
Written:       80,199

With these 80,199 we can do some EDA. Let us start by looking at some tiles and histograms. Cell #4 plots 10 cities at random on false colour.

In [4]:
# CELL 4
import random
import cv2
import numpy as np
PARAM_DATA_PATH='./boxes'

dev_uris = set()
pop = dict()
with open("ch6_cell32_dev_feat_conservative.tsv") as f:
    next(f) # header
    for line in f:
        fields = line.strip().split("\t")
        dev_uris.add(fields[0])
        pop[fields[0]] = fields[-1]

cities = list()
with open("ch9_cell1_cities1000_gps.tsv") as f:
    next(f) # header
    for line in f:
        uri, geoid, lat, lon = line.split("\t")
        if uri in dev_uris:
            cities.append( (uri, geoid, float(lat), float(lon)) )

rand = random.Random(42)
rand.shuffle(cities)

from matplotlib import pyplot as plt
%matplotlib inline


plt.rcParams['figure.figsize'] = [15, 10]

for idx, row in enumerate(cities[:4]):
    uri, geoid, lat, lon = row
    name = uri.split("/")[-1][:-1]
    plt.subplot(1,4, idx+1)
    img = cv2.imread("{}/{}.png".format(PARAM_DATA_PATH, geoid))
    plt.imshow(img)
    plt.axis('off')
    plt.title("{}\n{:,}".format(name, int(10**float(pop[uri]))))
plt.savefig("ch9_cell4_eda1.pdf", bbox_inches='tight', dpi=300)

from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [15, 10]


for idx, row in enumerate(cities[:12]):
    uri, geoid, lat, lon = row
    name = uri.split("/")[-1][:-1]
    plt.subplot(3,4, idx+1)
    img = cv2.imread("{}/{}.png".format(PARAM_DATA_PATH, geoid))
    img2 = cv2.applyColorMap(img, cv2.COLORMAP_JET)
    plt.imshow(img2)
    plt.axis('off')
    plt.title("{}\n{:,}".format(name, int(10**float(pop[uri]))))
plt.show()

We can see that places with more mountains have less population. Let's take a look at the histograms in Cell #5.

In [5]:
# CELL 5
import random
import cv2
import numpy as np
PARAM_DATA_PATH='./boxes'

dev_uris = set()
pop = dict()
with open("ch6_cell32_dev_feat_conservative.tsv") as f:
    next(f) # header
    for line in f:
        fields = line.strip().split("\t")
        dev_uris.add(fields[0])
        pop[fields[0]] = fields[-1]

cities = list()
with open("ch9_cell1_cities1000_gps.tsv") as f:
    next(f) # header
    for line in f:
        uri, geoid, lat, lon = line.split("\t")
        if uri in dev_uris:
            cities.append( (uri, geoid, float(lat), float(lon)) )

rand = random.Random(42)
rand.shuffle(cities)

from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 10]

for idx, row in enumerate(cities[:4]):
    uri, geoid, lat, lon = row
    name = uri.split("/")[-1][:-1]
    plt.subplot(2, 4, idx + 1)
    img = cv2.imread("{}/{}.png".format(PARAM_DATA_PATH, geoid))
    plt.imshow(img)
    plt.axis('off')
    plt.title("{}\n{:,}".format(name, int(10**float(pop[uri]))))
    plt.subplot(2, 4, idx + 5)
    plt.plot(cv2.calcHist([img], [0],None,[16],[0,256]))
plt.savefig("ch9_cell5_eda_hist.pdf", bbox_inches='tight', dpi=300)

from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [15, 10]


for idx, row in enumerate(cities[:12]):
    uri, geoid, lat, lon = row
    name = uri.split("/")[-1][:-1]
    plt.subplot(4,6, 2*idx + 1)
    img = cv2.imread("{}/{}.png".format(PARAM_DATA_PATH, geoid))
    img2 = cv2.applyColorMap(img, cv2.COLORMAP_JET)
    plt.imshow(img2)
    plt.axis('off')
    plt.title("{}\n{:,}".format(name, int(10**float(pop[uri]))))
    plt.subplot(4, 6, 2*idx+2)
    plt.plot(cv2.calcHist([img], [0],None,[16],[0,256]))

plt.show()

We are ready to try ML using all pixels as features.

First Featurization

Using each pixel as a feature for $32\times 32$ expands the feature vector in 1,024 features, which with the 98 original ends up in 1,122 (Cell #6).

In [6]:
# CELL 6
import math
import random
import cv2
import numpy as np
import time
from sklearn.ensemble import RandomForestRegressor
PARAM_DATA_PATH='./boxes'

header = None
data = list()
with open("ch6_cell32_dev_feat_conservative.tsv") as f:
    header = next(f)
    header = header.strip().split('\t')
    for line in f:
        fields = line.strip().split("\t")
        name = fields.pop(0)
        logpop = float(fields[-1])
        feats = list(map(float, fields[:-1]))
        row = (feats, logpop, name)
        data.append(row)

uri_to_geoid = dict()
with open("ch6_cell1_cities1000_links.ttl") as f:
    for line in f:
        fields = line.split(' ')
        _id = fields[2]
        _id = _id.split('/')[-2]
        uri_to_geoid[fields[0]] = _id

header = list(header[:-1]) + [ fname for rowlist in map(lambda x: map(lambda y:"p-{}-{}".format(x,y), 
                                                                      range(32)), range(32))
                              for fname in rowlist ] + [ header[-1] ]
for idx, row in enumerate(data):
    if idx % (len(data) // 5) == 0:
        print("Read {:,} out of {:,}".format(idx, len(data)))
    feats, _, name = row
    img = cv2.imread("{}/{}.png".format(PARAM_DATA_PATH, uri_to_geoid[name]))
    r, _, _ = cv2.split(img)
    feats.extend(r[16:48,16:48].reshape(-1))
    data[idx] = (np.array(feats), row[1], row[2])    

# write full file
with open("ch9_cell6_dev_feat1.tsv",  "w") as f:
    f.write("\t".join(header) + '\n')
    for idx, row in enumerate(data):
        if idx % (len(data) // 5) == 0:
            print("Wrote {:,} out of {:,}".format(idx, len(data)))
        feats, logpop, name = row
        f.write("{}\t{}\t{}\n".format(name, "\t".join(map(lambda x: str(int(x)) if int(x) == x else str(x),
                                                          feats)), logpop))
        
train_data = list()
test_data = list()
rand = random.Random(42)
for row in data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)
data=None
               
train_names = list(map(lambda t:t[2], train_data))

test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
train_data=None
test_data=None

# train RF
print("Training on {:,} cities\nusing {:,} features".format(*xtrain.shape))

start = time.time()
rfr = RandomForestRegressor(max_features=1.0, random_state=42, n_estimators=200, n_jobs=-1)
rfr.fit(xtrain[:,:98], ytrain)
print("Baseline training took {:,} seconds".format(time.time() - start))
ytest_pred = rfr.predict(xtest[:,:98])
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("Baseline RMSE", RMSE)

start = time.time()
rfr = RandomForestRegressor(max_features=1.0, random_state=42, n_estimators=200, n_jobs=-1)
rfr.fit(xtrain, ytrain)
print("Training took {:,} seconds".format(time.time() - start))
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

print("Writing files for error analysis")
np.savez_compressed("ch9_cell6_feat1_rf.npz", 
                    xtrain=xtrain, ytrain=ytrain,
                    xtest=xtest, ytest=ytest,
                    ytest_pred=ytest_pred)
import pickle
with open("ch9_cell6_model.pk", "wb") as pkl:
    pickle.dump(rfr, pkl)
with open("ch9_cell6_test_names.tsv", "w") as names:
    for idx, name in enumerate(test_names):
        names.write("{}\t{}\n".format(idx, name))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch9_cell6_rf_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Read 0 out of 44,959
Read 8,991 out of 44,959
Read 17,982 out of 44,959
Read 26,973 out of 44,959
Read 35,964 out of 44,959
Read 44,955 out of 44,959
Wrote 0 out of 44,959
Wrote 8,991 out of 44,959
Wrote 17,982 out of 44,959
Wrote 26,973 out of 44,959
Wrote 35,964 out of 44,959
Wrote 44,955 out of 44,959
Training on 35,971 cities
using 1,122 features
Baseline training took 19.819483757019043 seconds
Baseline RMSE 0.31895350002111916
Training took 1,123.1410493850708 seconds
RMSE 0.3434484570101145
Writing files for error analysis
Out[6]:
<matplotlib.legend.Legend at 0x7fbe8b7ab290>

That did not work very well, we're better off without the image features. Let's start by doing some visualization of the importance of the new features (Cell #7).

In [7]:
# CELL 7
import pickle
import math
import cv2

def cell7_model_to_image(model_file, feature_file):
    # get model
    rfr = None
    with open(model_file, "rb") as pkl:
        rfr = pickle.load(pkl)

    # get feature names
    header = None
    with open(feature_file) as f:
        header = next(f)
        header = header.strip().split('\t')
        header.pop() # population
        header.pop(0) # name

    first_pixel = list(filter(lambda x:x[1].startswith("p-"), enumerate(header)))[0][0]
    side = int(math.sqrt(len(header) - first_pixel))
    print("Maximum importance for non-pixel features", rfr.feature_importances_[:first_pixel].max())
    print("Mean importance    for non-pixel features", rfr.feature_importances_[:first_pixel].mean())
    print("Maximum importance for pixel     features", rfr.feature_importances_[first_pixel:].max())
    print("Mean importance    for pixel     features", rfr.feature_importances_[first_pixel:].mean())

    importances = rfr.feature_importances_[first_pixel:].copy()
    min_i = importances.min()
    max_i = importances.max()
    importances -= min_i
    importances /= (max_i - min_i)
    importances_copy = importances.copy()

    img = importances_copy.reshape( (side, side) )
    img *= 255
    img = img.astype('uint8')
    img2 = cv2.merge( [img, img, img] )
    return img2

img = cell7_model_to_image("ch9_cell6_model.pk", "ch9_cell6_dev_feat1.tsv")

from matplotlib import pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = [5, 5]
plt.imshow(img, aspect='equal')
plt.savefig("ch9_cell7_rf_feat_importance.pdf", bbox_inches='tight', dpi=300)
Maximum importance for non-pixel features 0.2270277171305718
Mean importance    for non-pixel features 0.007753290939281302
Maximum importance for pixel     features 0.00037305373851944003
Mean importance    for pixel     features 0.00023454832807659416

The feature importance shows the system is not caring much about the pixels nor any specific one neither. Let's try adding some Gaussian blur to smooth the data (Cell #8).

In [8]:
# CELL 8
import math
import random
import cv2
import numpy as np
import time
from sklearn.ensemble import RandomForestRegressor
PARAM_DATA_PATH='./boxes'

header = None
data = list()
with open("ch6_cell32_dev_feat_conservative.tsv") as f:
    header = next(f)
    header = header.strip().split('\t')
    for line in f:
        fields = line.strip().split("\t")
        name = fields.pop(0)
        logpop = float(fields[-1])
        feats = list(map(float, fields[:-1]))
        row = (feats, logpop, name)
        data.append(row)

uri_to_geoid = dict()
with open("ch6_cell1_cities1000_links.ttl") as f:
    for line in f:
        fields = line.split(' ')
        _id = fields[2]
        _id = _id.split('/')[-2]
        uri_to_geoid[fields[0]] = _id

header = list(header[:-1]) + [ fname for rowlist in map(lambda x: map(lambda y:"p-{}-{}".format(x,y), 
                                                                      range(32)), range(32))
                              for fname in rowlist ] + [ header[-1] ]
for idx, row in enumerate(data):
    if idx % (len(data) // 5) == 0:
        print("Read {:,} out of {:,}".format(idx, len(data)))
    feats, _, name = row
    img = cv2.imread("{}/{}.png".format(PARAM_DATA_PATH, uri_to_geoid[name]))
    img = cv2.GaussianBlur(img, (3,3), 0)
    r, _, _ = cv2.split(img)
    feats.extend(r[16:48,16:48].reshape(-1))
    data[idx] = (np.array(feats), row[1], row[2])    

# write full file
with open("ch9_cell8_dev_feat1_gauss.tsv",  "w") as f:
    f.write("\t".join(header) + '\n')
    for idx, row in enumerate(data):
        if idx % (len(data) // 5) == 0:
            print("Wrote {:,} out of {:,}".format(idx, len(data)))
        feats, logpop, name = row
        f.write("{}\t{}\t{}\n".format(name, "\t".join(map(lambda x: str(int(x)) if int(x) == x else str(x),
                                                          feats)), logpop))
        
train_data = list()
test_data = list()
rand = random.Random(42)
for row in data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)
data=None
               
train_names = list(map(lambda t:t[2], train_data))

test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
train_names = list(map(lambda t:t[2], train_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
train_data=None
test_data=None

# train RF
print("Training on {:,} cities\nusing {:,} features".format(*xtrain.shape))

start = time.time()
rfr = RandomForestRegressor(max_features=1.0, random_state=42, n_estimators=200, n_jobs=-1)
rfr.fit(xtrain[:,:98], ytrain)
print("Baseline training took {:,} seconds".format(time.time() - start))
ytest_pred = rfr.predict(xtest[:,:98])
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("Baseline RMSE", RMSE)

start = time.time()
rfr = RandomForestRegressor(max_features=1.0, random_state=42, n_estimators=200, n_jobs=-1)
rfr.fit(xtrain, ytrain)
print("Training took {:,} seconds".format(time.time() - start))
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

print("Writing files for error analysis")
np.savez_compressed("ch9_cell8_feat1_gauss_rf.npz", 
                    xtrain=xtrain, ytrain=ytrain,
                    xtest=xtest, ytest=ytest,
                    ytest_pred=ytest_pred)
import pickle
with open("ch9_cell8_model.pk", "wb") as pkl:
    pickle.dump(rfr, pkl)
with open("ch9_cell8_train_names.tsv", "w") as names:
    for idx, name in enumerate(train_names):
        names.write("{}\t{}\n".format(idx, name))
with open("ch9_cell8_test_names.tsv", "w") as names:
    for idx, name in enumerate(test_names):
        names.write("{}\t{}\n".format(idx, name))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch9_cell8_rf_feat1_gauss.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Read 0 out of 44,959
Read 8,991 out of 44,959
Read 17,982 out of 44,959
Read 26,973 out of 44,959
Read 35,964 out of 44,959
Read 44,955 out of 44,959
Wrote 0 out of 44,959
Wrote 8,991 out of 44,959
Wrote 17,982 out of 44,959
Wrote 26,973 out of 44,959
Wrote 35,964 out of 44,959
Wrote 44,955 out of 44,959
Training on 35,971 cities
using 1,122 features
Baseline training took 19.608769416809082 seconds
Baseline RMSE 0.31895350002111916
Training took 1,048.8585846424103 seconds
RMSE 0.340798195202612
Writing files for error analysis
Out[8]:
<matplotlib.legend.Legend at 0x7fbe5ca4d6d0>

There was a tiny improvement. Visualizing feature importance again Cell #9.

In [9]:
# CELL 9
img = cell7_model_to_image("ch9_cell8_model.pk", "ch9_cell8_dev_feat1_gauss.tsv")

from matplotlib import pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = [5, 5]
plt.imshow(img, aspect='equal')
plt.savefig("ch9_cell9_rf_feat_importance.pdf", bbox_inches='tight', dpi=300)
Maximum importance for non-pixel features 0.2271961323378289
Mean importance    for non-pixel features 0.007835081495580315
Maximum importance for pixel     features 0.00042086643504834554
Mean importance    for pixel     features 0.00022672071624329042

That did help a tiny bit and the weights look quite differente.

The other piece of advice is to whiten the data, let's try that (Cell #10).

In [10]:
# CELL 10
import math
import random
import cv2
import numpy as np
import time
from sklearn.ensemble import RandomForestRegressor
PARAM_DATA_PATH='./boxes'

header = None
data   = list()
with open("ch6_cell32_dev_feat_conservative.tsv") as f:
    header = next(f)
    header = header.strip().split('\t')
    for line in f:
        fields = line.strip().split("\t")
        name   = fields.pop(0)
        logpop = float(fields[-1])
        feats  = list(map(float, fields[:-1]))
        row    = (feats, logpop, name)
        data.append(row)

uri_to_geoid = dict()
with open("ch6_cell1_cities1000_links.ttl") as f:
    for line in f:
        fields = line.split(' ')
        _id = fields[2]
        _id = _id.split('/')[-2]
        uri_to_geoid[fields[0]] = _id

base_size = len(header) - 2
print("Base size", base_size)
header = list(header[:-1]) + [ fname for rowlist in map(lambda x: map(lambda y:"p-{}-{}".format(x,y), 
                                                                      range(32)), range(32))
                              for fname in rowlist ] + [ header[-1] ]
for idx, row in enumerate(data):
    if idx % (len(data) // 5) == 0:
        print("Read {:,} out of {:,}".format(idx, len(data)))
    feats, _, name = row
    img = cv2.imread("{}/{}.png".format(PARAM_DATA_PATH, uri_to_geoid[name]))
    img = cv2.GaussianBlur(img, (3,3), 0)
    r, _, _ = cv2.split(img)
    feats.extend(r[16:48,16:48].reshape(-1))
    data[idx] = (np.array(feats), row[1], row[2])    
       
train_data = list()
test_data  = list()
rand = random.Random(42)
for row in data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)
data = None
               
train_names = list(map(lambda t:t[2], train_data))

test_data   = sorted(test_data, key=lambda t:t[1])
test_names  = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

xtrain_orig = xtrain.copy()
xtest_orig  = xtest.copy()

# normalize
xtrain[:,base_size:] = xtrain[:,base_size:] / 256.0
xtest[:, base_size:] = xtest[:, base_size:] / 256.0

# centre images
means = np.mean(xtrain[:, base_size:], axis=0)
xtrain[:,base_size:] = xtrain[:,base_size:] - means
xtest[:, base_size:] = xtest[:, base_size:] - means

# compute whiten transform
covar   = np.dot(xtrain[:, base_size:].T, xtrain[:, base_size:]) / xtrain.shape[0]
E, D, _ = np.linalg.svd(covar) # PCA
#eps = 0.1 # Pal & Sudeep (2016)
eps   = 0.001
D_inv = 1. / np.sqrt(D[np.newaxis] + eps)
zca = (E * D_inv).dot(E.T)

#pca_whiten = np.dot(np.diag(1.0 / np.sqrt(D + eps)), E.T)
#zca = np.dot(E, pca_whiten)
#zca = np.dot(E, np.dot(np.diag(1.0 / np.sqrt(D + eps)), E.T))
whitened_train       = np.dot(xtrain[:, base_size:], zca)
xtrain[:,base_size:] = whitened_train

whitened_test        = np.dot(xtest[:, base_size:], zca)
xtest[:, base_size:] = whitened_test
    
# write full file
with open("ch9_cell10_dev_feat1_whiten.tsv",  "w") as f:
    f.write("\t".join(header) + '\n')
    for idx in range(xtrain.shape[0]):
        if idx % (xtrain.shape[0] // 5) == 0:
            print("Wrote {:,} out of {:,}".format(idx, xtrain.shape[0]))
        # different versions of python seem to print differently just using str
        f.write("{}\t{}\t{}\n".format(train_names[idx], 
                                      "\t".join(map(lambda x: str(int(x)) if int(x) == x else "{:0.12f}".format(x),
                                                          xtrain[idx,:])), ytrain[idx]))
    for idx in range(xtest.shape[0]):
        if idx % (xtest.shape[0] // 5) == 0:
            print("Wrote {:,} out of {:,}".format(idx, xtest.shape[0]))
        f.write("{}\t{}\t{}\n".format(test_names[idx], 
                                      "\t".join(map(lambda x: str(int(x)) if int(x) == x else "{:0.12f}".format(x),
                                                          xtest[idx,:])), ytest[idx]))

# train RF
print("Training on {:,} cities\nusing {:,} features".format(*xtrain.shape))

start = time.time()
rfr = RandomForestRegressor(max_features=1.0, random_state=42, n_estimators=200, n_jobs=-1)
rfr.fit(xtrain[:,:base_size], ytrain)
print("Baseline training took {:,} seconds".format(time.time() - start))
ytest_pred = rfr.predict(xtest[:,:base_size])
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("Baseline RMSE", RMSE)

start = time.time()
rfr = RandomForestRegressor(max_features=1.0, random_state=42, n_estimators=200, n_jobs=-1)
rfr.fit(xtrain, ytrain)
print("Training took {:,} seconds".format(time.time() - start))
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

print("Writing files for error analysis")
np.savez_compressed("ch9_cell10_feat1_whiten_rf.npz", 
                    xtrain=xtrain,           ytrain=ytrain,
                    xtrain_orig=xtrain_orig, xtest_orig=xtest_orig,
                    xtest=xtest,             ytest=ytest,
                    ytest_pred=ytest_pred)
import pickle
with open("ch9_cell10_model.pk", "wb") as pkl:
    pickle.dump(rfr, pkl)
with open("ch9_cell8_test_names.tsv", "w") as names:
    for idx, name in enumerate(test_names):
        names.write("{}\t{}\n".format(idx, name))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch9_cell10_rf_feat1_whiten.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Base size 98
Read 0 out of 44,959
Read 8,991 out of 44,959
Read 17,982 out of 44,959
Read 26,973 out of 44,959
Read 35,964 out of 44,959
Read 44,955 out of 44,959
Wrote 0 out of 35,971
Wrote 7,194 out of 35,971
Wrote 14,388 out of 35,971
Wrote 21,582 out of 35,971
Wrote 28,776 out of 35,971
Wrote 35,970 out of 35,971
Wrote 0 out of 8,988
Wrote 1,797 out of 8,988
Wrote 3,594 out of 8,988
Wrote 5,391 out of 8,988
Wrote 7,188 out of 8,988
Wrote 8,985 out of 8,988
Training on 35,971 cities
using 1,122 features
Baseline training took 19.62180995941162 seconds
Baseline RMSE 0.3189535000211191
Training took 1,843.5154190063477 seconds
RMSE 0.3476486854059412
Writing files for error analysis
Out[10]:
<matplotlib.legend.Legend at 0x7fbe5cac4e50>