Chapter 6: Graphs

Accompanying code for the book The Art of Feature Engineering.

This notebook plus notebooks for the other chapters are available online at https://github.com/DrDub/artfeateng

MIT License

Copyright 2019 Pablo Duboue

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Limitations

  • Simple python intented for people coming from other languages that plan to use the ideas described in the book outside of Python.
  • Many of these techniques are available as library calls. They are spelled out as for teaching purposes.
  • Resource limitations:
    • At most one day of running time per notebook.
    • No GPU required.
    • Minimal dependencies.
    • At most 8Gb of RAM.
  • Due to resource limitations, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.
  • To help readers try variants of some cells in isolation, the cells are easily executable without having to re-run the whole notebook. As such, most cells read everything they need from disk and write all their results back into disk, which is unnecessary with normal notebooks. The code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program.
  • I dislike Pandas so these notebooks are Pandas-free, which might seem unusual to some.

Chapter 6: Case Study on Graph Data

In this chapter I will describe the creation of the base WikiCities dataset employed in this chapter and the next three chapters.

WikiCities Dataset

It starts with the latest release of DBpedia [cite], for October 2016, a curated ontological resource obtained from the infoboxes at Wikipedia. Infoboxes are the tabular information appearing next to the page content in each Wikipedia page.

My intention with this dataset is to provide a task that can be attacked with regular features, with time series features, textual features and image features. The task I found that allow for all these approaches is that of calculating the population of a city or town based on their ontological properties (e.g., name of its leader or its time zone), based on its historical population and historical features (which involves a time series analysis), based on the textual description of the place (which involves text analysis, particularly as many times further includes the population) and a satellite image of the city (which involves image processing).

DBpedia files are distributed as files of triples, where a triple consists of a subject, verb and object. The subject is always an entity in the ontology, prefixed by the name space dbpedia.org. The verb is an ontological relation, of which there are many (type, leader-name, etc). The object can be either another entity or a literal value (a string or a number). Besides the actual files, DBpedia also contains links to multiple other ontologies, like Freebase or Yago plus links to the original Wikipedia articles that resulted in the extracted information. For these case studies we will be using the "instance_types_en.ttl.bz2", "mappingbased_literals_en.ttl.bz2", "mappingbased_objects_en.ttl.bz2".

The type information is derived from the infobox itself. The first challenge is to identify cities, as the "city" or "settlement" type is not thoroughly annotated in Wikipedia. Instead of relying on heuristics (such as "any entitity with a location and a population" which will result in neighborghoods annotated), I chose to use an external more reliable source, the GeoNames project. The project distributes a list of "cities or settlements with at least 1000 people". It contains 128,000+ places (file "cities1000.txt"). To link them back to DBpedia we use the file "geonames_links.ttl.bz2" distributed by DBpedia (Cell #1)

In [1]:
# CELL 1
import re
import bz2

id_re = re.compile("^(\d+)")
url_re = re.compile(".*geonames\.org\/(\d+)")

cities1000_ids = set()
with open("cities1000.txt") as cities_file:
    for line in cities_file:
        cities1000_ids.add(id_re.match(line).group(0))
print("Cities1000 total: {:,}".format(len(cities1000_ids)))

found = 0
with open("ch6_cell1_cities1000_links.ttl","w") as out:
    with bz2.BZ2File("geonames_links.ttl.bz2","r") as links:
        for byteline in links:
            if byteline[0] == ord('#'):
                continue
            line = byteline.decode("utf-8")
            if not url_re.match(line):
                print(byteline)
            if url_re.match(line).group(1) in cities1000_ids:
                out.write(line)
                found += 1            
print("In DBpedia found: {:,}".format(found))
cities1000_ids = None # free memory
Cities1000 total: 128,764
In DBpedia found: 80,251

Therefore, using the available links to DBpedia, we found 80,251 of them in the Wikipedia data set. These 80,251 plus DBpedia consitute the raw data for this problem (files "ch6_cell1_cities1000_links.ttl", ).

Following the methodology described in Chapter 1, I will set aside 20% of this data for final evaluation at the end of the feature engineering process (Cell #2).

In [2]:
# CELL 2
import random

rand = random.Random(42)
all_cities = open("ch6_cell1_cities1000_links.ttl").readlines()
rand.shuffle(all_cities)
pivot = int(len(all_cities) * 0.8)
devset = all_cities[:pivot]
heldout = all_cities[pivot:]
with open("ch6_cell2_cities1000_dev_links.ttl", "w") as dev:
    dev.write("".join(devset))
with open("ch6_cell2_cities1000_held_links.ttl", "w") as held:
    held.write("".join(heldout))
print("Devset size: {:,}".format(len(devset)))
print("Heldout size: {:,}".format(len(heldout)))
all_cities = devset = heldout = None # free memory
Devset size: 64,200
Heldout size: 16,051

Following the methodology in Chapter 1, I now want to do some exploratory data analysis over these 64,200 settlements to decide basic featurization and model to employ.

The key issue to resolve here is to drill down on the specific relations we think will be useful to predict our target class. While DBpedia contain thousands of relations, many are seldom expressed (meaning, there are few pages where a human editor feel the need to record information such as "distante-to-London").

While we want to account the number of relations where our cities of interest partake, something to remember is that they might be value in relations that have a given city as its subject but there might also be value in relations that have the city as its object (that is, the inverse of certain relations).

Because processing the large files of DB is a computationally onerous process, I'll start by pre-filtering the triples related to all entities both forward and inverse, generating the file "ch6_cell3_cities1000_base.ttl" (Cell #3).

In [3]:
# CELL 3
city_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell1_cities1000_links.ttl").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

def compressed_triples_gen(filename):
    with bz2.BZ2File(filename,"r") as compressed:
        for byteline in compressed:
            if byteline[0] == ord('#'):
                continue
            line = byteline.decode("utf-8")
            s,v,o = triple_re.match(line).groups()
            yield line,s,v,o
            
written = 0
read = 0
with open("ch6_cell3_cities1000_base.ttl","w") as base:
    # types and literals only in subject position
    for line, subj, verb, obj in compressed_triples_gen("instance_types_en.ttl.bz2"):
        read += 1 
        if subj in city_uris:
            base.write(line)
            written += 1
    for line, subj, verb, obj in compressed_triples_gen("mappingbased_literals_en.ttl.bz2"):
        read += 1
        if subj in city_uris:
            base.write(line)
            written += 1
    for line, subj, verb, obj in compressed_triples_gen("mappingbased_objects_en.ttl.bz2"):
        read += 1
        if subj in city_uris:
            base.write(line)
            written += 1
        if obj in city_uris:
            base.write("{} {} {} .\n".format(obj, verb[0:-1] + "?inv>", subj))
            written += 1
print("Base {:,} out of {:,}".format(written, read))
city_uris = None # free memory
Base 2,014,559 out of 38,285,143

That code takes a while to run but it will speed up the rest of the work quite a bit down the road as we only need to worry now of 2 million triples instead of 38 million. The next step is to produce a filtered version just for the development and heldout sets into the files "ch6_cell4_cities1000_devset_base.ttl" and "ch6_cell4_cities1000_heldout_base.ttl", respectively (Cell #4).

In [4]:
# CELL 4
devset_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell2_cities1000_dev_links.ttl").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

devset_size = 0
with open("ch6_cell3_cities1000_base.ttl") as base:
    with open("ch6_cell4_cities1000_dev_base.ttl", "w") as devset:
        with open("ch6_cell4_cities1000_heldout_base.ttl", "w") as heldout:
            for line in base:
                subj, verb, obj = triple_re.match(line).groups()
                if subj in devset_uris:
                    devset.write(line)
                    devset_size += 1
                else:
                    heldout.write(line)
print("Devset base size: {:,} triples".format(devset_size))
devset_uris = None # free memory
Devset base size: 1,625,988 triples

These close to 1.6 million triples are our working set,

EDA: Exploratory Data Analysis

I'll start the Exploratory Data Analysis by visualizing 10 random entities using the Graphviz package (Cell #5). The cities are rendered as ch6_cell5city*.pdf

In [5]:
# CELL 5
import re
import graphviz
import random

PARAM_USE_SET_FOR_BOOK = True

devset_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell1_cities1000_links.ttl").readlines()))
rand = random.Random(42)
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

# stable set for book
cities = [ "Doolittle,_Texas", "Belton,_Missouri", "Poiana_Sibiului", "Unadilla,_Georgia",
          "Sankeshwar,_Karnataka", "Psevdas", "Skudai", "Santa_Maria_da_Boa_Vista",
          "Newstead,_Tasmania", "Gombi" ]

if PARAM_USE_SET_FOR_BOOK:
    cities = list(map(lambda x: "<http://dbpedia.org/resource/" + x + ">", cities))
else:
    cities = set(rand.sample(sorted(devset_uris), 10))

nodes = dict()
dots = dict()

for city_uri in cities:
    dots[city_uri] = graphviz.Digraph(format='pdf', graph_attr={'rankdir':'BT'})
    nodes[city_uri] = dict()

def cell5_make_or_return_node(city_uri, name, shape=None):
    if name in nodes[city_uri]:
        return nodes[city_uri][name]
    ret = 'N' + str(len(nodes[city_uri]))
    nodes[city_uri][name] = ret
    if shape is None:
        dots[city_uri].node(ret, name)
    else:
        dots[city_uri].node(ret, name.replace('&', ' '), shape=shape)
    return ret

for city_uri in cities:
    cell5_make_or_return_node(city_uri, city_uri, 'star')
    
rel_count = dict()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if subj in cities:
            city_uri = subj
            rel_count[verb] = rel_count.get(verb, 0) + 1
            if verb.endswith("?inv>"):
                subj, obj = obj, subj
                verb = verb[:-5] + ">"
            subj_name = cell5_make_or_return_node(city_uri, subj)
            obj_name = cell5_make_or_return_node(city_uri, obj)
            dots[city_uri].edge(subj_name, obj_name, verb)
for city_uri in cities:
    city = city_uri[:-1].split("/")[-1]
    dot = dots[city_uri]
    print("Rendering",city,":",len(dot.source.splitlines()), "dot lines")
    dot.render("ch6_cell5_city_" + city + ".dot")
with open("ch6_cell5_rel_counts.tsv", "w") as rels:
    for rel in sorted(rel_count.keys()):
        rels.write("{}\t{}\n".format(rel,rel_count[rel]))
print("\nTop relation counts:")
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>{}</td><td>{}</td></tr>".format(r[0][:-1].split("/")[-1],r[1]), 
                               sorted(rel_count.items(), key=lambda t:t[1], reverse=True)[:10]))) +
            "</table>"))
Rendering Doolittle,_Texas : 4 dot lines
Rendering Belton,_Missouri : 62 dot lines
Rendering Poiana_Sibiului : 37 dot lines
Rendering Unadilla,_Georgia : 54 dot lines
Rendering Sankeshwar,_Karnataka : 4 dot lines
Rendering Psevdas : 22 dot lines
Rendering Skudai : 41 dot lines
Rendering Santa_Maria_da_Boa_Vista : 4 dot lines
Rendering Newstead,_Tasmania : 18 dot lines
Rendering Gombi : 20 dot lines

Top relation counts:
RelCounts
utcOffset10
isPartOf10
timeZone8
22-rdf-syntax-ns#type7
name7
country7
populationTotal5
areaTotal4
leaderTitle4
elevation4

From here we can see that some randomly sampled cities are all fairly small, which talks that most rows will have very little data while others will contain the bulk of the triples. The amount of infrormation available is already a great indicator of population.

From looking at the graphs we can see that certain places (like Skudai) do not contain population information. The next step is thus to filter cities for which the population is known (as we can only use those to build a regressor or classifier).

From the relation table in "ch6_cell5_rel_counts.tsv" we can see that some relations are very rare (like homepage or leaderParty) while others are defined for almost all cities (like rdf type or areaLand). The inverse relations are also quite sparse but these places are all sparsely populated.

But first we will need to know how population is expressed in DBpedia. From Psevdas we can see it has a relation "populationTotal". Would that be the only way population is expressed? Let us see what relations contain the word population in their name (Cell #6).

In [6]:
# CELL 6
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rels_of_interest = dict()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if "population" in verb:
            rels_of_interest[verb] = rels_of_interest.get(verb, 0) + 1
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>{}</td><td>{}</td></tr>".format(r[0][:-1].split("/")[-1],r[1]), 
                               sorted(rels_of_interest.items(), key=lambda t:t[1], reverse=True)))) +
            "</table>"))
RelCounts
populationTotal49028
populationDensity19540
populationAsOf5410
populationPlace?inv2511
populationMetro1286
populationUrban1188
populationTotalRanking1057
populationMetroDensity142
populationUrbanDensity79
populationRural42

It seems "populationTotal" is the right relation to use, but we will need to filter populationMetro, populationUrban and populationRural. Also, if populationTotal is not defined, either of those, if defined could be used as surrogates of out target class. But at this stage we can to filter the cities with defined population, I will put them into the file "ch6_cell7_dev_pop.tsv" and collect also the population (will require cleaning type information such as ^^http://www.w3.org/2001/XMLSchema#nonNegativeInteger). This number will be our target value for regression (or a discretized version of it, our target class).

I do this process on both the dev and heldout (into file "ch6_cell7_heldout_pop.tsv") in Cell #7.

In [7]:
# CELL 7
import re
from collections import OrderedDict

cities_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell1_cities1000_links.ttl").readlines()))
devset_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell2_cities1000_dev_links.ttl").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

population = OrderedDict()
with open("ch6_cell3_cities1000_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        verb = verb[:-1].split("/")[-1]
        update = False
        conditional = True
        if verb == "populationTotal":
            update = True
            conditional = False
        elif verb in set(['populationMetro', 'populationUrban', 'populationRural']):
            if subj in population and population[subj][1] == 'populationTotal':
                pass # keep current
            else:
                update = True
                conditional = subj in population
        if update:
            # compute actual population
            if obj[0] == '"':
                obj = obj[1:obj[1:].index('"')+1]
            try:
                pop = float(obj)
            except ValueError:
                print("For city", subj, "cannot convert", line)
                next # ignore
            if not conditional or population[subj][0] < pop:
                population[subj] = (pop, verb)
                
dev_cities_with_pop = 0
with open("ch6_cell7_dev_pop.tsv", "w") as dev:
    with open("ch6_cell7_heldout_pop.tsv", "w") as held:
        for city, pop in population.items():
            line = "{}\t{}\n".format(city, pop[0])
            if city in devset_uris:
                dev.write(line)
                dev_cities_with_pop += 1
            else:
                held.write(line)
print("Cities with known population in devset: {:,}".format(dev_cities_with_pop))
Cities with known population in devset: 48,868

Now we can proceed to repeat the plotting of 10 random cities in the devset with known population (Cell #8).

In [8]:
# CELL 8
import re
import graphviz
import random

PARAM_USE_SET_FOR_BOOK = True

devset_uris = set(map(lambda line:line.split('\t')[0], open("ch6_cell7_dev_pop.tsv").readlines()))
rand = random.Random(42)
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

# stable set for book
cities = [ "Ovalle", "Hawaiian_Paradise_Park,_Hawaii", "Coupar_Angus",
          "Hilsenheim", "Esko,_Minnesota", "Hamlin,_New_York", "Falkenberg,_Märkisch-Oderland",
          "Wermelskirchen", "Berne,_Indiana", "Chennevières-sur-Marne" ]

if PARAM_USE_SET_FOR_BOOK:
    cities = list(map(lambda x: "<http://dbpedia.org/resource/" + x + ">", cities))
else:
    cities = set(rand.sample(sorted(devset_uris), 10))

nodes = dict()
dots = dict()

for city_uri in cities:
    dots[city_uri] = graphviz.Digraph(format='pdf', graph_attr={'rankdir':'BT'})
    nodes[city_uri] = dict()

for city_uri in cities:
    cell5_make_or_return_node(city_uri, city_uri, 'star')

rel_count = dict()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if subj in cities:
            city_uri = subj
            rel_count[verb] = rel_count.get(verb, 0) + 1
            if verb.endswith("?inv>"):
                subj, obj = obj, subj
                verb = verb[:-5] + ">"
            subj_name = cell5_make_or_return_node(city_uri, subj)
            obj_name = cell5_make_or_return_node(city_uri, obj)
            dots[city_uri].edge(subj_name, obj_name, verb)
for city_uri in cities:
    city = city_uri[:-1].split("/")[-1]
    dot = dots[city_uri]
    print("Rendering",city,":",len(dot.source.splitlines()), "dot lines")
    dot.render("ch6_cell8_city_" + city + ".dot")
    
with open("ch6_cell8_rel_counts.tsv", "w") as rels:
    for rel in sorted(rel_count.keys()):
        rels.write("{}\t{}\n".format(rel,rel_count[rel]))
print("\nTop relation counts:")
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>{}</td><td>{}</td></tr>".format(r[0][:-1].split("/")[-1],r[1]), 
                               sorted(rel_count.items(), key=lambda t:t[1], reverse=True)[:10]))) +
            "</table>"))
Rendering Ovalle : 59 dot lines
Rendering Hawaiian_Paradise_Park,_Hawaii : 48 dot lines
Rendering Coupar_Angus : 4 dot lines
Rendering Hilsenheim : 28 dot lines
Rendering Esko,_Minnesota : 37 dot lines
Rendering Hamlin,_New_York : 46 dot lines
Rendering Falkenberg,_Märkisch-Oderland : 32 dot lines
Rendering Wermelskirchen : 45 dot lines
Rendering Berne,_Indiana : 67 dot lines
Rendering Chennevières-sur-Marne : 20 dot lines

Top relation counts:
RelCounts
isPartOf11
name10
areaCode10
22-rdf-syntax-ns#type9
elevation9
populationTotal9
country9
areaTotal8
postalCode8
birthPlace?inv8

These cities with defined population have definitely much more things going on for them, including multiple areaCode relations (three in the case of Wermelskirchen) and a variety of inverse relations. At this stage on the EDA process we want to look into which of these relations might make for good features. Going back to Chapter 1, we want features that will be simple, related to the target (population) and readily available. From the perspective of "readily available" we want to shorten down the number of relations to ones that are frequent enough that will be useful for a ML algorithm (I will look into the other issues later). Therefore their frequency over the whole set of entities is paramount. Let us sort the relations by the number of cities that have them defined (Cell #9).

In [9]:
# CELL 9
import re
from collections import OrderedDict

devset_uris = set(map(lambda line:line.split('\t')[0], open("ch6_cell7_dev_pop.tsv").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rel_count = OrderedDict()
rel_subj_seen = set()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if subj in devset_uris:
            rel_subj = "{}-{}".format(subj,verb)
            if rel_subj not in rel_subj_seen:
                rel_count[verb] = rel_count.get(verb, 0) + 1
                rel_subj_seen.add(rel_subj)
sorted_count = sorted(rel_count.items(), key=lambda t:t[1])

# drop long names
sorted_count = list(map(lambda t:(t[0][:-1].split("/")[-1], str(t[1])), sorted_count))
table = list()
step = int(len(sorted_count) / 10)
print("Total relations", len(sorted_count), "percentile size", step)
table.append( ("First", sorted_count[0][0], sorted_count[0][1], 
              "{}%".format(int(float(sorted_count[0][1]) / len(devset_uris) * 10000) / 100) ))
current = step
percentile = 1
while current < len(sorted_count):
    table.append( ("At {} percentile".format(percentile*10), sorted_count[current][0],
                 sorted_count[current][1],
                 "{}%".format(int(float(sorted_count[current][1]) / len(devset_uris) * 10000) / 100) ) )
    current += step
    percentile += 1
table.append( ("Last", sorted_count[-1][0], sorted_count[-1][1], 
              "{}%".format(int(float(sorted_count[-1][1]) / len(devset_uris) * 10000) / 100) ))


from IPython.display import HTML, display
display(HTML("<table><tr><th>Pos</th><th>Rel</th><th>Counts</th><th>Percentage of cities defined</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>" + "</td><td>".join(r) + "</td></tr>", table))) +
            "</table>"))
Total relations 347 percentile size 34
PosRelCountsPercentage of cities defined
FirstbuildingEndDate10.0%
At 10 percentilesportCountry?inv10.0%
At 20 percentilelowestState30.0%
At 30 percentileinflow?inv50.01%
At 40 percentileboard?inv90.01%
At 50 percentileriverMouth?inv270.05%
At 60 percentilecounty?inv780.15%
At 70 percentileknownFor?inv2090.42%
At 80 percentilepopulationTotalRanking6991.43%
At 90 percentileheadquarter?inv27695.66%
At 100 percentileareaTotal2991761.22%
Last22-rdf-syntax-ns#type48868100.0%

Which relation we will use is key to success of the ML, at this stage we will overselect and then drill down to find a better subset. But we don't want to select too many, particularly ones are not that useful, so anything appearing less than 5% of the time is not really that useful, even if looks very appealing (such as "populationTotalRanking", only available for 1.43% of the cities in the devset). The next stage is then write down the list of final relations (file "ch6_cell10_rels.txt") and filter the devset to only those triples (file "ch6_cell10_cities100_dev_filtered.ttl"). That will also further reduce the computation down the road. Cell #10.

In [10]:
# CELL 10
import re
from collections import OrderedDict

devset_uris = set(map(lambda line:line.split('\t')[0], open("ch6_cell7_dev_pop.tsv").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rel_count = OrderedDict()
rel_subj_seen = set()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if subj in devset_uris:
            rel_subj = "{}-{}".format(subj,verb)
            if rel_subj not in rel_subj_seen:
                rel_count[verb] = rel_count.get(verb, 0) + 1
                rel_subj_seen.add(rel_subj)

# delete populations
del rel_count["<http://dbpedia.org/ontology/populationTotal>"]
del rel_count["<http://dbpedia.org/ontology/populationMetro>"]
del rel_count["<http://dbpedia.org/ontology/populationUrban>"]
del rel_count["<http://dbpedia.org/ontology/populationRural>"]
                
threshold = len(devset_uris) * 0.05
selected_rels = sorted(filter(lambda t:t[1] >= threshold, rel_count.items()), key=lambda t:t[0])

with open("ch6_cell10_rels.txt", "w") as rels:
    rels.write("\n".join(map(lambda t:t[0], selected_rels))+"\n")

selected_rels = set(map(lambda t:t[0], selected_rels))
written = 0
with open("ch6_cell10_cities1000_dev_filtered.ttl", "w") as filtered:
    with open("ch6_cell4_cities1000_dev_base.ttl") as base:
        for line in base:
            subj, verb, obj = triple_re.match(line).groups()
            if subj in devset_uris and verb in selected_rels:
                filtered.write(line)
                written += 1

print("Selected", len(selected_rels), "relations")
print("Filtered down to {:,} triples".format(written))
Selected 43 relations
Filtered down to 1,278,943 triples

From all this process we are now down to 43 relations and 1.3 million triples for 48,000 instances. But we still do not have features, as these relations have multiple instances for them and are unordered. To deal with this situation, we can remember the techniques from Chapter 5 for dealing sets, lists, graphs, etc. For the set-based (unordered) nature of the relations, we can sort them. In this case we can sort them by how frequent is the target value. If the target value is more frequent, it might be more informative to the ML. That leaves then with a list of values for each relation. To represent lists, a common technique is to truncate it to a maximum size. We thus need to see how many times each of these 43 relations is defined for a given entity (Cell #11).

In [11]:
# CELL 11
import re
from collections import OrderedDict

triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rel_count_per_city = OrderedDict()
with open("ch6_cell10_cities1000_dev_filtered.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if verb not in rel_count_per_city:
            rel_count_per_city[verb] = OrderedDict()
        rel_count_per_city[verb][subj] = rel_count_per_city[verb].get(subj, 0) + 1

# sort and compute min, max, average and median
table = list()
for rel in sorted(rel_count_per_city.keys()):
    city_counts = sorted(rel_count_per_city[rel].items(), key=lambda t:t[1])
    city_counts = list(map(lambda t:(t[0][:-1].split("/")[-1],t[1]), city_counts))
    table.append( (rel[:-1].split("/")[-1], # rel 
                   len(city_counts), # counts
          city_counts[-1], # max
          int(sum(map(lambda t:t[1], city_counts)) / len(city_counts) * 100) / 100, # avg
          city_counts[int(len(city_counts) / 2)], # median
          city_counts[0])  # min  
    )
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th><th>Max</th><th>Avg</th><th>Median</th><th>Min</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>"+"</td><td>".join(list(map(str,r)))+
                               "</td></tr>", table))) +"</table>"))
RelCountsMaxAvgMedianMin
area4803('Saint-Germain-Nuelles', 1)1.0('Labeuvrière', 1)('Austin,_Texas', 1)
areaCode25417('Iserlohn', 8)1.14('Martin,_Slovakia', 1)('Ankara', 1)
areaLand11669('Spotswood,_New_Jersey', 2)1.69('Reinbeck,_Iowa', 2)('Amsterdam', 1)
areaTotal29917('Spotswood,_New_Jersey', 2)1.27('Osann-Monzel', 1)('Ankara', 1)
areaWater10950('Spotswood,_New_Jersey', 2)1.42('Mebane,_North_Carolina', 1)('Amsterdam', 1)
birthPlace?inv21089('London', 6258)13.31('Itō,_Shizuoka', 3)('Kaysersberg', 1)
city?inv13551('London', 709)3.9('Ashburnham,_Massachusetts', 2)('Iquitos', 1)
country45977('Sada,_Galicia', 8)1.04('Talence', 1)('Ankara', 1)
deathPlace?inv10046('London', 3059)8.36('Longmeadow,_Massachusetts', 2)('Sauk_City,_Wisconsin', 1)
department2587('Saint-Denis,_Seine-Saint-Denis', 1)1.0('Trémuson', 1)('Ajaccio', 1)
district5842('Waterloo,_Illinois', 6)1.01('Müllrose', 1)('Aalen', 1)
elevation28669('Córdoba,_Argentina', 4)1.33('Villaquilambre', 1)('Ankara', 1)
foundingDate3003('Tampa,_Florida', 5)1.14('Rio_Branco,_Acre', 1)('Austin,_Texas', 1)
ground?inv3368('Vienna', 134)1.94('Nödinge-Nol', 1)('Kansas_City,_Kansas', 1)
headquarter?inv2769('London', 509)5.63('Cortona', 1)('Wigton', 1)
hometown?inv4144('London', 1562)6.52('West_Covina,_California', 1)('Aston', 1)
inseeCode3906('Saint-Germain-Nuelles', 1)1.0('Ploumagoar', 1)('Ajaccio', 1)
isPartOf32266('Baima,_Baxoi_County', 10)2.02('Omalur', 2)('Amsterdam', 1)
isPartOf?inv3786('Shijiazhuang', 296)7.0('Santa_Teresa_del_Tuy', 1)('Tuxtla_Gutiérrez', 1)
leaderName3192('North_York', 24)1.35('Pointe-Noire', 1)('Ankara', 1)
leaderTitle12322('Canora,_Saskatchewan', 8)1.45('Bad_Sachsa', 1)('Amsterdam', 1)
location?inv14236('London', 2438)7.43('Schiller_Park,_Illinois', 2)('Sèvres', 1)
locationCity?inv2822('London', 684)4.33('Fall_River,_Massachusetts', 1)('Orchard_Park_(town),_New_York', 1)
maximumElevation4159('Zahlé', 2)1.0('Gières', 1)('Aarhus', 1)
minimumElevation4093('Zahlé', 2)1.0('Raismes', 1)('Aarhus', 1)
motto2689('Mérida,_Mérida', 4)1.08('Moraine,_Ohio', 1)('Amsterdam', 1)
nearestCity?inv4136('Middletown,_Orange_County,_New_York', 46)1.78('Owatonna,_Minnesota', 1)('Copenhagen', 1)
populationAsOf5303('Gilroy,_California', 2)1.0('Ellrich', 1)('Aarhus', 1)
populationDensity13788('Batavia,_New_York', 2)1.41('Santiz', 1)('Austin,_Texas', 1)
postalCode31782('Nijemci', 8)1.0('Lingig,_Surigao_del_Sur', 1)('Ankara', 1)
region4711('Centralia,_Illinois', 4)1.01('La_Châtaigneraie', 1)('Aachen', 1)
residence?inv5283('New_York_City', 703)4.4('Chonburi_(city)', 1)('Kiyosu', 1)
routeEnd?inv4079('Chicago', 33)1.46('Nongstoin', 1)('Sleepy_Hollow,_New_York', 1)
routeJunction?inv5301('Chicago', 23)1.94('Smithville,_Tennessee', 1)('Sleepy_Hollow,_New_York', 1)
routeStart?inv4136('Chicago', 16)1.53('Hilliard,_Florida', 1)('Dana_Point,_California', 1)
timeZone28615('Jerusalem', 4)1.35('Bhikhi,_Punjab', 1)('Ames,_Iowa', 1)
type21497('Kandi,_Benin', 3)1.02('Jericho,_Vermont', 1)('Ankara', 1)
utcOffset27762('Lőrinci', 3)1.74('Marietta,_Ohio', 2)('Algiers', 1)
22-rdf-syntax-ns#type48868('Okny', 1)1.0('Labarthe-Rivière', 1)('Ankara', 1)
rdf-schema#seeAlso3426('Dubai', 19)2.69('Coventry,_Rhode_Island', 3)('Bishkek', 1)
homepage20319('Pune', 4)1.0('Nueva_Loja', 1)('Amsterdam', 1)
name45193('Songtao_Miao_Autonomous_County', 5)1.19('Castillejo_de_Martín_Viejo', 1)('Ankara', 1)
nick2764('Davao_City', 10)1.1('Rotterdam_(town),_New_York', 1)('Ankara', 1)

There are multiple things we can see here. First, no relation has a minimum that it is not 1. That is good because it means no relation is inherently multiple. We can also see that the inverted relations have substantive maximums, which make sense that cities such as London will be the birthplaces of thousands of Wikipedia-worthy people. Clearly the most important signal here is the number of such inbound relations rather than the actual entities being referred, but we will leave that to be brought about by some feature selection techniques as the ones discussed in Chapter 4. Some the averages are also pretty high but they seem to be driven by few megacities. The medians are much informative, with only a few beyond one. I will thus focus on the medians and extract the file "ch6_cell12_rels_median_fan_outs.tsv" (Cell #12).

In [12]:
# CELL 12
import re
from collections import OrderedDict

triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rel_count_per_city = OrderedDict()
with open("ch6_cell10_cities1000_dev_filtered.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if verb not in rel_count_per_city:
            rel_count_per_city[verb] = OrderedDict()
        rel_count_per_city[verb][subj] = rel_count_per_city[verb].get(subj, 0) + 1

# sort and compute min, max, average and median
with open("ch6_cell12_rels_median_fan_outs.tsv", "w") as medians:
    for rel in sorted(rel_count_per_city.keys()):
        city_counts = sorted(rel_count_per_city[rel].items(), key=lambda t:t[1])
        medians.write("{}\t{}\n".format(rel,city_counts[int(len(city_counts) / 2)][1]))

Base Featurization

We are now almost ready to produce a base featurization, all we are missing is to have global counts for all literals and entities that are targets of the relations and use them to sort and select the top ones according to the median fan-out computed in Cell #11. To such features I will also add the total count per relation and a large total of all relevant relations. Missing features are marked by the string "N/A". Adding the population produces the file "ch6_cell13_dev_feat0.tsv", the first full featurization of the dataset (Cell #13).

In [13]:
# CELL 13
import re
from collections import OrderedDict

triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

# compute counts for ordering values
popularity = dict()
with open("ch6_cell10_cities1000_dev_filtered.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        popularity[obj] = popularity.get(obj, 0) + 1

# fan outs
fan_outs = dict()
with open("ch6_cell12_rels_median_fan_outs.tsv") as medians:
    for line in medians:
        rel, fan_out = line.strip().split("\t")
        fan_outs[rel] = int(fan_out)
rels = sorted(fan_outs.keys())
        
# population
pops = dict()
with open("ch6_cell7_dev_pop.tsv") as dev_pop:
    for line in dev_pop:
        city, pop = line.strip().split("\t")
        pops[city] = float(pop)
        
# full graph
graph = OrderedDict()
with open("ch6_cell10_cities1000_dev_filtered.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if verb in fan_outs:
            if subj not in graph:
                graph[subj] = OrderedDict()
            if verb not in graph[subj]:
                graph[subj][verb] = list()
            graph[subj][verb].append(obj)

# putting it all together
rows_written = 0
with open("ch6_cell13_dev_feat0.tsv", "w") as feats:
    # header
    feats.write("name\trel#count")
    rel_start = dict()
    idx = 2
    for rel in rels:
        fan_out = fan_outs[rel]
        rel_start[rel] = idx
        rel = rel[1:-1]
        feats.write("\t{}#count".format(rel))
        for i in range(fan_out):
            feats.write("\t{}#{}".format(rel, i+1))
        idx += fan_out + 1
    feats.write("\tpopulation\n")
    rows_written += 1
    
    # data
    for city, data in graph.items():
        feats.write("{}\t{}".format(city, sum(map(len,data.values()))))
        for rel in rels:
            fan_out = fan_outs[rel]
            if rel not in data:
                feats.write("\tNA" * (fan_out + 1))
            else:
                feats.write("\t{}".format(len(data[rel])))
                values = sorted(data[rel], key=lambda o:popularity[o], reverse=True)[:fan_out]
                feats.write("\t" + "\t".join(values))
                written = len(values)
                while written < fan_out:
                    feats.write("\tNA")
                    written += 1
        feats.write("\t{}\n".format(pops[city]))
        rows_written += 1
print("Rows written (including header): {:,}".format(rows_written))
Rows written (including header): 48,869

We are now at the moment to start thinking about the ML algorithm we will employ over this data. It contains almost 100 columns of different types and a numeric target. Let us start by plotting the target and its distribution (Cell #14).

In [14]:
# CELL 14
import matplotlib.pyplot as plt
%matplotlib inline

pops = list()
with open("ch6_cell13_dev_feat0.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        pops.append(float(line.strip().split("\t")[-1]) / 10e6)
pops = sorted(pops)

plt.plot(pops, label="population", color='gray')
plt.ylabel('population in millions')
plt.savefig("ch6_cell14_population.pdf", bbox_inches='tight', dpi=300)
plt
Out[14]:
<module 'matplotlib.pyplot' from '/vol/bulk/work/feateng/finaltest/chapter6/feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

That curve is quite steep at the end. To flatten it down, we can apply a logarithmic function as it helps put such growths processes in perspective, as discussed in Chapter 3. Instead of the default logarithm on base $e$ I will use the more intuitive base 10 that tell us the number of digits in the population for a city. Cell #15 shows the result.

In [15]:
# CELL 15
import math
import matplotlib.pyplot as plt
%matplotlib inline

pops = list()
with open("ch6_cell13_dev_feat0.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        pop = float(line.strip().split("\t")[-1])
        if pop == 0:
            pop = 1
        pops.append(math.log(pop, 10))
pops = sorted(pops)

plt.plot(pops, label="log population", color='gray')
plt.ylabel('log population in millions')
plt.savefig("ch6_cell15_log_population.pdf", bbox_inches='tight', dpi=300)
plt
Out[15]:
<module 'matplotlib.pyplot' from '/vol/bulk/work/feateng/finaltest/chapter6/feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

The first thing to notice in this figure is that there seems to be many cities in the data set with less that 1,000 inhabitants. Some even with only 1 person! Looking a little bit into the triples, I can see that some are actual mistakes in Wikipedia (a city with 1,000,000 people got typed as 1.000.000, etc). Other are small towns and hamlets that are less than a 1,000 people (but might have been more than a 1,000 at the time GeoNames built their list). Following the discussion on outliers in Chapter 2, these are outliers that are worth removing from the data. I will still leave them in the held-out final test data as their impact in production needs to be seen and evaluated.

We thus remove these outliers and produce a filtered file "ch6_cell16_dev_feat0_filtered.tsv" (Cell #16).

In [16]:
# CELL 16
written = 0
with open("ch6_cell16_dev_feat0_filtered.tsv", "w") as filtered:
    with open("ch6_cell13_dev_feat0.tsv") as feats:
        filtered.write(next(feats))
        for line in feats:
            pop = float(line.strip().split("\t")[-1])
            if pop >= 1000:
                filtered.write(line)
                written += 1
print("Rows written (including header): {:,}".format(written))
Rows written (including header): 44,959

So we did not throw a lot of data with this change, good.

Now we can try some learning for EDA. Let us try some regression models on it against the single feature of "number of relations defined". Let's start with SVR using RBFs (Cell #17).

In [17]:
# CELL 17
import random
import math
from sklearn.svm import SVR
import numpy as np

rand = random.Random(42)
train_data = list()
test_data = list()
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        fields = line.strip().split("\t")
        pop = float(fields[-1])
        rels = float(fields[1])
        row = (math.log(rels, 10), math.log(pop, 10))
        if rand.random() < 0.8:
            test_data.append(row) 
        else:
            train_data.append(row)
               
test_data = sorted(test_data, key=lambda t:t[1])

train_data = np.array(train_data)
test_data = np.array(test_data)

xtrain = train_data[:,0]
ytrain = train_data[:,1]
xtest = test_data[:,0]
ytest = test_data[:,1]

# SVRs need scaling
xtrain_min = xtrain.min(); xtrain_max = xtrain.max()
xtrain_scaling = 1.0 / (xtrain_max - xtrain_min)
xtrain -= xtrain_min
xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min
ytrain *= ytrain_scaling

xtest -= xtrain_min
xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min
ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))
best_c = 100.0
best_epsilon = 0.1
PARAM_DO_GRID_SEARCH = False
      
if PARAM_DO_GRID_SEARCH:
    best_rmse = 1000
    for c in [0.01, 0.1, 1.0, 2.0, 5.0, 10.0, 50.0, 100.0]:
        svr_rbf = SVR(epsilon=0.05, C=c, gamma='auto')
        svr_rbf.fit(xtrain.reshape(-1,1), ytrain)
        ytest_pred = svr_rbf.predict(xtest.reshape(-1,1))
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("C", c, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_c = c
            best_rmse = RMSE

    print("Best C", best_c,"best RMSE", best_rmse)        

    best_rmse = 1000
    for epsilon in [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 0.0, 1.0, 10.0, 100.0]:
        svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
        svr_rbf.fit(xtrain.reshape(-1,1), ytrain)
        ytest_pred = svr_rbf.predict(xtest.reshape(-1,1))
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("Epsilon", epsilon, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_epsilon = epsilon
            best_rmse = RMSE

    print("Best epsilon", best_epsilon,"best RMSE", best_rmse)
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain.reshape(-1,1), ytrain)
ytest_pred = svr_rbf.predict(xtest.reshape(-1,1))
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch6_cell17_svr_one_feat.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 8,963 cities
RMSE 0.617989060494947
Out[17]:
<matplotlib.legend.Legend at 0x7f92c6cdb5d0>

While the RMSE is not that bad for only one feature (less that one order of magnitude) and the flat bottom of the figure can be explained by the fact the SVR has only a single support vector (from its single feature), the figure still does not seem to track the curve that well. To further see if this is an issue with SVR or with the feature, I plotted a 20% sample of the total number of relations against the log of the population in Cell #18.

In [18]:
# CELL 18
import math
import random

pops = list()
num_rels = list()
rand = random.Random(42)
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        fields = line.strip().split("\t")
        pop = float(fields[-1])
        rels = float(fields[1])
        if rand.random() < 0.2:
            pops.append(math.log(pop, 10))
            num_rels.append(math.log(rels, 10))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]
plt.plot(num_rels, pops, '.', color='gray')
plt.ylabel('scaled log population')
plt.xlabel('scaled number of relations')
plt.savefig("ch6_cell18_logpop_vs_log_items.pdf", bbox_inches='tight', dpi=300)
plt
Out[18]:
<module 'matplotlib.pyplot' from '/vol/bulk/work/feateng/finaltest/chapter6/feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

It seems the correlation is there but it is weak and SVR is extracting as good information as it is. I thus settle for SVR and we are ready for our first feature set.

First Featurization

As SVR only takes numeric features, the first step is to typify the raw columns we have and see which ones are quasi-numeric (that is, numeric 80% of the time) and whether the ones that are not have a small (100 or less) unique values. This typification is in Cell #19.

In [19]:
# CELL 19
types = dict()
data = list()
header = None
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    header = next(feats).strip().split('\t')
    for line in feats:
        data.append( line.strip().split('\t'))

# numeric columns
num_cols = set()
for idx in range(len(header)):
    total_non_na = 0
    is_num = 0
    for row in data:
        if row[idx] == 'NA':
            continue
        total_non_na += 1
        row_val = row[idx]
        if '"^^' in row_val:
            row_val = row_val[1:row_val.index('"^^')]
        try:
            float(row_val)
            is_num += 1
        except ValueError:
            pass
    if is_num > total_non_na * 0.8:
        num_cols.add(idx)
        
print("Total cols", len(header))
print("Numeric cols", len(num_cols))

# differentiate categorical from free text, up to 100 categories
cat_cols = dict()
others = list()
for idx in range(len(header)):
    if idx in num_cols:
        continue
    values = set()
    for row in data:
        if row[idx] == 'NA':
            continue
        values.add(row[idx])
    if len(values) < 100:
        cat_cols[idx] = values
    else:
        others.append( (header[idx], len(values)) )

print("Categorical cols")
for col in sorted(cat_cols.keys()):
        print("   " + header[col])
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th></tr>" +
            "\n".join(list(map(lambda r: "<tr><td>{}</td><td>{}</td></tr>".format(r[0].split("/")[-1], r[1]), 
                               others))) +"</table>"))
Total cols 99
Numeric cols 54
Categorical cols
   http://dbpedia.org/ontology/utcOffset#1
   http://dbpedia.org/ontology/utcOffset#2
   http://www.w3.org/1999/02/22-rdf-syntax-ns#type#1
RelCounts
name44959
areaCode#15667
birthPlace?inv#120183
birthPlace?inv#213748
birthPlace?inv#310523
city?inv#111381
city?inv#26884
country#1253
deathPlace?inv#19848
deathPlace?inv#25215
department#1172
district#11182
foundingDate#12649
ground?inv#13284
headquarter?inv#12671
hometown?inv#14030
inseeCode#13855
isPartOf#12761
isPartOf#27640
isPartOf?inv#13331
leaderName#11839
leaderTitle#1454
location?inv#113017
location?inv#27984
locationCity?inv#12704
motto#12587
nearestCity?inv#13534
populationAsOf#1259
postalCode#125603
region#1404
residence?inv#15033
routeEnd?inv#13968
routeJunction?inv#12714
routeStart?inv#14019
timeZone#1178
type#1582
rdf-schema#seeAlso#1896
rdf-schema#seeAlso#2692
rdf-schema#seeAlso#3423
homepage#118912
name#140080
nick#12601

Therefore, of the 99 columns, 54 are numeric but only 3 can be captured by 100 unique values. As we can expect more of them to be quasi-categorical, I employ a technique for category discretization discussed in Chapter 2, introducing an OTHER category to handle the rest of the values if they account for less than 20% of the total number of values (Cell #20).

I also drop constant columns (which are uninformative) netting 50 numeric columns and 11 categorical. The remaining columns have a variety that cannot be discretized. These columns will require text processing or they are just uninformative for this approach. The selected categorical relations with their total number of categories is shown. Types go to "ch6_cell20_feat0_types.tsv".

In [20]:
# CELL 20
from collections import OrderedDict

types = dict()
data = list()
header = None
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    header = next(feats).strip().split('\t')
    for line in feats:
        data.append( line.strip().split('\t'))

# constant columns
const_cols = set()

for idx in range(len(header)):
    value = None
    constant = True
    for row in data:
        if row[idx] == 'NA':
            continue
        if value is None:
            value = row[idx]
        elif value != row[idx]:
            constant = False
            break
    if constant:
        const_cols.add(idx)

# integer columns
num_cols = set()
for idx in range(len(header)):
    if idx in const_cols:
        continue
    total_non_na = 0
    is_num = 0
    for row in data:
        if row[idx] == 'NA':
            continue
        total_non_na += 1
        row_val = row[idx]
        if '"^^' in row_val:
            row_val = row_val[1:row_val.index('"^^')]
        try:
            float(row_val)
            is_num += 1
        except ValueError:
            pass
    if is_num > total_non_na * 0.8:
        num_cols.add(idx)
        
print("Total cols", len(header))
print("Constant cols", len(const_cols))
print("Numeric cols", len(num_cols))

# differentiate categorical from free text, up to 100 categories, including "other"
cat_cols = dict()
others = list()
for idx in range(len(header)):
    if idx in num_cols or idx in const_cols:
        continue
    values = OrderedDict()
    total_obs = 0
    for row in data:
        if row[idx] == 'NA':
            continue
        values[row[idx]] = values.get(row[idx], 0) + 1
        total_obs += 1
    values = sorted(values.items(), key=lambda x:x[1])
    total_values = len(values)
    selected_values = list()
    selected_obs = 0
    while selected_obs * 1.0 / total_obs < 0.8 and len(selected_values) < 100:
        value, count = values.pop()
        selected_values.append(value)
        selected_obs += count
    if len(selected_values) < 100:
        selected_values.append('OTHER')
        cat_cols[idx] = selected_values
    else:
        others.append( (header[idx], total_values, int(selected_obs * 1.0 / total_obs * 10000)/100) )
        
with open("ch6_cell20_feat0_types.tsv", "w") as types:
    for idx in range(len(header)):
        types.write("{}\t{}".format(idx, header[idx]))
        if idx in num_cols:
            types.write("\tNUMERIC\n")
        elif idx in cat_cols:
            types.write("\t" + "\t".join(cat_cols[idx]) + "\n")
        else:
            types.write("\tIGNORE\n")
        
table1 = ("<table><tr><th>Rel</th><th>Categories</th></tr>" +
   "\n".join(list(map(lambda x: "<tr><td>{}</td><td>{}</td></tr>".format(header[x].split("/")[-1], 
                                                                         len(cat_cols[x])), cat_cols.keys()))) +
    "</table>")
table2 = ("<table><tr><th>Rel</th><th>Counts</th><th>% Covered @ 100</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{}</td></tr>".format(r[0].split("/")[-1], 
                                                                                   r[1], r[2]), 
                               others))) +"</table>")
from IPython.display import HTML, display
display(HTML("<h3>Categorical cols ({})</h3>".format(len(cat_cols)) + 
             table1 + "<h3>Remaining cols ({})</h3>".format(len(others)) + table2))
Total cols 99
Constant cols 4
Numeric cols 50

Categorical cols (11)

RelCategories
country#121
department#159
leaderTitle#14
populationAsOf#115
region#1100
timeZone#114
type#129
utcOffset#17
utcOffset#26
22-rdf-syntax-ns#type#13
rdf-schema#seeAlso#373

Remaining cols (34)

RelCounts% Covered @ 100
name449590.22
areaCode#1566726.65
birthPlace?inv#1201832.07
birthPlace?inv#2137481.0
birthPlace?inv#3105231.0
city?inv#1113815.1
city?inv#268843.32
deathPlace?inv#198481.52
deathPlace?inv#252151.99
district#1118231.45
foundingDate#1264910.33
ground?inv#132844.09
headquarter?inv#126716.47
hometown?inv#140304.35
inseeCode#138552.79
isPartOf#1276156.1
isPartOf#2764015.33
isPartOf?inv#1333111.27
leaderName#1183939.47
location?inv#1130173.38
location?inv#279843.06
locationCity?inv#127047.36
motto#125876.18
nearestCity?inv#1353411.07
postalCode#1256033.27
residence?inv#150334.76
routeEnd?inv#139683.49
routeJunction?inv#1271414.05
routeStart?inv#140193.09
rdf-schema#seeAlso#189674.52
rdf-schema#seeAlso#269276.44
homepage#1189122.89
name#1400801.1
nick#126017.98

I now proceed to perform a one hot encoding of the categories (cf. Chapter 3) and imputing the values to the OTHER for categorical (cf. Chapter 2) or to 0.0 for numeric into file "ch6_cell21_dev_feat1.tsv" (Cell #21). Its vector size has 380 features plus the target variable.

In [21]:
# CELL 21
from collections import OrderedDict

types = list(map(lambda line: line.strip().split("\t"), 
                 open("ch6_cell20_feat0_types.tsv").readlines()))
header = None
data = list()
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    header = next(feats).strip().split('\t')
    for line in feats:
        data.append( line.strip().split('\t'))

total_size = 0
cat_to_pos = dict()
for type_line in types:
    if type_line[2] == 'NUMERIC':
        total_size += 1
    elif type_line[2] != 'IGNORE':
        cat_to_pos[int(type_line[0])] = dict(zip(type_line[2:], range(len(type_line) - 2)))
        total_size += len(type_line) - 2
    
with open("ch6_cell21_dev_feat1.tsv", "w") as feat1:
    feat1.write('name') # add name for error analysis
    for type_line in types:
        if type_line[2] == 'NUMERIC':
            feat1.write('\t' + type_line[1])
        elif type_line[2] != 'IGNORE':
            for value in type_line[2:]:
                feat1.write('\t{}@{}'.format(type_line[1], value))
    feat1.write('\n')
    for row in data:
        v = ['NA',] * (total_size + 1)
        v[0] = row[0] # add name for error analysis
        current = 1
        for type_line in types:
            idx = int(type_line[0])
            value = row[idx]
            if type_line[2] == 'NUMERIC':
                if '"^^' in value:
                    value = value[1:value.index('"^^')]
                try:
                    v[current] = str(float(value))
                except ValueError:
                    v[current] = '0.0'
                current += 1
            elif type_line[2] != 'IGNORE':
                pos = None
                if value == 'NA' or value not in cat_to_pos[idx]:
                    pos = cat_to_pos[idx]['OTHER']
                else:
                    pos = cat_to_pos[idx][value]
                for idx2 in range(len(type_line) - 2):
                    if idx2 == pos:
                        v[current] = "1.0"
                    else:
                        v[current] = "0.0"
                    current += 1
        feat1.write('\t'.join(v) + "\n")        
print("Vector size", total_size)
Vector size 381

This data set is ready to be used with SVR. Doing a grid search for its parameters and saving data for error analysis (Cell #22) we obtain a RMSE of 0.41 and the curve below.

In [22]:
# CELL 22
import random
import math
from sklearn.svm import SVR
import numpy as np

rand = random.Random(42)
train_data = list()
test_data = list()
header = None
with open("ch6_cell21_dev_feat1.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    for line in feats:
        fields = line.strip().split("\t")
        pop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, math.log(pop, 10), name)
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
               
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c = 100.0
best_epsilon = 0.05

PARAM_DO_GRID_SEARCH = False
if PARAM_DO_GRID_SEARCH:
    best_rmse = 1000
    for c in [0.01, 0.1, 1.0, 2.0, 5.0, 10.0, 50.0, 100.0]:
        svr_rbf = SVR(epsilon=0.05, C=c, gamma='auto')
        svr_rbf.fit(xtrain, ytrain)
        ytest_pred = svr_rbf.predict(xtest)
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("C", c, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_c = c
            best_rmse = RMSE

    print("Best C", best_c,"best RMSE", best_rmse)        

    best_rmse = 1000
    for epsilon in [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 0.0, 1.0, 10.0, 100.0]:
        svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
        svr_rbf.fit(xtrain, ytrain)
        ytest_pred = svr_rbf.predict(xtest)
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("Epsilon", epsilon, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_epsilon = epsilon
            best_rmse = RMSE

    print("Best epsilon", best_epsilon,"best RMSE", best_rmse)
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

print("Writing files for error analysis")
np.savez_compressed("ch6_cell22_feat1_svr.npz", 
                    xtrain=xtrain, ytrain=ytrain, 
                    xtest=xtest, ytest=ytest,
                    ytest_orig=ytest_orig,
                    ytest_pred=ytest_pred,
                    ytrain_min=ytrain_min,
                    xtrain_scaling=xtrain_scaling,
                    ytrain_scaling=ytrain_scaling)
with open("ch6_cell22_test_names.tsv", "w") as names:
    for idx, name in enumerate(test_names):
        names.write("{}\t{}\n".format(idx, name))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch6_cell22_svr_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 35,971 cities
RMSE 0.41176199300264366
Writing files for error analysis
Out[22]:
<matplotlib.legend.Legend at 0x7f92c6c90bd0>

This figure and RMSE look better than the single feature that preceed it but it has some strange spikes. Let's perform some Error Analysis on them, an error drilldown on the spike at the end by feature ablation, most probably won't work but I show error analysis.

Error Analysis

Using the training data and evaluation results from Cell #22, I can now do an error drilldown on the spike at the end by feature ablation in Cell #23.

In [23]:
# CELL 23
arrays = np.load("ch6_cell22_feat1_svr.npz")
ytest_orig = arrays['ytest_orig']
ytest_pred = arrays['ytest_pred']
test_names = list(map(lambda line:line.strip().split('\t')[-1], 
                      open("ch6_cell22_test_names.tsv").readlines()))

# find the top 10 biggest contributors to the log error
sqe = (ytest_orig - ytest_pred)**2
contributors = list()
while len(contributors) < 10:
    idx = sqe.argmax()
    contributors.append( (idx, (ytest_pred - ytest_orig)[idx]) )
    sqe[idx] = 0.0

print("Top contributors to the error")
for idx, error in contributors:
    print("{:<4}  {:<21} {:>5}    {:>10,}   {:>17,}".
          format(idx, test_names[idx][:-1].split("/")[-1], int(error*100)/100, 
          int(10**ytest_orig[idx]), int(10**ytest_pred[idx])))                 
Top contributors to the error
8825  Dublin                  6.5       553,164   1,751,459,207,216
8726  Saint-Prex            -2.09       312,010               2,485
8705  Meads,_Kentucky       -2.08       288,648               2,351
7939  Nicosia                2.03        55,013           5,931,549
8921  Hofuf                 -1.94     1,500,000              17,145
8985  Mexico_City            1.93     8,918,652         761,989,297
8689  Montana_City,_Montana -1.88       271,529               3,513
8902  Pikine                -1.82     1,170,790              17,470
8791  Edinburgh              1.76       464,989          26,976,492
172   Isabela,_Basilan       1.74         1,081              60,349

From the table we can see that cities that have a rich history but are not necessarily that populated get overshoot by the system. Makes sense.

Interestingly, Saint-Prex is actually about 5,000 people, the 300,000 number is an extraction mistake. Meads, Kentucky population is listed on Wikipedia as 280,000 but it is being flagged as an error as 280,000 is the size of the full region. The town itself is definitely less than 5,000 people.

Nicosia is a history-rich city populated by 4,500 years. Sadly, it is a divided city and one of its divisions population is about 55,000 people. A joint city will total about 300,000 inhabitants, but it has the history of a 6,000,000 people for sure.

Therefore we have history-rich, underpopulated cities and extraction errors. From here it makes sense to focus on real (not logarithmic) where the system is over predicting (Cell #24).

In [24]:
# CELL 24
arrays = np.load("ch6_cell22_feat1_svr.npz")
xtrain = arrays['xtrain']
ytrain = arrays['ytrain'] 
xtest = arrays['xtest']
ytest = arrays['ytest']
ytest_orig = arrays['ytest_orig']
ytest_pred = arrays['ytest_pred']
test_names = list(map(lambda line:line.strip().split('\t')[-1], 
                      open("ch6_cell22_test_names.tsv").readlines()))

# find the top 10 biggest contributors to the actual error
real_error = (10**ytest_pred - 10**ytest_orig)
contributors = list()
cumulative_error = 0
while len(contributors) < 10:
    idx = real_error.argmax()
    contributors.append( (idx, (ytest_pred - ytest_orig)[idx]) )
    cumulative_error += ytest_pred[idx] - ytest_orig[idx]
    real_error[idx] = 0.0

print("Top contributors to the error")
print("Cumulative error {:,}".format(cumulative_error))
for idx, error in contributors:
    print("{:<4}  {:<21} {:>5}    {:>10,}   {:>17,}".
          format(idx, test_names[idx][:-1].split("/")[-1], int(error*100)/100, 
          int(10**ytest_orig[idx]), int(10**ytest_pred[idx])))
Top contributors to the error
Cumulative error 17.634868409695972
8825  Dublin                  6.5       553,164   1,751,459,207,216
8985  Mexico_City            1.93     8,918,652         761,989,297
8791  Edinburgh              1.76       464,989          26,976,492
8959  Algiers                0.83     3,415,810          23,294,353
8910  Prague                 0.88     1,267,448           9,823,330
8915  Milan                   0.8     1,359,904           8,636,483
8873  Amsterdam              0.98       842,342           8,094,965
8823  Lisbon                 1.15       545,245           7,776,212
8924  Tabriz                 0.74     1,549,452           8,586,162
7939  Nicosia                2.03        55,013           5,931,549

This list looks quite good, these cities are contributing a lot to the error and fall into the category of errors we have found the system is making, let's do an ablation study (wrapper) to see if we can find which features are hurting us here. In this study we'll throw away one feature at a time and see whether it improves the cumulative error for these 10 cities. The amount of computation is very big so we'll work on a 1% sample of the original data. Also, because Dublin is so markedly overshoot, I'll also keep track of dropping which better feature improves each city. Cell #25.

In [25]:
# CELL 25
import random
from sklearn.svm import SVR
import numpy as np

arrays = np.load("ch6_cell22_feat1_svr.npz")
xtrain = arrays['xtrain']
ytrain = arrays['ytrain']
ytrain_scaling = arrays['ytrain_scaling']
ytrain_min = arrays['ytrain_min']
xtest = arrays['xtest']
ytest = arrays['ytest']
ytest_orig = arrays['ytest_orig']
ytest_pred = arrays['ytest_pred']
test_names = list(map(lambda line:line.strip().split('\t')[-1], 
                      open("ch6_cell22_test_names.tsv").readlines()))
header = None
with open("ch6_cell21_dev_feat1.tsv") as feats:
    header = next(feats)
header = header.strip().split('\t')
header.pop(0) # drop name

# find the top 10 biggest contributors to the log error
real_error = (10**ytest_pred - 10**ytest_orig)
contributors = list()
cumulative_error = 0
xtest_contr = np.zeros( (10, xtest.shape[1]) )
ytest_contr = np.zeros( (10,) )
while len(contributors) < 10:
    idx = real_error.argmax()
    ytest_contr[len(contributors)] = ytest_orig[idx]
    xtest_contr[len(contributors),:] = xtest[idx,:]
    contributors.append( (idx, (ytest_pred - ytest_orig)[idx]) )
    cumulative_error += real_error[idx]
    real_error[idx] = 0.0

best_epsilon = 0.05
best_c = 100.0
# this is a lot of computation, work on sample of the original data
PARAM_SAMPLE_PERC=0.05
rand = random.Random(42)
selected = rand.sample(list(range(len(xtrain))), int(len(xtrain) * PARAM_SAMPLE_PERC))
xtrain_samp = xtrain[selected,:]
ytrain_samp = ytrain[selected]

# baseline run
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain_samp, ytrain_samp)
ytest_pred = svr_rbf.predict(xtest_contr)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
error = ytest_pred - ytest_contr # absolute log error
cumulative = sum(error)
baseline_cumulative = cumulative
per_city_baseline = error

best_cumulative = 100000.00
best_cumulative_ablate = None
per_city_best = [10000.00,] * 10
per_city_best_ablate = [None,] * 10

for feat in range(xtrain.shape[1]):
    if feat % int(xtrain.shape[1] / 10) == 0:
        print("{:>3} out of {:>3}".format(feat, xtrain.shape[1]))
    train_col = xtrain_samp[:,feat].copy()
    xtrain_samp[:,feat] = 0.0 # nuke feature
    svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
    svr_rbf.fit(xtrain_samp, ytrain_samp)
    ytest_pred = svr_rbf.predict(xtest_contr)
    ytest_pred *= 1.0/ytrain_scaling
    ytest_pred += ytrain_min
    error = ytest_pred - ytest_contr
    cumulative = sum(error)
    if best_cumulative_ablate is None or best_cumulative > cumulative:
        best_cumulative_ablate = feat
        best_cumulative = cumulative
    for idx in range(10):
        if per_city_best_ablate[idx] is None or per_city_best[idx] > error[idx]:
            per_city_best_ablate[idx] = feat
            per_city_best[idx] = error[idx]
    xtrain_samp[:,feat] = train_col # restore

print("Ablation study")
print("Best overall from {:3.2} to {:3.2} ({:.2%})".format(
    baseline_cumulative, best_cumulative,
    (baseline_cumulative-best_cumulative)/baseline_cumulative))
print("Feature to ablate ({}): {}\n\n".format(best_cumulative_ablate, header[best_cumulative_ablate]))
for idx in range(10):
    print("For city {} from {:3.2} to {:3.2} ({:.2%})".format(
        test_names[contributors[idx][0]][:-1].split("/")[-1],
        per_city_baseline[idx], per_city_best[idx],
        (per_city_baseline[idx]-per_city_best[idx])/per_city_baseline[idx]))
    print("Feature to ablate ({}): {}\n".format(per_city_best_ablate[idx], header[per_city_best_ablate[idx]]))
  0 out of 380
 38 out of 380
 76 out of 380
114 out of 380
152 out of 380
190 out of 380
228 out of 380
266 out of 380
304 out of 380
342 out of 380
Ablation study
Best overall from 0.28 to -2.4 (968.43%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count


For city Dublin from 1.2 to 0.48 (58.97%)
Feature to ablate (11): http://dbpedia.org/ontology/city?inv#count

For city Mexico_City from -0.061 to -0.4 (-561.55%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count

For city Edinburgh from 0.43 to 0.19 (57.03%)
Feature to ablate (3): http://dbpedia.org/ontology/areaLand#count

For city Algiers from -0.92 to -1.1 (-22.87%)
Feature to ablate (378): http://xmlns.com/foaf/0.1/name#count

For city Prague from -0.18 to -0.82 (-341.93%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count

For city Milan from 0.17 to -0.43 (347.88%)
Feature to ablate (307): http://www.w3.org/2000/01/rdf-schema#seeAlso#3@<http://dbpedia.org/resource/Italy>

For city Amsterdam from -0.53 to -0.7 (-31.85%)
Feature to ablate (98): http://dbpedia.org/ontology/ground?inv#count

For city Lisbon from 0.0019 to -0.16 (8666.74%)
Feature to ablate (323): http://www.w3.org/2000/01/rdf-schema#seeAlso#3@<http://dbpedia.org/resource/The_Republic_of_Ireland>

For city Tabriz from -0.2 to -0.97 (-376.48%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count

For city Nicosia from 0.41 to 0.1 (75.64%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count

From the above results I can see that counts are misleading the algorithm. Here I am presented with two options, I can either drop them or I can dampen their growth by applying a squashing function to them. As I believe there's value in them, I'll dampen their value by applying a log function to them (Cell #26).

In [26]:
# CELL 26
import math
from sklearn.svm import SVR
import numpy as np

header = None
with open("ch6_cell26_dev_feat1_dampen.tsv", "w") as feats_damp:
    with open("ch6_cell21_dev_feat1.tsv") as feats:
        header = next(feats)
        feats_damp.write(header)
        header = header.strip().split("\t")
        header.pop(0) # name
        for line in feats:
            fields = line.strip().split("\t")
            pop = float(fields[-1])
            name = fields[0]
            feats = list(map(float,fields[1:-1]))
            pop = math.log(pop, 10)
            # dampen counts
            for idx, featname in enumerate(header):
                if "#count" in featname:
                    feats[idx] = math.log(feats[idx] + 1, 10)
            feats_damp.write("\t".join(map(str, [name] + feats + [pop])) + "\n")

rand = random.Random(42)
train_data = list()
test_data = list()
header = None
with open("ch6_cell26_dev_feat1_dampen.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name)
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
               
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c = 100.0
best_epsilon = 0.05
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch6_cell26_svr_feat1_dampen.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 35,971 cities
RMSE 0.3556252788046216
Out[26]:
<matplotlib.legend.Legend at 0x7f9287312f90>

The RMSE is better than before at 0.36 and the spikes have reduced.

That is a nice improvement from the previous one (note that Feature Engineering does not always succeed and you might encounter dead ends). (Of course taking the log of those counts could have been done directly based on the advice in Chapter 2).

Instead of continuing in this route of wrapper ablation, I'll now move to feature utility metrics, for which we need a categorical target, I'll thus discretize the target in segments with equal number of points. Cell #27 does the discretization and measure the discretization error at different number of splits.

In [27]:
# CELL 27
import math
import random
import numpy as np

rand = random.Random(42)
train_data = list()
test_data = list()
with open("ch6_cell26_dev_feat1_dampen.tsv") as feats:
    header = next(feats)
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        if rand.random() < 0.2:
            test_data.append(logpop) 
        else:
            train_data.append(logpop)

train_data = np.array(sorted(train_data))
test_data  = np.array(test_data)

def cell27_split(data):
    pivot = int(len(data)/2)
    # slower, more accurate
    lower_center, lower_val = int(pivot/2), data[int(pivot/2)]
    upper_center, upper_val = int(pivot/2)+pivot, data[int(pivot/2)+pivot]
    return [ { "min": data[0], "max": data[pivot], "val": lower_val,
              "min_idx": 0, "max_idx": pivot, "center": lower_center },
             { "min": data[pivot+1], "max": data[-1], "val": upper_val,
              "min_idx": pivot+1, "max_idx": len(data) - 1, "center": upper_center } ]

def cell27_segment(train_data, order):
    segments_at = [ [ { "min": train_data[0], "max" : train_data[-1], 
                        "val": train_data[int(len(train_data)/2)],
                        "min_idx":0, "max_idx": len(train_data) - 1 } ] ]
    for exp in range(order):
        previous = segments_at[-1]
        new_segment = list()
        for old_segment in previous:
            splitted = cell27_split(train_data[old_segment['min_idx']:(old_segment['max_idx']+1)])
            for created in splitted:
                created['min_idx'] += old_segment['min_idx']
                created['max_idx'] += old_segment['min_idx']
                created['center']  += old_segment['min_idx']
                new_segment.append(created)
        segments_at.append(new_segment)
    return segments_at

def cell27_predict(data, segments):
    result = list()
    for pop in data:
        value = None
        if pop < segments[0]['min']:
            value = segments[0]['val']
        elif pop > segments[-1]['max']:
            value = segments[-1]['val']
        else:
            for seg_idx, segment in enumerate(segments):
                if segment['min'] <= pop and \
                        (seg_idx == len(segments) - 1 or pop < segments[seg_idx + 1]['min']):
                    value = segment['val']
                    break
        if value is None:
            raise Exception("Value not found: {}".format(10**pop))
        result.append(value)
    return np.array(result)

segments_at = cell27_segment(train_data, 5)

for idx, segments in enumerate(segments_at):
    for seg_idx, seg in enumerate(segments):
        print("\tSegment {:2}, min: {:10,} value: {:10,}, max: {:10,}"
              .format((seg_idx+1), int(round(10 ** seg['min'])), 
                      int(round(10 ** seg['val'])), int(round(10 ** seg['max']))))
    test_pred = cell27_predict(test_data, segments)
    print("Segmented in {}, RMSE: {}".format(2**idx, math.sqrt(sum((test_data - test_pred)**2)/len(test_data))))
    
import pickle
with open("ch6_cell27_splits.pk", "wb") as pkl:
    pickle.dump(segments_at, pkl)
	Segment  1, min:      1,000 value:      5,409, max: 24,300,000
Segmented in 1, RMSE: 0.703382635244624
	Segment  1, min:      1,000 value:      2,226, max:      5,409
	Segment  2, min:      5,410 value:     19,435, max: 24,300,000
Segmented in 2, RMSE: 0.44244189595902794
	Segment  1, min:      1,000 value:      1,545, max:      2,226
	Segment  2, min:      2,226 value:      3,363, max:      5,409
	Segment  3, min:      5,410 value:      9,734, max:     19,435
	Segment  4, min:     19,435 value:     49,331, max: 24,300,000
Segmented in 4, RMSE: 0.2912613798706353
	Segment  1, min:      1,000 value:      1,254, max:      1,545
	Segment  2, min:      1,545 value:      1,884, max:      2,226
	Segment  3, min:      2,226 value:      2,729, max:      3,363
	Segment  4, min:      3,364 value:      4,225, max:      5,409
	Segment  5, min:      5,410 value:      7,147, max:      9,734
	Segment  6, min:      9,735 value:     13,515, max:     19,435
	Segment  7, min:     19,435 value:     29,261, max:     49,339
	Segment  8, min:     49,371 value:    114,914, max: 24,300,000
Segmented in 8, RMSE: 0.1936688721250952
	Segment  1, min:      1,000 value:      1,126, max:      1,254
	Segment  2, min:      1,254 value:      1,393, max:      1,545
	Segment  3, min:      1,545 value:      1,710, max:      1,884
	Segment  4, min:      1,884 value:      2,028, max:      2,226
	Segment  5, min:      2,226 value:      2,461, max:      2,729
	Segment  6, min:      2,729 value:      3,021, max:      3,363
	Segment  7, min:      3,364 value:      3,766, max:      4,225
	Segment  8, min:      4,225 value:      4,800, max:      5,409
	Segment  9, min:      5,410 value:      6,176, max:      7,147
	Segment 10, min:      7,147 value:      8,300, max:      9,734
	Segment 11, min:      9,735 value:     11,441, max:     13,518
	Segment 12, min:     13,518 value:     16,117, max:     19,435
	Segment 13, min:     19,435 value:     23,699, max:     29,261
	Segment 14, min:     29,271 value:     37,076, max:     49,339
	Segment 15, min:     49,371 value:     69,696, max:    114,914
	Segment 16, min:    114,924 value:    264,091, max: 24,300,000
Segmented in 16, RMSE: 0.12719110586720506
	Segment  1, min:      1,000 value:      1,068, max:      1,126
	Segment  2, min:      1,126 value:      1,189, max:      1,254
	Segment  3, min:      1,254 value:      1,318, max:      1,393
	Segment  4, min:      1,393 value:      1,461, max:      1,545
	Segment  5, min:      1,545 value:      1,628, max:      1,710
	Segment  6, min:      1,710 value:      1,799, max:      1,884
	Segment  7, min:      1,884 value:      1,986, max:      2,028
	Segment  8, min:      2,028 value:      2,125, max:      2,226
	Segment  9, min:      2,226 value:      2,343, max:      2,461
	Segment 10, min:      2,461 value:      2,588, max:      2,729
	Segment 11, min:      2,729 value:      2,883, max:      3,021
	Segment 12, min:      3,022 value:      3,175, max:      3,363
	Segment 13, min:      3,364 value:      3,566, max:      3,766
	Segment 14, min:      3,766 value:      3,995, max:      4,225
	Segment 15, min:      4,225 value:      4,503, max:      4,800
	Segment 16, min:      4,800 value:      5,098, max:      5,409
	Segment 17, min:      5,410 value:      5,790, max:      6,176
	Segment 18, min:      6,177 value:      6,609, max:      7,147
	Segment 19, min:      7,147 value:      7,720, max:      8,300
	Segment 20, min:      8,300 value:      8,963, max:      9,734
	Segment 21, min:      9,735 value:     10,546, max:     11,441
	Segment 22, min:     11,442 value:     12,419, max:     13,518
	Segment 23, min:     13,518 value:     14,776, max:     16,117
	Segment 24, min:     16,119 value:     17,742, max:     19,435
	Segment 25, min:     19,435 value:     21,342, max:     23,699
	Segment 26, min:     23,700 value:     26,171, max:     29,261
	Segment 27, min:     29,271 value:     32,882, max:     37,080
	Segment 28, min:     37,095 value:     42,033, max:     49,339
	Segment 29, min:     49,371 value:     57,331, max:     69,713
	Segment 30, min:     69,769 value:     87,249, max:    114,914
	Segment 31, min:    114,924 value:    161,719, max:    264,700
	Segment 32, min:    264,716 value:    655,965, max: 24,300,000
Segmented in 32, RMSE: 0.07988951417940254

From above, splitting into 4 segments produce a reasonable discretiation error. I can now proceed to do some feature selection using mutual information (Chapter 4). I will also introduce 10 random features and drop all features whose mutual information falls below at least 3 random features. Cell #28.

In [28]:
# CELL 28
import numpy as np
import pickle
import random
from collections import OrderedDict

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

data = list()
header = None
PARAM_RANDOM_FEATS = 10
rand = random.Random(42)
with open("ch6_cell26_dev_feat1_dampen.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for idx in range(PARAM_RANDOM_FEATS):
        header.append('random_' + str(idx))
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        for _ in range(PARAM_RANDOM_FEATS):
            feats.append(rand.random())
        row = (feats, logpop, name)
        data.append(row)

xdata = np.array(list(map(lambda t:t[0], data)))
ydata = np.array(list(map(lambda t:t[1], data)))

def cell28_adjudicate(data, segments):
    result = list()
    for val in data:
        idx = None
        if val < segments[0]['min']:
            idx = 0
        elif val > segments[-1]['max']:
            idx = len(segments) - 1
        else:
            for idx2, segment in enumerate(segments):
#                if (segment['min'] == val and segment['max'] == val) or \
#                    (segment['min'] <= val and \
#                    (idx2 == len(segments)-1 or val < segments[idx2+1]['min'])):
                if segment['min'] <= val and val <= segment['max']:
                    idx = idx2
                    break
        result.append(idx)
    return np.array(result)

ydata = cell28_adjudicate(ydata, segments_at[2])

# discretize features into up to 4 bins
for feat in range(xdata.shape[1]):
    feat_segments = cell27_segment(sorted(xdata[:,feat]), 3)
    xdata[:,feat] = cell28_adjudicate(xdata[:,feat], feat_segments[2])
    
# compute confusion tables for each feature
confusion_tables = list()
for feat in range(xdata.shape[1]):
    table = OrderedDict()
    for row in range(xdata.shape[0]):
        feat_val   = int(xdata[row, feat])
        target_val = int(ydata[row])
        if feat_val not in table:
            table[feat_val] = OrderedDict()
        table[feat_val][target_val] = table[feat_val].get(target_val, 0) + 1
    confusion_tables.append(table)

# compute feature utility using mutual information
PARAM_USE_CHI = False
feature_utility = list()
if PARAM_USE_CHI:
    # pip install scipy
    from scipy.stats.mstats import chisquare
for feat in range(xdata.shape[1]):
    if feat % int(xdata.shape[1] / 10) == 0:
        print("{:>3} out of {:>3}".format(feat, xdata.shape[1]))
    table = confusion_tables[feat]
    feat_vals = set()
    for row in table.values():
        feat_vals.update(row.keys())
    cols = { val: sum(map(lambda x:x.get(val,0), table.values())) for val in feat_vals }
    full_table = sum(cols.values())
    
    best_utility = None
    for feat_val in table.keys():
        for target_val in table[feat_val].keys():
            # binarize
            n11 = table[feat_val][target_val]
            if n11 < 5:
                continue
            n10 = sum(table[feat_val].values()) - n11
            n01 = cols.get(target_val) - n11
            n00 = full_table - n11 - n10 - n01
            if n10 == 0 or n01 == 0 or n00 == 0:
                continue
            if PARAM_USE_CHI:
                utility, _ = chisquare([n11, n10, n01, n00])
            else:
                n1_ = n11 + n10
                n0_ = n01 + n00
                n_1 = n11 + n01
                n_0 = n10 + n00
                n = float(full_table)
                utility = n11/n * math.log(n*n11/(n1_*n_1),2) + \
                   n01 / n * math.log(n*n01/(n0_*n_1), 2) + \
                   n10 / n * math.log(n*n10/(n1_*n_0), 2) + \
                   n00 / n * math.log(n*n00/(n0_*n_0), 2)
            if best_utility is None or best_utility < utility:
                best_utility = utility
    if best_utility is not None:
        feature_utility.append( (feat, best_utility) )
    #else:
    #    print("No utility for feat ", feat)
    
feature_utility = sorted(feature_utility, key=lambda x:x[1], reverse=True)

# determine features to drop based on whether there are PARAM_BEST_RANDOM on top of them
PARAM_BEST_RANDOM = 3
cut_point = None
random_so_far = 0
for idx, row in enumerate(feature_utility):
    if header[row[0]].startswith("random_"):
        random_so_far += 1
        if random_so_far == PARAM_BEST_RANDOM:
            cut_point = idx
            break
            
to_drop = list()
for row in feature_utility[cut_point:]:
    if not header[row[0]].startswith("random_"):
        to_drop.append(row)
        
print("Dropped", len(to_drop), "features")

to_keep = set()
with open("ch6_cell28_features_dropped.txt", "w") as dropped:
    drop_set = set()
    for row in to_drop:
        dropped.write(header[row[0]] + "\n")
        drop_set.add(row[0])
    for idx in range(len(header) - PARAM_RANDOM_FEATS):
        if idx not in drop_set:
            to_keep.add(idx + 1) # accomodate 'name' column
        to_keep.add(0)
        to_keep.add(len(header) - PARAM_RANDOM_FEATS + 1) # logpop

print("New feature vector size", len(to_keep), "features, including name and target")
        
table1 = ("<table><tr><th>Position</th><th>Feat#</th><th>Feat</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0],
                        header[r[1][0]].replace('<','').replace('>',''),
                        r[1][1]), 
                               enumerate(feature_utility[:150])))) +"</table>")
table2 = ("<table><tr><th>Position</th><th>Feat#</th><th>Feat</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0],
                        header[r[1][0]].replace('<','').replace('>',''),
                        r[1][1]), 
                               enumerate(reversed(to_drop))))) +"</table>")

with open("ch6_cell28_dev_feat1_filtered.tsv", "w") as filtered:
    with open("ch6_cell26_dev_feat1_dampen.tsv") as feats:
        header = next(feats)
        header = header.strip().split("\t")
        new_header = list()
        for idx, header_col in enumerate(header):
            if idx in to_keep:
                new_header.append(header_col)
        filtered.write("\t".join(new_header) + "\n")
        for line in feats:
            fields = line.strip().split("\t")
            new_fields = list()
            for idx, value in enumerate(fields):
                if idx in to_keep:
                    new_fields.append(value)
            filtered.write("\t".join(new_fields) + "\n")

from IPython.display import HTML, display
display(HTML("<h3>Top 150 features by MI</h3>" + table1 + 
             "<h3>Features to drop</h3>"       + table2))
  0 out of 390
 39 out of 390
 78 out of 390
117 out of 390
156 out of 390
195 out of 390
234 out of 390
273 out of 390
312 out of 390
351 out of 390
Dropped 68 features
New feature vector size 314 features, including name and target

Top 150 features by MI

PositionFeat#FeatUtility
0286http://dbpedia.org/ontology/utcOffset#count0.0976963935
110http://dbpedia.org/ontology/birthPlace?inv#count0.0773520859
233http://dbpedia.org/ontology/country#1@OTHER0.0715361164
37http://dbpedia.org/ontology/areaTotal#10.0574203739
4301http://www.w3.org/1999/02/22-rdf-syntax-ns#type#1@http://dbpedia.org/ontology/City0.0446595875
5100http://dbpedia.org/ontology/hometown?inv#count0.0444925574
6104http://dbpedia.org/ontology/leaderTitle#count0.0418022897
798http://dbpedia.org/ontology/ground?inv#count0.0417288633
8102http://dbpedia.org/ontology/isPartOf?inv#count0.0383227077
999http://dbpedia.org/ontology/headquarter?inv#count0.0381511672
1011http://dbpedia.org/ontology/city?inv#count0.0375958312
11110http://dbpedia.org/ontology/locationCity?inv#count0.0345756327
12109http://dbpedia.org/ontology/location?inv#count0.0344053176
13303http://www.w3.org/2000/01/rdf-schema#seeAlso#count0.0329376538
14101http://dbpedia.org/ontology/isPartOf#count0.0310347434
150rel#count0.0305265290
16103http://dbpedia.org/ontology/leaderName#count0.0300089032
17237http://dbpedia.org/ontology/residence?inv#count0.0297368517
18291http://dbpedia.org/ontology/utcOffset#1@"+8"0.0296566141
19379http://xmlns.com/foaf/0.1/nick#count0.0294899528
20378http://xmlns.com/foaf/0.1/name#count0.0294766008
2134http://dbpedia.org/ontology/deathPlace?inv#count0.0290410902
22289http://dbpedia.org/ontology/utcOffset#1@"+5:30"0.0258797043
23246http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Indian_Standard_Time0.0254364934
2417http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/India0.0245425720
25299http://dbpedia.org/ontology/utcOffset#2@OTHER0.0223334458
26300http://www.w3.org/1999/02/22-rdf-syntax-ns#type#1@http://dbpedia.org/ontology/Settlement0.0219254334
27105http://dbpedia.org/ontology/leaderTitle#1@"Mayor"@en0.0211852222
28241http://dbpedia.org/ontology/timeZone#count0.0186070090
2913http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/United_States0.0175165178
30253http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/China_Standard_Time0.0170512916
3129http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Switzerland0.0158900843
3294http://dbpedia.org/ontology/district#count0.0157178626
3397http://dbpedia.org/ontology/foundingDate#count0.0148247411
34267http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Cities_of_Japan0.0141595020
35268http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Switzerland0.0136140699
36260http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Communes_of_Romania0.0134756551
3714http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/France0.0119833788
38377http://xmlns.com/foaf/0.1/homepage#count0.0117127918
3930http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Brazil0.0114840226
4019http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Romania0.0109216127
41262http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_the_Philippines0.0108070888
42249http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Philippine_Standard_Time0.0105554191
43243http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/North_American_Eastern_Time_Zone0.0098569008
443http://dbpedia.org/ontology/areaLand#count0.0094568186
454http://dbpedia.org/ontology/areaLand#10.0094568186
46113http://dbpedia.org/ontology/minimumElevation#count0.0094335063
47118http://dbpedia.org/ontology/populationAsOf#1@"2006-12-31"^^http://www.w3.org/2001/XMLSchema#date0.0094283539
48261http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Village0.0094071382
49114http://dbpedia.org/ontology/minimumElevation#10.0093485650
50240http://dbpedia.org/ontology/routeStart?inv#count0.0092929410
518http://dbpedia.org/ontology/areaWater#count0.0092498928
52258http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Census-designated_place0.0089104877
53238http://dbpedia.org/ontology/routeEnd?inv#count0.0089041982
54111http://dbpedia.org/ontology/maximumElevation#count0.0087616361
55112http://dbpedia.org/ontology/maximumElevation#10.0087616361
56287http://dbpedia.org/ontology/utcOffset#1@"-5"0.0082976925
575http://dbpedia.org/ontology/areaLand#20.0080500379
58133http://dbpedia.org/ontology/populationDensity#count0.0078484319
596http://dbpedia.org/ontology/areaTotal#count0.0074483968
60304http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/List_of_twin_towns0.0073464966
61294http://dbpedia.org/ontology/utcOffset#2@"-4"0.0067065538
62244http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/North_American_Central_Time_Zone0.0066068794
6323http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Austria0.0062625474
64116http://dbpedia.org/ontology/nearestCity?inv#count0.0059831857
65288http://dbpedia.org/ontology/utcOffset#1@"+2"0.0056993889
66297http://dbpedia.org/ontology/utcOffset#2@"+3"0.0053709065
67245http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Eastern_European_Time0.0053411230
6895http://dbpedia.org/ontology/elevation#count0.0052806194
691http://dbpedia.org/ontology/area#10.0051504748
702http://dbpedia.org/ontology/areaCode#count0.0048780440
7196http://dbpedia.org/ontology/elevation#10.0046697548
72279http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Brazil0.0044755608
7315http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Germany0.0043193264
74296http://dbpedia.org/ontology/utcOffset#2@"-6"0.0042388718
75259http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Town0.0042273599
76135http://dbpedia.org/ontology/postalCode#count0.0040125311
7722http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Czech_Republic0.0040068343
78115http://dbpedia.org/ontology/motto#count0.0038502787
79256http://dbpedia.org/ontology/type#count0.0037569686
80239http://dbpedia.org/ontology/routeJunction?inv#count0.0037468734
81254http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Iran_Standard_Time0.0036248488
82264http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Village_(United_States)0.0036064034
83284http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Cuba0.0035736682
8420http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Australia0.0034511135
8518http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Poland0.0033868709
86120http://dbpedia.org/ontology/populationAsOf#1@"2010-04-01"^^http://www.w3.org/2001/XMLSchema#date0.0033147498
87295http://dbpedia.org/ontology/utcOffset#2@"+1"0.0031843944
88242http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Central_European_Time0.0028993576
89285http://dbpedia.org/ontology/type#1@OTHER0.0026631391
90272http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/City_(California)0.0025227055
9125http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Mexico0.0024821676
92146http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Kyushu0.0024320431
93107http://dbpedia.org/ontology/leaderTitle#1@"Municipal President"@en0.0022439285
94119http://dbpedia.org/ontology/populationAsOf#1@"2010-12-31"^^http://www.w3.org/2001/XMLSchema#date0.0021322853
95266http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipality0.0019976217
969http://dbpedia.org/ontology/areaWater#10.0019811294
97148http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Chūbu_region0.0019766421
9826http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Hungary0.0019426290
99136http://dbpedia.org/ontology/region#count0.0019369811
100134http://dbpedia.org/ontology/populationDensity#10.0019254055
101269http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Gmina0.0019167945
10231http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Ukraine0.0019097990
10328http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Sweden0.0016357978
104302http://www.w3.org/1999/02/22-rdf-syntax-ns#type#1@OTHER0.0016116168
10516http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/United_Kingdom0.0015910670
106163http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Kansai_region0.0014567274
107122http://dbpedia.org/ontology/populationAsOf#1@"2009-06-30"^^http://www.w3.org/2001/XMLSchema#date0.0013583905
10812http://dbpedia.org/ontology/country#count0.0013441720
109293http://dbpedia.org/ontology/utcOffset#1@OTHER0.0012687612
110274http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Borough_(Pennsylvania)0.0011648049
111139http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Pays_de_la_Loire0.0010797894
112273http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/List_of_cities_in_Argentina0.0010213124
11332http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/United_States_of_America0.0010138343
11421http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Spain0.0009926035
115308http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Ukraine0.0009778459
116275http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Borough_(New_Jersey)0.0009481932
117250http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Mountain_Time_Zone0.0008733979
118307http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Italy0.0008481666
119137http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Nord-Pas-de-Calais0.0007872309
120280http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Colombia0.0007818404
12124http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Italy0.0007681960
122255http://dbpedia.org/ontology/timeZone#1@OTHER0.0007568887
123251http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Eastern_Standard_Time_Zone0.0007361581
124263http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/New_England_town0.0006881598
125270http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/List_of_towns_and_villages_in_Illinois0.0006643873
126282http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/City_(Quebec)0.0006639864
127314http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Africa0.0006526736
12835http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Nord_(French_department)0.0006488608
129140http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Rhône-Alpes0.0006098502
130117http://dbpedia.org/ontology/populationAsOf#count0.0005952472
131257http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/City0.0005771659
132141http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Aquitaine0.0005468276
13340http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Maine-et-Loire0.0005248101
13439http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Loire-Atlantique0.0005043381
135142http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Lombardy0.0005034930
136277http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Corregimientos_of_Panama0.0004998276
137188http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Hokkaido0.0004952871
13869http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Dordogne0.0004947210
139271http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Administrative_divisions_of_New_York0.0004838897
14041http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Haut-Rhin0.0004605897
141316http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Asia0.0004484091
14238http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Finistère0.0004417426
143106http://dbpedia.org/ontology/leaderTitle#1@"Alcalde"@en0.0004371334
144227http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Düsseldorf0.0004304264
145309http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/The_United_Kingdom0.0004189674
146305http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Hungary0.0004174443
147170http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Kassel0.0004139833
14837http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Pas-de-Calais0.0004096477
149248http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Pacific_Time_Zone0.0004072352

Features to drop

PositionFeat#FeatUtility
0344http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Northern_Cyprus0.0000040442
1126http://dbpedia.org/ontology/populationAsOf#1@"2008-12-31"^^http://www.w3.org/2001/XMLSchema#date0.0000080627
2371http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/History_of_Hamburg0.0000092315
3207http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Chester_County,_Pennsylvania0.0000104297
4222http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/McHenry_County,_Illinois0.0000151230
5359http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Transport0.0000181687
6125http://dbpedia.org/ontology/populationAsOf#1@"2007-12-31"^^http://www.w3.org/2001/XMLSchema#date0.0000185330
7206http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Madison_County,_Illinois0.0000202727
8130http://dbpedia.org/ontology/populationAsOf#1@"2011-12-31"^^http://www.w3.org/2001/XMLSchema#date0.0000206101
9199http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Piedmont0.0000232673
10214http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Marche0.0000247764
11231http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Lackawanna_County,_Pennsylvania0.0000255142
12219http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Trenčín_Region0.0000261496
1383http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Loire_(département)0.0000262310
14131http://dbpedia.org/ontology/populationAsOf#1@"2014-12-31"^^http://www.w3.org/2001/XMLSchema#date0.0000294894
15196http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Will_County,_Illinois0.0000323192
16176http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Moravian-Silesian_Region0.0000324838
17224http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Capital_District0.0000332389
18252http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Central_Time_Zone_(North_America)0.0000341756
1950http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Seine-Maritime0.0000341793
20235http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Hudson_Valley0.0000346144
21232http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Darmstadt_(region)0.0000347087
22208http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/St._Clair_County,_Illinois0.0000352008
23362http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Banstead0.0000365581
24361http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Malta0.0000365581
25364http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Marion_County0.0000369277
26366http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Oceania0.0000369800
27363http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Montenegro0.0000369800
28332http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Slovenia0.0000370154
29194http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/South_East_England0.0000371703
30368http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/List_of_people_from_the_Pittsburgh_metropolitan_area0.0000372525
3136http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Yvelines0.0000414433
3277http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Val-d'Oise0.0000417613
3376http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Drôme0.0000417613
34221http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Jefferson_Parish,_Louisiana0.0000482626
35220http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/East_Midlands0.0000482626
36223http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Sangamon_County,_Illinois0.0000496529
3753http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Nord_(département)0.0000502507
3872http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Ain0.0000510437
3984http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Côte-d'Or0.0000514225
40187http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Champagne-Ardenne0.0000514636
41218http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Trnava_Region0.0000517374
42202http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Olomouc_Region0.0000523276
4367http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Alpes-Maritimes0.0000526531
4492http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Haute-Vienne0.0000526967
4590http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Marne_(département)0.0000526967
46228http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Žilina_Region0.0000545172
47320http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Lithuania0.0000555288
4891http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Allier0.0000557335
49230http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Madeira_Island0.0000561838
50197http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/North_West_England0.0000585648
51145http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Campania0.0000592991
52205http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Apulia0.0000604698
53129http://dbpedia.org/ontology/populationAsOf#1@"2009-12-31"^^http://www.w3.org/2001/XMLSchema#date0.0000608310
54281http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Mexico0.0000621425
5587http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Moselle_(département)0.0000627847
56167http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Sicily0.0000630973
57318http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Estonia0.0000636066
5873http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Var_(département)0.0000646552
59323http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/The_Republic_of_Ireland0.0000689172
60212http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Île-de-France0.0000693302
61192http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Karlovy_Vary_Region0.0000695792
6261http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Vaucluse0.0000696101
63233http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Abruzzo0.0000698774
64322http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Latvia0.0000700137
65355http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Media0.0000705965
66234http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Latium0.0000707228
67147http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Veneto0.0000717788

The list of most informative features look very good, particularly number of timezones and actual timezones indicates the geographic location is important. While the wikipedia page contains the GPS coordinates of each city, that information is available in the dbpedia files I chose, but it is available in the geonames file, so it would be useful to retrieve it. I'll look into it in the GIS case study in Chapter 10.

The counts of incoming links seem also very useful, together with the areaTotal. The features dropped seem very specific and most probably fire very little.

I will not turn to another technique for analyzing the data, that is to use decision trees using the splits. Cell #29

In [29]:
# CELL 29
import numpy as np
import pickle

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

data = list()
header = None
with open("ch6_cell28_dev_feat1_filtered.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # pop
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name)
        data.append(row)

xdata = np.array(list(map(lambda t:t[0], data)))
ydata = np.array(list(map(lambda t:t[1], data)))

ydata = cell28_adjudicate(ydata, segments_at[2])
print_names = list()
for name in header:
    name = name.replace('"','')
    if '@' in name:
        pos = name.find('@')
        rel, val = name[0:pos], name[pos+2:]
        rel = rel.replace('<','').replace('>','').split("/")[-1]
        val = val.replace('<','').replace('>','').split("/")[-1]
        name = "{}@{}".format(rel, val)
    else:
        name = name.replace('<','').replace('>','').split("/")[-1]
    print_names.append(name)   

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(xdata, ydata)

import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, max_depth=4,
                               impurity=False, proportion=False,
                               feature_names=print_names)
graph = graphviz.Source(dot_data)
graph.render("ch6_cell29_tree.dot")
graph
Out[29]:
Tree 0 birthPlace?inv#count <= 0.739 samples = 44959 value = [11240, 11172, 11318, 11229] 1 country#1@THER <= 0.5 samples = 37696 value = [11042, 10545, 9236, 6873] 0->1 True 22042 areaTotal#1 <= 41230000.0 samples = 7263 value = [198, 627, 2082, 4356] 0->22042 False 2 country#1@India <= 0.5 samples = 28716 value = [9680, 9017, 7032, 2987] 1->2 16465 utcOffset#count <= 0.389 samples = 8980 value = [1362, 1528, 2204, 3886] 1->16465 3 areaTotal#1 <= 21493451.0 samples = 26957 value = [9672, 8962, 6419, 1904] 2->3 15336 rel#count <= 1.097 samples = 1759 value = [8, 55, 613, 1083] 2->15336 4 elevation#1 <= 111.931 samples = 19403 value = [7991, 6647, 4035, 730] 3->4 10765 country#1@Brazil <= 0.5 samples = 7554 value = [1681, 2315, 2384, 1174] 3->10765 5 (...) 4->5 6970 (...) 4->6970 10766 (...) 10765->10766 15255 (...) 10765->15255 15337 postalCode#count <= 0.151 samples = 916 value = [6, 41, 462, 407] 15336->15337 16104 rel#count <= 1.19 samples = 843 value = [2, 14, 151, 676] 15336->16104 15338 (...) 15337->15338 15721 (...) 15337->15721 16105 (...) 16104->16105 16374 (...) 16104->16374 16466 22-rdf-syntax-ns#type#1@City <= 0.5 samples = 6352 value = [692, 809, 1474, 3377] 16465->16466 20285 22-rdf-syntax-ns#type#1@City <= 0.5 samples = 2628 value = [670, 719, 730, 509] 16465->20285 16467 areaTotal#1 <= 15925000.0 samples = 5251 value = [682, 780, 1357, 2432] 16466->16467 19970 type#1@List_of_cities_in_Argentina <= 0.5 samples = 1101 value = [10, 29, 117, 945] 16466->19970 16468 (...) 16467->16468 18745 (...) 16467->18745 19971 (...) 19970->19971 20218 (...) 19970->20218 20286 areaTotal#1 <= 3950000.0 samples = 2061 value = [628, 639, 549, 245] 20285->20286 21763 utcOffset#1@THER <= 0.5 samples = 567 value = [42, 80, 181, 264] 20285->21763 20287 (...) 20286->20287 21228 (...) 20286->21228 21764 (...) 21763->21764 21899 (...) 21763->21899 22043 birthPlace?inv#count <= 1.389 samples = 4618 value = [180, 572, 1772, 2094] 22042->22043 23872 22-rdf-syntax-ns#type#1@THER <= 0.5 samples = 2645 value = [18, 55, 310, 2262] 22042->23872 22044 country#1@United_States <= 0.5 samples = 3862 value = [179, 563, 1667, 1453] 22043->22044 23695 birthPlace?inv#count <= 1.574 samples = 756 value = [1, 9, 105, 641] 22043->23695 22045 country#1@United_Kingdom <= 0.5 samples = 2537 value = [86, 238, 977, 1236] 22044->22045 23308 areaTotal#1 <= 13075000.0 samples = 1325 value = [93, 325, 690, 217] 22044->23308 22046 (...) 22045->22046 22901 (...) 22045->22901 23309 (...) 23308->23309 23558 (...) 23308->23558 23696 routeJunction?inv#count <= 0.151 samples = 332 value = [1, 6, 83, 242] 23695->23696 23807 rel#count <= 1.789 samples = 424 value = [0, 3, 22, 399] 23695->23807 23697 (...) 23696->23697 23770 (...) 23696->23770 23808 (...) 23807->23808 23813 (...) 23807->23813 23873 birthPlace?inv#count <= 1.161 samples = 2134 value = [5, 20, 155, 1954] 23872->23873 24132 utcOffset#2@4 <= 0.5 samples = 511 value = [13, 35, 155, 308] 23872->24132 23874 utcOffset#2@THER <= 0.5 samples = 891 value = [3, 20, 127, 741] 23873->23874 24071 areaTotal#1 <= 57828368.0 samples = 1243 value = [2, 0, 28, 1213] 23873->24071 23875 (...) 23874->23875 23978 (...) 23874->23978 24072 (...) 24071->24072 24093 (...) 24071->24093 24133 rel#count <= 1.243 samples = 333 value = [1, 7, 80, 245] 24132->24133 24262 populationDensity#1 <= 383.465 samples = 178 value = [12, 28, 75, 63] 24132->24262 24134 (...) 24133->24134 24149 (...) 24133->24149 24263 (...) 24262->24263 24354 (...) 24262->24354

Second Featurization

From the previous analysis I decided to a computable feature using $area \times density$ and to perform an out-of-fold target-rate encoding in Cell #30, producing "ch6_cell30_dev_feats2_train.tsv" and "ch6_cell30_dev_feats2_test.tsv".

Due to the impossiblity of doing TRE and normalization, this new feature set is already normalized.

In [30]:
# CELL 30
import random
from functools import reduce
from sklearn.svm import SVR
import numpy as np
from collections import OrderedDict

rand = random.Random(42)
train_data = list()
test_data  = list()
header = None
with open("ch6_cell28_dev_feat1_filtered.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name)
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
               
test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))

# add computed feature density area times
approx_area_train = np.zeros( (xtrain.shape[0],) )
approx_area_test  = np.zeros( (xtest.shape[0],)  )
#BUG "area#1" is not included here. It should
for idx, name in enumerate(header):
    if "#count" not in name and ("areaLand#" in name or "areaTotal#" in name):
        approx_area_train = np.max( [ approx_area_train, xtrain[:,idx] ], axis=0)
        approx_area_test =  np.max( [ approx_area_test, xtest[:,idx] ], axis=0)
density_idx = next(filter(lambda t:"populationDensity#1" in t[1], 
                     enumerate(header)))[0]
computed_value_train = approx_area_train * xtrain[:,density_idx]
computed_value_test  = approx_area_test  * xtest[:, density_idx]
computed_defined_train = np.array( list(map(lambda x: 1.0 if x>0 else 0.0, 
                                            computed_value_train)) )
computed_defined_test  = np.array( list(map(lambda x: 1.0 if x>0 else 0.0, 
                                           computed_value_test)) )
computed_value_train = np.log(computed_value_train + 1)
computed_value_test  = np.log(computed_value_test  + 1)
header.append("computed#defined")
header.append("computed#value")
xtrain = np.append(xtrain, computed_defined_train.reshape( (xtrain.shape[0],1) ), axis=1)
xtrain = np.append(xtrain, computed_value_train.reshape(   (xtrain.shape[0],1) ), axis=1)
xtest  = np.append(xtest,  computed_defined_test.reshape(  (xtest.shape[0],1) ),  axis=1)
xtest  = np.append(xtest,  computed_value_test.reshape(    (xtest.shape[0],1) ),  axis=1)

# out-of-fold target encoding for tsv
xtrain_non_tre = xtrain.copy()
xtest_non_tre  = xtest.copy()

def doTRE(xtrain, ytrain, xtest, ytest):
    PARAM_FOLDS = 10
    PARAM_ADD_ORIGINAL_COLS = True
    PARAM_SINGLE_COL_TRE = True
    folds = list(map(lambda x:rand.randint(0, PARAM_FOLDS - 1), range(xtrain.shape[0])))
    # value to use when there is no evidence available
    imputed_tre = sorted(list(ytrain))[int(ytrain.shape[0]/2)]
    
    # gather the TRE cols
    tre_cols = list()
    tre_col_names = list()
    cols_to_tre_cols = dict()
    cats_for_attrib = OrderedDict() # name to list of idx
    for idx, name in enumerate(header):
        if '@' in name: # categorical value
            tre_idx = len(tre_cols)
            cols_to_tre_cols[idx] = tre_idx
            tre_cols.append(idx)
            tre_col_names.append(name + '+TRE')
            attrib = name[0:name.index('@')]
            if attrib not in cats_for_attrib:
                cats_for_attrib[attrib] = []
            cats_for_attrib[attrib].append( (idx, tre_idx) )
    tre_train = np.zeros( (xtrain.shape[0], len(tre_cols)) )
    tre_test  = np.zeros( (xtest.shape[0],  len(tre_cols)) )
    
    saved_full_TRE = OrderedDict()
    
    for idx, name in enumerate(header):
        if '@' in name: # categorical value
            tre_col = cols_to_tre_cols[idx]
            
            fold_val_pops = { fold: dict() for fold in range(PARAM_FOLDS) }
            # gather the pop values for the features in all folds
            for row in range(xtrain.shape[0]):
                value = xtrain[row, idx]
                fold = folds[row]
                fold_val_pops[fold][value] = fold_val_pops[fold].get(value, []) + [ytrain[row]]
            all_values = set(xtrain[:,idx]).union(set(xtest[:,idx]))
            # compute TREs
            fold_TREs = list()
            for fold in range(PARAM_FOLDS):
                val_pops = dict()
                for other in range(PARAM_FOLDS):
                    if other != fold:
                        for value in fold_val_pops[other].keys():
                            val_pops[value] = val_pops.get(value, []) + fold_val_pops[other][value]
                fold_TRE = OrderedDict()
                for value in all_values:
                    val_pops[value] = sorted(val_pops.get(value, []))
                    fold_TRE[value] = imputed_tre if len(val_pops[value]) == 0 else \
                      val_pops[value][int(len(val_pops[value])/2)]
                fold_TREs.append(fold_TRE)
            full_val_pops = dict()
            for value in all_values:
                full_val_pops[value] = sorted(reduce(lambda x, y: x + y, 
                                                  map(lambda f:f.get(value, []), fold_val_pops.values()), []))
            full_TRE = OrderedDict()
            for value in all_values:
                full_TRE[value] = imputed_tre if len(full_val_pops[value]) == 0 \
                   else full_val_pops[value][int(len(full_val_pops[value])/2)]

            # encode folds
            for row in range(xtrain.shape[0]):
                tre_train[row, tre_col] = fold_TREs[folds[row]][xtrain[row, idx]]
            # encode test (full TRE)
            for row in range(xtest.shape[0]):
                tre_test[row, tre_col] = full_TRE[xtest[row, idx]]
            saved_full_TRE[idx] = full_TRE
            
    if PARAM_SINGLE_COL_TRE:
        single_tre_train = np.zeros( (xtrain.shape[0], len(cats_for_attrib)) )
        single_tre_test  = np.zeros( (xtest.shape[0],  len(cats_for_attrib)) )
        single_tre_col_names = list()
        single_tre_count = 0
        for attrib, idxs in cats_for_attrib.items():
            single_tre_col_names.append(attrib + '#TRE')
            for row in range(xtrain.shape[0]):
                max_tre_idx = None
                max_val = None
                for idx, tre_idx in idxs:
                    if max_tre_idx is None or xtrain[row,idx] > max_val:
                        max_val = xtrain[row,idx]
                        max_tre_idx = tre_idx
                single_tre_train[row,single_tre_count] = tre_train[row,tre_idx]
            for row in range(xtest.shape[0]):
                max_tre_idx = None
                max_val = None
                for idx, tre_idx in idxs:
                    if max_tre_idx is None or xtest[row,idx] > max_val:
                        max_val = xtest[row,idx]
                        max_tre_idx = tre_idx
                single_tre_test[row,single_tre_count] = tre_test[row,tre_idx]
            single_tre_count += 1
        tre_train = single_tre_train
        tre_test  = single_tre_test
        tre_col_names = single_tre_col_names

    if not PARAM_ADD_ORIGINAL_COLS:
        # delete original columns
        xtrain = np.delete(xtrain, tre_cols, 1)
        xtest  = np.delete(xtest,  tre_cols, 1)
        for idx in reversed(tre_cols):
            del header[idx]
            
    # add tre columns
    header.extend(tre_col_names)
    xtrain = np.append(xtrain, tre_train, axis=1)
    xtest = np.append(xtest, tre_test, axis=1)
    
    return header, xtrain, ytrain, xtest, ytest, saved_full_TRE

PARAM_WRITE_UNSCALED = False
if PARAM_WRITE_UNSCALED:
    new_header, xtrain, ytrain, xtest, ytest, _ = doTRE(xtrain, ytrain, xtest, ytest)

    # write new data
    with open("ch6_cell30_dev_feats2_train_raw.tsv", "w") as feats:
        feats.write("\t".join(["name"] + new_header + ["population"])+"\n")
        arr, target, names = xtrain, ytrain, list(map(lambda t:t[2],train_data))
        for row in range(arr.shape[0]):
            feats.write("\t".join(list(map(lambda v:str(v),
                                       [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")
    with open("ch6_cell30_dev_feats2_test_raw.tsv", "w") as feats:
        feats.write("\t".join(["name"] + new_header + ["population"])+"\n")
        arr, target, names = xtest, ytest, test_names
        for row in range(arr.shape[0]):
            feats.write("\t".join(list(map(lambda v:str(v),
                                       [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")

# need to re-do tre after scaling
xtrain = xtrain_non_tre
xtest  = xtest_non_tre

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# redo TRE after scaling
header, xtrain, ytrain, xtest, ytest, saved_full_TRE = doTRE(xtrain, ytrain, xtest, ytest)

np.savez_compressed("ch6_cell30_scaled_full_TRE.npz", 
                    saved_full_TRE = saved_full_TRE,
                   xtrain_min      = xtrain_min,
                   xtrain_scaling  = xtrain_scaling,
                   ytrain_min      = ytrain_min,
                   ytrain_scaling  = ytrain_scaling)
with open("ch6_cell30_dev_feats2_train.tsv", "w") as feats:
    feats.write("\t".join(["name"] + header + ["population"])+"\n")
    arr, target, names = xtrain, ytrain, list(map(lambda t:t[2],train_data))
    for row in range(arr.shape[0]):
        feats.write("\t".join(list(map(lambda v:str(v),
                                   [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")
with open("ch6_cell30_dev_feats2_test.tsv", "w") as feats:
    feats.write("\t".join(["name"] + header + ["population"])+"\n")
    arr, target, names = xtest, ytest, test_names
    for row in range(arr.shape[0]):
        feats.write("\t".join(list(map(lambda v:str(v),
                                   [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")


# train
print("Training on {:,} cities".format(len(xtrain)))

best_c = 100.0
best_epsilon = 0.05
PARAM_DO_GRID_SEARCH = False
if PARAM_DO_GRID_SEARCH:
    best_rmse = 1000
    for c in [0.001, 0.01, 0.1, 1.0, 2.0, 5.0, 15.0, 100.0]:
        svr_rbf = SVR(epsilon=0.01, C=c, gamma='auto')
        svr_rbf.fit(xtrain, ytrain)
        ytest_pred = svr_rbf.predict(xtest)
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("C", c, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_c = c
            best_rmse = RMSE
    print("Best C", best_c,"best RMSE", best_rmse)        

    best_epsilon = 0.1
    best_rmse = 1000
    for epsilon in [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 0.0, 1.0, 10.0, 100.0]:
        svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
        svr_rbf.fit(xtrain, ytrain)
        ytest_pred = svr_rbf.predict(xtest)
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("Epsilon", epsilon, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_epsilon = epsilon
            best_rmse = RMSE
    print("Best epsilon", best_epsilon,"best RMSE", best_rmse)

svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

print("Writing files for error analysis")
np.savez_compressed("ch6_cell30_feat2_svr.npz", 
                    xtrain=xtrain, ytrain=ytrain, 
                    xtest