Chapter 6: Graphs¶

Accompanying code for the book The Art of Feature Engineering.

This notebook plus notebooks for the other chapters are available online at https://github.com/DrDub/artfeateng

MIT License¶

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Limitations¶

Simple python intented for people coming from other languages that plan to use the ideas described in the book outside of Python.
Many of these techniques are available as library calls. They are spelled out as for teaching purposes.
Resource limitations:
- At most one day of running time per notebook.
- No GPU required.
- Minimal dependencies.
- At most 8Gb of RAM.
Due to resource limitations, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.
To help readers try variants of some cells in isolation, the cells are easily executable without having to re-run the whole notebook. As such, most cells read everything they need from disk and write all their results back into disk, which is unnecessary with normal notebooks. The code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program.
I dislike Pandas so these notebooks are Pandas-free, which might seem unusual to some.

Chapter 6: Case Study on Graph Data¶

In this chapter I will describe the creation of the base WikiCities dataset employed in this chapter and the next three chapters.

WikiCities Dataset¶

It starts with the latest release of DBpedia [cite], for October 2016, a curated ontological resource obtained from the infoboxes at Wikipedia. Infoboxes are the tabular information appearing next to the page content in each Wikipedia page.

My intention with this dataset is to provide a task that can be attacked with regular features, with time series features, textual features and image features. The task I found that allow for all these approaches is that of calculating the population of a city or town based on their ontological properties (e.g., name of its leader or its time zone), based on its historical population and historical features (which involves a time series analysis), based on the textual description of the place (which involves text analysis, particularly as many times further includes the population) and a satellite image of the city (which involves image processing).

DBpedia files are distributed as files of triples, where a triple consists of a subject, verb and object. The subject is always an entity in the ontology, prefixed by the name space dbpedia.org. The verb is an ontological relation, of which there are many (type, leader-name, etc). The object can be either another entity or a literal value (a string or a number). Besides the actual files, DBpedia also contains links to multiple other ontologies, like Freebase or Yago plus links to the original Wikipedia articles that resulted in the extracted information. For these case studies we will be using the "instance_types_en.ttl.bz2", "mappingbased_literals_en.ttl.bz2", "mappingbased_objects_en.ttl.bz2".

The type information is derived from the infobox itself. The first challenge is to identify cities, as the "city" or "settlement" type is not thoroughly annotated in Wikipedia. Instead of relying on heuristics (such as "any entitity with a location and a population" which will result in neighborghoods annotated), I chose to use an external more reliable source, the GeoNames project. The project distributes a list of "cities or settlements with at least 1000 people". It contains 128,000+ places (file "cities1000.txt"). To link them back to DBpedia we use the file "geonames_links.ttl.bz2" distributed by DBpedia (Cell #1)

# CELL 1
import re
import bz2

id_re = re.compile("^(\d+)")
url_re = re.compile(".*geonames\.org\/(\d+)")

cities1000_ids = set()
with open("cities1000.txt") as cities_file:
    for line in cities_file:
        cities1000_ids.add(id_re.match(line).group(0))
print("Cities1000 total: {:,}".format(len(cities1000_ids)))

found = 0
with open("ch6_cell1_cities1000_links.ttl","w") as out:
    with bz2.BZ2File("geonames_links.ttl.bz2","r") as links:
        for byteline in links:
            if byteline[0] == ord('#'):
                continue
            line = byteline.decode("utf-8")
            if not url_re.match(line):
                print(byteline)
            if url_re.match(line).group(1) in cities1000_ids:
                out.write(line)
                found += 1            
print("In DBpedia found: {:,}".format(found))
cities1000_ids = None # free memory

Cities1000 total: 128,764
In DBpedia found: 80,251

Therefore, using the available links to DBpedia, we found 80,251 of them in the Wikipedia data set. These 80,251 plus DBpedia consitute the raw data for this problem (files "ch6_cell1_cities1000_links.ttl", ).

Following the methodology described in Chapter 1, I will set aside 20% of this data for final evaluation at the end of the feature engineering process (Cell #2).

# CELL 2
import random

rand = random.Random(42)
all_cities = open("ch6_cell1_cities1000_links.ttl").readlines()
rand.shuffle(all_cities)
pivot = int(len(all_cities) * 0.8)
devset = all_cities[:pivot]
heldout = all_cities[pivot:]
with open("ch6_cell2_cities1000_dev_links.ttl", "w") as dev:
    dev.write("".join(devset))
with open("ch6_cell2_cities1000_held_links.ttl", "w") as held:
    held.write("".join(heldout))
print("Devset size: {:,}".format(len(devset)))
print("Heldout size: {:,}".format(len(heldout)))
all_cities = devset = heldout = None # free memory

Devset size: 64,200
Heldout size: 16,051

Following the methodology in Chapter 1, I now want to do some exploratory data analysis over these 64,200 settlements to decide basic featurization and model to employ.

The key issue to resolve here is to drill down on the specific relations we think will be useful to predict our target class. While DBpedia contain thousands of relations, many are seldom expressed (meaning, there are few pages where a human editor feel the need to record information such as "distante-to-London").

While we want to account the number of relations where our cities of interest partake, something to remember is that they might be value in relations that have a given city as its subject but there might also be value in relations that have the city as its object (that is, the inverse of certain relations).

Because processing the large files of DB is a computationally onerous process, I'll start by pre-filtering the triples related to all entities both forward and inverse, generating the file "ch6_cell3_cities1000_base.ttl" (Cell #3).

# CELL 3
city_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell1_cities1000_links.ttl").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

def compressed_triples_gen(filename):
    with bz2.BZ2File(filename,"r") as compressed:
        for byteline in compressed:
            if byteline[0] == ord('#'):
                continue
            line = byteline.decode("utf-8")
            s,v,o = triple_re.match(line).groups()
            yield line,s,v,o
            
written = 0
read = 0
with open("ch6_cell3_cities1000_base.ttl","w") as base:
    # types and literals only in subject position
    for line, subj, verb, obj in compressed_triples_gen("instance_types_en.ttl.bz2"):
        read += 1 
        if subj in city_uris:
            base.write(line)
            written += 1
    for line, subj, verb, obj in compressed_triples_gen("mappingbased_literals_en.ttl.bz2"):
        read += 1
        if subj in city_uris:
            base.write(line)
            written += 1
    for line, subj, verb, obj in compressed_triples_gen("mappingbased_objects_en.ttl.bz2"):
        read += 1
        if subj in city_uris:
            base.write(line)
            written += 1
        if obj in city_uris:
            base.write("{} {} {} .\n".format(obj, verb[0:-1] + "?inv>", subj))
            written += 1
print("Base {:,} out of {:,}".format(written, read))
city_uris = None # free memory

Base 2,014,559 out of 38,285,143

That code takes a while to run but it will speed up the rest of the work quite a bit down the road as we only need to worry now of 2 million triples instead of 38 million. The next step is to produce a filtered version just for the development and heldout sets into the files "ch6_cell4_cities1000_devset_base.ttl" and "ch6_cell4_cities1000_heldout_base.ttl", respectively (Cell #4).

# CELL 4
devset_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell2_cities1000_dev_links.ttl").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

devset_size = 0
with open("ch6_cell3_cities1000_base.ttl") as base:
    with open("ch6_cell4_cities1000_dev_base.ttl", "w") as devset:
        with open("ch6_cell4_cities1000_heldout_base.ttl", "w") as heldout:
            for line in base:
                subj, verb, obj = triple_re.match(line).groups()
                if subj in devset_uris:
                    devset.write(line)
                    devset_size += 1
                else:
                    heldout.write(line)
print("Devset base size: {:,} triples".format(devset_size))
devset_uris = None # free memory

Devset base size: 1,625,988 triples

These close to 1.6 million triples are our working set,

EDA: Exploratory Data Analysis¶

I'll start the Exploratory Data Analysis by visualizing 10 random entities using the Graphviz package (Cell #5). The cities are rendered as ch6_cell5city*.pdf

# CELL 5
import re
import graphviz
import random

PARAM_USE_SET_FOR_BOOK = True

devset_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell1_cities1000_links.ttl").readlines()))
rand = random.Random(42)
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

# stable set for book
cities = [ "Doolittle,_Texas", "Belton,_Missouri", "Poiana_Sibiului", "Unadilla,_Georgia",
          "Sankeshwar,_Karnataka", "Psevdas", "Skudai", "Santa_Maria_da_Boa_Vista",
          "Newstead,_Tasmania", "Gombi" ]

if PARAM_USE_SET_FOR_BOOK:
    cities = list(map(lambda x: "<http://dbpedia.org/resource/" + x + ">", cities))
else:
    cities = set(rand.sample(sorted(devset_uris), 10))

nodes = dict()
dots = dict()

for city_uri in cities:
    dots[city_uri] = graphviz.Digraph(format='pdf', graph_attr={'rankdir':'BT'})
    nodes[city_uri] = dict()

def cell5_make_or_return_node(city_uri, name, shape=None):
    if name in nodes[city_uri]:
        return nodes[city_uri][name]
    ret = 'N' + str(len(nodes[city_uri]))
    nodes[city_uri][name] = ret
    if shape is None:
        dots[city_uri].node(ret, name)
    else:
        dots[city_uri].node(ret, name.replace('&', ' '), shape=shape)
    return ret

for city_uri in cities:
    cell5_make_or_return_node(city_uri, city_uri, 'star')
    
rel_count = dict()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if subj in cities:
            city_uri = subj
            rel_count[verb] = rel_count.get(verb, 0) + 1
            if verb.endswith("?inv>"):
                subj, obj = obj, subj
                verb = verb[:-5] + ">"
            subj_name = cell5_make_or_return_node(city_uri, subj)
            obj_name = cell5_make_or_return_node(city_uri, obj)
            dots[city_uri].edge(subj_name, obj_name, verb)
for city_uri in cities:
    city = city_uri[:-1].split("/")[-1]
    dot = dots[city_uri]
    print("Rendering",city,":",len(dot.source.splitlines()), "dot lines")
    dot.render("ch6_cell5_city_" + city + ".dot")
with open("ch6_cell5_rel_counts.tsv", "w") as rels:
    for rel in sorted(rel_count.keys()):
        rels.write("{}\t{}\n".format(rel,rel_count[rel]))
print("\nTop relation counts:")
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>{}</td><td>{}</td></tr>".format(r[0][:-1].split("/")[-1],r[1]), 
                               sorted(rel_count.items(), key=lambda t:t[1], reverse=True)[:10]))) +
            "</table>"))

Rendering Doolittle,_Texas : 4 dot lines
Rendering Belton,_Missouri : 62 dot lines
Rendering Poiana_Sibiului : 37 dot lines
Rendering Unadilla,_Georgia : 54 dot lines
Rendering Sankeshwar,_Karnataka : 4 dot lines
Rendering Psevdas : 22 dot lines
Rendering Skudai : 41 dot lines
Rendering Santa_Maria_da_Boa_Vista : 4 dot lines
Rendering Newstead,_Tasmania : 18 dot lines
Rendering Gombi : 20 dot lines

Top relation counts:

From here we can see that some randomly sampled cities are all fairly small, which talks that most rows will have very little data while others will contain the bulk of the triples. The amount of infrormation available is already a great indicator of population.

From looking at the graphs we can see that certain places (like Skudai) do not contain population information. The next step is thus to filter cities for which the population is known (as we can only use those to build a regressor or classifier).

From the relation table in "ch6_cell5_rel_counts.tsv" we can see that some relations are very rare (like homepage or leaderParty) while others are defined for almost all cities (like rdf type or areaLand). The inverse relations are also quite sparse but these places are all sparsely populated.

But first we will need to know how population is expressed in DBpedia. From Psevdas we can see it has a relation "populationTotal". Would that be the only way population is expressed? Let us see what relations contain the word population in their name (Cell #6).

# CELL 6
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rels_of_interest = dict()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if "population" in verb:
            rels_of_interest[verb] = rels_of_interest.get(verb, 0) + 1
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>{}</td><td>{}</td></tr>".format(r[0][:-1].split("/")[-1],r[1]), 
                               sorted(rels_of_interest.items(), key=lambda t:t[1], reverse=True)))) +
            "</table>"))

It seems "populationTotal" is the right relation to use, but we will need to filter populationMetro, populationUrban and populationRural. Also, if populationTotal is not defined, either of those, if defined could be used as surrogates of out target class. But at this stage we can to filter the cities with defined population, I will put them into the file "ch6_cell7_dev_pop.tsv" and collect also the population (will require cleaning type information such as ^^http://www.w3.org/2001/XMLSchema#nonNegativeInteger). This number will be our target value for regression (or a discretized version of it, our target class).

I do this process on both the dev and heldout (into file "ch6_cell7_heldout_pop.tsv") in Cell #7.

# CELL 7
import re
from collections import OrderedDict

cities_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell1_cities1000_links.ttl").readlines()))
devset_uris = set(map(lambda line:line.split(' ')[0], open("ch6_cell2_cities1000_dev_links.ttl").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

population = OrderedDict()
with open("ch6_cell3_cities1000_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        verb = verb[:-1].split("/")[-1]
        update = False
        conditional = True
        if verb == "populationTotal":
            update = True
            conditional = False
        elif verb in set(['populationMetro', 'populationUrban', 'populationRural']):
            if subj in population and population[subj][1] == 'populationTotal':
                pass # keep current
            else:
                update = True
                conditional = subj in population
        if update:
            # compute actual population
            if obj[0] == '"':
                obj = obj[1:obj[1:].index('"')+1]
            try:
                pop = float(obj)
            except ValueError:
                print("For city", subj, "cannot convert", line)
                next # ignore
            if not conditional or population[subj][0] < pop:
                population[subj] = (pop, verb)
                
dev_cities_with_pop = 0
with open("ch6_cell7_dev_pop.tsv", "w") as dev:
    with open("ch6_cell7_heldout_pop.tsv", "w") as held:
        for city, pop in population.items():
            line = "{}\t{}\n".format(city, pop[0])
            if city in devset_uris:
                dev.write(line)
                dev_cities_with_pop += 1
            else:
                held.write(line)
print("Cities with known population in devset: {:,}".format(dev_cities_with_pop))

Cities with known population in devset: 48,868

Now we can proceed to repeat the plotting of 10 random cities in the devset with known population (Cell #8).

# CELL 8
import re
import graphviz
import random

PARAM_USE_SET_FOR_BOOK = True

devset_uris = set(map(lambda line:line.split('\t')[0], open("ch6_cell7_dev_pop.tsv").readlines()))
rand = random.Random(42)
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

# stable set for book
cities = [ "Ovalle", "Hawaiian_Paradise_Park,_Hawaii", "Coupar_Angus",
          "Hilsenheim", "Esko,_Minnesota", "Hamlin,_New_York", "Falkenberg,_Märkisch-Oderland",
          "Wermelskirchen", "Berne,_Indiana", "Chennevières-sur-Marne" ]

if PARAM_USE_SET_FOR_BOOK:
    cities = list(map(lambda x: "<http://dbpedia.org/resource/" + x + ">", cities))
else:
    cities = set(rand.sample(sorted(devset_uris), 10))

nodes = dict()
dots = dict()

for city_uri in cities:
    dots[city_uri] = graphviz.Digraph(format='pdf', graph_attr={'rankdir':'BT'})
    nodes[city_uri] = dict()

for city_uri in cities:
    cell5_make_or_return_node(city_uri, city_uri, 'star')

rel_count = dict()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if subj in cities:
            city_uri = subj
            rel_count[verb] = rel_count.get(verb, 0) + 1
            if verb.endswith("?inv>"):
                subj, obj = obj, subj
                verb = verb[:-5] + ">"
            subj_name = cell5_make_or_return_node(city_uri, subj)
            obj_name = cell5_make_or_return_node(city_uri, obj)
            dots[city_uri].edge(subj_name, obj_name, verb)
for city_uri in cities:
    city = city_uri[:-1].split("/")[-1]
    dot = dots[city_uri]
    print("Rendering",city,":",len(dot.source.splitlines()), "dot lines")
    dot.render("ch6_cell8_city_" + city + ".dot")
    
with open("ch6_cell8_rel_counts.tsv", "w") as rels:
    for rel in sorted(rel_count.keys()):
        rels.write("{}\t{}\n".format(rel,rel_count[rel]))
print("\nTop relation counts:")
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>{}</td><td>{}</td></tr>".format(r[0][:-1].split("/")[-1],r[1]), 
                               sorted(rel_count.items(), key=lambda t:t[1], reverse=True)[:10]))) +
            "</table>"))

Rendering Ovalle : 59 dot lines
Rendering Hawaiian_Paradise_Park,_Hawaii : 48 dot lines
Rendering Coupar_Angus : 4 dot lines
Rendering Hilsenheim : 28 dot lines
Rendering Esko,_Minnesota : 37 dot lines
Rendering Hamlin,_New_York : 46 dot lines
Rendering Falkenberg,_Märkisch-Oderland : 32 dot lines
Rendering Wermelskirchen : 45 dot lines
Rendering Berne,_Indiana : 67 dot lines
Rendering Chennevières-sur-Marne : 20 dot lines

Top relation counts:

These cities with defined population have definitely much more things going on for them, including multiple areaCode relations (three in the case of Wermelskirchen) and a variety of inverse relations. At this stage on the EDA process we want to look into which of these relations might make for good features. Going back to Chapter 1, we want features that will be simple, related to the target (population) and readily available. From the perspective of "readily available" we want to shorten down the number of relations to ones that are frequent enough that will be useful for a ML algorithm (I will look into the other issues later). Therefore their frequency over the whole set of entities is paramount. Let us sort the relations by the number of cities that have them defined (Cell #9).

# CELL 9
import re
from collections import OrderedDict

devset_uris = set(map(lambda line:line.split('\t')[0], open("ch6_cell7_dev_pop.tsv").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rel_count = OrderedDict()
rel_subj_seen = set()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if subj in devset_uris:
            rel_subj = "{}-{}".format(subj,verb)
            if rel_subj not in rel_subj_seen:
                rel_count[verb] = rel_count.get(verb, 0) + 1
                rel_subj_seen.add(rel_subj)
sorted_count = sorted(rel_count.items(), key=lambda t:t[1])

# drop long names
sorted_count = list(map(lambda t:(t[0][:-1].split("/")[-1], str(t[1])), sorted_count))
table = list()
step = int(len(sorted_count) / 10)
print("Total relations", len(sorted_count), "percentile size", step)
table.append( ("First", sorted_count[0][0], sorted_count[0][1], 
              "{}%".format(int(float(sorted_count[0][1]) / len(devset_uris) * 10000) / 100) ))
current = step
percentile = 1
while current < len(sorted_count):
    table.append( ("At {} percentile".format(percentile*10), sorted_count[current][0],
                 sorted_count[current][1],
                 "{}%".format(int(float(sorted_count[current][1]) / len(devset_uris) * 10000) / 100) ) )
    current += step
    percentile += 1
table.append( ("Last", sorted_count[-1][0], sorted_count[-1][1], 
              "{}%".format(int(float(sorted_count[-1][1]) / len(devset_uris) * 10000) / 100) ))


from IPython.display import HTML, display
display(HTML("<table><tr><th>Pos</th><th>Rel</th><th>Counts</th><th>Percentage of cities defined</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>" + "</td><td>".join(r) + "</td></tr>", table))) +
            "</table>"))

Total relations 347 percentile size 34

Which relation we will use is key to success of the ML, at this stage we will overselect and then drill down to find a better subset. But we don't want to select too many, particularly ones are not that useful, so anything appearing less than 5% of the time is not really that useful, even if looks very appealing (such as "populationTotalRanking", only available for 1.43% of the cities in the devset). The next stage is then write down the list of final relations (file "ch6_cell10_rels.txt") and filter the devset to only those triples (file "ch6_cell10_cities100_dev_filtered.ttl"). That will also further reduce the computation down the road. Cell #10.

# CELL 10
import re
from collections import OrderedDict

devset_uris = set(map(lambda line:line.split('\t')[0], open("ch6_cell7_dev_pop.tsv").readlines()))
triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rel_count = OrderedDict()
rel_subj_seen = set()
with open("ch6_cell4_cities1000_dev_base.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if subj in devset_uris:
            rel_subj = "{}-{}".format(subj,verb)
            if rel_subj not in rel_subj_seen:
                rel_count[verb] = rel_count.get(verb, 0) + 1
                rel_subj_seen.add(rel_subj)

# delete populations
del rel_count["<http://dbpedia.org/ontology/populationTotal>"]
del rel_count["<http://dbpedia.org/ontology/populationMetro>"]
del rel_count["<http://dbpedia.org/ontology/populationUrban>"]
del rel_count["<http://dbpedia.org/ontology/populationRural>"]
                
threshold = len(devset_uris) * 0.05
selected_rels = sorted(filter(lambda t:t[1] >= threshold, rel_count.items()), key=lambda t:t[0])

with open("ch6_cell10_rels.txt", "w") as rels:
    rels.write("\n".join(map(lambda t:t[0], selected_rels))+"\n")

selected_rels = set(map(lambda t:t[0], selected_rels))
written = 0
with open("ch6_cell10_cities1000_dev_filtered.ttl", "w") as filtered:
    with open("ch6_cell4_cities1000_dev_base.ttl") as base:
        for line in base:
            subj, verb, obj = triple_re.match(line).groups()
            if subj in devset_uris and verb in selected_rels:
                filtered.write(line)
                written += 1

print("Selected", len(selected_rels), "relations")
print("Filtered down to {:,} triples".format(written))

Selected 43 relations
Filtered down to 1,278,943 triples

From all this process we are now down to 43 relations and 1.3 million triples for 48,000 instances. But we still do not have features, as these relations have multiple instances for them and are unordered. To deal with this situation, we can remember the techniques from Chapter 5 for dealing sets, lists, graphs, etc. For the set-based (unordered) nature of the relations, we can sort them. In this case we can sort them by how frequent is the target value. If the target value is more frequent, it might be more informative to the ML. That leaves then with a list of values for each relation. To represent lists, a common technique is to truncate it to a maximum size. We thus need to see how many times each of these 43 relations is defined for a given entity (Cell #11).

# CELL 11
import re
from collections import OrderedDict

triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rel_count_per_city = OrderedDict()
with open("ch6_cell10_cities1000_dev_filtered.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if verb not in rel_count_per_city:
            rel_count_per_city[verb] = OrderedDict()
        rel_count_per_city[verb][subj] = rel_count_per_city[verb].get(subj, 0) + 1

# sort and compute min, max, average and median
table = list()
for rel in sorted(rel_count_per_city.keys()):
    city_counts = sorted(rel_count_per_city[rel].items(), key=lambda t:t[1])
    city_counts = list(map(lambda t:(t[0][:-1].split("/")[-1],t[1]), city_counts))
    table.append( (rel[:-1].split("/")[-1], # rel 
                   len(city_counts), # counts
          city_counts[-1], # max
          int(sum(map(lambda t:t[1], city_counts)) / len(city_counts) * 100) / 100, # avg
          city_counts[int(len(city_counts) / 2)], # median
          city_counts[0])  # min  
    )
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th><th>Max</th><th>Avg</th><th>Median</th><th>Min</th></tr>" +
            "\n".join(list(map(lambda r:"<tr><td>"+"</td><td>".join(list(map(str,r)))+
                               "</td></tr>", table))) +"</table>"))

There are multiple things we can see here. First, no relation has a minimum that it is not 1. That is good because it means no relation is inherently multiple. We can also see that the inverted relations have substantive maximums, which make sense that cities such as London will be the birthplaces of thousands of Wikipedia-worthy people. Clearly the most important signal here is the number of such inbound relations rather than the actual entities being referred, but we will leave that to be brought about by some feature selection techniques as the ones discussed in Chapter 4. Some the averages are also pretty high but they seem to be driven by few megacities. The medians are much informative, with only a few beyond one. I will thus focus on the medians and extract the file "ch6_cell12_rels_median_fan_outs.tsv" (Cell #12).

# CELL 12
import re
from collections import OrderedDict

triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

rel_count_per_city = OrderedDict()
with open("ch6_cell10_cities1000_dev_filtered.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if verb not in rel_count_per_city:
            rel_count_per_city[verb] = OrderedDict()
        rel_count_per_city[verb][subj] = rel_count_per_city[verb].get(subj, 0) + 1

# sort and compute min, max, average and median
with open("ch6_cell12_rels_median_fan_outs.tsv", "w") as medians:
    for rel in sorted(rel_count_per_city.keys()):
        city_counts = sorted(rel_count_per_city[rel].items(), key=lambda t:t[1])
        medians.write("{}\t{}\n".format(rel,city_counts[int(len(city_counts) / 2)][1]))

Base Featurization¶

We are now almost ready to produce a base featurization, all we are missing is to have global counts for all literals and entities that are targets of the relations and use them to sort and select the top ones according to the median fan-out computed in Cell #11. To such features I will also add the total count per relation and a large total of all relevant relations. Missing features are marked by the string "N/A". Adding the population produces the file "ch6_cell13_dev_feat0.tsv", the first full featurization of the dataset (Cell #13).

# CELL 13
import re
from collections import OrderedDict

triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

# compute counts for ordering values
popularity = dict()
with open("ch6_cell10_cities1000_dev_filtered.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        popularity[obj] = popularity.get(obj, 0) + 1

# fan outs
fan_outs = dict()
with open("ch6_cell12_rels_median_fan_outs.tsv") as medians:
    for line in medians:
        rel, fan_out = line.strip().split("\t")
        fan_outs[rel] = int(fan_out)
rels = sorted(fan_outs.keys())
        
# population
pops = dict()
with open("ch6_cell7_dev_pop.tsv") as dev_pop:
    for line in dev_pop:
        city, pop = line.strip().split("\t")
        pops[city] = float(pop)
        
# full graph
graph = OrderedDict()
with open("ch6_cell10_cities1000_dev_filtered.ttl") as base:
    for line in base:
        subj, verb, obj = triple_re.match(line).groups()
        if verb in fan_outs:
            if subj not in graph:
                graph[subj] = OrderedDict()
            if verb not in graph[subj]:
                graph[subj][verb] = list()
            graph[subj][verb].append(obj)

# putting it all together
rows_written = 0
with open("ch6_cell13_dev_feat0.tsv", "w") as feats:
    # header
    feats.write("name\trel#count")
    rel_start = dict()
    idx = 2
    for rel in rels:
        fan_out = fan_outs[rel]
        rel_start[rel] = idx
        rel = rel[1:-1]
        feats.write("\t{}#count".format(rel))
        for i in range(fan_out):
            feats.write("\t{}#{}".format(rel, i+1))
        idx += fan_out + 1
    feats.write("\tpopulation\n")
    rows_written += 1
    
    # data
    for city, data in graph.items():
        feats.write("{}\t{}".format(city, sum(map(len,data.values()))))
        for rel in rels:
            fan_out = fan_outs[rel]
            if rel not in data:
                feats.write("\tNA" * (fan_out + 1))
            else:
                feats.write("\t{}".format(len(data[rel])))
                values = sorted(data[rel], key=lambda o:popularity[o], reverse=True)[:fan_out]
                feats.write("\t" + "\t".join(values))
                written = len(values)
                while written < fan_out:
                    feats.write("\tNA")
                    written += 1
        feats.write("\t{}\n".format(pops[city]))
        rows_written += 1
print("Rows written (including header): {:,}".format(rows_written))

Rows written (including header): 48,869

We are now at the moment to start thinking about the ML algorithm we will employ over this data. It contains almost 100 columns of different types and a numeric target. Let us start by plotting the target and its distribution (Cell #14).

# CELL 14
import matplotlib.pyplot as plt
%matplotlib inline

pops = list()
with open("ch6_cell13_dev_feat0.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        pops.append(float(line.strip().split("\t")[-1]) / 10e6)
pops = sorted(pops)

plt.plot(pops, label="population", color='gray')
plt.ylabel('population in millions')
plt.savefig("ch6_cell14_population.pdf", bbox_inches='tight', dpi=300)
plt

<module 'matplotlib.pyplot' from '/vol/bulk/work/feateng/finaltest/chapter6/feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

That curve is quite steep at the end. To flatten it down, we can apply a logarithmic function as it helps put such growths processes in perspective, as discussed in Chapter 3. Instead of the default logarithm on base $e$ I will use the more intuitive base 10 that tell us the number of digits in the population for a city. Cell #15 shows the result.

# CELL 15
import math
import matplotlib.pyplot as plt
%matplotlib inline

pops = list()
with open("ch6_cell13_dev_feat0.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        pop = float(line.strip().split("\t")[-1])
        if pop == 0:
            pop = 1
        pops.append(math.log(pop, 10))
pops = sorted(pops)

plt.plot(pops, label="log population", color='gray')
plt.ylabel('log population in millions')
plt.savefig("ch6_cell15_log_population.pdf", bbox_inches='tight', dpi=300)
plt

<module 'matplotlib.pyplot' from '/vol/bulk/work/feateng/finaltest/chapter6/feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

The first thing to notice in this figure is that there seems to be many cities in the data set with less that 1,000 inhabitants. Some even with only 1 person! Looking a little bit into the triples, I can see that some are actual mistakes in Wikipedia (a city with 1,000,000 people got typed as 1.000.000, etc). Other are small towns and hamlets that are less than a 1,000 people (but might have been more than a 1,000 at the time GeoNames built their list). Following the discussion on outliers in Chapter 2, these are outliers that are worth removing from the data. I will still leave them in the held-out final test data as their impact in production needs to be seen and evaluated.

We thus remove these outliers and produce a filtered file "ch6_cell16_dev_feat0_filtered.tsv" (Cell #16).

# CELL 16
written = 0
with open("ch6_cell16_dev_feat0_filtered.tsv", "w") as filtered:
    with open("ch6_cell13_dev_feat0.tsv") as feats:
        filtered.write(next(feats))
        for line in feats:
            pop = float(line.strip().split("\t")[-1])
            if pop >= 1000:
                filtered.write(line)
                written += 1
print("Rows written (including header): {:,}".format(written))

Rows written (including header): 44,959

So we did not throw a lot of data with this change, good.

Now we can try some learning for EDA. Let us try some regression models on it against the single feature of "number of relations defined". Let's start with SVR using RBFs (Cell #17).

# CELL 17
import random
import math
from sklearn.svm import SVR
import numpy as np

rand = random.Random(42)
train_data = list()
test_data = list()
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        fields = line.strip().split("\t")
        pop = float(fields[-1])
        rels = float(fields[1])
        row = (math.log(rels, 10), math.log(pop, 10))
        if rand.random() < 0.8:
            test_data.append(row) 
        else:
            train_data.append(row)
               
test_data = sorted(test_data, key=lambda t:t[1])

train_data = np.array(train_data)
test_data = np.array(test_data)

xtrain = train_data[:,0]
ytrain = train_data[:,1]
xtest = test_data[:,0]
ytest = test_data[:,1]

# SVRs need scaling
xtrain_min = xtrain.min(); xtrain_max = xtrain.max()
xtrain_scaling = 1.0 / (xtrain_max - xtrain_min)
xtrain -= xtrain_min
xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min
ytrain *= ytrain_scaling

xtest -= xtrain_min
xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min
ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))
best_c = 100.0
best_epsilon = 0.1
PARAM_DO_GRID_SEARCH = False
      
if PARAM_DO_GRID_SEARCH:
    best_rmse = 1000
    for c in [0.01, 0.1, 1.0, 2.0, 5.0, 10.0, 50.0, 100.0]:
        svr_rbf = SVR(epsilon=0.05, C=c, gamma='auto')
        svr_rbf.fit(xtrain.reshape(-1,1), ytrain)
        ytest_pred = svr_rbf.predict(xtest.reshape(-1,1))
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("C", c, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_c = c
            best_rmse = RMSE

    print("Best C", best_c,"best RMSE", best_rmse)        

    best_rmse = 1000
    for epsilon in [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 0.0, 1.0, 10.0, 100.0]:
        svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
        svr_rbf.fit(xtrain.reshape(-1,1), ytrain)
        ytest_pred = svr_rbf.predict(xtest.reshape(-1,1))
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("Epsilon", epsilon, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_epsilon = epsilon
            best_rmse = RMSE

    print("Best epsilon", best_epsilon,"best RMSE", best_rmse)
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain.reshape(-1,1), ytrain)
ytest_pred = svr_rbf.predict(xtest.reshape(-1,1))
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch6_cell17_svr_one_feat.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 8,963 cities
RMSE 0.617989060494947

<matplotlib.legend.Legend at 0x7f92c6cdb5d0>

While the RMSE is not that bad for only one feature (less that one order of magnitude) and the flat bottom of the figure can be explained by the fact the SVR has only a single support vector (from its single feature), the figure still does not seem to track the curve that well. To further see if this is an issue with SVR or with the feature, I plotted a 20% sample of the total number of relations against the log of the population in Cell #18.

# CELL 18
import math
import random

pops = list()
num_rels = list()
rand = random.Random(42)
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        fields = line.strip().split("\t")
        pop = float(fields[-1])
        rels = float(fields[1])
        if rand.random() < 0.2:
            pops.append(math.log(pop, 10))
            num_rels.append(math.log(rels, 10))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]
plt.plot(num_rels, pops, '.', color='gray')
plt.ylabel('scaled log population')
plt.xlabel('scaled number of relations')
plt.savefig("ch6_cell18_logpop_vs_log_items.pdf", bbox_inches='tight', dpi=300)
plt

<module 'matplotlib.pyplot' from '/vol/bulk/work/feateng/finaltest/chapter6/feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

It seems the correlation is there but it is weak and SVR is extracting as good information as it is. I thus settle for SVR and we are ready for our first feature set.

First Featurization¶

As SVR only takes numeric features, the first step is to typify the raw columns we have and see which ones are quasi-numeric (that is, numeric 80% of the time) and whether the ones that are not have a small (100 or less) unique values. This typification is in Cell #19.

# CELL 19
types = dict()
data = list()
header = None
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    header = next(feats).strip().split('\t')
    for line in feats:
        data.append( line.strip().split('\t'))

# numeric columns
num_cols = set()
for idx in range(len(header)):
    total_non_na = 0
    is_num = 0
    for row in data:
        if row[idx] == 'NA':
            continue
        total_non_na += 1
        row_val = row[idx]
        if '"^^' in row_val:
            row_val = row_val[1:row_val.index('"^^')]
        try:
            float(row_val)
            is_num += 1
        except ValueError:
            pass
    if is_num > total_non_na * 0.8:
        num_cols.add(idx)
        
print("Total cols", len(header))
print("Numeric cols", len(num_cols))

# differentiate categorical from free text, up to 100 categories
cat_cols = dict()
others = list()
for idx in range(len(header)):
    if idx in num_cols:
        continue
    values = set()
    for row in data:
        if row[idx] == 'NA':
            continue
        values.add(row[idx])
    if len(values) < 100:
        cat_cols[idx] = values
    else:
        others.append( (header[idx], len(values)) )

print("Categorical cols")
for col in sorted(cat_cols.keys()):
        print("   " + header[col])
from IPython.display import HTML, display
display(HTML("<table><tr><th>Rel</th><th>Counts</th></tr>" +
            "\n".join(list(map(lambda r: "<tr><td>{}</td><td>{}</td></tr>".format(r[0].split("/")[-1], r[1]), 
                               others))) +"</table>"))

Total cols 99
Numeric cols 54
Categorical cols
   http://dbpedia.org/ontology/utcOffset#1
   http://dbpedia.org/ontology/utcOffset#2
   http://www.w3.org/1999/02/22-rdf-syntax-ns#type#1

Therefore, of the 99 columns, 54 are numeric but only 3 can be captured by 100 unique values. As we can expect more of them to be quasi-categorical, I employ a technique for category discretization discussed in Chapter 2, introducing an OTHER category to handle the rest of the values if they account for less than 20% of the total number of values (Cell #20).

I also drop constant columns (which are uninformative) netting 50 numeric columns and 11 categorical. The remaining columns have a variety that cannot be discretized. These columns will require text processing or they are just uninformative for this approach. The selected categorical relations with their total number of categories is shown. Types go to "ch6_cell20_feat0_types.tsv".

# CELL 20
from collections import OrderedDict

types = dict()
data = list()
header = None
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    header = next(feats).strip().split('\t')
    for line in feats:
        data.append( line.strip().split('\t'))

# constant columns
const_cols = set()

for idx in range(len(header)):
    value = None
    constant = True
    for row in data:
        if row[idx] == 'NA':
            continue
        if value is None:
            value = row[idx]
        elif value != row[idx]:
            constant = False
            break
    if constant:
        const_cols.add(idx)

# integer columns
num_cols = set()
for idx in range(len(header)):
    if idx in const_cols:
        continue
    total_non_na = 0
    is_num = 0
    for row in data:
        if row[idx] == 'NA':
            continue
        total_non_na += 1
        row_val = row[idx]
        if '"^^' in row_val:
            row_val = row_val[1:row_val.index('"^^')]
        try:
            float(row_val)
            is_num += 1
        except ValueError:
            pass
    if is_num > total_non_na * 0.8:
        num_cols.add(idx)
        
print("Total cols", len(header))
print("Constant cols", len(const_cols))
print("Numeric cols", len(num_cols))

# differentiate categorical from free text, up to 100 categories, including "other"
cat_cols = dict()
others = list()
for idx in range(len(header)):
    if idx in num_cols or idx in const_cols:
        continue
    values = OrderedDict()
    total_obs = 0
    for row in data:
        if row[idx] == 'NA':
            continue
        values[row[idx]] = values.get(row[idx], 0) + 1
        total_obs += 1
    values = sorted(values.items(), key=lambda x:x[1])
    total_values = len(values)
    selected_values = list()
    selected_obs = 0
    while selected_obs * 1.0 / total_obs < 0.8 and len(selected_values) < 100:
        value, count = values.pop()
        selected_values.append(value)
        selected_obs += count
    if len(selected_values) < 100:
        selected_values.append('OTHER')
        cat_cols[idx] = selected_values
    else:
        others.append( (header[idx], total_values, int(selected_obs * 1.0 / total_obs * 10000)/100) )
        
with open("ch6_cell20_feat0_types.tsv", "w") as types:
    for idx in range(len(header)):
        types.write("{}\t{}".format(idx, header[idx]))
        if idx in num_cols:
            types.write("\tNUMERIC\n")
        elif idx in cat_cols:
            types.write("\t" + "\t".join(cat_cols[idx]) + "\n")
        else:
            types.write("\tIGNORE\n")
        
table1 = ("<table><tr><th>Rel</th><th>Categories</th></tr>" +
   "\n".join(list(map(lambda x: "<tr><td>{}</td><td>{}</td></tr>".format(header[x].split("/")[-1], 
                                                                         len(cat_cols[x])), cat_cols.keys()))) +
    "</table>")
table2 = ("<table><tr><th>Rel</th><th>Counts</th><th>% Covered @ 100</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{}</td></tr>".format(r[0].split("/")[-1], 
                                                                                   r[1], r[2]), 
                               others))) +"</table>")
from IPython.display import HTML, display
display(HTML("<h3>Categorical cols ({})</h3>".format(len(cat_cols)) + 
             table1 + "<h3>Remaining cols ({})</h3>".format(len(others)) + table2))

Total cols 99
Constant cols 4
Numeric cols 50

I now proceed to perform a one hot encoding of the categories (cf. Chapter 3) and imputing the values to the OTHER for categorical (cf. Chapter 2) or to 0.0 for numeric into file "ch6_cell21_dev_feat1.tsv" (Cell #21). Its vector size has 380 features plus the target variable.

# CELL 21
from collections import OrderedDict

types = list(map(lambda line: line.strip().split("\t"), 
                 open("ch6_cell20_feat0_types.tsv").readlines()))
header = None
data = list()
with open("ch6_cell16_dev_feat0_filtered.tsv") as feats:
    header = next(feats).strip().split('\t')
    for line in feats:
        data.append( line.strip().split('\t'))

total_size = 0
cat_to_pos = dict()
for type_line in types:
    if type_line[2] == 'NUMERIC':
        total_size += 1
    elif type_line[2] != 'IGNORE':
        cat_to_pos[int(type_line[0])] = dict(zip(type_line[2:], range(len(type_line) - 2)))
        total_size += len(type_line) - 2
    
with open("ch6_cell21_dev_feat1.tsv", "w") as feat1:
    feat1.write('name') # add name for error analysis
    for type_line in types:
        if type_line[2] == 'NUMERIC':
            feat1.write('\t' + type_line[1])
        elif type_line[2] != 'IGNORE':
            for value in type_line[2:]:
                feat1.write('\t{}@{}'.format(type_line[1], value))
    feat1.write('\n')
    for row in data:
        v = ['NA',] * (total_size + 1)
        v[0] = row[0] # add name for error analysis
        current = 1
        for type_line in types:
            idx = int(type_line[0])
            value = row[idx]
            if type_line[2] == 'NUMERIC':
                if '"^^' in value:
                    value = value[1:value.index('"^^')]
                try:
                    v[current] = str(float(value))
                except ValueError:
                    v[current] = '0.0'
                current += 1
            elif type_line[2] != 'IGNORE':
                pos = None
                if value == 'NA' or value not in cat_to_pos[idx]:
                    pos = cat_to_pos[idx]['OTHER']
                else:
                    pos = cat_to_pos[idx][value]
                for idx2 in range(len(type_line) - 2):
                    if idx2 == pos:
                        v[current] = "1.0"
                    else:
                        v[current] = "0.0"
                    current += 1
        feat1.write('\t'.join(v) + "\n")        
print("Vector size", total_size)

Vector size 381

This data set is ready to be used with SVR. Doing a grid search for its parameters and saving data for error analysis (Cell #22) we obtain a RMSE of 0.41 and the curve below.

# CELL 22
import random
import math
from sklearn.svm import SVR
import numpy as np

rand = random.Random(42)
train_data = list()
test_data = list()
header = None
with open("ch6_cell21_dev_feat1.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    for line in feats:
        fields = line.strip().split("\t")
        pop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, math.log(pop, 10), name)
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
               
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c = 100.0
best_epsilon = 0.05

PARAM_DO_GRID_SEARCH = False
if PARAM_DO_GRID_SEARCH:
    best_rmse = 1000
    for c in [0.01, 0.1, 1.0, 2.0, 5.0, 10.0, 50.0, 100.0]:
        svr_rbf = SVR(epsilon=0.05, C=c, gamma='auto')
        svr_rbf.fit(xtrain, ytrain)
        ytest_pred = svr_rbf.predict(xtest)
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("C", c, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_c = c
            best_rmse = RMSE

    print("Best C", best_c,"best RMSE", best_rmse)        

    best_rmse = 1000
    for epsilon in [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 0.0, 1.0, 10.0, 100.0]:
        svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
        svr_rbf.fit(xtrain, ytrain)
        ytest_pred = svr_rbf.predict(xtest)
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("Epsilon", epsilon, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_epsilon = epsilon
            best_rmse = RMSE

    print("Best epsilon", best_epsilon,"best RMSE", best_rmse)
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

print("Writing files for error analysis")
np.savez_compressed("ch6_cell22_feat1_svr.npz", 
                    xtrain=xtrain, ytrain=ytrain, 
                    xtest=xtest, ytest=ytest,
                    ytest_orig=ytest_orig,
                    ytest_pred=ytest_pred,
                    ytrain_min=ytrain_min,
                    xtrain_scaling=xtrain_scaling,
                    ytrain_scaling=ytrain_scaling)
with open("ch6_cell22_test_names.tsv", "w") as names:
    for idx, name in enumerate(test_names):
        names.write("{}\t{}\n".format(idx, name))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch6_cell22_svr_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE 0.41176199300264366
Writing files for error analysis

<matplotlib.legend.Legend at 0x7f92c6c90bd0>

This figure and RMSE look better than the single feature that preceed it but it has some strange spikes. Let's perform some Error Analysis on them, an error drilldown on the spike at the end by feature ablation, most probably won't work but I show error analysis.

Error Analysis¶

Using the training data and evaluation results from Cell #22, I can now do an error drilldown on the spike at the end by feature ablation in Cell #23.

# CELL 23
arrays = np.load("ch6_cell22_feat1_svr.npz")
ytest_orig = arrays['ytest_orig']
ytest_pred = arrays['ytest_pred']
test_names = list(map(lambda line:line.strip().split('\t')[-1], 
                      open("ch6_cell22_test_names.tsv").readlines()))

# find the top 10 biggest contributors to the log error
sqe = (ytest_orig - ytest_pred)**2
contributors = list()
while len(contributors) < 10:
    idx = sqe.argmax()
    contributors.append( (idx, (ytest_pred - ytest_orig)[idx]) )
    sqe[idx] = 0.0

print("Top contributors to the error")
for idx, error in contributors:
    print("{:<4}  {:<21} {:>5}    {:>10,}   {:>17,}".
          format(idx, test_names[idx][:-1].split("/")[-1], int(error*100)/100, 
          int(10**ytest_orig[idx]), int(10**ytest_pred[idx])))

Top contributors to the error
8825  Dublin                  6.5       553,164   1,751,459,207,216
8726  Saint-Prex            -2.09       312,010               2,485
8705  Meads,_Kentucky       -2.08       288,648               2,351
7939  Nicosia                2.03        55,013           5,931,549
8921  Hofuf                 -1.94     1,500,000              17,145
8985  Mexico_City            1.93     8,918,652         761,989,297
8689  Montana_City,_Montana -1.88       271,529               3,513
8902  Pikine                -1.82     1,170,790              17,470
8791  Edinburgh              1.76       464,989          26,976,492
172   Isabela,_Basilan       1.74         1,081              60,349

From the table we can see that cities that have a rich history but are not necessarily that populated get overshoot by the system. Makes sense.

Interestingly, Saint-Prex is actually about 5,000 people, the 300,000 number is an extraction mistake. Meads, Kentucky population is listed on Wikipedia as 280,000 but it is being flagged as an error as 280,000 is the size of the full region. The town itself is definitely less than 5,000 people.

Nicosia is a history-rich city populated by 4,500 years. Sadly, it is a divided city and one of its divisions population is about 55,000 people. A joint city will total about 300,000 inhabitants, but it has the history of a 6,000,000 people for sure.

Therefore we have history-rich, underpopulated cities and extraction errors. From here it makes sense to focus on real (not logarithmic) where the system is over predicting (Cell #24).

# CELL 24
arrays = np.load("ch6_cell22_feat1_svr.npz")
xtrain = arrays['xtrain']
ytrain = arrays['ytrain'] 
xtest = arrays['xtest']
ytest = arrays['ytest']
ytest_orig = arrays['ytest_orig']
ytest_pred = arrays['ytest_pred']
test_names = list(map(lambda line:line.strip().split('\t')[-1], 
                      open("ch6_cell22_test_names.tsv").readlines()))

# find the top 10 biggest contributors to the actual error
real_error = (10**ytest_pred - 10**ytest_orig)
contributors = list()
cumulative_error = 0
while len(contributors) < 10:
    idx = real_error.argmax()
    contributors.append( (idx, (ytest_pred - ytest_orig)[idx]) )
    cumulative_error += ytest_pred[idx] - ytest_orig[idx]
    real_error[idx] = 0.0

print("Top contributors to the error")
print("Cumulative error {:,}".format(cumulative_error))
for idx, error in contributors:
    print("{:<4}  {:<21} {:>5}    {:>10,}   {:>17,}".
          format(idx, test_names[idx][:-1].split("/")[-1], int(error*100)/100, 
          int(10**ytest_orig[idx]), int(10**ytest_pred[idx])))

Top contributors to the error
Cumulative error 17.634868409695972
8825  Dublin                  6.5       553,164   1,751,459,207,216
8985  Mexico_City            1.93     8,918,652         761,989,297
8791  Edinburgh              1.76       464,989          26,976,492
8959  Algiers                0.83     3,415,810          23,294,353
8910  Prague                 0.88     1,267,448           9,823,330
8915  Milan                   0.8     1,359,904           8,636,483
8873  Amsterdam              0.98       842,342           8,094,965
8823  Lisbon                 1.15       545,245           7,776,212
8924  Tabriz                 0.74     1,549,452           8,586,162
7939  Nicosia                2.03        55,013           5,931,549

This list looks quite good, these cities are contributing a lot to the error and fall into the category of errors we have found the system is making, let's do an ablation study (wrapper) to see if we can find which features are hurting us here. In this study we'll throw away one feature at a time and see whether it improves the cumulative error for these 10 cities. The amount of computation is very big so we'll work on a 1% sample of the original data. Also, because Dublin is so markedly overshoot, I'll also keep track of dropping which better feature improves each city. Cell #25.

# CELL 25
import random
from sklearn.svm import SVR
import numpy as np

arrays = np.load("ch6_cell22_feat1_svr.npz")
xtrain = arrays['xtrain']
ytrain = arrays['ytrain']
ytrain_scaling = arrays['ytrain_scaling']
ytrain_min = arrays['ytrain_min']
xtest = arrays['xtest']
ytest = arrays['ytest']
ytest_orig = arrays['ytest_orig']
ytest_pred = arrays['ytest_pred']
test_names = list(map(lambda line:line.strip().split('\t')[-1], 
                      open("ch6_cell22_test_names.tsv").readlines()))
header = None
with open("ch6_cell21_dev_feat1.tsv") as feats:
    header = next(feats)
header = header.strip().split('\t')
header.pop(0) # drop name

# find the top 10 biggest contributors to the log error
real_error = (10**ytest_pred - 10**ytest_orig)
contributors = list()
cumulative_error = 0
xtest_contr = np.zeros( (10, xtest.shape[1]) )
ytest_contr = np.zeros( (10,) )
while len(contributors) < 10:
    idx = real_error.argmax()
    ytest_contr[len(contributors)] = ytest_orig[idx]
    xtest_contr[len(contributors),:] = xtest[idx,:]
    contributors.append( (idx, (ytest_pred - ytest_orig)[idx]) )
    cumulative_error += real_error[idx]
    real_error[idx] = 0.0

best_epsilon = 0.05
best_c = 100.0
# this is a lot of computation, work on sample of the original data
PARAM_SAMPLE_PERC=0.05
rand = random.Random(42)
selected = rand.sample(list(range(len(xtrain))), int(len(xtrain) * PARAM_SAMPLE_PERC))
xtrain_samp = xtrain[selected,:]
ytrain_samp = ytrain[selected]

# baseline run
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain_samp, ytrain_samp)
ytest_pred = svr_rbf.predict(xtest_contr)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
error = ytest_pred - ytest_contr # absolute log error
cumulative = sum(error)
baseline_cumulative = cumulative
per_city_baseline = error

best_cumulative = 100000.00
best_cumulative_ablate = None
per_city_best = [10000.00,] * 10
per_city_best_ablate = [None,] * 10

for feat in range(xtrain.shape[1]):
    if feat % int(xtrain.shape[1] / 10) == 0:
        print("{:>3} out of {:>3}".format(feat, xtrain.shape[1]))
    train_col = xtrain_samp[:,feat].copy()
    xtrain_samp[:,feat] = 0.0 # nuke feature
    svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
    svr_rbf.fit(xtrain_samp, ytrain_samp)
    ytest_pred = svr_rbf.predict(xtest_contr)
    ytest_pred *= 1.0/ytrain_scaling
    ytest_pred += ytrain_min
    error = ytest_pred - ytest_contr
    cumulative = sum(error)
    if best_cumulative_ablate is None or best_cumulative > cumulative:
        best_cumulative_ablate = feat
        best_cumulative = cumulative
    for idx in range(10):
        if per_city_best_ablate[idx] is None or per_city_best[idx] > error[idx]:
            per_city_best_ablate[idx] = feat
            per_city_best[idx] = error[idx]
    xtrain_samp[:,feat] = train_col # restore

print("Ablation study")
print("Best overall from {:3.2} to {:3.2} ({:.2%})".format(
    baseline_cumulative, best_cumulative,
    (baseline_cumulative-best_cumulative)/baseline_cumulative))
print("Feature to ablate ({}): {}\n\n".format(best_cumulative_ablate, header[best_cumulative_ablate]))
for idx in range(10):
    print("For city {} from {:3.2} to {:3.2} ({:.2%})".format(
        test_names[contributors[idx][0]][:-1].split("/")[-1],
        per_city_baseline[idx], per_city_best[idx],
        (per_city_baseline[idx]-per_city_best[idx])/per_city_baseline[idx]))
    print("Feature to ablate ({}): {}\n".format(per_city_best_ablate[idx], header[per_city_best_ablate[idx]]))

  0 out of 380
 38 out of 380
 76 out of 380
114 out of 380
152 out of 380
190 out of 380
228 out of 380
266 out of 380
304 out of 380
342 out of 380
Ablation study
Best overall from 0.28 to -2.4 (968.43%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count


For city Dublin from 1.2 to 0.48 (58.97%)
Feature to ablate (11): http://dbpedia.org/ontology/city?inv#count

For city Mexico_City from -0.061 to -0.4 (-561.55%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count

For city Edinburgh from 0.43 to 0.19 (57.03%)
Feature to ablate (3): http://dbpedia.org/ontology/areaLand#count

For city Algiers from -0.92 to -1.1 (-22.87%)
Feature to ablate (378): http://xmlns.com/foaf/0.1/name#count

For city Prague from -0.18 to -0.82 (-341.93%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count

For city Milan from 0.17 to -0.43 (347.88%)
Feature to ablate (307): http://www.w3.org/2000/01/rdf-schema#seeAlso#3@<http://dbpedia.org/resource/Italy>

For city Amsterdam from -0.53 to -0.7 (-31.85%)
Feature to ablate (98): http://dbpedia.org/ontology/ground?inv#count

For city Lisbon from 0.0019 to -0.16 (8666.74%)
Feature to ablate (323): http://www.w3.org/2000/01/rdf-schema#seeAlso#3@<http://dbpedia.org/resource/The_Republic_of_Ireland>

For city Tabriz from -0.2 to -0.97 (-376.48%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count

For city Nicosia from 0.41 to 0.1 (75.64%)
Feature to ablate (303): http://www.w3.org/2000/01/rdf-schema#seeAlso#count

From the above results I can see that counts are misleading the algorithm. Here I am presented with two options, I can either drop them or I can dampen their growth by applying a squashing function to them. As I believe there's value in them, I'll dampen their value by applying a log function to them (Cell #26).

# CELL 26
import math
from sklearn.svm import SVR
import numpy as np

header = None
with open("ch6_cell26_dev_feat1_dampen.tsv", "w") as feats_damp:
    with open("ch6_cell21_dev_feat1.tsv") as feats:
        header = next(feats)
        feats_damp.write(header)
        header = header.strip().split("\t")
        header.pop(0) # name
        for line in feats:
            fields = line.strip().split("\t")
            pop = float(fields[-1])
            name = fields[0]
            feats = list(map(float,fields[1:-1]))
            pop = math.log(pop, 10)
            # dampen counts
            for idx, featname in enumerate(header):
                if "#count" in featname:
                    feats[idx] = math.log(feats[idx] + 1, 10)
            feats_damp.write("\t".join(map(str, [name] + feats + [pop])) + "\n")

rand = random.Random(42)
train_data = list()
test_data = list()
header = None
with open("ch6_cell26_dev_feat1_dampen.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name)
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
               
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c = 100.0
best_epsilon = 0.05
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch6_cell26_svr_feat1_dampen.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE 0.3556252788046216

<matplotlib.legend.Legend at 0x7f9287312f90>

The RMSE is better than before at 0.36 and the spikes have reduced.

That is a nice improvement from the previous one (note that Feature Engineering does not always succeed and you might encounter dead ends). (Of course taking the log of those counts could have been done directly based on the advice in Chapter 2).

Instead of continuing in this route of wrapper ablation, I'll now move to feature utility metrics, for which we need a categorical target, I'll thus discretize the target in segments with equal number of points. Cell #27 does the discretization and measure the discretization error at different number of splits.

# CELL 27
import math
import random
import numpy as np

rand = random.Random(42)
train_data = list()
test_data = list()
with open("ch6_cell26_dev_feat1_dampen.tsv") as feats:
    header = next(feats)
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        if rand.random() < 0.2:
            test_data.append(logpop) 
        else:
            train_data.append(logpop)

train_data = np.array(sorted(train_data))
test_data  = np.array(test_data)

def cell27_split(data):
    pivot = int(len(data)/2)
    # slower, more accurate
    lower_center, lower_val = int(pivot/2), data[int(pivot/2)]
    upper_center, upper_val = int(pivot/2)+pivot, data[int(pivot/2)+pivot]
    return [ { "min": data[0], "max": data[pivot], "val": lower_val,
              "min_idx": 0, "max_idx": pivot, "center": lower_center },
             { "min": data[pivot+1], "max": data[-1], "val": upper_val,
              "min_idx": pivot+1, "max_idx": len(data) - 1, "center": upper_center } ]

def cell27_segment(train_data, order):
    segments_at = [ [ { "min": train_data[0], "max" : train_data[-1], 
                        "val": train_data[int(len(train_data)/2)],
                        "min_idx":0, "max_idx": len(train_data) - 1 } ] ]
    for exp in range(order):
        previous = segments_at[-1]
        new_segment = list()
        for old_segment in previous:
            splitted = cell27_split(train_data[old_segment['min_idx']:(old_segment['max_idx']+1)])
            for created in splitted:
                created['min_idx'] += old_segment['min_idx']
                created['max_idx'] += old_segment['min_idx']
                created['center']  += old_segment['min_idx']
                new_segment.append(created)
        segments_at.append(new_segment)
    return segments_at

def cell27_predict(data, segments):
    result = list()
    for pop in data:
        value = None
        if pop < segments[0]['min']:
            value = segments[0]['val']
        elif pop > segments[-1]['max']:
            value = segments[-1]['val']
        else:
            for seg_idx, segment in enumerate(segments):
                if segment['min'] <= pop and \
                        (seg_idx == len(segments) - 1 or pop < segments[seg_idx + 1]['min']):
                    value = segment['val']
                    break
        if value is None:
            raise Exception("Value not found: {}".format(10**pop))
        result.append(value)
    return np.array(result)

segments_at = cell27_segment(train_data, 5)

for idx, segments in enumerate(segments_at):
    for seg_idx, seg in enumerate(segments):
        print("\tSegment {:2}, min: {:10,} value: {:10,}, max: {:10,}"
              .format((seg_idx+1), int(round(10 ** seg['min'])), 
                      int(round(10 ** seg['val'])), int(round(10 ** seg['max']))))
    test_pred = cell27_predict(test_data, segments)
    print("Segmented in {}, RMSE: {}".format(2**idx, math.sqrt(sum((test_data - test_pred)**2)/len(test_data))))
    
import pickle
with open("ch6_cell27_splits.pk", "wb") as pkl:
    pickle.dump(segments_at, pkl)

	Segment  1, min:      1,000 value:      5,409, max: 24,300,000
Segmented in 1, RMSE: 0.703382635244624
	Segment  1, min:      1,000 value:      2,226, max:      5,409
	Segment  2, min:      5,410 value:     19,435, max: 24,300,000
Segmented in 2, RMSE: 0.44244189595902794
	Segment  1, min:      1,000 value:      1,545, max:      2,226
	Segment  2, min:      2,226 value:      3,363, max:      5,409
	Segment  3, min:      5,410 value:      9,734, max:     19,435
	Segment  4, min:     19,435 value:     49,331, max: 24,300,000
Segmented in 4, RMSE: 0.2912613798706353
	Segment  1, min:      1,000 value:      1,254, max:      1,545
	Segment  2, min:      1,545 value:      1,884, max:      2,226
	Segment  3, min:      2,226 value:      2,729, max:      3,363
	Segment  4, min:      3,364 value:      4,225, max:      5,409
	Segment  5, min:      5,410 value:      7,147, max:      9,734
	Segment  6, min:      9,735 value:     13,515, max:     19,435
	Segment  7, min:     19,435 value:     29,261, max:     49,339
	Segment  8, min:     49,371 value:    114,914, max: 24,300,000
Segmented in 8, RMSE: 0.1936688721250952
	Segment  1, min:      1,000 value:      1,126, max:      1,254
	Segment  2, min:      1,254 value:      1,393, max:      1,545
	Segment  3, min:      1,545 value:      1,710, max:      1,884
	Segment  4, min:      1,884 value:      2,028, max:      2,226
	Segment  5, min:      2,226 value:      2,461, max:      2,729
	Segment  6, min:      2,729 value:      3,021, max:      3,363
	Segment  7, min:      3,364 value:      3,766, max:      4,225
	Segment  8, min:      4,225 value:      4,800, max:      5,409
	Segment  9, min:      5,410 value:      6,176, max:      7,147
	Segment 10, min:      7,147 value:      8,300, max:      9,734
	Segment 11, min:      9,735 value:     11,441, max:     13,518
	Segment 12, min:     13,518 value:     16,117, max:     19,435
	Segment 13, min:     19,435 value:     23,699, max:     29,261
	Segment 14, min:     29,271 value:     37,076, max:     49,339
	Segment 15, min:     49,371 value:     69,696, max:    114,914
	Segment 16, min:    114,924 value:    264,091, max: 24,300,000
Segmented in 16, RMSE: 0.12719110586720506
	Segment  1, min:      1,000 value:      1,068, max:      1,126
	Segment  2, min:      1,126 value:      1,189, max:      1,254
	Segment  3, min:      1,254 value:      1,318, max:      1,393
	Segment  4, min:      1,393 value:      1,461, max:      1,545
	Segment  5, min:      1,545 value:      1,628, max:      1,710
	Segment  6, min:      1,710 value:      1,799, max:      1,884
	Segment  7, min:      1,884 value:      1,986, max:      2,028
	Segment  8, min:      2,028 value:      2,125, max:      2,226
	Segment  9, min:      2,226 value:      2,343, max:      2,461
	Segment 10, min:      2,461 value:      2,588, max:      2,729
	Segment 11, min:      2,729 value:      2,883, max:      3,021
	Segment 12, min:      3,022 value:      3,175, max:      3,363
	Segment 13, min:      3,364 value:      3,566, max:      3,766
	Segment 14, min:      3,766 value:      3,995, max:      4,225
	Segment 15, min:      4,225 value:      4,503, max:      4,800
	Segment 16, min:      4,800 value:      5,098, max:      5,409
	Segment 17, min:      5,410 value:      5,790, max:      6,176
	Segment 18, min:      6,177 value:      6,609, max:      7,147
	Segment 19, min:      7,147 value:      7,720, max:      8,300
	Segment 20, min:      8,300 value:      8,963, max:      9,734
	Segment 21, min:      9,735 value:     10,546, max:     11,441
	Segment 22, min:     11,442 value:     12,419, max:     13,518
	Segment 23, min:     13,518 value:     14,776, max:     16,117
	Segment 24, min:     16,119 value:     17,742, max:     19,435
	Segment 25, min:     19,435 value:     21,342, max:     23,699
	Segment 26, min:     23,700 value:     26,171, max:     29,261
	Segment 27, min:     29,271 value:     32,882, max:     37,080
	Segment 28, min:     37,095 value:     42,033, max:     49,339
	Segment 29, min:     49,371 value:     57,331, max:     69,713
	Segment 30, min:     69,769 value:     87,249, max:    114,914
	Segment 31, min:    114,924 value:    161,719, max:    264,700
	Segment 32, min:    264,716 value:    655,965, max: 24,300,000
Segmented in 32, RMSE: 0.07988951417940254

From above, splitting into 4 segments produce a reasonable discretiation error. I can now proceed to do some feature selection using mutual information (Chapter 4). I will also introduce 10 random features and drop all features whose mutual information falls below at least 3 random features. Cell #28.

# CELL 28
import numpy as np
import pickle
import random
from collections import OrderedDict

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

data = list()
header = None
PARAM_RANDOM_FEATS = 10
rand = random.Random(42)
with open("ch6_cell26_dev_feat1_dampen.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for idx in range(PARAM_RANDOM_FEATS):
        header.append('random_' + str(idx))
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        for _ in range(PARAM_RANDOM_FEATS):
            feats.append(rand.random())
        row = (feats, logpop, name)
        data.append(row)

xdata = np.array(list(map(lambda t:t[0], data)))
ydata = np.array(list(map(lambda t:t[1], data)))

def cell28_adjudicate(data, segments):
    result = list()
    for val in data:
        idx = None
        if val < segments[0]['min']:
            idx = 0
        elif val > segments[-1]['max']:
            idx = len(segments) - 1
        else:
            for idx2, segment in enumerate(segments):
#                if (segment['min'] == val and segment['max'] == val) or \
#                    (segment['min'] <= val and \
#                    (idx2 == len(segments)-1 or val < segments[idx2+1]['min'])):
                if segment['min'] <= val and val <= segment['max']:
                    idx = idx2
                    break
        result.append(idx)
    return np.array(result)

ydata = cell28_adjudicate(ydata, segments_at[2])

# discretize features into up to 4 bins
for feat in range(xdata.shape[1]):
    feat_segments = cell27_segment(sorted(xdata[:,feat]), 3)
    xdata[:,feat] = cell28_adjudicate(xdata[:,feat], feat_segments[2])
    
# compute confusion tables for each feature
confusion_tables = list()
for feat in range(xdata.shape[1]):
    table = OrderedDict()
    for row in range(xdata.shape[0]):
        feat_val   = int(xdata[row, feat])
        target_val = int(ydata[row])
        if feat_val not in table:
            table[feat_val] = OrderedDict()
        table[feat_val][target_val] = table[feat_val].get(target_val, 0) + 1
    confusion_tables.append(table)

# compute feature utility using mutual information
PARAM_USE_CHI = False
feature_utility = list()
if PARAM_USE_CHI:
    # pip install scipy
    from scipy.stats.mstats import chisquare
for feat in range(xdata.shape[1]):
    if feat % int(xdata.shape[1] / 10) == 0:
        print("{:>3} out of {:>3}".format(feat, xdata.shape[1]))
    table = confusion_tables[feat]
    feat_vals = set()
    for row in table.values():
        feat_vals.update(row.keys())
    cols = { val: sum(map(lambda x:x.get(val,0), table.values())) for val in feat_vals }
    full_table = sum(cols.values())
    
    best_utility = None
    for feat_val in table.keys():
        for target_val in table[feat_val].keys():
            # binarize
            n11 = table[feat_val][target_val]
            if n11 < 5:
                continue
            n10 = sum(table[feat_val].values()) - n11
            n01 = cols.get(target_val) - n11
            n00 = full_table - n11 - n10 - n01
            if n10 == 0 or n01 == 0 or n00 == 0:
                continue
            if PARAM_USE_CHI:
                utility, _ = chisquare([n11, n10, n01, n00])
            else:
                n1_ = n11 + n10
                n0_ = n01 + n00
                n_1 = n11 + n01
                n_0 = n10 + n00
                n = float(full_table)
                utility = n11/n * math.log(n*n11/(n1_*n_1),2) + \
                   n01 / n * math.log(n*n01/(n0_*n_1), 2) + \
                   n10 / n * math.log(n*n10/(n1_*n_0), 2) + \
                   n00 / n * math.log(n*n00/(n0_*n_0), 2)
            if best_utility is None or best_utility < utility:
                best_utility = utility
    if best_utility is not None:
        feature_utility.append( (feat, best_utility) )
    #else:
    #    print("No utility for feat ", feat)
    
feature_utility = sorted(feature_utility, key=lambda x:x[1], reverse=True)

# determine features to drop based on whether there are PARAM_BEST_RANDOM on top of them
PARAM_BEST_RANDOM = 3
cut_point = None
random_so_far = 0
for idx, row in enumerate(feature_utility):
    if header[row[0]].startswith("random_"):
        random_so_far += 1
        if random_so_far == PARAM_BEST_RANDOM:
            cut_point = idx
            break
            
to_drop = list()
for row in feature_utility[cut_point:]:
    if not header[row[0]].startswith("random_"):
        to_drop.append(row)
        
print("Dropped", len(to_drop), "features")

to_keep = set()
with open("ch6_cell28_features_dropped.txt", "w") as dropped:
    drop_set = set()
    for row in to_drop:
        dropped.write(header[row[0]] + "\n")
        drop_set.add(row[0])
    for idx in range(len(header) - PARAM_RANDOM_FEATS):
        if idx not in drop_set:
            to_keep.add(idx + 1) # accomodate 'name' column
        to_keep.add(0)
        to_keep.add(len(header) - PARAM_RANDOM_FEATS + 1) # logpop

print("New feature vector size", len(to_keep), "features, including name and target")
        
table1 = ("<table><tr><th>Position</th><th>Feat#</th><th>Feat</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0],
                        header[r[1][0]].replace('<','').replace('>',''),
                        r[1][1]), 
                               enumerate(feature_utility[:150])))) +"</table>")
table2 = ("<table><tr><th>Position</th><th>Feat#</th><th>Feat</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0],
                        header[r[1][0]].replace('<','').replace('>',''),
                        r[1][1]), 
                               enumerate(reversed(to_drop))))) +"</table>")

with open("ch6_cell28_dev_feat1_filtered.tsv", "w") as filtered:
    with open("ch6_cell26_dev_feat1_dampen.tsv") as feats:
        header = next(feats)
        header = header.strip().split("\t")
        new_header = list()
        for idx, header_col in enumerate(header):
            if idx in to_keep:
                new_header.append(header_col)
        filtered.write("\t".join(new_header) + "\n")
        for line in feats:
            fields = line.strip().split("\t")
            new_fields = list()
            for idx, value in enumerate(fields):
                if idx in to_keep:
                    new_fields.append(value)
            filtered.write("\t".join(new_fields) + "\n")

from IPython.display import HTML, display
display(HTML("<h3>Top 150 features by MI</h3>" + table1 + 
             "<h3>Features to drop</h3>"       + table2))

  0 out of 390
 39 out of 390
 78 out of 390
117 out of 390
156 out of 390
195 out of 390
234 out of 390
273 out of 390
312 out of 390
351 out of 390
Dropped 68 features
New feature vector size 314 features, including name and target

The list of most informative features look very good, particularly number of timezones and actual timezones indicates the geographic location is important. While the wikipedia page contains the GPS coordinates of each city, that information is available in the dbpedia files I chose, but it is available in the geonames file, so it would be useful to retrieve it. I'll look into it in the GIS case study in Chapter 10.

The counts of incoming links seem also very useful, together with the areaTotal. The features dropped seem very specific and most probably fire very little.

I will not turn to another technique for analyzing the data, that is to use decision trees using the splits. Cell #29

# CELL 29
import numpy as np
import pickle

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

data = list()
header = None
with open("ch6_cell28_dev_feat1_filtered.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # pop
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name)
        data.append(row)

xdata = np.array(list(map(lambda t:t[0], data)))
ydata = np.array(list(map(lambda t:t[1], data)))

ydata = cell28_adjudicate(ydata, segments_at[2])
print_names = list()
for name in header:
    name = name.replace('"','')
    if '@' in name:
        pos = name.find('@')
        rel, val = name[0:pos], name[pos+2:]
        rel = rel.replace('<','').replace('>','').split("/")[-1]
        val = val.replace('<','').replace('>','').split("/")[-1]
        name = "{}@{}".format(rel, val)
    else:
        name = name.replace('<','').replace('>','').split("/")[-1]
    print_names.append(name)   

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(xdata, ydata)

import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, max_depth=4,
                               impurity=False, proportion=False,
                               feature_names=print_names)
graph = graphviz.Source(dot_data)
graph.render("ch6_cell29_tree.dot")
graph

Second Featurization¶

From the previous analysis I decided to a computable feature using $area \times density$ and to perform an out-of-fold target-rate encoding in Cell #30, producing "ch6_cell30_dev_feats2_train.tsv" and "ch6_cell30_dev_feats2_test.tsv".

Due to the impossiblity of doing TRE and normalization, this new feature set is already normalized.

# CELL 30
import random
from functools import reduce
from sklearn.svm import SVR
import numpy as np
from collections import OrderedDict

rand = random.Random(42)
train_data = list()
test_data  = list()
header = None
with open("ch6_cell28_dev_feat1_filtered.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name)
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
               
test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))

# add computed feature density area times
approx_area_train = np.zeros( (xtrain.shape[0],) )
approx_area_test  = np.zeros( (xtest.shape[0],)  )
#BUG "area#1" is not included here. It should
for idx, name in enumerate(header):
    if "#count" not in name and ("areaLand#" in name or "areaTotal#" in name):
        approx_area_train = np.max( [ approx_area_train, xtrain[:,idx] ], axis=0)
        approx_area_test =  np.max( [ approx_area_test, xtest[:,idx] ], axis=0)
density_idx = next(filter(lambda t:"populationDensity#1" in t[1], 
                     enumerate(header)))[0]
computed_value_train = approx_area_train * xtrain[:,density_idx]
computed_value_test  = approx_area_test  * xtest[:, density_idx]
computed_defined_train = np.array( list(map(lambda x: 1.0 if x>0 else 0.0, 
                                            computed_value_train)) )
computed_defined_test  = np.array( list(map(lambda x: 1.0 if x>0 else 0.0, 
                                           computed_value_test)) )
computed_value_train = np.log(computed_value_train + 1)
computed_value_test  = np.log(computed_value_test  + 1)
header.append("computed#defined")
header.append("computed#value")
xtrain = np.append(xtrain, computed_defined_train.reshape( (xtrain.shape[0],1) ), axis=1)
xtrain = np.append(xtrain, computed_value_train.reshape(   (xtrain.shape[0],1) ), axis=1)
xtest  = np.append(xtest,  computed_defined_test.reshape(  (xtest.shape[0],1) ),  axis=1)
xtest  = np.append(xtest,  computed_value_test.reshape(    (xtest.shape[0],1) ),  axis=1)

# out-of-fold target encoding for tsv
xtrain_non_tre = xtrain.copy()
xtest_non_tre  = xtest.copy()

def doTRE(xtrain, ytrain, xtest, ytest):
    PARAM_FOLDS = 10
    PARAM_ADD_ORIGINAL_COLS = True
    PARAM_SINGLE_COL_TRE = True
    folds = list(map(lambda x:rand.randint(0, PARAM_FOLDS - 1), range(xtrain.shape[0])))
    # value to use when there is no evidence available
    imputed_tre = sorted(list(ytrain))[int(ytrain.shape[0]/2)]
    
    # gather the TRE cols
    tre_cols = list()
    tre_col_names = list()
    cols_to_tre_cols = dict()
    cats_for_attrib = OrderedDict() # name to list of idx
    for idx, name in enumerate(header):
        if '@' in name: # categorical value
            tre_idx = len(tre_cols)
            cols_to_tre_cols[idx] = tre_idx
            tre_cols.append(idx)
            tre_col_names.append(name + '+TRE')
            attrib = name[0:name.index('@')]
            if attrib not in cats_for_attrib:
                cats_for_attrib[attrib] = []
            cats_for_attrib[attrib].append( (idx, tre_idx) )
    tre_train = np.zeros( (xtrain.shape[0], len(tre_cols)) )
    tre_test  = np.zeros( (xtest.shape[0],  len(tre_cols)) )
    
    saved_full_TRE = OrderedDict()
    
    for idx, name in enumerate(header):
        if '@' in name: # categorical value
            tre_col = cols_to_tre_cols[idx]
            
            fold_val_pops = { fold: dict() for fold in range(PARAM_FOLDS) }
            # gather the pop values for the features in all folds
            for row in range(xtrain.shape[0]):
                value = xtrain[row, idx]
                fold = folds[row]
                fold_val_pops[fold][value] = fold_val_pops[fold].get(value, []) + [ytrain[row]]
            all_values = set(xtrain[:,idx]).union(set(xtest[:,idx]))
            # compute TREs
            fold_TREs = list()
            for fold in range(PARAM_FOLDS):
                val_pops = dict()
                for other in range(PARAM_FOLDS):
                    if other != fold:
                        for value in fold_val_pops[other].keys():
                            val_pops[value] = val_pops.get(value, []) + fold_val_pops[other][value]
                fold_TRE = OrderedDict()
                for value in all_values:
                    val_pops[value] = sorted(val_pops.get(value, []))
                    fold_TRE[value] = imputed_tre if len(val_pops[value]) == 0 else \
                      val_pops[value][int(len(val_pops[value])/2)]
                fold_TREs.append(fold_TRE)
            full_val_pops = dict()
            for value in all_values:
                full_val_pops[value] = sorted(reduce(lambda x, y: x + y, 
                                                  map(lambda f:f.get(value, []), fold_val_pops.values()), []))
            full_TRE = OrderedDict()
            for value in all_values:
                full_TRE[value] = imputed_tre if len(full_val_pops[value]) == 0 \
                   else full_val_pops[value][int(len(full_val_pops[value])/2)]

            # encode folds
            for row in range(xtrain.shape[0]):
                tre_train[row, tre_col] = fold_TREs[folds[row]][xtrain[row, idx]]
            # encode test (full TRE)
            for row in range(xtest.shape[0]):
                tre_test[row, tre_col] = full_TRE[xtest[row, idx]]
            saved_full_TRE[idx] = full_TRE
            
    if PARAM_SINGLE_COL_TRE:
        single_tre_train = np.zeros( (xtrain.shape[0], len(cats_for_attrib)) )
        single_tre_test  = np.zeros( (xtest.shape[0],  len(cats_for_attrib)) )
        single_tre_col_names = list()
        single_tre_count = 0
        for attrib, idxs in cats_for_attrib.items():
            single_tre_col_names.append(attrib + '#TRE')
            for row in range(xtrain.shape[0]):
                max_tre_idx = None
                max_val = None
                for idx, tre_idx in idxs:
                    if max_tre_idx is None or xtrain[row,idx] > max_val:
                        max_val = xtrain[row,idx]
                        max_tre_idx = tre_idx
                single_tre_train[row,single_tre_count] = tre_train[row,tre_idx]
            for row in range(xtest.shape[0]):
                max_tre_idx = None
                max_val = None
                for idx, tre_idx in idxs:
                    if max_tre_idx is None or xtest[row,idx] > max_val:
                        max_val = xtest[row,idx]
                        max_tre_idx = tre_idx
                single_tre_test[row,single_tre_count] = tre_test[row,tre_idx]
            single_tre_count += 1
        tre_train = single_tre_train
        tre_test  = single_tre_test
        tre_col_names = single_tre_col_names

    if not PARAM_ADD_ORIGINAL_COLS:
        # delete original columns
        xtrain = np.delete(xtrain, tre_cols, 1)
        xtest  = np.delete(xtest,  tre_cols, 1)
        for idx in reversed(tre_cols):
            del header[idx]
            
    # add tre columns
    header.extend(tre_col_names)
    xtrain = np.append(xtrain, tre_train, axis=1)
    xtest = np.append(xtest, tre_test, axis=1)
    
    return header, xtrain, ytrain, xtest, ytest, saved_full_TRE

PARAM_WRITE_UNSCALED = False
if PARAM_WRITE_UNSCALED:
    new_header, xtrain, ytrain, xtest, ytest, _ = doTRE(xtrain, ytrain, xtest, ytest)

    # write new data
    with open("ch6_cell30_dev_feats2_train_raw.tsv", "w") as feats:
        feats.write("\t".join(["name"] + new_header + ["population"])+"\n")
        arr, target, names = xtrain, ytrain, list(map(lambda t:t[2],train_data))
        for row in range(arr.shape[0]):
            feats.write("\t".join(list(map(lambda v:str(v),
                                       [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")
    with open("ch6_cell30_dev_feats2_test_raw.tsv", "w") as feats:
        feats.write("\t".join(["name"] + new_header + ["population"])+"\n")
        arr, target, names = xtest, ytest, test_names
        for row in range(arr.shape[0]):
            feats.write("\t".join(list(map(lambda v:str(v),
                                       [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")

# need to re-do tre after scaling
xtrain = xtrain_non_tre
xtest  = xtest_non_tre

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# redo TRE after scaling
header, xtrain, ytrain, xtest, ytest, saved_full_TRE = doTRE(xtrain, ytrain, xtest, ytest)

np.savez_compressed("ch6_cell30_scaled_full_TRE.npz", 
                    saved_full_TRE = saved_full_TRE,
                   xtrain_min      = xtrain_min,
                   xtrain_scaling  = xtrain_scaling,
                   ytrain_min      = ytrain_min,
                   ytrain_scaling  = ytrain_scaling)
with open("ch6_cell30_dev_feats2_train.tsv", "w") as feats:
    feats.write("\t".join(["name"] + header + ["population"])+"\n")
    arr, target, names = xtrain, ytrain, list(map(lambda t:t[2],train_data))
    for row in range(arr.shape[0]):
        feats.write("\t".join(list(map(lambda v:str(v),
                                   [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")
with open("ch6_cell30_dev_feats2_test.tsv", "w") as feats:
    feats.write("\t".join(["name"] + header + ["population"])+"\n")
    arr, target, names = xtest, ytest, test_names
    for row in range(arr.shape[0]):
        feats.write("\t".join(list(map(lambda v:str(v),
                                   [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")


# train
print("Training on {:,} cities".format(len(xtrain)))

best_c = 100.0
best_epsilon = 0.05
PARAM_DO_GRID_SEARCH = False
if PARAM_DO_GRID_SEARCH:
    best_rmse = 1000
    for c in [0.001, 0.01, 0.1, 1.0, 2.0, 5.0, 15.0, 100.0]:
        svr_rbf = SVR(epsilon=0.01, C=c, gamma='auto')
        svr_rbf.fit(xtrain, ytrain)
        ytest_pred = svr_rbf.predict(xtest)
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("C", c, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_c = c
            best_rmse = RMSE
    print("Best C", best_c,"best RMSE", best_rmse)        

    best_epsilon = 0.1
    best_rmse = 1000
    for epsilon in [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 0.0, 1.0, 10.0, 100.0]:
        svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
        svr_rbf.fit(xtrain, ytrain)
        ytest_pred = svr_rbf.predict(xtest)
        ytest_pred *= 1.0/ytrain_scaling
        ytest_pred += ytrain_min
        RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
        print("Epsilon", epsilon, "RMSE", RMSE)
        if RMSE < best_rmse:
            best_epsilon = epsilon
            best_rmse = RMSE
    print("Best epsilon", best_epsilon,"best RMSE", best_rmse)

svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

print("Writing files for error analysis")
np.savez_compressed("ch6_cell30_feat2_svr.npz", 
                    xtrain=xtrain, ytrain=ytrain, 
                    xtest=xtest, ytest=ytest,
                    ytest_orig=ytest_orig,
                    ytest_pred=ytest_pred,
                    xtrain_scaling=xtrain_scaling,
                    ytrain_scaling=ytrain_scaling)
with open("ch6_cell30_test_names.tsv", "w") as names:
    for idx, name in enumerate(test_names):
        names.write("{}\t{}\n".format(idx, name))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch6_cell30_svr_feat2.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE 0.3298424854334205
Writing files for error analysis

<matplotlib.legend.Legend at 0x7f93104c7f10>

Adding back the original columns increases the number of columns to 577 from 314 but it reduces the RMSE to 0.3366 (otherwise it stays on 0.3514). This constitutes the best feature set to be used in the final evaluation.

I'll now focus on selecting features that are stable under test error, that is, that can cope with a training set that is an incorrect sample of the feature.

For that, I'll borrow a technique from privacy preserving data mining and introduce noise to the data by swapping the value of a given feature between two instances. Doing this swapping at a given percentage creates a test set that has been perturbed at a certain level. For these experiments, I will arbitrarily use a perturbation of 30%.

For a conservative feature set, I will be willing to let go of features that, when perturbed at 30% produce a drop on performance greater than the drop resulting for keeping them.

Sadly, as features are dependent, we'll need to retrain the base model every time a feature is deleted. Cell #31

# CELL 31
import random
import datetime
from sklearn.svm import SVR
import numpy as np

rand = random.Random(41) # 41 has areaLand and US; 39 areaLand and Brazil
train_data = list()
test_data  = list()
header = None
with open("ch6_cell30_dev_feats2_train.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop()  # population
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name  = fields[0]
        feats = list(map(float,fields[1:-1]))
        row   = (feats, logpop, name)
        train_data.append(row)
with open("ch6_cell30_dev_feats2_test.tsv") as feats:
    _ = next(feats)
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name  = fields[0]
        feats = list(map(float,fields[1:-1]))
        row   = (feats, logpop, name)
        test_data.append(row) 
               
test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))

# sample the data to finish in a reasonable amount of time
selected = rand.sample(list(range(len(xtrain))), int(len(xtrain) * 0.2))
xtrain = xtrain[selected,:]
ytrain = ytrain[selected]

selected = rand.sample(list(range(len(xtest))), int(len(xtest) * 0.2))
xtest = xtest[selected,:]
ytest = ytest[selected]

dropped_features = list()
rand = random.Random(42)

best_c       = 100.0
best_epsilon = 0.05
PARAM_PERTURBATION_PERC = 0.3
PARAM_SAMPLES           = 3
PARAM_FIRST_SAMPLES     = 999
start_error = None
end_error   = None
while True:
    print(datetime.datetime.now().isoformat(), "=== Total features", xtrain.shape[1], "===")
    # train base system
    full_svr = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
    full_svr.fit(xtrain, ytrain)
    print(datetime.datetime.now().isoformat(), "Base model trained")
    
    # do wrapper
    feature_loss = list()
    selected     = rand.sample(list(range(len(xtrain))), int(len(xtrain) * 0.1))
    xtrain_samp = xtrain[selected,:]
    ytrain_samp = ytrain[selected]
    
    svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
    svr_rbf.fit(xtrain_samp, ytrain_samp)
    ytest_pred = svr_rbf.predict(xtest)
    base_error = math.sqrt( sum((ytest_pred - ytest)**2) )
    print(datetime.datetime.now().isoformat(), "Sampled model trained")
    for feat in range(xtrain.shape[1]):
        if feat % int(xtrain.shape[1] / 10) == 0:
            print(datetime.datetime.now().isoformat(), "ablation, {:>3} out of {:>3}".format(feat, xtrain.shape[1]))
        train_col = xtrain_samp[:,feat].copy()
        xtrain_samp[:,feat] = 0.0 # nuke feature
        svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
        svr_rbf.fit(xtrain_samp, ytrain_samp)
        ytest_pred = svr_rbf.predict(xtest)
        error = math.sqrt( sum((ytest_pred - ytest)**2) )
        feature_loss.append(base_error - error)
        xtrain_samp[:,feat] = train_col # restore
    while len(feature_loss) < xtrain.shape[1]:
        feature_loss.append(999)

    # do perturbation error
    ytest_pred = full_svr.predict(xtest)
    base_error = math.sqrt( sum( (ytest_pred - ytest)**2 ) )
    if start_error is None:
        start_error = base_error
    end_error = base_error
    
    # perturbed samples
    perturbed = list()
    for _ in range(PARAM_SAMPLES):
        xtest_perturbed = xtest.copy()
        for _ in range(int(xtest.shape[0] * PARAM_PERTURBATION_PERC * 2)):
            a = rand.randint(0, xtest.shape[0] - 1)
            b = rand.randint(0, xtest.shape[0] - 1)
            xtest_perturbed[ [a,b], :] = xtest_perturbed[ [b,a], :]
        perturbed.append(xtest_perturbed)
    print(datetime.datetime.now().isoformat(), 
          "perturbation at {} finished for {} samples".format(PARAM_PERTURBATION_PERC, PARAM_SAMPLES))

    perturbed_error = list()
    for feat in range(min(PARAM_FIRST_SAMPLES, xtrain.shape[1])):
        if feat % int(xtrain.shape[1] / 10) == 0:
            print(datetime.datetime.now().isoformat(), "perturbation, {:>3} out of {:>3}".format(feat, xtrain.shape[1]))
        errors = list()
        test_col = xtest[:,feat].copy()
        for idx in range(PARAM_SAMPLES):
            xtest[:,feat] = perturbed[idx][:,feat]
            ytest_pred = full_svr.predict(xtest)
            errors.append( base_error - math.sqrt( sum((ytest_pred - ytest)**2) ) )
        xtest[:,feat] = test_col
        perturbed_error.append(sorted(errors)[int(len(errors)/2)])
    while len(perturbed_error) < xtrain.shape[1]:
        perturbed_error.append(1)
        
    # pick the feature to drop
    feat_to_drop = None
    best_ratio = None
    droppable_count = 0
    for feat in range(min(PARAM_FIRST_SAMPLES, xtrain.shape[1])):
        if perturbed_error[feat] > 0 and feature_loss[feat] > 0 and perturbed_error[feat] > feature_loss[feat]:
            droppable_count += 1
            ratio = feature_loss[feat] / perturbed_error[feat]
            print(datetime.datetime.now().isoformat(), "Feature",header[feat], "ablated", feature_loss[feat], 
                  "perturbed",perturbed_error[feat], "ratio", ratio)
            if feat_to_drop is None or best_ratio > ratio:
                feat_to_drop = feat
                best_ratio = ratio
    print("Droppable features", droppable_count, "out of", xtrain.shape[1])
    if feat_to_drop is None:
        break # done
    
    # drop feature
    print(datetime.datetime.now().isoformat(), "Dropped", header[feat_to_drop])
    dropped_features.append(header[feat_to_drop])
    del header[feat_to_drop]
    xtrain = np.delete(xtrain, feat_to_drop, 1)
    xtest = np.delete(xtest, feat_to_drop, 1)
    
    if xtrain.shape[1] == 0:
        break # shouldn't happen
        
print(datetime.datetime.now().isoformat(), "Final number of features",xtrain.shape[1])
print(datetime.datetime.now().isoformat(), "Start error", start_error)
print(datetime.datetime.now().isoformat(), "Final error", end_error)

# write'em
with open("ch6_cell31_features_dropped.txt", "w") as dropped:
    for featname in dropped_features:
        dropped.write(featname + "\n")

with open("ch6_cell31_dev_train.tsv", "w") as feats:
    feats.write("\t".join(["name"] + header + ["population"])+"\n")
    arr, target, names = xtrain, ytrain, list(map(lambda t:t[2],train_data))
    for row in range(arr.shape[0]):
        feats.write("\t".join(list(map(lambda v:str(v),
                                   [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")
with open("ch6_cell31_dev_test.tsv", "w") as feats:
    feats.write("\t".join(["name"] + header + ["population"])+"\n")
    arr, target, names = xtest, ytest, test_names
    for row in range(arr.shape[0]):
        feats.write("\t".join(list(map(lambda v:str(v),
                                   [names[row]]+list(arr[row,:])+[target[row]]))) +"\n")

2019-11-25T13:21:04.493754 === Total features 325 ===
2019-11-25T13:21:26.182054 Base model trained
2019-11-25T13:21:26.838139 Sampled model trained
2019-11-25T13:21:26.838293 ablation,   0 out of 325
2019-11-25T13:21:47.601105 ablation,  32 out of 325
2019-11-25T13:22:08.500160 ablation,  64 out of 325
2019-11-25T13:22:29.318456 ablation,  96 out of 325
2019-11-25T13:22:50.153342 ablation, 128 out of 325
2019-11-25T13:23:11.025659 ablation, 160 out of 325
2019-11-25T13:23:31.847387 ablation, 192 out of 325
2019-11-25T13:23:52.624032 ablation, 224 out of 325
2019-11-25T13:24:13.417617 ablation, 256 out of 325
2019-11-25T13:24:34.250685 ablation, 288 out of 325
2019-11-25T13:24:55.108534 ablation, 320 out of 325
2019-11-25T13:25:01.536526 perturbation at 0.3 finished for 3 samples
2019-11-25T13:25:01.536699 perturbation,   0 out of 325
2019-11-25T13:30:05.127399 perturbation,  32 out of 325
2019-11-25T13:35:03.919973 perturbation,  64 out of 325
2019-11-25T13:40:07.095331 perturbation,  96 out of 325
2019-11-25T13:45:07.113808 perturbation, 128 out of 325
2019-11-25T13:50:08.798181 perturbation, 160 out of 325
2019-11-25T13:55:08.559696 perturbation, 192 out of 325
2019-11-25T14:00:10.055907 perturbation, 224 out of 325
2019-11-25T14:05:12.233867 perturbation, 256 out of 325
2019-11-25T14:10:12.343686 perturbation, 288 out of 325
2019-11-25T14:15:14.457111 perturbation, 320 out of 325
2019-11-25T14:16:01.101299 Feature http://dbpedia.org/ontology/headquarter?inv#count ablated 0.0009274672328878353 perturbed 0.0028307071885098622 ratio 0.32764506221361295
2019-11-25T14:16:01.121843 Feature http://dbpedia.org/ontology/timeZone#1@<http://dbpedia.org/resource/Philippine_Standard_Time> ablated 0.002247557793626953 perturbed 0.002763843223736906 ratio 0.8132001751489003
2019-11-25T14:16:01.122189 Feature http://dbpedia.org/ontology/leaderTitle#1#TRE ablated 0.0001323788930047698 perturbed 0.000700552723678971 ratio 0.18896349772159696
2019-11-25T14:16:01.122482 Feature http://dbpedia.org/ontology/type#1#TRE ablated 9.671300620706091e-05 perturbed 0.0005275024253355021 ratio 0.18334134889626244
2019-11-25T14:16:01.122781 Feature http://dbpedia.org/ontology/utcOffset#2#TRE ablated 0.00010306182378183593 perturbed 0.0001874910024248777 ratio 0.5496894381538647
Droppable features 5 out of 325
2019-11-25T14:16:01.123193 Dropped http://dbpedia.org/ontology/type#1#TRE
2019-11-25T14:16:01.174660 === Total features 324 ===
2019-11-25T14:16:22.950512 Base model trained
2019-11-25T14:16:23.619958 Sampled model trained
2019-11-25T14:16:23.620119 ablation,   0 out of 324
2019-11-25T14:16:44.949129 ablation,  32 out of 324
2019-11-25T14:17:06.206845 ablation,  64 out of 324
2019-11-25T14:17:27.473891 ablation,  96 out of 324
2019-11-25T14:17:48.769531 ablation, 128 out of 324
2019-11-25T14:18:10.095321 ablation, 160 out of 324
2019-11-25T14:18:31.411787 ablation, 192 out of 324
2019-11-25T14:18:52.719910 ablation, 224 out of 324
2019-11-25T14:19:14.023926 ablation, 256 out of 324
2019-11-25T14:19:35.313940 ablation, 288 out of 324
2019-11-25T14:19:56.662283 ablation, 320 out of 324
2019-11-25T14:20:02.512582 perturbation at 0.3 finished for 3 samples
2019-11-25T14:20:02.512764 perturbation,   0 out of 324
2019-11-25T14:25:02.365280 perturbation,  32 out of 324
2019-11-25T14:30:00.944940 perturbation,  64 out of 324
2019-11-25T14:34:57.904781 perturbation,  96 out of 324
2019-11-25T14:39:57.777915 perturbation, 128 out of 324
2019-11-25T14:44:57.481648 perturbation, 160 out of 324
2019-11-25T14:49:55.597044 perturbation, 192 out of 324
2019-11-25T14:54:53.113726 perturbation, 224 out of 324
2019-11-25T14:59:52.254833 perturbation, 256 out of 324
2019-11-25T15:04:52.506268 perturbation, 288 out of 324
2019-11-25T15:09:52.198086 perturbation, 320 out of 324
2019-11-25T15:10:29.797427 Feature http://dbpedia.org/ontology/country#1@<http://dbpedia.org/resource/Sweden> ablated 0.0006661093595425882 perturbed 0.0015691508509765306 ratio 0.4245030738300578
2019-11-25T15:10:29.799311 Feature http://dbpedia.org/ontology/populationDensity#1 ablated 0.00010196205733148389 perturbed 0.00020525250642089432 ratio 0.4967640060014602
Droppable features 2 out of 324
2019-11-25T15:10:29.799785 Dropped http://dbpedia.org/ontology/country#1@<http://dbpedia.org/resource/Sweden>
2019-11-25T15:10:29.808655 === Total features 323 ===
2019-11-25T15:10:51.510013 Base model trained
2019-11-25T15:10:52.159619 Sampled model trained
2019-11-25T15:10:52.159778 ablation,   0 out of 323
2019-11-25T15:11:12.904608 ablation,  32 out of 323
2019-11-25T15:11:33.652429 ablation,  64 out of 323
2019-11-25T15:11:54.393005 ablation,  96 out of 323
2019-11-25T15:12:15.095731 ablation, 128 out of 323
2019-11-25T15:12:35.869070 ablation, 160 out of 323
2019-11-25T15:12:56.633469 ablation, 192 out of 323
2019-11-25T15:13:17.466506 ablation, 224 out of 323
2019-11-25T15:13:38.186955 ablation, 256 out of 323
2019-11-25T15:13:58.962663 ablation, 288 out of 323
2019-11-25T15:14:19.723642 ablation, 320 out of 323
2019-11-25T15:14:24.805969 perturbation at 0.3 finished for 3 samples
2019-11-25T15:14:24.806147 perturbation,   0 out of 323
2019-11-25T15:19:21.228218 perturbation,  32 out of 323
2019-11-25T15:24:18.040114 perturbation,  64 out of 323
2019-11-25T15:29:16.511369 perturbation,  96 out of 323
2019-11-25T15:34:13.985397 perturbation, 128 out of 323
2019-11-25T15:39:12.060925 perturbation, 160 out of 323
2019-11-25T15:44:08.263007 perturbation, 192 out of 323
2019-11-25T15:49:05.444728 perturbation, 224 out of 323
2019-11-25T15:54:01.894423 perturbation, 256 out of 323
2019-11-25T15:59:01.978545 perturbation, 288 out of 323
2019-11-25T16:04:03.259312 perturbation, 320 out of 323
2019-11-25T16:04:31.048056 Feature http://dbpedia.org/ontology/areaLand#2 ablated 1.88743938718261e-05 perturbed 0.0010908843256056677 ratio 0.017301920495875568
2019-11-25T16:04:31.048480 Feature http://dbpedia.org/ontology/elevation#1 ablated 0.0036923574838825246 perturbed 0.007026226673793268 ratio 0.5255107264976867
2019-11-25T16:04:31.048738 Feature http://dbpedia.org/ontology/headquarter?inv#count ablated 0.000822406656177499 perturbed 0.0022617502509634235 ratio 0.3636151497395363
2019-11-25T16:04:31.049028 Feature http://dbpedia.org/ontology/timeZone#1@<http://dbpedia.org/resource/Philippine_Standard_Time> ablated 0.0024296285228500025 perturbed 0.002935117625004313 ratio 0.827778928568981
2019-11-25T16:04:31.049311 Feature http://dbpedia.org/ontology/leaderTitle#1#TRE ablated 0.000156618816564702 perturbed 0.0005469517852185746 ratio 0.2863484877412248
Droppable features 5 out of 323
2019-11-25T16:04:31.049675 Dropped http://dbpedia.org/ontology/areaLand#2
2019-11-25T16:04:31.095231 === Total features 322 ===
2019-11-25T16:04:52.794252 Base model trained
2019-11-25T16:04:53.424141 Sampled model trained
2019-11-25T16:04:53.424301 ablation,   0 out of 322
2019-11-25T16:05:13.351756 ablation,  32 out of 322
2019-11-25T16:05:33.308909 ablation,  64 out of 322
2019-11-25T16:05:53.206302 ablation,  96 out of 322
2019-11-25T16:06:13.194890 ablation, 128 out of 322
2019-11-25T16:06:33.147183 ablation, 160 out of 322
2019-11-25T16:06:53.098268 ablation, 192 out of 322
2019-11-25T16:07:12.999528 ablation, 224 out of 322
2019-11-25T16:07:32.922776 ablation, 256 out of 322
2019-11-25T16:07:52.982776 ablation, 288 out of 322
2019-11-25T16:08:13.029662 ablation, 320 out of 322
2019-11-25T16:08:17.389080 perturbation at 0.3 finished for 3 samples
2019-11-25T16:08:17.389263 perturbation,   0 out of 322
2019-11-25T16:13:12.545119 perturbation,  32 out of 322
2019-11-25T16:18:07.760685 perturbation,  64 out of 322
2019-11-25T16:23:04.718894 perturbation,  96 out of 322
2019-11-25T16:27:59.410018 perturbation, 128 out of 322
2019-11-25T16:32:53.323151 perturbation, 160 out of 322
2019-11-25T16:37:50.818356 perturbation, 192 out of 322
2019-11-25T16:42:44.997881 perturbation, 224 out of 322
2019-11-25T16:47:41.136197 perturbation, 256 out of 322
2019-11-25T16:52:37.111684 perturbation, 288 out of 322
2019-11-25T16:57:32.549776 perturbation, 320 out of 322
2019-11-25T16:57:51.153114 Feature http://dbpedia.org/ontology/headquarter?inv#count ablated 0.0013303723614930263 perturbed 0.002678400924296831 ratio 0.4967039659465071
2019-11-25T16:57:51.153547 Feature http://dbpedia.org/ontology/region#1@<http://dbpedia.org/resource/Midi-Pyrénées> ablated 0.00017285573278691402 perturbed 0.0002541867313277102 ratio 0.6800344450869853
2019-11-25T16:57:51.153851 Feature http://dbpedia.org/ontology/timeZone#1@<http://dbpedia.org/resource/Philippine_Standard_Time> ablated 0.0037833149916854936 perturbed 0.004136512285338956 ratio 0.9146147117934836
2019-11-25T16:57:51.154173 Feature http://dbpedia.org/ontology/populationAsOf#1#TRE ablated 5.619180415283864e-05 perturbed 9.206232238945233e-05 ratio 0.6103670067666748
Droppable features 4 out of 322
2019-11-25T16:57:51.154592 Dropped http://dbpedia.org/ontology/headquarter?inv#count
2019-11-25T16:57:51.166271 === Total features 321 ===
2019-11-25T16:58:12.528099 Base model trained
2019-11-25T16:58:13.181583 Sampled model trained
2019-11-25T16:58:13.181736 ablation,   0 out of 321
2019-11-25T16:58:33.904325 ablation,  32 out of 321
2019-11-25T16:58:54.610138 ablation,  64 out of 321
2019-11-25T16:59:15.285013 ablation,  96 out of 321
2019-11-25T16:59:36.014337 ablation, 128 out of 321
2019-11-25T16:59:56.781831 ablation, 160 out of 321
2019-11-25T17:00:17.492602 ablation, 192 out of 321
2019-11-25T17:00:38.177330 ablation, 224 out of 321
2019-11-25T17:00:58.846890 ablation, 256 out of 321
2019-11-25T17:01:19.613439 ablation, 288 out of 321
2019-11-25T17:01:40.321650 ablation, 320 out of 321
2019-11-25T17:01:44.149575 perturbation at 0.3 finished for 3 samples
2019-11-25T17:01:44.149761 perturbation,   0 out of 321
2019-11-25T17:06:38.764254 perturbation,  32 out of 321
2019-11-25T17:11:32.003665 perturbation,  64 out of 321
2019-11-25T17:16:26.601171 perturbation,  96 out of 321
2019-11-25T17:21:22.294422 perturbation, 128 out of 321
2019-11-25T17:26:17.450865 perturbation, 160 out of 321
2019-11-25T17:31:12.911789 perturbation, 192 out of 321
2019-11-25T17:36:08.240560 perturbation, 224 out of 321
2019-11-25T17:41:02.736580 perturbation, 256 out of 321
2019-11-25T17:45:57.929037 perturbation, 288 out of 321
2019-11-25T17:50:50.702256 perturbation, 320 out of 321
2019-11-25T17:50:59.788310 Feature http://dbpedia.org/ontology/department#1@<http://dbpedia.org/resource/Pas-de-Calais> ablated 1.3535612384707463e-05 perturbed 0.0013306511120747544 ratio 0.010172172301124601
Droppable features 1 out of 321
2019-11-25T17:50:59.790631 Dropped http://dbpedia.org/ontology/department#1@<http://dbpedia.org/resource/Pas-de-Calais>
2019-11-25T17:50:59.842918 === Total features 320 ===
2019-11-25T17:51:21.346834 Base model trained
2019-11-25T17:51:21.999301 Sampled model trained
2019-11-25T17:51:21.999452 ablation,   0 out of 320
2019-11-25T17:51:42.694046 ablation,  32 out of 320
2019-11-25T17:52:03.462559 ablation,  64 out of 320
2019-11-25T17:52:24.109073 ablation,  96 out of 320
2019-11-25T17:52:44.799092 ablation, 128 out of 320
2019-11-25T17:53:05.610716 ablation, 160 out of 320
2019-11-25T17:53:26.368710 ablation, 192 out of 320
2019-11-25T17:53:47.018200 ablation, 224 out of 320
2019-11-25T17:54:07.727404 ablation, 256 out of 320
2019-11-25T17:54:28.491676 ablation, 288 out of 320
2019-11-25T17:54:52.362356 perturbation at 0.3 finished for 3 samples
2019-11-25T17:54:52.362539 perturbation,   0 out of 320
2019-11-25T17:59:47.583158 perturbation,  32 out of 320
2019-11-25T18:04:42.609576 perturbation,  64 out of 320
2019-11-25T18:09:39.315901 perturbation,  96 out of 320
2019-11-25T18:14:32.307300 perturbation, 128 out of 320
2019-11-25T18:19:31.646431 perturbation, 160 out of 320
2019-11-25T18:24:26.119408 perturbation, 192 out of 320
2019-11-25T18:29:21.174792 perturbation, 224 out of 320
2019-11-25T18:34:20.874156 perturbation, 256 out of 320
2019-11-25T18:39:18.868541 perturbation, 288 out of 320
2019-11-25T18:44:10.689198 Feature http://dbpedia.org/ontology/region#1@<http://dbpedia.org/resource/Nord-Pas-de-Calais> ablated 0.00026382005871017356 perturbed 0.0002930217859047346 ratio 0.9003428120390512
2019-11-25T18:44:10.689593 Feature http://dbpedia.org/ontology/timeZone#1@<http://dbpedia.org/resource/Eastern_European_Time> ablated 0.0006883295292778868 perturbed 0.0009798518390997657 ratio 0.7024832753391435
2019-11-25T18:44:10.689882 Feature http://dbpedia.org/ontology/timeZone#1@<http://dbpedia.org/resource/Philippine_Standard_Time> ablated 0.003919405571660484 perturbed 0.003944644014064913 ratio 0.9936018453593176
Droppable features 3 out of 320
2019-11-25T18:44:10.690323 Dropped http://dbpedia.org/ontology/timeZone#1@<http://dbpedia.org/resource/Eastern_European_Time>
2019-11-25T18:44:10.700221 === Total features 319 ===
2019-11-25T18:44:32.234996 Base model trained
2019-11-25T18:44:32.885592 Sampled model trained
2019-11-25T18:44:32.885743 ablation,   0 out of 319
2019-11-25T18:44:52.911285 ablation,  31 out of 319
2019-11-25T18:45:12.943576 ablation,  62 out of 319
2019-11-25T18:45:32.936441 ablation,  93 out of 319
2019-11-25T18:45:52.983884 ablation, 124 out of 319
2019-11-25T18:46:13.007557 ablation, 155 out of 319
2019-11-25T18:46:33.096562 ablation, 186 out of 319
2019-11-25T18:46:53.156194 ablation, 217 out of 319
2019-11-25T18:47:13.241975 ablation, 248 out of 319
2019-11-25T18:47:33.254042 ablation, 279 out of 319
2019-11-25T18:47:53.287071 ablation, 310 out of 319
2019-11-25T18:48:02.171243 perturbation at 0.3 finished for 3 samples
2019-11-25T18:48:02.171427 perturbation,   0 out of 319
2019-11-25T18:52:44.687518 perturbation,  31 out of 319
2019-11-25T18:57:28.070392 perturbation,  62 out of 319
2019-11-25T19:02:10.266939 perturbation,  93 out of 319
2019-11-25T19:06:53.403453 perturbation, 124 out of 319
2019-11-25T19:11:48.346949 perturbation, 155 out of 319
2019-11-25T19:16:31.979982 perturbation, 186 out of 319
2019-11-25T19:21:15.780138 perturbation, 217 out of 319
2019-11-25T19:25:57.895660 perturbation, 248 out of 319
2019-11-25T19:30:41.987388 perturbation, 279 out of 319
2019-11-25T19:35:23.337605 perturbation, 310 out of 319
2019-11-25T19:36:44.859638 Feature http://dbpedia.org/ontology/department#1@<http://dbpedia.org/resource/Finistère> ablated 0.00013593849263049407 perturbed 0.00024706296418530727 ratio 0.5502180105332771
2019-11-25T19:36:44.860015 Feature http://dbpedia.org/ontology/department#1@<http://dbpedia.org/resource/Haute-Garonne> ablated 0.0006367945464589297 perturbed 0.0020112021926608215 ratio 0.31662383264233135
2019-11-25T19:36:44.860317 Feature http://dbpedia.org/ontology/populationAsOf#1@"2013-12-31"^^<http://www.w3.org/2001/XMLSchema#date> ablated 0.00031380754899235086 perturbed 0.0014953919016824635 ratio 0.2098497047090374
2019-11-25T19:36:44.860680 Feature http://dbpedia.org/ontology/region#1#TRE ablated 9.056460346279493e-05 perturbed 0.00021524992415233868 ratio 0.4207416277587151
2019-11-25T19:36:44.860971 Feature http://www.w3.org/2000/01/rdf-schema#seeAlso#3#TRE ablated 0.00020072408301086142 perturbed 0.0002105013432780467 ratio 0.9535525041554214
Droppable features 5 out of 319
2019-11-25T19:36:44.861382 Dropped http://dbpedia.org/ontology/populationAsOf#1@"2013-12-31"^^<http://www.w3.org/2001/XMLSchema#date>
2019-11-25T19:36:45.019572 === Total features 318 ===
2019-11-25T19:37:06.391448 Base model trained
2019-11-25T19:37:07.045220 Sampled model trained
2019-11-25T19:37:07.045374 ablation,   0 out of 318
2019-11-25T19:37:27.131871 ablation,  31 out of 318
2019-11-25T19:37:47.253811 ablation,  62 out of 318
2019-11-25T19:38:07.306976 ablation,  93 out of 318
2019-11-25T19:38:27.409822 ablation, 124 out of 318
2019-11-25T19:38:47.533634 ablation, 155 out of 318
2019-11-25T19:39:07.614966 ablation, 186 out of 318
2019-11-25T19:39:27.686846 ablation, 217 out of 318
2019-11-25T19:39:47.778393 ablation, 248 out of 318
2019-11-25T19:40:07.864923 ablation, 279 out of 318
2019-11-25T19:40:27.968146 ablation, 310 out of 318
2019-11-25T19:40:36.236313 perturbation at 0.3 finished for 3 samples
2019-11-25T19:40:36.236529 perturbation,   0 out of 318
2019-11-25T19:45:17.523431 perturbation,  31 out of 318
2019-11-25T19:49:57.402166 perturbation,  62 out of 318
2019-11-25T19:54:50.171327 perturbation,  93 out of 318
2019-11-25T19:59:33.063375 perturbation, 124 out of 318
2019-11-25T20:04:13.985248 perturbation, 155 out of 318
2019-11-25T20:08:53.768406 perturbation, 186 out of 318
2019-11-25T20:13:33.388016 perturbation, 217 out of 318
2019-11-25T20:18:13.589518 perturbation, 248 out of 318
2019-11-25T20:22:53.090370 perturbation, 279 out of 318
2019-11-25T20:27:34.706724 perturbation, 310 out of 318
2019-11-25T20:28:46.383929 Feature http://dbpedia.org/ontology/department#1@<http://dbpedia.org/resource/Haute-Garonne> ablated 0.0005162292927192169 perturbed 0.0016265166156594901 ratio 0.3173833502524078
Droppable features 1 out of 318
2019-11-25T20:28:46.384510 Dropped http://dbpedia.org/ontology/department#1@<http://dbpedia.org/resource/Haute-Garonne>
2019-11-25T20:28:46.394733 === Total features 317 ===
2019-11-25T20:29:07.569353 Base model trained
2019-11-25T20:29:08.165877 Sampled model trained
2019-11-25T20:29:08.166037 ablation,   0 out of 317
2019-11-25T20:29:26.712355 ablation,  31 out of 317
2019-11-25T20:29:45.119118 ablation,  62 out of 317
2019-11-25T20:30:03.689109 ablation,  93 out of 317
2019-11-25T20:30:22.226278 ablation, 124 out of 317
2019-11-25T20:30:40.765717 ablation, 155 out of 317
2019-11-25T20:30:59.321678 ablation, 186 out of 317
2019-11-25T20:31:17.905135 ablation, 217 out of 317
2019-11-25T20:31:36.504862 ablation, 248 out of 317
2019-11-25T20:31:54.956919 ablation, 279 out of 317
2019-11-25T20:32:13.476735 ablation, 310 out of 317
2019-11-25T20:32:20.717909 perturbation at 0.3 finished for 3 samples
2019-11-25T20:32:20.718091 perturbation,   0 out of 317
2019-11-25T20:36:58.584525 perturbation,  31 out of 317
2019-11-25T20:41:36.920274 perturbation,  62 out of 317
2019-11-25T20:46:22.118936 perturbation,  93 out of 317
2019-11-25T20:51:08.548078 perturbation, 124 out of 317
2019-11-25T20:55:48.923326 perturbation, 155 out of 317
2019-11-25T21:00:27.269818 perturbation, 186 out of 317
2019-11-25T21:05:24.524497 perturbation, 217 out of 317
2019-11-25T21:10:06.390253 perturbation, 248 out of 317
2019-11-25T21:14:49.731397 perturbation, 279 out of 317
2019-11-25T21:19:38.713480 perturbation, 310 out of 317
2019-11-25T21:20:44.015651 Feature http://dbpedia.org/ontology/department#1@<http://dbpedia.org/resource/Finistère> ablated 3.886253194673728e-05 perturbed 0.0005379524209652153 ratio 0.07224157831097516
2019-11-25T21:20:44.016535 Feature http://dbpedia.org/ontology/region#1@<http://dbpedia.org/resource/Haute-Normandie> ablated 3.2739117256319616e-05 perturbed 0.0005831928999149305 ratio 0.056137715773109075
Droppable features 2 out of 317
2019-11-25T21:20:44.017026 Dropped http://dbpedia.org/ontology/region#1@<http://dbpedia.org/resource/Haute-Normandie>
2019-11-25T21:20:44.027917 === Total features 316 ===
2019-11-25T21:21:05.505446 Base model trained
2019-11-25T21:21:06.150688 Sampled model trained
2019-11-25T21:21:06.150843 ablation,   0 out of 316
2019-11-25T21:21:26.032294 ablation,  31 out of 316
2019-11-25T21:21:45.939095 ablation,  62 out of 316
2019-11-25T21:22:05.868594 ablation,  93 out of 316
2019-11-25T21:22:25.739134 ablation, 124 out of 316
2019-11-25T21:22:45.675613 ablation, 155 out of 316
2019-11-25T21:23:05.660972 ablation, 186 out of 316
2019-11-25T21:23:25.512030 ablation, 217 out of 316
2019-11-25T21:23:45.310220 ablation, 248 out of 316
2019-11-25T21:24:05.233605 ablation, 279 out of 316
2019-11-25T21:24:25.161454 ablation, 310 out of 316
2019-11-25T21:24:32.081030 perturbation at 0.3 finished for 3 samples
2019-11-25T21:24:32.081213 perturbation,   0 out of 316
2019-11-25T21:29:12.015567 perturbation,  31 out of 316
2019-11-25T21:33:49.167385 perturbation,  62 out of 316
2019-11-25T21:38:28.120821 perturbation,  93 out of 316
2019-11-25T21:43:06.253766 perturbation, 124 out of 316
2019-11-25T21:47:44.269183 perturbation, 155 out of 316
2019-11-25T21:52:21.691520 perturbation, 186 out of 316
2019-11-25T21:57:00.787898 perturbation, 217 out of 316
2019-11-25T22:01:39.534595 perturbation, 248 out of 316
2019-11-25T22:06:19.671841 perturbation, 279 out of 316
2019-11-25T22:10:59.970293 perturbation, 310 out of 316
Droppable features 0 out of 316
2019-11-25T22:11:54.108416 Final number of features 316
2019-11-25T22:11:54.108593 Start error 3.354327748503948
2019-11-25T22:11:54.108753 Final error 3.3420276127247153

The previous cell results suggest dropping countries but that is counter-intuitive as they are the most stable features. Based on this realization, I assemble a "more conservative set" using only counts and country features, with no TRE (Cell #32).

# CELL 32
import bz2
import re
from sklearn.svm import SVR
import numpy as np

triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

def compressed_triples_gen(filename):
    with bz2.BZ2File(filename,"r") as compressed:
        for byteline in compressed:
            if byteline[0] == ord('#'):
                continue
            line = byteline.decode("utf-8")
            s,v,o = triple_re.match(line).groups()
            yield line,s,v,o
            
# determine country names, from DBpedia
countries = list()
for line, subj, verb, obj in compressed_triples_gen("instance_types_en.ttl.bz2"):
    if obj == '<http://dbpedia.org/ontology/Country>':
        countries.append(subj[len("http://dbpedia.org/resource"):])
print("Read {:,} countries".format(len(countries)))

# read dataset
train_data = list()
test_data  = list()
header = None
devtrain = set()
with open("devtrain.txt") as train:
    # for stable results due to a missing seed
    for city in train:
        devtrain.add(city.strip())
with open("ch6_cell28_dev_feat1_filtered.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name)
        #if rand.random() < 0.2:
        if name not in devtrain:
            test_data.append(row) 
        else:
            train_data.append(row)
               
train_names = list(map(lambda t:t[2], train_data))

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))

# computed feature density area times
approx_area_train = np.zeros( (xtrain.shape[0],) )
approx_area_test  = np.zeros( (xtest.shape[0],)  )
for idx, name in enumerate(header):
    if "#count" not in name and ("areaLand#" in name or "areaTotal#" in name):
        approx_area_train = np.max( [ approx_area_train, xtrain[:,idx] ], axis=0)
        approx_area_test  = np.max( [ approx_area_test,  xtest[:,idx]  ], axis=0)
density_idx = next(filter(lambda t:"populationDensity#1" in t[1], 
                     enumerate(header)))[0]
computed_value_train = approx_area_train * xtrain[:,density_idx]
computed_value_test  = approx_area_test  * xtest[:, density_idx]
computed_defined_train = np.array( list(map(lambda x: 1.0 if x>0 else 0.0, 
                                            computed_value_train)) )
computed_defined_test  = np.array( list(map(lambda x: 1.0 if x>0 else 0.0, 
                                            computed_value_test))  )
computed_value_train = np.log(computed_value_train + 1)
computed_value_test  = np.log(computed_value_test  + 1)

# filter categories not containing a country
to_delete = list()
for idx, name in enumerate(header):
    keep = True
    if '@' in name:
        keep = False
        for country in countries:
            if country in name:
                keep = True
                break
    if not keep:
        to_delete.append(idx)
xtrain = np.delete(xtrain, to_delete, 1)
xtest  = np.delete(xtest,  to_delete, 1)
for idx in reversed(to_delete):
    del header[idx]

# add computed feature
header.append("computed#defined")
header.append("computed#value")
xtrain = np.append(xtrain, computed_defined_train.reshape( (xtrain.shape[0],1) ), axis=1)
xtrain = np.append(xtrain, computed_value_train.reshape(   (xtrain.shape[0],1) ), axis=1)
xtest  = np.append(xtest,  computed_defined_test.reshape(  (xtest.shape[0], 1) ), axis=1)
xtest  = np.append(xtest,  computed_value_test.reshape(    (xtest.shape[0], 1) ), axis=1)

# write full file
with open("ch6_cell32_dev_feat_conservative.tsv",  "w") as f:
    f.write("name\t" + "\t".join(header) + '\tpopulation\n')
    for row in range(xtrain.shape[0]):
        f.write(train_names[row] + '\t' + '\t'.join(map(str,xtrain[row,:]))+ '\t' + str(ytrain[row]) + '\n')
    for row in range(xtest.shape[0]):
        f.write(test_names[row] +  '\t' + '\t'.join(map(str,xtest[row,:])) + '\t' + str(ytest[row])  + '\n')

# train SVR
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c = 100.0
best_epsilon = 0.05
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch6_cell32_svr_feat_conservative.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Read 3,424 countries
Training on 35,940 cities
RMSE 0.357821756992853

<matplotlib.legend.Legend at 0x7f9310693750>

Follow ups left to the reader:

go back to the source data and encode the selected relations as sets
gps coordinates
add the countries explicitly
out-of-fold cluster the categories by their behaviour with the target class rather than by sheer counts
add an out-of-fold feature as the result of the decision tree (need to add "ensemble systems as out-of-fold computable features" to chapter 3).
final evaluation on held out

Final systems:

Assemble baseline system using feature stability and feat2 as good, with expected scores to get <- baseline: total count, plus area times density, country target rate encoded

number of relations, utc, density, areatotal, country with interpolated RFs (baseline)
baseline + more features including some target rate encoding, using a SVR (improved)

Pos	Rel	Counts	Percentage of cities defined
First	buildingEndDate	1	0.0%
At 10 percentile	sportCountry?inv	1	0.0%
At 20 percentile	lowestState	3	0.0%
At 30 percentile	inflow?inv	5	0.01%
At 40 percentile	board?inv	9	0.01%
At 50 percentile	riverMouth?inv	27	0.05%
At 60 percentile	county?inv	78	0.15%
At 70 percentile	knownFor?inv	209	0.42%
At 80 percentile	populationTotalRanking	699	1.43%
At 90 percentile	headquarter?inv	2769	5.66%
At 100 percentile	areaTotal	29917	61.22%
Last	22-rdf-syntax-ns#type	48868	100.0%

Rel	Counts	Max	Avg	Median	Min
area	4803	('Saint-Germain-Nuelles', 1)	1.0	('Labeuvrière', 1)	('Austin,_Texas', 1)
areaCode	25417	('Iserlohn', 8)	1.14	('Martin,_Slovakia', 1)	('Ankara', 1)
areaLand	11669	('Spotswood,_New_Jersey', 2)	1.69	('Reinbeck,_Iowa', 2)	('Amsterdam', 1)
areaTotal	29917	('Spotswood,_New_Jersey', 2)	1.27	('Osann-Monzel', 1)	('Ankara', 1)
areaWater	10950	('Spotswood,_New_Jersey', 2)	1.42	('Mebane,_North_Carolina', 1)	('Amsterdam', 1)
birthPlace?inv	21089	('London', 6258)	13.31	('Itō,_Shizuoka', 3)	('Kaysersberg', 1)
city?inv	13551	('London', 709)	3.9	('Ashburnham,_Massachusetts', 2)	('Iquitos', 1)
country	45977	('Sada,_Galicia', 8)	1.04	('Talence', 1)	('Ankara', 1)
deathPlace?inv	10046	('London', 3059)	8.36	('Longmeadow,_Massachusetts', 2)	('Sauk_City,_Wisconsin', 1)
department	2587	('Saint-Denis,_Seine-Saint-Denis', 1)	1.0	('Trémuson', 1)	('Ajaccio', 1)
district	5842	('Waterloo,_Illinois', 6)	1.01	('Müllrose', 1)	('Aalen', 1)
elevation	28669	('Córdoba,_Argentina', 4)	1.33	('Villaquilambre', 1)	('Ankara', 1)
foundingDate	3003	('Tampa,_Florida', 5)	1.14	('Rio_Branco,_Acre', 1)	('Austin,_Texas', 1)
ground?inv	3368	('Vienna', 134)	1.94	('Nödinge-Nol', 1)	('Kansas_City,_Kansas', 1)
headquarter?inv	2769	('London', 509)	5.63	('Cortona', 1)	('Wigton', 1)
hometown?inv	4144	('London', 1562)	6.52	('West_Covina,_California', 1)	('Aston', 1)
inseeCode	3906	('Saint-Germain-Nuelles', 1)	1.0	('Ploumagoar', 1)	('Ajaccio', 1)
isPartOf	32266	('Baima,_Baxoi_County', 10)	2.02	('Omalur', 2)	('Amsterdam', 1)
isPartOf?inv	3786	('Shijiazhuang', 296)	7.0	('Santa_Teresa_del_Tuy', 1)	('Tuxtla_Gutiérrez', 1)
leaderName	3192	('North_York', 24)	1.35	('Pointe-Noire', 1)	('Ankara', 1)
leaderTitle	12322	('Canora,_Saskatchewan', 8)	1.45	('Bad_Sachsa', 1)	('Amsterdam', 1)
location?inv	14236	('London', 2438)	7.43	('Schiller_Park,_Illinois', 2)	('Sèvres', 1)
locationCity?inv	2822	('London', 684)	4.33	('Fall_River,_Massachusetts', 1)	('Orchard_Park_(town),_New_York', 1)
maximumElevation	4159	('Zahlé', 2)	1.0	('Gières', 1)	('Aarhus', 1)
minimumElevation	4093	('Zahlé', 2)	1.0	('Raismes', 1)	('Aarhus', 1)
motto	2689	('Mérida,_Mérida', 4)	1.08	('Moraine,_Ohio', 1)	('Amsterdam', 1)
nearestCity?inv	4136	('Middletown,_Orange_County,_New_York', 46)	1.78	('Owatonna,_Minnesota', 1)	('Copenhagen', 1)
populationAsOf	5303	('Gilroy,_California', 2)	1.0	('Ellrich', 1)	('Aarhus', 1)
populationDensity	13788	('Batavia,_New_York', 2)	1.41	('Santiz', 1)	('Austin,_Texas', 1)
postalCode	31782	('Nijemci', 8)	1.0	('Lingig,_Surigao_del_Sur', 1)	('Ankara', 1)
region	4711	('Centralia,_Illinois', 4)	1.01	('La_Châtaigneraie', 1)	('Aachen', 1)
residence?inv	5283	('New_York_City', 703)	4.4	('Chonburi_(city)', 1)	('Kiyosu', 1)
routeEnd?inv	4079	('Chicago', 33)	1.46	('Nongstoin', 1)	('Sleepy_Hollow,_New_York', 1)
routeJunction?inv	5301	('Chicago', 23)	1.94	('Smithville,_Tennessee', 1)	('Sleepy_Hollow,_New_York', 1)
routeStart?inv	4136	('Chicago', 16)	1.53	('Hilliard,_Florida', 1)	('Dana_Point,_California', 1)
timeZone	28615	('Jerusalem', 4)	1.35	('Bhikhi,_Punjab', 1)	('Ames,_Iowa', 1)
type	21497	('Kandi,_Benin', 3)	1.02	('Jericho,_Vermont', 1)	('Ankara', 1)
utcOffset	27762	('Lőrinci', 3)	1.74	('Marietta,_Ohio', 2)	('Algiers', 1)
22-rdf-syntax-ns#type	48868	('Okny', 1)	1.0	('Labarthe-Rivière', 1)	('Ankara', 1)
rdf-schema#seeAlso	3426	('Dubai', 19)	2.69	('Coventry,_Rhode_Island', 3)	('Bishkek', 1)
homepage	20319	('Pune', 4)	1.0	('Nueva_Loja', 1)	('Amsterdam', 1)
name	45193	('Songtao_Miao_Autonomous_County', 5)	1.19	('Castillejo_de_Martín_Viejo', 1)	('Ankara', 1)
nick	2764	('Davao_City', 10)	1.1	('Rotterdam_(town),_New_York', 1)	('Ankara', 1)

Rel	Counts
name	44959
areaCode#1	5667
birthPlace?inv#1	20183
birthPlace?inv#2	13748
birthPlace?inv#3	10523
city?inv#1	11381
city?inv#2	6884
country#1	253
deathPlace?inv#1	9848
deathPlace?inv#2	5215
department#1	172
district#1	1182
foundingDate#1	2649
ground?inv#1	3284
headquarter?inv#1	2671
hometown?inv#1	4030
inseeCode#1	3855
isPartOf#1	2761
isPartOf#2	7640
isPartOf?inv#1	3331
leaderName#1	1839
leaderTitle#1	454
location?inv#1	13017
location?inv#2	7984
locationCity?inv#1	2704
motto#1	2587
nearestCity?inv#1	3534
populationAsOf#1	259
postalCode#1	25603
region#1	404
residence?inv#1	5033
routeEnd?inv#1	3968
routeJunction?inv#1	2714
routeStart?inv#1	4019
timeZone#1	178
type#1	582
rdf-schema#seeAlso#1	896
rdf-schema#seeAlso#2	692
rdf-schema#seeAlso#3	423
homepage#1	18912
name#1	40080
nick#1	2601

Rel	Counts	% Covered @ 100
name	44959	0.22
areaCode#1	5667	26.65
birthPlace?inv#1	20183	2.07
birthPlace?inv#2	13748	1.0
birthPlace?inv#3	10523	1.0
city?inv#1	11381	5.1
city?inv#2	6884	3.32
deathPlace?inv#1	9848	1.52
deathPlace?inv#2	5215	1.99
district#1	1182	31.45
foundingDate#1	2649	10.33
ground?inv#1	3284	4.09
headquarter?inv#1	2671	6.47
hometown?inv#1	4030	4.35
inseeCode#1	3855	2.79
isPartOf#1	2761	56.1
isPartOf#2	7640	15.33
isPartOf?inv#1	3331	11.27
leaderName#1	1839	39.47
location?inv#1	13017	3.38
location?inv#2	7984	3.06
locationCity?inv#1	2704	7.36
motto#1	2587	6.18
nearestCity?inv#1	3534	11.07
postalCode#1	25603	3.27
residence?inv#1	5033	4.76
routeEnd?inv#1	3968	3.49
routeJunction?inv#1	2714	14.05
routeStart?inv#1	4019	3.09
rdf-schema#seeAlso#1	896	74.52
rdf-schema#seeAlso#2	692	76.44
homepage#1	18912	2.89
name#1	40080	1.1
nick#1	2601	7.98

Position	Feat#	Feat	Utility
0	286	http://dbpedia.org/ontology/utcOffset#count	0.0976963935
1	10	http://dbpedia.org/ontology/birthPlace?inv#count	0.0773520859
2	33	http://dbpedia.org/ontology/country#1@OTHER	0.0715361164
3	7	http://dbpedia.org/ontology/areaTotal#1	0.0574203739
4	301	http://www.w3.org/1999/02/22-rdf-syntax-ns#type#1@http://dbpedia.org/ontology/City	0.0446595875
5	100	http://dbpedia.org/ontology/hometown?inv#count	0.0444925574
6	104	http://dbpedia.org/ontology/leaderTitle#count	0.0418022897
7	98	http://dbpedia.org/ontology/ground?inv#count	0.0417288633
8	102	http://dbpedia.org/ontology/isPartOf?inv#count	0.0383227077
9	99	http://dbpedia.org/ontology/headquarter?inv#count	0.0381511672
10	11	http://dbpedia.org/ontology/city?inv#count	0.0375958312
11	110	http://dbpedia.org/ontology/locationCity?inv#count	0.0345756327
12	109	http://dbpedia.org/ontology/location?inv#count	0.0344053176
13	303	http://www.w3.org/2000/01/rdf-schema#seeAlso#count	0.0329376538
14	101	http://dbpedia.org/ontology/isPartOf#count	0.0310347434
15	0	rel#count	0.0305265290
16	103	http://dbpedia.org/ontology/leaderName#count	0.0300089032
17	237	http://dbpedia.org/ontology/residence?inv#count	0.0297368517
18	291	http://dbpedia.org/ontology/utcOffset#1@"+8"	0.0296566141
19	379	http://xmlns.com/foaf/0.1/nick#count	0.0294899528
20	378	http://xmlns.com/foaf/0.1/name#count	0.0294766008
21	34	http://dbpedia.org/ontology/deathPlace?inv#count	0.0290410902
22	289	http://dbpedia.org/ontology/utcOffset#1@"+5:30"	0.0258797043
23	246	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Indian_Standard_Time	0.0254364934
24	17	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/India	0.0245425720
25	299	http://dbpedia.org/ontology/utcOffset#2@OTHER	0.0223334458
26	300	http://www.w3.org/1999/02/22-rdf-syntax-ns#type#1@http://dbpedia.org/ontology/Settlement	0.0219254334
27	105	http://dbpedia.org/ontology/leaderTitle#1@"Mayor"@en	0.0211852222
28	241	http://dbpedia.org/ontology/timeZone#count	0.0186070090
29	13	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/United_States	0.0175165178
30	253	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/China_Standard_Time	0.0170512916
31	29	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Switzerland	0.0158900843
32	94	http://dbpedia.org/ontology/district#count	0.0157178626
33	97	http://dbpedia.org/ontology/foundingDate#count	0.0148247411
34	267	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Cities_of_Japan	0.0141595020
35	268	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Switzerland	0.0136140699
36	260	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Communes_of_Romania	0.0134756551
37	14	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/France	0.0119833788
38	377	http://xmlns.com/foaf/0.1/homepage#count	0.0117127918
39	30	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Brazil	0.0114840226
40	19	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Romania	0.0109216127
41	262	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_the_Philippines	0.0108070888
42	249	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Philippine_Standard_Time	0.0105554191
43	243	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/North_American_Eastern_Time_Zone	0.0098569008
44	3	http://dbpedia.org/ontology/areaLand#count	0.0094568186
45	4	http://dbpedia.org/ontology/areaLand#1	0.0094568186
46	113	http://dbpedia.org/ontology/minimumElevation#count	0.0094335063
47	118	http://dbpedia.org/ontology/populationAsOf#1@"2006-12-31"^^http://www.w3.org/2001/XMLSchema#date	0.0094283539
48	261	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Village	0.0094071382
49	114	http://dbpedia.org/ontology/minimumElevation#1	0.0093485650
50	240	http://dbpedia.org/ontology/routeStart?inv#count	0.0092929410
51	8	http://dbpedia.org/ontology/areaWater#count	0.0092498928
52	258	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Census-designated_place	0.0089104877
53	238	http://dbpedia.org/ontology/routeEnd?inv#count	0.0089041982
54	111	http://dbpedia.org/ontology/maximumElevation#count	0.0087616361
55	112	http://dbpedia.org/ontology/maximumElevation#1	0.0087616361
56	287	http://dbpedia.org/ontology/utcOffset#1@"-5"	0.0082976925
57	5	http://dbpedia.org/ontology/areaLand#2	0.0080500379
58	133	http://dbpedia.org/ontology/populationDensity#count	0.0078484319
59	6	http://dbpedia.org/ontology/areaTotal#count	0.0074483968
60	304	http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/List_of_twin_towns	0.0073464966
61	294	http://dbpedia.org/ontology/utcOffset#2@"-4"	0.0067065538
62	244	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/North_American_Central_Time_Zone	0.0066068794
63	23	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Austria	0.0062625474
64	116	http://dbpedia.org/ontology/nearestCity?inv#count	0.0059831857
65	288	http://dbpedia.org/ontology/utcOffset#1@"+2"	0.0056993889
66	297	http://dbpedia.org/ontology/utcOffset#2@"+3"	0.0053709065
67	245	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Eastern_European_Time	0.0053411230
68	95	http://dbpedia.org/ontology/elevation#count	0.0052806194
69	1	http://dbpedia.org/ontology/area#1	0.0051504748
70	2	http://dbpedia.org/ontology/areaCode#count	0.0048780440
71	96	http://dbpedia.org/ontology/elevation#1	0.0046697548
72	279	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Brazil	0.0044755608
73	15	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Germany	0.0043193264
74	296	http://dbpedia.org/ontology/utcOffset#2@"-6"	0.0042388718
75	259	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Town	0.0042273599
76	135	http://dbpedia.org/ontology/postalCode#count	0.0040125311
77	22	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Czech_Republic	0.0040068343
78	115	http://dbpedia.org/ontology/motto#count	0.0038502787
79	256	http://dbpedia.org/ontology/type#count	0.0037569686
80	239	http://dbpedia.org/ontology/routeJunction?inv#count	0.0037468734
81	254	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Iran_Standard_Time	0.0036248488
82	264	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Village_(United_States)	0.0036064034
83	284	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Cuba	0.0035736682
84	20	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Australia	0.0034511135
85	18	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Poland	0.0033868709
86	120	http://dbpedia.org/ontology/populationAsOf#1@"2010-04-01"^^http://www.w3.org/2001/XMLSchema#date	0.0033147498
87	295	http://dbpedia.org/ontology/utcOffset#2@"+1"	0.0031843944
88	242	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Central_European_Time	0.0028993576
89	285	http://dbpedia.org/ontology/type#1@OTHER	0.0026631391
90	272	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/City_(California)	0.0025227055
91	25	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Mexico	0.0024821676
92	146	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Kyushu	0.0024320431
93	107	http://dbpedia.org/ontology/leaderTitle#1@"Municipal President"@en	0.0022439285
94	119	http://dbpedia.org/ontology/populationAsOf#1@"2010-12-31"^^http://www.w3.org/2001/XMLSchema#date	0.0021322853
95	266	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipality	0.0019976217
96	9	http://dbpedia.org/ontology/areaWater#1	0.0019811294
97	148	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Chūbu_region	0.0019766421
98	26	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Hungary	0.0019426290
99	136	http://dbpedia.org/ontology/region#count	0.0019369811
100	134	http://dbpedia.org/ontology/populationDensity#1	0.0019254055
101	269	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Gmina	0.0019167945
102	31	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Ukraine	0.0019097990
103	28	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Sweden	0.0016357978
104	302	http://www.w3.org/1999/02/22-rdf-syntax-ns#type#1@OTHER	0.0016116168
105	16	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/United_Kingdom	0.0015910670
106	163	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Kansai_region	0.0014567274
107	122	http://dbpedia.org/ontology/populationAsOf#1@"2009-06-30"^^http://www.w3.org/2001/XMLSchema#date	0.0013583905
108	12	http://dbpedia.org/ontology/country#count	0.0013441720
109	293	http://dbpedia.org/ontology/utcOffset#1@OTHER	0.0012687612
110	274	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Borough_(Pennsylvania)	0.0011648049
111	139	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Pays_de_la_Loire	0.0010797894
112	273	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/List_of_cities_in_Argentina	0.0010213124
113	32	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/United_States_of_America	0.0010138343
114	21	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Spain	0.0009926035
115	308	http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Ukraine	0.0009778459
116	275	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Borough_(New_Jersey)	0.0009481932
117	250	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Mountain_Time_Zone	0.0008733979
118	307	http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Italy	0.0008481666
119	137	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Nord-Pas-de-Calais	0.0007872309
120	280	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Municipalities_of_Colombia	0.0007818404
121	24	http://dbpedia.org/ontology/country#1@http://dbpedia.org/resource/Italy	0.0007681960
122	255	http://dbpedia.org/ontology/timeZone#1@OTHER	0.0007568887
123	251	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Eastern_Standard_Time_Zone	0.0007361581
124	263	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/New_England_town	0.0006881598
125	270	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/List_of_towns_and_villages_in_Illinois	0.0006643873
126	282	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/City_(Quebec)	0.0006639864
127	314	http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Africa	0.0006526736
128	35	http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Nord_(French_department)	0.0006488608
129	140	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Rhône-Alpes	0.0006098502
130	117	http://dbpedia.org/ontology/populationAsOf#count	0.0005952472
131	257	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/City	0.0005771659
132	141	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Aquitaine	0.0005468276
133	40	http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Maine-et-Loire	0.0005248101
134	39	http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Loire-Atlantique	0.0005043381
135	142	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Lombardy	0.0005034930
136	277	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Corregimientos_of_Panama	0.0004998276
137	188	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Hokkaido	0.0004952871
138	69	http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Dordogne	0.0004947210
139	271	http://dbpedia.org/ontology/type#1@http://dbpedia.org/resource/Administrative_divisions_of_New_York	0.0004838897
140	41	http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Haut-Rhin	0.0004605897
141	316	http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Asia	0.0004484091
142	38	http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Finistère	0.0004417426
143	106	http://dbpedia.org/ontology/leaderTitle#1@"Alcalde"@en	0.0004371334
144	227	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Düsseldorf	0.0004304264
145	309	http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/The_United_Kingdom	0.0004189674
146	305	http://www.w3.org/2000/01/rdf-schema#seeAlso#3@http://dbpedia.org/resource/Hungary	0.0004174443
147	170	http://dbpedia.org/ontology/region#1@http://dbpedia.org/resource/Kassel	0.0004139833
148	37	http://dbpedia.org/ontology/department#1@http://dbpedia.org/resource/Pas-de-Calais	0.0004096477
149	248	http://dbpedia.org/ontology/timeZone#1@http://dbpedia.org/resource/Pacific_Time_Zone	0.0004072352