Welcome to LGM-Interlinking’s documentation!¶

LGM-Interlinking¶

This Python code implements and evaluates the proposed LinkGeoML models for Toponym classification-based interlinking.

In this setting, we consider the names of the toponyms as the only source of information that can be used to decide whether two toponyms refer to the same real-world entity. Specifically, we build a meta-similarity function, called LGM-Sim, that takes into account and incorporates within its processing steps the specificities of toponym names. Consequently, we derive training features from LGM-Sim that are used in various classification models. The proposed method and its derived features are robust enough to handle variations in the distribution of toponyms and demonstrate a significant increase in interlinking accuracy compared to baseline models widely used in the literature (see References). Indicatively, we succeed a 85.4% accuracy with the Gradient Boosting Trees classifier compared to the best baseline model that achieves accuracy of 77.3% with Random Forests.

The data folder contains the train datasets, which are used to build the classifiers, and files containing frequent terms, extracted from train datasets. For evaluation, we used the dataset from the Toponym-Matching work (see Setup procedure).

The source code was tested using Python 3 (>=3.6) and Scikit-Learn 0.22.2.post1 on a Linux server.

Setup procedure¶

Download the latest version from the GitHub repository, change to the main directory and run:

pip install -r pip_requirements.txt

It should install all the required libraries automatically (scikit-learn, numpy, pandas etc.).

Change to the data folder, download the test dataset and unzip it:

wget https://github.com/ruipds/Toponym-Matching/raw/master/dataset/dataset.zip
wget https://github.com/ruipds/Toponym-Matching/raw/master/dataset/dataset.z01

zip -FF dataset.zip  --out dataset.zip.fixed
unzip dataset.zip.fixed

Usage¶

The input dataset need to be in CSV format. Specifically, a valid dataset should have at least the following fields/columns:

The names for each of the candidate toponym pairs.
The label, i.e., {True, False}, assigned to each toponym pair.

The library implements the following distinct processes:

Features extraction
The build function constructs a set of training features to use within classifiers for toponym interlinking.
Algorithm and model selection
The functionality of the fineTuneClassifiers function is twofold. Firstly, it chooses among a list of supported machine learning algorithms the one that achieves the highest average accuracy score on the examined dataset. Secondly, it searches for the best model, i.e., the best hyper-parameters for the best identified algorithm in the first step.
Model training
The trainClassifier trains the best selected model on previous process, i.e., an ML algorithm with tuned hyperparameters that best fits data, on the whole train dataset, without splitting it in folds.
Model deployment
The testClassifier applies the trained model on new untested data.

A complete pipeline of the above processes, i.e., features extraction, training and evaluating state-of-the-art classifiers, for toponym interlinking can be executed with the following command:

$ python -m interlinking.cli hyperparameter_tuning --train_set <path/to/train-dataset>
--test_set <path/to/test-dataset>

Additionally, help is available on the command line interface (CLI). Enter the following to list all supported commands or options for a given command with a short description.

$ python -m interlinking.cli -h
Usage: cli.py [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help  Show this message and exit.

Commands:
  build                   build a candidate pairs of toponyms dataset for evaluation from Geonames
  evaluate                evaluate the effectiveness of the proposed methods
  extract_frequent_terms  create a file with ranked frequent terms found in corpus
  hyperparameter_tuning   tune various classifiers and select the best hyper-parameters on a train dataset
  learn_sim_params        learn parameters, i.e., weights/thresholds, on a train dataset for similarity metrics

Documentation¶

Source code documentation is available from linkgeoml.github.io.

Acknowledgements¶

The sim_measures.py file, which is used to generate the train/test datasets and to compute the string similarity measures, is a slightly modified version of the datasetcreator.py file used in Toponym-Matching work, which is under the MIT license.

References¶

Giorgos Giannopoulos, Vassilis Kaffes, Georgios Kostoulas: Learning Advanced Similarities and Training Features for Toponym Interlinking. ECIR (1) 2020: 111-125.
Santos, R., Murrieta-Flores, P. and Martins, B., 2018. Learning to combine multiple string similarity metrics for effective toponym matching. International journal of digital earth, 11(9), pp.913-938.

License¶

LGM-Interlinking is available under the MIT License.

API Reference¶

View the auto-generated API documentation for information on a specific function, class, or method.

Indices and tables¶

This documentation was last updated on May 28, 2020.