processing¶

Features¶

class poi_interlinking.processing.features.Features[source]¶

This class loads the dataset, frequent terms and builds features that are used as input to supported classification groups:

basic: similarity features based on basic similarity measures.

basic_sorted: similarity features based on sorted version of the basic similarity measures used in basic group.

lgm: similarity features based on variations of LGM-Sim similarity measures.

See also

compute_features()
Details on the metrics each classification group implements.

build()[source]¶

Build features depending on the assignment of parameter classification_method and return values (fX, y) as ndarray of floats.

Returns

fX (ndarray) – The computed features that will be used as input to ML classifiers.

y (ndarray) – Binary labels {True, False} to train the classifiers.

compute_features(s1, s2, sorted=True, lgm_sims=True)[source]¶

Depending on the group assigned to parameter classification_method, this method builds an ndarray of the following groups of features:

basic: various similarity measures, i.e., damerau_levenshtein(), jaro(), jaro_winkler() and the reversed one, sorted_winkler(), cosine(), jaccard(), strike_a_match(), monge_elkan(), soft_jaccard(), davies(), tuned_jaro_winkler() and the reversed one, skipgrams().

basic_sorted: sorted versions of similarity measures utilized in basic group, except for the sorted_winkler().

lgm: LGM-Sim variations that integrate, as internal, the similarity measures utilized in basic group, except for the sorted_winkler().

Parameters

s2 (s1,) – Input toponyms.

sorted (bool, optional) – Value of True indicate to build features for groups basic and basic_sorted, value of False only for basic group.

lgm_sims (bool, optional) – Values of True or False indicate whether to build or not features for group lgm.

Returns

It returns a list (vector) of features.

Return type

list

poi_interlinking.processing.spatial.matching.get_distance(p1, p2)[source]¶

It finds the minimum distance between two Points

Parameters

p1 (shapely geometric object) – The first point

p2 (shapely geometric object) – The second point

Returns

Returns the minimum distance. The value follows the geometric object projection.

Return type

list

class poi_interlinking.processing.spatial.matching.Projection(src='4326', dest='3857')[source]¶

Transform coordinates of a geometric object among specified projections

change_projection(lon, lat)[source]¶

Transforms the coordinates of a geometric object to the new projection.

Parameters

lon (float) – The longitude of the geometric Point.

lat (float) – The latitude of the geometric Point.

Returns

Return type

A shapely Point on the new projection.

Similarity measures¶

This module implements various similarity metrics used across different scenarios. Many of these functions were developed by Rui Santos and Alexandre Marinho for their work in Toponym-Matching.

poi_interlinking.processing.sim_measures.skipgram(str1, str2)[source]¶

Implements Jaccard-skipgram metric.

Parameters

str2 (str1,) – Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.davies(str1, str2)[source]¶

Implements Davies de Salles metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.cosine(str1, str2)[source]¶

Implements Cosine N-Grams metric for n=[2,3].

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.damerau_levenshtein(str1, str2)[source]¶

Implements Damerau-Levenshtein metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.jaro(str1, str2)[source]¶

Implements Jaro metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.jaro_winkler(str1, str2)[source]¶

Implements Jaro-Winkler metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.monge_elkan(str1, str2)[source]¶

Implements Monge-Elkan metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.strike_a_match(str1, str2)[source]¶

Implements Dice Bi-Grams metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.jaccard(str1, str2)[source]¶

Implements Jaccard N-Grams metric for n=[2,3].

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.soft_jaccard(str1, str2)[source]¶

Implements Soft-Jaccard metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.sorted_winkler(str1, str2)[source]¶

Implements Sorted Jaro-Winkler metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.tuned_jaro_winkler(s1, s2, long_tolerance=False)[source]¶

Implements Tuned Jaro-Winkler metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.lgm_sim_split(s1, s2, split_thres)[source]¶

Splits each toponym-string, i.e., s1, s2, to tokens, builds three distinct lists per toponym-string, i.e., base, mismatch and frequent, and assigns the produced tokens to these lists. The base lists contains the terms that are similar to one of the other toponym’s tokens, The mismatch contains the terms that have no similar pair to the tokens of the other toponym and the frequent list contains the terms that are common for the specified dataset of toponyms.

Parameters

s2 (s1,) – Input values in unicode.

split_thres (float) – If the similarity score is above this threshold, the compared terms are identified as base terms, otherwise as mismatch ones.

Returns

Three lists of terms identified as base, mismatch or frequent respectively per toponym, i.e., a for s1 and b for s2.

Return type

tuple of (dict of list of str, dict of list of str, dict of list of str)

poi_interlinking.processing.sim_measures.score_per_term(base_t, mis_t, special_t, metric)[source]¶

Computes three distinct similarity scores for each list of terms.

Parameters

mismatch_t special_t (base_t,) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.

metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.

Returns

A similarity score for every list of terms. Each score is normalized in range [0,1].

Return type

tuple of (float, float, float)

poi_interlinking.processing.sim_measures.weighted_sim(base_t, mis_t, special_t, metric, avg)[source]¶

Re-calculates the significance weights for each list of terms taking into account their lengths.

Parameters

mis_t, special_t (base_t,) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.

metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.

avg (bool) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.lgm_sim(str1, str2, metric='damerau_levenshtein', avg=False)[source]¶

Implements LGM-Sim metric.

Parameters

str2 (str1,) – Input values in unicode.

metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is damerau_levenshtein.

avg (bool, optional) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score. Default value is False.

Returns

A similarity score normalized in range [0,1].

Return type

float

poi_interlinking.processing.sim_measures.avg_lgm_sim(str1, str2, metric='damerau_levenshtein')[source]¶

Implements LGM-Sim metric where avg flag is True.

Parameters

str2 (str1,) – Input values in unicode.

metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is damerau_levenshtein.

Returns

A similarity score normalized in range [0,1].

Return type

float

Return Home