metrics¶

This module implements various similarity metrics used across different scenarios. Many of these functions were developed by Rui Santos and Alexandre Marinho for their work in Toponym-Matching.

interlinking.sim_measures.skipgram(str1, str2)[source]¶

Implements Jaccard-skipgram metric.

Parameters

str1, str2 (str) – Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.davies(str1, str2)[source]¶

Implements Davies de Salles metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.cosine(str1, str2)[source]¶

Implements Cosine N-Grams metric for n=[2,3].

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.damerau_levenshtein(str1, str2)[source]¶

Implements Damerau-Levenshtein metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.jaro(str1, str2)[source]¶

Implements Jaro metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.jaro_winkler(str1, str2)[source]¶

Implements Jaro-Winkler metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.monge_elkan(str1, str2)[source]¶

Implements Monge-Elkan metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.strike_a_match(str1, str2)[source]¶

Implements Dice Bi-Grams metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.jaccard(str1, str2)[source]¶

Implements Jaccard N-Grams metric for n=[2,3].

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.soft_jaccard(str1, str2)[source]¶

Implements Soft-Jaccard metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.sorted_winkler(str1, str2)[source]¶

Implements Sorted Jaro-Winkler metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.tuned_jaro_winkler(s1, s2, long_tolerance=False)[source]¶

Implements LGM Jaro-Winkler metric.

str1, str2: str
Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.lgm_sim_split(s1, s2, split_thres)[source]¶

Splits each toponym-string, i.e., s1, s2, to tokens, builds three distinct lists per toponym-string, i.e., base, mismatch and frequent, and assigns the produced tokens to these lists. The base lists contains the terms that are similar to one of the other toponym’s tokens, The mismatch contains the terms that have no similar pair to the tokens of the other toponym and the frequent list contains the terms that are common for the specified dataset of toponyms.

Parameters

s1, s2 (str) – Input values in unicode.

split_thres (float) – If the similarity score is above this threshold, the compared terms are identified as base terms, otherwise as mismatch ones.

Returns

Three lists of terms identified as base, mismatch or frequent respectively per toponym, i.e., a for s1 and b for s2.

Return type

tuple of (dict of list of str, dict of list of str, dict of list of str)

interlinking.sim_measures.score_per_term(base_t, mis_t, special_t, metric)[source]¶

Computes three distinct similarity scores for each list of terms.

Parameters

base_t, mismatch_t special_t (list of str) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.

metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.

Returns

A similarity score for every list of terms. Each score is normalized in range [0,1].

Return type

tuple of (float, float, float)

interlinking.sim_measures.weighted_sim(base_t, mis_t, special_t, metric, avg)[source]¶

Re-calculates the significance weights for each list of terms taking into account their lengths.

Parameters

base_t, mis_t, special_t (list of str) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.

metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.

avg (bool) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.lgm_sim(str1, str2, metric='damerau_levenshtein', avg=False)[source]¶

Implements LGM-Sim metric.

Parameters

str1, str2 (str) – Input values in unicode.

metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is damerau_levenshtein.

avg (bool, optional) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score. Default value is False.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.avg_lgm_sim(str1, str2, metric='damerau_levenshtein')[source]¶

Implements LGM-Sim metric where avg flag is True.

Parameters

str1, str2 (str) – Input values in unicode.

metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is damerau_levenshtein.

Returns

A similarity score normalized in range [0,1].

Return type

float

Return Home