metrics

This module implements various similarity metrics used across different scenarios. Many of these functions were developed by Rui Santos and Alexandre Marinho for their work in Toponym-Matching.

interlinking.sim_measures.skipgram(str1, str2)[source]

Implements Jaccard-skipgram metric.

Parameters

str1, str2 (str) – Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.davies(str1, str2)[source]

Implements Davies de Salles metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.cosine(str1, str2)[source]

Implements Cosine N-Grams metric for n=[2,3].

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.damerau_levenshtein(str1, str2)[source]

Implements Damerau-Levenshtein metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.jaro(str1, str2)[source]

Implements Jaro metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.jaro_winkler(str1, str2)[source]

Implements Jaro-Winkler metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.monge_elkan(str1, str2)[source]

Implements Monge-Elkan metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.strike_a_match(str1, str2)[source]

Implements Dice Bi-Grams metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.jaccard(str1, str2)[source]

Implements Jaccard N-Grams metric for n=[2,3].

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.soft_jaccard(str1, str2)[source]

Implements Soft-Jaccard metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.sorted_winkler(str1, str2)[source]

Implements Sorted Jaro-Winkler metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.tuned_jaro_winkler(s1, s2, long_tolerance=False)[source]

Implements LGM Jaro-Winkler metric.

str1, str2: str

Input values in unicode.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.lgm_sim_split(s1, s2, split_thres)[source]

Splits each toponym-string, i.e., s1, s2, to tokens, builds three distinct lists per toponym-string, i.e., base, mismatch and frequent, and assigns the produced tokens to these lists. The base lists contains the terms that are similar to one of the other toponym’s tokens, The mismatch contains the terms that have no similar pair to the tokens of the other toponym and the frequent list contains the terms that are common for the specified dataset of toponyms.

Parameters
  • s1, s2 (str) – Input values in unicode.

  • split_thres (float) – If the similarity score is above this threshold, the compared terms are identified as base terms, otherwise as mismatch ones.

Returns

Three lists of terms identified as base, mismatch or frequent respectively per toponym, i.e., a for s1 and b for s2.

Return type

tuple of (dict of list of str, dict of list of str, dict of list of str)

interlinking.sim_measures.score_per_term(base_t, mis_t, special_t, metric)[source]

Computes three distinct similarity scores for each list of terms.

Parameters
  • base_t, mismatch_t special_t (list of str) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.

  • metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.

Returns

A similarity score for every list of terms. Each score is normalized in range [0,1].

Return type

tuple of (float, float, float)

interlinking.sim_measures.weighted_sim(base_t, mis_t, special_t, metric, avg)[source]

Re-calculates the significance weights for each list of terms taking into account their lengths.

Parameters
  • base_t, mis_t, special_t (list of str) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.

  • metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.

  • avg (bool) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.lgm_sim(str1, str2, metric='damerau_levenshtein', avg=False)[source]

Implements LGM-Sim metric.

Parameters
  • str1, str2 (str) – Input values in unicode.

  • metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is damerau_levenshtein.

  • avg (bool, optional) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score. Default value is False.

Returns

A similarity score normalized in range [0,1].

Return type

float

interlinking.sim_measures.avg_lgm_sim(str1, str2, metric='damerau_levenshtein')[source]

Implements LGM-Sim metric where avg flag is True.

Parameters
  • str1, str2 (str) – Input values in unicode.

  • metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is damerau_levenshtein.

Returns

A similarity score normalized in range [0,1].

Return type

float

Return Home