metrics¶
This module implements various similarity metrics used across different scenarios. Many of these functions were developed by Rui Santos and Alexandre Marinho for their work in Toponym-Matching.
interlinking.sim_measures.
skipgram
(str1, str2)[source]¶Implements Jaccard-skipgram metric.
- Parameters
str1, str2 (str) – Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
davies
(str1, str2)[source]¶Implements Davies de Salles metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
cosine
(str1, str2)[source]¶Implements Cosine N-Grams metric for n=[2,3].
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
damerau_levenshtein
(str1, str2)[source]¶Implements Damerau-Levenshtein metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
jaro
(str1, str2)[source]¶Implements Jaro metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
jaro_winkler
(str1, str2)[source]¶Implements Jaro-Winkler metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
monge_elkan
(str1, str2)[source]¶Implements Monge-Elkan metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
strike_a_match
(str1, str2)[source]¶Implements Dice Bi-Grams metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
jaccard
(str1, str2)[source]¶Implements Jaccard N-Grams metric for n=[2,3].
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
soft_jaccard
(str1, str2)[source]¶Implements Soft-Jaccard metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
sorted_winkler
(str1, str2)[source]¶Implements Sorted Jaro-Winkler metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
tuned_jaro_winkler
(s1, s2, long_tolerance=False)[source]¶Implements LGM Jaro-Winkler metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
lgm_sim_split
(s1, s2, split_thres)[source]¶Splits each toponym-string, i.e., s1, s2, to tokens, builds three distinct lists per toponym-string, i.e., base, mismatch and frequent, and assigns the produced tokens to these lists. The base lists contains the terms that are similar to one of the other toponym’s tokens, The mismatch contains the terms that have no similar pair to the tokens of the other toponym and the frequent list contains the terms that are common for the specified dataset of toponyms.
- Parameters
s1, s2 (str) – Input values in unicode.
split_thres (float) – If the similarity score is above this threshold, the compared terms are identified as base terms, otherwise as mismatch ones.
- Returns
Three lists of terms identified as base, mismatch or frequent respectively per toponym, i.e., a for s1 and b for s2.
- Return type
tuple of (dict of list of
str
, dict of list ofstr
, dict of list ofstr
)
interlinking.sim_measures.
score_per_term
(base_t, mis_t, special_t, metric)[source]¶Computes three distinct similarity scores for each list of terms.
- Parameters
base_t, mismatch_t special_t (list of str) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.
metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.
- Returns
A similarity score for every list of terms. Each score is normalized in range [0,1].
- Return type
interlinking.sim_measures.
weighted_sim
(base_t, mis_t, special_t, metric, avg)[source]¶Re-calculates the significance weights for each list of terms taking into account their lengths.
- Parameters
base_t, mis_t, special_t (list of str) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.
metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.
avg (bool) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
lgm_sim
(str1, str2, metric='damerau_levenshtein', avg=False)[source]¶Implements LGM-Sim metric.
- Parameters
str1, str2 (str) – Input values in unicode.
metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is
damerau_levenshtein
.avg (bool, optional) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score. Default value is False.
- Returns
A similarity score normalized in range [0,1].
- Return type
interlinking.sim_measures.
avg_lgm_sim
(str1, str2, metric='damerau_levenshtein')[source]¶Implements LGM-Sim metric where avg flag is True.
- Parameters
str1, str2 (str) – Input values in unicode.
metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is
damerau_levenshtein
.- Returns
A similarity score normalized in range [0,1].
- Return type