processing¶
Features¶
- class
poi_interlinking.processing.features.
Features
[source]¶This class loads the dataset, frequent terms and builds features that are used as input to supported classification groups:
basic: similarity features based on basic similarity measures.
basic_sorted: similarity features based on sorted version of the basic similarity measures used in basic group.
lgm: similarity features based on variations of LGM-Sim similarity measures.
See also
compute_features()
Details on the metrics each classification group implements.
build
()[source]¶Build features depending on the assignment of parameter
classification_method
and return values (fX, y) as ndarray of floats.
- Returns
fX (ndarray) – The computed features that will be used as input to ML classifiers.
y (ndarray) – Binary labels {True, False} to train the classifiers.
compute_features
(s1, s2, sorted=True, lgm_sims=True)[source]¶Depending on the group assigned to parameter
classification_method
, this method builds an ndarray of the following groups of features:
basic: various similarity measures, i.e.,
damerau_levenshtein()
,jaro()
,jaro_winkler()
and the reversed one,sorted_winkler()
,cosine()
,jaccard()
,strike_a_match()
,monge_elkan()
,soft_jaccard()
,davies()
,tuned_jaro_winkler()
and the reversed one,skipgrams()
.basic_sorted: sorted versions of similarity measures utilized in basic group, except for the
sorted_winkler()
.lgm: LGM-Sim variations that integrate, as internal, the similarity measures utilized in basic group, except for the
sorted_winkler()
.
- Parameters
s2 (s1,) – Input toponyms.
sorted (bool, optional) – Value of True indicate to build features for groups basic and basic_sorted, value of False only for basic group.
lgm_sims (bool, optional) – Values of True or False indicate whether to build or not features for group lgm.
- Returns
It returns a list (vector) of features.
- Return type
list
poi_interlinking.processing.spatial.matching.
get_distance
(p1, p2)[source]¶It finds the minimum distance between two Points
- Parameters
p1 (shapely geometric object) – The first point
p2 (shapely geometric object) – The second point
- Returns
Returns the minimum distance. The value follows the geometric object projection.
- Return type
list
Similarity measures¶
This module implements various similarity metrics used across different scenarios. Many of these functions were developed by Rui Santos and Alexandre Marinho for their work in Toponym-Matching.
poi_interlinking.processing.sim_measures.
skipgram
(str1, str2)[source]¶Implements Jaccard-skipgram metric.
- Parameters
str2 (str1,) – Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
davies
(str1, str2)[source]¶Implements Davies de Salles metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
cosine
(str1, str2)[source]¶Implements Cosine N-Grams metric for n=[2,3].
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
damerau_levenshtein
(str1, str2)[source]¶Implements Damerau-Levenshtein metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
jaro
(str1, str2)[source]¶Implements Jaro metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
jaro_winkler
(str1, str2)[source]¶Implements Jaro-Winkler metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
monge_elkan
(str1, str2)[source]¶Implements Monge-Elkan metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
strike_a_match
(str1, str2)[source]¶Implements Dice Bi-Grams metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
jaccard
(str1, str2)[source]¶Implements Jaccard N-Grams metric for n=[2,3].
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
soft_jaccard
(str1, str2)[source]¶Implements Soft-Jaccard metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
sorted_winkler
(str1, str2)[source]¶Implements Sorted Jaro-Winkler metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
tuned_jaro_winkler
(s1, s2, long_tolerance=False)[source]¶Implements Tuned Jaro-Winkler metric.
- str1, str2: str
Input values in unicode.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
lgm_sim_split
(s1, s2, split_thres)[source]¶Splits each toponym-string, i.e., s1, s2, to tokens, builds three distinct lists per toponym-string, i.e., base, mismatch and frequent, and assigns the produced tokens to these lists. The base lists contains the terms that are similar to one of the other toponym’s tokens, The mismatch contains the terms that have no similar pair to the tokens of the other toponym and the frequent list contains the terms that are common for the specified dataset of toponyms.
- Parameters
s2 (s1,) – Input values in unicode.
split_thres (float) – If the similarity score is above this threshold, the compared terms are identified as base terms, otherwise as mismatch ones.
- Returns
Three lists of terms identified as base, mismatch or frequent respectively per toponym, i.e., a for s1 and b for s2.
- Return type
tuple of (dict of list of
str
, dict of list ofstr
, dict of list ofstr
)
poi_interlinking.processing.sim_measures.
score_per_term
(base_t, mis_t, special_t, metric)[source]¶Computes three distinct similarity scores for each list of terms.
- Parameters
mismatch_t special_t (base_t,) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.
metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.
- Returns
A similarity score for every list of terms. Each score is normalized in range [0,1].
- Return type
tuple of (float, float, float)
poi_interlinking.processing.sim_measures.
weighted_sim
(base_t, mis_t, special_t, metric, avg)[source]¶Re-calculates the significance weights for each list of terms taking into account their lengths.
- Parameters
mis_t, special_t (base_t,) – Lists of toponym terms identified as base, mismatch or frequent (special) respectively.
metric (str) – Indicates the metric to utilize in order to calculate the similarity score by comparing individually the three lists.
avg (bool) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
lgm_sim
(str1, str2, metric='damerau_levenshtein', avg=False)[source]¶Implements LGM-Sim metric.
- Parameters
str2 (str1,) – Input values in unicode.
metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is
damerau_levenshtein
.avg (bool, optional) – If value is True, the three individual similarity scores (for each term list) are properly weighted, otherwise each term list’ score is of equal significance to the final score. Default value is False.
- Returns
A similarity score normalized in range [0,1].
- Return type
float
poi_interlinking.processing.sim_measures.
avg_lgm_sim
(str1, str2, metric='damerau_levenshtein')[source]¶Implements LGM-Sim metric where avg flag is True.
- Parameters
str2 (str1,) – Input values in unicode.
metric (str, optional) – Similarity metric used, as internal one, to split toponyms in the two distinct lists that contains base and mismatch terms respectively. Each of the above supported metrics can be used as input. Default metric is
damerau_levenshtein
.- Returns
A similarity score normalized in range [0,1].
- Return type
float