config¶

interlinking.config.default_data_path = 'data'¶

The folder name, relative to root path, that contains all required input files, e.g., train/test dataset, frequent terms etc.

Type

str

interlinking.config.freq_term_size = 400¶

The proportion of frequent terms to utilize.

Type

int

interlinking.config.fieldnames = ['s1', 's2', 'status', 'c1', 'c2', 'a1', 'a2', 'cc1', 'cc2']¶

A list of names assigned to each column in train/test dataset. If a header exists, it should be set to None.

Type

list of str

interlinking.config.use_cols = {'label': 'status', 's1': 's1', 's2': 's2'}¶

A dictionary of useful column names.

Variables

s1 (str) – Denote the column name assigned to the first name of each of the toponym pair.

s2 (str) – Denote the column name assigned to the second name of each of the toponym pair.

label (str) – The true labels for the toponym pairs.

interlinking.config.delimiter = '\t'¶

The delimiter used to separate each column in CSV input files.

Type

char

interlinking.config.sort_thres = 0.55¶

Similarity threshold on whether sorting on toponym tokens is applied or not. It is triggered on a score below the assigned threshold.

Type

float

interlinking.config.seed_no = 13¶

Seed used by each of the random number generators.

Type

int

class interlinking.config.MLConf[source]¶

This class initializes parameters that correspond to the machine learning part of the framework.

Variables

opt_values (dict of dicts) – A list of learned parameters for LGM-Sim meta-similarity function. \(θ_{split}, w_b, w_m, w_f\).

clf_custom_params (dict of dicts) – A list of custom hyper-parameters to utilize for specified classifiers. These parameters are used when evaluate command is executed in cli.

These variables define the parameter grid for GridSearchCV:

Variables

SVM_hyperparameters (list) – Defines the search space for SVM.

MLP_hyperparameters (dict) – Defines the search space for MLP.

DecisionTree_hyperparameters (dict) – Defines the search space for Decision Trees.

RandomForest_hyperparameters (dict) – Defines the search space for Random Forests and Extra-Trees.

XGBoost_hyperparameters (dict) – Defines the search space for XGBoost.

These variables define the parameter grid for RandomizedSearchCV where continuous distributions are used for continuous parameters (whenever this is feasible):

Variables

SVM_hyperparameters_dist (dict) – Defines the search space for SVM.

MLP_hyperparameters_dist (dict) – Defines the search space for MLP.

DecisionTree_hyperparameters_dist (dict) – Defines the search space for Decision Trees.

RandomForest_hyperparameters_dist (dict) – Defines the search space for Random Forests and Extra-Trees.

XGBoost_hyperparameters_dist (dict) – Defines the search space for XGBoost.

kfold_no = 5¶

The number of outer folds that splits the dataset for the k-fold cross-validation.

Type

int

n_jobs = 4¶

Number of parallel jobs to be initiated. -1 means to utilize all available processors.

Type

int

classification_method = 'lgm'¶

The classification group of features to use. (basic | basic_sorted | lgm).

See also

Features
Details on available inputs.

Type

str

hyperparams_search_method = 'randomized'¶

Search Method to use for finding best hyperparameters. (randomized | grid).

See also

fineTuneClassifiers()
Details on available inputs.

Type

str

classifiers = ['RandomForest']¶

Define the classifiers to apply on code execution. Accepted values are:

SVM

DecisionTree

RandomForest

ExtraTrees

XGBoost

MLP.

Type

list of str

score = 'accuracy'¶

The metric to optimize on hyper-parameter tuning. Possible valid values presented on Scikit predefined values.

Type

str

Return Home