config

interlinking.config.default_data_path = 'data'

The folder name, relative to root path, that contains all required input files, e.g., train/test dataset, frequent terms etc.

Type

str

interlinking.config.freq_term_size = 400

The proportion of frequent terms to utilize.

Type

int

interlinking.config.fieldnames = ['s1', 's2', 'status', 'c1', 'c2', 'a1', 'a2', 'cc1', 'cc2']

A list of names assigned to each column in train/test dataset. If a header exists, it should be set to None.

Type

list of str

interlinking.config.use_cols = {'label': 'status', 's1': 's1', 's2': 's2'}

A dictionary of useful column names.

Variables
  • s1 (str) – Denote the column name assigned to the first name of each of the toponym pair.

  • s2 (str) – Denote the column name assigned to the second name of each of the toponym pair.

  • label (str) – The true labels for the toponym pairs.

interlinking.config.delimiter = '\t'

The delimiter used to separate each column in CSV input files.

Type

char

interlinking.config.sort_thres = 0.55

Similarity threshold on whether sorting on toponym tokens is applied or not. It is triggered on a score below the assigned threshold.

Type

float

interlinking.config.seed_no = 13

Seed used by each of the random number generators.

Type

int

class interlinking.config.MLConf[source]

This class initializes parameters that correspond to the machine learning part of the framework.

Variables
  • opt_values (dict of dicts) – A list of learned parameters for LGM-Sim meta-similarity function. \(θ_{split}, w_b, w_m, w_f\).

  • clf_custom_params (dict of dicts) – A list of custom hyper-parameters to utilize for specified classifiers. These parameters are used when evaluate command is executed in cli.

These variables define the parameter grid for GridSearchCV:

Variables
  • SVM_hyperparameters (list) – Defines the search space for SVM.

  • MLP_hyperparameters (dict) – Defines the search space for MLP.

  • DecisionTree_hyperparameters (dict) – Defines the search space for Decision Trees.

  • RandomForest_hyperparameters (dict) – Defines the search space for Random Forests and Extra-Trees.

  • XGBoost_hyperparameters (dict) – Defines the search space for XGBoost.

These variables define the parameter grid for RandomizedSearchCV where continuous distributions are used for continuous parameters (whenever this is feasible):

Variables
  • SVM_hyperparameters_dist (dict) – Defines the search space for SVM.

  • MLP_hyperparameters_dist (dict) – Defines the search space for MLP.

  • DecisionTree_hyperparameters_dist (dict) – Defines the search space for Decision Trees.

  • RandomForest_hyperparameters_dist (dict) – Defines the search space for Random Forests and Extra-Trees.

  • XGBoost_hyperparameters_dist (dict) – Defines the search space for XGBoost.

kfold_no = 5

The number of outer folds that splits the dataset for the k-fold cross-validation.

Type

int

n_jobs = 4

Number of parallel jobs to be initiated. -1 means to utilize all available processors.

Type

int

classification_method = 'lgm'

The classification group of features to use. (basic | basic_sorted | lgm).

See also

Features

Details on available inputs.

Type

str

hyperparams_search_method = 'randomized'

Search Method to use for finding best hyperparameters. (randomized | grid).

See also

fineTuneClassifiers()

Details on available inputs.

Type

str

classifiers = ['RandomForest']

Define the classifiers to apply on code execution. Accepted values are:

  • SVM

  • DecisionTree

  • RandomForest

  • ExtraTrees

  • XGBoost

  • MLP.

Type

list of str

score = 'accuracy'

The metric to optimize on hyper-parameter tuning. Possible valid values presented on Scikit predefined values.

Type

str

Return Home