Skip to content

Commit 2c36520

Browse files
authoredAug 20, 2021
Merge pull request #27 from Nowosad/cpp_extend
adds new cpp funs
2 parents 775fb08 + 228e67f commit 2c36520

27 files changed

+1257
-191
lines changed
 

‎.gitignore

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.Rproj.user
2+
.Rhistory
3+
.RData
4+
.Ruserdata
5+
src/*.o
6+
src/*.so
7+
src/*.dll

‎DESCRIPTION

+2-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ URL: https://github.com/drostlab/philentropy
2222
Suggests:
2323
testthat,
2424
knitr,
25-
markdown
25+
rmarkdown,
26+
microbenchmark
2627
VignetteBuilder: knitr
2728
BugReports: https://github.com/drostlab/philentropy/issues
2829
RoxygenNote: 7.1.1

‎NAMESPACE

+3
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ export(cosine_dist)
1717
export(czekanowski)
1818
export(dice_dist)
1919
export(dist.diversity)
20+
export(dist_many_many)
21+
export(dist_one_many)
22+
export(dist_one_one)
2023
export(distance)
2124
export(divergence_sq)
2225
export(estimate.probability)

‎NEWS.md

+2
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,10 @@
55
- `distance()` and all other individual information theory functions
66
receive a new argument `epsilon` with default value `epsilon = 0.00001` to treat cases where in individual distance or similarity computations
77
yield `x / 0` or `0 / 0`. Instead of a hard coded epsilon, users can now set `epsilon` according to their input vectors. (Many thanks to Joshua McNeill #26 for this great question).
8+
- three new functions `dist_one_one()`, `dist_one_many()`, `dist_many_many()` are added. They are fairly flexible intermediaries between `distance()` and single distance functions. `dist_one_one()` expects two vectors (probability density functions) and returns a single value. `dist_one_many()` expects one vector (a probability density function) and one matrix (a set of probability density functions), and returns a vector of values. `dist_many_many()` expects two matrices (two sets of probability density functions), and returns a matrix of values.
89

910
### Updates
11+
1012
- `dplyr` package dependency was removed and replaced by the `poorman`
1113
due to the heavy dependency burden of `dplyr`, since `philentropy`
1214
only used `dplyr::between()` which is now `poorman::between()` (Many thanks to Patrice Kiener for this suggestion)

‎R/KL.R

+1-1
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@
5050
#' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
5151
#' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
5252
#' return negative values which are not defined and only occur due to the
53-
#' technical issues of computing \code{x / 0} or \code(0 / 0) cases.
53+
#' technical issues of computing x / 0 or 0 / 0 cases.
5454
#' @return The Kullback-Leibler divergence of probability vectors.
5555
#' @author Hajk-Georg Drost
5656
#' @seealso

‎R/RcppExports.R

+251-16
Large diffs are not rendered by default.

‎R/distance.R

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
#' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
3838
#' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
3939
#' return negative values which are not defined and only occur due to the
40-
#' technical issues of computing \code{x / 0} or \code(0 / 0) cases.
40+
#' technical issues of computing x / 0 or 0 / 0 cases.
4141
#' @param est.prob method to estimate probabilities from input count vectors such as non-probability vectors. Default: \code{est.prob = NULL}. Options are:
4242
#' \itemize{
4343
#' \item \code{est.prob = "empirical"}: The relative frequencies of each vector are computed internally. For example an input matrix \code{rbind(1:10, 11:20)} will be transformed to a probability vector \code{rbind(1:10 / sum(1:10), 11:20 / sum(11:20))}

‎man/KL.Rd

+1-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/bhattacharyya.Rd

+17-7
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/dist_many_many.Rd

+60
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/dist_one_many.Rd

+60
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/dist_one_one.Rd

+59
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/distance.Rd

+1-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/jeffreys.Rd

+16-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/kulczynski_d.Rd

+16-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/kullback_leibler_distance.Rd

+16-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/kumar_johnson.Rd

+16-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/neyman_chi_sq.Rd

+16-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/pearson_chi_sq.Rd

+16-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎man/taneja.Rd

+16-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎src/RcppExports.cpp

+131
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,11 @@
88

99
using namespace Rcpp;
1010

11+
#ifdef RCPP_USE_GLOBAL_ROSTREAM
12+
Rcpp::Rostream<true>& Rcpp::Rcout = Rcpp::Rcpp_cout_get();
13+
Rcpp::Rostream<false>& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get();
14+
#endif
15+
1116
// Ecpp
1217
double Ecpp(const Rcpp::NumericVector& P, Rcpp::String unit);
1318
RcppExport SEXP _philentropy_Ecpp(SEXP PSEXP, SEXP unitSEXP) {
@@ -165,6 +170,57 @@ BEGIN_RCPP
165170
return rcpp_result_gen;
166171
END_RCPP
167172
}
173+
// dist_one_one
174+
double dist_one_one(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, const Rcpp::String& method, const double& p, const bool& testNA, const Rcpp::String& unit, const double& epsilon);
175+
RcppExport SEXP _philentropy_dist_one_one(SEXP PSEXP, SEXP QSEXP, SEXP methodSEXP, SEXP pSEXP, SEXP testNASEXP, SEXP unitSEXP, SEXP epsilonSEXP) {
176+
BEGIN_RCPP
177+
Rcpp::RObject rcpp_result_gen;
178+
Rcpp::RNGScope rcpp_rngScope_gen;
179+
Rcpp::traits::input_parameter< const Rcpp::NumericVector& >::type P(PSEXP);
180+
Rcpp::traits::input_parameter< const Rcpp::NumericVector& >::type Q(QSEXP);
181+
Rcpp::traits::input_parameter< const Rcpp::String& >::type method(methodSEXP);
182+
Rcpp::traits::input_parameter< const double& >::type p(pSEXP);
183+
Rcpp::traits::input_parameter< const bool& >::type testNA(testNASEXP);
184+
Rcpp::traits::input_parameter< const Rcpp::String& >::type unit(unitSEXP);
185+
Rcpp::traits::input_parameter< const double& >::type epsilon(epsilonSEXP);
186+
rcpp_result_gen = Rcpp::wrap(dist_one_one(P, Q, method, p, testNA, unit, epsilon));
187+
return rcpp_result_gen;
188+
END_RCPP
189+
}
190+
// dist_one_many
191+
Rcpp::NumericVector dist_one_many(const Rcpp::NumericVector& P, Rcpp::NumericMatrix dists, Rcpp::String method, double p, bool testNA, Rcpp::String unit, double epsilon);
192+
RcppExport SEXP _philentropy_dist_one_many(SEXP PSEXP, SEXP distsSEXP, SEXP methodSEXP, SEXP pSEXP, SEXP testNASEXP, SEXP unitSEXP, SEXP epsilonSEXP) {
193+
BEGIN_RCPP
194+
Rcpp::RObject rcpp_result_gen;
195+
Rcpp::RNGScope rcpp_rngScope_gen;
196+
Rcpp::traits::input_parameter< const Rcpp::NumericVector& >::type P(PSEXP);
197+
Rcpp::traits::input_parameter< Rcpp::NumericMatrix >::type dists(distsSEXP);
198+
Rcpp::traits::input_parameter< Rcpp::String >::type method(methodSEXP);
199+
Rcpp::traits::input_parameter< double >::type p(pSEXP);
200+
Rcpp::traits::input_parameter< bool >::type testNA(testNASEXP);
201+
Rcpp::traits::input_parameter< Rcpp::String >::type unit(unitSEXP);
202+
Rcpp::traits::input_parameter< double >::type epsilon(epsilonSEXP);
203+
rcpp_result_gen = Rcpp::wrap(dist_one_many(P, dists, method, p, testNA, unit, epsilon));
204+
return rcpp_result_gen;
205+
END_RCPP
206+
}
207+
// dist_many_many
208+
Rcpp::NumericMatrix dist_many_many(Rcpp::NumericMatrix dists1, Rcpp::NumericMatrix dists2, Rcpp::String method, double p, bool testNA, Rcpp::String unit, double epsilon);
209+
RcppExport SEXP _philentropy_dist_many_many(SEXP dists1SEXP, SEXP dists2SEXP, SEXP methodSEXP, SEXP pSEXP, SEXP testNASEXP, SEXP unitSEXP, SEXP epsilonSEXP) {
210+
BEGIN_RCPP
211+
Rcpp::RObject rcpp_result_gen;
212+
Rcpp::RNGScope rcpp_rngScope_gen;
213+
Rcpp::traits::input_parameter< Rcpp::NumericMatrix >::type dists1(dists1SEXP);
214+
Rcpp::traits::input_parameter< Rcpp::NumericMatrix >::type dists2(dists2SEXP);
215+
Rcpp::traits::input_parameter< Rcpp::String >::type method(methodSEXP);
216+
Rcpp::traits::input_parameter< double >::type p(pSEXP);
217+
Rcpp::traits::input_parameter< bool >::type testNA(testNASEXP);
218+
Rcpp::traits::input_parameter< Rcpp::String >::type unit(unitSEXP);
219+
Rcpp::traits::input_parameter< double >::type epsilon(epsilonSEXP);
220+
rcpp_result_gen = Rcpp::wrap(dist_many_many(dists1, dists2, method, p, testNA, unit, epsilon));
221+
return rcpp_result_gen;
222+
END_RCPP
223+
}
168224
// custom_log2
169225
double custom_log2(const double& x);
170226
RcppExport SEXP _philentropy_custom_log2(SEXP xSEXP) {
@@ -935,3 +991,78 @@ RcppExport SEXP _philentropy_RcppExport_registerCCallable() {
935991
R_RegisterCCallable("philentropy", "_philentropy_RcppExport_validate", (DL_FUNC)_philentropy_RcppExport_validate);
936992
return R_NilValue;
937993
}
994+
995+
static const R_CallMethodDef CallEntries[] = {
996+
{"_philentropy_Ecpp", (DL_FUNC) &_philentropy_Ecpp, 2},
997+
{"_philentropy_JEcpp", (DL_FUNC) &_philentropy_JEcpp, 2},
998+
{"_philentropy_CEcpp", (DL_FUNC) &_philentropy_CEcpp, 3},
999+
{"_philentropy_MIcpp", (DL_FUNC) &_philentropy_MIcpp, 4},
1000+
{"_philentropy_pearson_corr_centred", (DL_FUNC) &_philentropy_pearson_corr_centred, 3},
1001+
{"_philentropy_pearson_corr_uncentred", (DL_FUNC) &_philentropy_pearson_corr_uncentred, 3},
1002+
{"_philentropy_squared_pearson_corr", (DL_FUNC) &_philentropy_squared_pearson_corr, 3},
1003+
{"_philentropy_DistMatrixWithoutUnitDF", (DL_FUNC) &_philentropy_DistMatrixWithoutUnitDF, 3},
1004+
{"_philentropy_DistMatrixMinkowskiMAT", (DL_FUNC) &_philentropy_DistMatrixMinkowskiMAT, 3},
1005+
{"_philentropy_DistMatrixWithoutUnitMAT", (DL_FUNC) &_philentropy_DistMatrixWithoutUnitMAT, 3},
1006+
{"_philentropy_DistMatrixWithUnitDF", (DL_FUNC) &_philentropy_DistMatrixWithUnitDF, 4},
1007+
{"_philentropy_DistMatrixWithUnitMAT", (DL_FUNC) &_philentropy_DistMatrixWithUnitMAT, 4},
1008+
{"_philentropy_dist_one_one", (DL_FUNC) &_philentropy_dist_one_one, 7},
1009+
{"_philentropy_dist_one_many", (DL_FUNC) &_philentropy_dist_one_many, 7},
1010+
{"_philentropy_dist_many_many", (DL_FUNC) &_philentropy_dist_many_many, 7},
1011+
{"_philentropy_custom_log2", (DL_FUNC) &_philentropy_custom_log2, 1},
1012+
{"_philentropy_custom_log10", (DL_FUNC) &_philentropy_custom_log10, 1},
1013+
{"_philentropy_euclidean", (DL_FUNC) &_philentropy_euclidean, 3},
1014+
{"_philentropy_manhattan", (DL_FUNC) &_philentropy_manhattan, 3},
1015+
{"_philentropy_minkowski", (DL_FUNC) &_philentropy_minkowski, 4},
1016+
{"_philentropy_chebyshev", (DL_FUNC) &_philentropy_chebyshev, 3},
1017+
{"_philentropy_sorensen", (DL_FUNC) &_philentropy_sorensen, 3},
1018+
{"_philentropy_gower", (DL_FUNC) &_philentropy_gower, 3},
1019+
{"_philentropy_soergel", (DL_FUNC) &_philentropy_soergel, 3},
1020+
{"_philentropy_kulczynski_d", (DL_FUNC) &_philentropy_kulczynski_d, 4},
1021+
{"_philentropy_canberra", (DL_FUNC) &_philentropy_canberra, 3},
1022+
{"_philentropy_lorentzian", (DL_FUNC) &_philentropy_lorentzian, 4},
1023+
{"_philentropy_intersection_dist", (DL_FUNC) &_philentropy_intersection_dist, 3},
1024+
{"_philentropy_wave_hedges", (DL_FUNC) &_philentropy_wave_hedges, 3},
1025+
{"_philentropy_czekanowski", (DL_FUNC) &_philentropy_czekanowski, 3},
1026+
{"_philentropy_motyka", (DL_FUNC) &_philentropy_motyka, 3},
1027+
{"_philentropy_tanimoto", (DL_FUNC) &_philentropy_tanimoto, 3},
1028+
{"_philentropy_ruzicka", (DL_FUNC) &_philentropy_ruzicka, 3},
1029+
{"_philentropy_inner_product", (DL_FUNC) &_philentropy_inner_product, 3},
1030+
{"_philentropy_harmonic_mean_dist", (DL_FUNC) &_philentropy_harmonic_mean_dist, 3},
1031+
{"_philentropy_cosine_dist", (DL_FUNC) &_philentropy_cosine_dist, 3},
1032+
{"_philentropy_kumar_hassebrook", (DL_FUNC) &_philentropy_kumar_hassebrook, 3},
1033+
{"_philentropy_jaccard", (DL_FUNC) &_philentropy_jaccard, 3},
1034+
{"_philentropy_dice_dist", (DL_FUNC) &_philentropy_dice_dist, 3},
1035+
{"_philentropy_fidelity", (DL_FUNC) &_philentropy_fidelity, 3},
1036+
{"_philentropy_bhattacharyya", (DL_FUNC) &_philentropy_bhattacharyya, 5},
1037+
{"_philentropy_hellinger", (DL_FUNC) &_philentropy_hellinger, 3},
1038+
{"_philentropy_matusita", (DL_FUNC) &_philentropy_matusita, 3},
1039+
{"_philentropy_squared_chord", (DL_FUNC) &_philentropy_squared_chord, 3},
1040+
{"_philentropy_squared_euclidean", (DL_FUNC) &_philentropy_squared_euclidean, 3},
1041+
{"_philentropy_pearson_chi_sq", (DL_FUNC) &_philentropy_pearson_chi_sq, 4},
1042+
{"_philentropy_neyman_chi_sq", (DL_FUNC) &_philentropy_neyman_chi_sq, 4},
1043+
{"_philentropy_squared_chi_sq", (DL_FUNC) &_philentropy_squared_chi_sq, 3},
1044+
{"_philentropy_prob_symm_chi_sq", (DL_FUNC) &_philentropy_prob_symm_chi_sq, 3},
1045+
{"_philentropy_divergence_sq", (DL_FUNC) &_philentropy_divergence_sq, 3},
1046+
{"_philentropy_clark_sq", (DL_FUNC) &_philentropy_clark_sq, 3},
1047+
{"_philentropy_additive_symm_chi_sq", (DL_FUNC) &_philentropy_additive_symm_chi_sq, 3},
1048+
{"_philentropy_kullback_leibler_distance", (DL_FUNC) &_philentropy_kullback_leibler_distance, 5},
1049+
{"_philentropy_jeffreys", (DL_FUNC) &_philentropy_jeffreys, 5},
1050+
{"_philentropy_k_divergence", (DL_FUNC) &_philentropy_k_divergence, 4},
1051+
{"_philentropy_topsoe", (DL_FUNC) &_philentropy_topsoe, 4},
1052+
{"_philentropy_jensen_shannon", (DL_FUNC) &_philentropy_jensen_shannon, 4},
1053+
{"_philentropy_jensen_difference", (DL_FUNC) &_philentropy_jensen_difference, 4},
1054+
{"_philentropy_taneja", (DL_FUNC) &_philentropy_taneja, 5},
1055+
{"_philentropy_kumar_johnson", (DL_FUNC) &_philentropy_kumar_johnson, 4},
1056+
{"_philentropy_avg", (DL_FUNC) &_philentropy_avg, 3},
1057+
{"_philentropy_as_matrix", (DL_FUNC) &_philentropy_as_matrix, 1},
1058+
{"_philentropy_as_data_frame", (DL_FUNC) &_philentropy_as_data_frame, 1},
1059+
{"_philentropy_sum_rcpp", (DL_FUNC) &_philentropy_sum_rcpp, 1},
1060+
{"_philentropy_est_prob_empirical", (DL_FUNC) &_philentropy_est_prob_empirical, 1},
1061+
{"_philentropy_RcppExport_registerCCallable", (DL_FUNC) &_philentropy_RcppExport_registerCCallable, 0},
1062+
{NULL, NULL, 0}
1063+
};
1064+
1065+
RcppExport void R_init_philentropy(DllInfo *dll) {
1066+
R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
1067+
R_useDynamicSymbols(dll, FALSE);
1068+
}

‎src/dist_matrix.cpp

+230
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ Rcpp::NumericMatrix DistMatrixWithoutUnitMAT(Rcpp::NumericMatrix dists, Rcpp::Fu
7979
return dist_matrix;
8080
}
8181

82+
8283
// @export
8384
// [[Rcpp::export]]
8485
Rcpp::NumericMatrix DistMatrixWithUnitDF(Rcpp::DataFrame distsDF, Rcpp::Function DistFunc, bool testNA, Rcpp::String unit){
@@ -129,4 +130,233 @@ Rcpp::NumericMatrix DistMatrixWithUnitMAT(Rcpp::NumericMatrix dists, Rcpp::Funct
129130
return dist_matrix;
130131
}
131132

133+
//' @title Distances and Similarities between Two Probability Density Functions
134+
//' @description This functions computes the distance/dissimilarity between two probability density functions.
135+
//' @param P a numeric vector storing the first distribution.
136+
//' @param Q a numeric vector storing the second distribution.
137+
//' @param method a character string indicating whether the distance measure that should be computed.
138+
//' @param p power of the Minkowski distance.
139+
//' @param testNA a logical value indicating whether or not distributions shall be checked for \code{NA} values.
140+
//' @param unit type of \code{log} function. Option are
141+
//' \itemize{
142+
//' \item \code{unit = "log"}
143+
//' \item \code{unit = "log2"}
144+
//' \item \code{unit = "log10"}
145+
//' }
146+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
147+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
148+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
149+
//' the expected similarity between compared probability density functions and
150+
//' whether or not many 0 values are present within the compared vectors.
151+
//' As a rough rule of thumb we suggest that when dealing with very large
152+
//' input vectors which are very similar and contain many \code{0} values,
153+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
154+
//' whereas when vector sizes are small or distributions very divergent then
155+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
156+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
157+
//' return negative values which are not defined and only occur due to the
158+
//' technical issues of computing x / 0 or 0 / 0 cases.
159+
//' @return A single distance value
160+
//' @examples
161+
//' P <- 1:10 / sum(1:10)
162+
//' Q <- 20:29 / sum(20:29)
163+
//' dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
164+
//' @export
165+
// [[Rcpp::export]]
166+
double dist_one_one(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, const Rcpp::String& method, const double& p = NA_REAL, const bool& testNA = true, const Rcpp::String& unit = "log", const double& epsilon = 0.00001){
167+
double dist_value;
168+
if (method == "euclidean"){
169+
dist_value = euclidean(P, Q, testNA);
170+
} else if (method == "manhattan") {
171+
dist_value = manhattan(P, Q, testNA);
172+
} else if (method == "minkowski") {
173+
dist_value = minkowski(P, Q, p, testNA);
174+
} else if (method == "chebyshev") {
175+
dist_value = chebyshev(P, Q, testNA);
176+
} else if (method == "sorensen") {
177+
dist_value = sorensen(P, Q, testNA);
178+
} else if (method == "gower") {
179+
dist_value = gower(P, Q, testNA);
180+
} else if (method == "soergel") {
181+
dist_value = soergel(P, Q, testNA);
182+
} else if (method == "kulczynski_d") {
183+
dist_value = kulczynski_d(P, Q, testNA, epsilon);
184+
} else if (method == "canberra") {
185+
dist_value = canberra(P, Q, testNA);
186+
} else if (method == "lorentzian") {
187+
dist_value = lorentzian(P, Q, testNA, unit);
188+
} else if (method == "intersection") {
189+
dist_value = intersection_dist(P, Q, testNA);
190+
} else if (method == "non-intersection") {
191+
dist_value = 1.0 - intersection_dist(P, Q, testNA);
192+
} else if (method == "wavehedges") {
193+
dist_value = wave_hedges(P, Q, testNA);
194+
} else if (method == "czekanowski") {
195+
dist_value = czekanowski(P, Q, testNA);
196+
} else if (method == "motyka") {
197+
dist_value = motyka(P, Q, testNA);
198+
} else if (method == "kulczynski_s") {
199+
dist_value = 1.0 / kulczynski_d(P, Q, testNA, epsilon);
200+
} else if (method == "tanimoto") {
201+
dist_value = tanimoto(P, Q, testNA);
202+
} else if (method == "ruzicka") {
203+
dist_value = ruzicka(P, Q, testNA);
204+
} else if (method == "inner_product") {
205+
dist_value = inner_product(P, Q, testNA);
206+
} else if (method == "harmonic_mean") {
207+
dist_value = harmonic_mean_dist(P, Q, testNA);
208+
} else if (method == "cosine") {
209+
dist_value = cosine_dist(P, Q, testNA);
210+
} else if (method == "hassebrook") {
211+
dist_value = kumar_hassebrook(P, Q, testNA);
212+
} else if (method == "jaccard") {
213+
dist_value = jaccard(P, Q, testNA);
214+
} else if (method == "dice") {
215+
dist_value = dice_dist(P, Q, testNA);
216+
} else if (method == "fidelity") {
217+
dist_value = fidelity(P, Q, testNA);
218+
} else if (method == "bhattacharyya") {
219+
dist_value = bhattacharyya(P, Q, testNA, unit, epsilon);
220+
} else if (method == "hellinger") {
221+
dist_value = hellinger(P, Q, testNA);
222+
} else if (method == "matusita") {
223+
dist_value = matusita(P, Q, testNA);
224+
} else if (method == "squared_chord") {
225+
dist_value = squared_chord(P, Q, testNA);
226+
} else if (method == "squared_euclidean") {
227+
dist_value = squared_euclidean(P, Q, testNA);
228+
} else if (method == "pearson") {
229+
dist_value = pearson_chi_sq(P, Q, testNA, epsilon);
230+
} else if (method == "neyman") {
231+
dist_value = neyman_chi_sq(P, Q, testNA, epsilon);
232+
} else if (method == "squared_chi") {
233+
dist_value = squared_chi_sq(P, Q, testNA);
234+
} else if (method == "prob_symm") {
235+
dist_value = prob_symm_chi_sq(P, Q, testNA);
236+
} else if (method == "divergence") {
237+
dist_value = divergence_sq(P, Q, testNA);
238+
} else if (method == "clark") {
239+
dist_value = clark_sq(P, Q, testNA);
240+
} else if (method == "additive_symm") {
241+
dist_value = additive_symm_chi_sq(P, Q, testNA);
242+
} else if (method == "kullback-leibler") {
243+
dist_value = kullback_leibler_distance(P, Q, testNA, unit, epsilon);
244+
} else if (method == "jeffreys") {
245+
dist_value = jeffreys(P, Q, testNA, unit, epsilon);
246+
} else if (method == "k_divergence") {
247+
dist_value = k_divergence(P, Q, testNA, unit);
248+
} else if (method == "topsoe") {
249+
dist_value = topsoe(P, Q, testNA, unit);
250+
} else if (method == "jensen_shannon"){
251+
dist_value = jensen_shannon(P, Q, testNA, unit);
252+
} else if (method == "jensen_difference") {
253+
dist_value = jensen_difference(P, Q, testNA, unit);
254+
} else if (method == "taneja") {
255+
dist_value = taneja(P, Q, testNA, unit, epsilon);
256+
} else if (method == "kumar-johnson") {
257+
dist_value = kumar_johnson(P, Q, testNA, epsilon);
258+
} else if (method == "avg") {
259+
dist_value = avg(P, Q, testNA);
260+
} else {
261+
Rcpp::stop("Specified method is not implemented. Please consult getDistMethods().");
262+
}
263+
return dist_value;
264+
}
265+
266+
//' @title Distances and Similarities between One and Many Probability Density Functions
267+
//' @description This functions computes the distance/dissimilarity between one probability density functions and a set of probability density functions.
268+
//' @param P a numeric vector storing the first distribution.
269+
//' @param dists a numeric matrix storing distributions in its rows.
270+
//' @param method a character string indicating whether the distance measure that should be computed.
271+
//' @param p power of the Minkowski distance.
272+
//' @param testNA a logical value indicating whether or not distributions shall be checked for \code{NA} values.
273+
//' @param unit type of \code{log} function. Option are
274+
//' \itemize{
275+
//' \item \code{unit = "log"}
276+
//' \item \code{unit = "log2"}
277+
//' \item \code{unit = "log10"}
278+
//' }
279+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
280+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
281+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
282+
//' the expected similarity between compared probability density functions and
283+
//' whether or not many 0 values are present within the compared vectors.
284+
//' As a rough rule of thumb we suggest that when dealing with very large
285+
//' input vectors which are very similar and contain many \code{0} values,
286+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
287+
//' whereas when vector sizes are small or distributions very divergent then
288+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
289+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
290+
//' return negative values which are not defined and only occur due to the
291+
//' technical issues of computing x / 0 or 0 / 0 cases.
292+
//' @return A vector of distance values
293+
//' @examples
294+
//' set.seed(2020-08-20)
295+
//' P <- 1:10 / sum(1:10)
296+
//' M <- t(replicate(100, sample(1:10, size = 10) / 55))
297+
//' dist_one_many(P, M, method = "euclidean", testNA = FALSE)
298+
//' @export
299+
// [[Rcpp::export]]
300+
Rcpp::NumericVector dist_one_many(const Rcpp::NumericVector& P, Rcpp::NumericMatrix dists, Rcpp::String method, double p = NA_REAL, bool testNA = true, Rcpp::String unit = "log", double epsilon = 0.00001){
301+
302+
int nrows = dists.nrow();
303+
Rcpp::NumericVector dist_values(nrows);
304+
305+
for (int i = 0; i < nrows; i++){
306+
dist_values[i] = dist_one_one(P, dists(i, Rcpp::_), method, p, testNA, unit);
307+
}
308+
return dist_values;
309+
}
132310

311+
//' @title Distances and Similarities between Many Probability Density Functions
312+
//' @description This functions computes the distance/dissimilarity between two sets of probability density functions.
313+
//' @param dists1 a numeric matrix storing distributions in its rows.
314+
//' @param dists2 a numeric matrix storing distributions in its rows.
315+
//' @param method a character string indicating whether the distance measure that should be computed.
316+
//' @param p power of the Minkowski distance.
317+
//' @param testNA a logical value indicating whether or not distributions shall be checked for \code{NA} values.
318+
//' @param unit type of \code{log} function. Option are
319+
//' \itemize{
320+
//' \item \code{unit = "log"}
321+
//' \item \code{unit = "log2"}
322+
//' \item \code{unit = "log10"}
323+
//' }
324+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
325+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
326+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
327+
//' the expected similarity between compared probability density functions and
328+
//' whether or not many 0 values are present within the compared vectors.
329+
//' As a rough rule of thumb we suggest that when dealing with very large
330+
//' input vectors which are very similar and contain many \code{0} values,
331+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
332+
//' whereas when vector sizes are small or distributions very divergent then
333+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
334+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
335+
//' return negative values which are not defined and only occur due to the
336+
//' technical issues of computing x / 0 or 0 / 0 cases.
337+
//' @return A matrix of distance values
338+
//' @examples
339+
//' set.seed(2020-08-20)
340+
//' M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
341+
//' M2 <- t(replicate(10, sample(1:10, size = 10) / 55))
342+
//' result <- dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
343+
//' @export
344+
// [[Rcpp::export]]
345+
Rcpp::NumericMatrix dist_many_many(Rcpp::NumericMatrix dists1, Rcpp::NumericMatrix dists2, Rcpp::String method, double p = NA_REAL, bool testNA = true, Rcpp::String unit = "log", double epsilon = 0.00001){
346+
int nrows1 = dists1.nrow();
347+
int nrows2 = dists2.nrow();
348+
double dist_value = 0.0;
349+
350+
Rcpp::NumericMatrix dist_matrix(nrows1,nrows2);
351+
// std::fill(dist_matrix.begin(), dist_matrix.end(), Rcpp::NumericVector::get_na());
352+
353+
for (int i = 0; i < nrows1; i++){
354+
for (int j = 0; j < nrows2; j++){
355+
// if(Rcpp::NumericVector::is_na(dist_matrix(i,j))){
356+
dist_value = dist_one_one(dists1(i, Rcpp::_), dists2(j, Rcpp::_), method, p, testNA, unit);
357+
dist_matrix(i,j) = dist_value;
358+
// }
359+
}
360+
}
361+
return dist_matrix;
362+
}

‎src/distances.h

+133-8
Original file line numberDiff line numberDiff line change
@@ -366,9 +366,23 @@ double soergel(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool
366366
//' @param P a numeric vector storing the first distribution.
367367
//' @param Q a numeric vector storing the second distribution.
368368
//' @param testNA a logical value indicating whether or not distributions shall be checked for \code{NA} values.
369+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
370+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
371+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
372+
//' the expected similarity between compared probability density functions and
373+
//' whether or not many 0 values are present within the compared vectors.
374+
//' As a rough rule of thumb we suggest that when dealing with very large
375+
//' input vectors which are very similar and contain many \code{0} values,
376+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
377+
//' whereas when vector sizes are small or distributions very divergent then
378+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
379+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
380+
//' return negative values which are not defined and only occur due to the
381+
//' technical issues of computing x / 0 or 0 / 0 cases.
369382
//' @author Hajk-Georg Drost
370383
//' @examples
371-
//' kulczynski_d(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE)
384+
//' kulczynski_d(P = 1:10/sum(1:10), Q = 20:29/sum(20:29),
385+
//' testNA = FALSE, epsilon = 0.00001)
372386
//' @export
373387
// [[Rcpp::export]]
374388
double kulczynski_d(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool testNA, double epsilon){
@@ -1067,14 +1081,41 @@ double fidelity(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool
10671081
//' @param Q a numeric vector storing the second distribution.
10681082
//' @param testNA a logical value indicating whether or not distributions shall be checked for \code{NA} values.
10691083
//' @param unit type of \code{log} function. Option are
1084+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
1085+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
1086+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
1087+
//' the expected similarity between compared probability density functions and
1088+
//' whether or not many 0 values are present within the compared vectors.
1089+
//' As a rough rule of thumb we suggest that when dealing with very large
1090+
//' input vectors which are very similar and contain many \code{0} values,
1091+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
1092+
//' whereas when vector sizes are small or distributions very divergent then
1093+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
1094+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
1095+
//' return negative values which are not defined and only occur due to the
1096+
//' technical issues of computing x / 0 or 0 / 0 cases.
10701097
//' \itemize{
10711098
//' \item \code{unit = "log"}
10721099
//' \item \code{unit = "log2"}
10731100
//' \item \code{unit = "log10"}
10741101
//' }
1102+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
1103+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
1104+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
1105+
//' the expected similarity between compared probability density functions and
1106+
//' whether or not many 0 values are present within the compared vectors.
1107+
//' As a rough rule of thumb we suggest that when dealing with very large
1108+
//' input vectors which are very similar and contain many \code{0} values,
1109+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
1110+
//' whereas when vector sizes are small or distributions very divergent then
1111+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
1112+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
1113+
//' return negative values which are not defined and only occur due to the
1114+
//' technical issues of computing x / 0 or 0 / 0 cases.
10751115
//' @author Hajk-Georg Drost
10761116
//' @examples
1077-
//' bhattacharyya(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE, unit = "log2")
1117+
//' bhattacharyya(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE,
1118+
//' unit = "log2", epsilon = 0.00001)
10781119
//' @export
10791120
// [[Rcpp::export]]
10801121
double bhattacharyya(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool testNA, const Rcpp::String unit, double epsilon){
@@ -1225,9 +1266,23 @@ double squared_euclidean(const Rcpp::NumericVector& P, const Rcpp::NumericVector
12251266
//' @param P a numeric vector storing the first distribution.
12261267
//' @param Q a numeric vector storing the second distribution.
12271268
//' @param testNA a logical value indicating whether or not distributions shall be checked for \code{NA} values.
1269+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
1270+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
1271+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
1272+
//' the expected similarity between compared probability density functions and
1273+
//' whether or not many 0 values are present within the compared vectors.
1274+
//' As a rough rule of thumb we suggest that when dealing with very large
1275+
//' input vectors which are very similar and contain many \code{0} values,
1276+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
1277+
//' whereas when vector sizes are small or distributions very divergent then
1278+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
1279+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
1280+
//' return negative values which are not defined and only occur due to the
1281+
//' technical issues of computing x / 0 or 0 / 0 cases.
12281282
//' @author Hajk-Georg Drost
12291283
//' @examples
1230-
//' pearson_chi_sq(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE)
1284+
//' pearson_chi_sq(P = 1:10/sum(1:10), Q = 20:29/sum(20:29),
1285+
//' testNA = FALSE, epsilon = 0.00001)
12311286
//' @export
12321287
// [[Rcpp::export]]
12331288
double pearson_chi_sq(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool testNA, double epsilon){
@@ -1280,9 +1335,23 @@ double pearson_chi_sq(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q
12801335
//' @param P a numeric vector storing the first distribution.
12811336
//' @param Q a numeric vector storing the second distribution.
12821337
//' @param testNA a logical value indicating whether or not distributions shall be checked for \code{NA} values.
1338+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
1339+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
1340+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
1341+
//' the expected similarity between compared probability density functions and
1342+
//' whether or not many 0 values are present within the compared vectors.
1343+
//' As a rough rule of thumb we suggest that when dealing with very large
1344+
//' input vectors which are very similar and contain many \code{0} values,
1345+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
1346+
//' whereas when vector sizes are small or distributions very divergent then
1347+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
1348+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
1349+
//' return negative values which are not defined and only occur due to the
1350+
//' technical issues of computing x / 0 or 0 / 0 cases.
12831351
//' @author Hajk-Georg Drost
12841352
//' @examples
1285-
//' neyman_chi_sq(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE)
1353+
//' neyman_chi_sq(P = 1:10/sum(1:10), Q = 20:29/sum(20:29),
1354+
//' testNA = FALSE, epsilon = 0.00001)
12861355
//' @export
12871356
// [[Rcpp::export]]
12881357
double neyman_chi_sq(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool testNA, double epsilon){
@@ -1541,9 +1610,23 @@ double additive_symm_chi_sq(const Rcpp::NumericVector& P, const Rcpp::NumericVec
15411610
//' \item \code{unit = "log2"}
15421611
//' \item \code{unit = "log10"}
15431612
//' }
1613+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
1614+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
1615+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
1616+
//' the expected similarity between compared probability density functions and
1617+
//' whether or not many 0 values are present within the compared vectors.
1618+
//' As a rough rule of thumb we suggest that when dealing with very large
1619+
//' input vectors which are very similar and contain many \code{0} values,
1620+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
1621+
//' whereas when vector sizes are small or distributions very divergent then
1622+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
1623+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
1624+
//' return negative values which are not defined and only occur due to the
1625+
//' technical issues of computing x / 0 or 0 / 0 cases.
15441626
//' @author Hajk-Georg Drost
15451627
//' @examples
1546-
//' kullback_leibler_distance(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE, unit = "log2")
1628+
//' kullback_leibler_distance(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE,
1629+
//' unit = "log2", epsilon = 0.00001)
15471630
//' @export
15481631
// [[Rcpp::export]]
15491632
double kullback_leibler_distance(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool testNA, const Rcpp::String unit, double epsilon){
@@ -1660,9 +1743,23 @@ double kullback_leibler_distance(const Rcpp::NumericVector& P, const Rcpp::Numer
16601743
//' \item \code{unit = "log2"}
16611744
//' \item \code{unit = "log10"}
16621745
//' }
1746+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
1747+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
1748+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
1749+
//' the expected similarity between compared probability density functions and
1750+
//' whether or not many 0 values are present within the compared vectors.
1751+
//' As a rough rule of thumb we suggest that when dealing with very large
1752+
//' input vectors which are very similar and contain many \code{0} values,
1753+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
1754+
//' whereas when vector sizes are small or distributions very divergent then
1755+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
1756+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
1757+
//' return negative values which are not defined and only occur due to the
1758+
//' technical issues of computing x / 0 or 0 / 0 cases.
16631759
//' @author Hajk-Georg Drost
16641760
//' @examples
1665-
//' jeffreys(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE, unit = "log2")
1761+
//' jeffreys(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE,
1762+
//' unit = "log2", epsilon = 0.00001)
16661763
//' @export
16671764
// [[Rcpp::export]]
16681765
double jeffreys(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool testNA, const Rcpp::String unit, double epsilon){
@@ -2344,9 +2441,23 @@ double jensen_difference(const Rcpp::NumericVector& P, const Rcpp::NumericVector
23442441
//' \item \code{unit = "log2"}
23452442
//' \item \code{unit = "log10"}
23462443
//' }
2444+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
2445+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
2446+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
2447+
//' the expected similarity between compared probability density functions and
2448+
//' whether or not many 0 values are present within the compared vectors.
2449+
//' As a rough rule of thumb we suggest that when dealing with very large
2450+
//' input vectors which are very similar and contain many \code{0} values,
2451+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
2452+
//' whereas when vector sizes are small or distributions very divergent then
2453+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
2454+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
2455+
//' return negative values which are not defined and only occur due to the
2456+
//' technical issues of computing x / 0 or 0 / 0 cases.
23472457
//' @author Hajk-Georg Drost
23482458
//' @examples
2349-
//' taneja(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE, unit = "log2")
2459+
//' taneja(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE,
2460+
//' unit = "log2", epsilon = 0.00001)
23502461
//' @export
23512462
// [[Rcpp::export]]
23522463
double taneja(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool testNA, const Rcpp::String unit, double epsilon){
@@ -2465,9 +2576,23 @@ double taneja(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool t
24652576
//' @param P a numeric vector storing the first distribution.
24662577
//' @param Q a numeric vector storing the second distribution.
24672578
//' @param testNA a logical value indicating whether or not distributions shall be checked for \code{NA} values.
2579+
//' @param epsilon epsilon a small value to address cases in the distance computation where division by zero occurs. In
2580+
//' these cases, x / 0 or 0 / 0 will be replaced by \code{epsilon}. The default is \code{epsilon = 0.00001}.
2581+
//' However, we recommend to choose a custom \code{epsilon} value depending on the size of the input vectors,
2582+
//' the expected similarity between compared probability density functions and
2583+
//' whether or not many 0 values are present within the compared vectors.
2584+
//' As a rough rule of thumb we suggest that when dealing with very large
2585+
//' input vectors which are very similar and contain many \code{0} values,
2586+
//' the \code{epsilon} value should be set even smaller (e.g. \code{epsilon = 0.000000001}),
2587+
//' whereas when vector sizes are small or distributions very divergent then
2588+
//' higher \code{epsilon} values may also be appropriate (e.g. \code{epsilon = 0.01}).
2589+
//' Addressing this \code{epsilon} issue is important to avoid cases where distance metrics
2590+
//' return negative values which are not defined and only occur due to the
2591+
//' technical issues of computing x / 0 or 0 / 0 cases.
24682592
//' @author Hajk-Georg Drost
24692593
//' @examples
2470-
//' kumar_johnson(P = 1:10/sum(1:10), Q = 20:29/sum(20:29), testNA = FALSE)
2594+
//' kumar_johnson(P = 1:10/sum(1:10), Q = 20:29/sum(20:29),
2595+
//' testNA = FALSE, epsilon = 0.00001)
24712596
//' @export
24722597
// [[Rcpp::export]]
24732598
double kumar_johnson(const Rcpp::NumericVector& P, const Rcpp::NumericVector& Q, bool testNA, double epsilon){

‎src/philentropy_init.c

-146
This file was deleted.

‎tests/testthat/test-dist-functions.R

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
context("Test dist_one_one, dist_one_many, dist_many_many ...")
2+
3+
set.seed(2020-08-20)
4+
P <- 1:10 / sum(1:10)
5+
Q <- 20:29 / sum(20:29)
6+
M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
7+
M2 <- t(replicate(20, sample(1:10, size = 10) / 55))
8+
9+
doo1 <- dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
10+
dom1 <- dist_one_many(P, M1, method = "euclidean", testNA = FALSE)
11+
dmm1 <- dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
12+
13+
test_that("dist_one_one output structure is correct", {
14+
expect_type(doo1, "double")
15+
expect_length(doo1, 1)
16+
})
17+
18+
test_that("dist_one_many output structure is correct", {
19+
expect_type(dom1, "double")
20+
expect_length(dom1, nrow(M1))
21+
})
22+
23+
test_that("dist_many_many output structure is correct", {
24+
expect_type(dmm1, "double")
25+
expect_equal(dim(dmm1), c(nrow(M1), nrow(M2)))
26+
})
27+
28+
doo2 = euclidean(P, Q, FALSE)
29+
30+
test_that("dist_one_one output is correct", {
31+
expect_equal(doo1, doo2)
32+
})
33+
34+
dom2 = vector(length = nrow(M1))
35+
for (i in seq_len(nrow(M1))){
36+
dom2[i] = euclidean(P, M1[i, ], FALSE)
37+
}
38+
39+
test_that("dist_one_many output is correct", {
40+
expect_equal(dom1, dom2)
41+
})
42+
43+
dmm2 = matrix(nrow = nrow(M1), ncol = nrow(M2))
44+
for (i in seq_len(nrow(M1))){
45+
for (j in seq_len(nrow(M2))){
46+
dmm2[i, j] = euclidean(M1[i, ], M2[j, ], FALSE)
47+
}
48+
}
49+
50+
test_that("dist_many_many output is correct", {
51+
expect_equal(dmm1, dmm2)
52+
})

‎vignettes/Distances.Rmd

+2-2
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ As you can see, although the `distance()` function is quite fast, the internal c
184184

185185
The advantage of `distance()` is that it implements 46 distance measures based on base C++ functions that can be accessed individually by typing `philentropy::` and then `TAB`. In future versions of `philentropy` I will optimize the `distance()` function so that internal checks for data type correctness and correct input data will take less termination time than the base `dist()` function.
186186

187-
187+
<!--
188188
## Detailed assessment of individual similarity and distance metrics
189189
190190
@@ -210,4 +210,4 @@ Euclid argued that that the __shortest__ distance between two points is always a
210210
#### Chebyshev distance
211211
212212
> $d = max | P_i - Q_i |$
213-
213+
-->

‎vignettes/Many_Distances.Rmd

+132
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
---
2+
title: "Comparing many probability density functions"
3+
author: Jakub Nowosad
4+
date: 2021-08-20
5+
output: rmarkdown::html_vignette
6+
vignette: >
7+
%\VignetteIndexEntry{Many_Distances}
8+
%\VignetteEngine{knitr::rmarkdown}
9+
\usepackage[utf8]{inputenc}
10+
---
11+
12+
The **philentropy** package has several mechanisms to calculate distances between probability density functions.
13+
The main one is to use the the `distance()` function, which enables to compute 46 different distances/similarities between probability density functions (see `?philentropy::distance` and [a companion vignette](Distances.html) for details).
14+
Alternatively, it is possible to call each distance/dissimilarity function directly.
15+
For example, the `euclidean()` function will compute the euclidean distance, while `jaccard` - the Jaccard distance.
16+
The complete list of available distance measures are available with the `philentropy::getDistMethods()` function.
17+
18+
Both of the above approaches have their pros and cons.
19+
The `distance()` function is more flexible as it allows users to use any distance measure and can return either a `matrix` or a `dist` object.
20+
It also has several defensive programming checks implemented, and thus, it is more appropriate for regular users.
21+
Single distance functions, such as `euclidean()` or `jaccard()`, can be, on the other hand, slightly faster as they directly call the underlining C++ code.
22+
23+
Now, we introduce three new low-level functions that are intermediaries between `distance()` and single distance functions.
24+
They are fairly flexible, allowing to use of any implemented distance measure, but also usually faster than calling the `distance()` functions (especially, if it is needed to use many times).
25+
These functions are:
26+
27+
- `dist_one_one()` - expects two vectors (probability density functions), returns a single value
28+
- `dist_one_many()` - expects one vector (a probability density function) and one matrix (a set of probability density functions), returns a vector of values
29+
- `dist_many_many()` - expects two matrices (two sets of probability density functions), returns a matrix of values
30+
31+
Let's start testing them by attaching the **philentropy** package.
32+
33+
```{r}
34+
library(philentropy)
35+
```
36+
37+
## `dist_one_one()`
38+
39+
`dist_one_one()` is a lower level equivalent to `distance()`.
40+
However, instead of accepting a numeric `data.frame` or `matrix`, it expects two vectors representing probability density functions.
41+
In this example, we create two vectors, `P` and `Q`.
42+
43+
```{r}
44+
P <- 1:10 / sum(1:10)
45+
Q <- 20:29 / sum(20:29)
46+
```
47+
48+
To calculate the euclidean distance between them we can use several approaches - (a) build-in R `dist()` function, (b) `philentropy::distance()`, (c) `philentropy::euclidean()`, or the new `dist_one_one()`.
49+
50+
```{r}
51+
# install.packages("microbenchmark")
52+
microbenchmark::microbenchmark(
53+
dist(rbind(P, Q), method = "euclidean"),
54+
distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
55+
euclidean(P, Q, FALSE),
56+
dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
57+
)
58+
```
59+
60+
All of them return the same, single value.
61+
However, as you can see in the benchmark above, some are more flexible, and others are faster.
62+
63+
## `dist_one_many()`
64+
65+
The role of `dist_one_many()` is to calculate distances between one probability density function (in a form of a `vector`) and a set of probability density functions (as rows in a `matrix`).
66+
67+
Firstly, let's create our example data.
68+
69+
```{r}
70+
set.seed(2020-08-20)
71+
P <- 1:10 / sum(1:10)
72+
M <- t(replicate(100, sample(1:10, size = 10) / 55))
73+
```
74+
75+
`P` is our input vector and `M` is our input matrix.
76+
77+
Distances between the `P` vector and probability density functions in `M` can be calculated using several approaches.
78+
For example, we could write a `for` loop (adding a new code) or just use the existing `distance()` function and extract only one row (or column) from the results.
79+
The `dist_one_many()` allows for this calculation directly as it goes through each row in `M` and calculates a given distance measure between `P` and values in this row.
80+
81+
```{r}
82+
# install.packages("microbenchmark")
83+
microbenchmark::microbenchmark(
84+
as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
85+
distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
86+
dist_one_many(P, M, method = "euclidean", testNA = FALSE)
87+
)
88+
```
89+
90+
The `dist_one_many()` returns a vector of values.
91+
It is, in this case, much faster than `distance()`, and visibly faster than `dist()` while allowing for more possible distance measures to be used.
92+
93+
## `dist_many_many()`
94+
95+
`dist_many_many()` calculates distances between two sets of probability density functions (as rows in two `matrix` objects).
96+
97+
Let's create two new `matrix` example data.
98+
99+
```{r}
100+
set.seed(2020-08-20)
101+
M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
102+
M2 <- t(replicate(10, sample(1:10, size = 10) / 55))
103+
```
104+
105+
`M1` is our first input matrix and `M2` is our second input matrix.
106+
I am not aware of any function build-in R that allows calculating distances between rows of two matrices, and thus, to solve this problem, we can create our own - `many_dists()`...
107+
108+
```{r}
109+
many_dists = function(m1, m2){
110+
r = matrix(nrow = nrow(m1), ncol = nrow(m2))
111+
for (i in seq_len(nrow(m1))){
112+
for (j in seq_len(nrow(m2))){
113+
x = rbind(m1[i, ], m2[j, ])
114+
r[i, j] = distance(x, method = "euclidean", mute.message = TRUE)
115+
}
116+
}
117+
r
118+
}
119+
```
120+
121+
... and compare it to `dist_many_many()`.
122+
123+
```{r}
124+
# install.packages("microbenchmark")
125+
microbenchmark::microbenchmark(
126+
many_dists(M1, M2),
127+
dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
128+
)
129+
```
130+
131+
Both `many_dists()`and `dist_many_many()` return a matrix.
132+
The above benchmark concludes that `dist_many_many()` is about 30 times faster than our custom `many_dists()` approach.

0 commit comments

Comments
 (0)
Please sign in to comment.