{tidystringdist}
is a package that extends the {stringdist}
package with tidy data principles.
The idea is to perform string distance calculation and combine it with functions for data manipulation and visualisation from the tidyverse framework.
You can install the last stable version from GitHub with:
Or the dev version from GitHub:
tidycomb()
The tidycomb()
& tidy_comb_all()
functions return all the possible combinations from a vector / a
data.frame and a column / two vectors:
Once you’ve got this data.frame, you can use
tidy_string_dist()
to compute string distance. This
function takes a data.frame, the two columns containing the strings, and
one or more stringdist methods.
comb <- tidy_comb_all(state.name)
tidy_stringdist(comb)
#> # A tibble: 1,225 × 12
#> V1 V2 osa lv dl hamming lcs qgram cosine jaccard jw
#> * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama Alaska 3 3 3 Inf 5 5 0.216 0.571 0.254
#> 2 Alabama Arizona 5 5 5 5 10 10 0.581 0.8 0.476
#> 3 Alabama Arkansas 6 6 6 Inf 9 9 0.440 0.778 0.399
#> 4 Alabama California 8 8 8 Inf 13 11 0.481 0.818 0.535
#> 5 Alabama Colorado 6 6 6 Inf 11 11 0.704 0.778 0.488
#> 6 Alabama Connectic… 11 11 11 Inf 18 18 1 1 1
#> 7 Alabama Delaware 5 5 5 Inf 9 9 0.440 0.778 0.399
#> 8 Alabama Florida 5 5 5 5 10 10 0.581 0.8 0.476
#> 9 Alabama Georgia 6 6 6 6 12 12 0.686 0.909 0.571
#> 10 Alabama Hawaii 5 5 5 Inf 9 9 0.474 0.875 0.460
#> # ℹ 1,215 more rows
#> # ℹ 1 more variable: soundex <dbl>
Default call compute all the methods. You can use specific method
with the method
argument:
comb <- tidy_comb_all(state.name)
tidy_stringdist(comb, method = c("osa","jw"))
#> # A tibble: 1,225 × 4
#> V1 V2 osa jw
#> * <chr> <chr> <dbl> <dbl>
#> 1 Alabama Alaska 3 0.254
#> 2 Alabama Arizona 5 0.476
#> 3 Alabama Arkansas 6 0.399
#> 4 Alabama California 8 0.535
#> 5 Alabama Colorado 6 0.488
#> 6 Alabama Connecticut 11 1
#> 7 Alabama Delaware 5 0.399
#> 8 Alabama Florida 5 0.476
#> 9 Alabama Georgia 6 0.571
#> 10 Alabama Hawaii 5 0.460
#> # ℹ 1,215 more rows