levenshtein.scm: Levenshtein Distance Metric in Scheme
levenshtein.scm: Levenshtein Distance Metric in Scheme
******************************************************
Version 0.3, 2005-07-09, `http://www.neilvandyke.org/levenshtein-scm/'
by Neil W. Van Dyke <[email protected]>
Copyright (C) 2004 - 2005 Neil W. Van Dyke. This program is Free
Software; you can redistribute it and/or modify it under the terms
of the GNU Lesser General Public License as published by the Free
Software Foundation; either version 2.1 of the License, or (at
your option) any later version. This program is distributed in
the hope that it will be useful, but without any warranty; without
even the implied warranty of merchantability or fitness for a
particular purpose. See <http://www.gnu.org/copyleft/lesser.html>
for details. For other license options and consulting, contact
the author.
Introduction
************
This is a Scheme implementation of the "Levenshtein Distance"
algorithm, which is an "edit distance" metric of string similarity, due
to Vladimir Levenshtein. The Levenshtein Distance is a function of two
strings that represents a count of single-character insertions,
deletions, and substitions that will change the first string to the
second. More information is available in NIST DADS
(http://www.nist.gov/dads/HTML/Levenshtein.html) and Michael Gilleland
article, "in Three Flavors."
This implementation is modeled after a space-efficient Perl
implementation (http://www.mgilleland.com/ld/ldperl2.htm) by Jorge Mas
Trullenque. It has been written in R5RS Scheme (with an `error'
procedure such as the one in SRFI-23), and extended to support
heterogeneous combinations of Scheme types (strings, lists, vectors),
user-supplied predicate functions, and optionally reusable scratch
vectors.
This version 0.1 is an early release that has been tested only
lightly.
Basic Comparisons
*****************
In the current implementation, all comparisons are done internally via
vectors.
> (vector-levenshtein/predicate/get-scratch a b pred)
get-scratch
Few, if any, programs will use this procedure directly. This is
like `vector-levenshtein/predicate', but allows GET-SCRATCH to be
specified. GET-SCRATCH is a procedure of one term, n, that yields
a vector of length n or greater, which is used for record-keeping
during execution of the Levenshtein algorithm. `make-vector' can
be used for GET-SCRATCH, although some programs comparing a large
size or quantity of vectors may wish to reuse a record-keeping
vector, rather than each time allocating a new one that will need
to be garbage-collected.
> (vector-levenshtein/predicate a b pred)
> (vector-levenshtein/eq a b)
> (vector-levenshtein/eqv a b)
> (vector-levenshtein/equal a b)
> (vector-levenshtein a b)
Calculate the Levenshtein Distance of vectors A and B. PRED is
the predicate procedure for determining if two elements are equal.
The `/eq', `/eqv', and `/equal' variants correspond to the
standard equivalence predicates, `eq?', `eqv?', and `equal?'.
`vector-levenshtein' is an alias for `vector-levenshtein/equal'.
(vector-levenshtein '#(6 6 6) '#(6 35 6 24 6 32)) => 3
> (list-levenshtein/predicate a b pred)
> (list-levenshtein/eq a b)
> (list-levenshtein/eqv a b)
> (list-levenshtein/equal a b)
> (list-levenshtein a b)
Calculate the Levenshtein Distance of lists A and B. PRED is the
predicate procedure for determining if two elements are equal.
The `/eq', `/eqv', and `/equal' variants correspond to the
standard equivalence predicates, `eq?', `eqv?', and `equal?'.
`list-levenshtein' is an alias for `list-levenshtein/equal'. Note
that comparison of lists is less efficient than comparison of
vectors.
(list-levenshtein/eq '(b c e x f y) '(a b c d e f)) => 4
> (string-levenshtein a b)
Calculate the Levenshtein Distance of strings A and B.
(string-levenshtein "adresse" "address") => 2
Type-Coercing Comparisons
*************************
Procedures `levenshtein' and `levenshtein/predicate' provide a
convenient interface for comparing a combination of vectors, lists, and
strings, the types of which might not be known until runtime.
> (levenshtein/predicate a b pred)
Calculates the Levenshtein Distance of two objects A and B, which
are vectors, lists, or strings. A and B need not be of the same
type. PRED is the element equivalence predicate used.
(levenshtein/predicate '#(#\A #\B #\C #\D)
"aBXcD"
char-ci=?)
=> 1
> (levenshtein a b)
Calculate the levenshtein distance of A and B, in a similar manner
as using `levenshtein/predicate' with `equal?' as the predicate.
(define g '#(#\g #\u #\m #\b #\o))
(levenshtein g "gambol") => 2
(levenshtein g "dumbo") => 1
(levenshtein g "umbrage") => 5
History
*******
Version 0.3 -- 2005-07-09
PLaneT release, and minor documentation changes.
Version 0.2 -- 2004-07-06
Documentation changes.
Version 0.1 -- 2004-05-13
First release. Tested only lightly, and today _is_ the 13th, so
caveat emptor.