24 Mar
Дупки
От известно време се каня да ругая по българската народна милиция. Да, именно милиция, защото според мен никак не са си променили манталитета от онова време… Но първо първите неща, както казват англоезичните ни братя.
24 Mar
От известно време се каня да ругая по българската народна милиция. Да, именно милиция, защото според мен никак не са си променили манталитета от онова време… Но първо първите неща, както казват англоезичните ни братя.
17 Mar
A friend of mine recently had a problem. He needed to compare how equal are two strings. We all know how easy is to tell whether two strings are equal, but if they are not exactly equal, then how different are they?
If you google for partial string matching there are tons of resources… but not for C#. We found that most applicable algorithm is the Levenstein distance (also called edit distance). It measures exactly how many edits (insertion, deletion or substitution of a single character) you have to do to get one string from another. You can read more about it at Wikipedia.
Since we are to compare strings, it seems rational to extend System.String to support a simple method for getting the edit distance.
public static int GetLevensteinDistance(this string firstString, string secondString)
{
if (firstString == null)
throw new ArgumentNullException("firstString");
if (secondString == null)
throw new ArgumentNullException("secondString");
if (firstString == secondString)
return 0;
int[,] matrix = new int[firstString.Length + 1, secondString.Length + 1];
for (int i = 0; i <= firstString.Length; i++)
matrix[i, 0] = i; // deletion
for (int j = 0; j <= secondString.Length; j++)
matrix[0, j] = j; // insertion
for (int i = 0; i < firstString.Length; i++)
for (int j = 0; j < secondString.Length; j++)
if (firstString[i] == secondString[j])
matrix[i + 1, j + 1] = matrix[i, j];
else
{
matrix[i + 1, j + 1] = Math.Min(matrix[i, j + 1] + 1, matrix[i + 1, j] + 1); //deletion or insertion
matrix[i + 1, j + 1] = Math.Min(matrix[i + 1, j + 1], matrix[i, j] + 1); //substitution
}
return matrix[firstString.Length, secondString.Length];
}
Now, I know it is not so readable, but if you really need an explanation of how it works, check Wikipedia page, it is pretty direct implementation.
Okay, now we can compute the edit distance, but the goal is still comparing two strings, so we need some percentage. In order to have constant result regarding whether we are comparing string1 to string2 or vice versa and still get result from 0 to 1, we will compare longer string to shorter.
public static double GetSimilarity(this string firstString, string secondString)
{
if (firstString == null)
throw new ArgumentNullException("firstString");
if (secondString == null)
throw new ArgumentNullException("secondString");
if (firstString == secondString)
return 1;
int longestLenght = Math.Max(firstString.Length, secondString.Length);
int distance = GetLevensteinDistance(firstString, secondString);
double percent = distance/(double) longestLenght;
return 1 - percent;
}
That’s it! I have to advise you though, that it’s a costly operation and running this against a large database with big text is not going to be very fast.
I have to thank Miro, for the support. He is an awesome guy!
And finally, I want to thank Velin for reading.
16 Mar
Моя добър приятел Велин, с другарска усмивка и благ тон ме подкани да си направя блог. Имало хора, които видите ли можело и да се зачетат в написаното. Един вид – сея мъдрост, а другите попиват. Идеята е хубава, да видим какво ще последва. Може пък и да почна да сея мъдрост. И все пак, познавам се другарю Велин – това е поредната дивотия, която ще ме мързи да продължа.
На всички, които четат това искам да кажа:
Лека нощ, Велин!