Original link: https://qq52o.me/2808.html

There is such a requirement: it is necessary to compare the similarity of the content titles published by the user. If the similarity between the previous content and the currently published content title reaches a certain threshold, it is forbidden to publish or perform some other operations.

Seeing this requirement, you may think that you need to use a certain algorithm to achieve it, such as: `TF-IDF`

, cosine algorithm based on space vector, longest common subsequence, minimum edit distance algorithm, Jaccard coefficient and so on.

The minimum edit distance algorithm has been implemented in PHP: levenshtein , which calculates the edit distance between two strings.

` ``levenshtein( string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1 ): int`

The edit distance refers to the minimum number of characters required to convert the string `string1`

into `string2`

by replacing, inserting, deleting and other operations between two strings.

The complexity of this algorithm is `O(m*n)`

, where n and m are the lengths of `string1`

and `string2`

, respectively.

Let’s demonstrate some nonsense literature:

` ``echo levenshtein('听君一席话', '听君一席话'); // 0 echo levenshtein('听君一席话', '如听一席话'); // 3 echo levenshtein('我不要你觉得', '我要我觉得'); // 6 echo levenshtein('今天的天气怎么样？', '你吃饭了吗？'); // 21`

When the edit distance is smaller, the similarity is higher.

In addition to edit distance, PHP also directly provides a function to calculate the similarity of two strings: similar_text .

` ``similar_text(string $string1, string $string2, float &$percent = null): int`

Returns the number of matching characters in both strings.

By passing a reference as the third argument, `similar_text()`

will calculate a similarity of 100 by dividing the result of `similar_text()`

by the average length of the given string, multiplied by a percentage.

` ``echo similar_text('听君一席话', '听君一席话', $percent); // 15 echo $percent; // 100 echo similar_text('听君一席话', '如听一席话', $percent); // 12 echo $percent; // 80 echo similar_text('我不要你觉得', '我要我觉得', $percent); // 12 echo $percent; // 72.727272727273 echo similar_text('今天的天气怎么样？', '你吃饭了吗？', $percent); // 6 echo $percent; // 26.666666666667`

The similarity calculation for this function is performed as described in Programming Classics: Implementing the World’s Best Algorithms by Oliver (ISBN 0-131-00413-1).

The implementation of this function uses recursive calls, so it may cause the whole process to be slower or faster. The complexity of the algorithm is `O(N**3)`

, where N is the length of the longest string.

When `$percent`

is larger, the similarity is higher.

The number of matching characters is calculated by finding the longest first common substring, then recursively doing this for prefixes and suffixes. Add the lengths of all found common substrings.

This article is reprinted from: https://qq52o.me/2808.html

This site is for inclusion only, and the copyright belongs to the original author.