PHP Tips You Don’t Know About Calculating Text Similarity

Original link:


There is such a requirement: it is necessary to compare the similarity of the content titles published by the user. If the similarity between the previous content and the currently published content title reaches a certain threshold, it is forbidden to publish or perform some other operations.

Seeing this requirement, you may think that you need to use a certain algorithm to achieve it, such as: TF-IDF , cosine algorithm based on space vector, longest common subsequence, minimum edit distance algorithm, Jaccard coefficient and so on.

The minimum edit distance algorithm has been implemented in PHP: levenshtein , which calculates the edit distance between two strings.

 levenshtein( string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1 ): int

The edit distance refers to the minimum number of characters required to convert the string string1 into string2 by replacing, inserting, deleting and other operations between two strings.

The complexity of this algorithm is O(m*n) , where n and m are the lengths of string1 and string2 , respectively.

Let’s demonstrate some nonsense literature:

 echo levenshtein('听君一席话', '听君一席话'); // 0 echo levenshtein('听君一席话', '如听一席话'); // 3 echo levenshtein('我不要你觉得', '我要我觉得'); // 6 echo levenshtein('今天的天气怎么样?', '你吃饭了吗?'); // 21

When the edit distance is smaller, the similarity is higher.

In addition to edit distance, PHP also directly provides a function to calculate the similarity of two strings: similar_text .

 similar_text(string $string1, string $string2, float &$percent = null): int

Returns the number of matching characters in both strings.

By passing a reference as the third argument, similar_text() will calculate a similarity of 100 by dividing the result of similar_text() by the average length of the given string, multiplied by a percentage.

 echo similar_text('听君一席话', '听君一席话', $percent); // 15 echo $percent; // 100 echo similar_text('听君一席话', '如听一席话', $percent); // 12 echo $percent; // 80 echo similar_text('我不要你觉得', '我要我觉得', $percent); // 12 echo $percent; // 72.727272727273 echo similar_text('今天的天气怎么样?', '你吃饭了吗?', $percent); // 6 echo $percent; // 26.666666666667

The similarity calculation for this function is performed as described in Programming Classics: Implementing the World’s Best Algorithms by Oliver (ISBN 0-131-00413-1).

The implementation of this function uses recursive calls, so it may cause the whole process to be slower or faster. The complexity of the algorithm is O(N**3) , where N is the length of the longest string.

When $percent is larger, the similarity is higher.

The number of matching characters is calculated by finding the longest first common substring, then recursively doing this for prefixes and suffixes. Add the lengths of all found common substrings.

This article is reprinted from:
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment

Your email address will not be published.