The first part mainly introduced the content related to beam search and input segmentation, as well as some basic data structures provided. Next, we mainly focus on completing the implementation details of UserLanguageModel and HistoryBigram that were not mentioned last time.
It is stated that many of the prerequisite knowledge required in this article can be found in undergraduate natural language processing courses, with the keywords “statistical language model” and “N-gram”. In general, the N-gram model is an inaccurate generalization of real languages. It does not care about grammar, but only about the frequency of occurrences between words. Although it is imprecise, it is used for input methods, machine translation, speech recognition and other fields. All have good effects.
First of all, it was mentioned in the previous article that the core algorithm of our input method is N-gram and beam search. Generally, for the input method using this algorithm, N will take 3 or 2. A balance between performance and memory usage can be achieved. For the time being, we also inherited a part of the spirit of Sunpinyin, because the initial data is the Open-Gram used by Sunpinyin. Of course, by the way, in the latest version we retrained the language model with brand new data. But still uses the same Trigram as Sunpinyin.
HistoryBigram, as the name suggests, is a Bigram that stores user input. What it does is actually very simple, that is, it stores the sentences entered by the user one by one according to the words. While in memory, it is stored in the DATrie. You may want to ask a question, that is, DATrie can be regarded as a mapping from a string to 4 byte data in an abstract sense, so how does it store a mapping table with two levels of keys like Bigarm?
The answer is actually very simple, that is, you make a special encoding for the Key so that it can carry more content. It’s a simple matter of combining two strings with a delimiter to make one. The formulas for calculating scores for Bi-gram and Uni-gram are also from Sunpinyin. In the most straightforward way, the probability of Bigram should be P(B|A) = Freq(AB) / Freq(B) which is the number of occurrences of AB divided by the number of occurrences of A. Just get 0 for no AB. This does not work well for a model, the language model will take smoothing (Smoothing) to ensure that even if the data is sparse (the history of user input is very small), it can get meaningful results. Here is a weighted average of the number of times B appears alone. To reduce data missing due to sampling.
And this distance from the actual input method, there are still two problems to be solved. 1. How to mix the system model and user model, 2. How to ensure that the user’s input can quickly reflect the effect of sorting.
The first problem, although imprecise, is that we use a weighted average of two probability values. Of course, here is also a small mention of the implementation of sunpinyin. In the initial implementation of libime, many specific details of the scheme tried to refer to the practice of sunpinyin, but also found many “strange designs” of sunpinyin. For example, in order to reduce floating-point number operations, the common practice is to use logarithmic probability, so that the product of probabilities is a simple summation, which is much more efficient. But the probability weight of sunpinyin is an arithmetic mean weighted with logarithmic probability calculation, which is very strange (very strange and not very mathematical). libime here uses the arithmetic mean of the original probability calculation, which at least makes more sense. But there is a small problem here, suppose you have log(P_sys) and log(P_user) how to calculate log(w * P_sys + (1 – w) * P_user).
If you calculate it directly, we can get log(a * pow(10, log(P_sys)) + (1 – a) * pow(10, log(P_user))) by substituting the numbers, which is called a total of 3 times function of math. Can it be reduced a bit? Actually it is possible.
Suppose we need to calculate log10(exp10(a) + exp10(b)) log10(exp10(a) + exp10(b)) = log10(exp10(b) * (1 + exp10(a - b))) = b + log10(1 + exp10(a - b)) = b + log1p(exp10(a - b)) / log(10)
After getting the above formula, why do you deliberately come up with the form of log1p(exp10(ab))? It is because the standard library provides such a function log1p10exp, which can calculate this value in one step. To be more precise, a – b or b – a can be selected according to the relative size of ab. Among them, log(10) (base e) such constants can be pre-calculated, which reduces the math library function call to 1 time.
You may ask, where did the weights of w and (1 – w) in the original formula go? This problem is also very simple, you just need to convert them to logarithmic calculation first.
log10(w * exp10(log_sys) + (1-w) * exp10(log_user)) = log10(exp10(log10(w)) * exp10(log_sys) + exp10(log10(1-w)) * exp10(log_user)) = log10(exp10(log10(w) + log_sys) + exp10(log10(1-w) + log_user))
This can be perfectly brought into the previous formula. log10(w) and log10(1-w) are also constants for the input method and can be calculated directly.
Another problem is how to ensure that the user’s “new” input can be quickly sorted to the front of the result. I made some reference to the practice of sunpinyin, but I think it is not reasonable rationally, so I optimized it according to the situation. sunpinyin only records 1024 words in history. When new words enter history, old history will be kicked out. This design allows user input not to deviate too far from system input, but also relatively affects the user’s history. Then the frequency of some recently entered words is multiplied by a coefficient to disguise the weight of the newly entered content. But this range of numbers is actually quite small and may cause this part of the behavior to change the order of the results to be quickly forgotten.
libime takes a similar approach here but makes some of its own improvements. First, the words input by the user are recorded by sentences. Second, the sentences input by the user are divided into three pools of different sizes (128, 8192, 65536), and each pool is given different probability weights. (The choice of specific numbers is purely the so-called heuristic), which relatively ensures that there is more history to be remembered (65536 sentences vs 1024 words), and it also has the effect of attenuation according to the input time.
This article is reprinted from https://www.csslayer.info/wordpress/fcitx-dev/libime-intro-2/
This site is for inclusion only, and the copyright belongs to the original author.