Original link: https://blog.lilydjwg.me/posts/216461.html
This article comes from Evian’s Blog , please indicate when reprinting.
Last year, I made a software for indexing Telegram groups- Luoxu , and finally I can search the Chinese news in the group. However, it was later discovered that a lot of news group friends were sent through screenshots, and Luo Xu could not be indexed. You can’t help people take screenshots. After all, many people have limited ability to describe, and even copy and paste can make mistakes. The screenshots are relatively objective, true and reliable.
So Luo Xu wanted OCR. I know Baidu has an OCR service, but I obviously don’t use it on Luoxu. The OCR tool I usually use is tesseract, which is also used by many open source software. Its ability to recognize English is ok, especially the customizable character set, so the effect of recognizing IP addresses is very good, but the ability to recognize Chinese is not very good, the picture is slightly unclear (for example, compressed by Telegram JPEG), deformed ( For example, taking pictures), it’s a mess, so let’s not say it’s strange behavior to add spaces between Chinese characters.
Later, I heard from friends that PaddleOCR’s Chinese recognition effect is very good. I actually tested it, it’s really good, and it works completely offline and is open source. However, open source is open source, and I don’t have the ability to review all its code, and the number of users is too small to expect “enough eyes”. As a machine learning-based software, it also inherits the very complex and difficult construction process in this field, and even relies on a library called ” opencv-contrib-python ” that comes with various libraries of ffmpeg, Qt5, OpenSSL, and XCB. Knowing what to do, trying to compile an older version of numpy results in it failing because it’s too old to support Python 3.10. So I decided to install in a Debian chroot, where Python 3.9 is available directly from precompiled packages. So the question arises: Is it really safe to use such a large number of binary libraries with unknown sources?
I have no idea. But I know it’s relatively safe if it doesn’t connect to the Internet. After all, my main concern is privacy and security – I must not leak pictures sent by group friends to unknown third parties. And if you can’t connect to the Internet, whether you want to DDoS others or want to mine, it won’t work if you can’t receive instructions or transmit data. I just need to be able to read the picture from the outside world and return the recognition result to me.
So a simple solution is to give it an independent network space that can only access itself with bwrap, won’t it not be able to access the Internet? But it’s easier said than done. First, debootstrap needs to be executed with root, and then chown after execution. To further restrict permissions, I use subuid, but that complicates things – I have trouble accessing it myself. After some fiddling, I found a way to get me into this chroot environment:
#!/bin/bash -e user="$(id -un)" group="$(id -gn)" # Create a new user namespace in the background with a dummy process just to # keep it alive. unshare -U sh -c "sleep 30" & child_pid=$! # Set {uid,gid}_map in new user namespace to max allowed range. # Need to have appropriate entries for user in /etc/subuid and /etc/subgid. # shellcheck disable=SC2046 newuidmap $child_pid 0 $(grep "^${user}:" /etc/subuid | cut -d : -f 2- | tr : ' ') # shellcheck disable=SC2046 newgidmap $child_pid 0 $(grep "^${group}:" /etc/subgid | cut -d : -f 2- | tr : ' ') # Tell Bubblewrap to use our user namespace through fd 5. 5 < /proc/$child_pid/ns/user bwrap \ --userns 5 \ --cap-add ALL \ --uid 0 \ --gid 0 \ --unshare-ipc --unshare-pid --unshare-uts --unshare-cgroup --share-net \ --die-with-parent --bind ~/rootfs-debian / --tmpfs /sys --tmpfs /tmp --tmpfs /run --proc /proc --dev /dev \ -- \ /bin/bash -l
The networking permission is given here because I need to install PaddleOCR. I didn’t install it after creating the chroot and before chown, because I think it’s too risky to install untrusted software with root privileges that are still in the chroot. After installing it, just find a picture, identify each language, let it download the models in various languages, and then it will no longer be able to access the Internet (to avoid malicious code to store data when there is an Internet connection) send):
#!/bin/bash -e dir="$(dirname $2)" file="$(basename $2)" user="$(id -un)" group="$(id -gn)" # Create a new user namespace in the background with a dummy process just to # keep it alive. unshare -U sh -c "sleep 30" & child_pid=$! # Set {uid,gid}_map in new user namespace to max allowed range. # Need to have appropriate entries for user in /etc/subuid and /etc/subgid. # shellcheck disable=SC2046 newuidmap $child_pid 0 $(grep "^${user}:" /etc/subuid | cut -d : -f 2- | tr : ' ') # shellcheck disable=SC2046 newgidmap $child_pid 0 $(grep "^${group}:" /etc/subgid | cut -d : -f 2- | tr : ' ') # Tell Bubblewrap to use our user namespace through fd 5. 5 < /proc/$child_pid/ns/user bwrap \ --userns 5 \ --uid 1000 \ --gid 1000 \ --unshare-ipc --unshare-pid --unshare-uts --unshare-cgroup --unshare-net \ --die-with-parent --bind ~/rootfs-debian / --tmpfs /sys --tmpfs /tmp --tmpfs /run --proc /proc --dev /dev \ --ro-bind "$dir" /workspace --chdir /workspace \ --setenv PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \ --setenv HOME /home/worker \ -- \ /home/worker/paddleocr/ocr.py "$1" "$file" kill $child_pid
This script will mount the directory where the specified file is located into the chroot, and then call PaddleOCR on the file to identify and return the result. The ocr.py script that calls PaddleOCR is in my paddleocr-web project .
But this is too complicated. Later, I used systemd to make a service, which is much simpler:
[Unit] Description=PaddleOCR HTTP service [Service] Type=exec RootDirectory=/var/lib/machines/lxc-debian/ ExecStart=/home/lilydjwg/PaddleOCR/paddleocr-http --loglevel=warn -j 2 Restart=on-failure RestartSec=5s User=1000 NoNewPrivileges=true PrivateTmp=true CapabilityBoundingSet= IPAddressAllow=localhost IPAddressDeny=any SocketBindAllow=tcp: port number SocketBindDeny=any SystemCallArchitectures=native SystemCallFilter=~connect [Install] WantedBy=multi-user.target
The “paddleocr-http” script here is the “server.py” in paddleocr-web.
But it’s also less protective. First of all, it is only limited to access the local network. TCP only allows it to bind to the specified port and does not allow to call the connect system call, but it can still send UDP packets to the local. Secondly, the user running this process is my own user, although it should not be able to get out of the container after being chrooted. Well, I should probably change the user for it, such as uid 1500, which should be able to play a similar effect as subuid.
By the way, this PaddleOCR says that it supports so many languages, but in fact, only a few languages such as Simplified Chinese are well supported (the traditional Chinese are not very good), and other languages even have the language names and abbreviations wrong, Vietnamese recognition The additional symbols that came out were almost wiped out.
This article is reprinted from: https://blog.lilydjwg.me/posts/216461.html
This site is for inclusion only, and the copyright belongs to the original author.