Play Genshin with AI voice control, you can fight monsters with your mouth, the code tutorial is open source

In 2016, Mihayou hollowed out his family, all in the game project “Yuan Shen”, and pursued the road of transformation. The launch of the fully open adventure game “Yuan Shen” in 2020 has caused a phenomenon-level heated discussion in the game circle. The exquisite production and 3A-level picture quality will make the game top the 56-country rankings in 2021.


As a character development game, the Genshin Impact game itself is still relatively livery. In addition, the core gameplay is relatively simple, and the later versions of the game are updated slowly. After a long time in the game, some players will inevitably become bored, but they are “tasteless”. , a pity to abandon.”

Follow your words, play Genshin with voice


When they are idle and bored, there are always some big bosses with different brain circuits, thinking about the whole time. No, “Schrödinger’s Rainbow Cat”, the master of the hardcore living area at Station B, uses AI algorithm to control Genshin by voice, directly turning the player into a Pokémon trainer. It is recommended that the game be renamed “Pokémon: Original god”.

For specific combat effects, let’s take a look at the animation below.


As the system prompts “Defeat 8 monsters within 360 seconds”, 4 fire slimes are coming.

The Yuanshen trainer calmly shouted out “Use Tactic 3 to attack the fire slime in the middle”, and a green tracking box like a fighter appeared on the screen.

Shenli Linghua ran to the slime, and then switched Zhongli to activate the skill “Elemental Combat, Earth Core” to deal aoe damage while setting up a shield, and then Ling Hua appeared again, with a move of “Shenliliu·Grape”, played Tons of elemental damage, in the explosion of a fire slime, ends the fight.


The author also presets different tactical scenarios. When dealing with Fire Abyss Mage. The first sentence was “Attack the Fire Abyss Mage in the middle”, and the character began to automatically search for monsters.

When you come to the monster, execute Tactic 1. Diona used the “Cat’s Claw Ice” skill to deal damage at the speed of lightning, and at the same time moved backwards to adjust the position and set the shield. After that, Ayaka Kamiri appeared, with a single move to bully her forward and deal combo damage.


However, during the demonstration, we also found that after the tactics were activated, the operation of the characters was smooth and smooth. However, before the tactics were activated, it was a bit dull, the kindness [doge] from AI.

So, how is this AI that uses the mouth to play the game, realizes the words to follow the law, and assists the player to develop a three-inch tongue?

Three AI tools to create an intelligent command system


The author of the video “Schrödinger’s Rainbow Cat” shares his method. The realization of AI voice playing Genshin Impact mainly involves the three major AI mainstream fields such as “XVLM+WeNet+STARK” which are currently popular.

Seeing this, some friends may say, “Well said, I recognize these letters when they are taken apart, and they look like I don’t recognize them together.”

Don’t worry, little friends, let’s understand the usefulness of these three tools together.

In the past, the operation logic for us to operate the game character for melee combat was: 1. See the enemy target. 2. Lock on the target and move towards the target character. 3. Launch an attack.

To achieve game operation through voice, it is also necessary to complete the above three steps. Let’s disassemble the author’s instructions in the game and analyze the workflow of this AI.


As shown in the picture above, when the author said “after attacking the Fire Abyss Mage in the middle with Tactic Three”. Let the computer perform the three steps of “voice command recognition – image recognition target – character action”, the whole process is somewhat similar to customizing a voice assistant for the game, like “Hey, Siri, open Genshin Impact”.

Step 1: Voice Command Recognition


To make the device understand our instructions, we need a translator to convert what we say into a computer language that the machine can understand. WeNet is the translator who talks to the machine.

WeNet is a production-oriented end-to-end speech recognition toolkit that introduces a unified two-pass (U2) framework and built-in runtime to handle streaming and non-streaming decoding modes in a single model. Its speech recognition accuracy rate, real-time rate and delay have been very good, and it has been adopted by Jingdong, NetEase, NVIDIA, Himalaya and other companies’ speech recognition projects.

To use WeNet to recognize the voice commands of Genshin Impact, you need to go through “preparing training data”, “extracting optional cmvn features”, “generating label token dictionary”, “preparing WeNet data format”, “neural grid training”, ” Use the trained model to identify wav files”, “export model” and other 6 major steps.


The above things are said in vernacular, prepare some audio files, at the same time mark what these audio files say, and then let the machine learn to recognize these audio files and generate labels. After the above training is completed, we will speak to the machine in the future, and WeNet will be able to translate our words into words that the machine can understand.

Step 2: Analyze the features of the voice command

With the assistance of WeNet, we realized that the words we said made the machine understand what we were saying, and we also had to let the machine match what it heard with what was on the screen. This was the second turn. The tool “X-VLM” appeared.

X-VLM is a multi-granularity model based on visual language model (VLM), which consists of an image encoder, a text encoder and a cross-modal encoder. The cross-modal encoder performs cross-modality between visual features and language features. State attention to learn visual language alignment. So how does this tool realize object recognition?


The figure above shows the workflow of X-VLM. On the left side of the picture is the encoding process of the tool visual concept. The image encoder of the toolkit is implemented based on the Vision Transformer, and the input image will be divided into patch encoding. Then, given an arbitrary bounding box, the global representation of the region is flexibly obtained by taking the average of all patch representations in the box. Then, the global representation and all the patch representations in the original box are sorted into a sequence according to the original order, as the representation of the visual concept corresponding to the bounding box.

(I know all the characters, how come they look like I don’t know them together?)


How the article has become reading comprehension, let’s look at it with more eyes.


The meaning of the above paragraph, in layman’s terms, is to cut the picture into squares and pre-assemble these squares. For example, combined into a picture of “a man with a backpack”, or combined into a picture of “a man with a backpack crossing the road”.

All you have to do is tell the machine how these combinations correspond to words, and then let the device do machine learning.

In this way, the encoding of the picture itself and the visual concepts (V1, V2, V3) in the picture is obtained. The text corresponding to the visual concept is obtained by encoding one by one through the text encoder, such as picture title, area description, or object label.


After this operation, the editor was also dizzy. This thing works a bit like our eyes, when I see a “schoolbag”, although I have not seen this style, but according to feature extraction, I know that this thing is a schoolbag, and X-VLM is such a tool.

After receiving the text information output by WeNet, X-VLM can extract the related objects in the image to realize the correlation between language and vision. At this point, we can let the computer know what we are referring to in the picture.

Step 3: Track the Image

After using X-VLM and WeNet, we have successfully made the device understand what we are talking about, and the next thing to do is to achieve “tracking target”, doesn’t it sound cool, kind of launching a fighter jet The feeling of tracking missiles~


I believe that many friends have guessed that the last “STARK” is the AI ​​tool used to realize the image tracking function.

Stark is the latest SOTA tracking model, which uses a transformer to combine spatial and temporal information.

The model includes an encoder, decoder and prediction head. The encoder receives three inputs: the current frame image, the initial target, and a dynamically changing template image. Since the template image changes dynamically and is constantly updated during the tracking process, the encoder can simultaneously capture the temporal and spatial information of the target.

After obtaining the target information, the tool will obtain an optimal bounding box in each frame of image by predicting the heat map of the upper left and lower right corners, and can run directly on the GPU side.


To put it simply, after we determine the target to be tracked through X-VLM, Stark, like the tracking system of Tony Stark, will record the appearance of the object in the static state and dynamic state, and realize the tracking of the dynamic object after processing and analysis.

So, at this point, we have basically understood the principles of the three major technologies of Genshin Impact. How does the character move to execute tactics?

In fact, realizing the character’s automatic attack and releasing skills is the easiest part of AI voice play Genshin Impact. This function can be realized by macro instruction or code programming. The editor deliberately took a look at the code file shared by the author. The following is a display of some of the code.


This piece of operation code is written in python, and the logic is quite simple, that is, to execute a series of preset key commands. The picture above should show the operation corresponding to Tactic 1. The numbers or letters after key and mouse correspond to switching roles and releasing skills.


The code also explains why the character stays in a daze after executing the tactics, because there is no follow-up instruction and input.

In general, if you have a small partner who wants to simply try this AI voice play Genshin Impact, you can directly download the code shared by the author and run the program. You only need to design the hero lineup and order to be the same as the author, and then you can achieve the effect shown in the author’s video.

Of course, if the friends want to play their own tricks, they can also directly change this operation code to achieve different lineups and skill release combinations, and then remember which set of tactics they changed.

Of course, if you want to make the game more secondary, like this:


It was decided that it was you, Ayaka Kamari. (switch roles)

Use the Grape Step after getting close to the enemy. (Release skill)

Thank you for your hard work, Ayaka, come back. (switch roles)

The editor has also helped you figure out which codes to change. You replace the shortcut keys and skill keys corresponding to switching roles into the operation code, but at the same time, you also need to record a voice to WeNet, let it learn, know What are you talking about. (PS: Do as many things as possible in one sentence, because AI execution is busy, which is why the author uses tactics one, two, three)


Of course, there are fairy bosses who gave other suggestions in the video. For example, adding the SLAM tool to realize 360° orientation detection, so that the character can track enemies in different directions in the game, and the self-propelled map cannon is yes.

Eyes and gestures can be played, AI game poses and these

In addition to AI voice playing games, there are many big guys in station b tossing other game postures.


[Image source: Bilibili owner: Jack-Cui ]

Jack-Cui made his own AI directly, using an ordinary camera and a computer to play Street Fighter with somatosensory sensations.


[Image source: Bilibili owner: Tongji Zihao brother]

B station up master Tongji Zihao showed that using WebGazer.js to achieve “eye control mouse”, playing games through eyes, is directly eye killing.


With Mediapipe, use gestures to play games in the air. It feels like an Iron Man control panel!

AI technology has different applications in different occasions. The direct beneficiaries of technologies such as voice manipulation and eye manipulation are those who have physical defects in their lives.


[Image source: Bilibili Owner: Psychological consultant Zhu Mingjun]

Earlier, a retired firefighter with a high-level amputation shared a video of him using his mouth to operate the mobile phone to play Genshin Impact. When the AI ​​voice game is mature, he can play in the Genshin world more easily through voice.

In the later stage, the author also plans to add the AI ​​operation of “automatic brushing, teleporting, killing monsters, and receiving rewards”. At that time, we will also see a more interesting scene, let us wait and see.

Those who don’t understand these algorithms don’t have to worry. The author has shared the source code on github. After downloading and installing, according to what we said above, change the operation code and experience a voice to play Genshin Impact.

Source code link:

This article is reproduced from:
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment

Your email address will not be published.