One skill a day | Bilibili video collection information extraction

Requirement background: It is necessary to learn related knowledge of mathematical modeling. The daily learning progress is calculated in hours, and the video segmentation time needs to be calculated to arrange the learning progress.

Problem: There is no ready-made Bilibili video collection information extraction tool on the Internet

Idea: starting from the HTML code to solve the problem of obtaining video collection information

JavaScript code implementation

Using the element inspector, it can be seen that the diversity information exists in the ul tag, and the superior of the ul tag is the cur-list class, so writing a for loop can solve the problem.

1

Use document.getElementsByClassName('list-box')[0] to get the content in the list-box . Since getElementsByClassName returns an array, [0] needs to be used here to get the first result. It can be seen from the figure below that the actual diversity content is inside li , and the outermost one is just ul at present.

1

So you can use a querySelectorAll to select all the li tags. The effect is shown in the figure below. Assign the content to a variable and write a for loop.

1

The specific code is as follows. Since the data needs to be summed in Excel later, for the time data such as 12:34, it is necessary to add .00 at the end to prevent it from being misinterpreted as 12 hours and 34 minutes by Excel. The separator in the middle uses \t , so that Excel can recognize two different columns.

 var result = ''; var content = document.getElementsByClassName('list-box')[0].querySelectorAll('li'); for (var i = 0; i< content.length; i++){ temp = content[i].innerText.split('\n'); if (temp[temp.length-1].length < 6){ temp[temp.length-1] += ".00" } temp.join('\t') result += temp; result += '\n'; } console.log(result.replace(/,/g, '\t'))

The effect is shown in the figure below, copy the output result to the clipboard.

1

Excel processing

Create a new Excel and paste the contents of the clipboard. Since the titles have different lengths, select the title cell and click “Start → Cells → Automatically Adjust Column Width”

1

Use the Excel formula to count the total time of episodes, click in the lower right corner of cell D2 ➕ Autofill all total times.

1

However, the default time in Excel is 24 hours, and the part exceeding 24 hours will be rounded off, so the format of the cell needs to be adjusted. Select the cell in column D to adjust the format, enter the custom [hh]:mm:ss format and apply it.

1

The final effect is shown in the figure below:

1

Tips

In the picture above, you can see that there is an episode number in front of the title, but this is useless information, which can be considered to be removed. Can be replaced using regular expression tools. Here is a relatively simple solution, use the perl command with regular expressions to solve, copy the title column from Excel, run this command in the terminal, after the operation is completed, the clipboard is the processed title column, paste it again to replace the original The title column of the .

 pbpaste|perl -pE 's/^\d+ //'|pbcopy

1

This article is transferred from: https://sspai.com/post/77326
This site is only for collection, and the copyright belongs to the original author.