Web article archive tool get-article published

Original link: https://blog.henix.info/blog/get-article-archiving-tool-release/

The function of get-article is to completely archive a network article and all the resources (pictures, formulas, audio, video, etc.) cited to the local. Contains command line and graphical interface programs.

My goal is to make youtube-dl in the field of “web articles”.

get-article desktop version
get-article desktop version

How to use and download please go to: https://lab.henix.info/get-article.html

Now I will focus on the background of the creation of this software.

You may have some questions when you see this for the first time:

  • Why do I need it?
  • How is it different from RSS?
  • Can I also use the browser’s Ctrl+S to save as?
  • Can I use the web page clipping of the note-taking software?

I made this software because I have collected several articles on WeChat public accounts. I hope to archive them offline and make them into something similar to an e-book. If you copy them anywhere, you can open them directly, so that I feel that I truly “own” “This article.

Article archives should have the following properties:

  1. You own all the data for this article, so you can browse without the internet. In other words, the entire data of the article is stored locally
  2. The saved format should be open so that it can be opened in various operating systems and software

Friends who have read my past articles may know that my view on the Internet is to use less “follow” and less RSS, and advocate more reading and more classics (see ” Talking about Classics “).

This raises the question: is it possible to produce “classics” in online articles? Or take a step back, “high quality” articles?

After so many years of online practice, I think there are indeed some articles that I may take out and read again and again from time to time. For articles like this, I would like to treat them with the standard of books—that is, to preserve them in their entirety. Friends who often surf the Internet must know that the probability of sudden disappearance of online articles is not low, and there are more and more frequent signs in recent years.

  • It may be because of the censorship or censorship standards have changed
  • Maybe it’s because of multiple reports
  • It may be deleted by the author himself
  • It could also be because the site is down

If you’ve been online long enough, you’ve probably seen a lot of sites shut down: blogcn, Netease Blog, Baidu Tieba posts before 2017, Baidu Space,…

With the above background, we can discuss the similarities and differences between get-article, an online article archiving tool, and other similar tools:

RSS is when something happens and you get notified. RSS emphasizes more on reminders, notifications, and pushes. And get-article puts more emphasis on archives. RSS can also output the full text, but generally does not save all the pictures and even videos used in the article.

This also explains why I don’t want to do the “save all articles in the column” function and have to let the user save it manually one by one: I don’t think it’s possible that all the articles in an entire column are “classic” or “high quality” “of. You don’t have that many articles to save. Use your judgment to filter out what really matters! The things that really matter in this world are rare, and you can’t say that everything matters.

In other words, I don’t think all “conveniences” are worth pursuing, and sometimes we need a certain “inconvenience”. Because “inconvenience” forces us to discern what really matters.

Similar functionality for browsers:

Press Ctrl+S in Chrome to choose to save as full html or mht. Chrome also has a print to pdf feature that saves web pages in pdf format.

  • The problem of pdf: can not easily get a separate image file, not enough support for dynamic content

  • problem with saving webpage as

    1. Images loaded on demand, such as WeChat public accounts, may not be displayed on the saved webpage.
    2. A lot of useless js is saved, and js will be executed when it is opened locally, and the execution error will cause the page to be cluttered

Problems with web clipping of note-taking software (such as Evernote, Obisdian):

  1. Must be used with specific note-taking software
  2. The pictures may be saved to the server of the note-taking software or third-party cloud storage, rather than the user’s local
  3. The support for dynamic elements such as mathematical formulas and videos on the page may not be perfect

In addition, I often use the later reading software getpocket to save WeChat official account articles, but as long as the article contains mathematical formulas, it is basically impossible to read: because all mathematical formulas cannot be displayed in getpocket.

In general, the browser’s printing to pdf is roughly in line with my requirements, but the details are not perfect.

future outlook

If there are enough users of this tool, based on user feedback, I might consider adding the following features in the future:

  • Save in epub format (currently a folder)
  • Download supports resumable upload

This article is reprinted from: https://blog.henix.info/blog/get-article-archiving-tool-release/
This site is for inclusion only, and the copyright belongs to the original author.