How to Systematize Introductory Data Science for Free?

Title image: Photo by Myriam Jessier on Unsplash

Pain points

There are always small friends who leave a message or private message to ask me:

Mr. Wang, I have no foundation. If you want to get started with data science, you can take a certain remedial class (charged XXXX yuan), is it reliable?

This question is really hard to answer. Because I have never attended such a remedial class, I really don’t know the quality, so I dare not make a rash judgment.

If you want to learn data science knowledge across professional systems, in addition to tutoring classes that will make you dazzled, books, materials and online learning resources are not easy to choose. Because data science knowledge skills are characterized by rapid development . The results obtained by others using a certain method and model more than a year ago can still be published in high-level journals. You use the same method today, and you may be ruthlessly rejected. It’s not entirely your fault, it’s just that better models have emerged in such a short time. What you’ve learned, unfortunately, is out of date.

Then your next question should be:

Do you have enough money to save and keep up-to-date learning materials that keep up with the cutting edge?

This sounds greedy. If someone really tells such a good thing, you’ll wonder if he’s a liar. But to be honest, such free learning materials really exist, and they have been ignored by many beginners. Because it doesn’t exist on some well-known MOOC platform, but in the corner of a data science competition website.

The stereotype is that you only enter a competition if you have enough knowledge and skills, so don’t expect competition sites to be responsible for detailing the basics. This may make many people miss it.

The competition website is called Kaggle. This set of courses is called Kaggle Courses.

Find

I first discovered this set of self-made courses from Kaggle in April 2018. The reason why I still know the exact time is because I took notes at the time.

You see, it is very important to take notes in a timely manner , right?

At that time, the name of Kaggle’s self-made course was also called “Kaggle Learn”. The number of initial courses is very small, only 6.

But as early as 2018, I felt that the Kaggle course was very special, so I took notes. What’s special about it? At least include the following two points:

Take the teaching content of data visualization as an example. The Python visualization tool introduced in this section is Seaborn.

Seaborn learning materials are already everywhere today, not uncommon. But in early 2018, more common data science tutorials still use Python-based plotting packages like matplotlib. It is more complicated to use. With Seaborn, you can generate very complex print-level graphics with very simple statements. In this note in 2018, I also kept a few screenshots as examples.

Here is a normal variable distribution plot:

This graph can simultaneously describe two variable distributions and their associations:

The above graphs can be easily made with Seaborn.

Moreover, at that time, the course of the visualization part of Kaggle Learn was not limited to Seaborn, but also included plotly for interactive graphics. Here is a screenshot from the course:

It’s not hard to see that Plotly can easily make such three-dimensional images. You can also observe from different angles by dragging.

At that time, due to the small number of courses, it did not attract my attention. But the process of learning these materials at the beginning, until now, the restoration has been kept in the Kaggle system.

Variety

As the saying goes, “leave a scholar for three days, and look at it with admiration”. The same goes for Kaggle’s Courses section.

A few days ago, I wanted to find a free running GPU cloud environment for students to teach machine learning, so I opened the Kaggle website. I suddenly found that the original Kaggle Learn has been greatly expanded and grown into a separate Courses section in Kaggle. You can click this link directly.

As you can see, the courses now cover a wide variety of topics.

This is an incomplete list of courses, including:

Sample

Here I take the Python data frame Pandas tutorial as an example to introduce you to the features of Kaggle Courses.

The introduction of Pandas here is divided into 6 parts:

It can be seen that the content is very detailed, systematic and comprehensive. The advantage of such a step-by-step approach is to avoid exposing beginners to too many concepts and knowledge at one time, resulting in “pressing the gourd and making a scoop” when learning, and scrambling. Solve basic problems first, and then gradually try to enter more complex parts, which is consistent with the concept of “gradual iteration + high-level repetition” in “New Concept English”.

Each module is divided into two parts: explanation and practice.

The explanations include text, pictures, code, and the corresponding running results.

In the exercise, because of the use of the learntools package, the Kaggle platform can automatically give you hints, reference answers, and even determine whether the answer to the statement you entered is correct.

The code runs correctly and the prompt is this:

And if there is an error in the operation, Kaggle will give the specific error reason:

Note that this is invaluable feedback for beginners. Because of the feedback and prompts, you will have the right direction when you modify it, and you will get twice the result with half the effort. To a large extent, it avoids the exhaustion of energy caused by beginners’ guessing and making mistakes, and even “from entry to giving up” .

Even if your answer runs correctly, you can see the difference between your answer and the reference answer. There is more than one correct answer, and there are also differences in execution efficiency. This contrast is also an important way to improve learning .

I am teaching in MOOC, What Matters Most? “In the article, I mentioned to you that the most valuable thing in MOOC teaching is feedback . Course handouts and videos can be copied and disseminated on a large scale and at low cost, but feedback needs to be personalized, which is where contradictions and conflicts often arise between course scale and quality. For things like writing code, if you can make full use of the current automation technology, you can give beginners enough feedback. This is also a unique advantage of learning data science and programming content.

other

Let’s take a look at the other featured sections of Kaggle Courses. Due to space limitations, I have selected a few here, including:

Let’s start with data visualization. You’ll find that Seaborn really lasts forever. It is still the package of choice for Kaggle to explain visualization.

It’s just that the content is more detailed and diverse than before. I also plan to take the time to study the system. Back to share with you the relevant experience.

In addition to Seaborn and plotly, “Geographic Information Visualization” has been added to the data visualization section. With just a few lines of code, you can overlay various data on top of the map layer, making it easy for the reader to see at a glance.

You can also make the map interactive, and readers can interactively zoom in and out according to their preferences, like this.

In addition to visualization, I think time series analysis is worth talking about.

After all, in addition to panel data (e.g. shopping records, reviews), we often work with time series. For example, ” How to use Python to visualize public opinion time series?” 》, you can make a time series visualization of sentiment indicators like this.

Processing time series data in the past was quite troublesome. Now, thanks to more mature software packages, you can clean and visualize time series with less code.

Using time series, we often have the need for trend forecasting. Predictions can use some traditional algorithms, or machine learning can also be used. As I did in How to Predict Severe Traffic Congestion with Python and Recurrent Neural Networks? “The article gave you an example.

The figure below is an example of predicting flu data in Kaggle Courses.

How many lines of code do you think it would take to model, predict and visualize such data? 200? 500?

In fact, the core code is only these:

All the codes have supporting explanations. Step by step through the intermediate results, teach you how to do it, so that you have sufficient foreshadowing knowledge to gradually learn to master.

After learning by example, I believe you are full of confidence. At this time, according to the guidance, you can actually hand in the practice area .

The thing that excites me most about this Kaggle course is the dedicated AI ethics section.

Many news in the field of AI in recent years have made people gradually realize the importance and seriousness of AI ethics. In short, if people let AI research “fly free,” we’re going to experience bad consequences in a short period of time. For example, we are discriminated against or even despised by machine models because of factors such as looks, genes, socioeconomic status, etc. In those science fiction works, the scenes of human beings being bullied and even enslaved by machines will become a living reality.

There is no good or evil in the machine itself, because it is only shaped by people. But for data science practitioners, AI ethics are especially important. If it is not cultivated at the learning stage, it is like a driving school training driving skills, but not teaching traffic rules. He drives on the road, not the steering wheel, but the trigger of a deadly weapon.

original intention

I’m giving you this detailed introduction to this course because it’s completely free and also offers a certificate of study. This spirit of sharing also needs your sharing to be passed on from generation to generation.

I have been pondering, developing these courses, writing supporting exercises, and refining the answers. I have to constantly adjust the course content in response to changes in the environment. Is there no cost? Kaggle does this, doesn’t it make a profit at a loss? Picture what?

Besides, this isn’t the only “stupid thing” Kaggle has done. Don’t forget, I opened Kaggle this time, in fact, I want to help my students, use the free GPU time on it, and a large number of open data resources on it. These are actually all at the cost of Kaggle.

Later, I probably figured it out. These seemingly “stupid” behaviors of Kaggle are actually completing a closed loop. As a data science competition website, Kaggle needs data, computing power, and topics, but it also needs “people”, that is, enough participants. Not every user who comes to this site has basic knowledge of data science. But many of them have considerable potential to be discovered.

Doing a set of courses does cost a lot. But if this set of courses can allow beginners to quickly get started and master the introductory content, then the results of their competitions are even more worth looking forward to. The rapid improvement of the overall level of participants has significant benefits for such a website and a community – at the ecosystem level.

summary

After I understand this layer, I feel that I can recommend this Kaggle data science course to you more boldly. Because it is the result of many beginners’ practice, feedback, and iteration, it is more quality-assured.

I hope this recommendation can make you take less detours and have more sense of accomplishment on the road to entry data science.

What do you think of this course? Are there any better data science introductory resources to share with you guys? Welcome to leave a message, let’s communicate and discuss together.

Please charge if you find this article useful.

If this article might be helpful to your friends, please forward it to them.

Welcome to my column “Scientific Research Tool” to receive follow-up updates in a timely manner.

Further reading

To borrow or not to borrow: How to use Python and machine learning to help you decide?

Want to create a personalized and efficient workflow, but don’t know how to program?

Progress of Cloud Migration of Supporting Codes for “The Way of Numbers”

The world is very big, how do you go to see if your English is not good?

To practice “heavy equipment and light use”, what combination of tools do you use in your knowledge management process?

This article is reproduced from: https://sspai.com/post/74145
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment