4 skills to focus on when learning Data Science

Python, R, Spark, TensorFlow, pandas, data visualization, etc., etc., etc. With so many technologies and concepts, the task of learning data science, especially for people who are learning on their own, can appear daunting. My goal in this article is to your data science journey less overwhelming. I’m going to show you four skills I believe you should focus on when starting out in order of importance.

Skill #1: Python

You will be writing code on a daily basis as a data scientist. Code for extracting and parsing data, code for building training machine learning models, and code for creating data visualizations to communicate your findings to stakeholders. You may also be asked to write modules that can integrate into software products that your company is building. Therefore knowing how to write code - good, clean code in particular -  is important. Python is a great coding language to start out with because the syntax is simple and fairly intuitive compared to most languages and is a very popular language to use for Data Science as well as general software development tasks. 

Why Python and not another language like R?

Even though R designed specifically for statistical analysis and data visualization, I would encourage people starting out to learn Python because it’s a much more versatile language. R is a favorite among researchers and scholars in academia and works really well for doing quick exploratory data analysis and early modeling. Where R falls short is in the ability to scale in a production environment. 

Most mainstream developers you will be collaborating with as a data scientist also don’t know R. Considering that in your work may be part of a live software product, you’re better off starting with a more mainstream language like Python. 

Skill #2: SQL

An activity common in many data science roles is retrieving data from a database. I’m order to do this, knowledge of SQL is essential. 

As far as which database management system you should start with, choose any one you want. The structure and syntax of SQL is the same amongst all database management systems, save for a few minor differences in function names.

What about NoSQL?

A large majority of the data you will be working with as a data scientist will be structured and relational in nature. Relational databases and SQL are the go-to tools when working with this type of data. 

NoSQL technology is used when dealing with unstructured data such as documents, audio, and geospatial data. NoSQL may also be used when an application requires a level of efficiency and scalability that can’t be provided by a relational database management system.  

Many of the relational database concepts you will learn with SQL will help you better understand NoSQL. Therefore I would recommend you learn SQL first and then pick up NoSQL later when you really have a need to learn it.

Skill #3: Data Visualization 

As a data scientist, you will be expected to communicate the results of your findings to  business stakeholders. Data visualization is one of your primary tool for doing this. There's a multitude of packages for data visualization available under Python. Two packages I'd recommend you start out with before exploring others are matplotlib and seaborn.

Skill #4: Machine Learning 

Once you have a good grasp of probability, statistics, and linear algebra, it’s There are so many machine learning algorithms out there that it could be hard for ones starting out to figure out which ones to learn first. Pick five or six algorithms to study and understand and move on to more as you get comfortable with them. My recommendation would be:

  • Linear Regression
  • Logistic Regression
  • Kmeans
  • Decision Tree
  • K nearest neighbors
  • Support Vector Machines

Along with learning the algorithms understanding how to evaluate the predictions of the models is important as well. 

Do you agree with my suggestions? What are your thoughts? Share in the comments section below.