Blog

Clustering Video Game Attributes
In a previous blog post, I walked through the creation of a simple recommender system that recommends video games to existing users on Steam. Since creating the recommender, my student and I have been exploring ways to improve it. One enhancement we’ve been looking at is speeding up the computational time of the recommender by …

Building a Video Game Recommender System
As a code coach at theCoderSchool, I teach and guide young students in the development of software applications. Some apps are simple calculator apps, mad libs generators, and implementations of popular games like Tic-Tac-Toe and Connect Four. Other apps are as complex as web scrapers, networked multi-player space shooters, and Rubik’s cube solvers. Recently I’ve …

PCA and the Nuts and Bolts of It
Clustering high dimensional datasets can be problematic. In my last blog post, I mentioned a few methods that can be used to clustering data with many dimensions. Today, we will be exploring one of these methods more in detail. By the end of this post you will gain an intuitive understanding of Principal Component Analysis …

High Dimensional Clustering 101
High dimensional data are datasets containing a large number of attributes, usually more than a dozen. There are a few things you should be aware of when clustering datasets such as these. The goal of this post is to highlight a few strategies you can use when performing high dimensional clustering. Setting the Stage To …

Let’s Get Fuzzy with C Means Clustering
Not every dataset you will encounter in the wild will have data points that can be neatly placed in a single cluster. You may come across data points that could potentially belong to two, three, or possibly more clusters. Consider this scenario. You work as a data analyst for a company that has developed a …

Scaling your data before clustering
You will often need to perform some form of preprocessing on your dataset before running a data clustering algorithm. In this post I will introduce one important preprocessing method that is often overlooked when clustering data. That method, my friends, is called data scaling. What is data scaling? As you may already know, clustering algorithms …

Weighing your k-means data clustering
In my previous post on the K-Prototypes algorithm, I introduced the concept of parameter weighting. With just a single parameter, you can control the amount of influence categorical attributes has on the data clustering. In this post, I’m going to expand on this concept and show how we can control the weightings of entire data …

Mixed Data Clustering using K-Prototypes
This blog has thus far covered a number of data clustering algorithms. A majority of the algorithms operate under the assumption that the dataset to be clustered is purely numerical. The k-Modes algorithm was the one method we’ve covered that works on purely categorial data. What if you want to cluster a dataset that contains …

Mediods Data Clustering with CLARANS
K-Medoids is a data clustering algorithm that is commonly used on datasets containing extreme values. In previous posts on the subject, I covered two implementations of the algorithm that can be used to find medoids in a dataset. The first method, called PAM, scans every data point in the dataset for k medoids. PAM can …