Clustering Video Game Attributes

In a previous blog post, I walked through the creation of a simple recommender system that recommends video games to existing users on Steam. Since creating the recommender, my student and I have been exploring ways to improve it. One enhancement we've been looking at is speeding up the computational time of the recommender by clustering video game attributes (tags, genres, and specs) into smaller, more manageable groups. In this post I will describe how we utilized the K-means algorithm to do this.

Improving the Recommender by Computing Probabilities

The current system uses 188 categorical attributes to recommend video games to existing users. The biggest disadvantage of this approach is the large amount of computational time required to build the tables the recommender requires. The numerous categorical features also makes adding numerical features (such as game price) to the system challenging; the influence of the categorical features will outweigh any impact the numerical features could have on the recommendations. To correct both issues, we will attempt to reduce the number of attributes the recommender using K-Means clustering.

The raw, unprocessed video game metadata contains 380+ attributes. We observed from looking at the data for several video games that some attributes tend to appear together with other attributes. This gave me the idea to try to group these attributes based on the likihood the attributes will appear together. We will do this by constructing a square matrix that contains the conditional probability of observing any two attributes in a video game. We use this matrix to cluster the game attributes.

The code snippet below shows the construction of the probability matrix. We iterate through the each game in the steam games dataset and construct a dictionary containing the sets of games that have each attribute.

https://gist.github.com/daryleserrant/383587ef6041c3d774889af3e847e247

We use the attributes dictionary to build the probability matrix.

https://gist.github.com/daryleserrant/1985c2997d936c5a0d9ea6c4eaea7df8

Applying Dimensionality Reduction

The probability matrix computed in the last section has 381 dimensions. As discussed in my high dimensional clustering post, clustering data with very high dimensions could be problematic. To avoid these problems, we're gonna apply dimensionality reduction.

The code snippet below uses the pca module provided by Sci-kit learn to perform Principal Component Analysis on the probability matrix. To determine the number of principal components to keep, we computed a cummulative sum of the explained variance ratios for each principal component. Check out my article on PCA for more details on how it works.

pca = PCA()
pca.fit(probability_matrix)

total = 0
for idx, r in enumerate(pca.explained_variance_ratio_):
  total += r
  print("{0} Components: {1}".format(idx, total))

We decided to keep just enough components to explain 70% of the variability in the data. That number happened to be 31. We will refit PCA to the dataset and reduce the dimensions.

# Refit PCA to the probability matrix and keep only the 31 principal components
pca = PCA(n_components=31)
pca.fit(probability_matrix)
feature_set = pca.transform(probability_matrix)

With the newly featurized dataset in place, we can now proceed with the data clustering.

Discovering Game Attribute Categories

We use the sihoulette method to find the optimial number of clusters for k-means. The code snippet below uses the Sci-kit learn implementation of k-means and silhouettee score to derive scores for different numbers of clusters. Check out this post for details on how the sihouette method works.

range_n_clusters = range(2, 28)

for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state=25)
    cluster_labels = clusterer.fit_predict(feature_set)

    silhouette_avg = silhouette_score(feature_set, cluster_labels)
    print(
        "For n_clusters =",
        n_clusters,
        "The average silhouette_score is :",
        silhouette_avg,
    )

Using the snippet above we determined the optimal number of clusters to be 24. Last, but not least, we group the game attributes into 24 clusters and output the group assignments.

https://gist.github.com/daryleserrant/59d8f94ad6dc801aa23fb19f9ad2a6f4
Cluster 0: Moddable,Trading,City Builder,Building,Economy,Base Building,Sandbox,Management,Space,Political,Agriculture,Space Sim,Capitalism,Politics,Resource Management,God Game,Fishing,Mining
Cluster 1: Action,Indie,Simulation,Strategy,Single-player,RPG,Multi-player,Online Multi-Player,Cross-Platform Multiplayer,Steam Achievements,Steam Trading Cards,Stats,Adventure,Full controller support,Downloadable Content,Steam Cloud,Steam Leaderboards,Partial Controller Support,Early Access,Shared/Split Screen,Valve Anti-Cheat enabled,Steam Turn Notifications,Co-op,Violent,Commentary available,Steam Workshop,Includes level editor,Western,Flight,Tower Defense,Game demo,On-Rails Shooter,Soundtrack,Pinball
Cluster 2: 2D,Replay Value,Difficult,Pixel Graphics,Cute,Singleplayer,Great Soundtrack,Retro,Platformer,Side Scroller,Stylized,Arcade,Underground,Remake,Action-Adventure,Spectacle fighter,Character Action Game,Beat 'em up,Controller,Fast-Paced,2.5D,Ninja,Puzzle-Platformer,Time Attack,Colorful,3D Platformer,Psychedelic,Score Attack,1980s,Time Manipulation,Cartoon,Metroidvania,Blood,Runner,Cartoony,GameMaker
Cluster 3: FPS,Shooter,Third-Person Shooter,Sniper,Third Person,Survival,Classic,Gore,Sci-fi,Aliens,First-Person,Stealth,Assassin,Hunting,Futuristic,Cyberpunk,Destruction,Mechs,Robots,Lara Croft,Dinosaurs,Parkour,3D Vision,Zombies,Survival Horror,Bullet Time,Arena Shooter,Post-apocalyptic,Inventory Management,Star Wars,6DOF,Heist,Transhumanism,Gun Customization,Mars
Cluster 4: Mod,Mods,Mods (require HL2),Mods (require HL1)
Cluster 5: Design & Illustration,Tutorial,Education,Animation & Modeling,Animation & Modeling,Video Production,Utilities,Web Publishing,Game Development,Software Training,Design & Illustration,Audio Production,Photo Editing,Accounting
Cluster 6: HTC Vive,Oculus Rift,Tracked Motion Controllers,Room-Scale,VR,Seated,Standing,SteamVR Collectibles,Keyboard / Mouse,Gamepad,Windows Mixed Reality,360 Video
Cluster 7: Character Customization,Open World,Crafting,Swordplay,Hack and Slash,Action RPG,Medieval,Pirates,Dragons,Voxel,Sailing
Cluster 8: Card Game,Trading Card Game,Turn-Based,Board Game,Turn-Based Strategy,4X,Turn-Based Tactics,Warhammer 40K,Games Workshop,Hex Grid,Tactical RPG,Turn-Based Combat,Strategy RPG,Asynchronous Multiplayer,Chess
Cluster 9: Female Protagonist,Nudity,Anime,Choices Matter,Multiple Endings,Romance,Visual Novel,Sexual Content,Interactive Fiction,Dating Sim,RPGMaker,Choose Your Own Adventure,Text-Based,Otome
Cluster 10: Tactical,War,Rome,Historical,Wargame,Cold War,Real-Time with Pause,RTS,Diplomacy,World War II,Alternate History,Real-Time,Grand Strategy,Real Time Tactics,Military,Naval,Tanks,America,World War I,Modern
Cluster 11: 1990's,Story Rich,Atmospheric,Silent Protagonist,Linear,Mystery,Experience,Psychological Horror,Horror,Exploration,Point & Click,Underwater,Lovecraftian,Demons,Detective,Supernatural,Steampunk,Dystopian,Dark,Mature,Noir,Cinematic,FMV,Cult Classic,Based On A Novel,Surreal,Short,Walking Simulator,Psychological,Time Travel,Hand-drawn,Experimental,Quick-Time Events,Conspiracy,Narration,Dynamic Narration,Lore-Rich,Conversation,Nonlinear,Philisophical,Mystery Dungeon
Cluster 12: Realistic,Driving,Trains,TrackIR
Cluster 13: Captions available,Episodic,Crime,Benchmark,Movie,Thriller,Werewolves,Documentary,Martial Arts,Drama,Gaming,Foreign,Feature Film,Hardware,Faith
Cluster 14: Fantasy,Dark Fantasy,Gothic,Isometric,Vampire,Magic,Mythology,Villain Protagonist,CRPG,Dungeon Crawler,JRPG,Kickstarter,Investigation,Crowdfunded,Grid-Based Movement,Voice Control
Cluster 15: Casual,Physics,Science,Clicker,Puzzle,Music,Hidden Object,Match 3,Touch-Friendly,Family Friendly,Level Editor,Abstract,Relaxing,Mouse only,Music-Based Procedural Generation,Rhythm,Minimalist,Hacking,Lemmings,Sokoban,Typing,Programming,Artificial Intelligence,Word Game,Spelling,Steam Machine
Cluster 16: LEGO,Batman,Superhero,Comic Book
Cluster 17: Party-Based RPG,Software
Cluster 18: Sports,Racing,Golf,Horses,Offroad,Bowling,Mini Golf,Football,Soccer,Gambling,Basketball,Cycling,Pool,Wrestling
Cluster 19: Free to Play,PvP,Competitive,In-App Purchases,Multiplayer,Massively Multiplayer,MMO,Online Co-op,MMORPG,Online Co-Op,Team-Based,Includes Source SDK,Class-Based,PvE,MOBA,e-sports
Cluster 20: Local Co-op,Local Multi-Player,Co-op Campaign,Local Co-Op,Local Multiplayer,Fighting,Split Screen,2D Fighter,4 Player Local
Cluster 21: Top-Down,Top-Down Shooter,Loot,Shoot 'Em Up,Twin Stick Shooter,Bullet Hell,Rogue-like,Procedural Generation,Rogue-lite,Perma Death
Cluster 22: Funny,Comedy,Satire,Dark Humor,Memes,Parody,Illuminati,Intentionally Awkward Controls,NSFW,Dark Comedy
Cluster 23: Bikes

K-means does a pretty good job with grouping attributes into meaningful clusters. One downside you might notice that the cluster assignments will not be consistent when the snippet is run subsequent times. Because the initial centroids used by K-means are choosen at random, we will get different cluster outputs on the same dataset. We can work around this issue by performing the clustering several times and choosing the results that have the lowest sum of squared error.

That's all folks!

In my next post, we will use the attribute clusters to make a new version of the recommender. We will then explore methods for evaluating how good of a job our recommenders do with recommending video games to users. You can find the code for the entire solution here.