Building a Video Game Recommender System
As a code coach at theCoderSchool, I teach and guide young students in the development of software applications. Some apps are simple calculator apps, mad libs generators, and implementations of popular games like Tic-Tac-Toe and Connect Four. Other apps are as complex as web scrapers, networked multi-player space shooters, and Rubik's cube solvers. Recently I've been working with one my advanced students on a video game recommender system. I immediately had the thought to document our process in a series of blog posts.
A recommender system helps users discover new products and services that users would otherwise not discover on their own. Companies like Amazon and Netflix use recommender systems to suggest new products and movies for their users to buy and watch. Recommender systems work by examining how similar items are to the ones used by users in the past.
In a series of blog posts I will guide you through the development of different recommender systems using the Steam Video Game Dataset. At the end of this post, you will learn how to create a very basic content-based recommender system that will recommend video games to existing users.
This post assumes that you have a solid understanding of Python and the pandas open source python module. If your knowledge of Python and pandas is shakey, make sure to brush up on them before proceeding.
Checking out the Steam Video Game Dataset
The Steam Video Game Dataset provides several JSON files that contains information about reviews on the Steam platform, user and item metadata, and item bundles. For the recommender system we will be building in this post, we will be using the User and Item Data and Item metadata JSON files.
The User and Item Data file contains information collected from over 5 million steam users. Pictured below is the JSON object structure for a single user.
{'items': [{'item_id':<int>,
'item_name': <string>,
'playtime_2weeks': <int>,
'playtime_forever':<int>},
…],
'items_count': <int>,
'steam_id': <string>,
'user_id': <string>
'user_url': <string>}
The items element contains a list of video games played by a single user and the amount of time the user spent playing each game.
The items_count element provides the total number of games played by the user. The steam_id, user_id, and user_url are unique identifies for the user within steam platform.
The Item metadata file provide data 32,000 games on Steam. Below is the json object for one items in the dataset.
{'app_name': 'Lost Summoner Kitty',
'developer': 'Kotoshiro',
'discount_price': 4.49,
'early_access': False,
'genres': ['Action', 'Casual', 'Indie', 'Simulation', 'Strategy'],
'id': '761140',
'price': 4.99,
'publisher': 'Kotoshiro',
'release_date': '2018-01-04',
'reviews_url': ' http://steamcommunity.com/app/761140/reviews/?browsefilter=mostrecent&p=1',
'specs': ['Single-player'],
'tags': ['Strategy', 'Action', 'Indie', 'Casual', 'Simulation'],
'title': 'Lost Summoner Kitty',
'url': ' http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/'}
Now that you're acquainted with the data that we will be using, I'll now provide an overview on how the recommender will be built.
How the Recommender Will Work
We will develop a simple algorithm that will find video games that are similar in characteristics to games that user has played in the past. The JSON data files covered in the last section will be used to build two tables.
The first table, which we will call the game_features table contains the attributes of each game in the dataset.
game id | genre 1 | ... | genre n | tag 1 | ... | tag n | spec 1 | ... | spec n |
xxxxxxxx | 1 | ... | 0 | 1 | ... | 0 | 0 | ... | 0 |
xxxxxxxx | 0 | ... | 0 | 0 | ... | 1 | 1 | ... | 1 |
xxxxxxxx | 1 | ... | 0 | 0 | ... | 0 | 1 | ... | 0 |
xxxxxxxx | 0 | ... | 1 | 1 | ... | 1 | 0 | ... | 1 |
xxxxxxxx | 1 | ... | 0 | 0 | ... | 0 | 0 | ... | 0 |
We will use the genres, tags, and specs fields from the item metadata file to create a set of binary features for each game. Games that have the particular attribute will have the value 1, otherwise the value will be 0.
The second table will contain preferred video game characteristics for each user. We'll call this table the user_features table
user id | genre 1 | ... | genre n | tag 1 | ... | tag n | spec 1 | ... | spec n |
xxxxxxxx | 0 | ... | 1 | 1 | ... | 1 | 1 | ... | 0 |
xxxxxxxx | 1 | ... | 0 | 0 | ... | 0 | 1 | ... | 0 |
xxxxxxxx | 1 | ... | 0 | 0 | ... | 0 | 0 | ... | 0 |
xxxxxxxx | 0 | ... | 0 | 0 | ... | 1 | 0 | ... | 1 |
xxxxxxxx | 1 | ... | 0 | 0 | ... | 0 | 0 | ... | 1 |
This table has the same structure as the game_features table. Each binary feature indicates whether or not the user has played a game that has that particular attribute.
Given an existing user, the algorithm wil recommend new games by performing the following actions:
- Filter out all the games from the game_features table that was already played by the user.
- Compute a similarity score between each game in the game_features table and the user's preferred game characteristics. We will be dissimilarity scoring method covered in my post on the K-modes algorithm.
- Return the top 10 games with the lowest dissimilarity score.
Now that you know how the algorithm will work, let's start coding!
Loading the Steam Games Dataset
First things first. We will read the item data file, parse the JSON objects and create a Pandas data frame containing the fields from each object. Because the item metadata file contained improperly formatted JSON, the pandas read_json() function could not be used to create the dataframe. We will need to iterate through each json object indepedently and parse them using the python ast module.
Here's what some of the data from the resulting data frame looks like:

Building the Steam Game Features Table
With the item metadata loaded, we can now build the game_features table. But before we do that, let's examine the values of genres, tags, and specs attributes. The code snippet below creates a pandas series for each attribute.
genres = []
tags = []
specs = []
for idx in range(steam_games_df.shape[0]):
game_genre = steam_games_df.iloc[idx]['genres']
game_tags = steam_games_df.iloc[idx]['tags']
game_specs = steam_games_df.iloc[idx]['specs']
if game_genre:
genres.extend(steam_games_df.iloc[idx]['genres'].split(","))
if game_tags:
tags.extend(steam_games_df.iloc[idx]['tags'].split(","))
if game_specs:
specs.extend(steam_games_df.iloc[idx]['specs'].split(","))
genres_srs = pd.Series(genres)
tags_srs = pd.Series(tags)
specs_srs = pd.Series(specs)
Using the pandas series unique() method, we can obtain the unique values for each attribute.
genres_srs.unique()
>> array(['Action', 'Casual', 'Indie', 'Simulation', 'Strategy',
'Free to Play', 'RPG', 'Sports', 'Adventure', 'Racing',
'Early Access', 'Massively Multiplayer',
'Animation & Modeling', 'Video Production', 'Utilities',
'Web Publishing', 'Education', 'Software Training',
'Design & Illustration', 'Audio Production', 'Photo Editing',
'Accounting'], dtype=object)
tags_srs.unique()
>> array(['Strategy', 'Action', 'Indie', 'Casual', 'Simulation',
'Free to Play', 'RPG', 'Card Game', 'Trading Card Game',
'Turn-Based', 'Fantasy', 'Tactical', 'Dark Fantasy', 'Board Game',
'PvP', '2D', 'Competitive', 'Replay Value',
'Character Customization', 'Female Protagonist', 'Difficult',
'Design & Illustration', 'Sports', 'Multiplayer', 'Adventure',
'FPS', 'Shooter', 'Third-Person Shooter', 'Sniper', 'Third Person',
'Racing', 'Early Access', 'Survival', 'Pixel Graphics', 'Cute',
'Physics', 'Science', 'VR', 'Tutorial', 'Classic', 'Gore',
"1990's", 'Singleplayer', 'Sci-fi', 'Aliens', 'First-Person',
'Story Rich', 'Atmospheric', 'Silent Protagonist',
'Great Soundtrack', 'Moddable', 'Linear', 'Retro', 'Funny',
'Turn-Based Strategy', 'Platformer', 'Side Scroller',
'Massively Multiplayer', 'Clicker', 'Gothic', 'Isometric',
'Stealth', 'Mystery', 'Assassin', 'Comedy', 'Stylized', 'Co-op',
'War', 'Rome', 'Historical', 'Open World', 'Realistic', 'Crafting',
'Trading', 'MMORPG', 'Swordplay', 'Hunting', 'Violent',
'Experience', 'City Builder', 'Building', 'Economy',
'Base Building', 'Education', 'Golf', 'Wargame', 'Cold War',
'Real-Time with Pause', 'RTS', 'Diplomacy', 'Psychological Horror',
'Sandbox', 'Mod', 'Online Co-Op', 'Animation & Modeling', 'Puzzle',
'Horror', 'Management', 'Futuristic', 'Cyberpunk', 'Destruction',
'Music', 'Driving', 'Arcade', 'Mechs', 'Robots', 'Underground',
'Exploration', 'Point & Click', '4X', 'Trains', 'Top-Down',
'Underwater', 'Turn-Based Tactics', 'Lovecraftian', 'Lara Croft',
'Remake', 'Action-Adventure', 'Dinosaurs', 'Parkour', '3D Vision',
'Hack and Slash', 'Spectacle fighter', 'Character Action Game',
"Beat 'em up", 'Demons', 'Controller', 'Detective', 'Episodic',
'Zombies', 'Fast-Paced', '2.5D', 'World War II', 'Supernatural',
'Alternate History', 'Vampire', 'Space', 'Warhammer 40K',
'Games Workshop', 'Real-Time', 'Steampunk', 'Dystopian',
'Political', 'Dark', 'Action RPG', 'Grand Strategy',
'Real Time Tactics', 'Medieval', 'Hidden Object', 'Crime',
'Survival Horror', 'Mature', 'Noir', 'Bullet Time', 'Cinematic',
'Nudity', 'Co-op Campaign', 'FMV', 'Match 3', 'Anime',
'Touch-Friendly', 'Military', 'Western', 'Family Friendly',
'Ninja', 'Arena Shooter', 'Naval', 'Agriculture', 'Horses',
'Flight', 'TrackIR', 'Tanks', 'Cult Classic', 'Puzzle-Platformer',
'Post-apocalyptic', 'Inventory Management', 'Benchmark',
'Space Sim', 'Choices Matter', 'Based On A Novel',
'Multiple Endings', 'Magic', 'LEGO', 'Batman', 'Local Co-Op',
'Superhero', 'Comic Book', 'Local Multiplayer', 'Offroad',
'Satire', 'Surreal', 'Capitalism', 'Bowling', 'Dark Humor',
'Level Editor', 'Mythology', 'Time Attack', 'Colorful', 'Short',
'Tower Defense', 'Top-Down Shooter', 'Villain Protagonist',
'Fighting', 'Team-Based', 'Split Screen', 'Party-Based RPG',
'CRPG', 'Pirates', 'Walking Simulator', 'Psychological', 'Memes',
'3D Platformer', 'Psychedelic', 'Score Attack', 'Abstract',
'Hex Grid', 'Tactical RPG', 'Turn-Based Combat', 'America',
'2D Fighter', 'Star Wars', '1980s', 'Mini Golf',
'Time Manipulation', 'Time Travel', 'On-Rails Shooter',
'4 Player Local', 'Relaxing', 'Hand-drawn', 'Dungeon Crawler',
'Loot', 'Cartoon', 'Mouse only', 'Experimental', 'Dragons',
'Romance', 'Metroidvania', 'Parody', 'Quick-Time Events',
'World War I', "Shoot 'Em Up", 'Music-Based Procedural Generation',
'Twin Stick Shooter', 'Rhythm', 'Bullet Hell', '6DOF', 'Modern',
'Class-Based', 'PvE', 'Heist', 'Politics', 'Resource Management',
'Conspiracy', 'Minimalist', 'JRPG', 'Visual Novel', 'Hacking',
'Strategy RPG', 'Lemmings', 'Illuminati', 'Sexual Content',
'Movie', 'Blood', 'MOBA', 'Rogue-like', 'Runner', 'Narration',
'Asynchronous Multiplayer', 'Chess', 'God Game', 'Soundtrack',
'Procedural Generation', 'Rogue-lite', 'Perma Death',
'Kickstarter', 'Investigation', 'Thriller', 'Cartoony',
'Crowdfunded', 'Transhumanism', 'Interactive Fiction',
'Dating Sim', 'Werewolves', 'Documentary', 'RPGMaker',
'Gun Customization', 'Video Production', 'Software', 'e-sports',
'Martial Arts', 'Mars', 'GameMaker', 'Utilities', 'Web Publishing',
'Game Development', 'Choose Your Own Adventure', 'Text-Based',
'Football', 'Soccer', 'Intentionally Awkward Controls', 'Gambling',
'Software Training', 'Sokoban', 'Drama', 'NSFW',
'Dynamic Narration', 'Typing', 'Pinball', 'Voxel', 'Basketball',
'Fishing', 'Programming', 'Audio Production', 'Sailing', 'Mining',
'Dark Comedy', 'Grid-Based Movement', 'Otome', 'Voice Control',
'Artificial Intelligence', 'Cycling', 'Gaming', 'Photo Editing',
'Lore-Rich', 'Word Game', 'Pool', 'Conversation', 'Nonlinear',
'Spelling', 'Foreign', 'Feature Film', 'Hardware', 'Steam Machine',
'Philisophical', 'Mystery Dungeon', 'Wrestling', '360 Video',
'Faith', 'Bikes'], dtype=object)
specs_srs.unique()
>> array(['Single-player', 'Multi-player', 'Online Multi-Player',
'Cross-Platform Multiplayer', 'Steam Achievements',
'Steam Trading Cards', 'In-App Purchases', 'Stats',
'Full controller support', 'HTC Vive', 'Oculus Rift',
'Tracked Motion Controllers', 'Room-Scale', 'Downloadable Content',
'Steam Cloud', 'Steam Leaderboards', 'Partial Controller Support',
'Seated', 'Standing', 'Local Co-op', 'Shared/Split Screen',
'Valve Anti-Cheat enabled', 'Local Multi-Player',
'Steam Turn Notifications', 'MMO', 'Co-op', 'Online Co-op',
'Captions available', 'Commentary available', 'Steam Workshop',
'Includes level editor', 'Mods', 'Mods (require HL2)', 'Game demo',
'Includes Source SDK', 'SteamVR Collectibles', 'Keyboard / Mouse',
'Gamepad', 'Windows Mixed Reality', 'Mods (require HL1)'],
dtype=object)
As you can see, the attributes have an exteremly high cardinality. To address it, we will group infrequently occurring values for each attribute into an 'other' category. In future posts, we will explore other methods for dealing with categorical data with high cardinality. The code snippet below identifies the groupings for each attribute and then creates column names that will be used when we build the game_features table.
Here's what the dataframe looks like:

Building the User Features Table
We will now use the game_features table built in the last section to create the user_features table. We will load the user items json file, retrieve the list of games played for each user. Each game will be cross-referenced with the game_features table to determine the user's preferred game characteristics. We will also store the user's play history for later usage when it comes time to recommend new games to the user. Because the user items json file contains data for over 5 million users, the code snippet below will take a considerable amount of time to execute. To speed things up, you can reduce the number of records that are parsed.
Here's what the resulting table will look like:

The Recommender Algorithm
With the game_features and user_features table now in place, we can now code the recommender algorithm.
The code snippet above defines two functions. The dissimilarity_score function is the scoring function that will determine how similar a game is to a user's game characteristics. The recommend_games function uses the dissimilarity_score function to provide the ids of games that are similar to a user's preferences.
Here's what we get when recommeding new games for the steam user 'evcentric'
recommended_games = recommend_games('evcentric')
recommended_games
>>Index(['451600', '302670', '346330', '204360', '238460', '512900', '351100',
'306460', '344890', '257750'],
dtype='object', name='id')
We can use the following code snippet to get the names of the titles
for game in recommended_games:
filtered_games = steam_games_df[steam_games_df.id == game]
game_name = filtered_games.iloc[0]['app_name']
print(game_name)
We get the following as output
CounterAttack
Call to Arms
BrainBread 2
Castle Crashers®
BattleBlock Theater®
Streets of Rogue
Niffelheim
Unturned - Permanent Gold Upgrade
ARM PLANETARY PROSPECTORS Asteroid Resource Mining
Bloody Trapland
That's all, Folks!
We have created a basic video game recommender. There are definitely more enhancements we can make to the recommender in order to get better results. That is what we will be covering in the next series of posts. You can find the code for the entire solution here.