Getting Under the Hood of the Front Page of the Internet

Reddit API Basics in Python

George Ferre

--

Data scientists are responsible for reviewing, organizing, analyzing, and refining data (and so much more). Before that can even begin, we need to actually have data at our disposal. In many cases, that may be set up for us (your company may already have a database ready for use, or perhaps your starting point is a csv or excel file. However, if you ever have to get your hands dirty and grab some data yourself, APIs are a powerful tool to grab information.

What is an API?

APIs or Application Programming Interfaces allow two software systems to communicate. Perhaps the most recognizable feature to end users is the ability to use credentials from one website as a log in for another website, such as using a Google account to log into Medium. For data science, APIs allow us to make requests to a website to get data. You can get song or artist information from Spotify, tweets and trending data from Twitter, and ranking and user information from various games such as League of Legends, DOTA 2, and CSGO. For this article, we will look at Reddit’s API, to show how to get started using APIs and what they are capable of.

Steps Before Beginning With Code

Before getting into any of the code, we first need to make sure we have a few things in order. It is important to note when using APIs that typically websites request you to register in some way. If APIs are used irresponsibly, you may end up sending far too many requests and overload a company’s servers. There are typically some rules in place you must agree to in order to get authorization to use APIs. In Reddit’s case, they request that you create an account, and register your app. Once you create an account, you can go to https://www.reddit.com/prefs/apps and click the button that says “are you a developer? Create an app…”. You can then choose what kind of app you want to make. For our purposes, we are going to choose “script” since we will only be pulling up information rather than creating a web or mobile app. This also requires a redirect url, which is a url that users will be sent to if they have questions about your app. Since this is for personal use, you can put any url in. Once you are done you should see something like this:

The strings I have put arrows next to will be important in a moment. THIS INFORMATION SHOULD NOT BE SHARED! This account and app were created for demonstration purposes and will be deleted before this blog is posted.

Let’s Get Coding

With all of that set up, we can finally get to coding. The first thing we can do is initialize our Client ID and secret key. To start, we will do this by putting the information in directly, but make sure if you do it this way that your code is not uploaded anywhere, otherwise your information may be compromised.

client_id = 'FRoiBg6DRJ5qaw'
secret_id = 'qQ5XPXxLn-Tr2zGrKjmh5n0eX9Ztmw'

Next, we need to request an authorization token. First, we need to import the requests library. After doing that, we need to set up a request for an OAuth token from Reddit. In addition to the Client ID and Secret ID we have already put in, we need your Reddit account log in information.

import requests# Information from reddit.com/prefs/apps
auth = requests.auth.HTTPBasicAuth(client_id, secret_id)
# Reddit account login information
data = {
'grant_type' : 'password',
'username': 'api_demo2',
'password' : 'example_password'
}
# A short description of our app
headers = {'User-Agent' : 'Reddit_API_bot/0.0.1'}

With all of that entered we can set up a request for our OAuth token. For Reddit, this will last for two hours (so if you take a break and come back you may need to request a new token. When you check your request, you should get a 200 response, meaning you have been granted an access token. If you get another response (such as a 401 meaning access was denied), double check that all of your credentials were entered correctly.

res = requests.post('https://www.reddit.com/api/v1/access_token',
auth=auth, data=data, headers = headers)
# Make sure we get a 200 response
res

You can see more information about your token request by appending ‘.json()’ to the end.

res.json()

We have one more step before we can start pulling information. We need to place the access token inside of our headers dictionary.

token = res.json()['access_token']headers['Authorization'] = f'bearer {token}'# Making sure OAuth added to account
headers

Now we are ready to pull some data. You can start by pulling information about yourself. You can see all of the information about your account with the following:

# See information from your own account
requests.get('https://oauth.reddit.com/api/v1/me', headers = headers).json()

For reference, you can use Reddit’s API Documentation to see more places where you can pull information. On the left hand side of the guide, you can see a bunch of links starting with ‘/api/v1/me’. That is what you can place at the end of the url in your request. So for example, if you wanted a list of friends on your account, you would enter ‘https://oauth.reddit.com/api/v1/me/friends' for the url in your request.get function.

Now that we have access and have played around with our account information a little bit, we should get into the bread and butter of API use for data science: getting information. There are plenty of functions in the requests library in Python, but for our purposes we are pretty much going to stick to the get function. We will start at the top and pull information about the top posts of all time on reddit.

all_res = requests.get("https://oauth.reddit.com/r/all/top",
headers=headers)
print(all_res.json())

Here is the result:

And to be clear, this is the tip of the iceberg. There is plenty of room to scroll here. To put it bluntly, this is not helpful in its current form. We will need to clean this up a bit. We know this is simply a dictionary, so we can dig a bit. We can start by checking out the keys associated.

all_res.json().keys()

To skip a few steps, we can investigate the values of these keys until we get to what we want. After investigating a bit, we can see a list of dictionaries that have information that may be relevant to what we’re looking for.

# Pull the first entry on list to cut down on length of output# making it easier to parse the informationall_res.json()['data']['children'][0]

There is still a lot here, but we can investigate a bit by trying to get keys for dictionaries in this list. We can pull the first entry in the list to get a list of keys.

all_res.json()['data']['children'][0]['data'].keys()

There are quite a few keys available, and we still need to take all this data and turn it into something more human readable. We can take care of both by creating a pandas table using the keys we are interested in.

import pandas as pddf = pd.DataFrame()for post in all_res.json()['data']['children']:
df = df.append({
'subreddit' : post['data']['subreddit'],
'title' : post['data']['title'],
'upvote_ratio' : post['data']['upvote_ratio'],
'score' : post['data']['score'],
'is_self' : post['data']['is_self'],
'over_18' : post['data']['over_18'],
'author' : post['data']['author'],
'num_comments' : post['data']['num_comments'],
'url' : post['data']['url']
},
ignore_index = True)
df

By default, you can get 25 entries for but Reddit has a list of params that can help get more. To get 100 entries for the top post in r/games of all time, I can add the ‘limit’ param and set it to 100. The limit param maxes out at 100, so if a number higher than 100 is entered, you will only receive 100 entries.

games_res = requests.get("https://oauth.reddit.com/r/games/top?t=all",
headers=headers,
params = {'limit' : 500})
games_df = pd.DataFrame()for post in games_res.json()['data']['children']:
games_df = games_df.append({
'subreddit' : post['data']['subreddit'],
'title' : post['data']['title'],
'upvote_ratio' : post['data']['upvote_ratio'],
'score' : post['data']['score'],
'is_self' : post['data']['is_self'],
'over_18' : post['data']['over_18'],
'author' : post['data']['author'],
'num_comments' : post['data']['num_comments'],
'url' : post['data']['url']
},
ignore_index = True)
games_df

This should get you started for the time being. You can find more information from Reddit’s API Documentation. I also highly recommend this blog (and linked video) from James Briggs which helped me out a lot. Finally, once you feel like you have learned some basics of APIs and want to get deeper into Reddit information gathering, check out PRAW which is a Python wrapper for Reddit.

--

--

George Ferre

My name is George Ferre. I am currently working to become a data scientist. I hope to share insight into the process as a progress through bootcamp and onward.