Extract Facebook and Twitter data from any page

With Python - using official APIs

Posted on December 26, 2017

You may need data from social media like Facebook and Twitter for a variety of reasons. I for one use it for statistical analysis - to get the reactions on posts from a certain page and make it into a spreadsheet for easy analysis.

To be able to extract publicly available data using a python code, you need to register as a developer and then get your app's access tokens.

Getting your access tokens

The provided APIs are no longer public APIs and it requires user authentication via access tokens

Facebook
  1. Create a Facebook Developer Account
  2. Go to “My apps” drop down in the top right corner and select “add a new app”. Choose a display name and a category and then “Create App ID”.
  3. Go to your app dashboard from the side-menu. There, you'll find your App ID and App Secret.
  4. To avoid security risks always create a new App for the sole purpose of scraping and never share your access IDs
Twitter
  1. Create a new Twitter App with your login credentials
  2. Fill out the required form information and accept the Developer Agreement at the bottom of the page, then click the button labeled "Create your Twitter application".
  3. After successfully creating your application, you will be redirected to your application's settings page. Before you create your application keys, you will need to first modify the access level permissions in order to allow your application to post on your behalf.
  4. Click on the link labeled modify app permissions. You will then be able to choose which permissions to allow. Select Read and Write.
  5. After updating your application’s permissions to allow posting, click the tab labeled Keys and Access Tokens. This will take you to a page that lists your Consumer Key and Consumer Secret, and also will allow you to generate your Access Token and Access Token Secret.

The Python Script

Importing Python dependencies
          import urllib2
          import json
          import datetime
          import csv
          import time
          import tweepy
          from tweepy import OAuthHandler
Access Token

Accessing Facebook page data requires an access token.

Since the user access token expires within an hour, we use the app ID and app secret generated above from our dummy application solely made for scraping, both of which never expire.

          app_id = "your_facebook_app_id"
          app_secret = "your_facebook_app_secret" # DO NOT SHARE WITH ANYONE!

          access_token_fb = app_id + "|" + app_secret # NEVER EXPIRES

          consumer_key = 'your_twitter_consumer_key'
          consumer_secret = 'your_twitter_consumer_secret'
          access_token_tw = 'your_twitter_access_token'
          access_secret = 'your_twitter_access_secret'

          auth = OAuthHandler(consumer_key, consumer_secret)
          auth.set_access_token(access_token_tw, access_secret)

          api = tweepy.API(auth)
Define Page ID

Now we can access public Facebook and Twitter data without limit. Let's do our analysis on the Manchester United Facebook and Twitter page, which is popular enough to yield good data.

          fb_page = "manchesterunited"
          twitter_page = "@manutd"
Construct URL string (Facebook only)

Change num_statuses in parameters to the number of statuses you want to extract from the page

          base = "https://graph.facebook.com/v2.11"
          node = "/" + fb_page
          parameters = "/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s" % (num_statuses, access_token)
          url = base + node + parameters
Retry Function

When scraping large amounts of data from public APIs, there's a high probability that you'll hit an HTTP Error 500 (Internal Error) at some point. There is no way to avoid that on our end.

Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrieval code, so it kills two birds with one stone.

          def request_until_succeed(url):
            req = urllib2.Request(url)
            success = False
            while success is False:
                try:
                    response = urllib2.urlopen(req)
                    if response.getcode() == 200:
                        success = True
                except Exception, e:
                    print e
                    time.sleep(5)

                    print "Error for URL %s: %s" % (url, datetime.datetime.now())

            return response.read()
Extracting Facebook Status
          test_status = json.loads(request_until_succeed(url))["data"][0]
          print (json.dumps(test_status, indent=4, sort_keys=True))
Processing Facebook Status

The status is now a Python dictionary, so for top-level items, we can simply call the key.

Additionally, some items may not always exist, so we must check for existence first

          def processFacebookPageFeedStatus(status):

                status_id = status['id']
                status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')
                link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')
                status_type = status['type']
                status_link = '' if 'link' not in status.keys() else status['link']


                # Time needs special care since a) it's in UTC and
                # b) it's not easy to use in statistical programs.

                status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
                status_published = status_published + datetime.timedelta(hours=-5) # EST
                status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs

                # Nested items require chaining dictionary keys.

                num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count']
                num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']
                num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']

                # return a tuple of all processed data
                return (status_id, status_message, link_name, status_type, status_link,
                       status_published, num_likes, num_comments, num_shares)

            processed_test_status = processFacebookPageFeedStatus(test_status)
            print processed_test_status
Extracting Tweets
          for x in tweepy.Cursor(api.user_timeline, screen_name=twitter_page).items(1):
              tweet = x.text

              print (tweet)

Application

Analyzing data on Posts can be used to quantify the growth and success of your own page, or that of your competitors. Or, like you'll see in the next blog, to build a WhatsApp bot

Socialdata

The data is easy to get and is very useful.