Twitter is an impressive data source, and tweets have so much information about societal sentiments. In fact, some work has shown that aggregated tweet data can be used as a metric of societal happiness! With a short character limit, and wide prevalence, Twitter sounds like the perfect combination for a true reflection of society’s thoughts.

Apart from the text aspect, tweets can also contain important location information. Certain users agree to share their tweet locations with Twitter. This adds potentially another layer of information.

In the near future, adding multiple real-time information layers such as traffic flow, incidents, signals from smart city infrastructure, … could be extremely beneficial in keeping track of an increasingly complex society.

My initial goal in writing this article was to extract location information from tweets, and analyze how representative they are of diverse populations. After all, if tweets are an indicator of society or customers in business cases, it’s important that different populations from various socioeconomic backgrounds are appropriately sampled. I would have done this by comparing tweet distribution with census data.

I have only reached part of the way in achieving this simple goal. Turns out, there are some issues, due to a very small proportion of users agreeing to share their detailed locations, and Twitter assigning location coordinates based on tweet text.

Tweepy to Extract Tweet Locations

Tweepy is a Python wrapper for the Twitter API. First lets load the packages in python

import sys
import os
import re
import tweepy
from tweepy import OAuthHandler
from textblob import TextBlob
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from IPython.display import clear_output
import matplotlib.pyplot as plt
% matplotlib inline

Next, we will use Tweepy to authenticate user credentials.

# Authenticateauth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)if (not api):
print (“Can’t Authenticate”)
sys.exit(-1)

Now, let’s filter those tweets within the location we want. Twitter offers a couple of ways of doing it — a bounding box of max 25 miles length or a specified location with a maximum radius of 25 miles reflecting the area within which we want to search for tweets.

tweet_lst=[]
geoc=”38.9072,-77.0369,1mi”
for tweet in tweepy.Cursor(api.search,geocode=geoc).items(1000):
tweetDate = tweet.created_at.date()
if(tweet.coordinates !=None):tweet_lst.append([tweetDate,tweet.id,tweet.
coordinates[‘coordinates’][0],
tweet.coordinates[‘coordinates’][1],
tweet.user.screen_name,
tweet.user.name, tweet.text,
tweet.user._json[‘geo_enabled’]])
tweet_df = pd.DataFrame(tweet_lst, columns=['tweet_dt', 'id', 'lat','long','username', 'name', 'tweet','geo'])

I’ve used the Tweepy Cursor object, which takes into account Twitter’s limit of maximum 100 results per page, and solves this through pagination. For the first trial, I restrict to 1000 tweets within 1 mile from the heart of Washington, DC, and those with specific lat/long coordinate information.

Deeper Dive Into Locations

Out of 1000 results, 714 had distinct location information. But lets look a little bit deeper.

Histograms of Latitude, Longitude, and ID | Skanda Vivek

Interestingly, there are only 80 distinct latitude/ longitude pairs, but each tweet and Twitter user ID are distinct. So something strange is going on, it doesn’t make sense for all those users to be in the exact same location. Here’s the data frame, filtering for the most common lat/long pair:

Most common tweet location from data set | Skanda Vivek

If you notice, see how the text of many tweets have @ Washington DC in them. Twitter is defaulting to a central location in DC for any tweet that has @ Washington DC in it.

2nd most common tweet location from data set | Skanda Vivek

Interestingly, when I filter the 2nd most common location, I don’t see this anymore, but I do see the texts having @ USA in them! According to Twitter, USA is a location slightly below the white house, in a park!

The location of @ USA | Skanda Vivek

In looking more closely at the Twitter documentation:

While I want the coordinates of sample users, to do aggregate mobility analysis, Twitter is giving me information from location tags in tweets. The most common tag in the DC are is @ Washington DC, the second most is @ USA, etc.

Another issue is when I extend my search out to a radius of 25 miles, I get 0 tweets with coordinates information out of 1000, which could be because of the way Twitter assigns coordinates to tweets that reference popular locations in DC, but not those miles away from the heart of DC.

Here is the Google Colab notebook detailing the code:

In summary, Twitter offers tremendous potential to analyze societal patterns including sentiments, mobility, and anomalies. However, when it comes to geospatial analysis, there are several limitations due to the sparseness of data, and ambiguity of the meaning of locations when provided — whether locations represent the actual locations of the device sending out the tweet, or do they correspond to places referenced in tweets.

While Twitter developers can make improvements in the Twitter API, privacy concerns definitely exist as to mass sharing of sensitive location data. For me personally, I would love to share my location data in order for large-scale analyses that benefit society. But when asked by the Twitter app, I hesitate — due to the potential for unknown harm.

We are at a stage similar to nuclear power back in the day. Granular data has the potential for society upheaving benefits. At the same time, as a society we need to have continued discussions on how to achieve these benefits while defending against data breaches and misuse of detailed information.