Contributed by Oamar Gianan. He enrolled in the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between September 23, 2016 and December 23, 2016. The original article can be found here.
Have you been followed on Twitter or Instagram by someone you don’t know? I get this a lot. And so to avoid being thought of as rude, I follow back. Eventually, I got tired of following back when I realized that some of these accounts don’t really do anything but collect followers. Now, why would anyone go through all the trouble of following people in the hopes of being followed back? Why would anyone waste so much time on the internet for this?
I eventually realized the answer when I saw that most of these accounts were not personal. A lot of these accounts I encountered were about food, some about beach vacations, and on some occasion accounts with risque content.
Advertising has infiltrated the social network. It used to be just ads on banners but now companies hire personalities on social media to spread the word about their product or event. Companies spend big bucks on celebrities in an effort to publicize their brand and attract a celebrity’s fan base. A sponsored tweet could net as much as $13,000 as was the case for Kloe Kardashian in 2013.
Celebrities have multitudes of followers and get paid big bucks by sponsors. So people may have thought that creating accounts and amassing followers would eventually get them sponsorship deals with advertisers. In this exercise, we see that sponsors might be looking for some other things other than the number of followers.
In a social network, a link could represent a relationship as in Facebook or the passing of a tweet as in Twitter. These links determine the flow of information and are therefore a good indicator of a user’s influence. I will be presenting two methods of finding potential influencers in a network. One would be by extracting a user’s influence measures and the other is by using network graphs.
A large database was found on Followthehashtag.com. The database contained a stream of tweets related to NASDAQ 100 stocks extracted from twitter for 79 days, from 2016 March 28th to 2016 June 15th. This was selected because of a good mix of accounts representing organizations and personalities. The database also contained information about how many times a tweet was passed along and who the original tweet came from. This act, more popularly known as retweeting can be identified in the stream as tweets having ‘RT @user’ or ‘via @ user’ at the beginning of the tweet. The stream also contained information about mentions. In twitter, a mention is a public conversation between users. A user calls the attention of another user by mentioning them in a tweet. Mentioning is identified by tweets beginning with ‘@user’.
The influence measures extracted from the stream were the following: indegree, retweet, and mentions. These measures were selected because of how they affect the flow of information in the network. Indegree measures the user’s popularity. This was easily extracted from the database by the number of followers a user has. The number of followers shows us the size of the user’s audience base. Retweet influence represents a user’s ability to create content which other users find worthy of sharing. When a tweet is shared by another user, a bigger network of users is exposed to the tweet. From the stream, this was extracted by counting the number of retweeted messages for each user. The third measure, mention influence, was extracted by counting the number of mentions containing the user’s name. This influence measure indicates the ability of the user to engage others in a conversation. This represents the top-of-mind value of the user’s name.
A total of 96,613 users tweeted about NASDAQ 100 stocks during the timeframe. Between them, over 680 thousand tweets were broadcast. A word cloud of the NASDAQ symbols most often mentioned shows that Apple, represented by AAPL, was the most tweeted stock among the group.
Figure 1. Stock symbol word cloud.
Users were most active on April 27 where they broadcast over 20,800 tweets. This coincides with the day when AAPL stocks slumped following speculations that iPhone sales may decline by as much as 60 million units compared to the same quarter a year ago. The slump in Apple shares dragged the tech-heavy NASDAQ into the red by the day’s end.
Figure 2. Frequency plot of tweets.
Users’ activity on this day showed that activity was mostly during trading market hours which is 13:30 to 20:30 UTC.
Figure 3. Frequency plot of 27-April-2016.
Each user’s ranking over the three influence categories was assigned by using fractional ranking. For example, in assigning the indegree ranking, a rank of 1 was given to the user with the most number of followers. Users with the same number of followers receive the same ranking number, which is the mean of what they would have under ordinal rankings. Table 1 shows the top 30 users across the three influence measures. Notice that minimal overlap can be seen across each influence rank. The first user to show up across all three measures of influence was «WSJ».
Table 1. Top influentials based on indegree, retweets, and mentions
To see how much users overlap across the three categories, a Venn diagram of the top 100 users was derived. Figure 4 shows that among the 239 users in the top list, only 10 users can be seen across all three measures of influence.
Figure 4. Venn diagram of top influentials across measures.
Figure 5 below shows a correlation matrix which represents how a user’s rank varies across the three different measures of influence. The correlation matrix represents the strength of the association between a pair of rankings. This matrix was derived by comparing the relative influence ranks of all 96,613 users in the database.
Figure 5. Correlation plot across all influence measures.
The users show a strong correlation in their retweet influence and mention influence. The low correlation of the indegree measure across the other two measures show that indegree ranking may not be related to the other rankings.
A couple of conclusions can be derived from the correlation plot. First, we can say that in most cases, users who are retweeted often are also mentioned often, and vice versa. Another one is that the most followed user may not be the most engaging user in the group. A user’s popularity, therefore, is a weak representation of the ability to motivate the spread of information.
Retweets and mentions have direction. A retweet is the path of an idea from User A to User B. User A broadcast a tweet which was read by User B. User B, thought it was worth sharing and retweeted it. This retweet will eventually be seen by users not directly accessible to User A. When User A mentions User B, this is again a link from User A to User B. With this in mind, we have enough data to convert our twitter stream into a directed network graph. All users will be a node in our graph and all directed links will be edges. The igraph library will be used to extract information from the resulting network graph.
A quick look at the resulting network graph for the whole stream shows that we were able to create a graph with 96,613 nodes and 168, 519 edges. Because of this size, the resulting network graph will not be shown. This is because of the amount of time and computational effort needed to come up with a plot. It would most likely be a crowded mess of dots and lines anyway. However, we can still extract some information from the graph object.
## IGRAPH DNW- 96613 168519 --
## + attr: name (v/c), Followers (v/n), type (e/c), weight (e/n)
## + edges (vertex names):
##  ______NGS______->BluegrassCap ______NGS______->BrattleStCap
##  ______NGS______->FatTailCapital ______NGS______->FatTailCapital
##  ______NGS______->Find_Me_Value ______NGS______->inner_scorecard
##  ______NGS______->KerrisdaleCap ______NGS______->Liberty8988
##  ______NGS______->LongShortTrader ______NGS______->MaglanCapital
##  ______NGS______->MarAzul_90 ______NGS______->max02050
##  ______NGS______->maxvision33 ______NGS______->MugatuCapital
##  ______NGS______->NickatFP ______NGS______->SkeleCap
## + ... omitted several edges
The density of a network object is the proportion of present edges from all possible edges in the network. Our present graph has a density of 2.799118e-05. A very low density would mean that there is a very low interaction between our users.
The diameter of a network graph is the length of the longest path across unique nodes and edges. Considering the direction of the links, the diameter of our network is 14. This means that we are able to trace an unbroken path across 15 users.
+ 15/96613 vertices, named:
 TRADERPIRATE ARENABURSATIL fffavela AlisaStrategy AntonioNaVi
 IBD_ECarson BarbarianCap OptionsHawk HumbleBioTrader DanaMattioli
 GillianTan devinbanerjee ej_fournier JCMcCracken athomson6
The hubs and authority algorithm was developed by Jon Kleinberg to examine the relevance of a web page’s content. He categorized pages into hubs and authority pages. Hubs, which have more outgoing links are the internet’s catalog. This is similar to the early days of Yahoo where it touted itself as the internet’s yellow pages. Authority pages have more incoming links presumably because of their high-quality content. Translated to twitter activity, hub pages would fit the description of a user with high retweet influence and authority pages would be similar to a twitter user who has high mention influence.
The hub score and authority score of the network graph was derived using a simple igraph function call. The resulting top hub score went to «markbspiegel» while the top authority score went to «Benzinga». This is in contrast to the ranking tables where the top retweet and mention belong to «philstockworld» and»jimcramer» respectively.
To find out where the discrepancy came from, each node were investigated. Although it showed that «markbspiegel» had more unique edges than «philstockworld» if we consider and sum the weight of each unique edge, philstockworld still beats markbspiegel. The same is observed when looking at the edges of «Benzinga» and «jimcramer». The discrepancy is consistent with how web pages are rated wherein the number of links matter more over the number of times each link was activated. The hub and authority score also does not take into account the weight characteristics of the nodes.
To see an actual network graph, we narrow down our selection to a twitter stream of users tweeting about CA Technologies.
Table 2 shows us the resulting top influentials derived from our ranking method. The first user to cross the three influence categories is «Benzinga».
Table 2. Top influentials of the CA stream.
The resulting network graph of this smaller twitter stream comes up with 431 nodes and 131 edges.
## IGRAPH DNW- 431 181 --
## + attr: name (v/c), Followers (v/n), weight (e/n)
## + edges (vertex names):
##  _bagholder_ ->ppprophet 20trilliondeb ->ppprophet
##  7LadyQ ->eWhispers 7LadyQ ->OpenOutcrier
##  7LadyQ ->WrigleyTom AdaptToReality ->AdaptToReality
##  adelivania ->Benzinga AdvisorboxMedia->MorningstarInc
##  Alain_2012a ->Boursier_com alekskrug8 ->SleekMoneycom
##  AlertTrade ->AlertTrade allgringo ->ChinaInvest
##  AlphSt_Live ->Opinterest AltruistWealth ->eWhispers
##  aTGelstmM ->PersonsPlanet ATPFtrading ->gouluk1
## + ... omitted several edges
There is comparatively more interaction between users compared to our initial network object with the density clocking in at 0.0009550531. The diameter is shorter with just 9 hops across 10 nodes.
+ 10/431 vertices, named:
 TachyonGlobalLL StakepoolCom LMTentarelli tamaraspen2 ppprophet
 diggingplatinum WrigleyTom nixonstocks ACInvestorBlog ProbabilityOne
The resulting hub and authority score show a more consistent result with the ranking tables because the actual number of retweets and mentions were low. This time, the number of unique edges were not significantly lower than the total weight of the edges.
Figure 7 and 8 show the network graphs with the nodes adjusted based on the hub and authority score. The higher the score, the bigger the node size.
Figure 6. CA stream network graph showing the diameter path.
Figure 7. Closeup of network graph with node sizes adjusted based on hub score.
Figure 8. Closeup of network graph with node sizes adjusted based on authority score.
The fractional ranking method is found to be a more realistic measure of a twitter user’s influence. The frequency of interactions between users must be considered in measuring influence, even if it is among a usual set of audience. This just means that the user is consistent in producing high-quality content that has pass-along value.
For smaller networks, the network graph method may yield additional information that can’t be derived from fractional ranking. The key would be to check whether the ratio of the number of edges to the total edge weight is close to 1. The discrepancy between the ranking method and the network graph is expected to be greater when this ratio approaches zero.
Celli, F., Di Lascio, F., Magnani, M., Pacelli, B., Rossi, L. 2009. Social Network Data and Practices: the case of Friendfeed.
Cha, M., Haddadi, H., Benevenuto, F., and Gummadi, K. 2010. Measuring User Influence in Twitter: The Million Follower Fallacy.
Ognyanova, K. 2016. Network Analysis and Visualization with R and igraph.
Source: Data Science Central