A Precursor to Modern Social Network Analysis
Once upon a time, it took a lot of effort to collect social network data. Months were spent in the processing of designing surveys, identifying populations, completing forms, collecting results, and entering data. Months had to be spent in the field gaining entrée, keeping meticulous notes about happenstance interactions, and then coding up reams of observations. Or, months had to be spent in drafty and dusty libraries pouring over bureaucratic records of yore. The common theme is the “months” part before any actual analysis could begin. The idea that anything about this process of collecting, analyzing, and reporting data could be “real time” would have gotten you thrown in the funny house.
If you were a survey researcher interested in questions about relationships like friendship, sexual contacts, or communication patterns, you were really up a creek compared to your other survey colleagues. Obviously a relationship involves more than one person, so the survey researcher would start with one person named an “ego” then ask him to name his friends, sexual partners, or co-workers. If the researcher wanted to map out the network, she would then have to go to each of the ego’s friends, partners, and co-workers and ask each of these “alters” to name all their friends, sexual partners, and co-workers, and so on and so on. This method became known as a “snowball sample” because the number of people the researcher would have to track down figuratively snowballed out of control from the initial ego to hundreds or thousands of alters. Faced with such a problem, the researcher would stop collecting data because relationships a certain distance away were assumed to not really influence matters like social influence or disease transmission anymore. You don’t need a Ph.D. to know this is a pretty dumb assumption.
Other researchers tried to get around this problem by studying social systems with clear boundaries and containing a few dozen members at most like monasteries, fraternities, and tea clubs. While it’s actually really interesting data, put yourself in Samuel Sampson’s shoes walking around in a monastery for months in the mid-1960s keeping painstaking notes about which monk talks to which monk. Sure, they weren’t anything like the real world, but these data were the social Petri dishes that contributed to decades of scholarship that became the theoretical and methodological foundations of what we call network science today. It was fast to compute who was the most central actor, count the different structural building blocks, and create visualizations plotting the network structure because “big” networks had, at most, 100 nodes. My old TI-83 could’ve handled this.
Then the Internet Happend and Data Got Big
Then the internet happened, but not in a “headshot to your business model” bad way like for newspapers. Lots of relational data immaculately recorded in database tables was suddenly available: webpages linking to other webpages, users chatting with other users, and then “poking.” But many of the methods could not keep up with analyzing data that jumped in size from a few classrooms to the size of a few European countries. Many reliable old algorithms had to be given the Old Yeller treatment after trying to tango with networks containing millions of nodes. Computer scientists and physicists came along with new cute puppies like PageRank that worked well with these bigger datasets.
Big Data - Social Network Algorithms Go Feral
If the volume of big data is the rabies that makes reliable social network algorithms go feral, the variety and velocity of “big data” punch the severity of the disease up a few more notches to somewhere between the Ebola and Rage viruses. That makes things much more complex.
The variety of data refers to the multitude of sensors and data sources collecting natural language and text, images and video, geo-spatial data, and time series. All of these contain potential relationships from person to person, from person to object, and from object to object. Moreover, relationships like “friendship” overlap with other interactions like “communication” so that one type of relationship may predict the other under some conditions. Even though there are network statistics and algorithms for these different types of relationships, extracting these relationships in the first place from the source data requires a whole other set of data mining methods. Once you let algorithms talk to other algorithms, you’re basically on the path to time-traveling robots trying to kill John Connor. But really, ensuring the validity of an analysis becomes a hard problem as more algorithms provide more opportunities for inaccuracies to take root. Why bother? Because making sense of this variety of data also provides opportunities to perform interventions such as making recommendations about people to friend or movies to watch based on similarities and other hidden connections.
Data Velocity, Deluge and Digging into Nodes
The velocity of data refers to the rate at which this information is created and collected as well as the speed at which it needs to be analyzed to make decisions. It is reasonable to assume that some relationships such as friendship or marriage are stable and do not change significantly month to month (your mileage may vary). In cases like this, once the data are collected you can feel confident that your findings will be no less valid tomorrow than they were yesterday. But other relationships like trending topics on Twitter or edits to Wikipedia articles change dramatically hour to hour. The result is that a network of relationships and interactions built on these data at one point in time looks very different from the networks built not long after. Network scientists and researchers have only begun to develop the methods to analyze the behavior of new nodes in a network, the stability of ties between these nodes, and changes in the properties of these nodes and ties over time, to say nothing about how these all influence each other. Assuming one can avoid drowning from this data deluge, these data streams and dynamics can provide insights about the effectiveness of new strategies as well as highlight problematic relationships much more quickly.
While new technologies have drastically improved our capacity to collect data, no good deed goes unpunished. The volume, variety, and velocity of these new forms of data simply break traditional analytic approaches that can’t scale up to massive sizes, incorporate more complex data, or model rapidly changing data. Still, the richness of these data are also revealing new patterns between relationships that were previously overlooked. Changing your thinking from “how do I impact others?” to “how do others impact each other?” opens up a new realm of discovery.