Data Biases, Cognitive Biases

David Thiel:

This data bias is compounded by cognitive bias: the recency illusion, i.e. the perception that recently noticed things are more prevalent. For someone with no tendency to spend time searching hashtags of Chinese cities (or in Chinese-language Twitter in general), the volume of spam will seem sudden and anomalous, and quite possibly suspicious. And because gathering and analyzing data takes time, quickly drawn conclusions will often be based on small amounts of poor quality data.

As such, the only way to truly compare current with historical activity is to consume it over long timeframes in realtime before it has been acted upon, with the terms defined ahead of time — which was not the case in any of the analyses of Chinese spamming activity that we are aware of. In retrospective research, historical Twitter data generally becomes “cleaner” — some amount of spam and inauthentic behavior will have been removed — as you go further back, but this is necessarily a less accurate representation of what actually occurred on the platform. Put simply:

In a retrospective sample of moderated social media platform, ToS-violating or inauthentic content tends to appear most prevalent in the immediate past. We can call this Content Moderation Survivor Bias.

To illustrate this effect as best we can with data gathered after the fact, let’s take a look at tweets containing the names of major Chinese cities.