
Twitter emerged on the scene in March 2006, and since then, has accumulated a truly massive user base, even if it may not appear so relative to its main competition, but the sheer scale of 126 million daily users (39%) and roughly 330 million monthly users has brought it into the focus of researchers and private enterprises alike. Its role and position in society is of undoubted interest to researchers and firms alike given this scale. From the start the default public and accessible nature of the data which twitter generates has served to make it largely distinct from its main competitors as it provides a treasure trove of researchable text based data, that can be useful for investigations into sentiments surrounding brands, or indeed shape political and social discourse as has been the case in many high-profile cases in recent years.
These cases provide a particular insight into the way users understand the data they share on such platforms, and how other actors view that they can use this data in the context of research. This difference in understanding raises a number of ethical issues, which the use of twitter data presents to researchers, and an understanding of how such ethical issues can have serious real world implications is an important component of the reflection paper on ethics within text mining more generally.
In addition to the general privacy as a factor which a user assumes when communicating with their intended audience, perceptions of privacy on Twitter are also informed by the supposed ephemerality of tweets. Not only do Twitter users have a constructed or curated public, but as discussed in depth by Zimmer and Proferes (2014a), Twitter executives’ own rhetoric about the platform and features of the platform itself reinforce user expectations that tweets are short lived. This perceived ephemerality begs the question of whether users would communicate differently given the potential for tweets to be stored, preserved, or accessible over time. This poses a particular ethical and indeed more broadly methodological challenge in using twitter data as the users may be tweeting in the heat of the moment, and any analysis may ascribe too much weight to otherwise ‘in the moment’ thoughts and feelings, to the detriment of more conscious, and nuanced views. In particular, we can see this as the case of Twitter’s users perceiving that they are communicating with a much more narrow audience for their status updates and may share information, which they would not otherwise make public. This is a consideration that is particularly relevant to vulnerable populations, given the particularities in communication, which exist within many groups, and how this may be perceived or understood by those outside these groups.
Indeed the fact that the practice of text mining twitter data may not in itself be in contrast with the legal guidelines as outlined in Twitter’s terms of service, leads to many researchers pointing out that there exists a need to reflect on what is ethical within social science research, as opposed to what is strictly legal (a case some scholars highlight as not being sufficient to get through many ethical boards at universities and research institutes in other instances). In particular, we can see this around the issue of uninformed consent, as many other studies within the social sciences not seeing this as sufficient basis in which to approve a study, however the use of twitter data in research typically relies almost exclusively on this principle. Indeed, we can see in the case of Awan (2014), Innes et al. (2016) and Roberts et al. (2017) wherein the authors published highly sensitive Twitter content without any valid attempt to protect the privacy or, to gain the informed consent of users posing a very clear ethical challenge to using such data.
Indeed, while such practices may not be at odds with Twitter’s terms of service, some scholars argue that there is a need to interpret these practices through the lens of social science research methods that imply a reflexive ethical approach than is provided strictly within ‘legal’ accounts of the permissible use of these data sources.
This presents a very clear challenge to a researcher’s understanding of their ethical obligations, and indeed the use of Twitter data is oft-treated with such reckless abandon that can endanger users who were not explicitly informed that they would be the subject of study. While this reflection largely focuses on the particular ethical issues within using Twitter data, it is seen as particularly relevant to the topic of text mining more generally, as in many instances such data is used as the primary source. This leads to the relative absence of a clear ethical investigation in using twitter data, which may contain personally identifiable information, in often polarised, or highly charged topics. This means that the methods available within text mining present a challenge insofar as it can be deployed on a large dataset to extract either socially beneficial information, such as identifying issues that don’t receive sufficient attention, or indeed highly damaging and threatening information to the fore such as identifying perceived ‘enemies’ or publicly matching users to personally identifiable information such as their home address or phone number. This particular practice is referred to as “doxing” and is frequently deployed to personally identify opposition or critical voices and is used to intimidate and potentially make certain users the victim of crimes. That these tools are deployed in such a way is indeed cause for concern amongst researchers who by failing to fully investigate the ethical ramifications of their research can cause real world harm to users who typically have not been informed they are the subject of research. This is caused by failing to fully understand what speaking to the public really means, with large swathes of research highlighting that users fail to take account of the ephemerality of their content and the ethical challenge this can pose to researchers, who’s object may be to simply investigate an issue, but can lead to ethical issues with how text mining approaches and methodologies are deployed on large text based datasets.