On August 12, James Alex Fields Jr., described by his high school history teacher as “deeply into Adolf Hitler and white supremacy,” drove a car into a group of counterprotesters, murdering 32-year-old Heather Heyer and injuring 19 other people. Earlier in the event, Fields was seen marching with a Vanguard America shield. That same day, a group of six white men followed 20-year-old Black special ed assistant teacher DeAndre Harris into a parking garage and beat him with poles and metal pipes, an attack that was caught on film and posted to the internet. In response to the racist violence, Trump famously said that there were “very fine people on both sides.”
The Unite the Right rally, like much of the American fascist movement’s activism during the 2017–2021 Trump presidency, was largely organized online using Discord, a group chat platform designed for gamers. In Discord, users join servers, a group of chatrooms, or a channel, a single chatroom. Each channel covers different topics. Fascists created Discord servers for their regional hate groups, as well as for projects like organizing Unite the Right.
An antifascist infiltrator gained access to the server used to organize Unite the Right, called Charlottesville 2.0, as well as many other servers used by fascists at the time. They then leaked the chat logs to Chris Schiano and Dan Feidt, journalists working with the independent nonprofit news collective Unicorn Riot. The leak took the form of screenshots from the Discord app, large JSON files containing thousands of messages, and audio recordings from voice meetings.
In this chapter, I describe how the JSON chat log files were structured and how I went about analyzing them, using techniques covered in Chapter 11. I’ll describe the custom app that I wrote to investigate this dataset and explain how I used it to investigate a Discord server called Pony Power, whose members doxed their political enemies. You’ll also learn the inside story of DiscordLeaks, Unicorn Riot’s public searchable archive based on my app, which contains millions of chat messages from far-right Discord servers. Finally, I discuss a major hack of the American neo-Nazi organization Patriot Front that took place four and a half years after the Charlottesville rally. This hack included chat logs from RocketChat, a self-hosted system that Unicorn Riot also hosts in DiscordLeaks.
Like my reporting on the AFLDS dataset, this case study is an example of journalism with real-world impact. My work, along with that of Unicorn Riot, antifascist infiltrators, and other anonymous developers, helped lead to a court settlement against the most notorious American white supremacist leaders and organizations, resulting in over $25 million worth of damages. I hope that this case study will inspire your own work on datasets of structured chat logs, should you obtain them in the future. With the rise of remote work and the increasing popularity of chat platforms like Discord, Slack, and RocketChat, this type of leak is only getting more common.
I’ll start with a brief description of how these chat logs were leaked.
How Antifascists Infiltrated Neo-Nazi Discord Servers
Unicorn Riot reporters covered the Unite the Right gathering on the ground in Charlottesville. In the following days, the collective announced that it had received anonymously leaked chat logs from the far-right groups that took part in the rally, and particularly from the Charlottesville 2.0 Discord server. It began publishing articles based on these leaks, showing evidence of premeditated plans for violence, memes about hitting protesters with cars, and posts made after the event celebrating Heather Heyer’s murder. It also published ZIP files containing thousands of screenshots from the infiltrated Discord servers. Researchers, both amateur and professional, immediately began correlating breadcrumbs from these chat logs with photos and videos of the event that were posted to social media to identify specific fascist activists.
Alongside Charlottesville 2.0, other leaked fascist Discord servers had names like Vibrant Diversity, Ethnoserver, Safe Space 3, and 4th Reich. Some servers only had a few dozen users, while others had over a thousand. The most active server at the time, Vibrant Diversity, included a channel called #problematic_oven, where users shared racist memes. The 4th Reich server included a #rare_hitlers channel, where users shared vintage propaganda from Nazi Germany.
Once the reporting of Unicorn Riot and others had made it clear to Discord that Nazis were relying on its service, the chat platform shut down many far-right chat servers and accounts. “Discord’s mission is to bring people together around gaming. We’re about positivity and inclusivity. Not hate. Not violence,” the company said in a statement. “We will continue to take action against white supremacy, nazi ideology, and all forms of hate.” Shutting down individual servers and accounts didn’t work, though; fascists simply created new accounts and set up new chat servers. Just as quickly, antifascists infiltrated those new servers and continued to leak chat logs to Unicorn Riot.
Fascists started spreading conspiracy theories that there were no infiltrators but that Discord itself was selling their chat logs to the Southern Poverty Law Center, a nonprofit that monitors hate groups. “The Charlottesville planning server was leaked, even though it was highly secure and no one could figure out who could have leaked it,” Andrew Anglin, founder of the notorious neo-Nazi website the Daily Stormer, wrote in an April 2018 blog post. “Since then, servers have been repeatedly leaked. People have been doxed without being able to figure out how they were doxed. Repeatedly and consistently, I have been given reason to believe that these are not Discord ‘leaks,’ but data being bought by our enemies.” This wasn’t true, of course. Anglin provided no evidence for the claim, Discord’s privacy policy promises that it doesn’t sell user data, and we know exactly how the data was leaked: antifascists were invited into the group by pretending to be racists.
A few weeks after Unite the Right, I got a hold of some of these chat logs myself and began to analyze them.
Analyzing Leaked Chat Logs
In late August of 2017, after Unicorn Riot had started publishing articles based on leaked chats, someone from the collective asked me if I’d like to cover the fascist chat logs for The Intercept. While journalism can be competitive, with each newsroom racing to publish breaking news first without getting scooped, the opposite is often true when it comes to complicated datasets. When it’s clear that there’s no way that a single newsroom has the resources to discover all of the revelations in a dataset, it only makes sense to bring in other newsrooms and share access to the data. This sort of collaboration helps everyone because different newsrooms have different audiences, and it makes real-world impact from the reporting more likely.
My Unicorn Riot contact sent me a ZIP file full of JSON files and screenshots of Discord chats that covered several Discord servers. The JSON files contained more complete logs of everything posted to these chatrooms, while the screenshots captured only specific conversations. While screenshots are initially simpler to use because you don’t need to write any code or use special tools to read them, having the chat logs in a structured data format like JSON is much more useful in the long run. The best way to peruse screenshots of chats is to open individual images, read them one at a time, take note of the filenames that contain interesting content, and refer back to them as needed. This quickly becomes unwieldy when you’re dealing with thousands of screenshots.
I started digging into the JSON files to see what I was dealing with.
Specifically, I used the handy command line tool jq
to figure out exactly how this data
was structured in order to find the lists of users and channels and read
the messages in each channel.
NOTE Besides manually reading screenshots and taking notes, another option would have been to index the screenshots in software like Aleph, which you used in Chapter 5. Aleph would then perform OCR on the images, extracting their text and enabling me to search them for keywords. This might be helpful in locating specific messages, but in the end, it’s still not as useful as structured data. If I were dealing with this data today and only had screenshots without access to JSON data, I would definitely rely on Aleph.
Making JSON Files Readable
Each JSON file within the ZIP file sent by my source contained the entire archive of chat logs from a given Discord server. For example, one 29MB JSON file was called VibrantDiversityComplete-Sept5at327PM. For the purposes of this book, I’ve renamed it VibrantDiversity.json to make the following examples easier to read.
When I opened this file in a text editor, its contents looked like this:
{"meta":{"users":{"231148326249037824":{"name":"D'Marcus Liebowitz"},"232213403974893569":{"nam
e":"northern_confederate"},"279620004641767424":{"name":"☇Unlimited Power☇"},"23338059623405977
6":{"name":"OrwellHuxley"},"289851780521787392":{"name":"badtanman"},"337421867700715524":{"nam
e":"spadegunner"},"315936522656546818":{"name":"erz1871"},"122932975724789761":{"name":"Archer"
},"201547638129164290":{"name":"SLUG2_"},"288899711929286667":{"name":"million plus"},"25019824
--snip--
This block of data is not very human-readable. As you learned in
Chapter 11, it’s much easier to read JSON
data that’s been reformatted using line breaks, indentation, and syntax
highlighting. Using the jq
command, I formatted it and added syntax highlighting in my terminal
like so:
micah@trapdoor Discord-JSON-Scrapes % cat VibrantDiversity.json | jq
{
"meta": {
"users": {
"231148326249037824": {
"name": "D'Marcus Liebowitz"
},
"232213403974893569": {
"name": "northern_confederate"
},
"279620004641767424": {
"name": "☇Unlimited Power☇"
},
--snip--
Running this command added formatting and syntax highlighting to the file’s contents, but still resulted in 29MB of text madly scrolling through my terminal. To understand the data better, I needed to run more specific commands that would reveal its overall structure.
Exploring Objects, Keys, and Values with jq
I could tell by looking at the beginning of the JSON data that the whole
file was one large JSON object, and one of that object’s keys was
meta
. I ran the following
jq
command to see what other
keys there were:
cat VibrantDiversity.json | jq 'keys'
The output told me that the data for each Discord server includes two
parts, data
and
meta
:
[
"data",
"meta"
]
Guessing that meta
included
the metadata for the server, I ran the following command to determine
the keys of the meta
object:
cat VibrantDiversity.json | jq '.meta | keys'
This command piped the output of cat VibrantDiversity.json
as input into the
jq '.meta | keys'
command.
It looks like there’s a second pipe there, but there’s not. The string
'.meta | keys'
is actually
just a single argument into jq
. The pipe character is how you chain
multiple jq
filters together
so that the output of one gets piped into the output of the next; in
this case, .meta
outputs the
value of the meta
key and
pipes it into keys
, which
outputs the keys from that value.
The output showed me that the metadata included information about channels, servers, and users:
[
"channels",
"servers",
"userindex",
"users"
]
So far, I had only looked at the keys of JSON objects. It was time to
look at some of the content, starting with the servers. By running
jq '.meta .servers'
, I could
look at the value of the servers
key inside the meta
object:
cat VibrantDiversity.json | jq '.meta.servers'
The output in Listing 14-1 showed that VibrantDiversity.json lists a single server in the metadata sections, Vibrant Diversity, just as I expected.
[
{
"name": "Vibrant Diversity",
"type": "SERVER"
}
]
Listing 14-1: The list of servers in VibrantDiversity.json
I could tell that this output was an array, since it was a list of items surrounded by brackets ([ and ]).
Next, I wanted to see what channels this server had, so I ran the
following command to view the value of the channels
key in the meta
object:
cat VibrantDiversity.json | jq '.meta.channels'
Listing 14-2 shows the output of this command.
{
"274024266435919872": {
"server": 0,
"name": "rules"
},
"274262571367006208": {
"server": 0,
"name": "general"
},
"292812979555139589": {
"server": 0,
"name": "effortposting"
},
"288508006990348299": {
"server": 0,
"name": "problematic_oven"
},
"274055625988898816": {
"server": 0,
"name": "music"
},
"343979974241550337": {
"server": 0,
"name": "gun-posting-goes-here"
},
"328841016352440320": {
"server": 0,
"name": "food-posting"
},
"274025126641795074": {
"server": 0,
"name": "share_contact_info"
},
"288901961313550336": {
"server": 0,
"name": "recruiting"
}
}
Listing 14-2: The list of channels in the Vibrant Diversity server
Whereas the output in Listing 14-1 was an
array, the output for .meta.channels
was a JSON object, as indicated by the
braces ({
and }
) surrounding it.
The keys for this object are long numbers, presumably the ID of the
channel, and their values are objects that contain the server
and name
keys. For example, the channel with
key 288508006990348299
has
the value {"server": 0, "name": "problematic_oven"}
. The server
value for all of these channels is
0
. I guessed that this was
the index of the servers array from Listing 14-1. Since there was only one server in
this JSON file, the index for all of the channels is the first item in
the list, 0
. The
name
value was
problematic_oven
. When I
later read the chats in this channel, it was full of antisemitic posts
and Nazi memes, and the word oven was clearly a reference to the
Holocaust. This was definitely a neo-Nazi chat server.
I wanted to see a list of this server’s users, so I ran the following
command to view the value of the users
key in the meta
object:
cat VibrantDiversity.json | jq '.meta.users'
Listing 14-3 shows my output.
{
"231148326249037824": {
"name": "D'Marcus Liebowitz"
},
"232213403974893569": {
"name": "northern_confederate"
},
"279620004641767424": {
"name": "☇Unlimited Power☇"
},
--snip--
Listing 14-3: The list of users in the Vibrant Diversity server
Just like the list of channels in Listing 14-2, the output for .meta.users
in Listing 14-3 is a JSON object. The keys are long
numbers, presumably the ID of the user, and the values are objects with
just a single key, the user’s name.
So far, I had explored the metadata keys channels
, servers
, and users
, but there was one left: the
userindex
key. I ran the
following command to view the userindex
key’s value:
cat VibrantDiversity.json | jq '.meta.userindex'
Listing 14-4 shows my output.
[
"231148326249037824",
"232213403974893569",
"279620004641767424",
--snip--
Listing 14-4: The list of user IDs for each user in the Vibrant Diversity server
The output for the .meta.userlist
command was a JSON array rather than
an object, and each item in the array was a string that looks like a
Discord ID. Sure enough, the first item, 231148326249037824
, turned out to be the ID of the first
user from Listing 14-3, D’Marcus Liebowitz.
At this point I didn’t fully understand the purpose of
userlist
, but it soon became
clear, as you’ll see later in this section.
Armed with a basic understanding of the server’s metadata, I ran the
following command to find the keys for the data
object:
cat VibrantDiversity.json | jq '.data | keys'
Listing 14-5 shows my output.
[
"274024266435919872",
"274025126641795074",
"274055625988898816",
"274262571367006208",
"288508006990348299",
"288901961313550336",
"292812979555139589",
"328841016352440320",
"343979974241550337"
]
Listing 14-5: The keys to the data
object in the Vibrant Diversity server
These keys are the same channel IDs from Listing 14-2, so I guessed that the values of each key contained the actual messages in those chat channels. Because I needed to start somewhere, I decided to view the chat messages from the #problematic_oven channel, so I ran the following command:
cat VibrantDiversity.json | jq '.data."288508006990348299"'
The full argument for this jq
command is surrounded by single
quotes. The .data
part of
the filter looks in the key data
, and the
."288508006990348299"
part
of the filter looks in the key 288508006990348299
, which is the ID of the
#problematic_oven channel. I put the ID in quotes so that jq
would know that this key was a string
and not a number.
As with the first time I used jq
to read this JSON file, the output of
this command scrolled through a large block of text, though considerably
less than before. In this case, the output showed chat messages from
only a single channel, rather than showing all of the data in the JSON
file. Listing 14-6 shows just a few chat
messages from the middle of the output.
micah@trapdoor Discord-JSON-Scrapes % cat VibrantDiversity.json | jq '.data
."288508006990348299"'
{
--snip--
"352992491282366485": {
"u": 4,
"t": 1504230368205,
"m": "we need more white girls with nice asses"
},
"352992512752746496": {
"u": 4,
"t": 1504230373324,
"m": "no more gay jew shit"
},
"352992579949690890": {
"u": 1,
"t": 1504230389345,
"m": "you're not allowed to oogle anyone whiter than med"
},
"352992652687441920": {
"u": 1,
"t": 1504230406687,
"m": "if i catch you looking at anglo/celtic/nordic girls you're banned"
},
--snip--
Listing 14-6: Chat messages from the #problematic_oven channel in the Vibrant Diversity server
Just like the channels in Listing 14-2,
this output is a JSON object with keys that contain long numbers. In
this case, these keys appeared to be message IDs, and the values
appeared to be details about that specific chat message. In each
message, the u
field
represented the user and the m
field contained
the message content. The t
field was a Unix timestamp, the number of seconds or sometimes
milliseconds since January 1, 1970, a common way to represent specific
dates and times in computer science. These particular timestamps were in
milliseconds.
At this point, I knew that I was looking at a conversation between two
neo-Nazis. The top two messages in Listing 14-6 are from a user with the ID of
4
, and the bottom two
messages are from a user with the ID of 1
. Because the value of t
gets bigger with each message, these
appear to be displayed in chronological order. I decided to take a
closer look at the message 352992512752746496
, from user 4
, with the timestamp
1504230373324
.
Converting Timestamps
Unix timestamps are a useful way for computers to store an entire date—the year, month, day, and time of day—in a single number. I needed to convert the timestamp associated with that message into human-readable format to find out the date and time when the message was posted.
I used the following lines of code in the Python interpreter to convert
the 1504230373324
timestamp
into a more human-readable Python datetime
object:
>>> from datetime import datetime
>>> timestamp = datetime.fromtimestamp(1504230373324 / 1000)
>>> print(timestamp)
The syntax in this code is similar to the code you used to import
modules in Chapter 8. Rather than
import module
, this code takes the syntax
from module import resource_name
, loading a single
datetime
resource from the
datetime
module. Next, the
code defines a variable called timestamp
and sets its value to the return value
of the datetime.fromtimestamp()
function. This function takes the
number of seconds since January 1, 1970, as an argument. Because the
Discord logs are in milliseconds rather than seconds, this code first
divides the Discord timestamp by 1,000 to convert it to seconds before
passing it into the function. The function returns a Python
datetime
object.
When I displayed the datetime
object with print(timestamp)
, I could see that this chat message
was posted on August 31, 2017, at 6:46 [PM]:
2017-08-31 18:46:13.324000
I now had an idea of the timeframe in which this chat exchange took place. Next, I wanted to see which users were involved.
Finding Usernames
I wanted to find the username for person who’d posted the
352992512752746496
message
in Listing 14-6. The u
value for this message was 4
, so I checked to see if 4
was a valid user ID from the output in
Listing 14-3 but found that it wasn’t
there; all of the user IDs in that JSON object are 18 digits long. I
turned to the output in Listing 14-4 that shows the value of
userindex
in the
meta
object. The value of
userindex
is an array of
strings, each an 18-digit user ID.
As described in Chapter 11, JSON arrays are
lists of items in a specific order. Objects, on the other hand, don’t
have any order. You select values from arrays using their numerical
indices, starting from index 0 for the first item. Because objects don’t
have numerical indices, there’s no concept of the first, second, or
third item in the object; you could edit a JSON file to rearrange the
object’s items, and it would still be the same object. For this reason,
I guessed that the u
value
was actually an index of the userindex
array.
To determine which user ID corresponded to the user whose u
value was 4
, I looked for the value of
userindex
at index
4
by running the following
command:
cat VibrantDiversity.json | jq '.meta.userindex[4]'
This command is similar to the one in Listing 14-4, but because it uses
.meta.userindex[4]
, it
selects the value at index 4
of the .meta.userindex
array
and just displays that result. My output showed that this value was the
string 289851780521787392
,
an 18-digit user ID:
"289851780521787392"
Now that I had a user ID, I used it in the following command to find the matching username:
cat VibrantDiversity.json | jq '.meta.users."289851780521787392"'
Like the previous command, this command selects just one value to
output. In this case, it selects the meta
key, then the users
key, then the
289851780521787392
key. The
result is an object that includes a name
key:
{
"name": "badtanman"
}
The name badtanman was the username I was looking for.
In the chat logs quoted in Listing 14-6,
the user badtanman is talking to someone with the u
value of 1
. To find that person’s username, I ran
the same commands, substituting the appropriate ID numbers:
micah@trapdoor Discord-JSON-Scrapes % cat VibrantDiversity.json | jq '.meta.userindex[1]'
"232213403974893569"
micah@trapdoor Discord-JSON-Scrapes % cat VibrantDiversity.json | jq '.meta.users."232213403974
893569"'
{
"name": "northern_confederate"
}
I’d found that the snippet of chat messages in Listing 14-6 was a conversation between badtanman and northern_confederate on the night of August 31, 2017.
Running all of these jq
commands, along with running code in the Python interpreter to convert
timestamps, is tedious. If confronted with a large volume of chat logs,
you don’t want to research every group of messages this way. But when
you’re exploring an unfamiliar dataset for the first time, you need to
manually explore it like this until you better understand how the data
is structured. After doing this preliminary analysis, I could use my new
understanding of the chat logs to write Python scripts or even a full
custom app (like I ended up developing for this dataset) to aid my
research.
Before I actually started writing Python code to more easily parse these chat logs, though, I noticed a file that I’d missed before in the Unicorn Riot ZIP file that might make researching this dataset a lot easier.
The Discord History Tracker
The ZIP file from my Unicorn Riot contact had dozens of files in it, most of them JSON files and PNG screenshots, along with a few folders containing other JSON files. I’d immediately zeroed in on the JSON files to analyze their data structure, but until now I hadn’t noticed the file logviewer.html. This was an HTML and JavaScript file that, when opened in a web browser, would allow me to load JSON chat log files and read through them.
After talking with my Unicorn Riot contact, I learned that this local HTML file is part of a piece of open source software called Discord History Tracker. This software, not affiliated with Discord, lets users save an offline copy of everything they have access to in a given Discord server in JSON format. Antifascist activists used this software to exfiltrate chat logs from Vibrant Diversity, Charlottesville 2.0, and other fascist-run Discord servers.
Discord History Tracker included two components. The main component was in charge of actually creating a backup of a Discord server. The user would load the Discord server in their web browser, open their developer tools, and copy and paste the Discord History Tracker JavaScript code into their browser’s console. This would then scrape all of the data in the Discord server and save a backup file in JSON format. The second component of Discord History Tracker was the logviewer.html file, which contained offline HTML software for viewing those backup files.
Figure 14-1 shows logviewer.html loaded in a web browser. In the screenshot, I’ve scrolled to the aforementioned messages between badtanman and northern_confederate from the #problematic_oven channel.
NOTE The screenshot in Figure 14-1 shows software from 2017. The Discord History Tracker interface has changed considerably since then. Among other changes, it now saves the data in SQLite databases, rather than as JSON files, and you can view the logs in a desktop app instead of using the logviewer.html
file. You can learn more about the software at https://dht.chylex.com.
Figure 14-1: The August 31, 2017, chat between badtanman
and northern_confederate
,
viewed in the Discord Offline History web app
This offline HTML viewer software made it considerably easier to navigate and read the contents of the JSON files. I could click through the channels on the left, and then read through a page of chats at a time. However, it also lacked some features that would be important for my ongoing investigation:
- There was no simple way to search for individual messages. For example, suppose I wanted to search for mentions of Berkeley, the city I lived in at the time. I would have to click a channel like #general, use my web browser’s search feature to search for Berkeley, and then find which messages appeared in the #general channel. I would also need to change the settings to display all messages per page so I could search them all at once, rather than displaying just 1,000 messages at a time, as shown in Figure 14-1. I would then have to replicate this search for every other channel in the server, and if I wanted to search other Discord servers as well, I’d have to replicate it for each channel in each server.
- The offline viewer only supported looking at one server at a time, but I wanted to be able to search multiple servers at once and also track a single user’s messages across different servers.
- There was no way to generate hyperlinks leading to individual messages. When you’re taking notes for a story based on chat logs like this, it’s helpful to track the messages of interest. Without links, you’ll regularly have to go back and search for specific messages all over again.
I decided to build my own web application to add these missing features. I already had all of the chat logs in a structured format, which is by far the biggest requirement to build a custom app, as you learned in Chapter 10’s discussion of BlueLeaks Explorer. If I’d had only screenshots of the Discord servers, a custom app with these features wouldn’t have been possible. Screenshots aren’t structured data, and there’s no easy way to write software that allows you to browse the chat messages they contain.
A Script to Search the JSON Files
As you’ve learned throughout this book, understanding how the data is
structured is a prerequisite to writing code that works with it.
Therefore, I decided to use the knowledge I’d gained from manually
investigating the JSON files with jq
to build a simple Python script that
let me search one of the JSON files for keywords. Initially I thought I
might be able to use this script to do all of the analysis I needed, but
that turned out to be wrong; I ended up writing a complete custom app to
investigate this dataset as well. Even so, this first (considerably
simpler) script allowed me to use Python code to express the structure
of the dataset that I’d already gleaned, which simplified the process of
programming the full web app. In this section I go over exactly how my
initial Discord JSON search script worked.
For example, I knew my script needed to be able to display chat messages based on what I searched for. Let’s say I wanted my code to display the following chat message from Listing 14-6:
"352992491282366485": {
"u": 4,
"t": 1504230368205,
"m": "we need more white girls with nice asses"
}
The value of the u
key is
4
, but now I knew how to
find the actual username of the person who posted this message. First,
my code needed to look in the JSON’s meta
object and select the fourth item in
the userindex
array, which
is the user ID 289851780521787392
. My code then would look again in the
JSON’s meta
object, this
time for the users
key, and
use that user ID as the key to get this user object:
{
"name": "badtanman"
}
My code would select the name
string from that object to get the
username of the message poster, badtanman
, and then replicate the whole process
to display the correct username for every message.
I opened my text editor and started writing a Python script, discord-json -search.py, to search one of the JSON files for keywords. Here’s my completed source code (you can also find it at https://github.com/micahflee/hacks-leaks-and-revelations/blob/main/chapter-14/discord-analysis/discord-json-search.py):
#!/usr/bin/python3
import sys
import json
import click
from datetime import datetime
def highlight(message, query): ❶
new_message = ""
index = 0
while True:
new_index = message.lower().find(query.lower(), index)
if new_index > 0:
# Found
new_message += message[index:new_index]
new_message += click.style(
message[new_index : new_index + len(query)], underline=True
)
index = new_index + len(query)
else:
# Not found
new_message += message[index:]
break
return new_message
def display(channel_name, server_name, user_name, timestamp, message, query): ❷
click.echo(
"{} {}".format(
click.style("#{}".format(channel_name), fg="bright_magenta"),
click.style("[server: {}]".format(server_name), fg="bright_black"),
)
)
click.echo(
"{} {}".format(
click.style(user_name, bold=True),
click.style(timestamp.strftime("%c"), fg="bright_black"),
)
)
click.echo(highlight(message, query))
click.echo("")
def search(data, query): ❸
# Loop through each channel
for channel_id in data["data"]: ❹
# Get the channel name and server name
channel_name = data["meta"]["channels"][channel_id]["name"] ❺
server_name = data["meta"]["servers"][
data["meta"]["channels"][channel_id]["server"]
]["name"]
for message_id in data["data"][channel_id]: ❻
# Pull the user data, timestamp, and message body from the message
user_index = data["data"][channel_id][message_id]["u"]
user_id = data["meta"]["userindex"][user_index]
user_name = data["meta"]["users"][user_id]["name"]
timestamp = datetime.fromtimestamp(
data["data"][channel_id][message_id]["t"] / 1000
)
message = data["data"][channel_id][message_id]["m"]
# Is the query in the message?
if query.lower() in message.lower(): ❼
display(channel_name, server_name, user_name, timestamp, message, query)
@click.command()
@click.argument("filename", type=click.Path(exists=True))
@click.argument("query")
def main(filename, query): ❽
# Load the JSON file
try:
with open(filename) as f:
data = json.loads(f.read())
except:
print("Failed to load JSON file")
sys.exit()
# Search
search(data, query)
if __name__ == "__main__":
main()
It’s simplest to explain how this script worked from bottom to top,
since that’s how it executed and also how I programmed it. The
main()
function
[❽] is a Click command that
takes two arguments: the filename for a JSON file with Discord chat logs
called filename
, and a
search term called query
.
The code opened the filename that was passed in and parsed it using
json.loads()
to turn it into
a JSON object. Then it called the search()
function, passing in the data from the
JSON file and the search query.
The search()
function
[❸] is where all the magic
happened. I knew from my previous analysis that these Discord JSON
objects had two keys: the data
key, which contained the messages in
each channel, and the meta
key, which contained metadata about these messages. My script started by
looping through every channel in data['data']
[❹], then using its channel_id
to look up that channel’s name and
server in the metadata [❺]. It
then looped through every message in that channel
[❻] and stored the message’s
username, timestamp, and the message itself in variables.
The code then checked to see if the search query that was passed into
the script as a CLI argument (stored in query
) existed in the message (stored in
message
) [❼]. As described in Chapter 7, it converted both strings to lowercase
using the lower()
method to
make the search case insensitive. If the lowercase version of the
message contained the lowercase version of the search term, the script
then passed all of the relevant variables into the display()
function to display the message in the
terminal.
The display()
function
[❷] took arguments for
metadata about a message, the message text itself, and the search term
and used those to display the message. This code used
click.echo()
instead of
print()
to display text to
the terminal, and it used click.style()
to apply different colors and
formatting. (You could do all of this just with the print()
function, but the click
module makes it simpler to style
terminal output.) After displaying two lines of metadata for the
message, the script then displayed the output of the
highlight()
function, which
returned the message itself in color with the search term underlined.
The highlight()
function
[❶] created an empty string
called new_message
and then
made it a copy of message
,
the original message it displayed, except with all instances of the
search term underlined using click.style()
. It then returned
new_message
and displayed it
to the terminal in the display()
function.
For example, if I wanted to search VibrantDiversity.json for the term berkeley, I could run:
python3 discord-json-search.py ~/datasets/Discord-JSON-Scrapes/VibrantDiversity.json "berkeley"
The output listed over a hundred chat messages that mentioned Berkeley. Each message showed the name of the channel, the name of the Discord server, the user who posted it and when, and the content of the message. Here’s the first snippet of output, which highlighted the search term in the message with an underline:
#general [server: Vibrant Diversity]
Hector Sun Sep 3 20:19:11 2017
Look at how many antifa were at Boston and Berkeley. We need numbers. We can't have rallies with less than a thousand people now. Even that's a low number.
--snip--
The first message that mentioned Berkeley was a post from the user Hector in the #general channel on September 3, 2017. This user was complaining about the relatively small number of fascists that showed up to their rallies in Boston and Berkeley, compared to the “antifa” counterprotesters.
This script allowed me to search a full Discord server for keywords, but it still lacked several of the features that I wanted: it could work with only one Discord leak at a time, and there was no easy way to browse through and read the data sequentially or to save links to specific interesting messages. I started building out a web application to help me perform these missing tasks.
My Discord Analysis Code
I’ve found that after obtaining a large dataset full of structured data, building a custom web application to explore it, as I did with BlueLeaks Explorer, makes it much easier to find its hidden revelations. After writing discord-json -search.py, I spent about a week creating Discord Analysis, a custom web app to analyze leaked Discord chat logs.
Since I wanted to be able to search multiple Discord servers at once, I decided that the best solution would be to convert all of the data from JSON files into a SQL database. I used a Python tech stack that I was already familiar with, Flask (discussed briefly in Chapter 10), for the web app and SQLAlchemy for communicating with the SQL database.
SQLAlchemy is an Object Relational Mapping (ORM) Python module that’s useful for making code that works with SQL databases simpler to write and more secure. ORMs allow you to work with SQL databases in such a way that you don’t have to directly write any SQL code yourself, which means your projects won’t be vulnerable to SQL injection. This web app used Flask-SQLAlchemy, a Flask extension that adds SQLAlchemy support to Flask apps.
While developing my Discord Analysis web app, I was actively using it to research the leaked neo-Nazi chat logs. If I had new questions about the data (like what other messages a user posted) or found that I needed new features (like limiting my search to a single server), I would program them in as I went along. This is typically how I build research tools: I start using them long before they’re complete, and I let the direction of my research guide which features I add next.
In this section, I explain how I went about developing the different components of the app: designing a SQL database, importing chat logs from the Discord JSON files into that database, and building the web interface to research the chat logs. You’ll learn how I used SQLAlchemy to define database tables, insert rows into them, and select rows from them. You’ll also learn how I used Flask to build this web app, including how to make Jinja templates and how to define routes—skills you’ll need if you build your own Flask web apps in the future.
NOTE Fully explaining how to build a Flask and SQLAlchemy web app is outside the scope of this book. Instead, I go over how I went about building this app in broad strokes, which should still be useful if you ever decide to build a similar one yourself. The best way to learn how to make your own Flask app is by exploring Flask’s excellent documentation at https://flask.palletsprojects.com; that’s how I learned. The Flask documentation includes a tutorial that walks you through every step of developing a simple web app. The Python skills you’ve learned from Chapters 7 and 8 are more than enough for you to follow along with the tutorial. You can also find docs for SQLAlchemy at https://www.sqlalchemy.org and for Flask’s SQLAlchemy extension at https://flask-sqlalchemy.palletsprojects.com.
The code for Discord Analysis, which has quietly been public on my GitHub account for years, hasn’t been updated much since 2017, with the exception of some small changes I made when preparing it for this book. I don’t plan on maintaining it. Still, you should be able to get it running locally if you’d like to explore it further, and you can use it as inspiration for your own future projects that use a similar tech stack. Read through this section to see how it works, and then if you’re curious, try getting it running locally yourself.
As I explain the app, I’ll quote sections of the source code. It’s too long to include all of it here, but you can find the full code online in the book’s GitHub repository at https://github.com/micahflee/hacks-leaks-and-revelations/tree/main/chapter-14/discord-analysis. I recommend that you pull up the full source code for each file as I describe how it works.
Designing the SQL Database
I started my web app with a Python script called app.py. You can find
the full source code for this file at
https://github.com/micahflee/hacks-leaks-and-revelations/blob/main/chapter-14/discord-analysis/app.py.
First, my code imported the appropriate Flask and SQLAlchemy modules,
created a new Flask app object called app
, and created a new Flask-SQLAlchemy
object called db
:
from flask import Flask, render_template, request, escape, flash, redirect
from flask_sqlalchemy import SQLAlchemy
app = Flask(__name__)
app.config["SQLALCHEMY_DATABASE_URI"] = "sqlite:///database.sqlite3"
app.config["DEBUG"] = True
db = SQLAlchemy(app)
I started by importing several items from the flask
module, like Flask
and render_template
, that I knew I’d need later in the
program. In the next line, I also imported SQLAlchemy
from the flask_sqlalchemy
module.
Using the newly imported Flask
, I then created a Flask object called
app
. Every Flask web app
includes such an object (and usually by that name) to define exactly how
the app will work. I modified the app.config
dict to set some configuration
settings, telling it that I wanted to use a SQLite3 database stored in
the file database.sqlite3, and I wanted to turn debug mode on, which
is useful while you’re actively developing a web app. Finally, I created
the SQLAlchemy object called db
, passing in app
.
For the next bit of code, I’ll introduce you to
a new Python concept that I didn’t explicitly cover in Part III but that you’ve technically been using all
along: classes. In Python, a class is a template for creating new
objects that can store data (using variables called attributes) and
perform actions (using functions called methods). For example, strings
are technically classes. When you run the code s = "example"
, the variable s
is an instance of the string class,
the data it stores is the string example
, and it has a bunch of methods you can
call on it, such as s.upper()
, which returns an uppercase version of
the string. When you write SQLAlchemy code, you define a class for each
database table. This way, you can write code that works with Python
objects without needing to write the SQL queries yourself.
I started writing code to define the SQL tables that would store Discord
data for servers, users, channels, and messages. For example, the
following code defines the Server
class, which represents the SQL table
to store data about servers:
class Server(db.Model):
id = db.Column(db.Integer, autoincrement=True, primary_key=True)
name = db.Column(db.String(128), unique=True, nullable=False)
channels = db.relationship("Channel", back_populates="server")
messages = db.relationship("Message", back_populates="server")
def __init__(self, name):
self.name = name
Using SQLAlchemy requires that you define your own classes. You can
think of this Server
class
as a description of a new type of Python object that represents a row in
the server SQL table. Because I defined it as
Server(db .Model)
, this
class inherited all of the functionality of the db.Model
class, which is part of SQLAlchemy.
Inside the class definition, I defined the table’s columns: id
(an auto-incrementing number) and
name
(a string). Next, I
defined this table’s relationships to other tables, in this case
relating servers
to
channels
and
messages
—both the
Channel
table and the
Message
table have a
server_id
column.
Finally, I defined the __init__()
method. When you define a class, you
must call the first argument of every method self
to represent this Python object
itself. You can optionally include other arguments, too. The
__init__()
method is a type
of method called a constructor, which runs as soon as you create the
object. This constructor sets the value of the object’s name
attribute (which you access within the
class as self.name
) to the
value of name
, which is a
variable passed into the __init__()
method as an argument.
For example, to add a row to the Server
table in the SQL database for the
Vibrant Diversity Discord server, I could run the code in Listing 14-7. (My Discord Analysis app doesn’t
actually use this code—it loads the servers from the JSON data—but
I’m including this example to help you understand how to use SQLAlchemy
classes to interact with databases without needing to write SQL
queries.)
server = Server("Vibrant Diversity")
db.session.add(server)
db.session.commit()
Listing 14-7: Using SQLAlchemy to insert data into a SQL database
The first line of code creates a Server
object by running
Server("Vibrant Diversity")
.
This would run the constructor method, passing in the string
Vibrant Diversity
as
name
. The constructor would
then set the value of its name
attribute to the name that was passed
in. When the constructor finishes running, the code would save this
newly created Python object in the server
variable. The next two lines of code
use the SQLAlchemy object db
to run the INSERT
query in
the SQL database and insert this row. The db.session .add()
method collects a list of SQL queries,
and the db.session.commit()
method runs those SQL queries on the database. In SQL, sometimes it’s
more efficient to run several queries and then commit them all at once
rather than one at a time.
In other words, the code in Listing 14-7 is
basically the same as running the SQL query
INSERT INTO server SET name='Vibrant Diversity';
, except this way all you need to do is
interact with Python objects, not write any SQL yourself. After creating
the server object, I could then access that object’s ID attribute with
server.id
or the object’s
name attribute with server.name
.
In addition to the Server
table I just described, I also created the following tables, which you
can view in detail in the app.py file at
https://github.com/micahflee/hacks-leaks-and-revelations/blob/main/chapter-14/discord-analysis/app.py:
User
A Discord user. I
included the columns id
,
discord_id
, and
name
. The id
column is an auto-incrementing number,
and discord_id
is the
original ID that Discord itself used. This is useful for identifying the
same user across servers.
Channel
A channel in a
Discord server. The columns are id
, discord_id
, name
, and server_id
. The server_id
column forms a relationship with the
Server
table, since each
server has a set of channels. Every Discord server JSON file contains a
list of channels. Adding this relationship means that the SQL database I
was designing would match the data structure in the JSON files.
Message
A Discord message.
The columns are id
,
discord_id
,
timestamp
, message
, attachments_json
, user_id
, channel_id
, and server_id
. The attachments _json
column contains extra data from
messages with attachments, like when someone posts an image to Discord.
The user_id
,
channel_id
, and
server_id
columns form
relationships with the User
,
Channel
, and
Server
tables. These also
would match the structure found in the JSON files.
Figure 14-2 shows the relationship between
these four tables. The Channel
table includes a server_id
column, so it’s related to the
Server
table.
The Message
table includes columns for
channel_id
,
server_id
, and
user_id
, so it’s related to
the Channel
,
Server
, and User
tables.
Figure 14-2: Relationships between the SQL tables in the Discord Analysis app
My goal for this web app would be to build an interface that allows me to explore the data stored in these SQL tables. I wanted to be able to search all of the messages at once, including from multiple servers, to see which users posted in multiple servers and to be able to generate links to individual messages that I could store in my notes. Before building the web interface, though, I needed to load the database with data from the JSON files.
Importing Chat Logs into the SQL Database
I wrote a separate script, admin.py, that I used to import data into
the SQL database. This script took a command as its first argument. If I
passed in create-db
, it
would use SQLAlchemy to create the SQL tables that I had defined in
app.py. When I passed in import-json
, followed by the filename of a JSON
file, the code would import Discord data from that JSON file into the
SQL database. I also eventually added the user-stats
command, which displayed how many
messages each user in the whole database posted, and on which servers.
This admin,py file is too long to include in this chapter in its entirety, but as with app.py, you can find a copy of the complete code in the book’s GitHub repo at https://github.com/micahflee/hacks-leaks-and-revelations/blob/main/chapter-14/discord-analysis/admin.py.
In this section, I’ll explain how I built the import-json
command (specifically, the
import_json()
function,
which is what gets called when you run import-json
), the most interesting part of the
script. This is the code that opens up the JSON files containing Discord
server leaks, loops through all the data, and then inserts it into the
SQL database. As with the discord-json -search.py script, I relied on
my previous manual analysis of the Discord JSON files to write this
code. Basically, this is the part that requires an understanding of the
structure of the original data.
The import_json()
function
is too long to display it all here, so instead I’ll display snippets
that explain the general idea of how it works. The function takes the
filename for a JSON file containing Discord leaks as an argument. It
opens this file, loads it into a variable called data
, and then uses the information in
data
to add servers, users,
channels, and messages to the SQL database. I’ll show the code that adds
users, channels, and messages soon, but first, Listing 14-8 shows the code that adds servers.
print("Adding servers: ", end="", flush=True)
for item in data["meta"]["servers"]:
name = item["name"]
try:
server = Server(name)
db.session.add(server)
db.session.commit()
print("+", end="", flush=True)
except sqlalchemy.exc.IntegrityError:
db.session.rollback()
print(".", end="", flush=True)
print("")
Listing 14-8: Code from admin.py
to add servers to the database
This code looped through all of the servers it found in
data["meta"]["servers"]
,
adding a row to the database for each server that it found. For example,
in Listing 14-1, I used jq
to view this list of servers for
VibrantDiversity.json and found that it contained only a single
server. Listing 14-8 uses Python code to
find that same list of servers from the same part of the target leaked
JSON file.
For each server it found, the code stored the server’s name in the
name
variable, then tried to
add that server to the database. This code used Python exception
handling, which you learned about in Chapter 7. In the try
block, the code created a new
Server
object (this
represents a row in the Server
table in SQLAlchemy), added that row
to the database using db.session``.add(server)
, and finally committed the database
changes with db.session.commit()
, just like in the SQLAlchemy code in
Listing 14-7. After the server was
successfully inserted into the database, the program displayed a plus
sign (+
) and moved on to the
next loop.
When I defined the Server
table in app.py, I specified that the name
column should be unique, meaning that
there could be no two rows with the same name
column. If SQLAlchemy threw the
sqlalchemy.exc.IntegrityError
exception while the script was trying
to add the row to the database, that meant a server with that name
already existed in the database, and the except
block should run instead. If this
happened, then the code rolled back the change that it was about to make
and displayed a dot (.
)
instead of a plus sign.
Why did I worry about catching these exceptions to begin with instead of just adding rows to the database? As with the programming exercises that you completed in previous chapters, I didn’t write the whole script perfectly the first time and then run it. Instead, I wrote small bits of code at a time and ran them to make sure my script was working so far. This exception handling allowed me to rerun an import on the same JSON file over and over while starting where I left off. If my script showed a plus sign, I knew it had added a new row to the database. If it showed a dot, that meant the row already existed and the script moved on.
You might also notice that the familiar
print()
function calls look
odd in Listing 14-8: my code passed in the
end=""
and
flush=True
keyword
arguments. By default, print()
displays the string the user passes in
as an argument, then adds a newline character (\n
) to the end. The end
argument replaces that newline with
something else (in this case, an empty string). In other words, this is
how I could print a string without moving on to the next line. The
flush=True
argument makes
sure that the output gets displayed to the screen immediately; without
it, the output would still get displayed, but not right after the
function call. This allowed me to watch the progress of an import.
After adding servers, the script added users, as shown in Listing 14-9.
print("Adding users: ", end="", flush=True)
for user_discord_id in data["meta"]["users"]:
name = data["meta"]["users"][user_discord_id]["name"]
try:
user = User(user_discord_id, name)
db.session.add(user)
db.session.commit()
print("+", end="", flush=True)
except sqlalchemy.exc.IntegrityError:
db.session.rollback()
print(".", end="", flush=True)
print("")
Listing 14-9: Code from admin.py
to add users to the database
This code is very similar to Listing 14-8,
but instead of looping through the list
data["meta"]["servers"]
, it
looped through the dictionary data["meta"]["users"]
. Listing 14-3 shows this JSON object of users from
VibrantDiversity .json. As described in Chapter 8, when you loop through a dictionary, you’re
actually looping through the dictionary’s keys. In this case, the script
stored each key in the user_discord_id
variable. Armed with the user’s
Discord ID, it then looked up that user’s name in the metadata.
In the try
block, the script
then created a new User
object, this time with both the user’s Discord ID and name, and tried
adding it to the database. When I defined the User
table in app.py, I specified that
user_discord_id
should be
unique in order to prevent duplicate users. Like Listing 14-8, the code displayed a plus sign when
adding the user to the database and a dot if it hit an error. This
error-handling code would be important when I started importing multiple
servers: if a Discord user was already in the database because they were
a member of a previous server, the code wouldn’t create a duplicate user
for them.
After adding servers and users, the script then added channels, using the code in Listing 14-10.
print("Adding channels: ", end="", flush=True)
for channel_discord_id in data["meta"]["channels"]:
name = data["meta"]["channels"][channel_discord_id]["name"]
server_id = data["meta"]["channels"][channel_discord_id]["server"]
❶ server = Server.query.filter_by(
name=data["meta"]["servers"][server_id]["name"]
).first()
try:
channel = Channel(server, channel_discord_id, name)
db.session.add(channel)
db.session.commit()
print("+", end="", flush=True)
except sqlalchemy.exc.IntegrityError:
db.session.rollback()
print(".", end="", flush=True)
print("")
Listing 14-10: Code from admin.py
to add channels to the database
This code is also similar to Listings 14-8 and 14-9. This time however,
it looped through the keys of the data["meta"]["channels"]
dictionary, storing each key as
channel_discord_id
.
Listing 14-2 showed this JSON object of
channels from VibrantDiversity .json, which you can revisit to remind
yourself what this dictionary looks like. For each channel, the code in
Listing 14-8 stored the name of the channel
in name
and that channel’s
server index in server_id
.
It then queried the SQL database itself to get the server row in
Listing 14-10 ❶, which should have been added earlier by the
code in Listing 14-9, and stored this value
in server
. The SQL query
that the Server.query.filter_by()
function call ran was similar to
SELECT * FROM servers WHERE name='``name``';
, where name
is the server name.
In the try
block, the code
then created a new Channel
object, this time telling it the server, the channel’s Discord ID, and
the channel name. As with the previous listings, it tried adding this
channel to the database, displaying a plus sign on success and a dot if
the channel already existed.
Finally, after adding servers, users, and channels, the code added all of the messages, as shown in Listing 14-11.
for channel_discord_id in data["data"]:
# Get the channel
channel = Channel.query.filter_by(discord_id=channel_discord_id).one() ❶
# Loop through each message in this channel
print(f"Adding messages from {channel.server.name}, #{channel.name}: ", end="", flush=True)
for message_discord_id in data["data"][channel_discord_id]:
try:
timestamp = data["data"][channel_discord_id][message_discord_id]["t"]
message = data["data"][channel_discord_id][message_discord_id]["m"]
user_index = data["data"][channel_discord_id][message_discord_id]["u"]
user_discord_id = data["meta"]["userindex"][user_index]
user = User.query.filter_by(discord_id=user_discord_id).first() ❷
if "a" in data["data"][channel_discord_id][message_discord_id]:
attachments_json = json.dumps(
data["data"][channel_discord_id][message_discord_id]["a"]
)
else:
attachments_json = None
message = Message(
channel.server,
message_discord_id,
timestamp,
message,
user,
channel,
attachments_json,
)
db.session.add(message)
db.session.commit()
print("+", end="", flush=True)
except sqlalchemy.exc.IntegrityError:
db.session.rollback()
print(".", end="", flush=True)
print("")
Listing 14-11: Code from admin.py
to add messages to the database
This time, this code looped through all of the keys of the
data["data"]
dictionary. As
you learned in Listing 14-5, this
dictionary’s keys are the Discord IDs of channels. My code stored each
ID in the variable channel _discord_id
. I then used SQLAlchemy to query the
database to load this actual channel row ❶ (the SQL query that this command ran was
similar to SELECT * FROM channel WHERE channel_discord_id=``channel_discord_id
, where
channel_discord_id
is the channel ID). After learning what channel it was dealing with, the
code then looped through all of that channel’s messages to add them to
the database, storing each message’s Discord ID as
message_discord_id
.
The rest of the code in Listing 14-11 is
also similar to Listings 14-8 through 14-10. In the try
block, for each message, the code
stored the timestamp and message in the timestamp
and message
variables. It then looked up the user
Discord ID from the metadata and queried the SQL database for the
User
object
❷, and, if the message
included an attachment, it also created a string called
attachments_json
. Finally,
it created a Message
object
and inserted this message into the database. As before, the code
displayed a plus sign if it successfully inserted a message or a dot if
that message was already in the database.
Since exception handling ensured admin.py wouldn’t import duplicate rows, I could use this script to import newer versions of JSON files from the same Discord server. For example, if Unicorn Riot’s infiltrator used Discord History Tracker to save another offline copy of everything in Vibrant Diversity a month later, and I imported that new JSON file, it would import only the new messages.
Once this code was written, I used it to import all of the JSON Discord files that I had received from Unicorn Riot. To import data from the Vibrant Diversity channel, I would run this command:
python3 admin.py import-json ~/datasets/Discord-JSON-Scrapes/VibrantDiversity.json
And here is the output:
Adding servers: +
Adding users: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
--snip--
Adding channels: +++++++++
Adding messages from Vibrant Diversity, #rules: +++++++++++++++
Adding messages from Vibrant Diversity, #general: +++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
--snip--
Adding messages from Vibrant Diversity, #recruiting: ++++++++++++++++++++++++++++++++++++++++++
+++++++++
Import complete
Each plus sign in this output represents a different row of data inserted into the database. The VibrantDiversity.json file added 1 server, 530 users, 9 channels, and a total of 255,349 messages, importing a message at a time.
I then used admin.py to import the rest of the Discord JSON files I had, including chat logs from Anticom, 4th Reich, Ethnoserver, and other leaked servers. For example, next I imported one of the smaller servers called Pony Power, which I’ll discuss further later in this chapter, like so:
python3 admin.py import-json ~/datasets/Discord-JSON-Scrapes/PonyPowerComplete-Sept5at155PM.txt
And here is the output from that command (in this case, I’d already imported the Vibrant Diversity data, and these two Discord channels had some overlapping users, so my script skipped importing some of the users):
Importing: /Users/micah/datasets/Discord-JSON-Scrapes/PonyPowerComplete-Sept5at155PM.txt
Adding servers: +
Adding users: .++++..+++++.+++++..++++.++.+++++++++++..+.+......
Adding channels: ++++
Adding messages from Pony Power, #general-chat: +++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
--snip--
This JSON file included 50 users. The code skipped 17 of them (displaying dots instead of plus signs) because they were already in the database from Vibrant Diversity, and it added 33 new users.
My database was now full of neo-Nazi chat logs, preparing me to build a web interface to explore them. When you’re building a web app to investigate data, you need some data to explore to make sure your app is actually working as intended. If I hadn’t imported the actual data first, I would have had to make up and import some test data so I’d have something to troubleshoot with while building the web app. But I decided to import the real data first because I knew I’d need to write that code eventually anyway.
Building the Web Interface
When you build web apps, it’s often useful to split your web pages into
reusable components, like headers, footers, and sidebars. Individual
pages may have their own reusable components, too. For example, the page
that lists chat messages might repeat the same message component for
each message on the page. You define these components in templates,
HTML files that can contain variables and logic, like if
statements or for
loops. You can render a template
(convert it into HTML) by passing the template file along with variables
into a templating engine, or code that converts a template into HTML.
Flask comes with a popular templating engine called Jinja. To build the web interface to explore the chat logs I’d just imported, I started by creating the layout template in Jinja. In short, I wrote the HTML code that would make up the layout of all of the pages in my web app, but also included Python variables and loops. Listing 14-12 shows the code for layout.html, my layout template.
<!doctype html>
<html>
<head>
<title>Discord Analysis</title>
<link rel=stylesheet type=text/css href="{{url_for('static', filename='style.css')}}"> ❶
</head>
<body>
<div class="wrapper">
<div class="sidebar">
{% for server in servers %} ❷
<div class="server">
<p><strong>{{server.name}}</strong></p>
<ul>
{% for c in server.channels %} ❸
<li{% if channel %}{% if c.id==channel.id %} class="active" {% endif %}{% endif %}><a
href="{{c.permalink()}}">#{{c.name}}</a> <span class="message-count">[{{
"{0:,}".format(c.message_count() | int)}}]</span></li>
{% endfor %}
</ul>
</div>
{% endfor %}
<p><a href="/users">Users</a></p>
</div>
<div class="content">
<div class="search">
<form method="get" action="/search">
<input type="text" name="q" class="q" placeholder="Search query" {% if q %}
value="{{q}}" {% endif %} /> ❹
<select name="s">
<option value="">[all servers]</option>
{% for server in servers %} ❺
<option value="{{server.id}}" {% if server.id==s %} selected="selected" {% endif
%}>
{{server.name}}
</option>
{% endfor %}
</select>
<input type="submit" value="Search" />
</form>
</div>
<div class="messages">
{% for message in get_flashed_messages() %} ❻
<div class=flash>{{message}}</div>
{% endfor %}
</div>
{% block content %}{% endblock %} ❼
</div>
</div>
</body>
</html>
Listing 14-12: The layout.html
layout template
The code in Listing 14-12 looks like HTML at a glance, but if you look closely you’ll see that it’s actually a Jinja template. For example, look at the code that adds the CSS (Cascading Style Sheets) file—which defines the page’s style—to the page ❶. The HTML syntax for adding a stylesheet is
<link rel=stylesheet type=text/css href="style.css">
where style.css
is
the path or URL of a CSS file. Instead of an actual filename, the code
in Listing 14-12 uses this:
{{url_for('static', filename='style.css')}}
In a Jinja template, putting a Python expression between {{
and }}
means Python will evaluate this
expression when the template is rendered. In this case, Listing 14-12 rendered that line as
<link rel=stylesheet type=text/css href="/static/style.css">
because the url_for()
function, which is part of Flask, returned the /static/style.css
string.
The template in Listing 14-12 also included some for
loops. In Jinja, you start a
for
loop with the code {% for item in list %}
and end it with {% endfor %}
. In the left sidebar of the layout,
the template listed all of the Discord servers in the databases
❷, looping through the items
in the servers
list one at a
time. (For this template to render properly, I’d need to make sure to
pass servers
into the
template as a variable when I render it in the Flask code.) For each
server, after displaying the server name, it looped through all of the
channels in that server ❸,
getting the list of channels from server.channels
. For each channel, the code displayed
a link to view messages in that channel followed by the number of
messages it contains.
The template also included a search bar at the top of the page ❹, as well as a drop-down menu with options to search a specific server or to search them all ❺. It also included a list of notification messages ❻ I could use if I wanted to display an error message—for example, if I tried loading a link to view messages in a channel that didn’t exist in the database. Finally, the template displayed the content block for that particular page ❼. While all pages shared this template, the content block differed for each page.
After starting on my templates, I wrote code for a handful of routes,
which let the web app know which page the user’s web browser was trying
to view. In web development, you can think of a route as a path for a
web page, except it can include placeholders. For example, if the web
app is hosted at http://localhost:5000,
and the Python code defines the route /search
for the search page, users can view
that route with the URL
http://localhost:5000/search.
The home page route (/
),
shown in Listing 14-13, was by far the
simplest one in my web app. This page displayed the message “This is a
web app that will let you research the alt-right chatroom leak,
published by Unicorn Riot.”
@app.route("/")
def index():
servers = Server.query.all()
return render_template("index.html", servers=servers)
Listing 14-13: The home page route (/
)
In Flask, each route is a function that returns the HTML for that web
page. The index()
function
starts with the @app.route("/")
decorator, which is how Flask knows
that the /
route should call
this function. This function first runs a SQL query to get all of the
servers in the database, stored in the variable servers
. It then calls the
render_template()
function,
rendering the index.html template, passing the servers
variable into the template, and
returning the HTML it receives.
Listing 14-14 shows the code for the index.html Jinja template that was rendered.
{% extends "layout.html" %}
{% block content %}
<h2>Alt-right chatroom research</h2>
<p>This is a web app that will let you research the alt-right chatroom leak,
published by Unicorn Riot.</p>
<p>Click on channel names to browse them. Search for keywords. Viewing
individual messages will show you the whole conversation from an hour
before and after that message.</p>
{% endblock %}
Listing 14-14: The index.html
template
The first line of code in this template means that Jinja should render
the layout.html template but replace
{% block content %}{% endblock %}
with the content block defined
here—some text that says, “Alt-right chatroom research” and a brief
description of the web app. Also notice that in Listing 14-13, I passed the servers
variable into the template; the
layout.html template in Listing 14-11
used this variable to make the list of servers in the sidebar.
Figure 14-3 shows what the app’s home page looked like at this point, with the home page text as defined in index.html and with the servers on the left and the search bar at the top as defined in layout.html.
Figure 14-3: The home page of my Discord Analysis web app
Let’s look at one more route that does a bit more than the /
route, the /search
route, which will help explain how one
of the web app’s core features—searching the chat logs—works. Here’s
the Python code:
@app.route("/search")
def search():
q = request.args.get("q")
s = request.args.get("s", 0)
if s == "":
s = 0
page, per_page = get_pagination_args()
server = Server.query.filter_by(id=s).first()
messages = Message.query
if server:
messages = messages.filter_by(server=server)
pagination = (
messages.filter(Message.message.like(f"%{q}%"))
.order_by(Message.timestamp)
.paginate(page=page, per_page=per_page)
)
if server:
description = f"Search {server.name}: {q}"
else:
description = f"Search: {q}"
servers = Server.query.all()
pagination_link = f"/search?q={q}&s={s}"
return render_template(
"results.html",
q=q,
s=int(s),
servers=servers,
pagination=pagination,
pagination_link=pagination_link,
description=description,
)
The search()
function starts
with the decorator @app.route("/search")
, so Flask knows that the
/search
route should call
this function. At the beginning of the function, I defined the
q
, s
, page
, and per_page
variables as the values from the URL’s
query string. For example, if the URL ends in
/search?q[=]berkeley, then this code would set the
value of q
to the
berkeley
.
I got this query string information from the Flask variable
request.args
, which is a
dictionary containing all of the values after the ?
in the URL. The code got the value of
the q
key in this dictionary
by evaluating request.args.get("q")
, but request.args["q"]
would work just the same. When using
the .get()
method on
dictionaries, you can choose default values, as I did in the following
line. The expression request.args.get("s", 0)
looks through request.args
for the key s
and returns it if it finds it. If the
expression doesn’t find s
,
it returns 0
.
On the search page, q
is the
search query and s
is the ID
of the server to search (if s
is 0
, this means I want to search all
servers). The page
and
per_page
variables are used
for pagination, which determines how an app displays a limited number
of results per page. The page
variable is the page number, and
per_page
is the number of
results per page.
Since three of the routes in my app used pagination (/search
, /channel
, and /user
), I wrote the code to find the
page
and per_page
query strings in the function
get_pagination_args()
, which
allowed me to just call that function instead of repeating the same code
in multiple places.
I then queried the SQL database for the server
with the ID stored in s
,
saving the result as server
.
The server
variable is used
to optionally search a single Discord server, rather than all of them.
If the SQL database doesn’t have any servers with that ID, then
server
is set to
None
, which means the app
should search all servers. I then started building the SQL query to
search for all of the messages, storing the results in the variable
messages
. If this search was
limited to a specific server (that is, if there’s a value for
s
), the code modified
messages
to filter just by
messages from that server. Finally, I used the SQLAlchemy pagination
feature to run the SQL query, making sure to select the correct page of
results, storing the search results in the variable pagination
. Part of the SQLAlchemy query included
Message.message.like(f"%{q}%")
to ultimately run a SQL query that
used SQL’s LIKE
operator,
which did a case-insensitive search for any messages containing the
string q
, as described in
Chapter 12.
In the following if
statement, my code defined the description
variable as a description of the
search, showing either just the search query or both it and the name of
the server being searched. It then loaded all of the servers with
servers=Server.query.all()
, which the layout.html layout
template needs to render the sidebar. Finally, the code rendered the
results.html Jinja template, passing in all of the appropriate
variables, resulting in the search results page.
In addition to the home page route (/
) and the search route
(/search
), I created these other routes for my web app:
/view/``message_id
The hyperlink to a specific Discord message
/channel/``channel_id
The hyperlink to a specific channel in a Discord server
/users
A page that listed all Discord users in the database, along with how many messages each has posted
/users/``user_id
A page that listed the messages that each Discord user has posted, spanned across all servers and channels that they posted in
As you can see in Figure 14-3, the Discord
servers that I imported while developing the app are all listed in the
left sidebar, along with each server’s channels. To start my research, I
could search for keywords (using the /search
route), or I could click a channel
name on the left and read its chat logs (using the /channel/``channel_id
route).
You can view the code for all of these routes in app.py at https://github.com/micahflee/hacks-leaks-and-revelations/blob/main/chapter-14/discord-analysis/app.py.
Now that you know how the Discord Analysis web app works, let’s look at how I went about using it to analyze the Discord leaks.
Using Discord Analysis to Find Revelations
After I had built enough of the Discord Analysis web app that I could start using it for actual research, I started by reading a cross section of all of the Discord leaks I had imported and taking notes on what might make good articles—all the while fixing bugs as I discovered them, and adding features as I felt I needed them. I went one Discord server at a time, trying to understand the gist of what was discussed in each channel. I searched for terms like WikiLeaks to see what the fascists were saying about it, since it had recently played a role in Trump’s 2016 election victory. I stumbled upon various conversations about digital security advice and which encrypted messaging apps to trust, all of it mixed up with numerous conspiracy theories, racist diatribes, and selfies of people holding guns.
Here’s how the process of using Discord Analysis on my computer actually
worked. When I wanted to run my web app to test it during development or
to start researching neo-Nazi chats, I’d run python3 app.py
. It showed this output, which is the
typical output you see every time you start a Flask web app:
* Serving Flask app 'app'
* Debug mode: on
WARNING: This is a development server. Do not use it in a production
deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:5000
Press CTRL+C to quit
* Restarting with stat
* Debugger is active!
* Debugger PIN: 654-228-939
The output said that the Flask web server started and was running at the URL http://127.0.0.1:5000. The web server continued to run until I was ready to quit it by pressing CTRL-C. I loaded that URL in my web browser to view the web app. As I made web requests, my terminal output showed me web service logs. For example, when I loaded the home page, my app produced these logs:
127.0.0.1 - - [14/Jan/2023 11:58:30] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [14/Jan/2023 11:58:30] "GET /static/style.css HTTP/1.1" 200 -
127.0.0.1 - - [14/Jan/2023 11:58:30] "GET /favicon.ico HTTP/1.1" 404 -
The left column is the IP address (127.0.0.1
) of the web browser that loaded each
route; in this case, I loaded routes from my own computer. It also shows
the timestamp the route was loaded, which route was loaded, and other
information. The first route that I loaded was the home page (you can
tell because the first log line says GET /
), and it responded with the HTTP code
200
, which means it loaded
successfully. Immediately after that, my browser loaded the CSS
stylesheet at /static/style.css
, which successfully loaded too, and
tried to load the favicon (the icon in the corner of a web browser tab)
at /favicon.ico
. However,
the server replied with the HTTP code 404
, “File not found,” because I hadn’t
bothered creating a favicon for my app.
At the top of each page in the web app was a search bar, next to which
was the drop-down menu that let me choose to search all servers or just
one. For example, I tried searching all Discord servers from which I had
imported data for the string berkeley
. Back in my terminal, I could see that
my browser had loaded the /search?q=berkeley&s=
route:
127.0.0.1 - - [14/Jan/2023 11:58:57] "GET /search?q=berkeley&s= HTTP/1.1" 200 -
127.0.0.1 - - [14/Jan/2023 11:58:57] "GET /static/style.css HTTP/1.1" 304 -
The search page loaded the CSS stylesheet at /static/style.css
as well, but this time it returned
with the HTTP code 304
,
which means that the stylesheet hadn’t been modified since the last time
my browser made that request.
Figure 14-4 shows the Discord Analysis web
app showing these search results. You can see that the page has the URL
http://127.0.0.1:5000/search?q=berkeley&s=
and lists search results from all servers for the string
berkeley
.
Figure 14-4: Searching for the string berkeley in my Discord Analysis web app
My search found 417 messages that contained the string
berkeley
, along with
information on who posted each message, in what channel, in what server,
at what time, and the content of the message, with the search term
itself highlighted. If I clicked on the user’s name, which linked to the
/users/<user_id>
route, I’d see all of the
posts from that user, including those on multiple Discord servers.
Each message also had a view link, which led to the /view/<message_id>
route and pulled up a page
displaying that individual message. This allowed me to store links to
individual messages in my notes. When I clicked on a view link I’d
saved, the web app would show me not only that message but also the 20
messages before and after it, so I could easily see the rest of the
conversation.
The app also allowed me to explore the leaked
chats by manually reading through each channel. I could select
individual channels by clicking the links in the left sidebar. For
example, Figure 14-5 shows the #general
channel in the Pony Power server. In this case, the URL was
http://127.0.0.1:5000/channel/10,
meaning the channel_id
in the /channel/<channel_id>
route was 10
. The ID field in the Channel
table auto-increments, so the first
row starts at 1, then 2, then 3, and so on. I imported the Vibrant
Diversity JSON file first, which created channels with IDs 1 through 9,
then imported the Pony Power JSON file, which created channels with IDs
10 through 13.
Figure 14-5: Viewing chat logs for the #general-chat channel in the Pony Power server in my Discord Analysis web app
With this case study as inspiration, I hope that you’ll feel confident building similar custom apps for your future investigations when you get your hands on large structured datasets like these.
After spending a few days splitting my time between writing code and reading some of the worst stuff on the internet, I ultimately decided to write about Pony Power, a server set up for the sole purpose of harassing and doxing people.
The Pony Power Discord Server
Pony Power was one of the smaller servers, with only 50 users and just over 1,000 messages posted over the course of just 10 days. More than any other server, it was full of PII for perceived members of antifa. I decided to focus my reporting on this server because this harassment campaign was clearly newsworthy, and because the server was small enough that I could read through all of the messages and write about the highlights. As a single reporter, it would have taken me considerably longer to do the same for larger servers, like Vibrant Diversity.
In the Pony Power chat logs, I found private data from over 50 people from 14 states across the country, from California to Florida. The information often included users’ photographs, social media profiles, home addresses, phone numbers, email addresses, dates of birth, driver’s license numbers, vehicle information, places of employment, and in one instance, a Social Security number. As I read through the Pony Power chat logs, from the beginning to the end, I built up a spreadsheet listing each person who was doxed to help me keep track of them, as well as Discord Analysis links to the messages where the doxing happened. The server’s #faces-of-rainbow-ponies channel contained nearly all of the PII.
The Pony Power fascists weren’t very selective about their targets. Anyone they considered to be a member of antifa or an antifa sympathizer was fair game, as were journalists they disagreed with, professors from liberal universities, or anyone who spoke out against racism.
Eight times in 2017, fascists traveled to Berkeley to hold protests. They came prepared with racist and antisemitic signs and armed with weapons for street fighting. One of these protests, a Say No to Marxism rally, was scheduled for late August. In response, antifascists began preparing a counterprotest. “So who is going to be there to stand up against Antifa? This is a good chance to dox them so we can have an idea who they are,” one of the Pony Power members posted in the chat. “We should go onto their Facebook page if they have an active one and dox all the ones who plan on being there and who liked the post.”
Another Pony Power user posted a link to a website for “white people striving to be allies in the fight for Black Liberation” and said, “These white allies need doxing.” Another wanted to dox members of the Democratic Socialists of America and the Southern Poverty Law Center. Some members of the group disagreed about the strategy of doxing everyone they didn’t like, though. “Fuck these random ass people to be honest,” another user posted. “We need to dox journalists and leadership of activist groups.” A person going by the name Klaus Albricht suggested, “It’s time we start mapping out the liberal teachers of universities.”
Albricht decided to dox a 22-year-old college student because her Facebook cover photo showed her wearing a shirt reading “Punch more Nazis,” a reference to Richard Spencer, a white supremacist best known for the viral video in which he is punched in the face while being interviewed. Albricht outlined a plan to trick her into clicking a malicious link so he could learn her IP address. He also said that he would dox people who liked her shirt. Less than 20 minutes later, he posted her home address, what she was studying at college, and links to all her social media accounts.
While writing my story, I reached out to the woman who was doxed. She told me, “I never clicked the link because it seemed hella sketch.” She also said that she hadn’t gone out to protest fascists and that she was annoyed that they had doxed her just because she hurt their feelings. She was “terrified” that they had her address because “it’s not just myself who’s at risk, but now also my parents who live here as well.”
In the 10 days’ worth of Pony Power chat logs I had at my disposal, I also found the fascists doxing Emily Gorcenski, an antifascist data scientist from Charlottesville who had witnessed Fields’s car plow into protesters. She’s a trans woman, and the fascists posted her deadname (the name she went by before she transitioned) and her home address. She has since moved to Germany.
Fascists also doxed 10 alleged members of an antifa group from Gainesville, Florida. A user who went by the name adolphus (not hitler) posted, “I lost my job because of these [homophobic slur]s,” later posting again that he lost his job because he attended the Unite the Right rally in Charlottesville, so “I’ve got some scores to settle with my local antifa.” I searched the internet for terms like Gainesville Charlottesville fired and quickly found news articles about a Gainesville man who was fired from his job after marching in Charlottesville with neo-Nazis. He was a member of the pro-slavery hate group League of the South, and he had gotten arrested in Charlottesville for carrying a concealed handgun. I tracked down a court document related to his arrest and found one that included his phone number. Because I decided to name him in the article, I called him to give him a chance to provide his side of the story, per the journalistic practices described in Chapter 1. To keep my actual phone number private from him and the League of the South, I used a new virtual phone number I had created just for this purpose (today, I have a public phone number that I use solely for communicating with sources like this). I left messages, but he never responded.
MENTAL HEALTH AND EXTREMISM RESEARCH
While building Discord Analysis and developing a story based on the chat logs from my Unicorn Riot contact, I spent several hours a day for two weeks reading racist, antisemitic, misogynistic, and outright genocidal rants by neo-Nazis speaking to one another online. Among these, one message in particular stuck out to me. It was written by a man, probably in his 50s, well past midnight, and it was a rant about the Jews who he believed were secretly controlling the media and the banks. It was clear to me that he was expressing deeply held beliefs rather than just trying to post something edgy, like a lot of the younger fascists seemed to be doing. While writing this chapter, I searched my Discord Analysis app for the term jew to see if I could find that specific message, but it came back with over 11,000 results, all of them full of hate. I decided it wasn’t worth tracking it down after all.
I knew about antisemitism, of course. I’d experienced antisemitic microaggressions myself. But reading through these neo-Nazi chat logs was the first time that I realized how many people—including thousands of Americans, many of whom lived in my city—really wished that we were all dead.
After reading through a massive amount of this content, I had many ideas for articles I wanted to write, but I ended up writing only a single one, about the Pony Power server. After publishing the first article, I didn’t want to spend more time reading these chats. I found it much better for my mental health to instead focus on writing code to improve my Discord search tool so that others could do the research. As I describe in the following section, this code eventually became a collaboration with Unicorn Riot called DiscordLeaks.
Reading through chat logs like this is an experience that no one should have to go through. But unfortunately, it’s necessary for extremism researchers. If you’re doing this sort of work, make sure to prioritize your own mental health. Take breaks and find people to talk to about the terrible things you’re seeing so you don’t keep it all inside. However you go about it, it’s important to have a plan for making this work sustainable, because it will definitely affect you.
Pony Power members also went after Michael Novick, at the time a 70-year-old retired teacher from Los Angeles who had been an antifascist activist for over 50 years. In the late 1980s, Novick helped found a group called Anti-Racist Action, and he’s been dealing with threats from neo-Nazis ever since. Because Novick’s name appeared on antiracist websites, Pony Power users decided that he must be an antifa leader. “Michael is behind what we know as the power structure,” Albricht posted. The Pony Power users then hit what they believed to be a gold mine: they discovered a video of Novick speaking at the 2011 Los Angeles Housing & Hunger Crisis Conference in which he said, “I’m of Jewish descent.” “HE ADMITS HE IS JEWISH! I KNEW IT!” Albricht exclaimed. “We have our link. Antifa is a Jewish organization!” He added, “Now let’s tear these [antisemitic slur]s apart!” and began inventing an antifa organization chart that placed Novick on top. “This man we know for a fact is the leader of Antifa. […] All other branches report to him.”
Novick told me it’s no secret that he’s Jewish. “My father came to the US in the early ’30s as a teenager from Poland, and most of his family (many aunts, uncles, and cousins) were wiped out by the Nazis either in Bialystok during a ghetto rebellion or in the camps,” he said. He also told me that there’s no antifa “command structure” or “organization chart.” He added, “Some antifa are Jewish. Hardly surprising, given the level of antisemitism displayed by the fascists and neo-Nazis.”
According to a story by Unicorn Riot reporter Chris Schiano, the Pony Power server was started by Dan Kleve. At the time, Kleve was a biochemistry major at the University of Nebraska–Lincoln and a member of the neo-Nazi group Vanguard America. After Klein was outed as one of the fascists who marched in Charlottesville, people began calling the head of his department to demand that he be expelled. Schiano wrote that Kleve created the Pony Power server, in apparent retaliation against those demanding his expulsion, “to seek revenge by maliciously publishing the personal information of alleged antifascists and encouraging others to harass them and bring them harm.”
You can read my full reporting on the Pony Power Discord chat logs at https://theintercept.com/2017/09/06/how-right-wing-extremists-stalk-dox-and-harass-their-enemies/.
The Launch of DiscordLeaks
After publishing my Pony Power article, I was sure that there were many more revelations spread throughout the hundreds of thousands of messages in the leaked chat logs, but I decided I needed a break from Nazis. I wanted to make it possible for others to analyze the rest of the Discord servers, though, and I knew from my own experience with these datasets that there were technical challenges to analyzing them, which is why I developed Discord Analysis to begin with. I spoke with the journalists from Unicorn Riot and showed them the Discord Analysis web app I had used to write my article. We decided that Unicorn Riot would run a public version of this app for researchers, journalists, and members of the public to use. This is how DiscordLeaks was born.
DiscordLeaks (https://discordleaks.unicornriot.ninja) is a searchable public database designed to make it easy for anyone to access the massive corpus of fascist chat logs from hundreds of Discord servers infiltrated by antifascists. I and a small team of anonymous developers worked in our spare time to add new features to the app and handle the scaling issues that come with hosting a public website that gets lots of traffic. We kept the modified source code for DiscordLeaks private, but it’s based on the Discord Analysis source code that I just described. By late 2017, DiscordLeaks was live, and by early 2018 it was full of chat logs from several Discord servers uploaded by Unicorn Riot journalists, including the one used to organize Unite the Right. The only redactions to the chat logs on DiscordLeaks are the PII for victims of doxing and harassment by far-right extremists; the rest of the data is fully public.
Over the years, Unicorn Riot has obtained a steady stream of leaked Discord chat logs from fascist groups and continued to index them into DiscordLeaks. I eventually stopped contributing to the project myself. In the time I’ve been away, it’s matured: the infrastructure is now running in Docker containers, and the speed of search has greatly improved thanks to the addition of an Elasticsearch database (both technologies were discussed in Chapter 5). Today, DiscordLeaks contains millions of messages from nearly 300 Discord servers used by the far right, available for the public to research. It also contains chat logs from RocketChat servers, which I discuss in the next section.
The Aftermath
By 2019, I had stopped writing code for DiscordLeaks myself, but I still kept in touch with the developers and promoted the website. I was proud of my role in developing this important tool for extremism research, but at the time I still had no idea how much positive impact it would ultimately have.
In this section I’ll discuss two major developments in the DiscordLeaks project since I wrote the initial code back in 2017. In 2021, survivors of the Charlottesville terrorist attack won a $25 million settlement against the organizers of Unite the Right in a lawsuit made possible, in part, by evidence published on DiscordLeaks. DiscordLeaks continues to be a vital tool for extremism researchers: in 2022, DiscordLeaks’ anonymous developers updated it to include another major leak of neo-Nazi chat logs, this time from the group Patriot Front.
The Lawsuit Against Unite the Right
In 1871, in response to the wave of racist terrorism against Black people that swept the South after the end of the Civil War, the US Congress passed the Ku Klux Klan Act. This law allows victims of racist violence to sue the perpetrators in civil court. If the victims can prove there was a conspiracy to deprive them of their civil rights, they can force the racists to pay monetary damages. This is exactly what nine survivors from Charlottesville did.
The plaintiffs in these cases were all Charlottesville residents, some of whom were severely injured that day—one suffered a fractured skull, another a broken leg and ankle. They filed the Sines v. Kessler lawsuit in October 2017 against 14 individuals and 10 organizations, with the goal of bankrupting the American fascist movement. The individual defendants included Jason Kessler, the primary organizer of Unite the Right; James Alex Fields Jr., the neo-Nazi terrorist serving a prison sentence for Heather Heyer’s murder; Richard Spencer; and leaders of the fascist groups that organized Unite the Right. Defendants also included fascist groups themselves like Vanguard America, Traditionalist Worker Party, various branches of the Ku Klux Klan, and the National Socialist Movement.
The Charlottesville survivors’ lawsuit was organized and funded by a legal nonprofit called Integrity First for America (IFA). The mission of the organization, founded in response to the violence of Unite the Right, was “defending democratic norms and ensuring equal rights for every American.” Using over 5TB of evidence in the form of phone records, text messages, videos from Unite the Right, email messages, social media posts, and private messages and chat logs, the plaintiffs successfully made their case. IFA made all of the evidence used in the lawsuit available to the public at https://www.integrityfirstforamerica.org/exhibits.
On its blog, IFA explained that while its lawyers did eventually get copies of the neo-Nazi chat logs directly from Discord as part of the lawsuit’s discovery process, DiscordLeaks provided “an immense amount” of detail before the lawsuit was filed. In the chat logs published by Unicorn Riot, Unite the Right attendees discussed whether they could hit protesters with cars and then claim self-defense, which is what happened. This evidence “provided crucial early information that made the speed and breadth of the initial complaint possible.”
In November 2021, the court found the fascist organizers guilty and ordered them to pay over $25 million in damages. In late 2022, IFA wound down its operations.
The Patriot Front Chat Logs
In the aftermath of the violent Unite the Right protests in Charlottesville, one of the neo-Nazi groups in attendance, Vanguard America, broke apart due to infighting. Out of the ashes of Vanguard America, a new fascist group called Patriot Front was born. Patriot Front, based out of Texas, is known for requiring members to do weekly “activism” involving vandalizing property with racist messages and posting Patriot Front propaganda, like stickers, all over the place. According to the Anti-Defamation League, Patriot Front was responsible for 82 percent of all reported incidents in 2021 involving the distribution of racist, antisemitic, and other hateful propaganda in the US.
In January 2022, someone hacked Patriot Front and leaked 400GB of data to Unicorn Riot, including thousands of messages posted to the group’s internal RocketChat server, an open source chat platform that anyone can host themselves. Unicorn Riot collaborated with DDoSecrets to publish the 400GB Patriot Front dataset, which you can find at https://ddosecrets.com/wiki/Patriot_Front. In response to this leak, the DiscordLeaks developers also updated the app to include support for RocketChat, and they imported over 12,000 new messages into it from two Patriot Front chat servers. You can find Patriot Front’s chat logs at https://discordleaks.unicornriot.ninja/rocket-chat/.
Figure 14-6 shows a still from a video in the Patriot Front dataset of members reading their manifesto and chanting “Life, liberty, victory!” The video includes a few seconds at the end where one of the neo-Nazis, apparently thinking the recording was over, yells, “Seig fucking Heil!”
Figure 14-6: Patriot Front members, from a video in the hacked dataset
Unfortunately, the American fascist movement has steadily grown since the election of Donald Trump in 2016. But there’s a wealth of public datasets about this movement, just waiting for researchers like you to dig in and expose it.
Summary
In this chapter, you’ve learned how antifascists infiltrated the Discord servers used by the American fascist movement, including organizers of the deadly Unite the Right rally in 2017, and leaked millions of chat logs to Unicorn Riot in JSON format. You saw how I went about analyzing these JSON files to understand their structure, how the custom Flask and SQLAlchemy web app I built worked under the hood, and how the app ultimately became DiscordLeaks. I also described my own investigation into the Pony Power server that fascists used to dox their enemies. Finally, you read about the amazing results from the Sines v. Kessler lawsuit and the continued success of DiscordLeaks tools.