Smartphones in hand, the pro-Trump, anti-democracy activists recorded the entire event. They posted their photos and videos online, many to the far-right social media site Parler. In this chapter, you’ll learn to work with the massive trove of video evidence collected from that day’s insurrection in a popular file format called JavaScript Object Notation (JSON). You’ll learn how JSON data is structured and write Python code to scour a million JSON files full of Parler video metadata to find specific videos. You’ll also learn about working with Global Positioning System (GPS) coordinates, including how to plot points on a map, since many of the videos include GPS coordinates in their metadata. All of these skills could serve you well in your future investigations.
Let’s start with a brief history of how the Parler dataset became available to the public.
The Origins of the Parler Dataset
The protesters at the US Capitol insurrection filmed themselves marching with Don’t Tread on Me, Fuck Biden, and Trump flags; tearing down fences; fighting with riot cops; smoking weed; smashing windows and then storming the Capitol building through them; throwing chairs at police; and threatening the lives of members of Congress and Vice President Mike Pence. They uploaded these videos to Parler in real time as they filmed them.
During the attack on the Capitol, pro-Trump rioters attacked police officers with baseball bats, flag poles, and pipes, injuring at least 138 of them. One officer, Brian Sicknick, was hospitalized and died the next day. In the weeks and months following the attack, four more officers who responded that day died by suicide. A Capitol Police officer shot and killed Ashli Babbitt, a rioter who attempted to breach the doors to the US Senate chamber where senators were sheltering. Three more Trump supporters died during the riot: one from being crushed to death in the crowd, one from a stroke, and one from a heart attack.
Days after the attack, citing Parler’s unwillingness to moderate content that encourages and incites violence, Apple and Google banned the Parler app from their app stores. Amazon Web Services (AWS), the major cloud hosting service that Parler had relied on, kicked the company off its service. It took Parler a month and a half to bring its site back up. Before it went down, though, a quick-thinking archivist downloaded over a million videos from the site. In this section, I’ll describe how she downloaded the videos and how they were used in Trump’s second impeachment trial.
How the Parler Videos Were Archived
On the Saturday after the January 6 attack, John Paczkowski and Ryan Mac published an email in BuzzFeed News from the Amazon AWS Trust & Safety Team to Parler. Amazon informed Parler that it “cannot provide services to a customer that is unable to effectively identify and remove content that encourages or incites violence against others,” and that “we plan to suspend Parler’s account effective Sunday, January 10th.” Less than 48 hours before Parler went dark, a hacker named @donk_enby, with the help of other archivists, raced to download a copy of all of the videos and images uploaded to the social network.
Parler, it turns out, lacked security measures that prevent automatic scraping of the site’s data. Web scraping is a method of automated data collection where you use code to load web pages, rather than manually loading them in a browser, and extract their data. This chapter won’t cover how to scrape the web like @donk_enby did, but if you’re curious, you can learn how in Appendix B.
Parler’s website didn’t have any rate limiting, a security feature that prevents users from accessing the site too frequently, so nothing stopped a single computer from making millions of web requests. The URLs of Parler posts appeared to have random IDs, but @donk_enby discovered that they also had hidden incremental IDs (1, 2, 3, and so on), so a script could easily loop through every ID, make a web request to download every post, and then find the URLs for every video and image to download. While Parler did strip metadata from videos uploaded by its users, they also left original copies of videos that contained this metadata at predictable URLs. @donk_enby downloaded versions of the videos that contained a wealth of hidden information, including, in many cases, the GPS coordinates of where the video was filmed.
When @donk_enby archived this data, she saved it to an AWS S3 bucket, an AWS service for hosting files that never runs out of disk space. (It’s ironic that, in response to AWS kicking Parler off its service, she saved copies of the videos to a different part of AWS.)
Because there’s no widely agreed-upon definition of hacking, whether or not Parler was “hacked” is a matter of perspective. Technically, @donk_enby scraped public content from a public website, which isn’t illegal and doesn’t require bypassing security—had Parler even had any that would have prevented this. The same thing is often true of illegal hacking, though; people break into systems that are barely protected or accidentally left open to the public.
By Sunday night, @donk_enby had managed to archive at least 32TB of videos. “I hope that it can be used to hold people accountable and to prevent more death,” she told Vice. She worked with DDoSecrets to make a copy of the data available to the public—the copy you’ll work with in this chapter.
The Dataset’s Impact on Trump’s Second Impeachment
On January 13, a week after the deadly riot at the Capitol and a week before Joe Biden’s inauguration as the new president, the US House of Representatives impeached Trump for “incitement of insurrection,” making Trump the first president in US history to be impeached twice.
During the impeachment trial in the US Senate, which took place in February at the beginning of Biden’s administration, the impeachment managers showed many videos of violent Trump supporters that @donk_enby had archived from Parler as evidence to support their case. “I had an efficient way to download it all. I knew what was there, but it seemed that nobody else could see the value,” she told CNN at the time. “I hope it inspires more people with similar skills to mine to use those skills for good.”
Ultimately, 57 percent of the Senate, including seven members of the Republican Party, found Trump guilty, while 43 percent—all of whom were Republicans—found him not guilty. The US Constitution requires a two-thirds majority of the Senate to convict, so Trump was acquitted. However, over 1,000 people were charged in connection to the January 6 insurrection. Two members of the far-right Oath Keepers militia, including its leader, Stewart Rhodes, and four members of the Proud Boys hate group, including its former leader, Enrique Tarrio, were convicted of seditious conspiracy. Several other members of these groups were also convicted of lesser crimes. Rhodes was sentenced to 18 years in prison in May 2023, and Tarrio was sentenced to 22 years in prison in September 2023.
Further investigating this dataset is obviously in the public interest. Let’s get started in Exercise 11-1.
Exercise 11-1: Download and Extract Parler Video Metadata
The Parler data is so large that it’s not practical, for the purposes of this chapter, to download it all. Instead, you’ll start with just the video metadata DDoSecrets has made available separately. The metadata contains useful information about each video, like its file format, when it was filmed, what type of phone or camera was used to film it, and in some cases the GPS coordinates describing where it was filmed. In this exercise, you’ll learn how to use the metadata to select and download individual videos to view.
NOTE If you’re using Windows, I recommend that you follow along with this chapter using your Ubuntu terminal instead of PowerShell and that you save this data in your WSL Linux filesystem (for example, in ~/datasets
), instead of in your Windows-formatted USB disk (/mnt/c
or /mnt/d
). Because of disk performance issues with WSL, I found that working with this data in Linux rather than directly in Windows was significantly faster. If you’ve only used Python in Windows so far, install Python in Ubuntu with the command sudo apt install python3 python3-pip
, then install the click Python module by running python3 -m pip install click
. You’ll need the click module for the exercises in this chapter. Refer to Appendix A to learn more about solving performance issues in WSL if you run into any problems.
Download the Metadata
Since the Parler dataset takes up so much disk space, DDoSecrets couldn’t publish it using BitTorrent like it does with most of its other public releases. To seed that torrent, you would need a single server with 32TB of data, and no one would be able to connect to the swarm to download it because no one has 32TB of disk space lying around. Instead, DDoSecrets hosts the Parler data on its public data web server. If you know the filename of a Parler video, you can download it from https://data.ddosecrets.com/Parler/Videos/<filename>.
You can also download a full list of filenames, ddosecrets-parler-listing.txt.gz, and metadata for all of the video files, metadata.tar.gz. Files ending in .gz are compressed using a format called GZIP, so you can tell from the filename that ddosecrets-parler-listing.txt.gz is a compressed text file. Files ending in .tar, called tarballs, also combine multiple files and folders together into a single file. Tar files aren’t compressed, though—they take up as much disk space as all of the files they contain—so it’s common to compress them with GZIP, resulting in .tar.gz files. The metadata.tar.gz file is a GZIP-compressed tarball.
Start by downloading
ddosecrets-parler-listing.txt.gz and metadata.tar.gz using the
wget
command. This command
is similar to curl
, but it
downloads a file and saves it to disk by default instead of displaying
it in your terminal. Check if you already have wget
installed by running
which wget
. If you don’t,
install it on macOS with brew install wget
, or on Linux or Windows with WSL with
sudo apt install wget
.
Open a terminal. Create a new folder for the Parler data you’ll download, and change to that folder. (If you’re using Windows with WSL, make sure you create it in your WSL Linux filesystem, such as at \~/datasets/Parler.) For example, here’s how I did it on my Mac, creating the folder on my datasets USB disk:
micah@trapdoor ~ % cd /Volumes/datasets
micah@trapdoor datasets % mkdir Parler
micah@trapdoor datasets % cd Parler
micah@trapdoor Parler %
Now use wget
to download the
list of filenames by running the following command:
micah@trapdoor Parler % wget https://data.ddosecrets.com/Parler/Videos/ddosecrets-parler
-listing.txt.gz
--snip--
Resolving data.ddosecrets.com (data.ddosecrets.com)... 172.67.75.15, 104.26.3.199, 104.26.2.199
Connecting to data.ddosecrets.com (data.ddosecrets.com)|172.67.75.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17790173 (17M) [application/octet-stream]
Saving to: 'ddosecrets-parler-listing.txt.gz'
ddosecrets-parler-listin 100%[==================================>] 16.97M 29.1MB/s in 0.6s
... (29.1 MB/s) - 'ddosecrets-parler-listing.txt.gz' saved [17790173/17790173]
The output should show that you’ve downloaded the 17MB ddosecrets
-parler-listing.txt.gz file. The wget
program shows you a progress bar of
your download in your terminal.
Next, download the video metadata by running the following command:
wget https://data.ddosecrets.com/Parler/Videos/metadata.tar.gz
Check to make sure you’ve successfully downloaded the files by running
ls -lh
. You should get the
following output:
-rw-r--r-- 1 micah staff 17M Mar 28 2021 ddosecrets-parler-listing.txt.gz
-rw-r--r-- 1 micah staff 203M Mar 15 2021 metadata.tar.gz
The file containing the list of filenames should be 17MB, and the metadata file should be 203MB.
Uncompress and Download Individual Parler Videos
To uncompress GZIP files, you’ll use the gunzip
command with the following syntax:
gunzip filename.gz
. Running gunzip
on a gzipped file deletes the original
file and leaves you with the uncompressed version without the .gz file
extension.
Uncompress the ddosecrets-parler-listing.txt.gz file by running the following command:
gunzip ddosecrets-parler-listing.txt.gz
Your original 17MB file, ddosecrets-parler-listing.txt.gz, should be replaced with a 43MB text file called ddosecrets-parler-listing.txt, which contains over one million lines, one for each video that @donk_enby archived.
To make sure it worked, run ls -lh
again. Your output should look
something like this:
-rw-r--r-- 1 user staff 43M Mar 28 2021 ddosecrets-parler-listing.txt
-rw-r--r-- 1 user staff 203M Mar 15 2021 metadata.tar.gz
Count the number of files in ddosecrets-parler-listing.txt with the following command:
cat ddosecrets-parler-listing.txt | wc -l
As you learned in Chapter 4, the
cat
command displays the
content of a file, and piping that command’s output into wc -l
counts the number of lines in that
file. The output should be 1031509
, meaning there are 1,031,509 lines in
ddosecrets-parler-listing.txt.
If you load the file in a text editor, it should look like this:
2021-01-12 18:31:54 77632730 0002bz1GNsUP
2021-01-12 18:37:33 14586730 0003lx5cSwSB
2021-01-12 18:37:33 822706 0004D2lOBGpr
2021-01-12 18:37:33 17354739 000EyiYpWZqg
2021-01-12 18:37:33 2318606 000SbGUM7vD4
2021-01-12 18:37:33 5894269 000oDvV6Bcfd
2021-01-12 18:37:36 20806361 0012uTuxv9qQ
2021-01-12 18:37:34 45821231 0015NlY0yUB5
--snip--
The first and second columns of text show the date and time that @donk_enby first uploaded each file to the S3 bucket, just after scraping it. The third column is the size of the file, in bytes, and the final column is the filename. All of the video files in the Parler dataset have similar random-looking names. These are the original IDs that Parler used for each video, and they don’t have file extensions.
Now that you know the filenames of each Parler video, you can download individual files from https://data.ddosecrets.com/Parler/Videos/<filename>. Let’s try downloading one of the first videos listed in ddosecrets-parler-listing .txt. First, use the following commands to create a videos folder and switch to that folder:
micah@trapdoor Parler % mkdir videos
micah@trapdoor Parler % cd videos
Next, run the following command to download the Parler file 0003lx5cSwSB:
wget https://data.ddosecrets.com/Parler/Videos/0003lx5cSwSB
You can normally tell the format of a file based on its file extension,
but since these Parler video filenames don’t have extensions, use the
following file
command to
determine the format of 0003lx5cSwSB:
file 0003lx5cSwSB
The output, 0003lx5cSwSB: ISO Media, MP4 v2 [ISO 14496-14]
, shows that the file is an MP4 video.
To make it easier to open in video-playing software, you’ll need to add
the .mp4 extension to the filename. You can rename files using the
command mv
source_path dest_path
, which moves a file from a
source path to a destination path. To rename 0003lx5cSwSB to
0003lx5cSwSB.mp4, run the following command:
mv 0003lx5cSwSB 0003lx5cSwSB.mp4
You can now watch 0003lx5cSwSB.mp4 in software like VLC Media Player. Figure 11-1 shows a screenshot from this video, which features Trump battling the “fake news” media and calls him the “Savior of the Universe.”
Figure 11-1: A screenshot from a pro-Trump Parler video showing an altered image of Trump riding a motorcycle
In your terminal, run cd ..
to change out of the videos folder
you just created and back to the Parler dataset folder.
There are over a million videos in this dataset, and most likely, only a small fraction contain anything newsworthy. If you randomly pick individual videos to download and watch, chances are you’ll be wasting a lot of time. To more efficiently find interesting videos, let’s take a closer look at the metadata.
Extract Parler Metadata
To view the Parler metadata, you’ll need to extract the
metadata.tar.gz tarball. In your terminal, uncompress and extract
metadata.tar.gz using the tar
command:
tar -xvf metadata.tar.gz
Because it’s so common to gzip tar archives, the tar
command will automatically detect if
it’s gzipped and uncompress it for you, so you don’t need to manually do
the gunzip
step yourself. In
the -xvf
argument,
x
tells tar
to extract the files from
metadata.tar, v
(meaning
verbose) tells tar
to
display each filename it extracts in the terminal, and f
means that the next argument is a
filename for the tarball on which this command will run.
Your output should look like this:
x metadata/
x metadata/.aws/
x metadata/meta-00CnBY5xCdca.json
x metadata/meta-0003lx5cSwSB.json
x metadata/meta-0070HNolzi3z.json
x metadata/meta-00BIFOMnOyi1.json
x metadata/meta-0002bz1GNsUP.json
--snip--
The command might take 10 minutes or so to extract the over one million JSON files in metadata.tar.gz into a new folder called metadata, depending on the speed of your hard disk. (If you’re using Windows with WSL and this step is going very slowly, consult Appendix A for performance tips.)
Feel free to run ls
on the
metadata folder or view it in a file browser, but beware that there
are so many files that those simple tasks will take a long time (it took
over five minutes for the ls
command to finish running on my computer). Figure
11-2 shows the files in the metadata folder
in Finder on macOS.
The files in this folder are all named meta-<ID>.json, where ID is the original video ID from Parler. For example, you can find the metadata for the file 0003lx5cSwSB, the video you downloaded in the previous section, at metadata/meta-0003lx5cSwSB.json. All of these metadata files are in the JSON file format, so let’s take a closer look at that now.
Figure 11-2: Some of the extracted Parler metadata files
The JSON File Format
JSON is a format used to store information in text strings. One of its
main benefits is that it’s human-readable. Some file formats are
designed for computers rather than humans to understand. If you run
cat
on a PDF file, for
example, you’ll see random-looking output in your terminal. You need to
open the PDF in a program like Adobe Reader to understand the
information it contains. However, humans can easily read the JSON text
format just by viewing it in a text editor or by using the cat
command.
JSON is one of the most widely used data formats, and the one most APIs communicate with. Whenever you visit a website that does anything interactive, chances are your web browser and the website’s server are passing JSON data back and forth. This is one reason why hacked data, as well as data scraped from APIs, is often full of JSON files. Most of the data from the America’s Frontline Doctors dataset, covered in detail in Chapter 13, is in JSON format, as is much of the data hacked from Gab, the right-wing social network discussed in Appendix B.
In this section, you’ll learn more about JSON syntax and how to load JSON data into Python scripts.
Understanding JSON Syntax
JSON has JavaScript in its name because it was first derived from that programming language, but it’s a language-independent data format: you can work with JSON data in JavaScript, Python, or any other programming language. Using their own JSON libraries, programming languages can convert JSON text strings into structured data (such as Python’s dictionaries and lists) and also convert that structured data back into JSON text strings that can be loaded by code in any other programming language.
To get an idea of the structure of a JSON file, run the following command in your terminal to display the metadata for the Parler video with the filename 0003lx5cSwSB:
cat metadata/meta-0003lx5cSwSB.json
The output should look like Listing 11-1.
[{
"SourceFile": "-",
"ExifToolVersion": 12.00,
"FileType": "MP4",
"FileTypeExtension": "mp4",
"MIMEType": "video/mp4",
"MajorBrand": "MP4 v2 [ISO 14496-14]",
"MinorVersion": "0.0.0",
"CompatibleBrands": ["mp42","mp41","iso4"],
"MovieHeaderVersion": 0,
"CreateDate": "2020:10:15 09:35:29",
"ModifyDate": "2020:10:15 09:35:29",
"TimeScale": 48000,
"Duration": "0:01:59",
--snip--
Listing 11-1: Video metadata for the file 0003lx5cSwSB
As you can see, FileType
is
MP4
. The
CreateDate
is
2020:10:15 09:35:29
, meaning
that this video was filmed on October 15, 2020, at 9:35 [AM],
and the Duration
is
0:01:59
, or 1 minute and 59
seconds.
JSON syntax is extremely similar to Python syntax but uses different terminology to describe types of information:
Object
A set of key-value pairs. An object is essentially equivalent to a
dictionary in Python and even uses the same syntax. In JSON, however,
keys must be strings. Objects are defined between braces ({
and}), and keys and values are
separated with colons—for example,
{"first_name": "Frederick", "last_name": "Douglass"}
. The JSON output for Listing
11-1 also includes a JSON object.
Array
An ordered list of items. An array is essentially equivalent to a list
in Python and uses the same syntax. Arrays are defined between brackets ([
and ]
), and items are
separated by commas. The JSON output in Listing 11-1 has a few arrays, such as
["mp42","mp41","iso4"]
.
Boolean
A value of either true
or
false
. These work the same
as True
and False
in Python, but they’re lowercase in
JSON.
Number
Any whole number or number with decimals in it, such as 2600
or 3.14
. These are similar to numbers in
Python, though while Python makes a distinction between integers (whole
numbers) and floating points (numbers with decimals), JSON does not.
String
A sequence of text characters—for example,
"videos have metadata?"
.
This is exactly the same as a string in Python, except that JSON strings
must be enclosed double quotes ("
), whereas Python also allows you to
use single quotes ('
).
null
A keyword representing an empty value. This is very similar to Python’s
None
keyword.
All JSON data is made up of combinations of these types, so it’s
important to understand their exact syntax. If you use any invalid
syntax, such as surrounding a string with single quotes instead of
double quotes or using the Boolean True
instead of true
, the JSON data won’t load properly.
Unlike in Python code, whitespace isn’t important in JSON data. For example, consider this JSON string:
{"abolitionists":[{"first_name":"Frederick","last_name":"Douglass"},{"first_name":"John","last
_name":"Brown"},{"first_name":"Harriet","last_name":"Tubman"}]}
To write the same JSON string in a more human-readable format, you can split it into multiple lines and add indentation:
{
"abolitionists": [
{
"first_name": "Frederick",
"last_name": "Douglass"
},
{
"first_name": "John",
"last_name": "Brown"
},
{
"first_name": "Harriet",
"last_name": "Tubman"
}
]
}
You might encounter JSON files in datasets that are formatted either way. I often open JSON files in VS Code and use the text editor’s built-in format feature to reformat the JSON for legibility. To format a document in VS Code, click View ▸ Command Palette ▸ Format Document and press ENTER.
Parsing JSON with Python
You can turn JSON data into Python dictionaries and lists using Python’s
built-in json
module. First,
open a Python interpreter and import the module:
>>> import json
The function in this module that I use the most is
json.loads()
. This takes a
string with JSON data as an argument, parses the string into a Python
object like a dictionary or a list, and returns that object. For
example, define a string called json_data
and set its value to a JSON string
with the following command:
>>> json_data = '{"first_name": "Frederick", "last_name": "Douglass"}'
The value you set json_data
to looks similar to a dictionary, but since it’s surrounded by single
quotes, it’s actually a string. In Python, the type()
function tells you the type of a
variable. You can confirm that json_data
is a string with the following
command:
>>> type(json_data)
<class 'str'>
This output shows that json_data
is a class of type str
(Chapter 14 will touch on classes), meaning it’s a string. Now define a variable called obj
and set its value to the return value
of the json.loads()
function:
>>> obj = json.loads(json_data)
Here, json.loads()
takes a
string as input and, if the string contains valid JSON, converts it into
structured data—in this case, storing the resulting object in
obj
. Use the
type()
function on
obj
now to see what type of
variable it is:
>>> type(obj)
<class 'dict'>
The output shows that you’ve parsed this JSON data into a Python
dictionary (a dict
), which
you can now use like any other dictionary. For
example, to put the value at the
last_name
key of this
dictionary in an f-string and then display it, use the following
command:
>>> print(f"Hello, Mr. {obj['last_name']}.")
Hello, Mr. Douglass.
To practice accessing structured data, in your terminal, change to your Parler dataset folder, and then open a Python interpreter. Run the following commands to load the metadata from a Parler video as structured data. I’ve chosen the file metadata/meta-HS34fpbzqg2b.json, but feel free to load whichever file you’d like:
>>> import json
>>> with open("metadata/meta-HS34fpbzqg2b.json") as f:
... json_data = f.read()
...
>>> obj = json.loads(json_data)
You now have the video metadata in the variable obj
. The simplest way to start inspecting
it is to display it to the screen with the print()
function:
>>> print(obj)
[{'SourceFile': '-', 'ExifToolVersion': 12.0, 'FileType': 'MOV', 'FileTypeExtension': 'mov',
'MIMEType': 'video/quicktime', 'MajorBrand': 'Apple QuickTime (.MOV/QT)', 'MinorVersion':
'0.0.0', 'CompatibleBrands': ['qt '], 'MediaDataSize': 139501464, 'MediaDataOffset': 36,
--snip--
This output looks a little like JSON, but it’s a Python object—in this
case, a list with a nested dictionary. Use the len()
function you learned about in
Chapter 8 to count how many items are in
this list:
>>> len(obj)
1
Since any given Parler video metadata file contains the metadata only
for one video, there’s only one item in this list. In order to access
that metadata, you need to select the first item in the list. To do
that, use obj[0]
(remember,
0 is the first index for any list) as follows:
>>> print(obj[0])
{'SourceFile': '-', 'ExifToolVersion': 12.0, 'FileType': 'MOV', 'FileTypeExtension': 'mov',
'MIMEType': 'video/quicktime', 'MajorBrand': 'Apple QuickTime (.MOV/QT)', 'MinorVersion':
'0.0.0', 'CompatibleBrands': ['qt '], 'MediaDataSize': 139501464, 'MediaDataOffset': 36,
--snip--
This time, the output starts with a brace,
meaning the item is a dictionary. Now use a for
loop to view all of the keys in this
dictionary:
>>> for key in obj[0]:
... print(key)
...
SourceFile
ExifToolVersion
FileType
--snip--
GPSLatitude
GPSLongitude
Rotation
GPSPosition
Each key listed in this output represents a different piece of video
metadata from the JSON file. You can also select values from this
dictionary using their keys. For example, try printing the values for
the GPSLatitude
and
GPSLongitude
keys:
>>> print(obj[0]["GPSLatitude"])
38 deg 53' 26.52" N
>>> print(obj[0]["GPSLongitude"])
77 deg 0' 28.44" W
These values represent the GPS coordinates for the location where this video was filmed.
Since JSON makes it easy to convert structured data into strings and back, when creating BlueLeaks Explorer I used JSON files to store the structure of BlueLeaks sites, as described in the section The Technology Behind BlueLeaks Explorer in Chapter 10. When you create a structure for a BlueLeaks site, BlueLeaks Explorer stores all of the configuration for that site in a dictionary, then saves that information to a JSON file. If you quit BlueLeaks Explorer and then run it again later, it loads that JSON file back into a dictionary. Since the Parler metadata comes in JSON format, you can also write Python code that loads these JSON files to easily access that metadata, as you’ll do later in this chapter.
To learn more about the json
module, you can find the documentation and plenty of example code at
https://docs.python.org/3/library/json.html.
Handling Exceptions with JSON
The json.loads()
function
will throw an exception if you pass an invalid JSON string into it, like
this:
>>> json.loads("this isn't valid json")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py",
line 346, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py",
line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py",
line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
A json.decoder.JSONDecodeError
exception means that the string you
passed in doesn’t contain valid JSON data. In this case, it’s telling
you the error in the JSON string is at line 1, column 1, and character
0, meaning the error is located at the first character of the string. If
you have a longer JSON string that’s mostly valid but just has a little
syntax issue, this error message can help you determine which piece of
your syntax is wrong.
Validating JSON data is a common use for Python exception handling,
which you learned about in Exception Handling in Chapter 7. For example, let’s say you have a string
called json_data
. The
following code will catch exceptions in case this string contains
invalid JSON data:
try:
obj = json.loads(json_data)
print("The JSON is valid")
print(obj)
except json.decoder.JSONDecodeError:
print("Invalid JSON")
This code uses try
and
except
statements to catch
the json.decoder.JSONDecodeError
exception if it gets thrown. If
json_data
is a valid JSON
string, it will display The JSON is valid
, followed by the information in
obj
. If the JSON string is
invalid, the script will display Invalid JSON
and then continue running without
crashing.
To load a JSON file in Python functions such as main()
, you must first load the content of
the file into a string like so
with open("filename.json") as f:
json_data = f.read()
replacing filename.json
with whatever file you’re
loading, such as metadata/meta-HS34fpbzqg2b.json
to load the metadata for the
HS34fpbzqg2b video file. As you learned in Reading and Writing
Files in Chapter 8, this code opens the file as a
file object f
and then
stores its content into a string called json_data
.
Next, you’d run that string through json.loads()
to convert it from a string into
structured data, like this:
try:
obj = json.loads(json_data)
except json.decoder.JSONDecodeError:
print("Invalid JSON")
return
When this code finishes running, if the JSON
string was valid, obj
will
contain the JSON data. Otherwise, it will display Invalid JSON
and then return early from the
function. The remaining code in the function can access the data in
obj
.
To prepare for using this module to write Python scripts that parse the Parler metadata files, next we’ll look at how to access values like GPS coordinates from JSON files with several command line programs.
Tools for Exploring JSON Data
While we’ve been focusing primarily on working with JSON files using
Python, sometimes writing a Python script is overkill if you just want
to quickly search a large block of JSON text. In this section, you’ll
learn to use our old friend grep
, as well as a more powerful tool
called jq
, to search JSON
files.
Counting Videos with GPS Coordinates Using grep
As you know from Chapter 4, the command line
programs grep
and
wc
are incredibly powerful
tools to quickly assess datasets. In a single command, and without
needing to write a Python script, you can use grep
to efficiently search inside JSON
files.
For example, let’s say you want to figure out how many Parler video
metadata files include GPS coordinates. Open a terminal, switch to your
Parler dataset folder, and run the following command to grep for the
string GPSCoordinates
:
micah@trapdoor Parler % grep -r GPSCoordinates metadata
The first argument, -r
(short for --recursive
),
tells grep
to look inside
every file in the given folder. The next argument,
GPSCoordinates
, is the
string to search for. The final argument, metadata
, is the name of the folder to search.
When you run this command, your terminal should quickly fill with GPS coordinates:
metadata/meta-31VC1ufihFpa.json: "GPSCoordinates": "22 deg 8' 0.60\" S, 51 deg 22' 4.80\" W",
metadata/meta-ImUNiSXcoGKh.json: "GPSCoordinates": "0 deg 0" 0.00\" N, 0 deg 0' 0.00\" E",
metadata/meta-70Tv9tAQUKyL.json: "GPSCoordinates": "36 deg 10' 49.08\" N, 115 deg 26' 45.60\"
W, 1922.566 m Above Sea Level",
metadata/meta-P2w4QOgv5n9U.json: "GPSCoordinates": "26 deg 14' 46.32\" N, 80 deg 5' 38.76\" W,
3.424 m Above Sea Level",
--snip--
However, you’re trying to find how many of these videos have GPS
coordinates, not necessarily what those coordinates are. If coordinates
are still loading in your terminal, press CTRL-C to cancel the
command, then pipe the output of
grep
into wc -l
to count how many lines get displayed:
micah@trapdoor Parler % grep -r GPSCoordinates metadata | wc -l
64088
Of the slightly more than one million videos, about 64,000 have GPS coordinates.
Programs like grep
and
wc
can only take you so far
in your attempts to efficiently search large quantities of data. For
example, if the JSON files you’re searching are formatted on a single
line, rather than split into multiple lines like the Parler files,
grep
will search the entire
block of JSON data for your string rather than a line at a time. You
can’t use grep
to extract
specific fields of data from JSON, either. For that, the best tool for
the job is a program called jq
.
Formatting and Searching Data with the jq Command
The jq
program allows you to
take JSON data as input and select key information from it. In this
section, you’ll learn how to use it to extract specific information from
the Parler files.
First, you’ll need to install jq
. Mac users can do so by running the
brew install jq
command.
Linux or Windows with WSL users, run the sudo apt install jq
command.
You can use the jq
command
to indent JSON data and show syntax highlighting in your terminal,
making the data easier to read. For example, try running this command in
your terminal:
cat metadata/meta-HS34fpbzqg2b.json | jq
The first part of the command,
cat metadata/meta-HS34fpbzqg2b.json
, outputs the content of that JSON
file, which contains the metadata for a single Parler video. The second
part, | jq
, pipes that
output as input into jq
.
The output should look like this:
[
{
"SourceFile": "-",
"ExifToolVersion": 12,
"FileType": "MOV",
"FileTypeExtension": "mov",
"MIMEType": "video/quicktime",
"MajorBrand": "Apple QuickTime (.MOV/QT)",
"MinorVersion": "0.0.0",
"CompatibleBrands": [
"qt "
],
--snip--
"GPSLatitude": "38 deg 53' 26.52\" N",
"GPSLongitude": "77 deg 0' 28.44\" W",
"Rotation": 180,
"GPSPosition": "38 deg 53' 26.52\" N, 77 deg 0' 28.44\" W"
}
]
This version includes syntax highlighting (as in VS Code) and formats the JSON data so that the items in every array and object are listed on separate lines and indented.
You can also use jq
to
filter for details from inside the JSON data. For example, suppose you
just want to know the GPS coordinates from this JSON file. In the
preceding code, you can tell from the bracket character at the beginning
that this JSON data is an array. The first value of the array is an
object, since it starts with a brace character, and one of the keys of
the object is GPSPosition
.
To filter for GPSPosition
,
pass ".[0].GPSPosition"
as
an argument into the jq
command, as follows:
micah@trapdoor Parler % cat metadata/meta-HS34fpbzqg2b.json | jq ".[0].GPSPosition"
"38 deg 53' 26.52\" N, 77 deg 0' 28.44\" W"
In this command, .[0]
selects the first item of the list in the file named
metaHS34fpbzqg2b.json, and .GPSPosition
selects the value with the key
GPSPosition
from the object.
The output shows the value of the GPSPosition
field,
"38 deg 53' 26.52\" N, 77 deg 0' 28.44\" W"
.
If you’re interested in learning more about how to use jq
, check out its website at
https://stedolan.github.io/jq.
You’ll also revisit it in Chapter 14, where
I explain how I used it to understand the structure of leaked neo-Nazi
chat logs.
Now that you have a foundational understanding of JSON, you’ll try your hand at writing Python code that works with it in Exercise 11-2.
Exercise 11-2: Write a Script to Filter for Videos with GPS from January 6, 2021
In this exercise, you’ll write a Python script that filters the Parler videos down to just those filmed on January 6, 2021, whose metadata includes GPS coordinates. You’ll do this by looping through all the JSON files in the dataset, converting them into Python objects, and inspecting their metadata to show you just the ones you’re looking for.
For a challenge, you can try programming your own script to meet the following requirements:
- Make this script accept an argument,
parler_metadata_path
, using Click. This will be the path to the metadata folder full of JSON files. - Define a new variable called
count
that keeps track of the number of Parler videos that include GPS coordinates in their metadata, and set it to0
. - Loop through all of the JSON files in the metadata folder. For
each loop, your program should run the content of each JSON file
through the
json.loads()
function to turn it into a Python object. As described in the “[Parsing JSON with Python”] section, each object is technically a list containing one element, a dictionary full of all of the video’s metadata. - Check to see if that video’s metadata dictionary includes the key
GPSCoordinates
and if the date stored in the keyCreateDate
is January 6, 2021. If both of these are true, the script should display a message that this file includes GPS coordinates and is from January 6, 2021, and increment thecount
variable by 1. - Have the program display a message after looping through all the
metadata files that tells the user the total number of videos with
GPS coordinates from January 6, 2021 (which should be stored in the
count
variable, now that you’re done counting).
Alternatively, follow along with the rest of this exercise and I’ll walk you through the programming process.
Accept the Parler Metadata Path as an Argument
Start with the usual Python script template:
def main():
pass
if __name__ == "__main__":
main()
Next, make the following modifications to your script so that it accepts
the parler_metadata_path
CLI
argument. This way, when you run the script, you can pass in the path to
the metadata folder as an argument, which the code will use to open
all of the JSON files inside that folder. The modifications are shown in
bold:
import click
@click.command()
@click.argument("parler_metadata_path")
def main(parler_metadata_path):
"""Filter Parler videos with GPS that were filmed Jan 6, 2021"""
print(f"Parler metadata path: {parler_metadata_path}")
if __name__ == "__main__":
main()
This code first imports the click
module, then uses it to make the
main()
function accept the
argument parler_metadata_path
. It also adds a docstring to show what
the script does when you run it with the --help
argument. Finally, the
print()
function will print
the value of parler_metadata_path
to the screen.
Test your code to make sure it works so far, replacing the argument with the path to your own metadata folder:
micah@trapdoor chapter-11 % python3 exercise-11-2.py /Volumes/datasets/Parler/metadata
Parler metadata path: /Volumes/datasets/Parler/metadata
Sure enough, the code should display the same string, stored in
parler_metadata_path
, that
you passed in as an argument.
Loop Through Parler Metadata Files
Next, add some code that will loop through all of the JSON files in the
metadata folder and run json.loads()
on their contents to convert them into
structured data in Python. Modify your code as follows:
import click
import os
import json
@click.command()
@click.argument("parler_metadata_path")
def main(parler_metadata_path):
"""Filter Parler videos with GPS that were filmed Jan 6, 2021"""
for filename in os.listdir(parler_metadata_path):
abs_filename = os.path.join(parler_metadata_path, filename)
if os.path.isfile(abs_filename) and abs_filename.endswith(".json"):
with open(abs_filename) as f:
json_data = f.read()
try:
metadata = json.loads(json_data)
print(f"Successfully loaded JSON: {filename}")
except json.decoder.JSONDecodeError:
print(f"Invalid JSON: {filename}")
continue
if __name__ == "__main__":
main()
The code imports the os
and
json
modules at the top of
the file so it can use the functions they contain later on. The program
then loops through the return value of the os.listdir()
function, which returns the list of
files in the metadata folder, storing each filename in the variable
filename
.
Inside the for
loop, the
code defines a new variable called abs_filename
to be the absolute path of the JSON
file the code is working with each time it loops. It creates the
absolute path by concatenating parler_metadata_path
with filename
using the os.path.join()
function. Now that the code knows the
full filename, it checks to make sure that this is actually a file, not
a folder, and that it ends with .json.
If the code confirms the file is JSON, it loads
all of the data from this file into the variable json_data
and then converts that string into
structured data, saved in the variable metadata
, using try
and except
statements, as described in the
Handling Exceptions with JSON section. If there are no syntax
errors in an individual JSON file, the code displays a message to the
screen saying that the file loaded successfully. Otherwise, it displays
an error and moves on to the next file using the continue
statement. In a for
loop, continue
statements immediately end the current
loop and move on to the next loop.
To summarize, at this point the code is looping through every file in
the metadata folder, and for each JSON file it comes across, opening
it and loading its content as a text string. It then converts this
string into a Python object using the json.loads()
function, storing the object in the
metadata
variable, and
displays a message that it successfully loaded. If the file didn’t
successfully load, the message says that the JSON was invalid, and the
code continues on to the next JSON file.
Run the program again, replacing the argument with the path to your own metadata folder:
micah@trapdoor chapter-11 % python3 exercise-11-2.py /Volumes/datasets/Parler/metadata
Successfully loaded JSON: meta-gzK2iNatgLLr.json
Successfully loaded JSON: meta-31VC1ufihFpa.json
Successfully loaded JSON: meta-ZsZRse5JGx8j.json
--snip--
If your output shows many messages saying different JSON files loaded successfully, your code is working. Once you’ve determined that your output looks correct, you can press CTRL-C to cancel the script before it finishes running.
Filter for Videos with GPS Coordinates
Your code currently loops through all of the Parler metadata files, loads each file, and converts it into a Python object so you can work with it. Next, you need to filter out the videos that include GPS coordinates and to count those videos. To do so, make the following modifications:
import click
import os
import json
@click.command()
@click.argument("parler_metadata_path")
def main(parler_metadata_path):
"""Filter Parler videos with GPS that were filmed Jan 6, 2021"""
count = 0
for filename in os.listdir(parler_metadata_path):
abs_filename = os.path.join(parler_metadata_path, filename)
if os.path.isfile(abs_filename) and abs_filename.endswith(".json"):
with open(abs_filename) as f:
json_data = f.read()
try:
metadata = json.loads(json_data)
except json.decoder.JSONDecodeError:
print(f"Invalid JSON: {filename}")
continue
if "GPSCoordinates" in metadata[0]:
print(f"Found GPS coordinates: {filename}")
count += 1
print(f"Total videos with GPS coordinates: {count:,}")
if __name__ == "__main__":
main()
This code defines a new variable called count
and starts its value out as
0
. This will keep track of
the number of videos with GPS coordinates. After each JSON file is
loaded into the metadata
variable, an if
statement
checks if the key GPSCoordinates
exists inside this metadata
dictionary. Remember from the previous section that metadata
is a list with one item, making
metadata[0]
the actual
dictionary your code is checking. If this video metadata does have the
GPSCoordinates
field, the
control flow moves to the code block after the if
statement. Otherwise, it moves on to
the next loop.
When the Python script comes across metadata that includes GPS
coordinates, it displays the name of the file with print()
and increments count
by 1. This way, by the time this
for
loop is finished,
count
will contain the total
number of videos that have GPS coordinates in their metadata. Finally,
after the for
loop
completes, the code displays that total count with a second call to the
print()
function. As you
learned in Chapter 8, the :,
in the f-string will display larger
numbers with comma separators.
Run your program again:
micah@trapdoor chapter-11 % python3 exercise-11-2.py /Volumes/datasets/Parler/metadata
Found GPS coordinates: meta-31VC1ufihFpa.json
Found GPS coordinates: meta-ImUNiSXcoGKh.json
Found GPS coordinates: meta-70Tv9tAQUKyL.json
--snip--
Found GPS coordinates: meta-1FMyKoVq53TV.json
Found GPS coordinates: meta-Y0jO2wy1Z7RO.json
Found GPS coordinates: meta-aZlkDfPojhxW.json
Total videos with GPS coordinates: 63,983
Because this script loads the JSON data from over a million files, it might take a few minutes to finish running. In the end, your script should find 63,983 videos with GPS coordinates. There should also be 63,984 lines of output: one with the name of each metadata file that has GPS coordinates, and one at the end that lists the total.
Filter for Videos from January 6, 2021
Now you’ll whittle down that list of roughly 64,000 videos even further to find out which were filmed on January 6, 2021.
You can tell the date on which a video was filmed from the
CreateDate
field in its
metadata, as shown earlier in Listing 11-1.
The value of this field looks something like this:
"CreateDate": "2020:12:28 17:25:47",
To use the CreateDate
field
to filter the results further, make the following modifications to your
code:
import click
import os
import json
@click.command()
@click.argument("parler_metadata_path")
def main(parler_metadata_path):
"""Filter Parler videos with GPS that were filmed Jan 6, 2021"""
count = 0
for filename in os.listdir(parler_metadata_path):
abs_filename = os.path.join(parler_metadata_path, filename)
if os.path.isfile(abs_filename) and abs_filename.endswith(".json"):
with open(abs_filename, "rb") as f:
json_data = f.read()
try:
metadata = json.loads(json_data)
except json.decoder.JSONDecodeError:
print(f"Invalid JSON: {filename}")
continue
if (
"GPSCoordinates" in metadata[0]
and "CreateDate" in metadata[0]
and metadata[0]["CreateDate"].startswith("2021:01:06")
):
print(f"GPS + Jan 6: {filename}")
count += 1
print(f"Total videos with GPS coordinates, filmed Jan 6: {count:,}")
if __name__ == "__main__":
main()
Rather than just checking for videos with GPS coordinates, now the code
also checks for those that have a CreateDate
that starts with 2021:01:06
. Once the code determines that the
metadata in the current loop has GPS coordinates and was created on
January 6, 2021, it displays the filename with
print(f"GPS + Jan 6: {filename}")
. When the for
loop is finished, it displays the
total count.
The expression in this code’s if
statement is surrounded by
parentheses, and the three conditions inside those parentheses are
indented. This is purely cosmetic; the code would work exactly the same
if it were all on one line, but this formatting makes it slightly easier
to read.
You can find the final script in the book’s GitHub repo at https://github.com/micahflee/hacks-leaks-and-revelations/blob/main/chapter-11/exercise-11-2.py. Run the completed script like so:
micah@trapdoor chapter-11 % python3 exercise-11-2.py /Volumes/datasets/Parler/metadata
GPS + Jan 6: meta-xHkUeMHMFx3F.json
GPS + Jan 6: meta-eGqmDWzz0oSh.json
GPS + Jan 6: meta-WhQeLMyPWIrG.json
--snip--
GPS + Jan 6: meta-fhqU4rQ4ZFzO.json
GPS + Jan 6: meta-pTbZXLmXGyyn.json
GPS + Jan 6: meta-hL60MjItBhOW.json
Total videos with GPS coordinates, filmed Jan 6: 1,958
The script might still take a few minutes to run, but this time, there should be fewer results. Only 1,958 Parler videos have GPS coordinates and were filmed on January 6, 2021; this is about 3 percent of the videos with GPS coordinates, and less than 0.2 percent of all of the videos.
Watching almost 2,000 videos, while perhaps unpleasant, is at least feasible. We can still do better, though. In all likelihood, some of those January 6 videos weren’t actually filmed at the insurrection itself, but just happened to be uploaded the same day from other locations. To prepare for filtering this list further in order to find videos filmed at the insurrection, you’ll need some background on working with GPS coordinates.
Working with GPS Coordinates
In this section, you’ll learn how latitude and longitude coordinates
work and how to look them up on online map services like Google Maps.
You’ll also learn how to convert between different GPS formats and
measure the rough distance between two locations. I’ll introduce a few
new Python features, including the split()
and replace()
methods for modifying strings and the
float()
function for
converting a string into a decimal number.
Searching by Latitude and Longitude
You can define any location on Earth using two coordinates: latitude and longitude. These coordinates are measured in degrees, with each degree split into 60 minutes and each minute split into 60 seconds. Latitude goes from 90 degrees North, which is the North Pole, to 0 degrees at the equator, to 90 degrees South, which is the South Pole. Longitude goes from 180 degrees West, which is in the middle of the Pacific Ocean, to 0 degrees, which cuts through England, to 180 degrees East, back to that same location in the middle of the Pacific.
For example, if you look up the metadata for the Parler video with filename HS34fpbzqg2b (which shows Trump supporters removing barricades around the Capitol building while police officers stand by and watch), you’d find the following GPS coordinates:
Latitude: 38 deg 53[′] 26.52[″] N
Longitude: 77 deg 0[′] 28.44[″] W
That means this video was filmed at the latitude of 38 degrees, 53 minutes, 26.52 seconds North and the longitude of 77 degrees, 0 minutes, 28.44 seconds West.
You can use various online map services, like Google Maps, to search by GPS coordinates and see exactly where on Earth they point to. To search the coordinates contained in the Parler metadata, you’ll need to slightly modify them so that Google Maps will recognize them, loading https://www.google.com/maps and entering these coordinates as the string 38°53′26.52″, −77°0′28.44. Try searching for those coordinates in Google Maps now. Figure 11-3 shows the exact location this video was filmed: just outside the US Capitol building, where police had set up barricades.
Figure 11-3: Pinpointing a location near the US Capitol building in Google Maps
You can also use Google Maps to discover the GPS coordinates of any given point. If you right-click anywhere on the map, a context menu should pop up showing you the GPS coordinates of that point. However, when you do this, the coordinates it shows you will look slightly different because they’ll be in decimal format.
In the next section, you’ll learn to convert from decimals to degrees, minutes, and seconds.
Converting Between GPS Coordinate Formats
GPS coordinates in decimal format show the number of degrees on the left side of the decimal point, and converted minutes and seconds values on the right side. For example, consider the GPS coordinates from the HS34fpbzqg2b video:
- The latitude is 38 degrees, 53 minutes, 26.52 seconds North, which is 38.8907 in decimal.
- The longitude is 77 degrees, 0 minutes, 28.44 seconds West, which is −77.0079 in decimal.
One degree is 60 minutes and one minute is 60 seconds, meaning there are 3,600 seconds in a degree. The formula to convert from degrees, minutes, and seconds to decimal format is degrees + (minutes / 60) + (seconds / 3,600). Latitudes are negative in the Southern Hemisphere but positive in the Northern Hemisphere, while longitudes are negative in the Western Hemisphere but positive in the Eastern Hemisphere. The latitude for the HS34fpbzqg2b video is positive, while the longitude is negative.
Decimal numbers are simpler to work with in code. Since the GPS
coordinates in the Parler metadata are formatted as degrees, minutes,
and seconds, let’s use some Python code to convert them to decimal
format. The gps_degrees_to_decimal()
function in Listing
11-2 takes a GPS coordinate from the Parler
metadata as an argument and returns the decimal version.
def gps_degrees_to_decimal(gps_coordinate):
parts = gps_coordinate.split()
degrees = float(parts[0])
minutes = float(parts[2].replace(" ' ", " "))
seconds = float(parts[3].replace(' " ', " "))
hemisphere = parts[4]
gps_decimal = degrees + (minutes / 60) + (seconds / 3600)
if hemisphere == "W" or hemisphere == "S":
gps_decimal *= -1
return gps_decimal
Listing 11-2: The gps_degrees_to_decimal()
function
This function introduces some new Python features. First, the
split()
string method splits
a string into a list of parts based on whitespace. For example, this
method would convert the string '77 deg 0' 28.44" W'
into the list of strings
['77', 'deg', "0' ", '28.44" ', 'W']
. The line parts = gps _coordinate.split()
stores the return value of
gps_coordinate.split()
into
the parts
variable. If you
passed that string into this function as gps_coordinate
, this would mean the following:
parts[0]
is the string77
.parts[1]
is the stringdeg
.parts[2]
is the string0'
(0 followed by a single quote).parts[3]
is the string28.44"
(28.44 followed by a double quote).parts[4]
is the stringW
.
Before you can do math with strings in Python, you must convert them
into floating-point numbers—which are just numbers that can contain
decimals—using the float()
function. Listing 11-2 uses
float()
to set the value of
degrees
to the
floating-point version of parts[0]
. In this case, it converts the value
of the string 77
in
gps_coordinate
to the
floating-point number 77.0
.
The next line of code similarly uses the replace()
string method to convert the
minutes
value to a
floating-point number. This method searches the string for the first
argument and replaces it with the second argument. For example,
"GPS is fun".replace("fun", "hard")
returns the string
GPS is hard
. When you run
parts[2].replace(" ' ", " ")
, you’re replacing the single quote
character ('
) with an empty
string, in order to delete that character. This would convert the string
0'
from
gps_coordinate
to
0
and then convert
0
to the floating-point
number 0.0
.
The next line uses replace()
to delete the double quote character ("
), converting the string
28.44"
from
gps_coordinates
to
28.44
, then converting that
into the floating-point number 28.44
and saving it as seconds
.
The rest of the function is more straightforward. It defines the
variable gps_decimal
as the
decimal version of the GPS coordinates that are passed in an argument,
using the formula to convert the coordinates to decimal format using the
numbers in degrees
,
minutes
, and
seconds
. If the coordinates
are in the Western or Southern Hemisphere, the code
gps_decimal *= -1
makes gps_decimal
a negative number. Finally, the
function returns gps _decimal
, the decimal version of the GPS
coordinates.
Since the GPS coordinates in the Parler data come in strings of degrees,
minutes, and seconds, you’ll use the gps_degrees_to_decimal()
function in the next exercise to
convert them to decimal format. First, though, you’ll need to know how
to calculate distances between two GPS coordinates.
Calculating GPS Distance in Python
To determine which Parler videos were filmed in Washington, DC, based on their GPS coordinates, you can begin by finding the coordinates for the center point of the city and then imagine a circle around that point. You can consider a video to have been filmed in the city if its metadata has both a longitude and latitude within that circle. This won’t tell you if the video was exactly filmed within the Washington, DC, city limits, but it’s close enough. In this section, I’ll review the simple math required to do this calculation.
The Earth isn’t flat, but for the purposes of this chapter, pretend that Washington, DC, is a flat plane. You can think of GPS coordinates as a 2D point on a Cartesian coordinate system, where longitude represents the x axis (East and West) and latitude represents the y axis (North and South). Since you can look up the coordinates of the center of Washington, DC, and you know the coordinates for where each video was filmed, you can use the distance formula to determine if it’s inside the circle.
The distance formula, as you might recall from geometry class, is used to calculate the distance between two points. It states that the distance between two points equals the square root of ((x2 − x1>)2 + (y2 − y1)2), where (x1, y1) is one point and (x2, y2) is another point. As an example, Figure 11-4 shows the distance between the White House and the US Capitol, with the White House at point (x1, y1) and the US Capitol at point (x2, y2).
Figure 11-4: Using the distance formula to calculate the distance between the White House and the US Capitol
To determine if a given Parler video was filmed in Washington, DC, you’ll compare the city center with the GPS coordinates of a Parler video. The center point of DC is constant, and when you loop through the JSON files of Parler metadata, you can find all the relevant GPS coordinates. If you plug these points into the distance formula, you can determine whether the distance is close enough to the center to be considered inside the city.
NOTE Since the Earth isn’t actually flat, using the distance formula will only be relatively accurate for short distances, like 20 kilometers. It’s possible to calculate much more accurate distances between GPS coordinates using spherical geometry, but that requires using trigonometry functions like sine, cosine, and arctangent. Using the distance formula is much simpler and accurate enough for our purposes.
Listing 11-3 shows a Python distance()
function that implements the distance formula.
import math
def distance(x1, y1, x2, y2):
return math.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)
Listing 11-3: The distance()
function
The distance formula requires you to calculate a square root, which you
can do using Python’s math.sqrt()
function. To access this function,
first you import the math
module at the top of the file. The distance()
function takes the x1
, x2
, y1
, and y2
arguments, then calculates the
distance formula, returning the distance between the two points. (In
Python, **
is the power
operator, so we write x2 as x**2
.) If you call distance()
and pass any two points into it as
arguments, it will return the distance between them.
Finding the Center of Washington, DC
Now you’ll find the coordinates of the center of Washington, DC, so that you can use the distance formula to compare them against those from a Parler video. Load https://www.google.com/maps in your browser and search for Washington DC. Right-click the US Capitol building, which is approximately at the center of the city. Google Maps should show you the GPS coordinates of that point (see Figure 11-5); click them to copy them. Your GPS coordinates might be slightly different, depending on where exactly you clicked.
Figure 11-5: Using Google Maps to find the GPS coordinates of the center of Washington, DC
If the radius of the imaginary circle around Washington, DC, is about 20 kilometers, you can consider any videos filmed within 0.25 degrees to be inside the city. I decided on 0.25 degrees by checking the GPS coordinates on the outskirts of DC and comparing them to the coordinates in the city center.
Armed with the gps_degrees_to_decimal()
and distance()
Python functions and the GPS
coordinates for the center of Washington, DC, you’re ready to finish
filtering the Parler videos to find the insurrection videos in Exercise
11-3.
Exercise 11-3: Update the Script to Filter for Insurrection Videos
In this exercise, you’ll filter the results of the Exercise 11-2 script even further, searching just for videos filmed in Washington, DC. First, make a copy of exercise-11-2.py and rename it exercise-11-3.py. Now modify exercise-11-3.py to match the following code:
import click
import os
import json
import math
def gps_degrees_to_decimal(gps_coordinate):
parts = gps_coordinate.split()
degrees = float(parts[0])
minutes = float(parts[2].replace(" ' ", " "))
seconds = float(parts[3].replace(' " ', " "))
hemisphere = parts[4]
gps_decimal = degrees + (minutes / 60) + (seconds / 3600)
if hemisphere == "W" or hemisphere == "S":
gps_decimal *= -1
return gps_decimal
def distance(x1, y1, x2, y2):
return math.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)
def was_video_filmed_in_dc(metadata):
dc_x = -77.0066
dc_y = 38.8941
x = gps_degrees_to_decimal(metadata[0]["GPSLongitude"])
y = gps_degrees_to_decimal(metadata[0]["GPSLatitude"])
return distance(dc_x, dc_y, x, y) <= 0.25
@click.command()
@click.argument("parler_metadata_path")
def main(parler_metadata_path):
"""Filter Parler videos that were filmed in Washington DC and on Jan 6, 2021"""
count = 0
for filename in os.listdir(parler_metadata_path):
abs_filename = os.path.join(parler_metadata_path, filename)
if os.path.isfile(abs_filename) and abs_filename.endswith(".json"):
with open(abs_filename, "rb") as f:
json_data = f.read()
try:
metadata = json.loads(json_data)
except json.decoder.JSONDecodeError:
print(f"Invalid JSON: {filename}")
continue
if (
"GPSLongitude" in metadata[0]
and "GPSLatitude" in metadata[0]
and "CreateDate" in metadata[0]
and metadata[0]["CreateDate"].startswith("2021:01:06")
and was_video_filmed_in_dc(metadata)
):
print(f"Found an insurrection video: {filename}")
count += 1
print(f"Total videos filmed in Washington DC on January 6: {count:,}")
if __name__ == "__main__":
main()
This code first defines the gps_degrees_to_decimal()
function from Listing
11-2 and the distance()
function from Listing 11-3, importing the required math
module at the top of the file. It will
later use gps_degrees_to _decimal()
to convert GPS coordinates from the
Parler video metadata into decimal format and distance()
to calculate the distance between that
GPS coordinate and the center of Washington, DC.
Next, the code defines the was_video_filmed_in_dc()
function. This function takes a single
argument, metadata
, which
contains the Parler video metadata loaded from its JSON file. It returns
True
if the GPS coordinates
in that metadata are located inside Washington, DC, but otherwise
returns False
.
The was_video_filmed_in_dc()
function first defines the x and y coordinates you found for the
city center in the variables dc_x
and dc_y
. Next, it defines the x and y
coordinates of the Parler video, storing those values in the variables
x
and y
. Since the GPS coordinates in the
GPSLongitude
and
GPSLatitude
metadata fields
aren’t in decimal format, it first passes those strings into the
gps_degrees_to_decimal()
function to convert them from degrees, minutes, and seconds into
decimals and then saves the return values into x
and y
.
Finally, was_video_filmed_in_dc()
calls the distance()
function to determine the distance
between these two points. The return value is this expression:
distance(dc_x, dc_y, x, y) <= 0.25
The distance()
function returns a number representing
the distance between the center of Washington, DC, and the location
where the video was filmed. If that number is less than or equal to 0.25
(roughly 20 kilometers), the expression evaluates to True
; otherwise, it evaluates to
False
. Thus, the
was_video_filmed_in_dc()
function returns a Boolean.
With these functions defined at the top of the file, the remaining
changes to the script are minimal. The code updates the docstring, since
our script’s purpose has changed. It also updates the if
statement that checks whether or not
an insurrection video was found. The version of this script from
Exercise 11-2 just checked if the metadata included a
GPSCoordinates
field, but
now it checks for the fields GPSLongitude
and GPSLatitude
as well. The videos with GPS
coordinates contain all three of these fields. GPSCoordinates
is just a single field that contains
both longitude and latitude. However, since you need separate values for
longitude and latitude, it’s simpler to use the metadata fields that are
already separated. Finally, the if
statement confirms that the video was
filmed in Washington, DC, by calling
was_video_filmed_in_dc(metadata)
.
If all of these conditions are true—the metadata contains
GPSLongitude
and
GPSLatitude
; the metadata
contains CreateDate
with a
value matching January 6, 2021; and the GPS coordinates in the metadata
show that the video was filmed in Washington, DC—then the code
displays a message saying it found an insurrection video and increments
count
. Finally, after the
script has finished looping through all of the Parler metadata files, it
displays the total number of insurrection videos found.
You can find the final script in the book’s GitHub repo at https://github.com/micahflee/hacks-leaks-and-revelations/blob/main/chapter-11/exercise-11-3.py. Run your complete script now, making sure to pass in the correct path to your Parler metadata folder:
micah@trapdoor chapter-11 % python3 exercise-11-3.py /Volumes/datasets/Parler/metadata
Found an insurrection video: meta-QPsyYtwu4zJb.json
Found an insurrection video: meta-Hcv3lzEsnWaa.json
Found an insurrection video: meta-6dDTCsYzK3k3.json
--snip--
Found an insurrection video: meta-eLSgf3w5r4PI.json
Found an insurrection video: meta-goL0HLdYn3Pb.json
Found an insurrection video: meta-a7DW37R386K3.json
Total videos filmed in Washington DC on January 6: 1,202
The script should find 1,202 insurrection videos. This means that out of the 1,958 videos uploaded to Parler on January 6 that included GPS coordinates, at least 61 percent were videos of the insurrection itself. (It’s possible that more videos uploaded to Parler were also from the insurrection that day but just didn’t include GPS coordinates in their metadata.) Manually watching 1,202 Parler videos is still unpleasant, but at least it’s not as bad as watching 1,958.
PROPUBLICA’S PARLER DATABASE
Using the metadata as a starting point, as you’ve done so far in this chapter, 36 journalists at ProPublica did in fact watch thousands of insurrection videos from this dataset in late January 2021. The nonprofit newsroom published an interactive database of newsworthy Parler videos related to the January 6 attack. “ProPublica reviewed thousands of videos uploaded publicly to the service that were archived by a programmer before Parler was taken offline by its web host,” states the project’s website at https://projects.propublica.org/parler-capitol-videos/. The project included over 500 videos that “ProPublica determined were taken during the events of Jan. 6 and were relevant and newsworthy.” Readers could see what was happening during the insurrection at any point in time that day, and ProPublica organized the videos into the categories Around DC, Near Capitol, and Inside Capitol.
You now know which of the Parler videos were from the January 6 insurrection, but you can draw even more interesting conclusions from this dataset (and others that contain similar location data) when you visualize the data on a map.
Plotting GPS Coordinates on a Map with simplekml
Rather than just displaying a list of insurrection video filenames, you could plot the locations of those videos on a map, allowing you to easily choose which videos you’d like to watch first. You could also map all Parler videos that contain GPS coordinates around the world, in case there are other newsworthy videos in this dataset that don’t relate to the January 6 insurrection. In this section, you’ll learn to write Python code to create a file of Parler location data that you can then upload to an online map service to visualize it.
Google Earth (https://earth.google.com) allows you to upload a file in Keyhole Markup Language (KML), a file format designed to describe geographical features such as points on a map. KML was created in 2004 specifically for use with Google Earth, and it became a standard file format for describing geographic data in 2008.
Listing 11-4 shows an example KML file.
<?xml version="1.0" ?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<Document id="1">
<Placemark id="3">
<name>New York City</name>
<description>The Big Apple</description>
<Point id="2">
<coordinates>-74.006393,40.714172,0.0</coordinates>
</Point>
</Placemark>
</Document>
</kml>
Listing 11-4: A file written in KML, example.kml
As you can see, the KML format is similar to HTML. Both formats are
extensions of XML, or Extensible Markup Language, so they share the same
rules. The first line, starting with <?xml
, is called the XML prolog, and it
defines some metadata about this file. The entire contents of the KML
file are wrapped in a <kml>
tag. Inside this is a <Document>
tag, and inside this are one or more
<Placemark>
tags. Each
<Placemark>
represents a
point on a map: its name, description, and GPS coordinates in decimal
format. This example file describes a single point for New York City.
To plot GPS coordinates on Google Earth, you must generate a KML file
that contains these coordinates and then upload it to the service. The
simplest way to create KML files is by using the simplekml
Python module. You can use this module
to create a new KML object, create a new point on it for each Parler
video with GPS coordinates, and then save that KML object to a .kml
file.
Install the simplekml
module
by running the following command:
python3 -m pip install simplekml
Now use the module in the Python interpreter to generate the example .kml file from Listing 11-4:
>>> import simplekml
>>> kml = simplekml.Kml()
>>> kml.newpoint(name="New York City", description="The Big Apple", coords=[(-74.006393,
40.714172)])
<simplekml.featgeom.Point object at 0x101241cc0>
>>> kml.save("example.kml")
After importing the simplekml
module, this code defines the value of
the kml
variable as the
output of simplekml.Kml()
,
which returns a KML object. It then uses the kml.newpoint()
method to add GPS points to the KML
file it’s creating. While this example just adds one point for New York
City, with the description “The Big Apple,” you can add as many points
as you want. Note that the value of the coords
argument must be a list of tuples,
with each tuple containing longitude and latitude coordinates in decimal
format. Finally, after adding points, the code saves the KML file by
running kml.save()
and
passes an output filename.
You can find further documentation for the simplekml
Python module at
https://simplekml.readthedocs.io.
ALTERNATIVES TO GOOGLE EARTH
There are many different ways to plot GPS points, including alternative online services like MapBox (https://www.mapbox.com), which allows you to upload a CSV of GPS coordinates to generate points on a map and even embed that map into articles on your website.
In future projects, you may need to visualize sensitive geographic data without sharing it with a third-party service like Google Earth or MapBox. The free and open source desktop software QGIS (https://qgis.org) allows you to create maps locally on your computer, though it’s pretty complicated to use. You can also write Python code that pulls data from OpenStreetMap (https://www.openstreetmap.org), a vast and completely free and open source mapping resource that allows you to create geographic images with GPS points on them. These options aren’t as simple as using online tools, and explaining how they work is beyond the scope of this book.
You don’t necessarily need GPS coordinates in your dataset to visualize location data on a map. If you have addresses, or even just city names or postal codes, you could convert that information to GPS coordinates and then plot those on a map. You could do the same with IP addresses, converting them to their rough GPS locations.
You now know how to create KML files full of location data that can be mapped in Google Earth. As your final exercise in this chapter, you’ll generate KML files based on GPS coordinates in the Parler dataset.
Exercise 11-4: Create KML Files to Visualize Location Data
So far, we’ve focused on finding Parler videos filmed in Washington, DC, during the January 6 insurrection. While this is undoubtedly the most newsworthy part of this dataset, there could be other things we’re missing. Parler is a global far-right social network. What other far-right videos did people post to it? Does it contain any interesting data from other countries, such as Russia? In this exercise, you’ll write a script that creates two KML files full of GPS coordinates from the Parler dataset to visualize in Google Earth:
- A parler-videos-all.kml file containing all videos with GPS coordinates
- A parler-videos-january6.kml file containing videos with GPS coordinates filmed on January 6, 2021
This exercise will give you experience creating KML files and using Google Earth to visualize location data, a skill that will likely come in handy for any future dataset you come across that includes location data.
You’ll base your script for this exercise off the script you wrote in Exercise 11-3. For a challenge, you can try programming your own script to meet the following requirements:
- Make this script accept an argument,
parler_metadata_path
, using Click. This will be the path to the metadata folder full of JSON files. - Import the
simplekml
module and create two KML objects (one for each KML file you’ll be creating). Loop through the Parler video metadata JSON files, and add different points to the appropriate KML objects depending on the metadata. Points for all videos should be added to parler-videos-all.kml, and points only for videos with theCreateDate
of January 6, 2021, should be added to parler-videos-january6.kml. - Give every point you add to a KML object a name, a description, and
GPS coordinates in decimal format. The name should be the Parler
video ID (for example,
HS34fpbzqg2b
), and the description should be a string containing the video’s download link (for example, https://data.ddosecrets.com/Parler/Videos/HS34fpbzqg2b) as well as important metadata fields such asCreateDate
,FileTypeExtension
, or others you’re interested in. - Make your script loop through all of the metadata JSON files and filter them for videos that contain GPS coordinates.
Alternatively, follow along with the instructions in the rest of this exercise.
Create a KML File for All Videos with GPS Coordinates
You’ll begin by writing a script to loop through all of the Parler metadata JSON files and add any GPS coordinates it finds to a single KML file, parler -videos-all.kml, including only the video URL in the description, not any metadata. Make a copy of the exercise-11-3.py script and name it exercise-11-4.py, then make the following modifications:
import click
import os
import json
import simplekml
def json_filename_to_parler_id(json_filename):
return json_filename.split("-")[1].split(".")[0]
def gps_degrees_to_decimal(gps_coordinate):
parts = gps_coordinate.split()
degrees = float(parts[0])
minutes = float(parts[2].replace(" ' ", " "))
seconds = float(parts[3].replace(' " ', " "))
hemisphere = parts[4]
gps_decimal = degrees + (minutes / 60) + (seconds / 3600)
if hemisphere == "W" or hemisphere == "S":
gps_decimal *= -1
return gps_decimal
@click.command()
@click.argument("parler_metadata_path")
def main(parler_metadata_path):
"""Create KML files of GPS coordinates from Parler metadata"""
kml_all = simplekml.Kml()
for filename in os.listdir(parler_metadata_path):
abs_filename = os.path.join(parler_metadata_path, filename)
if os.path.isfile(abs_filename) and abs_filename.endswith(".json"):
with open(abs_filename) as f:
json_data = f.read()
try:
metadata = json.loads(json_data)
except json.decoder.JSONDecodeError:
print(f"Invalid JSON: {filename}")
continue
if (
"GPSLongitude" in metadata[0]
and "GPSLatitude" in metadata[0]
and metadata[0]["GPSLongitude"] != " "
and metadata[0]["GPSLatitude"] != " "
):
name = json_filename_to_parler_id(filename)
description = f"URL: https://data.ddosecrets.com/Parler/Videos/{name}"
lon = gps_degrees_to_decimal(metadata[0]["GPSLongitude"])
lat = gps_degrees_to_decimal(metadata[0]["GPSLatitude"])
print(f"Found a video with GPS coordinates: {filename}")
kml_all.newpoint(name=name, description=description, coords=[(lon, lat)])
kml_all.save("parler-videos-all.kml")
if __name__ == "__main__":
main()
Since you’re going to be mapping this data, you don’t need the code that
detects if a video is in Washington, DC—you’ll be able to tell by
zooming into Washington, DC. Therefore, this code deletes the
distance()
and
was_video_filmed_in_dc()
functions from the previous script, as well as the math
import. The new code imports the
simplekml
module at the top
of the file so that you can use it later in the script.
Next, the code defines the function
json_filename_to_parler_id()
. This function is only a single,
complex line of code that takes the filename of a Parler metadata JSON
file as an argument, then returns the Parler ID associated with that
file. For example, say the value of json_filename
is meta-31VC1ufihFpa.json
. In this case, the expression
json_filename.split("-")
will evaluate to the list ['meta', '31VC1ufihFpa.json']
. Since Python starts counting at zero,
the code selects the second item in that list (the string
31VC1ufihFpa.json
) by adding
[1]
to that expression,
making it json_filename .split("-")[1]
. Next, it splits that string on the
period character with the expression
json_filename.split("-")[1].split(".")
, which returns the list
['31VC1ufihFpa', 'json']
. It
then selects the first item in that list (the string
31VC1ufihFpa
) by adding
[0]
to that expression,
making it json_filename .split("-")[1].split(".")[0]
. The
json_filename_to_parler_id()
function just returns the result of that expression, which is the Parler
ID.
In the main()
function, the
code defines a new KML object called kml_all
to contain all the GPS points found in
the Parler metadata. The rest of this code should be familiar to you
from Exercises 11-2 and 11-3. It loops through the Parler metadata
folder looking for JSON files, loading the JSON data for each file it
finds into the metadata
variable. This time, the if
statement ensures that the metadata dictionary contains the keys
GPSLongitude
and
GPSLatitude
and that those
values aren’t blank.
When the code finds a Parler video that contains non-empty GPS fields,
it sets up variables with the data it needs to add the point to the KML
files: name
,
description
, lon
, and lat
. It defines name
as the return value of the
json_filename _to_parler_id()
function, meaning the name of the
point will be the video’s Parler ID. It defines description
as the video’s download URL. Using the
gps_degrees_to_decimal()
function, it defines lon
and
lat
as the longitude and
latitude, in decimal format, of the GPS coordinates found in the
metadata.
After defining these variables, the code runs
kml_all.newpoint()
to add
the GPS point to the KML object. It sets the point’s name to
name
, its description to
description
, and its
coordinates to a list of points; in this case, the list has only one
point, a tuple containing lon
and lat
. Finally, when the for
loop is complete, the code calls the
kml_all.save()
function to
save all of these GPS points into the file parler-videos-all.kml.
Run the final script, changing the path in the argument to the path to your Parler metadata folder:
micah@trapdoor chapter-11 % python3 exercise-11-4.py /Volumes/datasets/Parler/metadata
Adding point 2XpiJFsho2do to kml_all: -117.6683, 33.490500000000004
Adding point bcHZhpDOFnXd to kml_all: -1.3391, 52.04648888888889
--snip--
Since the Parler dataset contains about 64,000 videos with GPS coordinates, the script should return about 64,000 lines of output, each including a video’s Parler ID, longitude, and latitude. When the script finishes running, it should also create a 20MB KML file called parler-videos-all.kml in the same folder as the script.
Open parler-videos-all.kml in a text editor. The file’s contents should look like this:
<?xml version="1.0" ?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<Document id="1">
<Placemark id="3">
<name>2XpiJFsho2do</name>
<description>URL:https://data.ddosecrets.com/Parler/Videos/2XpiJFsho2do</description>
<Point id="2">
<coordinates>-117.6683,33.490500000000004,0.0</coordinates>
</Point>
</Placemark>
<Placemark id="5">
<name>bcHZhpDOFnXd</name>
<description>URL:https://data.ddosecrets.com/Parler/Videos/bcHZhpDOFnXd</description>
<Point id="4">
<coordinates>-1.3391,52.04648888888889,0.0</coordinates>
</Point>
</Placemark>
--snip--
This file should contains 64,000 <Placemark>
tags, each representing a different
Parler video with GPS coordinates.
Now that you’ve created a KML file that contains all of the Parler location data, you’ll modify your script further to create a KML file with just the videos from January 6, 2021.
Create KML Files for Videos from January 6, 2021
Your script so far has a KML object called kml_all
, and the code adds all of the GPS
points in the Parler metadata to it. Make the following changes to your
code to create another KML object, kml_january6
, and just add GPS points from videos
filmed on January 6, 2021, to it. Since this script is getting long,
I’ll quote just the main()
function, the only part that is modified:
@click.command()
@click.argument("parler_metadata_path")
def main(parler_metadata_path):
"""Create KML files of GPS coordinates from Parler metadata"""
kml_all = simplekml.Kml()
kml_january6 = simplekml.Kml()
for filename in os.listdir(parler_metadata_path):
abs_filename = os.path.join(parler_metadata_path, filename)
if os.path.isfile(abs_filename) and abs_filename.endswith(".json"):
with open(abs_filename, "rb") as f:
json_data = f.read()
try:
metadata = json.loads(json_data)
except json.decoder.JSONDecodeError:
print(f"Invalid JSON: {filename}")
continue
if (
"GPSLongitude" in metadata[0]
and "GPSLatitude" in metadata[0]
and metadata[0]["GPSLongitude"] != " "
and metadata[0]["GPSLatitude"] != " "
):
name = json_filename_to_parler_id(filename)
description = f"URL: https://data.ddosecrets.com/Parler/Videos/{name}<br>"
for key in [
"CreateDate",
"FileTypeExtension",
"Duration",
"Make",
"Model",
"Software",
]:
if key in metadata[0]:
description += f"{key}: {metadata[0][key]}<br>"
lon = gps_degrees_to_decimal(metadata[0]["GPSLongitude"])
lat = gps_degrees_to_decimal(metadata[0]["GPSLatitude"])
print(f"Adding point {name} to kml_all: {lon}, {lat}")
kml_all.newpoint(name=name, description=url, coords=[(lon, lat)])
if "CreateDate" in metadata[0] and metadata[0]["CreateDate"].startswith(
"2021:01:06"
):
print(f"Adding point {name} to kml_january6: {lon}, {lat}")
kml_january6.newpoint(
name=name, description=url, coords=[(lon, lat)]
)
kml_all.save("parler-videos-all.kml")
kml_january6.save("parler-videos-january6.kml")
At the top of the main()
function, this script adds another KML object called
kml_january6
. The code will
add points to this file only from January 6, 2021. Next, the
for
loop will loop through
each Parler metadata file, parse the JSON, and determine whether or not
it has GPS coordinates. If so, the code will prepare variables so it can
add the point to the KML objects. But this time, instead of the
description
variable
containing just the video’s download URL, it will also include metadata.
When defining description
,
the code adds <br>
at the
end, which is the HTML tag for a line break. This way, when you
visualize this KML file, the description will show the URL on the first
line, and the metadata will start on the next line. The code then loops
through a list of metadata keys to add to the description, including
CreateDate
,
FileTypeExtension
,
Duration
, Make
, Model
, and Software
. If there are any other pieces of
metadata you’d like to include, feel free to add them to your script.
In each loop, the code checks to see if the metadata for the current
video includes that key, and if so, adds its value to
description
, inserting a
line break after each piece of metadata. For example, if the code is
looking at the JSON file meta-g09yZZCplavI.json, description
will appear as follows:
URL: https://data.ddosecrets.com/Parler/Videos/g09yZZCplavI
CreateDate: 2021:01:06 20:08:25
FileTypeExtension: mov
Duration: 25.24 s
Make: Apple
Model: iPhone XS Max
Software: 14.3
(The actual value of the description
string will contain <br>
for the line breaks, but this is how
the description will look in Google Earth.)
Next, the code uses another if
statement to see if that video was
created on January 6, 2021, and if so, adds that point to
kml_january6
. It does this
by checking that the file has a CreateDate
metadata field and that the date in
that field is from January 6, 2021, just as you did in Exercise 11-2.
Finally, when the script finishes looping through all of the Parler
videos, after saving the points in kml_all
to parler-videos-all.kml, it also
saves the points in kml_january6
to parler-videos-january6.kml.
You can find the final script in the book’s GitHub repo at https://github.com/micahflee/hacks-leaks-and-revelations/blob/main/chapter-11/exercise-11-4.py. Run your complete script like so:
micah@trapdoor chapter-11 % python3 exercise-11-4.py /Volumes/datasets/Parler/metadata
Adding point 2XpiJFsho2do to kml_all: -117.6683, 33.490500000000004
Adding point bcHZhpDOFnXd to kml_all: -1.3391, 52.04648888888889
--snip--
Adding point VNYtKrEURiZs to kml_all: -97.0244, 33.1528
Adding point VNYtKrEURiZs to kml_january6: -97.0244, 33.1528
Adding point KptnQksS5Xr8 to kml_all: -77.0142, 38.8901
--snip--
When the script is finished running, it should have created two KML files: a 31MB file called parler-videos-all.kml (the file is bigger this time because the descriptions are longer) and a 929KB file called parler-videos -january6.kml.
Now that you’ve put in the hard work of generating KML files full of GPS coordinates, you can move on to the fun part: visualizing this data using Google Earth. This will allow you to scroll around the globe picking which videos you’d like to watch.
Visualizing Location Data with Google Earth
In this section, you’ll learn how to visualize location data in the KML files that you just created using Google Earth, marking each Parler video with a pin on a map. Not only will this let you visualize exactly where all of the videos with GPS coordinates were filmed, but this will also make it considerably simpler to download these videos to watch.
When you created those KML files, you set the description for each Parler video to include its download URL. Once you load a KML file into Google Earth and turn it into pins on a map, you can click on a video’s pin to see its description and then click the link in the description to download the video. In a web browser, load Google Earth at https://earth.google.com. (You don’t have to log in to a Google account, though doing so enables you to save your work and revisit it later.) In the menu bar on the left, choose Projects[▸]Open[▸]Import KML File from Computer. Browse for the parler -videos-all.kml file you created in the previous exercise and open it. When it’s done loading, click the pencil icon to edit the title of this project, name it All Parler Videos, and press ENTER. This should create a pin on the map for each Parler video in the entire dataset, labeled by its ID.
Repeat this process for parler-videos-january6.kml, and name this one Parler Videos from January 6, 2021. In the Projects panel on the left of the screen, you should see your two projects.
By clicking the eye icon, you can show and hide Google Earth projects to choose which KML files you want displayed. With the pins you want displayed, you can rotate the Earth and zoom in on whatever you’d like. You can double-click on the map, click the plus ([+]) button to zoom in, and click the minus ([−]) button to zoom out.
For example, to investigate just the insurrection videos, show that project and hide the others. Figure 11-6 shows Google Earth zoomed in on the US Capitol building in Washington, DC, with just the videos from the January 6 insurrection showing. The pins in the figure are all videos of the January 6 insurrection, and the pins located over the Capitol building itself are videos filmed by Trump supporters who were actively trespassing inside the US Capitol that day.
Figure 11-6: Google Earth, focused on the US Capitol building, with pins at the GPS points in
When you find a video you’re interested in, click its pin to view its description. You should see the URL to download the video, and you can watch it using software like VLC Media Player.
You can also use Google Earth to search for a location so you can see the individual pins there. For example, you could hide the Parler Videos from January 6, 2021, project and instead show pins for the All Parler Videos project, then search for Moscow. Figure 11-7 shows Google Earth zoomed in on the city of Moscow, Russia. As the figure indicates, only a handful of videos whose metadata included GPS coordinates were filmed there and uploaded to Parler.
Figure 11-7: Parler videos filmed in Moscow
Click the pin for the video labeled ykAXApWbiZuM. You should see the following description:
URL: https://data.ddosecrets.com/Parler/Videos/ykAXApWbiZuM
CreateDate: 2020:06:28 21:56:41
FileTypeExtension: mov
Duration: 0:06:51
Make: Apple
Model: iPhone 7 Plus
Software: 13.5.1
As you can see, this video was filmed on June 28, 2020 (during the Black Lives Matter uprising), with an iPhone 7 Plus running iOS 13.5.1. Right-click the link to see the option to download the video. This way, your web browser won’t try opening it directly in a new tab, where it might not display properly.
If you’re interested, you can open the video file using VLC Media Player to watch it. In the recording, a tattooed American white supremacist who runs a Confederate-themed barber shop in Moscow goes on a racist and homophobic rant, in part explaining why he moved to Russia. “I voted Trump in office in 2016,” he said. “But the fact is, nothing’s gonna change. The fact is, all these Trump supporters in America all the time can’t see the real problem. Your real problem is fucking Jews in America.” Figure 11-8 shows a screenshot from the video where he’s telling Parler users that he’s a real white supremacist and not a liberal troll, as people were accusing him of being.
Figure 11-8: A screenshot from a Parler video filmed by an American white supremacist in Moscow
He goes on to fantasize about mass shooting Black Lives Matter protesters. “I watch the news in America. I see all these fucking [N-word]s, antifa fucking scum. Ripping down the monuments. It angers me more than anything. What I don’t understand is where’s the fucking police to stop any of this?” he asks. “How come nobody’s shooting these motherfuckers? If I was in Los Angeles still, seeing all this rioting and looting going on, I’d be up on a motherfucking building with my AK-47 just spraying the fucking crowd.”
If you’re curious about the complete metadata from this video, you can
check the original file at meta-ykAXApWbiZuM.json. If you wanted to
see more videos posted by this Parler user, you could modify your script
to filter videos that were filmed on the exact device by checking for
the same Make
,
Model
, and
Software
fields. You might
find some other users’ videos, but chances are you’ll also find more
videos from this poster as well.
The media spent the bulk of its time focusing on Parler videos they knew were taken in Washington, DC, on the day of the insurrection. If you’re interested in further exploring this dataset, you might try to find videos from other far-right protests, or events with far-right counterprotesters. For example, you could create a KML file that includes the date ranges of the specific 2020 Black Lives Matters protests and explore those videos. You might find video evidence of other crimes.
Viewing Metadata with ExifTool
When @donk_enby downloaded the Parler videos and
extracted metadata from them in JSON format, she used a command line
program called exiftool
.
This program is one of the investigation tools I use most frequently,
and this section explains how to use it.
If you run exiftool
followed
by a filepath, it will attempt to find metadata stored in that file and
show it to you. It works on a variety of file formats, including
Microsoft Office documents, PDFs, images, and videos. You can use it to
find hidden information in the metadata of those documents,
such as the author of a Word document, which
type of phone or camera was used to take a photo, and much more.
You don’t need to run exiftool
on the Parler videos since
@donk_enby did it for you, but most of the time, you
won’t be so lucky. If you want to search for hidden information in
BlueLeaks documents, for example, you’d need to run exiftool
on them yourself. In this subsection,
to learn how exiftool
works,
you’ll use it to view the metadata on one of the Parler videos in JSON
format.
Mac users, install exiftool
by running the brew install exiftool
command; users of Linux or Windows with
WSL, install it with the
sudo apt install libimage-exiftool-perl
command. In your terminal, change to
the videos folder in your Parler dataset folder and use wget
to download the Parler video with the
ID HS34fpbzqg2b:
wget https://data.ddosecrets.com/Parler/Videos/HS34fpbzqg2b
You can use exiftool
to look
at the metadata of a file by running exiftool filename
. Run it on the HS34fpbzqg2b
file that you just downloaded with the following command:
exiftool HS34fpbzqg2b
The output should show all the metadata for this video file:
--snip--
File Type Extension : mov
--snip--
Model : iPhone XR
Software : 14.2
Creation Date : 2021:01:06 13:57:49-05:00
--snip--
GPS Position : 38 deg 53' 26.52" N, 77 deg 0' 28.44" W
Along with other information, the metadata shows that this video’s file extension is .mov, it was recorded using an iPhone XR running iOS 14.2 on January 6, 2021, at 1:57 [PM], and it was filmed at the GPS coordinates 38 deg 53[′] 26.52[″] N, 77 deg 0[′] 28.44[″] W.
Since the file extension for this video is .mov, rename it by running
mv HS34fpbzqg2b HS34fpbzqg2b.mov
. You can open HS34fpbzqg2b.mov in a
program like VLC Media Player just to see what it contains: police
officers stepping out of the way while Trump supporters remove
barricades surrounding the Capitol building.
When @donk_enby used exiftool
to extract the metadata from the
Parler videos, she used the -json
argument to extract it in JSON format.
Here’s how you do that for HS34fpbzqg2b:
micah@trapdoor videos % exiftool HS34fpbzqg2b -json
[{
"SourceFile": "HS34fpbzqg2b",
"ExifToolVersion": 12.42,
"FileName": "HS34fpbzqg2b",
--snip--
"GPSLatitude": "38 deg 53' 26.52\" N",
"GPSLongitude": "77 deg 0' 28.44\" W",
"Rotation": 180,
"GPSPosition": "38 deg 53' 26.52\" N, 77 deg 0' 28.44\" W"
}]
The -json
argument makes the
output much easier to work with than exiftool
’s default output.
Summary
In this chapter, you’ve learned about the secrets hidden in the metadata of over a million videos uploaded to Parler, many of them by insurrectionists filming themselves during the January 6 riot in Washington, DC. You’ve learned the syntax of the JSON file format and how to work with JSON data in your own Python scripts. You’ve written a series of scripts that filtered the list of a million videos down to just the ones that were, according to their metadata, filmed on January 6, 2021, in Washington, DC, during the attack on the US Capitol by supporters of Donald Trump. You now have the skills necessary to write code that analyzes JSON in your own investigations. Finally, you’ve seen how you can convert GPS coordinates from degrees to decimal and plot them on a map, an invaluable skill for future investigations that involve location data.
In the next chapter, you’ll explore one more technology that’s common in hacked and leaked datasets: SQL databases. You’ll use the SQL skills you learn to dig into the hacked databases of Epik, a hosting and domain name company that provides service to much of the American fascist movement.