GDELT & BigQuery: Understand the world

GDELT & BigQuery: Understand the world

[MUSIC PLAYING] FELIPE HOFFA: Hi. I’m Felipe Hoffa. And today we are at NOAA–
the National Oceanic and Atmospheric Administration. Today we’re going to visit their
Science on a Sphere project, where they are able to visualize
all the information, everything that is happening around the
world in a giant 3D sphere. We are going to use this
sphere to take a look at GDELT, the Global Database of
Events, Language, and Tone– an initiative that collects,
classifies, and scores every piece of daily news
you can read from anywhere in the world. In this video, we are going to
see practical examples of how to explore GDELT
data with BigQuery. We are going to meet Amanda
Traud to see how she’s using R to explore the
ever-changing relationships between countries. We’ll also meet Louisa Koch,
the NOAA education director, to explore the amazing
possibilities that emerge from combining
GDELT with other data sets. But first, let’s go
back to NOAA to meet Kalev Leetaru, the
creator of GDELT, to learn more about his
project and how it works. So this is NOAA Science
on a Sphere Project, where we can visualize the
whole planet at once. Here, for example, we’re looking
at the Chilean 2014 tsunami, and how it’s expanding
all over the world. But then, can we look at GDELT? Let’s wait for a second
look, and here we have it. All the products
around the world. And Kalev, can you
tell us more about what we are looking at here? KALEV LEETARU: So the idea
of the GDELT project really is, how do we create a
dashboard, if you will, of human society? If you think about it
today, an earthquake occurs anywhere on the planet. And there’s this massive global
sensor grid that’s been set up, and groups like the USGS,
can pick up the fact that this earthquake occurred
anywhere on the planet. They know where it is. What its depth is. Who’s going to be affected. All within seconds. But what about the
human earthquakes? If you think about
it for a moment, a protest that occurs
anywhere on the planet– the ability to create
a data set of that. And that’s really what the
GDELT Project is about, is creating essentially a
global catalog of human society. So imagine scooping up all the
world’s news media each day– web, print, broadcast,
over 100 languages– and processing all that by
computer, second by second. How do you interact with
that, visualize that, understand that? And that, to me, is what’s
so exciting about the NOAA Science on a Sphere
Project, is this ability to take this massive data set of
a quarter of a billion records and make it accessible. You think about visualization. You can’t do better than to
display essentially planet Earth on a physical
sphere of the planet. FELIPE HOFFA: While
visiting NOAA, Louisa Koch, the
Education Director, joined us to share more about
the mission and her thoughts on the GDELT Data Set. LOUISA KOCH: So NOAA
produces a lot of information about the ocean
and the atmosphere. We spend billions
of dollars a year collecting ocean and
atmospheric data. But it’s really important
for us to also connect it to what that means to people. And GDELT is an
amazing database. It’s so exciting to
see it on the sphere. Right now, most of the data
sets on a Science on a Sphere are the physical
data sets that talk about how the
Earth is operating. GDELT talks about how social
systems are operating. And can show us a completely
new perspective on the planet that we call home. FELIPE HOFFA: So
GDELT is amazing. It’s monitoring the world. But what I want
to learn today is how we are getting all this
information into GDELT. KALEV LEETARU: Yeah. Yeah. So that’s one of
the tricks is trying to monitor the
entire world’s media. And of course, this
is a complex process. So it’s taking this
textual news article– this big, massive mess–
and translating that to a codified record that’s
recording what’s happening, where it’s happening, who’s
involved, the details of it. And then all that flows through
to the end, where it then makes it available as
these CSV files and sitting inside of BigQuery. And then eventually
at the end of each day when it processes it, it
generates a file CSV file, makes it available
on the website, but then it also
sends it to BigQuery using the BigQuery
uploader tool. And a minute or so later,
it’s up in BigQuery. And it’s been added to that
quarter billion record, so now you can pull up
BigQuery on your computer and suddenly access
the latest material. FELIPE HOFFA: GDELT already
indexes more than a quarter billion rows from the last
35 years of worldwide events. And it’s loaded by more
than 100,000 rows a day. I sat down with Kalev for
an interactive session of exploring GDELT
using BigQuery. Take a look at what we did. KALEV LEETARU: What
we’re going to do here is do a simple
experiment to show how do you really
work with data. So why don’t we start
off with Ukraine? Ukraine’s been in the
news, obviously, a lot. Let’s pull, for
example, let’s just count how many total
events that we’re getting from Ukraine right now. Specifically, let’s
look at conflict events, since a lot of Ukrainian event. So what we want to
do, in this case, we just care about
the number of events. We don’t care about the detail. We just want to know how
many events are occurring in Ukraine by
month, for example. FELIPE HOFFA: By month? For the last year? For the last 10 years? KALEV LEETARU: Let’s do
it for the last 30 years. Let’s go back to 1979. So we’ll start off with
a select count, as usual. And what we want to
do is we want to say, first off, we want to limit
it just to Ukraine, obviously. And so we’re going
to say Action. So what we’re going
to do is we’re going to say, where action. So for each event, it
records a whole bunch of different details. One of them is where
the event took place. So the physical location
where the event took place. So we’re going to say where
the country code of the action, where it actually
took place, is UP. And don’t ask me why UP is
the country code for Ukraine. And so if we just do
this, and we say, well, where are the events by Ukraine? But we want to break
it down by month. So we’re going to
come back here, and we’re going to
add in month-year. And so basically we’re asking
it, break this down by month. Count up the number of events. And again, group by month-year. And let’s order by month-year. And so if we run this, what
we should get back here is literally by month how many
events it counted in Ukraine. And actually, I just realized
this will give us everything. But this is a good example. This will give us everything,
not just conflict events. FELIPE HOFFA: Maybe
we need to– oh. There it is. So what do we have here? KALEV LEETARU: So
what we’re seeing here is by month is how many
events it actually recorded. So you’re seeing in January
1979, it recorded 23 events. In February, it recorded 11. And we scroll by. And let me hop all
the way to the end. So you notice these are
sitting in the 50 to 80. FELIPE HOFFA: 430
different events. KALEV LEETARU: Exactly. If we come to here, we
start seeing like 100,000. And if I download this– let me
download this as a CSV file– and show you what this
actually looks like. And let me go ahead
and graph this. I love visuals, so it
makes it easier to see. So let’s go ahead
and graph this. And we can see– wow. There’s not a whole
lot in Ukraine. And then right as– you
can see a certain number. Maybe 100 or so a month or so. And then of course the
fall of the Soviet Union, it starts increasing
, increasing. We start seeing
some of the others. Then of course, we
really see things skyrocketing in recent times. But you could say, well,
that kind of makes sense. I mean there’s a
lot of stuff that’s been happening in the last year. But the problem that that’s
facing is that at the same time the total volume of all
news media worldwide is increasing exponentially. And so what you
really want to do here is you actually want
to normalize this. I don’t want to
see a raw number. What I want to see
is I want to see that over the entire size
of GDELT on that day. FELIPE HOFFA: So in order to
normalize the number of events, we need to know two numbers. First, the total number
of worldwide events in GDELT for each month, which
we compute with a simple count. And then the number of events
related to the Ukraine. We can get those adding an IF
condition to our count clause. Finally, we need to get rid
of this ALL WHERE clause. Otherwise, our first
count would be incorrect, since only rows
related to the Ukraine will be taken into account. We could now get the
ratio between both columns in BigQuery. But instead, let’s export
them as a spreadsheet, and produce a monthly ratios
and visualizations there. KALEV LEETARU: And so now
let’s go ahead and just– so we mapped Ukraine
before, as we remember. If we look here, we see Ukraine. Let’s take a look at the total. So this is all events
found by GDELT. And we can see, obviously,
right around the birth of Google News, in particular,
which really drove a lot of
web-based news, we sort of see this
massive spiral forward. So actually looking
at this, this does suggest, indeed, that there
was a big surge in Ukraine. But the problem is that
this peak is really going to be heavily
impacted by the fact that there’s a lot more in GDELT
than there was a long time ago. So what we want to do is instead
of plotting the raw number, we want to plot it
as a percentage. So we basically
want to come in here and we simply want to
do a simple dividing. Number of Ukraine events divided
by total number of events, make it into a percentage point. Now if we graph
this, it probably will look somewhat similar. But actually, we
see a lot more here. So now we start seeing
some interesting things. And we can see, again,
sort of all events. So we can see, again, sort of
this rise and sort of interest in Ukraine. But you know, again,
one of the challenges here is we’re looking at
everything for Ukraine. And probably what we want is
we want to go more narrow, something like protests. Because again, looking at
everything to do with Ukraine is probably less interesting. So what we want to do is
we want to come in here and we want to say– we want
to limit it to protests. So if we want to
say protests, we want to say there’s something
here called Event Root Code. So remember that there’s
300 different categories of events in here. And so all the protest
ones begin with 14. So if we go back, remember
here, and we look at 14, we can see there’s all different
types of it– hunger strikes, policy, a protest for
a leadership change, a protest for a rally, a
protest for policy change. In our case, we just
want to see all of them. So if we say, with a
Root Code equals 14, that captures anything
that’s a protest. So now if we repeat
this, we should get something much
more interesting. This is going through
and finding everything that has to do with a
protest across all of GDELT, coming back, and
then for each month, recording how many
total protests were recorded in GDELT
for that month. And then how many were reported
in Ukraine in that month. FELIPE HOFFA: So
in January 1979– KALEV LEETARU: It’s
saying they found 400 protests around the world. And again, in 1979, there
was less electronic media. But it didn’t find any in
Ukraine and, obviously, with the Soviet
occupation, a lot less. So let’s say let’s download
that as a CSV file. And let’s once again
bring this into Excel. So this is total protests. This is Ukraine protests. And then once again,
percentage Ukraine. So what this is then,
is– so let’s say we want to divide like this. So what we’re going to do here–
so this column essentially will be what percentage
of all protest events marched in that
month were from Ukraine. And if we then examine
this, we plot this out, now we see something that
makes a lot more sense. Let me delete this to give us
a little bit more room here. You can see during the Soviet
occupation, very few protests. Obviously, the Soviets were
very restrictive on that. And then, of course,
the protests– the great European
revolutions that began at the fall
of the Soviet Union. Boom! Now all of a sudden you see
a lot of protest activity. You see spikes. Crimea. You see the Orange Revolution. And then you see, of course,
you see where we are today. But you see what’s interesting
is that protests have really been decreasing in
Ukraine as we’ve been moving towards present. And a lot of that,
obviously, is the fact that what’s occurring
in Ukraine right now has really shifted
from protest activity to full scale military conflict. And this raises some
interesting questions. So for example, let’s
go back to this query, and instead of protests,
let’s look at that. If we say, well, why
are protests going down? Isn’t Ukraine becoming
more unstable? Well, instead of protests, let’s
go back and look at conflict. So fighting, for example. Let’s change from 14 to 19. And so now let’s run this. So now what we’re
saying is, find me all of the active conflict
activities, the actual fighting that’s occurring
across the world. And for each month, count
up how many total we see, and then, again, how
many were in Ukraine. FELIPE HOFFA: We continue to
explore the basics of GDELT and how to look into the
timeline of each country. During this trip, I also met
Amanda Traud, a data scientist who’s been exploring
GDELT at her job. She works at L-3 Data Tactics,
a specialized provider of Big Data Analytics and
Cloud Solutions Services. They have been
using R to explore GDELT data and the
relationships among countries and how they change over time. For this project, she uses
Shiny, a web application framework that turns R code
into interactive web apps. Let’s learn more from her. AMANDA TRAUD: So
Shiny is actually a package in R that
allows you to create browser-based applications. And both of these apps
are browser-based, and anyone can use them
on our Shiny server. FELIPE HOFFA: That’s awesome. So you have a Shiny server
running R Shiny apps. AMANDA TRAUD: Right. FELIPE HOFFA: Can you show
me more about these apps? What have you done– AMANDA TRAUD: Absolutely. FELIPE HOFFA:
–with GDELT’s data. AMANDA TRAUD: So my favorite
app is the GDELT Country Network Dashboard. You can put in a date range. So say I wanted to know what the
connections between countries looked like in the last week. So I select October
15 and October 22. So again, I put
in my BigQuery ID. And then I query. FELIPE HOFFA: OK. So we have a beautiful
network here. AMANDA TRAUD: Yes. FELIPE HOFFA: Can you
show me behind the scenes? AMANDA TRAUD: Absolutely. FELIPE HOFFA: What
is powering this app? AMANDA TRAUD: So two
things– HTML and R. First I’ll show you the R code. So R has this amazing
library called Shiny that you can use to create
these browser-based apps. So behind the scenes, when I
get my data back from BigQuery, my R function, Shiny
server, is taking the table and creating an edge list. And in a network,
an edge is the line that connects the two dots. So this takes the table that
I get of all the events, pulls out actor
one and actor two, decides in the table
which ones are countries, and takes that subset. And then says that the
weight on that edge, how connected they are, is
based on the tone of the event. FELIPE HOFFA: I see. And it’s all open-sourced. AMANDA TRAUD: It is. It’s all on GitHub. So you can go and you can
download the code if you want. It on the data tactics GitHub. FELIPE HOFFA: That’s beautiful. And you have another
app available, too. AMANDA TRAUD: I do. The second app will give us a
table of all of the information that we could possibly
want with GDELT. So for the app, we need
to put in– I’ve already put in a date range. That is October 1st
through the 22nd. And put in my project
ID and client ID, just like the first one. And selected an actor. Now, I can select any
of the possible actors in the GDELT database. I’ve selected China. FELIPE HOFFA: So GDELT has
a lot of factors encoded. Can we take a look of– AMANDA TRAUD: Absolutely. It has– FELIPE HOFFA: Countries. It has– AMANDA TRAUD: So many. FELIPE HOFFA: Groups. So we’re going to
look at protests in China between October
1 and October 22. AMANDA TRAUD: Absolutely. FELIPE HOFFA: And this
app is querying BigQuery– AMANDA TRAUD: Through JavaScript
exactly like the other one. I hit Query, and it goes to
GDELT on the BigQuery database and brings me back a table. And I have a list of
the different columns that are available in the data. FELIPE HOFFA: Yeah. This is awesome. And this is data you
later pick up with Aaron, you’re able to network. AMANDA TRAUD: Absolutely. So this is the type of data that
I then take into the other app. In the other app, I use
actor one, actor two, and average tone. FELIPE HOFFA: To learn
more about these apps, visit the GDELT block, and
explore the Open Source Code at the L-3 GitHub page. So how did you
discover BigQuery? And how did it change
GDELT’s project? KALEV LEETARU: BigQuery
has been incredibly amazing to this project. So really, for the first
sort of 3/4 of GDELT’s life, I could not access GDELT. I have a quarter of a billion
records sitting in my hands. But I couldn’t look at it. I couldn’t ask the most
basic of questions. A simple query
of, give me a list of all the events
involving two actors. That could take a day to compute
in a traditional database. So oftentimes people
would say, wow, could you give me an extract
of all the events in Syria that involve a certain actor. I say, sure. but it’s going
to take about a day and a half for that to extract. BigQuery really was essentially
taking the lens cap off of this incredible data set. BigQuery is really
changing that bar now, where the vast majority
of those ideas like, could we literally take a
period of time right now and check all of
world history, and see what periods in world
history are most similar? 2 and 1/2 million correlations,
2 and 1/2 minutes, one line of SQL. FELIPE HOFFA: And what is next? Where is BigQuery, GDELT,
the data world going? KALEV LEETARU: So that’s
the most exciting part is you think about GDELT today. It’s out there. It’s processing all
the world’s news media. It’s creating this
incredible data set. GDELT also is
expanding the things, like academic literature. Processing all the world’s
academic literature on social cultural issues. And from that we
have a citation graph of who’s most cited in areas. Now imagine taking GDELT,
what’s happening in news today, combine that with the citation
data of who all the world’s experts are in an area,
then combining that with Google
Freebase, which gives all the structural information. So GDELT knows, for
example, that Barack Obama’s the President of
the United States. But Freebase will then tell us,
well, he also attended Harvard. And then it could tell us, well,
this other person over here, they also attended Harvard. So we start getting those
structural links that allow us to combine
with that, and then being able to correlate
against other data sets that are in BigQuery, things like
the Wikipedia page views. Being able to say, well, when
all of a sudden this particular person’s in the news,
and everyone’s now looking that person
up on Wikipedia, what are the other
topics they’re looking up at the
same time period? How is that person or
that political party, or other thing,
how are they being contextualized and understood? It’s just magic. You just put the data out there. It’s sort of like
the electricity. You walk into a room and
you flip the light switch. You think about how
simple that is, yet how incredible that
infrastructure is that makes that possible. But you don’t think about that. You flip your light
switch, and you really don’t worry anymore about
how expensive is that query? You just ask the question. And it’s possible. Whatever you can
imagine, it’s possible. FELIPE HOFFA: And
that’s it for today. Thank you for joining me
in this exploration journey into understanding the world. It was a pleasure to meet
Louisa Koch and Amanda Traud to explore how industry,
academia, government, and end users can benefit from this
extraordinary data set. And of course, Kalev Leetaru,
the man behind GDELT. To follow the
latest developments and interesting use
cases, visit their website at There’s a lot more to discover. Now it’s your turn to
dive into the data, write your own queries,
and share your results. If you enjoyed this video,
share it with your friends. For the Google Cloud
Platform, I’m Felipe Hoffa. I’ll soon be back with more
stories from the Big Data world. Stay curious. [MUSIC PLAYING]


  1. Post
  2. Post
  3. Post
  4. Post
  5. Post
  6. Post
  7. Post
    Csala Dénes

    Check out my analysis of GDELT conflict and insurgency data

  8. Post
    Михаил Геннадьевич

    А мы в Союзе много штили том как американцы создают компьютер чтобы вычислить русских… Пришло время смеяться

  9. Post
  10. Post
    MarxNutz y

    It's ability to predict is only as good as the data it collects. NOAA data is pretty accurate because it comes from scientific instruments. However, so much of the reporting in the media all around the world that has gone into this system is fraught with error and plain lies. You must have accurate, unbiased information in order to make accurate predictions. Can you say GIGO?

  11. Post

Leave a Reply

Your email address will not be published. Required fields are marked *