Geographic Knowledge in Freebase

Freebase-final-01_Map

Description

This map shows the global distribution of geo-located entities described in Freebase, a collaborative knowledge base that defines itself as “an open shared database of the world’s knowledge”.

Data

Freebase forms one of the key informational ingredients in Google’s Knowledge Graph. If you’ve ever looked at the side panel in Google’s search results page, which presents information about people, places, and events in response to a search query, then you’ve probably come into contact with data stored in Freebase.

The data that we collected from Freebase describe over 43 million entities, among which we identified 478 thousand place names. The content is stored as RDF triples, which specify a predicate in the form of subject-verb-object. The triples in the dataset have been surveyed, collecting all entities associated with a latitude-longitude coordinates pair; that is, all subjects of triples where the verb refers to the concept “has latitude” and “has longitude”.

Findings

Geographic content in Freebase is largely clustered in certain regions of the world. The United States accounts for over 45% of the overall number of place names in the collection, despite covering about 2% of the Earth, less than 7% of the land surface, and less than 5% of the world population, and about 10% of Internet users. This results in a US density of one Freebase place name for every 1500 people, and far more place names referring to Massachusetts than referring to China.

A third of all place names are geo-located in Europe. The United Kingdom is home to about 7% of place names, Poland has about 6%, and France has just over 5%. The United Kingdom accounts for one place name for every 2000 inhabitants, the same proportion as Luxembourg. Ukraine is the only European country described with less than one place name per 30,000 inhabitants, whereas Slovenia and Poland are described in exceptional detail, with about one place name for every 1000 people and one place name for every 1300 inhabitants, respectively.

This stands in contrast to countries like China that account for less than 1% of the collection (with less than 4000 place names, and a density of only one place name for every 300,000 inhabitants). Most of Africa, Asia, Latin America and the Caribbean are similarly underrepresented. Nigeria barely represents 0.1% of the place names, and Venezuela accounts for only 0.05%. Outside Europe and North America, only four countries (Australia, China, India, and Japan) are represented with more content than Antarctica (in part because the database contains descriptions of hundreds of Antarctic mountains and ranges).

The largest cluster of under-represented countries is found in Sub-Saharan Africa, where only a handful of countries are described by more than one place name for every 100,000 inhabitants. South Africa is the notable exception, as it exhibits information counts comparable to most European countries. Other exceptions are Nepal and Bhutan in Asia, which score relatively highly compared to neighbouring countries. It is also worth pointing out that Indonesia in the country with the lowest information density in the world, with only one place name per 470,000 people.

Because Freebase is a core ingredient in the informational menu presented to us by the world’s most widely used search engine, these presences and absences have the potential to have a significant impact on how we understand, interact with, and create our world. Freebase may seem like a small corner of the Web, but the imbalances that we observe in it can have large reverberations through the broader information ecosystems accessed by billions of people.

Advertisements

Geographic intersections of languages in Wikipedia

Wikipedia_geotagged_articles_final-update (1)

Description

This graph illustrate the percentage of geo-referenced articles in the twenty editions of Wikipedia containing the larges number of geo-referenced articles.

Data

The Terra Incognita project by Tracemedia investigates how Wikipedia has evolved over the last decade, mapping geographic articles, and date of creation, for over 50 languages. The maps highlight geolinguistic biases, unexpected areas of focus, and overlaps between the spatial coverage of different languages.

The project was developed using geo-coded Wikipedia articles from the Wikimedia Toolsever Ghel project (Geohack External Links), and article metrics that were collated using Toolserver scripts. The Ghel data dumps date to July 2013.

Only articles with primary coordinates are used, that is “where the location should be considered the primary object(s) in the page […]. Generally this should be one per article, but may be more with current corner cases with source and outlet of lakes and rivers” (Ghel project).

As illustrated in the featured graphic above (see table, bar chart by the Terra Incognita project), the percentages of geocoded articles in Wikipedia editions vary largely, from a minimum of 2% (Hindi Wikipedia) to a maximum of 46% (Polish Wikipedia), with the exception of the constructed language Volapük, whose Wikipedia edition includes a 79% of geocoded articles. Most large editions in Germanic and Italic languages contain between 12% (Italian Wikipedia) and 20% (English Wikipedia) of geo-coded articles.

Findings

The primary goal of the illustrations presented in this piece is to visualise how Wikipedia has very divergent geographic coverage in different languages. The tool also allows us to look at the date at which every one of the 4.5 million geocoded articles in Wikipedia was created: thus enabling us to see how the focus of different linguistic communities has evolved.

Most geo-coded Wikipedia articles are located in the countries where the language is listed as an official one.

One of the most interesting patterns that we can see in the data is that over 70% of articles written in that languages are spoken predominantly in a single country (e.g. Czech or Italian) only exist in that language. This means, for instance, that there might be articles about thousands of Czech villages written in Czech, but not English, French, German, or even Japanese.

Furthermore, Terra Incognita studies how two or more languages intersect with each other, when two distinct Wikipedia editions refer to the same location, in which is the proportion of such articles in the collections. These linking points can be visualized by means of language intersection maps, which highlight location referred to be more than one language.

Some of the most interesting linguistic comparisons can be seen when comparing the geography of different languages in multilingual parts of the world, such as Spain. We can see a high density of articles in Galicia, the Basque Country, Catalonia and to a lesser extent Valencia in their respective languages. Spanish (Castilian) is more evenly represented across the whole country.

spain_languages

A similar approach can be taken to explore the distribution of Wikipedia articles in some of the main languages spoken in South Asia. The map below, for instance, includes Bishnupriya Manipuri, Hindi, Nepal Bhasa and Tamil.

india_tamil_bish_hindi_nepalbhasa

Regional variations are not as strongly pronounced as they were in the Spanish case (Tamil, which is concentrated in South India and Sri Lanka is a notable exception). The overlap of the languages with each other is consistently between 12% and 16% with the exception of Bishnupriya Manipuri and Nepal Bhasa where the majority (65.1%) of articles are shared. These shared articles are distributed across India, and the distinct articles are in the native Nepal and Bangladesh.

One further case study is presented below, illustrating the interaction between Romanian, Bulgarian and Serbian Wikipedia subprojects.

romanian_bulgarian21

Romanian and Bulgarian Wikipedia articles are largely concentrated within the political boundaries of their respective countries. There is very limited overlapping of geographic content except in major cities.

bulgarian_serbian

Bulgaria and Serbia also share a border and are both Slavic languages (in contrast to Romanian, which is a Romance language). There is a much higher percentage of language intersections for articles between Bulgarian and Serbian than between Bulgarian and Romanian. For instance, a large number of intersected articles appear in Macedonia, which shares a border with Serbia and Bulgaria.

These maps, and the associated Terra Incognita tool, offer us an insight into not just patterns in Wikipedia, but also the geographic spheres of interest to different linguistic communities. As we work to better understand online geographies of knowledge, these maps allow us to ask important questions about who is representing and being represented by who.

Credits

The project was created by Gavin Baily and Sarah Bagshaw at TraceMedia, and was supported by funding from the Arts Council of England Grants for Arts and the National Lottery.

World-wide news web

GDELT_Worldwide_News_Web-top

Description

This map depicts mentions of multiple places in news articles between 1979 and 2013. Brighter lines indicate more connections between places.

Data

The map uses data from the  Global Database of Events, Language, and Tone (GDELT), which is an initiative aiming to provide a “realtime social sciences earth observatory”, by creating a freely available catalog of events derived from news stories. The database is compiled from stories in media outlets from almost every country in the world. Any story can contain more than one event, and events are automatically parsed out of news stories using a text analysis program called Tabari and encoded using a schema called Cameo.

A large portion of these events (140 million out of 250 million listed events) contains both a location of where the event happened and locations of the two primary actors involved. The Tabari algorithm associates events that it has already picked out of an article with geographic locations mentioned in the same text (by looking at verb usage in surrounding sentences). You can read the introductory paper on GDELT (Leetaru and Schrodt, 2013) for more on the specific geocoding methods employed.

We exclude all events where the two actors are geo-coded as being located in the same place (about 91 million events, or 36 percent of the full dataset), and location pairs referred to by fewer than 10 events (about 7 million events). This left us with about 43 million events (17 percent) and 216,000 connections between location pairs to visualize in the map.

The first map illustrates all the connections between pairs of locations. The brightness of each line reflects the number of events connecting the two places. The second graphic focus on international events, grouping the connections by country. Colour is used to map the world’s regions and the connections between them, with colour assigned to the ‘edges’ (i.e., connections) based on the colours of the two connected nodes. The thickness of the lines represents the number of events.

Note: in the second graphic below, “Countries, Dependencies, Areas of Special Sovereignty, and Their Principal Administrative Divisions” are labeled according to their classification in the GDELT database, using the FIPS 10-4 codes.

Findings

The map restates the United States’ position as a core geographical focal point of the collection. There are seven location pairs that are characterized by over 100,000 events happening between them. Every one of these seven pairs has one location outside of the United States and one inside the country. The brightest lines connect the United States (and Washington in particular), with Russia (twice), Iran, Iraq, Israel (twice), and China.

It is important to be aware of the scale at which this map should be interpreted. Many of the hotspots on the map are capital cities such as Washington or Moscow, but many locations also appear to be in relatively unpopulated places, such as the American Midwest or the middle of the Australian Great Victoria Desert. This occurs because many actors in the dataset are simply geocoded to a country rather than to a particular city or town. In those cases, the dataset locates them at the geometric centroid of countries. As such, this map is most useful to illustrate broad patterns of connections between regions and countries, rather than micro-connections between specific cities.

Russia, Iran, Iraq, Israel, and China are the countries most connected in general to the United States, along with Afghanistan, each one accounting for more than 500,000 events connecting a location in the United States to a location in one of those countries. The ‘special relationship’ between United Kingdom and United States accounts for over 450,000 events connecting two places on either side of the Atlantic.

The United States aside, the single most active connection between two cities is between Seoul and Pyongyang, with more than 98,000 events recorded in the database. At the country level, North and South Korea are connected by almost 250,000 events. The two most connected countries (excluding the United States) are Afghanistan and Pakistan, accounting for over 425,000 events, almost double the number of events connecting Pakistan and India (about 238,000 events).

The most active relationship in the Middle East and North Africa region involves Egypt and Israel, counting over 385,000 events connecting places in the two countries, followed by the relations between Israel and the West Bank (335,000 events), and between Israel and Lebanon (over 330,000 events). There are about the same number of events connecting Iran and Iraq as the number of events connecting the United States and Canada (about 315,000 events), and almost as many events connecting China and Japan as events connecting the United States and Mexico (about 270,000 events).

Aggregating data by country, we see that most of the events involving two distinct locations are international events, as only about 5 million events refer to two locations in the same country, whereas about 38 million events refer to locations in two different countries. The second graphic focuses on international events only.

Beyond the connections mentioned above, the second graphic highlights several inter-continental connections. Russia and the United Kingdom are among the most visible European countries, followed by Germany and France. Each one of these four European countries has strong connections with Asia, especially with China, Afghanistan, and Pakistan. A tight cluster is also visible in Asia, centered in China, and involving Hong Kong, Taiwan, South Korea, and North Korea.

Russia, the United Kingdom, Germany and France also have very visible connections with countries in the Middle East, in particular with Syria, Israel, Iran, and Iraq. The bright orange lines originating from Turkey also point to that country’s connections with a handful of Middle Eastern countries.

Sub-Saharan Africa is visibly the most disconnected of the seven regions. There are a few lines connecting Sub-Saharan African countries to the United States and the United Kingdom, and a few that link Sudan with its neighbour Egypt. Otherwise, we see very few connections. A similar pattern is evident in Latin America and the Caribbean, although the connections to the United States are stronger, especially those involving Mexico and Cuba.

The media inevitably present us with particular biases and objects of attention. This work is designed to show you both the locations and connections present in hundreds of millions of news stories from around the world.

GDELT_Worldwide_News_Web-bottom2-01

Mapping the Times Higher Education’s top-400 universities

MappingTimesHigherEducationstop-400universities_final1

Description

This map depicts the locations of the world’s top 400 universities as ranked by the Times Higher Education. It also illustrates the relative wealth of the country that hosts each university.

Data

The map uses data from the World University Rankings 2013-2014, published by the Times Higher Education, in collaboration with Thomson Reuters. Thirteen indicators that measure teaching, research, knowledge transfer and international outlook are taken into account in order to evaluate universities.

Each university is represented as a square, and shaded according to the World Bank income group that its country belongs to. The four World Bank income groups are high-income (GNI per capita of >$12,616), upper-middle income ($4,086 – $12,615), lower-middle income ($1,036 – $4,085), and low-income (<$1,036). We exclude the low-income category from this map because not one of the 400 universities is located in a low-income country.

The universities are grouped by world region, and the equator is depicted as a red line towards the bottom of the map.

Some universities are further grouped into metropolitan region clusters. The clusters have been identified using the DBSCAN density-based clustering algorithm, applying a 50 km distance threshold, and a minimum cardinality of four universities. Because of the compact nature of many European cities, we further refined some clusters manually in order to achieve meaningful definitions of metropolitan regions.

Findings

The primary finding is that most of the world’s top-ranked universities are located in the world’s wealthiest countries (a point also made by Benjamin Hennig and his cartograms of the Times Higher Education rankings). The Greater London cluster alone, which does not include Oxford and Cambridge, contains the same number of top-400 universities as all of Sub-Saharan Africa, the Middle East, and Latin America combined!

Not only are there are no low-income countries represented in the ranking, but India is also the only lower-middle income country represented, being home to five of the top-400 ranked universities. Latin America and Sub-Saharan Africa are home to three universities each, all six being based in upper-middle-income countries (i.e., Brazil, Colombia, and South Africa). These eleven elite universities in India, Latin America, and Sub-Saharan Africa serve a population of over 2.7 billion people.

The ranking also includes ten universities in China, an upper-middle-income economy that is home to over 1.3 billion citizens, and seven other universities from the same income group: five in Turkey, one in Iran, and one in Thailand. The remaining 34 Asian universities included in the ranking are mostly concentrated in densely populated (and wealthy) cities like Hong Kong, Seoul, Taipei, Tokyo, and Singapore.

The Middle East and North Africa also reveals a relatively concentrated geography of elite universities. Of the six universities included from the region, three are in Israel, two in Saudi Arabia, and one in Iran.

Oceania is interestingly the largest world region (in terms of number of top universities) present below the equator. All the top-400 universities in this region are found either in Australia or New Zealand, with two large clusters in Melbourne and Sydney.

Almost half of the top-400 universities are located in Europe, and over a quarter are in the United States. Northern Europe and the US East Coast are home to some of the largest university clusters, most notably in Greater London and Boston.

It’s important to remember that there are tens of thousands of universities that aren’t represented on this map; what this graphic doesn’t do is visualize the potentials or practices of all higher education worldwide. However, what it does do is clearly illustrate the highly uneven geography of elite education. The universities in the top-400 list don’t just command an undue amount of power, resources, and influence, but also serve to actively produce and reproduce it in particular parts of the world.

The geographic focus of world media

The_geographic_focus_of_world_media-final

Description

This graphic illustrates the number of events listed in the Global Database of Events, Language, and Tone, from January 1979 until August 5th, 2013. The database is a compiled from stories in media outlets from almost every country on Earth.

Data

The map uses data from the  Global Database of Events, Language, and Tone (GDELT), which is an initiative aiming to provide a “realtime social sciences earth observatory”, by creating a freely available catalog of events derived from news stories. Any story can contain more than one event, and events are automatically parsed out of news stories from different sources using a text analysis program called Tabari and encoded using a schema called Cameo.

A large portion of these events (140 million out of the 250 million listed) contain both a location of where the event happened and locations of the two primary actors involved. The Tabari algorithm associates events that it has already picked out of an article with geographic locations mentioned in the same text, by looking at verb usage in surrounding sentences. You can read the introductory paper on GDELT  (Leetaru and Schrodt, 2013) for more on the specific geocoding methods employed.

This graphic visualises those 140 million news events for which spatial data exists. In the map, each pie chart refers to a different world region. The shading of each country’s slice reflects the number of events recorded as happening in that country. The size of the pie represents the total number of events within that region, and the slices represent the relative percentage associated with each country. Also shown is a chart illustrating the number of events recorded in every year between 1979 and 2012.

Findings

News is one of the central ways that we learn about and understand our world. It therefore contributes massively to how we understand place.

Looking at the data, we see that most recorded news events for which we have geographic data are located in the northern hemisphere, with North America, Europe and Asia each individually accounting for more than 20% of the whole collection.

Most events reported in Europe are located in Russia or large western European countries (in particular, the United Kingdom, France, and Germany), whilst relatively little attention is devoted to other parts of Europe.

There are roughly the same number of events located in Asia as there are in North America, despite the fact that the population of Asia is about twelve times larger. China and India together only account for 23% of Asian events, despite being home to about two-thirds of Asia’s population. Smaller countries in conflict areas, such as Afghanistan and Pakistan, account for a similar amount of attention.

A similar situation can be seen in the Middle East and North Africa region. Long-standing conflict areas — Israel, Iraq, Iran, Syria, Egypt, and Libya  — account for more than three-quarters of the news events referring to this region in the collection.

More neglected are most of the countries in Latin America. The region contains only a fifth of the number of stories as North America, despite having almost double the population. Countries in Sub-Saharan Africa similarly are home to a relatively small number of reported events: with most of what is written about focused on just a handful of the region’s 47 countries. Note for instance that we see relatively little content about ongoing, and costly, conflicts in the Democratic Republic of the Congo, despite the heavy focus on conflicts in other parts of the world like Israel/Palestinian Territories and Afghanistan.

Also illustrated is the development of the GDELT dataset over time, from 1979 until 2012. We can see that the last five years account for the vast majority of the listed events.

A finer-grained analysis of the geographic information reported in the dataset shows that 80% of the events are located in a short list of only 816 specific locations — scattered throughout the world. In other words, not only are the geographies of news highly uneven at the national-scale, but even within those places, articles tend to focus on only a small subset of locations.

News stories necessarily reflect underlying events and processes that have distinct geographies. Some of the distributions we see in these charts are undoubtedly artefacts of the gazetteers that have been used to geocode news events. We already discussed the uneven geographies of gazetteers, and it stands to reason that employing them to geocode news events will simply reproduce some of their underlying biases.

But irrespective of biases derived from the use of gazetteers for geocoding, the question that we ultimately have to ask ourselves is whether our media are presenting us with something resembling an accurate reflection of newsworthy events. This graphic suggests that may not be the case.