Geographies of gazetteers

 

GeoNames_visualisation_8M_final Map.png

Gazetteers are playing a central role in the current data revolution, as data scientists apply natural language processing methodologies to texts (from social media data to literature) to identify their geographic component.

However, little research has been devoted to the geographies of gazetteers, their idiosyncrasies and skewness. In a recent paper short for Environment and Planning A (pre-print available on SSRN), Mark Graham and I explored the geographies of the GeoNames gazetteers.

The visualization above illustrates the density of named features (i.e., entries in the gazetteers referring to a place or natural feature, and recording related name and information) in GeoNames, the largest freely available gazetteer covering the globe. This map uses GeoNames gazetteer data from May 2013, and colour is used represent the number of features per square kilometre.

Most strikingly, the visualizations do not really resemble a map of population, as illustrated in the map below. Nor are the named features evenly distributed among regions and countries, as illustrated in the graph below. Instead, we see dense clusters of features in some parts of the world and a lack of geographic information in others. Interestingly, the information presences that we see are characterised by unusual patterns. Not only do we see the usual suspects of Western Europe and the United States with large amounts of geographic information, but we also see significant densities in places like Sri Lanka, Iran, and Nepal.

GeoNames_Population_Map

It is clear how national and international policies play a large role in the construction of this gazetteer. The United States is by far the most representative country in the dataset, accounting for a quarter of the total number of features. Nepal comes in 11th, apparently thanks to a project funded by the European Union in 2001, and counts more features than India and the UK put together.

GeoNames_Population_Scatterplot

These global differences are crucial in the context of the current data revolution. First, this can be seen as a technical problem, as analyses might be strongly influence by the gazetteer chosen to process the data, and end-users would ‘see’ the gazetteer rather than a geographic phenomenon that they aim to study. Second, and possibly more importantly, this is an ethical issue, as there is a risk to perpetuating data-program-data cycles, which reinforces global, historical, and broader information inequalities — as Mark Graham, Matthew Zook, and I discuss in another recent paper published in Geo.

This is where human geography meets data science, and further studies will be crucial to understand the geographies of gazetteers, their origins, and their impact on applications. 

 

 Thanks to Elise Acheson and Ross Purves for the recent (and most fruitful) discussions on this topic.

 

Advertisements

Uneven Geographies of OpenStreetMap

OpenStreetMap_Satellite

Description

This series of maps shows the location of edited content in the world’s largest collaborative mapping project: OpenStreetMap.

Data

The maps use OpenStreetMap data downloaded from GeoFabrik.de on December 12th, 2013. Each sub-region extract has been parsed and for each node (i.e., elements used in OpenStreetMap to represent any point feature), the coordinates, version, and last update values have been selected.

The first map was created by counting the number of nodes for each cell in a grid of 0.1 degrees of latitude per 0.1 degrees of longitude. The second map instead focuses on edits by summing the version numbers of all nodes in a cell (as this number is increased by one each time a node is modified), resulting in a count of all edits for the whole history of OpenStreetMap. The third map focuses on the age of content, and so records the latest update made to a node for each cell of the grid.

Findings

The first map offers a revealing picture of the presence of thick layers of content that annotate a few parts of the world, and a relative absence of content over much of the rest of the planet. The glowing centres of content in parts of North America, Europe, Oceania, and Japan, in many ways, parallels the visual intensity of lights in NASA’s Earth City Lights series.

The United States account for the largest total amount of content, collecting 21% of all nodes present in OpenStreetmap (OSM), followed by France, Canada, Germany and Russia, all counting more than 100 million nodes. These five countries alone collect 58% of the content, and high-income OEDC countries sum up to about 80% of OSM.

The Netherlands enjoy the highest density of content, with an average of over 1000 nodes per square kilometre, followed by Belgium with over 700 nodes per square kilometre, and Germany, the Czech Republic, Switzerland, and France, with about 400 nodes per square kilometre.

In contrast to the brightness of the Europe, the southern hemisphere is barely visible, as the amount of content available on OSM about that part of the world is far lower than in the northern hemisphere, with Africa and Latin America represented by less than 5% of the content. California alone accounts for almost as much content as the entirety of Africa.

Turkey and the western part of the Middle East are visible, but already fading into a less intense color. The emerging powers of Brazil, India, and China appear to be suffering from wide-spread content “blackout”, where only the largest urban centres are visible. Brazil accounts for fewer nodes than Switzerland, and China for even fewer. The same applies to most of the remaining parts of Africa, Asia, and Latin America. One of the oldest urbanized areas of the world, an amazing strip of lights that follows the course of the Nile, is barely visible. In fact, Egypt accounts for as many nodes as Iceland, despite being 10 times as big and accounting for 250 times the population.

Interestingly, content in parts of North Korea lights up the map: an unusual situation for a country not renowned for even appearing in most indices of online participation. This is most likely thanks to work done in 2011 by the OSM developers community. We see a similar situation in Newfoundland and Labrador: with large swaths of sparsely populated land characterised by relatively dense amounts of content. The Canadian case is likely a result of a detailed physical geography dataset that was bulk-uploaded to OpenStreetMap.

Several studies have been conducted on the quality of OSM’s coverage in these areas (e.g., see the paper by Haklay et al, 2010) where high-quality data from government agencies are also available for comparison. However, it has to be noted that these are the same countries where Open Data policies have spread, allowing lots of data to be uploaded to OSM. In fact, the visible distribution of content is not too different from the map of the GeoNames gazetteer project we published some months ago.

The second map below illustrates the number of edits made to OpenStreetMap. Unsurprisingly, the most content-dense areas are also the most heavily edited, because each new node included means one more edit made within the related area. However, statistical analysis suggests that the United States and Germany account for far more edits than would be expected given the related content in OSM, whereas content from Italy and Netherlands is far less edited than expected. In most parts of the rest of the world the number of edits is simply related to the number of objects in a given area.

EditingTheMap

The third and last map presents an illustration of the most and least recently updated areas in OSM, similarly to a map included in the recent Mapbox’s 2013 OpenStreetMap Data Report.

It is not surprising that most areas in Europe have seen at least one edit in the week before the data were collected. Similarly, it is evident how the most remote regions of the world have not been updated for years, from Siberia to the Australian Outback, from central Africa to the Amazon basin and northern Canada.

While most of the map shows a random mix of data, due to the volunteer-based nature of the projects, there are some evident areas of plain colour, which might indicate bulk uploading of new data and datasets from government agencies or companies. An examples can be found in Iraq, where most of the country has been updated between September and November 2013; in Australia, where large areas in South Australia have been recently updated, and the updates clearly follow the state borders with New South Wales and Victoria states; and in Estonia, which has also received recent edits for most of its territory.

TheAgingMap

OSM will turn 10 years old in a few months, and combining the findings obtained from these three maps, it is evident how it is a very good geographical representation of the most developed countries, and their urban environment. OSM also provide large amount of information about non-rural areas, although these are not as up-to-date and detailed as urban areas.

The quantity and the quality of the data make OSM one of the most powerful and exciting open-source projects that the Internet has facilitated in recent years, along with Linux and Wikipedia. Nonetheless, there is still a lot of work to do, and the development of the project in its second decade will probably depend on it attracting new volunteers among the new Internet users in Africa, Asia, Latin America, and the Middle East. Finally, OSM will be influenced by the relationships with those many companies which are currently based their mapping services on it, as well as the future spread of open data policies.

Geographic Knowledge in Freebase

Freebase-final-01_Map

Description

This map shows the global distribution of geo-located entities described in Freebase, a collaborative knowledge base that defines itself as “an open shared database of the world’s knowledge”.

Data

Freebase forms one of the key informational ingredients in Google’s Knowledge Graph. If you’ve ever looked at the side panel in Google’s search results page, which presents information about people, places, and events in response to a search query, then you’ve probably come into contact with data stored in Freebase.

The data that we collected from Freebase describe over 43 million entities, among which we identified 478 thousand place names. The content is stored as RDF triples, which specify a predicate in the form of subject-verb-object. The triples in the dataset have been surveyed, collecting all entities associated with a latitude-longitude coordinates pair; that is, all subjects of triples where the verb refers to the concept “has latitude” and “has longitude”.

Findings

Geographic content in Freebase is largely clustered in certain regions of the world. The United States accounts for over 45% of the overall number of place names in the collection, despite covering about 2% of the Earth, less than 7% of the land surface, and less than 5% of the world population, and about 10% of Internet users. This results in a US density of one Freebase place name for every 1500 people, and far more place names referring to Massachusetts than referring to China.

A third of all place names are geo-located in Europe. The United Kingdom is home to about 7% of place names, Poland has about 6%, and France has just over 5%. The United Kingdom accounts for one place name for every 2000 inhabitants, the same proportion as Luxembourg. Ukraine is the only European country described with less than one place name per 30,000 inhabitants, whereas Slovenia and Poland are described in exceptional detail, with about one place name for every 1000 people and one place name for every 1300 inhabitants, respectively.

This stands in contrast to countries like China that account for less than 1% of the collection (with less than 4000 place names, and a density of only one place name for every 300,000 inhabitants). Most of Africa, Asia, Latin America and the Caribbean are similarly underrepresented. Nigeria barely represents 0.1% of the place names, and Venezuela accounts for only 0.05%. Outside Europe and North America, only four countries (Australia, China, India, and Japan) are represented with more content than Antarctica (in part because the database contains descriptions of hundreds of Antarctic mountains and ranges).

The largest cluster of under-represented countries is found in Sub-Saharan Africa, where only a handful of countries are described by more than one place name for every 100,000 inhabitants. South Africa is the notable exception, as it exhibits information counts comparable to most European countries. Other exceptions are Nepal and Bhutan in Asia, which score relatively highly compared to neighbouring countries. It is also worth pointing out that Indonesia in the country with the lowest information density in the world, with only one place name per 470,000 people.

Because Freebase is a core ingredient in the informational menu presented to us by the world’s most widely used search engine, these presences and absences have the potential to have a significant impact on how we understand, interact with, and create our world. Freebase may seem like a small corner of the Web, but the imbalances that we observe in it can have large reverberations through the broader information ecosystems accessed by billions of people.

A world’s panorama

Density_Photographs_Panoramio-01

Description

This map represents the location of public photographs published on Panoramio, one of the largest photo-sharing services on the Web.

Data

The map uses data collected via the Panoramio Data API in December, 2013. We used the API to retrieve the number of public photos tagged to locations in each of 259,200 bounding boxes into which we divided the world. It’s worth noting that because our boxes are sized to be a quarter of a degree of latitude tall and a quarter of a degree of longitude wide, the mapped cells are not of a consistent size globally. A cell in Edinburgh has about half the area of a cell in Nairobi. This means that locations near the equator are more likely to show up as bright concentrations of content, compared to locations with equivalent numbers of photographs in higher or lower latitudes, although the used color-scale should limit this effect.

Findings

Building on our map of content in Flickr, this graphic tells a very similar story. Panoramio is smaller than Flickr, with about a tenth of its users, and only a fraction of its photos. Nonetheless, Panoramio plays an important role in online representations of places, as photographs on the site can be accessed as a layer in Google Maps and Google Earth.

The United States is layered with more than two million public photographs published on Panoramio. It is closely followed by Russia, China, Germany and Brazil, which are each covered with more than a million photos. These five countries account for about one-third of the entire public content on the platform.

However, it is the Netherlands that is covered by the densest layer of content, with over five pictures per square kilometer. The Netherlands are followed by Switzerland, Slovakia, Germany, and Belgium, which all have an average of three pictures per square kilometer.

In contrast, Africa in particular is characterised by very thin layers of digital content (Italy alone is covered by more photos than the whole continent). No African country has more than one picture per five square kilometers; the highest being Tunisia with 0.2 photos per square kilometer. Algeria is the country with the most photographs in Africa, but tiny Western Sahara has the fewest, representing just 0.016% of the content created about the United States.

Whilst Latin America and the Caribbean tend to score poorly on many other metrics of information production, they are represented by a non-trivial amount of content, with about as many photographs as the United States. In Asia, China accounts for the largest portion of pictures, followed by Turkey (with 800,000), and then Japan and India, each with about half a million pictures. The rest of Asia combined is described by about 1.8 million pictures.

These presences and absences all ultimately influence what we see, and where we see it, when using some of the web’s most popular platforms.

Geographic coverage of Wikivoyage

Wikivoyage_Circles_final

Description

This graphic depicts the geographic focus of four major languages of the Wikivoyage project; one of the world’s most popular crowd-sourced travel guides.

Data

This graphic uses data freely available from the WikiMedia Dumps website, collected in October 2013.

To determine to location of each article, we used WikiVoyage’s internal geographic hierarchy. The page on Blackburn, for instance, is nested within the categories of Lancashire, and the United Kingdom. English and German have been included as they are the two largest sub-projects in Wikivoyage according to WikiMedia Statistics. We selected Italian and Spanish because they respectively represent good examples of geographically concentrated and dispersed languages.

Each ring represents one of the languages, and is sized in relation to the number of articles present in that language. Each section of a ring represents the number of articles in that language about a country. The visualisation excludes countries represented by fewer than three pages.

Findings

The visualisation shows us that, in all four languages, extensive coverage exists of countries in which those languages are spoken. Wikivoyage — one of the world’s most used travel guides — therefore presents us with a very selective picture of the world.

The United States accounts for a large portion of the content included in the English edition of Wikivoyage, and the comparison with the other languages is striking. The same applies to Germany in Germany, and Spain in Spanish. English-speaking countries account for about half of the pages written in English, and Spanish-speaking countries account for about half of the pages written in Spanish. However, German-speaking countries account for only about one third of German Wikivoyage. and the Italian edition dedicates an even smaller percentage of pages to Italy (just above 18%).

In other words, despite the fact that WikiVoyage is by its nature a project designed to facilitate writing about distant parts of the world that people might travel to, people aren’t actually writing that much content about places in which the language that they speak isn’t widely spoken (notable exceptions being content about Egypt in German, and about Greece in Italian, which account for more than 4% of the respective guides).

Low-income countries are particularly under-represented by the English, German, and Italian projects, with only about one third of articles in those languages dedicated to countries outside Europe, North America, Australia, and New Zealand. The Spanish Wikivoyage, in contrast, devotes almost 40% of its content to the Latin America and Caribbean region (as Spanish is widely spoken in that region, and it is possible that a significant number of editors are writing from the region). Sub-Saharan Africa, in contrast, is heavily under-represented in the Spanish WIkivoyage, comprising only 0.1% of the collection.

As ever more people use online travel guides, it will be important to understand whether these inequalities in information begin to actively shape where and how people move around the world.

Geographic intersections of languages in Wikipedia

Wikipedia_geotagged_articles_final-update (1)

Description

This graph illustrate the percentage of geo-referenced articles in the twenty editions of Wikipedia containing the larges number of geo-referenced articles.

Data

The Terra Incognita project by Tracemedia investigates how Wikipedia has evolved over the last decade, mapping geographic articles, and date of creation, for over 50 languages. The maps highlight geolinguistic biases, unexpected areas of focus, and overlaps between the spatial coverage of different languages.

The project was developed using geo-coded Wikipedia articles from the Wikimedia Toolsever Ghel project (Geohack External Links), and article metrics that were collated using Toolserver scripts. The Ghel data dumps date to July 2013.

Only articles with primary coordinates are used, that is “where the location should be considered the primary object(s) in the page […]. Generally this should be one per article, but may be more with current corner cases with source and outlet of lakes and rivers” (Ghel project).

As illustrated in the featured graphic above (see table, bar chart by the Terra Incognita project), the percentages of geocoded articles in Wikipedia editions vary largely, from a minimum of 2% (Hindi Wikipedia) to a maximum of 46% (Polish Wikipedia), with the exception of the constructed language Volapük, whose Wikipedia edition includes a 79% of geocoded articles. Most large editions in Germanic and Italic languages contain between 12% (Italian Wikipedia) and 20% (English Wikipedia) of geo-coded articles.

Findings

The primary goal of the illustrations presented in this piece is to visualise how Wikipedia has very divergent geographic coverage in different languages. The tool also allows us to look at the date at which every one of the 4.5 million geocoded articles in Wikipedia was created: thus enabling us to see how the focus of different linguistic communities has evolved.

Most geo-coded Wikipedia articles are located in the countries where the language is listed as an official one.

One of the most interesting patterns that we can see in the data is that over 70% of articles written in that languages are spoken predominantly in a single country (e.g. Czech or Italian) only exist in that language. This means, for instance, that there might be articles about thousands of Czech villages written in Czech, but not English, French, German, or even Japanese.

Furthermore, Terra Incognita studies how two or more languages intersect with each other, when two distinct Wikipedia editions refer to the same location, in which is the proportion of such articles in the collections. These linking points can be visualized by means of language intersection maps, which highlight location referred to be more than one language.

Some of the most interesting linguistic comparisons can be seen when comparing the geography of different languages in multilingual parts of the world, such as Spain. We can see a high density of articles in Galicia, the Basque Country, Catalonia and to a lesser extent Valencia in their respective languages. Spanish (Castilian) is more evenly represented across the whole country.

spain_languages

A similar approach can be taken to explore the distribution of Wikipedia articles in some of the main languages spoken in South Asia. The map below, for instance, includes Bishnupriya Manipuri, Hindi, Nepal Bhasa and Tamil.

india_tamil_bish_hindi_nepalbhasa

Regional variations are not as strongly pronounced as they were in the Spanish case (Tamil, which is concentrated in South India and Sri Lanka is a notable exception). The overlap of the languages with each other is consistently between 12% and 16% with the exception of Bishnupriya Manipuri and Nepal Bhasa where the majority (65.1%) of articles are shared. These shared articles are distributed across India, and the distinct articles are in the native Nepal and Bangladesh.

One further case study is presented below, illustrating the interaction between Romanian, Bulgarian and Serbian Wikipedia subprojects.

romanian_bulgarian21

Romanian and Bulgarian Wikipedia articles are largely concentrated within the political boundaries of their respective countries. There is very limited overlapping of geographic content except in major cities.

bulgarian_serbian

Bulgaria and Serbia also share a border and are both Slavic languages (in contrast to Romanian, which is a Romance language). There is a much higher percentage of language intersections for articles between Bulgarian and Serbian than between Bulgarian and Romanian. For instance, a large number of intersected articles appear in Macedonia, which shares a border with Serbia and Bulgaria.

These maps, and the associated Terra Incognita tool, offer us an insight into not just patterns in Wikipedia, but also the geographic spheres of interest to different linguistic communities. As we work to better understand online geographies of knowledge, these maps allow us to ask important questions about who is representing and being represented by who.

Credits

The project was created by Gavin Baily and Sarah Bagshaw at TraceMedia, and was supported by funding from the Arts Council of England Grants for Arts and the National Lottery.

World-wide news web

GDELT_Worldwide_News_Web-top

Description

This map depicts mentions of multiple places in news articles between 1979 and 2013. Brighter lines indicate more connections between places.

Data

The map uses data from the  Global Database of Events, Language, and Tone (GDELT), which is an initiative aiming to provide a “realtime social sciences earth observatory”, by creating a freely available catalog of events derived from news stories. The database is compiled from stories in media outlets from almost every country in the world. Any story can contain more than one event, and events are automatically parsed out of news stories using a text analysis program called Tabari and encoded using a schema called Cameo.

A large portion of these events (140 million out of 250 million listed events) contains both a location of where the event happened and locations of the two primary actors involved. The Tabari algorithm associates events that it has already picked out of an article with geographic locations mentioned in the same text (by looking at verb usage in surrounding sentences). You can read the introductory paper on GDELT (Leetaru and Schrodt, 2013) for more on the specific geocoding methods employed.

We exclude all events where the two actors are geo-coded as being located in the same place (about 91 million events, or 36 percent of the full dataset), and location pairs referred to by fewer than 10 events (about 7 million events). This left us with about 43 million events (17 percent) and 216,000 connections between location pairs to visualize in the map.

The first map illustrates all the connections between pairs of locations. The brightness of each line reflects the number of events connecting the two places. The second graphic focus on international events, grouping the connections by country. Colour is used to map the world’s regions and the connections between them, with colour assigned to the ‘edges’ (i.e., connections) based on the colours of the two connected nodes. The thickness of the lines represents the number of events.

Note: in the second graphic below, “Countries, Dependencies, Areas of Special Sovereignty, and Their Principal Administrative Divisions” are labeled according to their classification in the GDELT database, using the FIPS 10-4 codes.

Findings

The map restates the United States’ position as a core geographical focal point of the collection. There are seven location pairs that are characterized by over 100,000 events happening between them. Every one of these seven pairs has one location outside of the United States and one inside the country. The brightest lines connect the United States (and Washington in particular), with Russia (twice), Iran, Iraq, Israel (twice), and China.

It is important to be aware of the scale at which this map should be interpreted. Many of the hotspots on the map are capital cities such as Washington or Moscow, but many locations also appear to be in relatively unpopulated places, such as the American Midwest or the middle of the Australian Great Victoria Desert. This occurs because many actors in the dataset are simply geocoded to a country rather than to a particular city or town. In those cases, the dataset locates them at the geometric centroid of countries. As such, this map is most useful to illustrate broad patterns of connections between regions and countries, rather than micro-connections between specific cities.

Russia, Iran, Iraq, Israel, and China are the countries most connected in general to the United States, along with Afghanistan, each one accounting for more than 500,000 events connecting a location in the United States to a location in one of those countries. The ‘special relationship’ between United Kingdom and United States accounts for over 450,000 events connecting two places on either side of the Atlantic.

The United States aside, the single most active connection between two cities is between Seoul and Pyongyang, with more than 98,000 events recorded in the database. At the country level, North and South Korea are connected by almost 250,000 events. The two most connected countries (excluding the United States) are Afghanistan and Pakistan, accounting for over 425,000 events, almost double the number of events connecting Pakistan and India (about 238,000 events).

The most active relationship in the Middle East and North Africa region involves Egypt and Israel, counting over 385,000 events connecting places in the two countries, followed by the relations between Israel and the West Bank (335,000 events), and between Israel and Lebanon (over 330,000 events). There are about the same number of events connecting Iran and Iraq as the number of events connecting the United States and Canada (about 315,000 events), and almost as many events connecting China and Japan as events connecting the United States and Mexico (about 270,000 events).

Aggregating data by country, we see that most of the events involving two distinct locations are international events, as only about 5 million events refer to two locations in the same country, whereas about 38 million events refer to locations in two different countries. The second graphic focuses on international events only.

Beyond the connections mentioned above, the second graphic highlights several inter-continental connections. Russia and the United Kingdom are among the most visible European countries, followed by Germany and France. Each one of these four European countries has strong connections with Asia, especially with China, Afghanistan, and Pakistan. A tight cluster is also visible in Asia, centered in China, and involving Hong Kong, Taiwan, South Korea, and North Korea.

Russia, the United Kingdom, Germany and France also have very visible connections with countries in the Middle East, in particular with Syria, Israel, Iran, and Iraq. The bright orange lines originating from Turkey also point to that country’s connections with a handful of Middle Eastern countries.

Sub-Saharan Africa is visibly the most disconnected of the seven regions. There are a few lines connecting Sub-Saharan African countries to the United States and the United Kingdom, and a few that link Sudan with its neighbour Egypt. Otherwise, we see very few connections. A similar pattern is evident in Latin America and the Caribbean, although the connections to the United States are stronger, especially those involving Mexico and Cuba.

The media inevitably present us with particular biases and objects of attention. This work is designed to show you both the locations and connections present in hundreds of millions of news stories from around the world.

GDELT_Worldwide_News_Web-bottom2-01