Geographies of gazetteers


GeoNames_visualisation_8M_final Map.png

Gazetteers are playing a central role in the current data revolution, as data scientists apply natural language processing methodologies to texts (from social media data to literature) to identify their geographic component.

However, little research has been devoted to the geographies of gazetteers, their idiosyncrasies and skewness. In a recent paper short for Environment and Planning A (pre-print available on SSRN), Mark Graham and I explored the geographies of the GeoNames gazetteers.

The visualization above illustrates the density of named features (i.e., entries in the gazetteers referring to a place or natural feature, and recording related name and information) in GeoNames, the largest freely available gazetteer covering the globe. This map uses GeoNames gazetteer data from May 2013, and colour is used represent the number of features per square kilometre.

Most strikingly, the visualizations do not really resemble a map of population, as illustrated in the map below. Nor are the named features evenly distributed among regions and countries, as illustrated in the graph below. Instead, we see dense clusters of features in some parts of the world and a lack of geographic information in others. Interestingly, the information presences that we see are characterised by unusual patterns. Not only do we see the usual suspects of Western Europe and the United States with large amounts of geographic information, but we also see significant densities in places like Sri Lanka, Iran, and Nepal.


It is clear how national and international policies play a large role in the construction of this gazetteer. The United States is by far the most representative country in the dataset, accounting for a quarter of the total number of features. Nepal comes in 11th, apparently thanks to a project funded by the European Union in 2001, and counts more features than India and the UK put together.


These global differences are crucial in the context of the current data revolution. First, this can be seen as a technical problem, as analyses might be strongly influence by the gazetteer chosen to process the data, and end-users would ‘see’ the gazetteer rather than a geographic phenomenon that they aim to study. Second, and possibly more importantly, this is an ethical issue, as there is a risk to perpetuating data-program-data cycles, which reinforces global, historical, and broader information inequalities — as Mark Graham, Matthew Zook, and I discuss in another recent paper published in Geo.

This is where human geography meets data science, and further studies will be crucial to understand the geographies of gazetteers, their origins, and their impact on applications. 


 Thanks to Elise Acheson and Ross Purves for the recent (and most fruitful) discussions on this topic.


Vis-à-Wik: a visual analytics tool for Wikipedia analysis

Further to a short paper I wrote with ArzuKathryn, Scott, and Ralph (see Collaborative Visualizations for Wikipedia Critique and Activism), I started working on Vis-à-Wik, a simple online visual analytics tool for Wikipedia analysis. Vis-à-Wik retrieves data from the MediaWiki Wikipedia API, and uses D3jsto visualize the links between Wikipedia articles as a network diagram. This simple tool allows users to search for Wikipedia articles in a selected language edition, and visualize the articles selected by the user as a set of nodes, along with the related articles in a second language edition, and the links and language-links between them. The aim is facilitate mid-scale analysis of Wikipedia content — that is somewhere between a single-page analysis (that editors do routinely) and large-scale analyses (e.g., academic research projects).

Vis-à-Wik is available for testing at, while the code is available on GitHub ( under the GPLv3 licence. This is not a collaborative visualization tool, and currently implements only one of the visualization methods, but it is a first step (hopefully) of a larger endeavour.

The image below is a screenshot showing the same graph in the illustrative example presented in Collaborative Visualizations for Wikipedia Critique and Activism.


Featured Image -- 678

A “big” step

Digital Realism

(Image: “Pensiero” by ​​​​​Ilaria Parente)

A raising number of voices claim that we can now understand society, literature, and art, using ‘big data’ analytics, fostering epochal perspectives (or ‘epochalistic’ according to Savage) on how the recent ‘data deluge’ will impact society and research, and possibly change our life – as discussed by Mayer-Schonberger and Cukier in Big data.

Some of those claims follow the line set by Anderson, as he suggested “the end of theory”, minimizing possible issues of these methods, and subordinating them to a higher common good or economical advantage. Questions have been raised about how companies perform big data analytics and use it. Other authors are even more cautious, other extremely skeptical. In The data revolution, Kitchin highlights how current applications of big data analytics raise issues of quantification, which are as old…

View original post 1,171 more words

Mapping collaborative software




Github is one of the world’s biggest and best-known hosting services for software development projects. The shading of the map illustrates the number of users as a proportion of each country’s Internet population. The circular charts surrounding the two hemispheres depict the total number of GitHub users (left) and commits (right) per country. The uneven geographies on GitHub can possibly shed light on the ways in which different countries are being enrolled into a global knowledge economy.


The data in this map consists of all public events logged by GitHub in 2013. The data are freely available from the GitHub Archive.

We analysed over 65 million commits, made by about 1.1 million users active in 2013 (i.e., users that registered at least one “PushEvent”). Only 26% of users (accounting for over 44% of the commits) specified a location that we were able to match to an actual place. We employed a script based on the Unlock Places service to geolocate the locations in people’s profiles.


GitHub has become one of the largest web-based hosting services for software development projects, and is used by 3.5 million users worldwide. Its global distribution is strongly correlated with the number of Internet users in a country.

North America and Europe each account for about one third of the total number of GitHub users. The platform is particularly popular in Northern Europe, where Iceland and Sweden each have more than 50 GitHub users for every 100,000 Internet users in the country, as well as in Eastern Europe. The United States, New Zealand and Australia are the countries where the service is most popular outside Europe (they have about 35 GitHub users for every 100,000 Internet users).

The remaining third of GitHub users are mostly located in Asia (17% of the total). Singapore (27 GitHub users per 100,000 Internet users), and Taiwan (10 GitHub users per 100,000 Internet users) are two of the biggest per capita users. A lot of usage comes from China, but on a per-capita basis the country isn’t a heavy user (fewer than 3 GitHub users for every 100,000 Internet users).

The Middle East and North Africa and Sub-Saharan Africa together represent less than 1% of GitHub users, and just about 1% of commits. Switzerland alone counts almost as many GitHub users as the Middle East and North Africa region, and more than Sub-Saharan Africa.

Not only are North America and Europe home to a majority of users, but those users make more contributions than their counterparts in the rest of the world. Each region is home to over 38% of commits to the platform. The United States, for instance, is home to 31% of users but over 35% of commits. Similarly, the Netherlands is home to 1.7% of the users but 2.4% of the commits, and Switzerland is home to 0.9% of the users but 1.4% of the commits.

We see the opposite dynamic in the rest of the world. India, for instance, accounts for 3.6% of users, but only 1.7% of commits.

In sum, the uneven geographies of collaborative software development likely tell us a lot about where our global knowledge economy is being performed. Africa and the Middle East, in particular, have far fewer people accessing open software tools than would be expected given their numbers of Internet users. Not only is a lot of the world not accessing software made available on GitHub, but they also aren’t contributing to it: a sign that this facet of our global knowledge economy remains heavily based in some of the world’s traditional hubs of codified knowledge.

The anonymous Internet



This cartogram illustrates users of Tor: one of the largest anonymous networks on the Internet.


The data are freely and openly available on the Tor Metrics Portal, which provides information about the number of users per country joining their network every day. The average number of users has been calculated over a one-year period, prior to August 2013, when malware Sefnit “took the Tor Network by storm”, starting to use Tor for its communications and thus disrupting Tor’s usage statistics.


Tor is an opensource project promoting online anonymity through free software and volunteer collaboration. The Tor network consists of more than five thousand nodes. Tor users can connect to the network and have their Internet data routed through the network before reaching any server or webpage, thus the latter are not able to distinguish between Tor users or locate them.

Tor is the most popular and well known network of its kind, and it is used world-wide by over 750,000 Internet users every day. This is about the size of a small country; half-way between the Internet populations of Luxembourg and Estonia.

Over half of Tor users are located in Europe, which is also the region with the highest penetration, as the service is used by an average of 80 per 100,000 European Internet users. Italy in particular accounts for over 76,000 users a day, which is about one fifth of the entire European Tor daily user base. Italy is second only to the United States in terms of average number of users, as over 126,000 people access the Internet through Tor every day from the United States. The service is popular throughout the whole European region, with a high penetration in Moldova, as well as in less populous states: about a hundred Internet users connect to Tor every day from each of San Marino, Monaco, Andorra, and Liechtenstein, despite their small Internet populations.

When looking at the number of Tor users as a percentage of the larger Internet population, the Middle East and North Africa has the second highest rate of usage, with an average of over 60 per 100,000 Internet users utilizing the service. Tor is particularly popular in Israel, which accounts for more Tor users than India, while having less than 4% of its Internet users. The service is also very popular in Iran, which accounts for the largest number of Tor users outside Europe and the United States, and counts 50% more users than the United Kingdom, despite having only one third of its Internet population.

The geography of Tor tells us much about potentials for anonymity on the Internet. As ever more governments seek to control and censor online activities, users face a choice to either perform their connected activities in ways that adhere to official policies, or to use anonymity to bring about a freer and more open Internet.

Uneven Geographies of OpenStreetMap



This series of maps shows the location of edited content in the world’s largest collaborative mapping project: OpenStreetMap.


The maps use OpenStreetMap data downloaded from on December 12th, 2013. Each sub-region extract has been parsed and for each node (i.e., elements used in OpenStreetMap to represent any point feature), the coordinates, version, and last update values have been selected.

The first map was created by counting the number of nodes for each cell in a grid of 0.1 degrees of latitude per 0.1 degrees of longitude. The second map instead focuses on edits by summing the version numbers of all nodes in a cell (as this number is increased by one each time a node is modified), resulting in a count of all edits for the whole history of OpenStreetMap. The third map focuses on the age of content, and so records the latest update made to a node for each cell of the grid.


The first map offers a revealing picture of the presence of thick layers of content that annotate a few parts of the world, and a relative absence of content over much of the rest of the planet. The glowing centres of content in parts of North America, Europe, Oceania, and Japan, in many ways, parallels the visual intensity of lights in NASA’s Earth City Lights series.

The United States account for the largest total amount of content, collecting 21% of all nodes present in OpenStreetmap (OSM), followed by France, Canada, Germany and Russia, all counting more than 100 million nodes. These five countries alone collect 58% of the content, and high-income OEDC countries sum up to about 80% of OSM.

The Netherlands enjoy the highest density of content, with an average of over 1000 nodes per square kilometre, followed by Belgium with over 700 nodes per square kilometre, and Germany, the Czech Republic, Switzerland, and France, with about 400 nodes per square kilometre.

In contrast to the brightness of the Europe, the southern hemisphere is barely visible, as the amount of content available on OSM about that part of the world is far lower than in the northern hemisphere, with Africa and Latin America represented by less than 5% of the content. California alone accounts for almost as much content as the entirety of Africa.

Turkey and the western part of the Middle East are visible, but already fading into a less intense color. The emerging powers of Brazil, India, and China appear to be suffering from wide-spread content “blackout”, where only the largest urban centres are visible. Brazil accounts for fewer nodes than Switzerland, and China for even fewer. The same applies to most of the remaining parts of Africa, Asia, and Latin America. One of the oldest urbanized areas of the world, an amazing strip of lights that follows the course of the Nile, is barely visible. In fact, Egypt accounts for as many nodes as Iceland, despite being 10 times as big and accounting for 250 times the population.

Interestingly, content in parts of North Korea lights up the map: an unusual situation for a country not renowned for even appearing in most indices of online participation. This is most likely thanks to work done in 2011 by the OSM developers community. We see a similar situation in Newfoundland and Labrador: with large swaths of sparsely populated land characterised by relatively dense amounts of content. The Canadian case is likely a result of a detailed physical geography dataset that was bulk-uploaded to OpenStreetMap.

Several studies have been conducted on the quality of OSM’s coverage in these areas (e.g., see the paper by Haklay et al, 2010) where high-quality data from government agencies are also available for comparison. However, it has to be noted that these are the same countries where Open Data policies have spread, allowing lots of data to be uploaded to OSM. In fact, the visible distribution of content is not too different from the map of the GeoNames gazetteer project we published some months ago.

The second map below illustrates the number of edits made to OpenStreetMap. Unsurprisingly, the most content-dense areas are also the most heavily edited, because each new node included means one more edit made within the related area. However, statistical analysis suggests that the United States and Germany account for far more edits than would be expected given the related content in OSM, whereas content from Italy and Netherlands is far less edited than expected. In most parts of the rest of the world the number of edits is simply related to the number of objects in a given area.


The third and last map presents an illustration of the most and least recently updated areas in OSM, similarly to a map included in the recent Mapbox’s 2013 OpenStreetMap Data Report.

It is not surprising that most areas in Europe have seen at least one edit in the week before the data were collected. Similarly, it is evident how the most remote regions of the world have not been updated for years, from Siberia to the Australian Outback, from central Africa to the Amazon basin and northern Canada.

While most of the map shows a random mix of data, due to the volunteer-based nature of the projects, there are some evident areas of plain colour, which might indicate bulk uploading of new data and datasets from government agencies or companies. An examples can be found in Iraq, where most of the country has been updated between September and November 2013; in Australia, where large areas in South Australia have been recently updated, and the updates clearly follow the state borders with New South Wales and Victoria states; and in Estonia, which has also received recent edits for most of its territory.


OSM will turn 10 years old in a few months, and combining the findings obtained from these three maps, it is evident how it is a very good geographical representation of the most developed countries, and their urban environment. OSM also provide large amount of information about non-rural areas, although these are not as up-to-date and detailed as urban areas.

The quantity and the quality of the data make OSM one of the most powerful and exciting open-source projects that the Internet has facilitated in recent years, along with Linux and Wikipedia. Nonetheless, there is still a lot of work to do, and the development of the project in its second decade will probably depend on it attracting new volunteers among the new Internet users in Africa, Asia, Latin America, and the Middle East. Finally, OSM will be influenced by the relationships with those many companies which are currently based their mapping services on it, as well as the future spread of open data policies.

A global division of microwork



This graphic illustrates the global division of microwork undertaken on the ODesk platform and reveals some of its locally divergent practices.


Microwork refers to a series of relatively small tasks that are carried out by a distributed workforce over the Internet. Practices of coordinated microwork therefore allows for relatively large projects to be carried out quickly by workforces from around the world. ODesk is one of the largest job marketplaces for microworkers. This graphic uses openly available data from ODesk, describing the hourly working practices of microworkers (i.e., the number of active workers per each hour of the week) in each country across the globe.

In the first visualisation, each dot represents the average number of workers active in each country for every hour of the week. For countries that span more than one time zone, we use the local time in the capital city.

The second visualisation uses the same data, but makes two changes. First, dots are aligned according to local time, rather than Coordinated Universal Time (UTC). Second, dots are aligned according to UTC  and the size of each dot is normalized by the Internet population in each country. These changes offer a sense of how prevalent online microwork is in each country, and allows working hours between places to be directly compared.

The representations do not account for the use of daylight saving time.


The first image shows that a large portion of the world’s microwork carried out through ODesk is carried out in Asia: particular in the Philippines, Bangladesh, India, and Pakistan. At noon (local time) on an average Tuesday, there are almost 35,000 active workers on the platform, roughly one third of whom are located in India, about one quarter in the Philippines, and about one tenth in the United States. Russia and the Ukraine also each provide over five percent of the total. Despite the fact that ODesk is used in 58 countries that cover almost every time zone, 85% of the digitally mediated workers are located in the seven countries mentioned above. In other words, despite the potential for almost anyone with an Internet connection to become a microworker, we can see that microwork practices have very clustered geographies.

One interesting facet of these data is the significant different between working patterns in the Philippines and most other countries. In most countries, it is easy to distinguish the difference between day and night by the sharp drop-off in work that happens at the end of the working day. However when looking at the Philippines we only see a relatively minor change in working practices between the day and night.

In many countries we also see a stark difference between weekdays and weekends. However, the Philippines again exhibit a relatively consistent temporal pattern with fewer people than elsewhere avoiding work on weekends. By 3am (Philippines time) on an average Sunday morning, the Philippines provide almost half of the active workers in ODesk.

Some of these patterns can be traced to the large US demand for microwork. Filipino microworkers are mostly employed to complete tasks related to data entry, writing, and a variety of personal assistance work(see ODesk Philippines Country Dashboard). We see an increase in the number of active Filipino workers when it is morning in the US (9am Eastern Standard Time: which is 10pm in the Philippines). Bangladesh also exhibits a similar pattern to the Philippines. Bangladeshi microworkers are also largely employed for data entry, with the most common type of task performed in the country relating to search engine optimization (see ODesk Bangladesh Country Dashboard).This contrasts to the situation in India, where most microworkers are employed for tasks related to Web programming and design (see ODesk India Country Dashboard). In India, we see the number of active workers decline in the US morning (9am Eastern Standard Time: which is 6.30pm Indian time).

The second image, weights the number of active microworkers from each country against that country’s Internet population. This gives us further insights into some of the country-specific differences in microwork practices. For instance, we can see that not only does ODesk have a large and around-the-clock workforce in the Philippines, but that the platform is also relatively popular in that country. On an average Tuesday at noon local time, ODesk employs 0.025% of the entire Filipino Internet population. This is almost ten times the global average. By way of comparison, the platform employs only 0.001% of the US Internet population.

Online microwork also appears to be relatively popular in Armenia and Moldova (in both countries over 0.01% of the Internet population are active on an average Tuesday at lunch time), mostly employing micoworkers in the fields of Web programming and design. In South America, Uruguay and Bolivia also demonstrate relatively high rates of microwork activity; Bolivia is particularly interesting because it is the only country that exhibits a visible decline in the number of active workers in the middle of the working day.

These data offer a fascinating insight into new practices of work in our global knowledge economy. The ability to carve up large projects into small digital tasks that can be performed by a globally distributed labour force has meant that global demands for, and supply of, digital tasks can be easily matched. But it remains to be seen whether these new work practices are a useful employment opportunity for many of the two and a half billion connected people in the world, or whether they represent a new type of digital sweatshop in which the world’s poor are enrolled, as expendable and unorganized workers, into exploitative digital divisions of labour.