Gazetteers are playing a central role in the current data revolution, as data scientists apply natural language processing methodologies to texts (from social media data to literature) to identify their geographic component.
However, little research has been devoted to the geographies of gazetteers, their idiosyncrasies and skewness. In a recent paper short for Environment and Planning A (pre-print available on SSRN), Mark Graham and I explored the geographies of the GeoNames gazetteers.
The visualization above illustrates the density of named features (i.e., entries in the gazetteers referring to a place or natural feature, and recording related name and information) in GeoNames, the largest freely available gazetteer covering the globe. This map uses GeoNames gazetteer data from May 2013, and colour is used represent the number of features per square kilometre.
Most strikingly, the visualizations do not really resemble a map of population, as illustrated in the map below. Nor are the named features evenly distributed among regions and countries, as illustrated in the graph below. Instead, we see dense clusters of features in some parts of the world and a lack of geographic information in others. Interestingly, the information presences that we see are characterised by unusual patterns. Not only do we see the usual suspects of Western Europe and the United States with large amounts of geographic information, but we also see significant densities in places like Sri Lanka, Iran, and Nepal.
It is clear how national and international policies play a large role in the construction of this gazetteer. The United States is by far the most representative country in the dataset, accounting for a quarter of the total number of features. Nepal comes in 11th, apparently thanks to a project funded by the European Union in 2001, and counts more features than India and the UK put together.
These global differences are crucial in the context of the current data revolution. First, this can be seen as a technical problem, as analyses might be strongly influence by the gazetteer chosen to process the data, and end-users would ‘see’ the gazetteer rather than a geographic phenomenon that they aim to study. Second, and possibly more importantly, this is an ethical issue, as there is a risk to perpetuating data-program-data cycles, which reinforces global, historical, and broader information inequalities — as Mark Graham, Matthew Zook, and I discuss in another recent paper published in Geo.
This is where human geography meets data science, and further studies will be crucial to understand the geographies of gazetteers, their origins, and their impact on applications.