This is part three of a series of posts describing a potential new API for dealing with countries, country subdivisions
and timezones in KI18n, following the previous one country to timezone mapping,
covering how we can query the timezone and country or country subdivision information by geographic coordinates.
The API for this is fairly straightforward, pass a geographic coordinate in, and get the respective feature at that location back:
KCountry::fromLocation(float latitude, float longitude)
KCountrySubdivision::fromLocation(float latitude, float longitude)
KTimeZone::fromLocation(float latitude, float longitude)
This doesn’t need to overly precise coordinates, even GeoIP-based positions with an accuracy of a few
kilometers provides useful results in most areas.
The data sources for this are the same we already used in the last post:
Both are based on OSM data.
Compact Storage and Indexing
The source data however is huge and slow to process, we need to convert that into a compact form allowing efficient
storage. For this we reuse prior work from KItinerary
which contains a z-order curve based coordinate to timezone index.
There’s a few improvements and extensions over the original code though. Most notably we can now represent multiple
features per location, while using the fact that there is only a small set of feature combinations actually occurring.
This allows us to look up not only timezones but also the country or the country subdivision by location, without
significantly increasing the needed storage size.
The QGIS Python script doing the processing also
got optimized a lot, the original version from KItinerary needed about eight hours, the new one only needs about 15 minutes
while producing a more detailed result. This makes it much more feasible to experiment with tweaking the various parameters
to get to optimal results.
Choosing Parameters and Conflict Resolution
Obviously we can’t just magically reduce the hundreds of megabytes of source data by two orders of magnitudes without trading in
spatial resolution, how much depends on the parameters of the index generation script.
There’s three values to keep an eye on:
- For how much of the earth’s surface do we return a result?
- For how much of the earth’s surface do we return the wrong result?
- The size of the index data.
To understand how we can influence those it’s useful to have a quick look at what the index generation does conceptually.
- Split the earth’s surface in rectangular tiles (currently: 2¹¹ x 2¹¹). Cut off uninhabited polar regions to
have more tiles for inhabited areas (currently: 80°N and 60°S). For our current parameters that results in tiles roughly 10x20 km
at the equator, and increasingly smaller towards the poles. This controls how much surface area we can cover, and how large
features have to be in order to be visible at all.
- For each tile, check which of the features in that tile actually conflict. For example a tile overlapping the French/German border
would see two timezones
Europe/Paris. Those two are (at least for the present and near future) equivalent,
so we just pick one of those and don’t have an actual conflict. For the country we obviously can’t do that, so there we wont be
able to return a result.
- For each feature conflict, discard those features that only cover a small fraction of the tile (currently: 2%). This trades
correctness within a few hundred meters of a border for a larger coverage area.
With the above mentioned parameters, we get to an index size of about 950kB, and cover 99% of the non-polar regions
for timezones and countries, as well as 98% for (first level) country subdivisions, and we shouldn’t get wrong results when being away
at least 300 meters from a border.
This is a decent trade-off for many use-cases, further reducing the tile size results in a rapidly growing index size for a decreasing
win in precision.
There are ways to break this of course, land-locked and shaped against the tile orientation mini-countries such as
Lichtenstein can fall through the cracks entirely, even more so their subdivisions. Similarly, very fine-grained country subdivisions can
also be missed, but in those locations we tend to at least get a correct country information.
There’s two more remaining aspects to be sorted out now:
- Human readable and translated timezones names. Unlike with countries there is no canonical form for this,
applications tend to use different approaches to represent timezones. It’s still unclear which building
blocks for this can be offered by KI18n.
- Looking up the language of a country or country subdivision, as well as human readable and translated language names.
This needs a bit more thought as well, as code referring to languages often rather expects locales (area and/or script
variants used in a specific area), as well as the available translation catalogs on the system.
Feedback for all this is very welcome, on the implementation but also regarding use-cases and requirements
you have in your application. Check the corresponding Phabricator task
and the Gitlab branch for this,
or find me in the
#kde-devel channel on Matrix, the weekly KF6 meetings (Monday 15:00 UTC)
or the kde-frameworks-devel mailing list.