Benchmarking geospatial database on Kubernetes cluster

Sharma, Bharti; Bansal, Poonam; Chugh, Mohak; Chauhan, Adisakshya; Anand, Prateek; Hua, Qiaozhi; Jain, Achin

doi:10.1186/s13634-021-00754-2

Research
Open access
Published: 19 July 2021

Benchmarking geospatial database on Kubernetes cluster

Bharti Sharma¹,
Poonam Bansal¹,
Mohak Chugh¹,
Adisakshya Chauhan¹,
Prateek Anand²,
Qiaozhi Hua ORCID: orcid.org/0000-0002-5999-4498³ &
…
Achin Jain⁴

EURASIP Journal on Advances in Signal Processing volume 2021, Article number: 43 (2021) Cite this article

3699 Accesses
Metrics details

Abstract

Kubernetes is an open-source container orchestration system for automating container application operations and has been considered to deploy various kinds of container workloads. Traditional geo-databases face frequent scalability issues while dealing with dense and complex spatial data. Despite plenty of research work in the comparison of relational and NoSQL databases in handling geospatial data, there is a shortage of existing knowledge about the performance of geo-database in a clustered environment like Kubernetes. This paper presents benchmarking of PostgreSQL/PostGIS geospatial databases operating on a clustered environment against non-clustered environments. The benchmarking process considers the average execution times of geospatial structured query language (SQL) queries on multiple hardware configurations to compare the environments based on handling computationally expensive queries involving SQL operations and PostGIS functions. The geospatial queries operate on data imported from OpenStreetMap into PostgreSQL/PostGIS. The clustered environment powered by Kubernetes demonstrated promising improvements in the average execution times of computationally expensive geospatial SQL queries on all considered hardware configurations compared to their average execution times in non-clustered environments.

1 Introduction

The use of geospatial data has been increased in many applications, including traffic management, ride-hailing services, and food sector, etc. The volume of geospatial data is predicted to increase by 20% every year. The increase of geospatial information requires new architectures or systems to handle data thus creating new challenges. At present, mainly two types of databases store geospatial information: relational databases and NoSQL databases. Relational databases are the most universally used and the most developed database information systems used in industries for decades. Due to a lack of native support for geospatial data in relational databases, some modern databases have updated and changed their database design specifically for spatial data to extend support for various operations on geo-data [1]. Examples of relational databases for geographic information include PostGIS, WebGIS, Oracle 19c, the Microsoft Azure SQL Database, and a few more. The relational databases can define spatial entities, extend their support for the different spatial data entities (polygon relationships), and acquire various optimizations for refining the query operations execution time. With the arrival of new cloud technologies, spatial information application systems are also incurring updations and changes rapidly to handle different operations on complex and colossal data efficiently [2]. Storing, managing, querying data, and managing geospatial databases in the environment effectively are the problems that are being tried to be solved for many years.

Running multiple machines as a cluster is a method of managing multiple containers. Docker is one of the technology solutions that are compatible with any computer to run containerized applications. Containerization is a process of isolating applications from the host machine. It creates an environment similar to having a separate operating system, even though there might be other containers running on the host machine. Containerization helps the host machine to run, create, and manage multiple containers on a single host machine. Kubernetes is an open-source technology that serves as a container orchestration tool that automates installing and managing a cluster of Docker containers. Docker images contain the desired application and service elements, and Kubernetes can be used to deploy and manage these components. Kubernetes allows us to automate the provisioning of containers, networking, load-balancing, security, and scaling across all its nodes [3].

Clustering and orchestration of containers automatically allocate the client to the machine with the least resource usage. Database clustering and containerization take a different approach in order to maintain atomicity, consistency, isolation, and durability (ACID) properties. In the database cluster mode, every single node is fully isolated and has its own methods of managing the data and ACID properties. Since there is more than one server instance, consistency is difficult to operate, and the concept of eventual consistency is used. But the result of this offers an alternative in the event of a crash or a failure.

The traditional data management technologies face frequent read-write problems and scalability problems while dealing with such dense and complex spatial data. Using Geographic Information System (GIS) in a cluster environment can be an effective way to solve spatial data problems by having the benefits of horizontal extension on low-cost computers, which can provide large and scalable storage, computing power, load balancing, high availability, and monitoring and automation. The structure and principle of containers in the cluster environment make the technology very prominent and efficient for database workflows. One of the reasons being that once a container has been built, it will run on any platform. The cluster environment can ensure availability, management of resources which simplifies reproducibility and deployment. Performing different kinds of operations on geospatial data is compute intensive, i.e., it needs high computational resources to run. Therefore, there is a need for evaluating the performance of compute intensive operations on geospatial data running in a cluster environment of Kubernetes.

2 Related work

There have been many studies on spatial data storage due to the increasing spatial data and processing scale. This prompts the development of spatial data technologies, including aspects: the data model for storage, spatial indexes, and various types of query operations processing. Management and processing of spatial vector data is complex and needs unique storage models, mechanisms for processing, scanning, and specific usage systems for its use in various applications. A geographic information system (GIS) is used for geospatial vector data gathering, accumulation, and processing to assist the general or specific types of applications [1]. The fast-paced development of data systems, space technology, and sensor technology has led to an increase in the huge volume of geospatial data in several subjects. Hence, spatial data services are often used with cloud technologies to respond faster [2]. Geospatial data representing information is confined to the location object, structure, and characteristics of entities and entity dependency on each other [3]. New geospatial applications need versatile schematics, reasonably faster execution of query operations, and more scalability than the existing conventional geospatial relational databases [4]. In fact, the bottlenecks observed in the management and processing of spatial vector data have been continuously the driving force for the development of system designs due to the limitations which reside in the current systems which are used for handling the specific type of huge information and its manipulations and computations [5]. In spatial information systems, support for various spatial data services as used in any information system is required. Several experiments and studies have realized that conventional relational databases are not efficient for big data storage and queries for industrial purposes operating at large-scale accessing millions of data points at enormous speed in various geospatial applications [6, 7]. NoSQL databases are widely considered for storing big data due to the capability to accumulate, manage, and support the creation of various types of indexes on data fields while horizontally scalable providing the ability to serve the huge number of retrieval operations [8].

Subjective comparison of experiments has also shown that no fixed schema-based databases have faster execution or query processing times than flexible schema databases when operating on a huge volume of data [9, 10]. Creating spatial indexes is crucial to validate spatial databases to access and view data efficiently; thus, affecting the overall performance of the spatial databases compared to using non-indexed spatial databases directly [10, 11]. All these operations are not only confined to huge storage space but also need comparatively more computation power. In the subjective comparison of the widely used query operations in various database systems, NoSQL databases have outperformed relational databases. Current NoSQL database designs used for industrial purposes cannot serve as a fully viable option for geospatial data. NoSQL databases have some advantages than traditional relational databases that can be easily operated as a distributed system and do not have fixed structured data, which eases its capability to be scaled horizontally [12, 13].

Collecting open geospatial datasets in a traditional relational database management system (RDBMS) requires a lot of work related to schema design and data import, where both attributes and geometries have to be mapped, translated, and converted [14, 15]. Relational databases have also some advantages compared to NoSQL databases that provide standard ACID properties (atomicity, consistency, isolation, and durability) that maintain the integrity of the database system when performing concurrent operations on it [16, 17]. Conventional data exploration analysis and methods using a specific software to find crucial information for use in various geospatial applications can be computationally expensive [18,19,20]. They cannot be possible in every case without having special methods to support the processing of big geospatial data [6, 21].

NoSQL or document databases provide much more flexibility in retrieving and inserting geospatial data than key-value databases. Using the geoJSON format, many document databases easily support geo-data management. Due to the flexible nature of NoSQL databases, they can be more efficient in performing geospatial data queries. One of NoSQL databases’ shortcomings is that they do not provide any functions other than the basic spatial functions, lesser than relational databases. However, this approach leads to the usage of benefits of RDBMS, such as strong relational mappings, ACID properties, and strong foreign key constraints [22, 23]. Distinct characteristics of spatial data such as high dimensionality, several complex dependencies between entities on each other (e.g., distance entity, the dimension of direction, and geometrical relationships) leads to the requirement for time-consuming operations, and computationally exhausting algorithms for performing operations. The geospatial data in the cloud can provide a suitable efficient computation architecture that can support the processing of such huge data [24, 25].

3 Methodology

This section defines the experimental setup and execution of the benchmarking process for GeoDatabase deployed in a clustered and non-clustered environment. The steps are shown in Fig. 1.

3.1 Subject to be benchmarked

PostgreSQL has been chosen as the subject for the benchmarking process. It is an open-source software program that adds support for geographic objects to its object-relational database using a PostGIS spatial database extender. This allows location queries to be run in SQL. The easy installation process across platforms turns out to be a good fit for a GeoDatabase that we can use as a subject in clustered and non-clustered environments.

A spatial query is a special type of database query that is supported by spatial geo-databases. These queries allow the use of geometry data types such as points, lines, and polygons. They also consider the spatial relationship between these geometries. The spatial queries’ execution time is used as a parameter for comparing performance in our benchmarking process.

3.2 Execution environments

The execution environments chosen for the selected geo-database are powered by Amazon Web Services (AWS) and are as follows:

1.
PostgreSQL on Amazon Elastic Compute Cloud (AWS EC2)
2.
PostgreSQL on Amazon Relational Database Service (AWS RDS)
3.
PostgreSQL on Amazon Elastic Kubernetes Service (AWS EKS)

All the execution environments operate on two hardware configurations:

1.
Hardware Configuration 1 (HC-1)
2.
Hardware Configuration 2 (HC-2)

These hardware configurations differ in the allocated random-access memory (RAM), virtual central processing units (vCPUs).

Uniform hardware configuration is the key ingredient to make a benchmarking process fair for all the execution environments. PgAdmin is used as the monitoring tool to get all the benchmarking results for the spatial queries.

The first execution environment (AWS EC2) depicts how a student or researcher would set up a project database. Setting up a virtual machine on-premise or on-cloud and running the database on it is the simplest of all the available options. However, there is an overhead of manually scaling the database according to the incoming requests.

The second execution environment (AWS RDS) depicts the scenario of how a startup or any organization in the software industry would like to set up and manage their databases for all their projects. Relying on third-party services such as RDS or any other database-service provider takes off the load of managing and maintaining the setup. As these options provide less flexibility in scaling options and limited architectural control, they do not prove to be cost-effective.

The third execution environment (AWS EKS) depicts the scenario of a database running in a clustered environment that provides flexible scaling options, full architectural control, and good fail-over support.

3.2.1 PostgreSQL on Amazon Elastic Compute Cloud (AWS EC2)

The base of the environment is Amazon EC2 instance. This environment is used with two hardware configurations:

1.
HC-1
1. a.
  Instance type–t2.Medium
2. b.
  RAM–4 GB
3. c.
  vCPUs–2
4. d.
  Storage–8 GB general-purpose solid-state drive (SSD)
2.
HC-2
1. a.
  Instance type–t3.Large
2. b.
  RAM–8 GB
3. c.
  vCPUs–2
4. d.
  Storage–8 GB general-purpose SSD

On the EC2 instance, docker and docker-compose are installed. The docker images of “mdillon/postgis:9.5-alpine” to setup PostgreSQL with PostGIS and “dpage/pgadmin4:latest” to setup PgAdmin using docker-compose on the Amazon EC2 instance are utilized.

3.2.2 PostgreSQL on Amazon Relational Database Service (AWS RDS)

The base of the environment is Amazon EC2 instance. This environment is used with two hardware configurations:

3.
HC-1
1. a.
  EC2 instance type–t2.Medium
2. b.
  EC2 vCPUs–2
3. c.
  EC2 storage–8 GB general-purpose SSD
4. d.
  Database instance type–db.m3.medium
5. e.
  Database RAM–4 GB
6. f.
  Database vCPUs–1
7. g.
  Database capacity–20 GB SSD
4.
HC-2
1. a.
  EC2 instance type–t3.Large
2. b.
  EC2 vCPUs–2
3. c.
  EC2 storage–8 GB general-purpose SSD
4. d.
  Database instance type–db.m5.large
5. e.
  Database RAM–8 GB
6. f.
  Database vCPUs–2
7. g.
  Database capacity–20 GB SSD

Docker and docker-compose are installed on the instance the docker image “dpage/pgadmin4:latest” is used to set up PgAdmin using docker-compose on the Amazon EC2 instance. PgAdmin is connected with the Amazon RDS instance.

3.2.3 PostgreSQL on Amazon Elastic Kubernetes Service (AWS EKS)

The base of the environment is Amazon EKS cluster with a node group attached with two hardware configurations:

5.
HC-1
1. a.
  Node group instance type–t2.Medium
2. b.
  RAM–4 GB
3. c.
  vCPUs–2
4. d.
  Storage–20 GB SSD
6.
HC-2
1. a.
  Instance type–t3.Large
2. b.
  RAM–8 GB
3. c.
  vCPUs–2
4. d.
  Storage–20 GB SSD

On the EKS cluster, PostgreSQL and PGAdmin are deployed for benchmarking purposes using docker images “mdillon/postgis:9.5-alpine” and “dpage/pgadmin4:latest”.

3.3 Custom Kubernetes setup

There are cases where spinning up an AWS EKS instance can be very costly and may not be useful for research or for testing purposes. In that case, it is recommended building your own Kubernetes cluster, which can be done either on the cloud, or on the local machine(s). This kind of setup can greatly reduce the cost and enable researchers and students to set up their own distributed environment quickly and easily. The authors aim to build an easy to set up a heterogeneous clustered environment that can connect to different types of machines in different environments. Providing a methodology to set up a production-like environment quickly and easily can help greatly in validating conceptual architectures. This architecture needs to be cost-effective, flexible, and scalable at the same time. Kubernetes clusters can also provide certain benefits such as the following:

(i)
Load balancing—a methodical and efficient distribution of network or application traffic across multiple servers in a server farm. Each load balancer handles between client devices and backend servers, receiving and then distributing incoming requests to any available server capable of fulfilling them.
(ii)
Failover support—it ensures that a business intelligence system remains available for use if an application or hardware failure occurs. Clustering provides failover support in two ways: load redistribution and request recovery. The purpose of developing high-performance database clusters is to produce high performing computer systems. They operate co-extending programs that are needed for time-exhaustive computations. The scientific industries commonly prefer such a variety of clusters. The basic aim is intelligently sharing the workload.
(iii)
Monitoring and automation—clustering allows automating a lot of the processes of the database while it permits to set up rules to warn potential issues.

This installation process of the custom Kubernetes cluster has a lot of management overhead from the user’s perspective but provides desirable performance especially on lower configuration systems. This setup enables small-scale use cases to deploy and validate conceptual architectures for much less costs as compared to AWS EKS with slightly comparable performance. This setup can be created either by using KIND (Kubernetes in a docker) for test use cases or using Kubeadm to setup a master-agent configuration. One important point to note while setting up is that all the machines should be on the same network or should be able to discover each other in order to connect and operate as a cluster.

3.4 Data acquisition

Geospatial data is used for benchmarking, since retrieving and fetching data can be a very resource intensive task and may provide us better and more accurate results since such resource-intensive tasks portray a more accurate description of deploying of databases in the real world. The choice of database for benchmarking is PostgreSQL since one of the biggest benefits of running PostgreSQL is running the cluster in primary-replica setup for the purposes of high-availability or load balancing the read-only queries. It is not necessarily simple to deploy a primary-replica setup out of the box, but the process can be simplified by using modern containerization technology. PostgreSQL provides the flexibility and the granular control to deploy the database in the desired and most effective configuration while having great tooling and support.

In this context, the geospatial data can be described by the atomic unit of a feature. A feature is a geographic shape (e.g., point, line string, or polygon) as well as a list of accompanying key-value attributes. An example of a feature is a building footprint represented by a vector geometry describing a polygon, accompanied by attributes such as address, name of the owner, and the year it was built. Considering the map data of Colorado and Washington states of USA, provided by OpenStreetMap (OSM). The data when downloaded initially is in the file format *.osm.pbf, which is a few hundred megabytes. This file format cannot be directly imported in PostgreSQL and hence it needs to be transformed first.

Osm2pgsql package available as a cli on ubuntu repository is an open-source tool to import the *.osm.pbf file into the PostgreSQL database. Osm2pgsql is software to import OpenStreetMap data into a PostgreSQL database that has PostGIS extension installed already before import. It is an essential part of many rendering tool chains. The following are the stages of the process of importing OSM data into PostgreSQL:

1.
Reading *.osm.pbf file using PBF parser
2.
Sorting of data and creation of index

The time taken to import OSM data depends on the following:

1.
Hardware specifications of the machine where Osm2pgsql is running
2.
The network bandwidth—to be able to share transformed data with PostgreSQL
3.
The target database specifications—to sort data and create indices

Thus, using a separate Amazon Elastic Compute Cloud (EC2) instance within the same Virtual Private Cloud (VPC) as that of the desired execution environment was considered. The instance for a given hardware configuration has Osm2pgsql installed for importing OSM data in PostgreSQL for the 3 execution environment. With this, the import time depends solely on the database running in the desired execution environment.

Based upon the OSM data, 8 geospatial queries listed in Table 1 were used for the benchmarking process. All spatial-queries are SELECT operations on the geo-database, and every query represents a real-world use-case where clients want to perform read queries from a geospatial web service. As updates to geo-data are less frequent than read operations, so considering read-queries enables benchmarking of the geo-database deployment performance in a real-world scenario.

Table 1 Benchmarking queries

AET_A = Average Execution Time in environment A	(1)
AET_B = Average Execution Time in environment B	(2)
\( \mathrm{Percentage}\ \mathrm{improvement}\ \mathrm{in}\ \mathrm{AET}={\mathrm{PI}}_{\mathrm{AB}}=\frac{{\mathrm{AET}}_B-{\mathrm{AET}}_A}{{\mathrm{AET}}_B}\times 100 \)	(3)

Benchmarking geospatial database on Kubernetes cluster

Abstract

1 Introduction

2 Related work

3 Methodology

3.1 Subject to be benchmarked

3.2 Execution environments

3.2.1 PostgreSQL on Amazon Elastic Compute Cloud (AWS EC2)

3.2.2 PostgreSQL on Amazon Relational Database Service (AWS RDS)

3.2.3 PostgreSQL on Amazon Elastic Kubernetes Service (AWS EKS)

3.3 Custom Kubernetes setup

3.4 Data acquisition

3.5 Iterative benchmarking

4 Experiments and results analysis

4.1 Benchmarking on import time

4.2 Benchmarking on queries

4.2.1 Queries operating on indexed attributes

4.2.2 Queries operating on non-indexed attributes

4.3 Benchmarking on AET

4.4 Comparison of execution environments based on PIAB

4.5 Comparing AWS EKS with custom Kubernetes cluster

5 Discussion

6 Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

4.4 Comparison of execution environments based on PI_AB