Benchmarking geospatial database on Kubernetes cluster

Kubernetes is an open-source container orchestration system for automating container application operations and has been considered to deploy various kinds of container workloads. Traditional geo-databases face frequent scalability issues while dealing with dense and complex spatial data. Despite plenty of research work in the comparison of relational and NoSQL databases in handling geospatial data, there is a shortage of existing knowledge about the performance of geo-database in a clustered environment like Kubernetes. This paper presents benchmarking of PostgreSQL/PostGIS geospatial databases operating on a clustered environment against non-clustered environments. The benchmarking process considers the average execution times of geospatial structured query language (SQL) queries on multiple hardware configurations to compare the environments based on handling computationally expensive queries involving SQL operations and PostGIS functions. The geospatial queries operate on data imported from OpenStreetMap into PostgreSQL/PostGIS. The clustered environment powered by Kubernetes demonstrated promising improvements in the average execution times of computationally expensive geospatial SQL queries on all considered hardware configurations compared to their average execution times in non-clustered environments.


Introduction
The use of geospatial data has been increased in many applications, including traffic management, ride-hailing services, and food sector, etc. The volume of geospatial data is predicted to increase by 20% every year. The increase of geospatial information requires new architectures or systems to handle data thus creating new challenges. At present, mainly two types of databases store geospatial information: relational databases and NoSQL databases. Relational databases are the most universally used and the most developed database information systems used in industries for decades. Due to a lack of native support for geospatial data in relational databases, some modern databases have updated and changed their database design specifically for spatial data to extend support for various operations on geo-data [1]. Examples of relational databases for geographic information include PostGIS, WebGIS, Oracle 19c, the Microsoft Azure SQL Database, and a few more. The relational databases can define spatial entities, extend their support for the different spatial data entities (polygon relationships), and acquire various optimizations for refining the query operations execution time. With the arrival of new cloud technologies, spatial information application systems are also incurring updations and changes rapidly to handle different operations on complex and colossal data efficiently [2]. Storing, managing, querying data, and managing geospatial databases in the environment effectively are the problems that are being tried to be solved for many years.
Running multiple machines as a cluster is a method of managing multiple containers. Docker is one of the technology solutions that are compatible with any computer to run containerized applications. Containerization is a process of isolating applications from the host machine. It creates an environment similar to having a separate operating system, even though there might be other containers running on the host machine. Containerization helps the host machine to run, create, and manage multiple containers on a single host machine. Kubernetes is an open-source technology that serves as a container orchestration tool that automates installing and managing a cluster of Docker containers. Docker images contain the desired application and service elements, and Kubernetes can be used to deploy and manage these components. Kubernetes allows us to automate the provisioning of containers, networking, load-balancing, security, and scaling across all its nodes [3].
Clustering and orchestration of containers automatically allocate the client to the machine with the least resource usage. Database clustering and containerization take a different approach in order to maintain atomicity, consistency, isolation, and durability (ACID) properties. In the database cluster mode, every single node is fully isolated and has its own methods of managing the data and ACID properties. Since there is more than one server instance, consistency is difficult to operate, and the concept of eventual consistency is used. But the result of this offers an alternative in the event of a crash or a failure.
The traditional data management technologies face frequent read-write problems and scalability problems while dealing with such dense and complex spatial data. Using Geographic Information System (GIS) in a cluster environment can be an effective way to solve spatial data problems by having the benefits of horizontal extension on low-cost computers, which can provide large and scalable storage, computing power, load balancing, high availability, and monitoring and automation. The structure and principle of containers in the cluster environment make the technology very prominent and efficient for database workflows. One of the reasons being that once a container has been built, it will run on any platform. The cluster environment can ensure availability, management of resources which simplifies reproducibility and deployment. Performing different kinds of operations on geospatial data is compute intensive, i.e., it needs high computational resources to run. Therefore, there is a need for evaluating the performance of compute intensive operations on geospatial data running in a cluster environment of Kubernetes. There have been many studies on spatial data storage due to the increasing spatial data and processing scale. This prompts the development of spatial data technologies, including aspects: the data model for storage, spatial indexes, and various types of query operations processing. Management and processing of spatial vector data is complex and needs unique storage models, mechanisms for processing, scanning, and specific usage systems for its use in various applications. A geographic information system (GIS) is used for geospatial vector data gathering, accumulation, and processing to assist the general or specific types of applications [1]. The fast-paced development of data systems, space technology, and sensor technology has led to an increase in the huge volume of geospatial data in several subjects. Hence, spatial data services are often used with cloud technologies to respond faster [2]. Geospatial data representing information is confined to the location object, structure, and characteristics of entities and entity dependency on each other [3]. New geospatial applications need versatile schematics, reasonably faster execution of query operations, and more scalability than the existing conventional geospatial relational databases [4]. In fact, the bottlenecks observed in the management and processing of spatial vector data have been continuously the driving force for the development of system designs due to the limitations which reside in the current systems which are used for handling the specific type of huge information and its manipulations and computations [5]. In spatial information systems, support for various spatial data services as used in any information system is required. Several experiments and studies have realized that conventional relational databases are not efficient for big data storage and queries for industrial purposes operating at large-scale accessing millions of data points at enormous speed in various geospatial applications [6,7]. NoSQL databases are widely considered for storing big data due to the capability to accumulate, manage, and support the creation of various types of indexes on data fields while horizontally scalable providing the ability to serve the huge number of retrieval operations [8]. Subjective comparison of experiments has also shown that no fixed schema-based databases have faster execution or query processing times than flexible schema databases when operating on a huge volume of data [9,10]. Creating spatial indexes is crucial to validate spatial databases to access and view data efficiently; thus, affecting the overall performance of the spatial databases compared to using non-indexed spatial databases directly [10,11]. All these operations are not only confined to huge storage space but also need comparatively more computation power. In the subjective comparison of the widely used query operations in various database systems, NoSQL databases have outperformed relational databases. Current NoSQL database designs used for industrial purposes cannot serve as a fully viable option for geospatial data. NoSQL databases have some advantages than traditional relational databases that can be easily operated as a distributed system and do not have fixed structured data, which eases its capability to be scaled horizontally [12,13].
Collecting open geospatial datasets in a traditional relational database management system (RDBMS) requires a lot of work related to schema design and data import, where both attributes and geometries have to be mapped, translated, and converted [14,15]. Relational databases have also some advantages compared to NoSQL databases that provide standard ACID properties (atomicity, consistency, isolation, and durability) that maintain the integrity of the database system when performing concurrent operations on it [16,17]. Conventional data exploration analysis and methods using a specific software to find crucial information for use in various geospatial applications can be computationally expensive [18][19][20]. They cannot be possible in every case without having special methods to support the processing of big geospatial data [6,21].
NoSQL or document databases provide much more flexibility in retrieving and inserting geospatial data than key-value databases. Using the geoJSON format, many document databases easily support geo-data management. Due to the flexible nature of NoSQL databases, they can be more efficient in performing geospatial data queries. One of NoSQL databases' shortcomings is that they do not provide any functions other than the basic spatial functions, lesser than relational databases. However, this approach leads to the usage of benefits of RDBMS, such as strong relational mappings, ACID properties, and strong foreign key constraints [22,23]. Distinct characteristics of spatial data such as high dimensionality, several complex dependencies between entities on each other (e.g., distance entity, the dimension of direction, and geometrical relationships) leads to the requirement for time-consuming operations, and computationally exhausting algorithms for performing operations. The geospatial data in the cloud can provide a suitable efficient computation architecture that can support the processing of such huge data [24,25].

Methodology
This section defines the experimental setup and execution of the benchmarking process for GeoDatabase deployed in a clustered and non-clustered environment. The steps are shown in Fig. 1.

Subject to be benchmarked
PostgreSQL has been chosen as the subject for the benchmarking process. It is an open-source software program that adds support for geographic objects to its objectrelational database using a PostGIS spatial database extender. This allows location queries to be run in SQL. The easy installation process across platforms turns out to be a good fit for a GeoDatabase that we can use as a subject in clustered and nonclustered environments.
A spatial query is a special type of database query that is supported by spatial geodatabases. These queries allow the use of geometry data types such as points, lines, and polygons. They also consider the spatial relationship between these geometries. The spatial queries' execution time is used as a parameter for comparing performance in our benchmarking process.

Execution environments
The execution environments chosen for the selected geo-database are powered by Amazon Web Services (AWS) and are as follows: These hardware configurations differ in the allocated random-access memory (RAM), virtual central processing units (vCPUs).
Uniform hardware configuration is the key ingredient to make a benchmarking process fair for all the execution environments. PgAdmin is used as the monitoring tool to get all the benchmarking results for the spatial queries.
The first execution environment (AWS EC2) depicts how a student or researcher would set up a project database. Setting up a virtual machine on-premise or on-cloud and running the database on it is the simplest of all the available options. However, there is an overhead of manually scaling the database according to the incoming requests.
The second execution environment (AWS RDS) depicts the scenario of how a startup or any organization in the software industry would like to set up and manage their databases for all their projects. Relying on third-party services such as RDS or any other database-service provider takes off the load of managing and maintaining the setup. As these options provide less flexibility in scaling options and limited architectural control, they do not prove to be cost-effective.
The third execution environment (AWS EKS) depicts the scenario of a database running in a clustered environment that provides flexible scaling options, full architectural control, and good fail-over support.

PostgreSQL on Amazon Elastic Compute Cloud (AWS EC2)
The base of the environment is Amazon EC2 instance. This environment is used with two hardware configurations: On the EC2 instance, docker and docker-compose are installed. The docker images of "mdillon/postgis:9.5-alpine" to setup PostgreSQL with PostGIS and "dpage/pgad-min4:latest" to setup PgAdmin using docker-compose on the Amazon EC2 instance are utilized.

PostgreSQL on Amazon Relational Database Service (AWS RDS)
The base of the environment is Amazon EC2 instance. This environment is used with two hardware configurations: Docker and docker-compose are installed on the instance the docker image "dpage/ pgadmin4:latest" is used to set up PgAdmin using docker-compose on the Amazon EC2 instance. PgAdmin is connected with the Amazon RDS instance.

PostgreSQL on Amazon Elastic Kubernetes Service (AWS EKS)
The base of the environment is Amazon EKS cluster with a node group attached with two hardware configurations: On the EKS cluster, PostgreSQL and PGAdmin are deployed for benchmarking purposes using docker images "mdillon/postgis:9.5-alpine" and "dpage/pgadmin4:latest".

Custom Kubernetes setup
There are cases where spinning up an AWS EKS instance can be very costly and may not be useful for research or for testing purposes. In that case, it is recommended building your own Kubernetes cluster, which can be done either on the cloud, or on the local machine(s). This kind of setup can greatly reduce the cost and enable researchers and students to set up their own distributed environment quickly and easily. The authors aim to build an easy to set up a heterogeneous clustered environment that can connect to different types of machines in different environments. Providing a methodology to set up a production-like environment quickly and easily can help greatly in validating conceptual architectures. This architecture needs to be cost-effective, flexible, and scalable at the same time. Kubernetes clusters can also provide certain benefits such as the following: (i) Load balancing-a methodical and efficient distribution of network or application traffic across multiple servers in a server farm. Each load balancer handles between client devices and backend servers, receiving and then distributing incoming requests to any available server capable of fulfilling them. (ii) Failover support-it ensures that a business intelligence system remains available for use if an application or hardware failure occurs. Clustering provides failover support in two ways: load redistribution and request recovery. The purpose of developing high-performance database clusters is to produce high performing computer systems. They operate co-extending programs that are needed for timeexhaustive computations. The scientific industries commonly prefer such a variety of clusters. The basic aim is intelligently sharing the workload. (iii)Monitoring and automation-clustering allows automating a lot of the processes of the database while it permits to set up rules to warn potential issues.
This installation process of the custom Kubernetes cluster has a lot of management overhead from the user's perspective but provides desirable performance especially on lower configuration systems. This setup enables small-scale use cases to deploy and validate conceptual architectures for much less costs as compared to AWS EKS with slightly comparable performance. This setup can be created either by using KIND (Kubernetes in a docker) for test use cases or using Kubeadm to setup a master-agent configuration. One important point to note while setting up is that all the machines should be on the same network or should be able to discover each other in order to connect and operate as a cluster.

Data acquisition
Geospatial data is used for benchmarking, since retrieving and fetching data can be a very resource intensive task and may provide us better and more accurate results since such resource-intensive tasks portray a more accurate description of deploying of databases in the real world. The choice of database for benchmarking is PostgreSQL since one of the biggest benefits of running PostgreSQL is running the cluster in primary-replica setup for the purposes of high-availability or load balancing the read-only queries. It is not necessarily simple to deploy a primary-replica setup out of the box, but the process can be simplified by using modern containerization technology. PostgreSQL provides the flexibility and the granular control to deploy the database in the desired and most effective configuration while having great tooling and support.
In this context, the geospatial data can be described by the atomic unit of a feature. A feature is a geographic shape (e.g., point, line string, or polygon) as well as a list of accompanying key-value attributes. An example of a feature is a building footprint represented by a vector geometry describing a polygon, accompanied by attributes such as address, name of the owner, and the year it was built. Considering the map data of Colorado and Washington states of USA, provided by OpenStreetMap (OSM). The data when downloaded initially is in the file format *.osm.pbf, which is a few hundred megabytes. This file format cannot be directly imported in PostgreSQL and hence it needs to be transformed first.
Osm2pgsql package available as a cli on ubuntu repository is an open-source tool to import the *.osm.pbf file into the PostgreSQL database. Osm2pgsql is software to import OpenStreetMap data into a PostgreSQL database that has PostGIS extension installed already before import. It is an essential part of many rendering tool chains. The following are the stages of the process of importing OSM data into PostgreSQL: 1. Reading *.osm.pbf file using PBF parser 2. Sorting of data and creation of index The time taken to import OSM data depends on the following: 1. Hardware specifications of the machine where Osm2pgsql is running 2. The network bandwidth-to be able to share transformed data with PostgreSQL 3. The target database specifications-to sort data and create indices Thus, using a separate Amazon Elastic Compute Cloud (EC2) instance within the same Virtual Private Cloud (VPC) as that of the desired execution environment was considered. The instance for a given hardware configuration has Osm2pgsql installed for importing OSM data in PostgreSQL for the 3 execution environment. With this, the import time depends solely on the database running in the desired execution environment.
Based upon the OSM data, 8 geospatial queries listed in Table 1 were used for the benchmarking process. All spatial-queries are SELECT operations on the geo-database, and every query represents a real-world use-case where clients want to perform read queries from a geospatial web service. As updates to geodata are less frequent than read operations, so considering read-queries enables benchmarking of the geo-database deployment performance in a real-world scenario.

Iterative benchmarking
The prerequisite for beginning benchmarking for the desired state is to have the infrastructure setup as described by the corresponding architecture diagrams (refer to Figs. 2, 3, and 4).
After infrastructure is set up, there will be a PostgreSQL database with PostGIS extension installed. Then, OSM data is imported in PostgreSQL running in the desired state. After the import is complete, Osm2pgsql gives the total time taken to import OSM data in PostgreSQL.
When the import process is complete, the next step is to execute the benchmarking queries. The benchmarking queries described in Table 1 are run using PgAdmin Query Tool, a robust, feature-rich environment that allows the execution of arbitrary SQL commands and retrieves the result set along with the execution time for each SQL query. Every benchmarking query is run in 10 iterations and execution time for each iteration is tabulated. After all iterations for all the benchmarking queries are done, the average execution time for every benchmarking query is calculated using the execution time for 10 iterations obtained. This average execution time obtained at the end of the process is considered the parameter of comparison of performance between the AWS EC2, AWS RDS, and AWS EKS.
The average execution time (AET) for each benchmarking query is calculated by taking the average of all its iterations in a particular execution environment. This average execution time (AET) of all the benchmarking queries is then used to compare the execution environments' performance under consideration. In the experiment, the AET in one environment is considered to be comparable with the AET in other environment only if the percentage difference in AET (positive or negative) between the two is significant, i.e., greater than 10%, because of multiple reasons which are not directly linked with the database like network latency, bandwidth available to transfer data which can affect the statistics. If the percentage improvement in AET is less than 10% then the AET of both the environments are considered to be similar. We have represented the percentage improvement in AET in environment A with respect to AET in environment B as PI AB which is calculated using Eqs. 1, 2, and 3: If PI AB is positive, then the AET in environment A has improved by PI AB percent compared with AET in environment B for a given benchmarking query. If PI AB is negative, then the AET in environment A has degraded by PI AB percent compared with AET in environment B for a given benchmarking query.

Experiments and results analysis
Following the methodology, we will compare our considered execution environments based on total time taken to import the data into PostgreSQL/PostGIS and the average execution times of benchmarking queries in the given environment. While several factors might lead to these results, we chose to focus on the hardware configuration, resource usage, and the deployment architecture as the root cause of the results.

Benchmarking on import time
When downloaded in the file format *.osm.pbf, the OpenStreetMap data of the Colorado and Washington states cannot be directly imported in PostgreSQL and hence is transformed using Osm2pgsql CLI. Osm2pgsql also logs the timestamp corresponding to each stage of the import process described in Section 3.4. Import time is the summation of the time taken by each step in the import process.
The hardware specifications for the virtual machine where Osm2pgsql operates are the same for all the execution environments. The import time depends solely on the database's ability to run in the desired execution environment to create relations, insert, and index data.
Creation of tables, sorting, and indexing of geospatial data are computationally expensive operations. Table 2 shows the tabulated import times for the OSM data corresponding to Colorado and Washington states in all execution environments for both the considered hardware configurations. We observed that the import was quickest for databases operating in AWS EKS, because of its ability to scale up or down based on resource usage while the database operating in AWS EC2 took longer time to complete the import process as there was no scaling ability. Import time for AWS RDS varied greatly compared with the other two execution environments, because AWS RDS is not optimized to work with geospatial data. Certain extensions for PostGIS support are not compatible with it. The custom Kubernetes setup took similar time for the import process as AWS EKS, as it is also a clustered environment with an ability to scale on demand.  Table 3 describes the geospatial queries that we have considered for benchmarking. These queries are executed on the imported OpenStreetMap data in each execution environment running on both the considered hardware configurations using PgAdmin.

Benchmarking on queries
In above-mentioned queries, the attributes on which the geo-data is indexed in Post-greSQL are nodes, roads, ways, rels, point, line, and polygon. Tables 4,5,6,7,8,9,10,11,12,13,14, and 15 describe the execution time (in seconds) for 10 iterations (ET-1 to ET-10) of each query (Q1-Q8) in each of our execution environments. Here, "ET" refers to "execution time," "AET" refers to "average execution time" which is the average of ET-1 to ET-10 and "ET-i" refers to the "execution time for the i-th iteration for a given query."

Queries operating on indexed attributes
Q1 and Q2 are geospatial queries which operate on indexed attributes and retrieve less than 5 rows. From Tables 4, 5, 6, 7, 8, 9, 10, and 11, we observe that AWS EC2 and AWS EKS gave similar AET for them as these attributes were indexed during the import process so there is less processing overhead because of efficient retrieval of data from the geo-database based on these attributes. But AWS RDS proved to be slower in operating on indices because it is not fully compatible to operate with PostGIS. AWS RDS gave slower AET for HC-1 for both Colorado and Washington states, but when compute resources were upgraded to HC-2, AWS RDS gave comparable AET to AWS EKS for Q1 and Q2.

Queries operating on non-indexed attributes
Tables 4 and 5 describe the AETs that are observed by running the queries on in AWS EC2 for HC-1 as shown in Fig. 2   HAVING clause along with count and grouping operation on an indexed attribute. Q4 and Q6 look similar but in Tables 4 and 5, their average execution times differ by a significant margin of 377.7 ms and Q4 being the quickest to execute, this is because in Q6 additional computations are being performed to get result corresponding to the HAVING clause which requires an additional traversal of the resultant rows after grouping operation. Q7 is a computationally expensive query calculating the length of all roads in the city using ST_LENGTH PostGIS function and retrieves 386224 rows and we see that it took maximum time to execute. From Table 5, similar observations can be seen for Washington State OSM data. Q7 being the slowest query to execute and even slower than Q7 for Colorado State OSM data from Table 4, this is because Washington State has more roads compared to Colorado State, and this can also be seen from Table 3, where Q7 fetch 494k rows for Washington while 386k rows for Colorado. Queries with low computational overhead like Q1, Q2, Q3, and Q4 gave less AET for Washington compared to Colorado from Tables 4 and 5, because Washington State OSM data is smaller in size compared to Colorado State OSM data making it easier to operate on. Tables 6 and 7 describe the AET's that are observed by running the queries on in AWS EC2 for HC-2 as shown in Fig. 2, for Colorado State OSM data and Washington State OSM data respectively.
In Table 6, on upgrading the hardware configuration of AWS EC2 from HC-1 to HC-2, all the benchmarking queries saw improvement in AET compared to Table 4.  The query with highest computational overhead Q7 saw an improvement of 2.2636 s in AET. While the queries involving moderate computation overhead like Q5, Q6 saw an improvement of 1.1233 s and 580 ms in AET respectively. Queries with low computational overhead like Q3, Q4, and Q8 saw an improvement of 200 ms, 494 ms, and 500 ms in AET respectively. Table 7 shows that we recorded similar observations for Washington State OSM data. Q7 saw an improvement of 3.1241 s in AET, while queries with low computational overhead like Q3 and Q4 saw marginal improvements less than 10 ms in AET. Tables 8 and 9 describe the AET's that are observed by running the queries on in AWS RDS for HC-1 as shown in Fig. 3, for Colorado State OSM data and Washington State OSM data respectively.
Q7 again was observed as the query requiring the maximum execution time. But in this case, we observe that the AET for all queries increased when compared with corresponding AET in AWS EC2 and AWS EKS from Tables 4 and 12. There can be multiple reasons for this behavior; certain PostGIS extensions required for installation and operation in PostgreSQL are not compatible with AWS RDS and are not optimized to deal with geospatial data.
From Table 9, it can be seen that similar observations were made for Washington State OSM data. Q7 being the slowest to execute again. All other queries involving moderate and low computational overhead gave better AET than Colorado OSM data from Table 8 because Washington State OSM data is smaller in size compared to Colorado State OSM data.  Tables 10 and 11 describe the AET's that are observed by running the queries on in AWS RDS for HC-2 as shown in Fig. 3, for Colorado State OSM data and Washington State OSM data respectively.
In Table 10, on upgrading the hardware configuration of AWS RDS from HC-1 to HC-2, all the benchmarking queries saw improvement in AET compared to Table 8 similar to what we saw in case of AWS EC2 from Tables 4 and 6. The query with highest computational overhead Q7 saw an improvement of 57.64%, i.e., 7.3464 s in AET. While the queries involving moderate computation overhead like Q5, Q6 saw an improvement of 220 ms and 403.9 ms in AET respectively. Queries with low computational overhead like Q3, Q4, and Q8 saw an improvement of 343.9 ms, 296.6 ms, and 359.2 ms in AET. Significant improvements were observed for RDS on upgrading the hardware configuration.
AETs for queries operating on Washington State OSM data shown in Table 11, we recorded similar observations. Q7 saw an improvement of 57.18%, i.e., 8.88 s in AET, queries with moderate and low computational overhead like Q3, Q4, Q5, Q6, and Q8 saw marginal improvements, i.e., less than 10%.
Tables 12 and 13 describe the AETs that are observed by running the queries on in AWS EKS for HC-1 as shown in Fig. 4, for Colorado State OSM data and Washington State OSM data respectively.
Here also Q7 took maximum time to execute, but this time it was quicker to execute when compared with corresponding AET for AWS EC2 and AWS RDS in Tables 4  and 8. We also observed that the absolute difference between average execution time of  Q4 and Q6 reduced to 11.1 ms when compared with 377.7 ms in Table 4 and 102.2 ms in Table 8, also their AET was comparable keeping the margin of error in mind. This improvement in performance results from AWS EKS to scale up and down on-demand based on resource usage. For Washington State, OSM data similar records were observed as shown in Table 13. AET for all benchmarking queries were observed to have improved when compared to other execution environments like AWS EC2 from Table 5 and AWS  RDS from Table 7.
In Table 14, on upgrading the hardware configuration of AWS EKS from HC-1 to HC-2, all the benchmarking queries saw improvement in AET compared to Table 12. The query with highest computational overhead Q7 saw an improvement of 18.87%, i.e., 1.0037 s in AET. Queries with low computational overhead like Q3, Q4, and Q8 saw an improvement of 411.1 ms, 327.2 ms, and 427.1 ms in AET. Significant improvements in AET for query involving high computational overhead were observed for EKS on upgrading the hardware configuration.
For Washington State OSM data from Table 15, we recorded similar observations. Q7 saw an improvement of 15.39%, i.e., 8.88 s in AET, queries with moderate computational overhead like Q5 and Q6 saw an improvement of 189.7 ms and 136.8 ms respectively.
Q3 operates on non-indexed attributes and uses aggregate functions on them to retrieve 168 rows, and it can also be considered to be a standard SQL text query; the retrieval has moderate computational overhead. From Tables 4, 5, 6, 7, 8, and 9, it can be  observed that AWS EKS because of its scaling ability and AWS RDS which is designed to work with text queries gave better AET than standard PostgreSQL on AWS EC2. For queries with high computational overhead like Q5 which traverses all the data in the planet_osm_point table based on non-indexed attribute and Q8 which traverses complete dataset to retrieve the information about the generated objects and cardinalities after the import process is completed, AWS EKS gave the best AET because of its ability to scale based on resource usage. The plot shows that for queries on indexed attributes Q1 and Q2, AWS EKS and AWS EC2 gave similar AET but AWS RDS deviated and gave slower AET. For a standard SQL text query Q3, AWS RDS performed similarly to AWS EKS and both outperformed standard PostgreSQL in AWS EC2. The absolute difference between the AET of query with lowest computational overhead Q4 and the query involving moderate overhead Q6 was observed to be minimum in case of AWS EKS, thus where AWS EC2 and AWS RDS deviated for moderate load, AWS EKS performed similarly  for low to moderate load. For Q7, which is using PostGIS function AWS RDS and AWS EC2 deviated by a significant margin from AWS EKS. For queries involving traversal of complete dataset like Q5 and Q8, AWS EKS and AWS RDS gave better AET when compared to AWS EC2. These observations are further explored in Section 4.4 using PI AB. Figure 6 is a line plot showing the AET (y-axis) in seconds for all benchmarking queries Q1-Q8 (x-axis) operated for Washington OSM data in all execution environments for HC-1. The plot is made using the AET values from Tables 5, 9, and 13. It can be observed from the plot that similar to the case of Colorado State, AWS EKS and AWS EC2 gave similar AET for Q1 and Q2 while AWS RDS deviated to give a slower AET. Here also AWS EKS gave similar AET for low to moderate load in Q4 and Q6. AWS EKS again gave the best AET for Q7, which was the slowest query to execute because of the highest computational overhead. Figure 7 is a line plot showing the AET (y-axis) in seconds for all benchmarking queries Q1-Q8 (x-axis) operated for Colorado OSM data in all execution environments for HC-2. The plot is made using the AET values from Tables 6, 10, and 14. On upgrading the hardware configuration, we observed that the difference in AETs was within the margin of error when compared with AETs for Colorado State in HC-1 as depicted in Fig. 6, for queries involving low computational overhead. For such queries, AWS EKS outperformed AWS EC2 and AWS RDS but now by significant  margin. Greater improvement in AET can be observed for queries involving moderate and high computational overhead, for which AWS EKS outperformed AWS EC2 and AWS RDS by significant margin, which points us to the fact that on increasing the compute resources, the performance or AET will improve for the queries which actually need that much resources. Figure 8 is a line plot showing the AET (y-axis) in seconds for all benchmarking queries Q1-Q8 (x-axis) operated for Washington OSM data in all execution environments for HC-2. The plot is made using the AET values from Tables 7, 11, and 15. Similar to our observation from Fig. 7, the difference in AET between all execution environments became marginal for queries involving low computational overhead, but AWS EKS continued to yield better AET for queries involving moderate to high computational overhead.

Benchmarking on AET
Figs. 9, 10, 11, 12, 13, and 14 depict the improvements in AET for all benchmarking queries operating for both hardware configuration for a given execution environment. These plots can enable us to understand the effect of hardware configurations on AET for benchmarking queries in a given environment. Line plot for percentage improvement (PIAB) in AET, is shown in Fig. 15. Figure 9 is a double-bar plot showing the AET (x-axis) in seconds for all benchmarking queries Q1-Q8 (y-axis) operated for Colorado State OSM data in AWS EC2 for both HC-1 and HC-2. The plot is made using the AET values from Tables 4 and 6. It can be observed that on increasing the resources the AET of the benchmarking queries  decreased. Queries such as Q5 and Q7 showed more improvement since they are moderate and high computationally intensive as compared to other queries and benefited from upgrading the compute resources. Figure 10 is a double-bar plot showing the AET (x-axis) in seconds for all benchmarking queries Q1-Q8 (y-axis) operated for Washington OSM data in AWS EC2 for both HC-1 and HC-2. The plot is made using the AET values from Tables 5 and 7. The improvement in AETs for Q1, Q2, Q3, and Q4 is marginal because these queries are less resource intensive for Washington State. This trend is due to the fact that Washington State OSM data is smaller than Colorado State OSM data. Because of which the improvements were better in Colorado State than in Washington State. Figure 11 is a double-bar plot showing the AET (x-axis) in seconds for all benchmarking queries Q1-Q8 (y-axis) operated for Colorado OSM data in AWS RDS for both HC-1 and HC-2. The plot is made using the AET values from Tables 8 and 10. We observed great improvements in AET for the benchmarking queries on upgrading the hardware configuration for AWS RDS. Hence, it can be said that AWS RDS requires more compute resources than other execution environments to deliver comparable results. Figure 12 is a double-bar plot showing the AET (x-axis) in seconds for all benchmarking queries Q1-Q8 (y-axis) operated for Washington OSM data in AWS RDS for both HC-1 and HC-2. The plot is made using the AET values from Tables 9 and 11. Similar to the observations from Fig. 11, the AET for benchmarking queries is improved significantly for Q7, but for queries involving low to moderate computational overhead the improvement was marginal. Figure 13 is a double-bar plot showing the AET (x-axis) in seconds for all benchmarking queries Q1-Q8 (y-axis) operated for Colorado OSM data in AWS EKS for both HC-1 and HC-2. The plot is made using the AET values from Tables 12 and 14. AWS EKS when upgraded to HC-2 shows good improvement as compared to AWS EKS in HC-1. The AET improved for all the benchmarking queries. Figure 14 is a double-bar plot showing the AET (x-axis) in seconds for all benchmarking queries Q1-Q8 (y-axis) operated for Washington OSM data in AWS EKS for both HC-1 and HC-2. The plot is made using the AET values from Tables 13 and 15. The exact same observation can be seen in Fig. 13, but Q1, Q2, Q3, and Q4 are queries involving low computational overhead for Washington State; hence, the improvement was not significant.

Comparison of execution environments based on PI AB
We know that the Colorado State OSM data is larger in size than the Washington State OSM data; therefore, we considered comparing the performance of the benchmarking queries for Colorado State OSM data in all execution environments to  find out which one of them yields the best AETs for a lower hardware configuration (HC-1). Table 16 represents percentage improvement in AET (PI AB ) among all the execution environments which is calculated using Eq. 3. AET of a benchmarking query in environment A is said to have improved relative to its AET in environment B if AET A is less than AET B or PI AB is positive.
From Tables 10 and 11, we observed that AWS EKS when compared to AWS EC2 gave similar average execution times for Q1 and Q2 which operated on indexed attributes. Figure 6 shows the line plot for percentage improvement (PI AB ) in AET. They again were not very dissimilar for Q4, because of less computational overhead involved. But AWS EKS outperformed AWS EC2 when moderate computational overhead was introduced in Q6 and Q3, because of its ability to scale up and down based upon resource usage. Q5, Q7, and Q8 involved high computational overhead and AWS EKS continued to outperform AWS EC2 because of its ability to scale on demand flexibly. Certain PostgreSQL/PostGIS extensions are not compatible with AWS RDS, as a result of which AWS EKS outperformed it for Q1 and Q2 which operate on indexed attributes and Q7 which makes use of PostGIS function. They both yielded similar AET for Q3 and Q4, Q3 is similar to a standard SQL query and AWS RDS is designed to handle such text queries and Q4 involved low computational overhead. In the case of Q6, the scaling ability of AWS EKS made it outperform AWS

Comparing AWS EKS with custom Kubernetes cluster
From Table 17, it can be seen that the custom Kubernetes cluster performed similar to AWS EKS from Table 15, since it is a similar clustered environment. AET for all the benchmarking queries are not significantly different to AWS EKS (slightly higher than EKS). This slightly increased time can arise due to the network latencies in load balancers created in the custom Kubernetes cluster.  For Table 18 similar to Table 17, we observed that the AETs for the custom Kubernetes cluster is comparable to AWS EKS from Table 14, but slightly higher due to the network latencies in the setup. Since the data of Washington is smaller than Colorado, the AET for Washington State data is less than Colorado State data.
From Fig. 16, it can be observed that the AETs of custom Kubernetes cluster are comparable to EKS, but take slightly more time; this can be due to the hidden latencies in the local load balancer.
From Fig. 17, it can be seen that for the Washington State as well the AETs for custom Kubernetes cluster is slightly higher than AWS EKS, because of the similar reason as seen in the Colorado dataset from Fig. 16.

Discussion
Deploying and managing software applications and databases in a clustered environment is not an easy task, although the containerized applications can meet the requirements of ease of migration, portability, scalability, and flexibility when operating in a clustered environment.
Plenty of studies have been carried out comparing relational (SQL) and NoSQL databases in handling geospatial data. But these traditional database management technologies face frequent scalability problems when dealing with geospatial data. This paper demonstrates benchmarking the performance of operations on geospatial database by comparing the execution times of geospatial queries in clustered  environment like Kubernetes and non-clustered environments. Kubernetes demonstrated its advantages by scaling on demand based on resource usage and performing better when compared to non-clustered environments for computationally expensive operations; this ability is particularly important for mission-critical applications and geospatial databases as they tend to be compute intensive thereby can be benefited immensely from operating in clustered environment. Setting up a custom local Kubernetes cluster proved to be a viable option for testing and validating conceptual architectures if we want the benefits of a clustered environment without incurring high costs.
A disadvantage of using a clustered environment with PostgreSQL compared to a managed non-clustered environment like AWS RDS is that we lose the ability to use a fully managed database. We have to set up, operate, manage, and maintain the database ourselves, which might not be very cost-efficient. We have to manage our own backups, survive downtime in the case of a crash, and increase our deployment cost, and in case of local Kubernetes clusters, we have to manage availability of clusters on top of all this.

Conclusions
The work aimed to benchmark geospatial databases in clustered and non-clustered environments. It was found that on processing geospatial queries operating upon indexed attributes and involving low computational overhead, both clustered and non-clustered environments offered similar performance, keeping the margin of error in mind. The clustered environments performed better than non-clustered environments in scenarios where a computationally expensive geospatial query is involved or the query operated on non-indexed attributes and large data was retrieved from the geo-database. A clustered environment like AWS EKS could do this because of its ability to scale flexibly. Also, operating geo-databases in a clustered environment like AWS EKS (Kubernetes) can drastically improve its performance and scale on demand and automate administration or routine tasks, a good improvement, especially when computationally expensive operations are to be performed efficiently.