Spatial Database Concepts
1. Introduction to Spatial Databases
Spatial databases incorporate functionality that provides support for databases that keep track of objects in a multidimensional space. For example, cartographic data-bases that store maps include two-dimensional spatial descriptions of their objects—from countries and states to rivers, cities, roads, seas, and so on. The systems that manage geographic data and related applications are known as Geographical Information Systems (GIS), and they are used in areas such as environmental applications, transportation systems, emergency response systems, and battle management. Other databases, such as meteorological databases for weather information, are three-dimensional, since temperatures and other meteorological information are related to three-dimensional spatial points. In general, a spatial database stores objects that have spatial characteristics that describe them and that have spatial relationships among them. The spatial relationships among the objects are important, and they are often needed when querying the database. Although a spatial database can in general refer to an n-dimensional space for any n, we will limit our discussion to two dimensions as an illustration.
A spatial database is optimized to store and query data related to objects in space, including points, lines and polygons. Satellite images are a prominent example of spatial data. Queries posed on these spatial data, where predicates for selection deal with spatial parameters, are called spatial queries. For example, “What are the names of all bookstores within five miles of the College of Computing building at Georgia Tech?” is a spatial query. Whereas typical databases process numeric and character data, additional functionality needs to be added for databases to process spatial data types. A query such as “List all the customers located within twenty miles of company headquarters” will require the processing of spatial data types typically outside the scope of standard relational algebra and may involve consult-ing an external geographic database that maps the company headquarters and each customer to a 2-D map based on their address. Effectively, each customer will be associated to a <latitude, longitude> position. A traditional B+-tree index based on customers’ zip codes or other nonspatial attributes cannot be used to process this query since traditional indexes are not capable of ordering multidimensional coordinate data. Therefore, there is a special need for databases tailored for handling spatial data and spatial queries.
Table 26.1 shows the common analytical operations involved in processing geographic or spatial data. Measurement operations are used to measure some
Table 26.1 Common Types of Analysis for Spatial Data
Analysis Type : Type of Operations and Measurements
Measurements : Distance, perimeter, shape, adjacency, and direction
Spatial analysis/statistics : Pattern, autocorrelation, and indexes of similarity and topology using spatial and nonspatial data
Flow analysis : Connectivity and shortest path
Location analysis : Analysis of points and lines within a polygon
Terrain analysis : Slope/aspect, catchment area, drainage network
Search : Thematic search, search by region
global properties of single objects (such as the area, the relative size of an object’s parts, compactness, or symmetry), and to measure the relative position of different objects in terms of distance and direction. Spatial analysis operations, which often use statistical techniques, are used to uncover spatial relationships within and among mapped data layers. An example would be to create a map—known as a prediction map—that identifies the locations of likely customers for particular products based on the historical sales and demographic information. Flow analysis operations help in determining the shortest path between two points and also the connectivity among nodes or regions in a graph. Location analysis aims to find if the given set of points and lines lie within a given polygon (location). The process involves generating a buffer around existing geographic features and then identifying or selecting features based on whether they fall inside or outside the boundary of the buffer. Digital terrain analysis is used to build three-dimensional models, where the topography of a geographical location can be represented with an x, y, z data model known as Digital Terrain (or Elevation) Model (DTM/DEM). The x and y dimensions of a DTM represent the horizontal plane, and z represents spot heights for the respective x, y coordinates. Such models can be used for analysis of environmental data or during the design of engineering projects that require terrain information. Spatial search allows a user to search for objects within a particular spatial region. For example, thematic search allows us to search for objects related to a particular theme or class, such as “Find all water bodies within 25 miles of Atlanta” where the class is water.
There are also topological relationships among spatial objects. These are often used in Boolean predicates to select objects based on their spatial relationships. For example, if a city boundary is represented as a polygon and freeways are represented as multilines, a condition such as “Find all freeways that go through Arlington, Texas” would involve an intersects operation, to determine which freeways (lines) intersect the city boundary (polygon).
2. Spatial Data Types and Models
This section briefly describes the common data types and models for storing spatial data. Spatial data comes in three basic forms. These forms have become a de facto standard due to their wide use in commercial systems.
Map Data includes various geographic or spatial features of objects in a map, such as an object’s shape and the location of the object within the map. The three basic types of features are points, lines, and polygons (or areas). Points are used to represent spatial characteristics of objects whose locations correspond to a single 2-d coordinate (x, y, or longitude/latitude) in the scale of a particular application. Depending on the scale, some examples of point objects could be buildings, cellular towers, or stationary vehicles. Moving vehicles and other moving objects can be represented by a sequence of point locations that change over time. Lines represent objects having length, such as roads or rivers, whose spatial characteristics can be approximated by a sequence of connected lines. Polygons are used to represent spatial characteristics of objects that have a boundary, such as countries, states, lakes, or cities. Notice that some objects, such as buildings or cities, can be represented as either points or polygons, depending on the scale of detail.
Attribute data is the descriptive data that GIS systems associate with map features. For example, suppose that a map contains features that represent counties within a US state (such as Texas or Oregon). Attributes for each county feature (object) could include population, largest city/town, area in square miles, and so on. Other attribute data could be included for other features in the map, such as states, cities, congressional districts, census tracts, and so on.
Image data includes data such as satellite images and aerial photographs, which are typically created by cameras. Objects of interest, such as buildings and roads, can be identified and overlaid on these images. Images can also be attributes of map features. One can add images to other map features so that clicking on the feature would display the image. Aerial and satellite images are typical examples of raster data.
Models of spatial information are sometimes grouped into two broad categories: field and object. A spatial application (such as remote sensing or highway traffic con-trol) is modeled using either a field- or an object-based model, depending on the requirements and the traditional choice of model for the application. Field models are often used to model spatial data that is continuous in nature, such as terrain ele-vation, temperature data, and soil variation characteristics, whereas object models have traditionally been used for applications such as transportation networks, land parcels, buildings, and other objects that possess both spatial and non-spatial attributes.
3. Spatial Operators
Spatial operators are used to capture all the relevant geometric properties of objects embedded in the physical space and the relations between them, as well as to perform spatial analysis. Operators are classified into three broad categories.
Topological operators. Topological properties are invariant when topological transformations are applied. These properties do not change after trans-formations like rotation, translation, or scaling. Topological operators are hierarchically structured in several levels, where the base level offers opera-tors the ability to check for detailed topological relations between regions with a broad boundary, and the higher levels offer more abstract operators that allow users to query uncertain spatial data independent of the underlying geometric data model. Examples include open (region), close (region), and inside (point, loop).
Projective operators. Projective operators, such as convex hull, are used to express predicates about the concavity/convexity of objects as well as other spatial relations (for example, being inside the concavity of a given object).
Metric operators. Metric operators provide a more specific description of the object’s geometry. They are used to measure some global properties of single objects (such as the area, relative size of an object’s parts, compactness, and symmetry), and to measure the relative position of different objects in terms of distance and direction. Examples include length (arc) and distance (point, point).
Dynamic Spatial Operators. The operations performed by the operators mentioned above are static, in the sense that the operands are not affected by the application of the operation. For example, calculating the length of the curve has no effect on the curve itself. Dynamic operations alter the objects upon which the operations act. The three fundamental dynamic operations are create, destroy, and update. A representative example of dynamic operations would be updating a spatial object that can be subdivided into translate (shift position), rotate (change orientation), scale up or down, reflect (produce a mirror image), and shear (deform).
Spatial Queries. Spatial queries are requests for spatial data that require the use of spatial operations. The following categories illustrate three typical types of spatial queries:
Range query. Finds the objects of a particular type that are within a given spatial area or within a particular distance from a given location. (For exam-ple, find all hospitals within the Metropolitan Atlanta city area, or find all ambulances within five miles of an accident location.)
Nearest neighbor query. Finds an object of a particular type that is closest to a given location. (For example, find the police car that is closest to the location of crime.)
Spatial joins or overlays. Typically joins the objects of two types based on some spatial condition, such as the objects intersecting or overlapping spatially or being within a certain distance of one another. (For example, find all townships located on a major highway between two cities or find all homes that are within two miles of a lake.)
4. Spatial Data Indexing
A spatial index is used to organize objects into a set of buckets (which correspond to pages of secondary memory), so that objects in a particular spatial region can be easily located. Each bucket has a bucket region, a part of space containing all objects stored in the bucket. The bucket regions are usually rectangles; for point data structures, these regions are disjoint and they partition the space so that each point belongs to precisely one bucket. There are essentially two ways of providing a spatial index.
Specialized indexing structures that allow efficient search for data objects
based on spatial search operations are included in the database system. These indexing structures would play a similar role to that performed by B+-tree indexes in traditional database systems. Examples of these indexing structures are grid files and R-trees. Special types of spatial indexes, known as spatial join indexes, can be used to speed up spatial join operations.
Instead of creating brand new indexing structures, the two-dimensional
(2-d) spatial data is converted to single-dimensional (1-d) data, so that traditional indexing techniques (B+-tree) can be used. The algorithms for converting from 2-d to 1-d are known as space filling curves. We will not discuss these methods in detail (see the Selected Bibliography for further references).
We give an overview of some of the spatial indexing techniques next.
Grid Files. We introduced grid files for indexing of data on multiple attributes in Chapter 18. They can also be used for indexing 2-dimensional and higher n-dimensional spatial data. The fixed-grid method divides an n-dimensional hyper-space into equal size buckets. The data structure that implements the fixed grid is an n-dimensional array. The objects whose spatial locations lie within a cell (totally or partially) can be stored in a dynamic structure to handle overflows. This structure is useful for uniformly distributed data like satellite imagery. However, the fixed-grid structure is rigid, and its directory can be sparse and large.
R-Trees. The R-tree is a height-balanced tree, which is an extension of the B+-tree for k-dimensions, where k > 1. For two dimensions (2-d), spatial objects are approximated in the R-tree by their minimum bounding rectangle (MBR), which is the smallest rectangle, with sides parallel to the coordinate system (x and y) axis, that contains the object. R-trees are characterized by the following properties, which are similar to the properties for B+-trees (see Section 18.3) but are adapted to 2-d spatial objects. As in Section 18.3, we use M to indicate the maximum number of entries that can fit in an R-tree node.
The structure of each index entry (or index record) in a leaf node is (I, object-identifier), where I is the MBR for the spatial object whose identifier is object-identifier.
Every node except the root node must be at least half full. Thus, a leaf node that is not the root should contain m entries (I, object-identifier) where M/2 <= m <= M. Similarly, a non-leaf node that is not the root should contain m entries (I, child-pointer) where M/2 <= m <= M, and I is the MBR that contains the union of all the rectangles in the node pointed at by child-pointer.
All leaf nodes are at the same level, and the root node should have at least two pointers unless it is a leaf node.
All MBRs have their sides parallel to the axes of the global coordinate system.
Other spatial storage structures include quadtrees and their variations. Quadtrees generally divide each space or subspace into equally sized areas, and proceed with the subdivisions of each subspace to identify the positions of various objects. Recently, many newer spatial access structures have been proposed, and this area remains an active research area.
Spatial Join Index. A spatial join index precomputes a spatial join operation and stores the pointers to the related object in an index structure. Join indexes improve the performance of recurring join queries over tables that have low update rates. Spatial join conditions are used to answer queries such as “Create a list of highway-river combinations that cross.” The spatial join is used to identify and retrieve these pairs of objects that satisfy the cross spatial relationship. Because computing the results of spatial relationships is generally time consuming, the result can be com-puted once and stored in a table that has the pairs of object identifiers (or tuple ids) that satisfy the spatial relationship, which is essentially the join index.
A join index can be described by a bipartite graph G = (V1,V2,E), where V1 contains the tuple ids of relation R, and V2 contains the tuple ids of relation S. Edge set contains an edge (vr,vs) for vr in R and vs in S, if there is a tuple corresponding to (vr,vs) in the join index. The bipartite graph models all of the related tuples as connected vertices in the graphs. Spatial join indexes are used in operations (see Section 26.3.3) that involve computation of relationships among spatial objects.
5. Spatial Data Mining
Spatial data tends to be highly correlated. For example, people with similar characteristics, occupations, and backgrounds tend to cluster together in the same neighborhoods.
The three major spatial data mining techniques are spatial classification, spatial association, and spatial clustering.
Spatial classification. The goal of classification is to estimate the value of an attribute of a relation based on the value of the relation’s other attributes. An example of the spatial classification problem is determining the locations of nests in a wetland based on the value of other attributes (for example, vegetation durability and water depth); it is also called the location prediction problem. Similarly, where to expect hotspots in crime activity is also a location prediction problem.
Spatial association. Spatial association rules are defined in terms of spatial predicates rather than items. A spatial association rule is of the form
P1 ^ P2 ^ ... ^ Pn ⇒ Q1 ^ Q2 ^ ... ^ Qm,
where at least one of the Pi’s or Q j’s is a spatial predicate. For example, the rule
is_a(x, country) ^ touches(x, Mediterranean) ⇒ is_a (x, wine-exporter)
(that is, a country that is adjacent to the Mediterranean Sea is typically a wine exporter) is an example of an association rule, which will have a certain support s and confidence c.
Spatial colocation rules attempt to generalize association rules to point to collection data sets that are indexed by space. There are several crucial differences between spatial and nonspatial associations including:
The notion of a transaction is absent in spatial situations, since data is embedded in continuous space. Partitioning space into transactions would lead to an overestimate or an underestimate of interest measures, for exam-ple, support or confidence.
Size of item sets in spatial databases is small, that is, there are many fewer items in the item set in a spatial situation than in a nonspatial situation.
In most instances, spatial items are a discrete version of continuous variables. For example, in the United States income regions may be defined as regions where the mean yearly income is within certain ranges, such as, below $40,000, from $40,000 to $100,000, and above $100,000.
Spatial Clustering attempts to group database objects so that the most similar objects are in the same cluster, and objects in different clusters are as dis-similar as possible. One application of spatial clustering is to group together seismic events in order to determine earthquake faults. An example of a spatial clustering algorithm is density-based clustering, which tries to find clusters based on the density of data points in a region. These algorithms treat clusters as dense regions of objects in the data space. Two variations of these algorithms are density-based spatial clustering of applications with noise (DBSCAN) and density-based clustering (DENCLUE). DBSCAN is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes.
6. Applications of Spatial Data
Spatial data management is useful in many disciplines, including geography, remote sensing, urban planning, and natural resource management. Spatial database management is playing an important role in the solution of challenging scientific problems such as global climate change and genomics. Due to the spatial nature of genome data, GIS and spatial database management systems have a large role to play in the area of bioinformatics. Some of the typical applications include pattern recognition (for example, to check if the topology of a particular gene in the genome is found in any other sequence feature map in the database), genome browser development, and visualization maps. Another important application area of spatial data mining is the spatial outlier detection. A spatial outlier is a spatially referenced object whose nonspatial attribute values are significantly different from those of other spatially referenced objects in its spatial neighborhood. For example, if a neighborhood of older houses has just one brand-new house, that house would be an outlier based on the nonspatial attribute ‘house_age’. Detecting spatial outliers is useful in many applications of geographic information systems and spatial data-bases. These application domains include transportation, ecology, public safety, public health, climatology, and location-based services.