Parallel Database
- The parallel database is a system in which multiple processors execute and run queries simultaneously.
- In a parallel database, we can use thousands of small processors.
Advantages of parallel databases
- Improved Performance of the processing of data.
- Increased data processing speed by using multiple resources such as CPUs and disks in parallel.
- High Availability:- the same data is stored at multiple locations
- Increases Reliability:- even of data site fails execution can continue as other copy of data are available.
Distributed Database
Data are often generated and stored on different database systems, and there is a need to execute queries and update transactions across multiple databases. This need led to the development of distributed database systems.
A distributed database is a collection of multiple interconnected databases, which are spread physically across various locations that communicate via a computer network.
Advantages:
- Modular Development: If we want to expand the same system on different locations then we just need to add nodes to the current network and these nodes do not interrupt the current network functionalities.
- Increases Reliability: If one node on a network fails then its work can be distributed between other nodes on the network, failing of one node does not stop the system.
- Improves Performance: a small database is always easy to handle compared to a large database, so in the distributed database, a large database is distributed into small database across various locations which are easy to handle with better performance.
- Increase Availability: failure of one node does not affect data availability as data can be obtained from various other nodes on the network.
- Faster Response: the data is available locally so data retrieval becomes efficient.
Database System Architectures
- Centralized and Client-Server Architectures
- Server System Architectures
- Parallel Systems
- Distributed Systems
Centralized Database Systems
- Centralized database systems are those that run on a single computer system.
- They are widely used for
enterprise-scale applications.
- There are two ways in which computers are used: as single-user systems and
as multiuser systems.
A typical single-user system is a system used by a single person, usually with only one
processor (usually with multiple cores), and one or two disks.
A typical multiuser system, on the other hand, has multiple disks, a large amount of memory, and multiple processors. Such systems serve a large number of users who are connected to the system remotely, and they are called server systems.
Database systems designed for single-user systems usually do not provide many of the facilities that a multiuser database provides.
The systems that not support SQL and may instead provide an API for data access. Such database systems are referred to as embedded databases since they are usually designed to be linked to a single application program and are accessible only from that application.
Multiuser database systems support the full transactional features. Such databases are usually designed as servers, which service requests received from application programs; the requests could be in the form of SQL queries or they could be requests for retrieving, storing, or updating data specified using an API.
Most general-purpose computer systems in use today have a few multicore processors (typically one to four), with each multicore processor having a few cores. Main memory is shared across all the processors in a general-purpose computer system. Parallelism with such a small number of cores, and with shared memory, is referred to as coarse-grained parallelism.
Databases running on coarse-grained parallel machines traditionally did not at-
tempt to partition a single query among the processors; instead, they ran each query
on a single processor, allowing multiple queries to run concurrently. Thus, such systems support a higher throughput; that is, they allow a greater number of transactions
to run per second.
In contrast, machines with fine-grained parallelism have a large number of processors, and database systems running on such machines attempt to parallelize single tasks
(queries, for example) submitted by users.
Client-Server System
Server systems satisfy requests generated at m client systems.
Database functionality can be broadly divided into two parts :
Front-end
Back-end
Advantages of replacing mainframes with networks of workstations or personal
computers connected to back-end server machines:
better functionality for the cost
flexibility in locating resources and expanding facilities
better user interfaces
easier maintenance
Server System Architectures
Server systems can be broadly categorized as transaction servers and data servers.
# Transaction-server systems
- also called query-server systems, provide an interface to which clients can send requests to perform an action, in response to which they execute the action and send back results to the client.
Usually, client machines ship transactions to the server systems, where those transactions are executed, and results are shipped back to clients that are in charge of displaying the data.
Requests may be specified through the use of SQL or through a specialized application program interface.
Transaction server systems consists of multiple processes accessing data
in shared memory.
The processes that form part of the database system include:
- Server processes: These are processes that receive user queries (transactions), execute them, and send the results back. The queries may be submitted to the server processes from a user interface, or from a user process running embedded SQL, or via JDBC, ODBC, or similar protocols.
- Lock manager process: This process implements lock manager functionality, which includes lock grant, lock release, and deadlock detection.
- Database writer process: There are one or more processes that output modified buffer blocks back to disk continuously.
- Log writer process: This process outputs log records from the log record buffer to stable storage. Server processes simply add log records to the log record buffer in shared memory, and if a log force is required, they request the log writer process to output log records (recall that a log force causes the log contents in memory to be output to stable storage).
- Checkpoint process: This process performs periodic checkpoints.
- Process monitor process: This process monitors other processes, and if any of them fails, it takes recovery actions for the process, such as aborting any transaction being executed by the failed process and then restarting the process.
The shared memory contains all shared data, such as:
- Buffer pool.
- Lock table.
- Log buffer, containing log records waiting to be output to the log on stable storage.
- Cached query plans, which can be reused if the same query is submitted again.
To ensure that no two processes are accessing the same data structure at the same time, databases systems implement mutual exclusion using either,
- Operating system semaphores
- Atomic instructions such as test-and-set or compare-and-swap, which are supported by the computer hardware.
# Data-server systems
- allow clients to interact with the servers by making requests to read or update data, in units such as files, pages, or objects.
For example, file servers provide a file-system interface where clients can create, update, read, and delete files.
Data servers are used in local area networks where:
- There is a high-speed connection between the clients and the server.
- The client machines are having comparatively more processing power.
- Tasks to be executed are computing-intensive.
In such an environment, the server machine sends data to the client machines to perform all processing at the client machine and then the client sends that data back to the server machine.
Data servers for database systems offer much more functionality; they
support units of data such as pages, tuples, or objects that are smaller than a file.
They provide indexing facilities for data, and they provide transaction facilities so
that the data are never left in an inconsistent state if a client machine or process
fails. The term data item to refer to tuples, objects, files, and documents. We also use the terms data server and data storage system interchangeably. Data servers support communication of entire data items.
Data servers in earlier generations of storage systems supported a concept called
page shipping, where the unit of communication is a database page that may potentially
contain multiple data items. Page shipping is not used today, since storage systems do
not expose the underlying storage layout to clients.
Caching at Clients
The time cost of communication between a client application and a server is high compared to that of a local memory reference.
Following issues decide the time cost of communication between client and server.
- Data Shipping- {Page-Shipping versus Item-Shipping}: Units of communication for data are the page or an item (tuple or an object). Data can be fetched by fetching the page or single item (tuple or object).
- Locks: Locks are usually granted by the server for data item/page that it ships client machine.
The time to send a message over a network, and get a response back, called the network round-trip time, or network latency, As a result, applications running at the client's adopt several optimization strategies to reduce the effects of network latency.
The optimization strategies include the following:
- Prefetching- Fetching items even before they are requested.
- Data caching- Data that are shipped to a client on behalf of a transaction can be cached at the client within the scope of a single transaction. Data can be cached even after the transaction completes, allowing successive transactions at the same client to make use of the cached data. However, cache coherency is an issue.
- Lock caching
- Adaptive lock granularity
Lock de-escalation is a way of adaptively decreasing the lock granularity if there
is higher contention.
Parallel Systems
Parallel systems improve processing and I/O speeds by using a large number of com-
puters in parallel.
In parallel processing, many operations are performed simultaneously, in which the computational steps are performed sequentially.
A coarse-grain parallel machine consists of a small number of powerful processors; a massively parallel or fine-grain parallel machine uses thousands of smaller processors.
Measures of Performance for Parallel Systems
There are two main measures of the performance of a database system:
(1) throughput, the number of tasks that can be completed in a given time interval.
(2) response time, the amount of time it takes to complete a single task from the time it is submitted.
A system that processes a large number of small transactions can improve throughput by
processing many transactions in parallel. A system that processes large transactions can improve response time as well as throughput by performing subtasks of each transaction in parallel.
Two important issues in studying parallelism are speedup and scaleup. Running a
given task in less time by increasing the degree of parallelism is called speedup. Handling larger tasks by increasing the degree of parallelism is called scaleup.
Speedup
A fixed-sized problem executing on a small system is given to an N-times larger system. i.e Speedup is the execution of a task in less time by the increasing the degree of parallelism.
Scaleup
It is the process of handling a large task in the same amount of time by increasing the degree of parallelism. It increases the size of both the problem and the system N-times larger system used to perform N-times larger job.
Scale-up is linear if equation equals 1.
Two kinds of scaleup are relevant in parallel database systems, depending on how the size of the task is measured:
- In batch scaleup, the size of the database increases, and the tasks are large jobs whose runtime depends on the size of the database.
- In transaction scaleup, the rate at which transactions are submitted to the database increases, and the size of the database increases proportionally to the transaction rate. This kind of scaleup is what is relevant in transaction-processing systems where the transactions are small updates.
Scaleup is usually the more important metric for measuring the efficiency of parallel database systems. The goal of parallelism in database systems is usually to make sure that the database system can continue to perform at an acceptable speed, even as the size of the database and the number of transactions increases.
Several factors work against an efficient parallel operation and can diminish both
speedup and scaleup.
- Sequential computation- Start-up costs
- Interference
- Skew
Interconnection Networks
Parallel systems consist of a set of components (processors, memory, and disks) that
can communicate with each other via an interconnection network.
Commonly used types of interconnection networks:
- Bus. All the system components can send data on and receive data from a single communication bus.
- Ring. The components are nodes arranged in a ring (circle), and each node is connected to its two adjacent nodes in the ring.
- Mesh. The components are nodes in a grid, and each component connects to all its adjacent components in the grid.
- Hypercube. The components are numbered in binary, and a component is connected to another if the binary representations of their numbers differ in exactly one bit. Thus, each of the n components is connected to log(n) other components. In a hypercube interconnection, a message from a component can reach any other component by going through at most log(n) links.
Parallel Database Architectures
There are several architectural models for parallel machines.
- Shared memory. All the processors share a common memory
- Shared disk. A set of nodes that share a common set of disks; each node has its own processor and memory. Shared-disk systems are sometimes called clusters.
- Shared nothing. A set of nodes that share neither a common memory nor common disk.
- Hierarchical. This model is a hybrid of the preceding three architectures. This model is the most widely used today.
Distributed Systems
In a distributed database system, the database is stored on nodes located at geographically separated sites.
The nodes in a distributed system communicate with one another through various communication media, such as high-speed private networks or the internet. They do not share the main memory or disks.
The main differences between shared-nothing parallel databases and distributed
databases include the following:
----------------
In a distributed database system, we differentiate between local and global transactions.
A local transaction is one that accesses data only from nodes where the transaction was initiated.
A global transaction, on the other hand, is one that either accesses data in a node different from the one at which the transaction was initiated, or accesses data in several different nodes.
Distributed databases that are built by integrating existing database systems have somewhat different characteristics.
Sharing data. The major advantage in building a distributed database system is the provision of an environment where users at one site may be able to access the data residing at other sites.
Autonomy. The primary advantage of sharing data using data distribution is that each site can retain a degree of control over data that are stored locally. In a centralized system, the database administrator of the central site controls the database. In a distributed system, there is a global database administrator responsible for the entire system.
Types:
In a homogeneous distributed database system, nodes share a common global
schema, all nodes run the same distributed database-management software and the nodes actively cooperate in processing transactions and queries.
However, in many cases, a distributed database has to be constructed by linking
together multiple already-existing database systems, each with its own schema and possibly running different database-management software. The sites may not be aware of
one another, and they may provide only limited facilities for cooperation in query and
transaction processing. Such systems are sometimes called federated database systems
or heterogeneous distributed database systems.
Nodes in a distributed database communicate over wide-area networks ( WAN ).
Disadvantage:
- Added complexity required to ensure proper coordination among sites.
- Software development cost.
- Greater potential for bugs.
- Increased processing overhead.
Parallel Databases
Nowadays
organizations need to handle a huge amount of data with a high transfer
rate. For such requirements, the client-server or centralized system is
not efficient. With the need to improve the efficiency of the system,
the concept of the parallel database comes in picture. A parallel
database system seeks to improve the performance of the system through
parallelizing concept.
Need :
Multiple
resources like CPUs and Disks are used in parallel. The operations are
performed simultaneously, as opposed to serial processing. A parallel
server can allow access to a single database by users on multiple
machines. It also performs many parallelization operations like data
loading, query processing, building indexes, and evaluating queries.
Advantages :
- Performance Improvement –
By connecting multiple resources like CPU and disks in parallel we can significantly increase the performance of the system.
- High availability –
In
the parallel database, nodes have less contact with each other, so the
failure of one node doesn’t cause for failure of the entire system. This
amounts to significantly higher database availability.
- Proper resource utilization –
Due to parallel execution, the CPU will never be ideal. Thus, proper utilization of resources is there.
- Increase Reliability –
When
one site fails, the execution can continue with another available site
which is having a copy of data. Making the system more reliable.
I/O Parallelism
Refers to reducing the time required to retrieve relations from disk by partitioning the relations on multiple disks.
Horizontal
partitioning – common form of data partioning where, tuples of a
relation are divided among many disks such that each tuple resides on
one disk.
Partitioning techniques
1. Round-robin: Scans the relation in any order & Send the i th tuple inserted in the relation to disk i mod n.
2. Hash partitioning:
Choose one or more attributes as the partitioning attributes. A hash
function is chosen with range {0…n - 1}. Each tuple of the original
relation is hashed on the partitioning attributes. if the hash fn
returns i, then the tuple is placed on disk Di.
3. Range Partitioning: This strategy distributes tuples by assigning contiguous attribute value ranges to each disk.
No comments:
Post a Comment