Tuesday, May 25, 2021

Rapid Application Development (RAD)

Rapid Application Development (RAD) is a form of agile software development methodology that prioritizes rapid prototype releases and iterations.

Unlike the Waterfall method, RAD emphasizes the use of software and user feedback over strict planning and requirements recording.

Steps or Phases in RAD

Step 1. Define and finalize project requirements

During this step, stakeholders sit together to define and finalize project requirements such as project goals, expectations, timelines, and budget.
When you have clearly defined and scoped out each aspect of the project’s requirements, you can seek management approvals.

Step 2: Begin building prototypes

As soon as you finish scoping the project, you can begin development.
Designers and developers will work closely with clients to create and improve upon working prototypes until the final product is ready.

Step 3: Gather user feedback

In this step, prototypes and beta systems are converted into working models.
Developers then gather feedback from users to tweak and improve prototypes and create the best possible product.

Step 4: Test

This step requires you to test your software product and ensure that all its moving parts work together as per client expectations.
Continue incorporating client feedback as the code is tested and retested for its smooth functioning.

Step 5: Present your system

This is the final step before the finished product goes to launch.
It involves data conversion and user training.

Advantages

Enhanced flexibility and adaptability as developers can make adjustments quickly during the development process.
Quick iterations that reduce development time and speed up delivery.
Encouragement of code reuse, which means less manual coding, less room for errors, and shorter testing times.
Increased customer satisfaction due to high-level collaboration and coordination between stakeholders (developers, clients, and end users).
Better risk management as stakeholders can discuss and address code vulnerabilities while keeping development processes going.

Disadvantages

Only systems that can be modularized can be built using RAD.
Inapplicable to cheaper projects as cost of Modelling and automated code generation is very high.
Only suitable for systems that are component based and scalable.
Requires user involvement throughout the life cycle.
Suitable for project requiring shorter development times.

Monday, May 24, 2021

Agile Manifesto

Agile software development values

Based on their combined experience of developing software and helping others do that, the seventeen signatories to the manifesto proclaimed that they value,

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
Agile software development principles

The Manifesto for Agile Software Development is based on twelve principles:
1. Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.
2. Welcome changing requirements, even late in development. Agile processes harness change for the customer's competitive advantage.
3. Deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale.
4. Business people and developers must work together daily throughout the project.
5. Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done.
6. The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.
7. Working software is the primary measure of progress.
8. Agile processes promote sustainable development. The sponsors, developers, and users should be able to maintain a constant pace indefinitely.
9. Continuous attention to technical excellence and good design enhances agility.
10. Simplicity--the art of maximizing the amount of work not done--is essential.
11. The best architectures, requirements, and designs emerge from self-organizing teams.
12. At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.

Friday, May 14, 2021

Hadoop & Mapreduce

With the rising Big data, Apache Software Foundation in 2008 developed an open source framework known as Apache Hadoop, which is a solution to all the big data problems.

Apache Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts.
Hadoop Cluster-
- A Hadoop cluster is nothing but a group of computers connected together via LAN.
- We use it for storing and processing large data sets.
- Hadoop clusters have a number of commodity hardware connected together.
- They communicate with a high-end machine which acts as a master.
- These master and slaves implement distributed computing over distributed data storage.
- It runs open source software for providing distributed functionality.

Hadoop Architecture

Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets using Hadoop MapReduce paradigm and allows parallel processing of data using several components.

The 3 important Hadoop components that play a vital role in the Hadoop architecture are -

1.Hadoop Distributed File System (HDFS)

2.MapReduce

3.YARN

#Hadoop Daemon

A daemon (pronounced DEE-muhn) is a program that runs continuously and exists for the purpose of handling periodic service requests that a computer system expects to receive. The daemon program forwards the requests to other programs (or processes) as appropriate.

1. Master Daemons

NameNode: It is the master Daemon in Hadoop HDFS. It maintains the filesystem namespace. It stores metadata about each block of the files.
ResourceManager: It is the master daemon of YARN. It arbitrates resources amongst all the applications running in the cluster.

2. Slave Daemons

DataNode: DataNode is the slave daemon of Hadoop HDFS. It runs on slave machines. It stores actual data or blocks.
NodeManager: It is the slave daemon of YARN. It takes care of all the individual computing nodes in the cluster.

How Does Hadoop Work?

Input data is broken into blocks of size 128 Mb and then blocks are moved to different nodes.
Once all the blocks of the data are stored on data-nodes, the user can process the data.
Resource Manager then schedules the program (submitted by the user) on individual nodes.
Once all the nodes process the data, the output is written back to HDFS.

Hadoop distributed file system (HDFS)

Hadoop stores a massive amount of data in a distributed manner in HDFS.

Hadoop Distributed File System(HDFS) is the world’s most reliable storage system. It is best known for its fault tolerance and high availability.

The Hadoop Distributed File System (HDFS) is Hadoop’s storage unit.
Here, the data is split into multiple blocks and these blocks are then randomly distributed and stored across slave machines.
Data is replicated three times & stores it across multiple systems.
Each block contains 128 MB of data by default & block size can also be customized.
HDFS features like Rack awareness, high Availability, Data Blocks, Replication Management, HDFS data read and write operations.
It provides high throughput by providing the data access in parallel.

Replications operate under two rules:

Two identical blocks cannot be placed on the same Data Node
When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack.

There are three components of the Hadoop Distributed File System:

NameNode (Masternode): Contains metadata in RAM and disk
Secondary NameNode: Contains a copy of NameNode’s metadata on disk
Slave/Data Node: Contains the actual data in the form of blocks

HDFS Architecture

Hadoop Distributed File System follows the master-slave architecture.
Each cluster comprises a single master node and multiple slave nodes.
Internally the files get divided into one or more blocks, and each block is stored on different slave machines depending on the replication factor.
The Master node is the NameNode and DataNodes are the slave nodes.
The master node stores and manages the file system namespace, that is information about blocks of files like block locations, permissions, etc. The slave nodes store data blocks of files.

#NameNode

NameNode is the centerpiece of the Hadoop Distributed File System. It maintains and manages the file system namespace and provides the right access permission to the clients.

The NameNode stores information about blocks locations, permissions, etc. on the local disk in the form of two files:

Fsimage: Fsimage stands for File System image. It contains the complete namespace of the Hadoop file system since the NameNode creation.
Edit log: It contains all the recent changes performed to the file system namespace to the most recent Fsimage.

Functions of HDFS NameNode

It executes the file system namespace operations like opening, renaming, and closing files and directories.
NameNode manages and maintains the DataNodes.
It determines the mapping of blocks of a file to DataNodes.
NameNode records each change made to the file system namespace.
It keeps the locations of each block of a file.
NameNode takes care of the replication factor of all the blocks.
NameNode receives heartbeat and block reports from all DataNodes that ensure DataNode is alive.
If the DataNode fails, the NameNode chooses new DataNodes for new replicas.

#Datanodes

DataNodes are the slave nodes that store the actual data and maintains the block.
While there is only one namenode, there can be multiple datanodes, which are responsible for retrieving the blocks when requested by the namenode.

Functions of DataNode

DataNode is responsible for serving the client read/write requests.
Based on the instruction from the NameNode, DataNodes performs block creation, replication, and deletion.
DataNodes send a heartbeat to NameNode to report the health of HDFS.
DataNodes also sends block reports to NameNode to report the list of blocks it contains.

#Secondary NameNode

The secondary NameNode server is responsible for maintaining a copy of the metadata in the disk.
It is the helper node for the primary NameNode.
The main purpose of the secondary NameNode is to create a new NameNode in case of failure.
Secondary NameNode downloads the edit logs and Fsimage file from the primary NameNode and periodically applies the edit logs to Fsimage. Then it sends back the updated Fsimage file to the NameNode. So, if the primary NameNode fails, the last save Fsimage on the secondary NameNode is used to recover file system metadata.
In a high availability cluster, there are two NameNodes: active and standby.
The secondary NameNode performs a similar function to the standby NameNode.

Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called CheckpointNode.

Process

Application data is stored on servers referred to as DataNodes
File system metadata is stored on servers referred to as NameNode.
HDFS replicates the file content on multiple DataNodes based on the replication factor to ensure reliability of data.
The NameNode and DataNode communicate with each other using TCP based protocols.
For the Hadoop architecture to be performance efficient, HDFS must satisfy certain pre-requisites –

1.All the hard drives should have a high throughput.

2.Good network speed to manage intermediate data transfer and block replications.

Blocks in HDFS Architecture

Internally, HDFS split the file into block-sized chunks called a block.
The size of the block is 128 Mb by default.
One can configure the block size as per the requirement. For example, if there is a file of size 612 Mb, then HDFS will create four blocks of size 128 Mb and one block of size 100 Mb.
The file of a smaller size does not occupy the full block size space in the disk.For example, the file of size 2 Mb will occupy only 2 Mb space in the disk.
The user doesn’t have any control over the location of the blocks.

Replication Management

In Hadoop, HDFS stores replicas of a block on multiple DataNodes based on the replication factor.
The replication factor is the number of copies to be created for blocks of a file in HDFS architecture.
If the replication factor is 3, then three copies of a block get stored on different DataNodes. So if one DataNode containing the data block fails, then the block is accessible from the other DataNode containing a replica of the block.
If we are storing a file of 128 Mb and the replication factor is 3, then (3*128=384) 384 Mb of disk space is occupied for a file as three copies of a block get stored.

This replication mechanism makes HDFS fault-tolerant.

Rack Awareness in HDFS Architecture

Rack is the collection of around 40-50 machines (DataNodes) connected using the same network switch. If the network goes down, the whole rack will be unavailable.

Rack Awareness is the concept of choosing the closest node based on the rack information.
To ensure that all the replicas of a block are not stored on the same rack or a single rack, NameNode follows a rack awareness algorithm to store replicas and provide latency and fault tolerance.

Suppose if the replication factor is 3, then according to the rack awareness algorithm:

The first replica will get stored on the local rack.
The second replica will get stored on the other DataNode in the same rack.
The third replica will get stored on a different rack.

Rack Awareness - Apache Hadoop HDFS Architecture - Edureka

Advantages of Rack Awareness

To improve the network performance: The communication between nodes residing on different racks is directed via switch. In general, you will find greater network bandwidth between machines in the same rack than the machines residing in different rack. So, the Rack Awareness helps you to have reduce write traffic in between different racks and thus providing a better write performance. Also, you will be gaining increased read performance because you are using the bandwidth of multiple racks.

To prevent loss of data: We don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure. And if you think about it, it will make sense, as it is said that never put all your eggs in the same basket.

HDFS Read and Write Mechanism

HDFS Read and Write mechanisms are parallel activities. To read or write a file in HDFS, a client must interact with the namenode. The namenode checks the privileges of the client and gives permission to read or write on the data blocks.

During file read, if any DataNode goes down, the NameNode provides the address of another DataNode containing a replica of the block from where the client can read its data without any downtime.

Goals of HDFS

Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.

Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.

Hardware at data − A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput

Hadoop YARN

YARN enabled the users to perform operations as per requirement by using a variety of tools like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others.
Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop and is responsible for resource allocation and job scheduling.
Introduced in the Hadoop 2.0 version.
YARN is the middle layer between HDFS and MapReduce in the Hadoop architecture.

Components of YARN include:

Resource Manager: (one per cluster)Runs on a master daemon and manages the resource allocation in the cluster.

Node Manager: (one per node)They run on the slave daemons and are responsible for the execution of a task on every single Data Node.

Application Master: (one per application)Manages the user job lifecycle and resource needs of individual applications. It works along with the Node Manager and monitors the execution of tasks.

Container: Package of resources including RAM, CPU, Network, HDD etc on a single node.

You can consider YARN as the brain of your Hadoop Ecosystem

Resource Manager

Resource Manager manages the resource allocation in the cluster and is responsible for tracking how many resources are available in the cluster and each node manager’s contribution. It has two main components:

Scheduler: Allocating resources to various running applications and scheduling resources based on the requirements of the application; it doesn’t monitor or track the status of the applications
Application Manager: Accepting job submissions from the client or monitoring and restarting application masters in case of failure

Application Master

Application Master manages the resource needs of individual applications and interacts with the scheduler to acquire the required resources.
It connects with the node manager to execute and monitor tasks.

Node Manager

Node Manager tracks running jobs and sends signals (or heartbeats) to the resource manager to relay the status of a node.
It also monitors each container’s resource utilization.
NodeManager is the slave daemons of YARN. It runs on all the slave nodes in the cluster.
It is responsible for launching and managing the containers on nodes.
Containers execute the application-specific processes with a constrained set of resources such as memory, CPU, and so on.
When NodeManager starts, it announces himself to the ResourceManager. It periodically sends a heartbeat to the ResourceManager. It offers resources to the cluster.

Container

Container houses a collection of resources like RAM, CPU, and network bandwidth.
Allocations are based on what YARN has calculated for the resources.
The container provides the rights to an application to use specific resource amounts.

apache hadoop yarn - How hadoop Works

Steps to Running an application in YARN

Client submits an application to the ResourceManager
Resource Manager allocates a container to start Application Manager
Application Manager registers with Resource Manager
Application Manager asks containers from Resource Manager
Application Manager notifies Node Manager to launch containers
Application code is executed in the container
Client contacts Resource Manager/Application Manager to monitor application’s status
Application Manager unregisters with Resource Manager.

MapReduce

It is the processing unit in Hadoop.
MapReduce is a software framework and programming model that allows us to perform distributed and parallel processing on large data sets in a distributed environment.
MapReduce consists of two distinct tasks – Map and Reduce
It works by dividing the task into independent subtasks and executes them in parallel across various DataNodes thereby increasing the throughput,fially
Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++.
The input to each phase is key-value pairs. The type of key, value pairs is specified by the programmer through the InputFormat class. By default, the text input format is used.

The first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

Mapper Class

The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store saves this intermediate data into the local disk.

Input Split-It is the logical representation of data. It represents a block of work that contains a single map task in the MapReduce Program.

RecordReader-It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.

Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it and generates the final output which is then saved in the HDFS.

Driver Class

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long with data types and their respective job names.

Working of Hadoop MapReduce

Whenever the client wants to perform any processing on its data in the Hadoop cluster, then it first stores the data in Hadoop HDFS and then writes the MapReduce program for processing the Data. The Hadoop MapReduce works as follows:

Hadoop divides the job into tasks of two types, that is, map tasks and reduce tasks. YARN scheduled these tasks
These tasks run on different DataNodes.
The input to the MapReduce job is divided into fixed-size pieces called input splits.
One map task which runs a user-defined map function for each record in the input split is created for each input split. These map tasks run on the DataNodes where the input data resides.
The output of the map task is intermediate output and is written to the local disk.
The intermediate outputs of the map tasks are shuffled and sorted and are then passed to the reducer.
For a single reduce task, the sorted intermediate output of mapper is passed to the node where the reducer task is running. These outputs are then merged and then passed to the user-defined reduce function.
The reduce function summarizes the output of the mapper and generates the output. The output of the reducer is stored on HDFS.
For multiple reduce functions, the user specifies the number of reducers. When there are multiple reduce tasks, the map tasks partition their output, creating one partition for each reduce task.

Advantages of Hadoop

https://data-flair.training/blogs/advantages-and-disadvantages-of-hadoop/

Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

REF

https://data-flair.training/forums/topic/what-are-the-differences-between-traditional-rdbms-and-hadoop/

/https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/?utm_source=youtube&utm_campaign=hadoop-architecture-081216-wr&utm_medium=description

Wednesday, May 12, 2021

System Models for Distributed and Cloud Computing

Distributed and cloud computing systems are built over a large number of autonomous computer nodes.
These node machines are interconnected by SANs, LANs, or WANs in a hierarchical manner.
With today’s networking technology, a few LAN switches can easily connect hundreds of machines as a working cluster.
A WAN can connect many local clusters to form a very large cluster of clusters. In this sense, one can build a massive system with millions of computers connected to edge networks.

Massive systems are considered highly scalable, and can reach web-scale connectivity, either physically or logically.

Massive systems are classified into four groups:

Clusters, P2P networks, Computing grids, and Internet clouds

In terms of node number, these four system classes may involve hundreds, thousands, or even millions of computers as participating nodes. These machines work collectively, cooperatively, or collaboratively at various levels. The table entries characterize these four system classes in various technical and application aspects.

Clusters of Cooperative Computers

A computing cluster consists of interconnected stand-alone computers which work cooperatively as a single integrated computing resource.

Cluster Architecture

This network can be as simple as a SAN (e.g., Myrinet) or a LAN (e.g., Ethernet). To build a larger cluster with more nodes, the interconnection network can be built with multiple levels of Gigabit Ethernet, Myrinet, or InfiniBand switches. Through hierarchical construction using a SAN, LAN, or WAN, one can build scalable clusters with an increasing number of nodes. The cluster is connected to the Internet via a virtual private network (VPN) gateway. The gateway IP address locates the cluster. The system image of a computer is decided by the way the OS manages the shared cluster resources. Most clusters have loosely coupled node computers. All resources of a server node are managed by their own OS. Thus, most clusters have multiple system images as a result of having many autonomous nodes under different OS control.

Single-System Image

An ideal cluster should merge multiple system images into a single-system image (SSI). Cluster designers desire a cluster operating system or some middle-ware to support SSI at various levels, including the sharing of CPUs, memory, and I/O across all cluster nodes. An SSI is an illusion created by software or hardware that presents a collection of resources as one integrated, powerful resource. SSI makes the cluster appear like a single machine to the user. A cluster with multiple system images is nothing but a collection of independent computers.

Hardware, Software, and Middleware Support

Clusters exploring massive parallelism are commonly known as MPPs. Almost all HPC clusters in the Top 500 list are also MPPs. The building blocks are computer nodes (PCs, workstations, servers, or SMP), special communication software such as PVM or MPI, and a network interface card in each computer node. Most clusters run under the Linux OS. The computer nodes are interconnected by a high-bandwidth network (such as Gigabit Ethernet, Myrinet, InfiniBand, etc.).

Special cluster middleware supports are needed to create SSI or high availability (HA). Both sequential and parallel applications can run on the cluster, and special parallel environments are needed to facilitate use of the cluster resources. For example, distributed memory has multiple images. Users may want all distributed memory to be shared by all servers by forming distribu-ted shared memory (DSM). Many SSI features are expensive or difficult to achieve at various cluster operational levels. Instead of achieving SSI, many clusters are loosely coupled machines. Using virtualization, one can build many virtual clusters dynamically, upon user demand.

Major Cluster Design Issues

Unfortunately, a cluster-wide OS for complete resource sharing is not available yet. Middleware or OS extensions were developed at the user space to achieve SSI at selected functional levels. Without this middleware, cluster nodes cannot work together effectively to achieve cooperative computing. The software environments and applications must rely on the middleware to achieve high performance.

Grid Computing Infrastructures

In the past 30 years, users have experienced a natural growth path from Internet to web and grid computing services. Internet services such as the Telnet command enables a local computer to connect to a remote computer. A web service such as HTTP enables remote access of remote web pages. Grid computing is envisioned to allow close interaction among applications running on distant computers simultaneously. Forbes Magazine has projected the global growth of the IT-based economy from $1 trillion in 2001 to $20 trillion by 2015. The evolution from Internet to web and grid services is certainly playing a major role in this growth.

Computational Grids

Like an electric utility power grid, a computing grid offers an infrastructure that couples computers, software/middleware, special instruments, and people and sensors together. The grid is often constructed across LAN, WAN, or Internet backbone networks at a regional, national, or global scale. Enterprises or organizations present grids as integrated computing resources. They can also be viewed as virtual platforms to support virtual organizations. The computers used in a grid are primarily workstations, servers, clusters, and supercomputers. Personal computers, laptops, and PDAs can be used as access devices to a grid system.

The resource sites offer complementary computing resources, including workstations, large servers, a mesh of processors, and Linux clusters to satisfy a chain of computational needs. The grid is built across various IP broadband networks including LANs and WANs already used by enterprises or organizations over the Internet. The grid is presented to users as an integrated resource pool as shown in the upper half of the figure.

Special instruments may be involved such as using the radio telescope in SETI@Home search of life in the galaxy and the austrophysics@Swineburne for pulsars. At the server end, the grid is a network. At the client end, we see wired or wireless terminal devices. The grid integrates the computing, communication, contents, and transactions as rented services. Enterprises and consumers form the user base, which then defines the usage trends and service characteristics.

Grid Families

Grid technology demands new distributed computing models, software/middleware support, network protocols, and hardware infrastructures. National grid projects are followed by industrial grid plat-form development by IBM, Microsoft, Sun, HP, Dell, Cisco, EMC, Platform Computing, and others. New grid service providers (GSPs) and new grid applications have emerged rapidly, similar to the growth of Internet and web services in the past two decades.

Peer-to-Peer Network Families

An example of a well-established distributed system is the client-server architecture. In this scenario, client machines (PCs and workstations) are connected to a central server for compute, e-mail, file access, and database applications. The P2P architecture offers a distributed model of networked systems. First, a P2P network is client-oriented instead of server-oriented. In this section, P2P systems are introduced at the physical level and overlay networks at the logical level.

P2P Systems

In a P2P system, every node acts as both a client and a server, providing part of the system resources. Peer machines are simply client computers connected to the Internet. All client machines act autonomously to join or leave the system freely. This implies that no master-slave relationship exists among the peers. No central coordination or central database is needed. In other words, no peer machine has a global view of the entire P2P system. The system is self-organizing with distributed control.

Initially, the peers are totally unrelated. Each peer machine joins or leaves the P2P network voluntarily. Only the participating peers form the physical network at any time. Unlike the cluster or grid, a P2P network does not use a dedicated interconnection network. The physical network is simply an ad hoc network formed at various Internet domains randomly using the TCP/IP and NAI protocols. Thus, the physical network varies in size and topology dynamically due to the free membership in the P2P network.

Overlay Networks

Data items or files are distributed in the participating peers. Based on communication or file-sharing needs, the peer IDs form an overlay network at the logical level. This overlay is a virtual network formed by mapping each physical machine with its ID, logically, through a virtual mapping as shown in Figure 1.17. When a new peer joins the system, its peer ID is added as a node in the overlay network. When an existing peer leaves the system, its peer ID is removed from the overlay network automatically. Therefore, it is the P2P overlay network that characterizes the logical connectivity among the peers.

There are two types of overlay networks: unstructured and structured.

An unstructured overlay network is characterized by a random graph. There is no fixed route to send messages or files among the nodes. Often, flooding is applied to send a query to all nodes in an unstructured overlay, thus resulting in heavy network traffic and nondeterministic search results.

Structured overlay net-works follow certain connectivity topology and rules for inserting and removing nodes (peer IDs) from the overlay graph. Routing mechanisms are developed to take advantage of the structured overlays.

P2P Application Families

Based on application, P2P networks are classified into four groups, as shown in Table 1.5. The first family is for distributed file sharing of digital contents (music, videos, etc.) on the P2P network. This includes many popular P2P networks such as Gnutella, Napster, and BitTorrent, among others. Colla-boration P2P networks include MSN or Skype chatting, instant messaging, and collaborative design, among others. The third family is for distributed P2P computing in specific applications. For example, SETI@home provides 25 Tflops of distributed computing power, collectively, over 3 million Internet host machines. Other P2P platforms, such as JXTA, .NET, and FightingAID@home, support naming, discovery, communication, security, and resource aggregation in some P2P applications.

P2P Computing Challenges

P2P computing faces three types of heterogeneity problems in hardware, software, and network requirements. There are too many hardware models and architectures to select from; incompatibility exists between software and the OS; and different network connections and protocols make it too complex to apply in real applications. We need system scalability as the workload increases. System scaling is directly related to performance and bandwidth. P2P networks do have these properties. Data location is also important to affect collective performance. Data locality, network proximity, and interoperability are three design objectives in distributed P2P applications.

P2P performance is affected by routing efficiency and self-organization by participating peers. Fault tolerance, failure management, and load balancing are other important issues in using overlay networks. Lack of trust among peers poses another problem. Peers are strangers to one another. Security, privacy, and copyright violations are major worries by those in the industry in terms of applying P2P technology in business applications [35]. In a P2P network, all clients provide resources including computing power, storage space, and I/O bandwidth. The distributed nature of P2P net-works also increases robustness, because limited peer failures do not form a single point of failure.

By replicating data in multiple peers, one can easily lose data in failed nodes. On the other hand, disadvantages of P2P networks do exist. Because the system is not centralized, managing it is difficult. In addition, the system lacks security. Anyone can log on to the system and cause damage or abuse. Further, all client computers connected to a P2P network cannot be considered reliable or virus-free. In summary, P2P networks are reliable for a small number of peer nodes. They are only useful for applica-tions that require a low level of security and have no concern for data sensitivity. We will discuss P2P networks in Chapter 8, and extending P2P technology to social networking in Chapter 9.

Cloud Computing over the Internet

Gordon Bell, Jim Gray, and Alex Szalay [5] have advocated: “Computational science is changing to be data-intensive. Supercomputers must be balanced systems, not just CPU farms but also petascale I/O and networking arrays.” In the future, working with large data sets will typically mean sending the computations (programs) to the data, rather than copying the data to the workstations. This reflects the trend in IT of moving computing and data from desktops to large data centers, where there is on-demand provision of software, hardware, and data as a service. This data explosion has promoted the idea of cloud computing.

Cloud computing has been defined differently by many users and designers. For example, IBM, a major player in cloud computing, has defined it as follows: “A cloud is a pool of virtualized computer resources. A cloud can host a variety of different workloads, including batch-style backend jobs and interactive and user-facing applications.” Based on this definition, a cloud allows workloads to be deployed and scaled out quickly through rapid provisioning of virtual or physical machines. The cloud supports redundant, self-recovering, highly scalable programming models that allow workloads to recover from many unavoidable hardware/software failures. Finally, the cloud system should be able to monitor resource use in real time to enable rebalancing of allocations when needed.

Internet Clouds

Cloud computing applies a virtualized platform with elastic resources on demand by provisioning hardware, software, and data sets dynamically (see Figure 1.18). The idea is to move desktop computing to a service-oriented platform using server clusters and huge databases at data centers. Cloud computing leverages its low cost and simplicity to benefit both users and providers. Machine virtualization has enabled such cost-effectiveness. Cloud computing intends to satisfy many user applications simultaneously. The cloud ecosystem must be designed to be secure, trustworthy, and dependable. Some computer users think of the cloud as a centralized resource pool. Others consider the cloud to be a server cluster which practices distributed computing over all the servers used.

The Cloud Landscape

Traditionally, a distributed computing system tends to be owned and operated by an autonomous administrative domain (e.g., a research laboratory or company) for on-premises computing needs. However, these traditional systems have encountered several performance bottlenecks: constant system maintenance, poor utilization, and increasing costs associated with hardware/software upgrades. Cloud computing as an on-demand computing paradigm resolves or relieves us from these problems. Figure 1.19 depicts the cloud landscape and major cloud players, based on three cloud service models. Chapters 4, 6, and 9 provide details regarding these cloud service offerings. Chapter 3 covers the relevant virtualization tools.

Infrastructure as a Service (IaaS)

This model puts together infrastructures demanded by users—namely servers, storage, networks, and the data center fabric.
The user can deploy and run on multiple VMs running guest OSes on specific applications.
The user does not manage or control the underlying cloud infrastructure, but can specify when to request and release the needed resources.

Platform as a Service (PaaS)

This model enables the user to deploy user-built applications onto a virtualized cloud platform.
PaaS includes middleware, databases, development tools, and some runtime support such as Web 2.0 and Java.
The platform includes both hardware and software integrated with specific programming interfaces.
The provider supplies the API and software tools (e.g., Java, Python, Web 2.0, .NET). The user is freed from managing the cloud infrastructure.

Software as a Service (SaaS)

This refers to browser-initiated application software over thousands of paid cloud customers.
The SaaS model applies to business processes, industry applications, consumer relationship management (CRM), enterprise resources planning (ERP), human resources (HR), and collaborative applications.
On the customer side, there is no upfront investment in servers or software licensing.
On the provider side, costs are rather low, compared with conventional hosting of user applications.

Internet clouds offer four deployment modes: private, public, managed, and hybrid [11]. These modes demand different levels of security implications. The different SLAs imply that the security responsibility is shared among all the cloud providers, the cloud resource consumers, and the third-party cloud-enabled software providers. Advantages of cloud computing have been advocated by many IT experts, industry leaders, and computer science researchers.
Reasons to adapt the cloud for upgraded Internet applications and web services:

1. Desired location in areas with protected space and higher energy efficiency

2. Sharing of peak-load capacity among a large pool of users, improving overall utilization

3.Separation of infrastructure maintenance duties from domain-specific application development

4. Significant reduction in cloud computing cost, compared with traditional computing paradigms

5. Cloud computing programming and application development

6. Service and data discovery and content/service distribution

7. Privacy, security, copyright, and reliability issues

8. Service agreements, business models, and pricing policies