DBMS Interview Questions

1. What is DBMS?
  • Database Management Systems (DBMS) are software systems used to store, retrieve, and run queries on data
  • A DBMS serves as an interface between an end-user and a database, allowing users to create, read, update, and delete data in the database. 
  • In DbMS data is stored in a more secure way as compared to the file-based system. 
  • Using DBMS, we can overcome many problems such as- data redundancy, data inconsistency, easy access, more organized and understandable, and so on.  
  • Some DBMS examples include MySQL, PostgreSQL, Microsoft Access, SQL Server, FileMaker, Oracle, RDBMS, dBASE, Clipper, and FoxPro.

 2. What is a database? 

  • Database is an organized collection of data stored and accessed electronically from a computer system. 
  • Where databases are more complex they are often developed using formal design and modeling techniques.  
  • Database mostly consists of the objects (tables), and tables include of the records and fields. 
  • Fields are the basic units of data storage, which contain the information about a particular aspect or attribute of the entity described by the database. 
  • DBMS is used for extraction of data from the database in the form of the queries.

 The collection of database and DBMS software together is known as a database system. 

3. What are the advantages of DBMS?

  • Improved data sharing
  • Improved data security
  • Improved data access
  • Improved decision making
  • Increased end-user productivity
  • Redundancy control
  • Restriction for unauthorized access
  • Provides multiple user interfaces
  • Provides backup and recovery
  • Enforces integrity constraints
  • Ensure data consistency
  • Easy data extraction and data processing due to the use of queries

 4. What is a checkpoint in DBMS? When does checkpoint occur in DBMS?

The Checkpoint is a type of mechanism where all the previous logs are removed from the system and permanently stored in the storage disk. 

A Checkpoint is the means in which the SQL Server Database Engine guarantees the databases consistency, where no data will be lost when the system is crashed as the dirty pages that are located in the memory and resulted from committed transaction will be redone although they are not written to the database files yet.

5. What do you mean by transparent DBMS?

The transparent DBMS is a type of DBMS which keeps its physical structure hidden from users. Physical structure or physical storage structure implies to the memory manager of the DBMS, and it describes how the data stored on disk. 

6. What are the unary operations in Relational Algebra?

PROJECTION and SELECTION are the unary operations in relational algebra. Unary operations are those operations which use single operands. Unary operations are SELECTION, PROJECTION, and RENAME.

7. What is RDBMS?

RDBMS stands for Relational Database Management Systems. RDBMS is a type of DataBase Management System that stores data in the form of relations (tables). Relational databases are powerful, so they require few assumptions about how data is related or how, it will be extracted from the database. RDBMS is the system that enables you to perform different operations such as- update, insert, delete, manipulate and administer a relational database with minimal difficulties. 

Examples of the most popular RDBMS are MYSQL, Oracle, IBM DB2, and Microsoft SQL Server database

8. How many types of database languages are?

There are four types of database languages:

  • Data Definition Language (DDL) DDL defines the statements to implement the database schema. e.g., CREATE, ALTER, DROP, TRUNCATE, RENAME, etc. All these commands are used for updating the data that?s why they are known as Data Definition Language.
  • Data Manipulation Language (DML) Data Manipulation Language has a set of statements that allows users to access and manipulate the data in the database. e.g., SELECT, UPDATE, INSERT, DELETE, etc. These commands are used for the manipulation of already updated data that's why they are the part of Data Manipulation Language.
  • DATA Control Language (DCL) e.g., GRANT and REVOKE. These commands are used for giving and removing the user access on the database. So, they are the part of Data Control Language.
  • Transaction Control Language (TCL) e.g., COMMIT, ROLLBACK, and SAVEPOINT. These are the commands used for managing transactions in the database. TCL is used for managing the changes made by DML.

 9. What do you understand by Data Model?

The Data model is specified as a collection of conceptual tools for describing data, data relationships, data semantics and constraints. These models are used to describe the relationship between the entities and their attributes.

There is the number of data models:

  • Hierarchical data model
  • network model
  • relational model
  • Entity-Relationship model and so on.

10. Define a Relation Schema and a Relation.

A Relation Schema is specified as a set of attributes. It is also known as table schema. It defines what the name of the table is. Relation schema is known as the blueprint with the help of which we can explain that how the data is organized into tables. This blueprint contains no data.

A relation is specified as a set of tuples. A relation is the set of related attributes with identifying key attributes

See this example:

Let r be the relation which contains set tuples (t1, t2, t3, ..., tn). Each tuple is an ordered list of n-values t=(v1,v2, ...., vn). 

11. What is a degree of Relation?

The degree of relation is a number of attribute of its relation schema. A degree of relation is also known as Cardinality it is defined as the number of occurrence of one entity which is connected to the number of occurrence of other entity. 

12. What is the Relationship?

The Relationship is defined as an association among two or more entities. There are three type of relationships in DBMS-

One-To-One: Here one record of any object can be related to one record of another object.

One-To-Many (many-to-one): Here one record of any object can be related to many records of other object and vice versa.

Many-to-many: Here more than one records of an object can be related to n number of records of another object

13. What are the disadvantages of file processing systems?

· Data redundancy & inconsistency.
· Difficult in accessing data.
· Data isolation.
· Data integrity.
· Concurrent access is not possible.
· Security Problems. 

14. What is data abstraction in DBMS?

Data abstraction in DBMS is a process of hiding irrelevant details from users. Because database systems are made of complex data structures so, it makes accessible the user interaction with the database. 

15. What are the three levels of data abstraction?

Following are three levels of data abstraction:

Physical level: It is the lowest level of abstraction. It describes how data are stored.

Logical level: It is the next higher level of abstraction. It describes what data are stored in the database and what the relationship among those data is.

View level: It is the highest level of data abstraction. It describes only part of the entire database.

16. What is Relational Algebra?

Relational Algebra is a Procedural Query Language which contains a set of operations that take one or two relations as input and produce a new relationship. Relational algebra is the basic set of operations for the relational model. The decisive point of relational algebra is that it is similar to the algebra which operates on the number.

There are few fundamental operations of relational algebra:

  • select
  • project
  • set difference
  • union
  • rename,etc.

17. What do you understand by query optimization?

The term query optimization specifies an efficient execution plan for evaluating a query that has the least estimated cost.

18. What is normalization?

Normalization is a process of analysing the given relation schemas according to their functional dependencies. It is used to minimize redundancy and also used to minimize insertion, deletion and update distractions. Normalization is considered as an essential process as it is used to avoid data redundancy, insertion anomaly, updation anomaly, deletion anomaly.

There most commonly used normal forms are:

  • First Normal Form(1NF)
  • Second Normal Form(2NF)
  • Third Normal Form(3NF)
  • Boyce & Codd Normal Form(BCNF)

19. What is Denormalization?

Denormalization is the process of boosting up database performance and adding of redundant data which helps to get rid of complex data. Denormalization is a part of database optimization technique. This process is used to avoid the use of complex and costly join

20. What is functional Dependency?

The functional dependency is known as database dependency and defines as the relationship which occurs when one attribute in a relation uniquely determines another attribute. It is written as A->B which means B is functionally dependent on A.

21. What is the E-R model?

E-R model is a short name for the Entity-Relationship model. This model is based on the real world. It contains necessary objects (known as entities) and the relationship among these objects. Here the primary objects are the entity, attribute of that entity, relationship set, an attribute of that relationship set can be mapped in the form of E-R diagram.

In E-R diagram, entities are represented by rectangles, relationships are represented by diamonds, attributes are the characteristics of entities and represented by ellipses, and data flow is represented through a straight line.

 The Entity is a set of attributes in a database. An entity can be a real-world object which physically exists in this world.

 

22. What is Weak Entity set?

An entity set that doesn't have sufficient attributes to form a primary key is referred to as a weak entity set. The member of a weak entity set is known as a subordinate entity. Weak entity set does not have a primary key, but we need a mean to differentiate among all those entries in the entity set that depend on one particular strong entity set.

23. What is an attribute?

An attribute refers to a database component. It is used to describe the property of an entity. An attribute can be defined as the characteristics of the entity. Entities can be uniquely identified using the attributes. Attributes represent the instances in the row of the database.

For example: If a student is an entity in the table then age will be the attribute of that student. 

24. What are the integrity rules in DBMS?

Data integrity is one significant aspect while maintaining the database. So, data integrity is enforced in the database system by imposing a series of rules. Those set of integrity is known as the integrity rules.

There are two integrity rules in DBMS:

Entity Integrity : It specifies that "Primary key cannot have a NULL value."

Referential Integrity: It specifies that "Foreign Key can be either a NULL value or should be the Primary Key value of other relation

25. What is System R? How many of its two major subsystems?

System R was designed and developed from 1974 to 1979 at IBM San Jose Research Centre. System R is the first implementation of SQL, which is the standard relational data query language, and it was also the first to demonstrate that RDBMS could provide better transaction processing performance. It is a prototype which is formed to show that it is possible to build a Relational System that can be used in a real-life environment to solve real-life problems.

Following are two major subsystems of System R:

  • Research Storage
  • System Relational Data System

SQL Introduction

 #Basics of Data

Data is a collection of a distinct small unit of information. For example, your name, age, height, weight, etc. are some data related to you. A picture, image, file, pdf, etc. can also be considered data.

Database

  • database is an organized/ systematic collection of data so that it can be easily accessed and managed.
  • They support electronic storage and manipulation of data. 
  • Databases make data management easy.
  • You can organize data into tables, rows, columns, and index it to make it easier to find relevant information.
  • The main purpose of the database is to operate a large amount of information by storing, retrieving, and managing data.

Example: An online telephone directory uses a database to store data of people, phone numbers, and other contact details. Your electricity service provider uses a database to manage billing, client-related issues, handle fault data, etc.


Example: Facebook. It needs to store, manipulate, and present data related to members, their friends, member activities, messages, advertisements, and a lot more. We can provide a countless number of examples for the usage of databases.


  • There are many databases available like MySQL, Sybase, Oracle, MongoDB, Informix, PostgreSQL, SQL Server, etc.
  • Modern databases are managed by the database management system (DBMS).
  • SQL or Structured Query Language is used to operate on the data stored in a database. SQL depends on relational algebra and tuple relational calculus.

SQL

  • SQL stands for Structured Query Language. 
  • It is used for storing and managing data in a relational database management system (RDMS).
  • It is a standard language for Relational Database systems. 
  • It enables a user to create, read, update and delete relational databases and tables.
  • All the RDBMS like MySQL, Informix, Oracle, MS Access and SQL Server use SQL as their standard database language.
  • SQL allows users to query the database in several ways, using English-like statements.

Characteristics of SQL

  • SQL is easy to learn.
  • SQL is used to access data from relational database management systems.
  • SQL can execute queries against the database.
  • SQL is used to describe the data.
  • SQL is used to define the data in the database and manipulate it when needed.
  • SQL is used to create and drop the database and table.
  • SQL is used to create a view, stored procedure, function in a database.
  • SQL allows users to set permissions on tables, procedures, and views.

Advantages of SQL

High speed

Using the SQL queries, the user can quickly and efficiently retrieve a large number of records from a database.

No coding needed

In the standard SQL, it is very easy to manage the database system. It doesn't require a substantial amount of code to manage the database system.

Well defined standards

Long established are used by the SQL databases that are being used by ISO and ANSI.

Portability

SQL can be used in laptops, PCs, servers and even some mobile phones.

Interactive language

SQL is a domain language used to communicate with the database. It is also used to receive answers to complex questions in seconds.

Multiple data views

Using the SQL language, the users can make different views of the database structure.


Firewalls & Types

 Most organizations rely on firewalls for their network security needs. A firewall can be defined as a network security system that allows cybersecurity professionals to monitor and control network traffic. In other words, a firewall sets the boundary between the internal and external networks. There are two main types of firewalls:

  • Network-based firewalls: They are often positioned on the LANs, intranets or WANs of the gateway computers. (HW firewalls)

  • Host-based firewalls: They are implemented on the network host itself to protect the entire network traffic. Host-based firewalls can be a part of the operating system or an agent application to offer an additional layer of security.


Packet filtering firewall


  • Packet filtering firewall is a network security technique that is used to control data flow to and from a network using predefined rules.

  • The packet filtering firewall checks access control lists (ACLs) to separate packets depending upon the upper-layer protocol ID, source and destination port numbers, source and destination IP addresses, and packet transmission route.

  • The packet filtering firewall analyses for the source and destination IP addresses, source, and destination port numbers, and protocol IDs of IP packets as per an ACL. 

  • The firewall checks for the information contained in the IP, TCP, or UDP header, and then it decides to accept or drop the packet depending upon the ACL.

  • Additionally, it has a default method, set by users, that allows the packets to pass even if these do not qualify with the ACL. 

  • Packet filtering is usually an effective defence against attacks from computers outside a local area network (LAN). As most routing devices have integrated filtering capabilities, packet filtering is considered a standard and cost-effective means of security.


Working

The information passes through a network in the form of small pieces called packets, which travel independently across IP networks. These small packets travel through a node only if they match with predefined filtering rules otherwise get dropped. Hence, the filtering rules that are defined by the network layer firewalls in a packet filtering firewall prove to be highly efficient in providing security mechanisms. 


Packet filtering controls (allows or drops) packet or data transfer based on the following standards.

The address the packet is coming from.

The address the packet is going to.

The application protocols or rules are set to transfer the data.



A packet-filtering firewall can be distinguished into the following types based on the usage of rules:


  • Static packet filtering firewall: In this type of firewall rules are established manually, and the connection between the internal and external networks is left open or closed at all times until changed manually.

  • Dynamic packet filtering firewall: This type of firewall is a more intelligent way of filtering as rules may be changed dynamically depending upon the conditions, and ports are open only for a specific time otherwise remain closed.

  • Stateful packet filtering firewall: It uses a presettable for maintaining a secure connection, and packets pass through in a sequence as approved by the filter rules.


Advantages

  • Packet filtering is an efficient defence system against intrusions from computers or networks outside a local area network (LAN). 

  • It is also a standard and cost-effective means of protection as most routing devices themselves possess integrated filtering capabilities, so there is no need for setting a new firewall device.

  • Need only one router: The key advantage of using packet filtering is that it requires the use of only one screening router to protect an entire network. 

  • Highly efficient and fast: The packet filtering router works very fast and effectively and accepts and rejects the packets quickly based upon the destination and source ports and addresses. However, other firewall techniques show more time-consuming performance.

  • Transparent to users: Packet filtering works independently without any need for user knowledge or cooperation. Users won’t get to know about the transmission of packets until there is something that got rejected. On the contrary, other firewalls require custom software, the configuration of client machines, or specific training or procedures for users.

  • Built-in packet filtering in routers: Packet filtering capacities are inbuilt in widely used hardware and software routing products. Additionally, now most websites possess packet filtering techniques available in their routers themselves, which also makes this technique the most inexpensive one. 


Disadvantages

  • Filtration based on IP address or Port Information: The biggest disadvantage of packet filtering is that it works on the authentication of IP address and port number and is not based on information like context or application.

  • Packet filtering is stateless: Another big disadvantage of packet filtering is that it does not remember any past invasions or filtered packets. It tests every packet in isolation and is stateless which allows hackers to break the firewall easily. 

  • No safety from address spoofing: The packet filtering does not protect from IP spoofing, in which hackers can insert fake IP addresses in packets to intrude the network.

  • Not a perfect option for all networks: The packet filtering firewalls implementation in highly desirable filters becomes difficult or highly time-consuming. Managing and configuring ACLs sometimes get difficult.


PACKET FILTERING FIREWALL EXAMPLE

Packet filters act on the source and destination IP and port addresses that are present in each TCP/IP packet. You can set rules allowing access to only familiar and established IP addresses and denying access to all unknown or unrecognized IP addresses. 


For example, if you set rules denying access to port 80 to outsiders, you would block off all outside access to the HTTP server as most HTTP servers run on port 80. Alternatively, you can set packet filtering firewall rules permitting packets designated for your mail or web server and rejecting all other packets.

Despite its weaknesses, packet filter firewalls are widely used for being leveraged and inexpensive. It controls the movement of information/packets according to a set of rules defined by the user and protects the network from unwanted intrusion or attacks. Thus, it acts as a powerful security tool and provides a good level of security to the network.



Proxy Firewall (Application Level Gateways)


  • A proxy firewall is a network security system that protects network resources by filtering messages at the application layer. 

  • A proxy firewall is also a proxy server, but not all proxy servers are proxy firewalls. 

  • A proxy server acts as an intermediary between clients and servers. It can cache web pages to reduce bandwidth demands, compress data, filter traffic and detect viruses. A proxy server can also be used to hide user information or to connect to services that would be blocked. 

  • On the other hand, a proxy firewall inspects all network traffic to detect and protect against potential threats. It can also detect network intrusion and enforce security policies.


Proxy firewall vs. traditional firewall

A proxy firewall acts as a gateway between internal users and the internet. It can be installed on an organization's network or on a remote server that is accessible by the internal network. It provides security to the internal network by monitoring and blocking traffic that is transmitted to and from the internet.


In contrast, a traditional firewall acts as a gateway between two networks. By blocking unwanted external traffic, a traditional firewall protects the computers and networks behind it from unauthorized access and attacks.


Filtering at the application level

  • Proxy firewalls filter traffic at the application layer, which is Layer 7 of the Open Systems Interconnection model. 

  • The technology is similar to traditional packet filtering firewalls, but proxy firewalls add an extra level of protection. 

  • A proxy firewall has its Internet Protocol (IP) address, so an outside network connection will never receive packets directly from the sending network.

  • Proxy firewalls are often used as the first layer of defence in a secure web or application infrastructure. In this regard, they protect the network from external threats, while ensuring that internal web services and applications can be used safely with no impact on performance.


How are proxy firewalls used?

Proxy firewalls protect critical systems from unauthorized access. They act as a barrier between authorized users and unauthorized users. Proxy firewalls can be deployed within a hardware device, such as a router or firewall. They can also help accomplish the following:

  • ensure that only authorized users have access to the resources of a computer network;

  • filter out unwanted messages and packets on an internet network; and

  • protect against network intrusion and espionage.


Proxy firewalls are also used to restrict access to sensitive sites or sites that are only relevant to specific users. For example, a proxy firewall can be used to prevent employees from accessing Facebook or Twitter during working hours, while still allowing them access to their accounts.

Proxies can be installed in the network itself -- between the internet and the internal network -- or on each computer.


Advantages

  • Security. Proxy firewalls are the most secure type of firewall because they prevent direct network contact with other systems. As previously mentioned, because it has its IP address, the proxy firewall keeps external network connections from receiving network packets directly from the sending network.

  • Logging capabilities. Proxy firewalls can examine the entire network packet, rather than just the network address and port number. This capability provides extensive logging capabilities -- a valuable resource for security administrators when dealing with security incidents.

  • Threat assessment. Marcus J. Ranum is given credit for coming up with the idea of a proxy firewall. He described it as an easier way for programmers to assess the threat levels of application protocols and deploy error and attack detection and validity checking.

  • Control and granularity. Another advantage of proxy firewalls is they offer more control and granularity than other types of firewalls. This is because they can be configured to apply levels of security to individual users and groups and contain access logs for detailed reports on user activities.


Disadvantages


  • Challenging to use. The main problem with proxy firewalls is that they are difficult to use. Many users disable them when they become frustrated that the applications they are using are unable to access the internet.

  • Slow performance. Proxy firewalls can slow internet connections. Because they operate as a third party between the internet and the computer or device in use, they establish an additional connection for each outgoing and incoming packet. As a result, the firewall can become a bottleneck and slow performance.

  • Single point of failure. For the same reason that they slow performance, proxy firewalls can also become a single point of failure in the system.


EXAMPLES OF A PROXY FIREWALL’S WORK


All types of proxy firewalls monitor traffic for layer 7 protocols. This includes protocols like Simple Mail Transfer Protocol (SMTP), HyperText Transfer Protocol (HTTP), File Transfer Protocol (FTP), etc. The implementation of proxy servers is many times made in the form of dual-homed bastion hosts that run sets of proxy agents. A bastion host is a system that is expected to come under a direct network attack, probably and in most cases from the internet. More examples of protocols for which proxy firewalls work are the Internet Control Message Protocol (ICMP), Domain Name System (DNS), etc.


More often than not, there is only one computer in a proxy firewall network that has a direct connection to the internet. The other computers in this network setup have to access the internet by using that main computer in the form of a gateway. The rest of the process is pretty simple and has been explained before. A working proxy firewall example can be given at this stage. The process would be that the request of a client is received by the gateway inside the firewall, the request is then sent to the remote server outside the firewall. After this, the server’s response is read and sent back to the client on whether or not access has been granted.



Stateful Inspection Firewalls


  • Technology that controls the flow of traffic between two or more networks. 

  • Stateful inspection firewalls, in addition to verifying and keeping track of established connections, also perform packet inspection to provide better, more comprehensive security. 

  • They work by creating a state table with source IP, destination IP, source port and destination port once a connection is established. 

  • They create their own rules dynamically to allow expected incoming network traffic instead of relying on a hardcoded set of rules based on this information. 

  • They conveniently drop data packets that do not belong to a verified active connection.

  • Stateful inspection firewalls check for legitimate connections as well as source and destination IPs to determine which data packets can pass through. Although these extra checks provide advanced security, they consume a lot of system resources and can slow down traffic considerably. Hence, they are prone to DDoS (distributed denial-of-service attacks).

  • Stateful packet inspection is also known as dynamic packet filtering and it aims to provide an additional layer of network security. 


  • This is sometimes called session-level protection because they keep state information for each network session and make allowed/denied decisions based on a session state table.SI Firewalls track the state of sessions and drop packets that are not part of a session allowed by a predefined security policy. 


What is the benefit of implementing stateful inspection?

Before stateful inspection became mainstream, similar technology called static packet filtering was in use. This older alternative only checks the headers of the packets to determine whether they should be allowed through the firewall. As a result, a hacker can simply indicate “reply” in the header to extract information from the network. On the contrary, stateful inspection aims to carry out a more sophisticated investigation. That is why it analyzes the application layer of the packets. A dynamic packet filter like stateful inspection can offer a better security posture for networks through recording the session information like port numbers or IP addresses.

In other words, stateful inspection is better at keeping the intruders away from your network since it uses a more refined technology.



Hadoop Input/output formats & Setting up a Hadoop cluster

Hadoop InputFormat

  • InputFormatdefines, How the input files are split up and read in Hadoop. 
  • An Hadoop InputFormat is the first component in Map-Reduce, it is responsible for creating the input splits and dividing them into records. 
  • Initially, the data for a MapReduce task is stored in input files, and input files typically reside in HDFS.
  • Although these files format is arbitrary, line-based log files and binary format can be used. 
  • Using InputFormat we define how these input files are split and read. 
  • The InputFormat class is one of the fundamental classes in the Hadoop MapReduce framewor.

Types of InputFormat in MapReduce

1. FileInputFormat in Hadoop

It is the base class for all file-based InputFormats. Hadoop FileInputFormat specifies input directory where data files are located. When we start a Hadoop job, FileInputFormat is provided with a path containing files to read. FileInputFormat will read all files and divides these files into one or more InputSplits.

2. TextInputFormat

It is the default InputFormat of MapReduce. TextInputFormat treats each line of each input file as a separate record and performs no parsing. This is useful for unformatted data or line-based records like log files.

  • Key – It is the byte offset of the beginning of the line within the file (not whole file just one split), so it will be unique if combined with the file name.
  • Value – It is the contents of the line, excluding line terminators.

3. KeyValueTextInputFormat

It is similar to TextInputFormat as it also treats each line of input as a separate record. While TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat breaks the line itself into key and value by a tab character (‘/t’). Here Key is everything up to the tab character while the value is the remaining part of the line after tab character.

4. SequenceFileInputFormat

Hadoop SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence files are binary files that stores sequences of binary key-value pairs. Sequence files block-compress and provide direct serialization and deserialization of several arbitrary data types (not just text). Here Key & Value both are user-defined.

5. SequenceFileAsTextInputFormat

Hadoop SequenceFileAsTextInputFormat is another form of SequenceFileInputFormat which converts the sequence file key values to Text objects. By calling ‘tostring()’ conversion is performed on the keys and values. This InputFormat makes sequence files suitable input for streaming.

6. SequenceFileAsBinaryInputFormat

Hadoop SequenceFileAsBinaryInputFormat is a SequenceFileInputFormat using which we can extract the sequence file’s keys and values as an opaque binary object.

7. NLineInputFormat

Hadoop NLineInputFormat is another form of TextInputFormat where the keys are byte offset of the line and values are contents of the line. Each mapper receives a variable number of lines of input with TextInputFormat and KeyValueTextInputFormat and the number depends on the size of the split and the length of the lines. And if we want our mapper to receive a fixed number of lines of input, then we use NLineInputFormat.
N is the number of lines of input that each mapper receives. By default (N=1), each mapper receives exactly one line of input. If N=2, then each split contains two lines. One mapper will receive the first two Key-Value pairs and another mapper will receive the second two key-value pairs.

8. DBInputFormat

Hadoop DBInputFormat is an InputFormat that reads data from a relational database, using JDBC. As it doesn’t have portioning capabilities, so we need to careful not to swamp the database from which we are reading too many mappers. So it is best for loading relatively small datasets, perhaps for joining with large datasets from HDFS using MultipleInputs. Here Key is LongWritables while Value is DBWritables.

Hadoop Output Format

As we know, Reducer takes as input a set of an intermediate key-value pair produced by the mapper and runs a reducer function on them to generate output that is again zero or more key-value pairs.

RecordWriter writes these output key-value pairs from the Reducer phase to output files.

Types of Hadoop Output Formats

i. TextOutputFormat

MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes (key, value) pairs on individual lines of text files and its keys and values can be of any type since TextOutputFormat turns them to string by calling toString() on them. Each key-value pair is separated by a tab character, which can be changed using MapReduce.output.textoutputformat.separator property. KeyValueTextOutputFormat is used for reading these output text files since it breaks lines into key-value pairs based on a configurable separator.

ii. SequenceFileOutputFormat

It is an Output Format which writes sequences files for its output and it is intermediate format use between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next mapper in the same manner as it was emitted by the previous reducer, since these are compact and readily compressible. Compression is controlled by the static methods on SequenceFileOutputFormat.

iii. SequenceFileAsBinaryOutputFormat

It is another form of SequenceFileInputFormat which writes keys and values to sequence file in binary format.

iv. MapFileOutputFormat

It is another form of FileOutputFormat in Hadoop Output Format, which is used to write output as map files. The key in a MapFile must be added in order, so we need to ensure that reducer emits keys in sorted order.
Any doubt yet in Hadoop Oputput Format? Please Ask.

v. MultipleOutputs

It allows writing data to files whose names are derived from the output keys and values, or in fact from an arbitrary string.

vi. LazyOutputFormat

Sometimes FileOutputFormat will create output files, even if they are empty. LazyOutputFormat is a wrapper OutputFormat which ensures that the output file will be created only when the record is emitted for a given partition.

vii. DBOutputFormat

DBOutputFormat in Hadoop is an Output Format for writing to relational databases and HBase. It sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type extending DBwritable. Returned RecordWriter writes only the key to the database with a batch SQL query.

 REF

https://data-flair.training/blogs/hadoop-inputformat/

https://data-flair.training/blogs/hadoop-outputformat/

 https://www.edureka.co/blog/hadoop-clusters

Artificial Bee Colony (ABC) Optimization

The artificial bee colony algorithm (ABC) is an optimization algorithm based on the intelligent foraging behaviour of honey bee swarm, proposed by Derviş Karaboğa (Erciyes University) in 2005.

Particle swarm optimization (PSO) is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. ABC is based on PSO.

Behaviour Of Real Bees

  • Honey bees live in well-organized colonies and do not require hibernation. 
  • They are best known for their production of honey, which they store in wax combs inside nests. 
  • Honey bees are generally active during spring, when they go in search of plants from which to collect pollen and nectar. From these two ingredients, they create honey, which humans have harvested for hundreds of years.
  • Honey bees are social creatures and live within colonies with a queen, thousands of workers and a few male drones. 
  • Workers make these nests from wax, which they secrete from their abdominal glands. 
  • Honey bees are very adaptable. While honey bees forage for food in groups, a colony can survive without foraging for several years by living on food reserves and huddling in large, compacted masses during winter seasons. 
  • Like some insects, honey bees behave defensively when intruders are near, guarding the entrance to their nests. However, honey bees are able to sting only once. Because stingers contain barbs and are attached to the worker's intestines, they detach from the stinging bee's body after attacking a victim. While a honey bee will die soon after transferring its venom, pheromones secreted during the attack will alarm and stimulate other worker bees to attack, as well.

https://www.orkin.com/pests/stinging-pests/bees/honey-bees/honey-bee-behavior


ABC Algorithm

Working

  • The ABC was first proposed to solve numerical optimization problems by Karaboga.
  • ABC consists of employed and unemployed foragers, and food sources. The ABC consists of three groups of artificial bees: employed foragers, onlookers and scouts. 
  • The employed bees comprise the first half of the colony whereas the second half consists of the onlookers.
  • In the basic ABC, there are 3 kinds of bees: employed, onlooker, and scout bees.
  • The employed bees exploit the food positions, while the onlooker bees are waiting for information from the employed bees about nectar amount of the food positions. 
  • The onlooker bees select food positions using the employed bee information and they exploit the selected food positions. 
  • Finally, the scout bees find new random food positions. 



A bee waiting on the dance area for making decision to choose a food source, is called an onlooker and a bee going to the food source visited by itself previously is named an employed bee. A bee carrying out random search is called a scout. In the ABC algorithm, first half of the colony consists of employed artificial bees and the second half constitutes the onlookers. For every food source, there is only one employed bee. In other words, the number of employed bees is equal to the number of food sources around the hive. The employed bee whose food source is exhausted by the employed and onlooker bees becomes a scout. The main steps of the algorithm are given below: 


• Initialize. 

• REPEAT. 

(a) Place the employed bees on the food sources in the memory; 

(b) Place the onlooker bees on the food sources in the memory; 

(c) Send the scouts to the search area for discovering new food sources. 

• UNTIL (requirements are met).


 At the initialization stage, a set of food source positions are randomly selected by the bees and their nectar amounts are determined. Then, these bees come into the hive and share the nectar information of the sources with the bees waiting on the dance area within the hive. At the second stage, after sharing the information, every employed bee goes to the food source area visited by herself at the previous cycle since that food source exists in her memory, and then chooses a new food source by means of visual information in the neighbourhood of the present one. At the third stage, an onlooker prefers a food source area depending on the nectar information distributed by the employed bees on the dance area. As the nectar amount of a food source increases, the probability with which that food source is chosen by an onlooker increases, too. Hence, the dance of employed bees carrying higher nectar recruits the onlookers for the food source areas with higher nectar amount. After arriving at the selected area, she chooses a new food source in the neighbourhood of the one in the memory depending on visual information. Visual information is based on the comparison of food source positions. When the nectar of a food source is abandoned by the bees, a new food source is randomly determined by a scout bee and replaced with the abandoned one. In our model, at each cycle at most one scout goes outside for searching a new food source and the number of employed and onlooker bees were equal.


In the ABC algorithm, the position of a food source represents a possible solution of the optimization problem and the nectar amount of a food source corresponds to the quality (fitness) of the associated solution. The number of the employed bees or the onlooker bees is equal to the number of solutions in the population.


Phases of ABC

 It generally consist of four phases.

1) Initialization of ABC.

  • Determine the number of artificial bees. 50% are employed bees and 50% are onlooker’s bees.
  • Generate the random initial candidate solutions for employed bees  
  • Determine the limit value.

2) Employed bee phase

  • For all employed bees Generate new candidate solution.
  • Calculate the fitness value of the new solution.
  • If fitness of new candidate solution is better than the existing solution replace the older solution.
  • Calculate the probability for each individual.

3) Onlooker bee phase.

For all onlooker bees

  • Select an employed bees using roulette wheel.
  • Produce new candidate solution.
  • Compute fitness of individual.
  • If fitness of new candidate solution is better than the existing solution replace the older solution.

4) Scout bee phase

If any food source exhausted then replace it by randomly generated solution by scout memorize the best solution.

Until (stopping criteria is not met).


Two variants of the ABC known as ABCgBest and ABCgBestDist were proposed. In the ABCgBest variant, the global best equation from PSO was incorporated whereas global best and distance based reference selection were incorporated into the ABCgBestDist. The position of the artificial bees in both variations were updated and controlled based on the selection of reference locations

The major advantages which ABC holds over other optimization algorithms include its:

• Simplicity, flexibility and robustness

• Use of fewer control parameters compared to many other search techniques

• Ease of hybridization with other optimization algorithms.

• Ability to handle the objective cost with stochastic nature

• Ease of implementation with basic mathematical and logical operations.


Onlooker bees are the bees which reside in the beehive and search the food sources on the basis of the information collected by the employed bees.

 

Employed Bees Phase

Employed bees search for new food sources (υm) having more nectar within the neighbourhood of the food source (xm) in their memory. They find a neighbour food source and then evaluate its profitability (fitness). For example, they can determine a neighbour food source υmusing the formula given by equation (6):

υmi=xmi+ϕmi(xmixki)(6)


where xkis a randomly selected food source, iis a randomly chosen parameter index and ϕmiis a random number within the range [a,a] .After producing the new food source υm ,its fitness is calculated and a greedy selection is applied between υmandxm .

The fitness value of the solution, fitm(xm) ,might be calculated for minimization problems using the following formula (7)

fitm(xm)=11+fm(xm)1+abs(fm(xm))if  fm(xm)0if  fm(xm)<0(7)


where fm(xm)is the objective function value of solutionxm .

Onlooker Bees Phase

Unemployed bees consist of two groups of bees: onlooker bees and scouts. Employed bees share their food source information with onlooker bees waiting in the hive and then onlooker bees probabilistically choose their food sources depending on this information. In ABC, an onlooker bee chooses a food source depending on the probability values calculated using the fitness values provided by employed bees. For this purpose, a fitness based selection technique can be used, such as the roulette wheel selection method (Goldberg, 1989).

The probability value pmwith which xmis chosen by an onlooker bee can be calculated by using the expression given in equation (8) :

pm=fitm(xm)m=1SNfitm(xm) .(8)


After a food source xmfor an onlooker bee is probabilistically chosen, a neighbourhood source υmis determined by using equation (6), and its fitness value is computed. As in the employed bees phase, a greedy selection is applied between υmand xm .Hence, more onlookers are recruited to richer sources and positive feedback behaviour appears.

Scout Bees Phase

The unemployed bees who choose their food sources randomly are called scouts. Employed bees whose solutions cannot be improved through a predetermined number of trials, specified by the user of the ABC algorithm and called “limit” or “abandonment criteria” herein, become scouts and their solutions are abandoned. Then, the converted scouts start to search for new solutions, randomly. For instance, if solution xmhas been abandoned, the new solution discovered by the scout who was the employed bee of xmcan be defined by (5). Hence those sources which are initially poor or have been made poor by exploitation are abandoned and negative feedback behaviour arises to balance the positive feedback.

 
Explain ABC algorithm and its working? What is the fitness function used so to improve the performance of the algorithm? 

Initialization Phase 
REPEAT
Employed Bees Phase
Onlooker Bees Phase
Scout Bees Phase
Memorize the best solution achieved so far
UNTIL(Cycle=Maximum Cycle Number or a Maximum CPU time)

The procedure of the artificial bee colony (ABC) algorithm. | Download  Scientific DiagramFlow chart of the ABC algorithm. | Download Scientific Diagram

Monk and Inversions

using System; public class Solution { public static void Main () { int T = Convert . ToInt32 ( Console . ReadLine...