CS.Lectures: Web databases | Semi-structured data/ Model (M4.1)

Web Database

A Web database is a database application designed to be managed and accessed through the Internet.
Website operators can manage this collection of data and present the analytical results based on the data in the Web database application.
Web databases enable collected data to be organized and catalogued thoroughly within hundreds of parameters.
The Web database does not require advanced computer skills, and many database software programs provide an easy "click-and-create" style with no complicated coding. Fill in the fields and save each record.

Often in the world of Web databases, MySQL (structured query language) will be mentioned. This is a relational database management system that manages different Web databases. It operates as a server and is an open-source project. MySQL is often included with Web hosting for managing either personal or business website databases. It is a programming language, so it's more difficult to work with than a straight Web database software program.

Types of Data

Structured data is often numerical and easy to analyze. It’s organized in a predefined structured format, such as Excel and Google Sheets, where data is added to standardized columns and rows relating to pre- set parameters. The framework is designed for easy data entry, search, comparison, and extraction.

Semi-structured data, which is text-heavy data but loosely organized into categories or “meta tags.” This information can be easily broken into its individual groups, but the data within these groups is, itself, unstructured. Email is a good example of this: you can search your email by Inbox, Sent, and Drafts, but the email text within each category has no pre- set structure.

Unstructured data is usually text-heavy or configured in a way that’s difficult to analyze. Social media posts, for example, might contain personal opinions, topics that are being discussed, and feature recommendations. However, this information is difficult to process in bulk. First, specific bits of information must be extracted and categorized, then analyzed to gain user insights.

Table 1 from How Do Procurement Networks Become Social? Design Principles Evaluation in a Heterogeneous Environment of Structured and Unstructured Interactions | Semantic Scholar

Semi-structured data/ Model

Semi-structured data is the data that can’t be organized in relational databases or doesn’t have a strict structural framework, but they have some structural properties or loose organizational framework.
In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important.
Semi-structured data includes text that is organized by subject or topic or fit into a hierarchical programming language, yet the text within is open-ended, having no structure itself.
Semi-structured data is, essentially, a combination of the Structured and Unstructured data.
Photos and videos may contain meta tags that relate to the location, date, or by whom they were taken, but the information within has no structure. Or think of social media platforms, like Facebook that organizes information by User, Friends, Groups, Marketplace, etc., but the comments and text contained in these categories are unstructured.

Examples of Semi-Structured Data

Email is probably the type of semi-structured data we’re all most familiar with because we use it daily. Email messages contain structured data like name, email address, recipient, date, time, etc., and they are also organized into folders, like Inbox, Sent, Trash, etc.
CSV, XML, and JSON are the three major languages used to communicate or transmit data from a web server to a client.
HTML or “Hyper Text Markup Language” is a hierarchical language similar to XML, but while XML is used to transmit data, HTML is used to display data.

Types of semi-structured data

XML
JSON (JavaScript Object Notation)

Advantages & Disadvantages of Semi-Structured Data

Semi-structured data is not constrained to a fixed architecture. So, a NoSQL database, for example, can store any format of data desired and can be easily scaled to store massive amounts of data. The downside, however, is that this makes it much more difficult to analyze this data – it must be manually processed (taking hundreds of human hours) or first be structured into a format that machines can understand.

Programmers persisting objects from their application to a database do not need to worry about object-relational impedance mismatch, but can often serialize objects via a light-weight library.
Support for nested or hierarchical data often simplifies data models representing complex relationships between entities.
Support for lists of objects simplifies data models by avoiding messy translations of lists into a relational data model.

The traditional relational data model has a popular and ready-made query language, SQL.
Prone to "garbage in, garbage out"; by removing restraints from the data model, there is less fore-thought that is necessary to operate a data application.

Semi-structured data is much more storable and portable than completely unstructured data, but storage cost is usually much higher than structured data. Semi-structured data is flexible, offering the ability to change schema, but the schema and data are often too tightly tied to each other, so you essentially have to already know the data you’re looking for when performing queries.

Semistructured Data Seach

How to Analyze Semi-Structured Data

Dealing with semi-structured data is easier than unstructured, but it still presents challenges. In previous years, humans would have to manually organize and analyze semi-structured data, but now, with the help of AI-guided machine learning technology, text analysis models can automatically break down and analyze semi-structured (and unstructured) text data for powerful insights.

Topic analysis, for example, is a machine learning technique that can automatically read through thousands of documents, emails, social media posts, customer support tickets, etc., and classify them by topic, subject, aspect, etc.

Adding other techniques, like sentiment analysis allows you to automatically analyze these texts for opinion polarity (positive, negative, neutral, and beyond).

Semi-structured model

The semi-structured model is a database model where there is no separation between the data and the schema, and the amount of structure used depends on the purpose.

The advantages of this model are the following:

It can represent the information of some data sources that cannot be constrained by schema.
It provides a flexible format for data exchange between different types of databases.
It can be helpful to view structured data as semi-structured (for browsing purposes).
The schema can easily be changed.
The data transfer format may be portable.

The primary trade-off being made in using a semi-structured database model is that queries cannot be made as efficiently as in a more constrained structure, such as in the relational model.
Typically the records in a semi-structured database are stored with unique IDs that are referenced with pointers to their location on disk. This makes navigational or path-based queries quite efficient, but for doing searches over many records (as is typical in SQL), it is not as efficient because it has to seek around the disk following pointers.
The Object Exchange Model (OEM) is one standard to express semi-structured data, another way is XML.

Monday, February 1, 2021

Web databases | Semi-structured data/ Model (M4.1)

Examples of Semi-Structured Data

Types of semi-structured data

Advantages & Disadvantages of Semi-Structured Data

How to Analyze Semi-Structured Data