Data storage

Introduction to Databases

A database is a collection of data stored in some organized fashion.

The database software is the called the Database Management System (DBMS). For relational databases, it is called RDBMS. The database is the container created and manipulated via the DBMS.

Tables

Within a database, a table is a structured list of data of a specific type.

Usually with relational databases, the data stored in a table is one type of data or one list (e.g. one table for Customers, one table for Orders, etc.). Every table in a database has a name that identifies it. That name is always unique, meaning no other table in that database can have the same name.

Table have characteristics and properties that define how and what kind of data is stored in them. This set of information that describes a table is known as a schema. Schema are used to describe specific tables within a database.

Columns and Datatypes

Tables are made up of columns.

A column contains a particular piece of information within a table. A column is a single field in a table. All tables are made up of one or more columns. Each column in a database has an associated datatype. A datatype defines what type of data the column can contain (e.g. Integer, Character Array, etc.).

Datatypes restrict the type of data that can be stored in a column (e.g. preventing the entry of alphabetical characters into a numeric field). Datatypes help sort data correctly and play an important role in optimizing the disk usage.

Rows and Primary Keys

Data in a table is stored in rows; each record saved is stored in its own row. A row is basically a record. Every row in a table should have some column (or set of columns) that uniquely identifies it (but this is not mandatory). This column (or set of columns) that uniquely identifies each row in a table is called a Primary Key.

The primary key meets the following conditions:

  • No two rows can have the same unique primary key value.
  • Every row must have a primary key value (NULL are not allowed).
  • Values in the primary key columns should never be modified or updated.

Transactions

RDBMSs and SQL also support transactions. A key feature of transactions is that they execute virtually at first, allowing the programmer to undo (using rollback) any changes that may have gone awry during execution; if all has gone well, the transaction can be reliably committed.

A transaction is “a transformation of state” that has the ACID properties.

ACID Properties

ACID is an acronym for Atomic, Consistent, Isolated, Durable, which are the gauges we can use to assess that a transaction has executed properly and that it was successful.

Atomic means “all or nothing”; that is, when a statement is executed, every update within the transaction must succeed in order to be called successful.

Consistent means that data moves from one correct state to another correct state, with no possibility that readers could view different values that don’t make sense together.

Isolated means that transactions executing concurrently will not become entangled with each other; they each execute in their own space. That is, if two different transactions attempt to modify the same data at the same time, then one of them will have to wait for the other to complete.

Durable: once a transaction has succeeded, the changes will not be lost. This doesn’t imply another transaction won’t later modify the same data; it just means that writers can be confident that the changes are available for the next transaction to work with as necessary.

Schema

A database schema

  • is the skeleton structure that represents the logical view of the entire database.
  • defines how the data is organized and how the relations among them are associated.
  • formulates all the constraints that are to be applied on the data.
  • defines its entities and the relationship among them.
  • contains a descriptive detail of the database.

Bigtable: Genesis of NoSQL

In 2006, Google released a whitepaper that would outline a new paradigm for databases and data storage: “Bigtable: A Distributed Storage System for Structured Data.”

Bigtable was the distributed storage system for managing structured data at Google. However, unlike existing RDBMS platforms, Bigtable was designed to reliably scale to petabytes of data and thousands of machines to meet Google’s application workload demand.

Bigtable paved the way for several Apache projects, including HBase, Cassandra, MongoDB and others.

NoSQL, which stands for Not Only SQL, is an integral product of the big data movement, and the Google Bigtable paper was a major catalyst for the movement.

Characteristics of NoSQL systems

NoSQL DB are schemaless at design time and “schema-on-read” at runtime - This means they do not have predefined columns, but columns are created with each PUT (INSERT) operation, and each record, document, or data instance can have a different schema from the previous instance.

Data has no predefined relationship to any other object - This means there is no concept of foreign keys or referential integrity, declarative or otherwise. Relationships may exist between data objects or instances, but they are discovered or leveraged at runtime rather than prescribed at design time.

Joins are typically avoided - In most NoSQL implementations, joins are kept to an absolute minimum, if not avoided altogether. This is typically accomplished by denormalizing data, often with a trade-off of storing duplicate data.

NoSQL systems are typically distributed (like Apache Cassandra or HBase) and are designed for fast lookups.

Write operations are typically faster as well, as many of the overheads of traditional relational database systems are not used, like datatype or domain checks, atomic/blocking transactions, or management of transaction isolation levels.

NoSQL systems, in the majority of cases, are built for scale and scalability from petabytes of storage to queries bounded in terabytes, performance, and low friction (having the ability to adapt to changes).

Types of NoSQL systems

NoSQL systems come in several variants or categories:

Key value stores contain a set or sets of indexed keys and associated values. Values are typically uninterpreted byte arrays, but can represent complex objects such as nested maps, structs, or lists. The schema is not defined at design time. Examples of key value stores include Redis and Memcached.

Column stores or wide-column stores, which store data tables as columns rather than rows and have an ability to hold very large numbers of dynamic columns. Examples of Column stores include HBase, BigTable.

Document stores or document databases store documents, complex objects, such as JSON or BSON objects, or other complex, nested objects. The documents are assigned a key or document ID, and the contents would be the semi-structured document data. Examples of document stores include MongoDB and CouchDB .

Graph stores represent data in graph structures as nodes and edges, which are relationships between nodes. Examples of graph stores include Neo4J and GraphBase.

NoSQL databases and CAP theorem

The CAP theorem is sometimes called Brewer’s theorem after its author, Eric Brewer.

The theorem states that within a large-scale distributed data system, there are three requirements that have a relationship of sliding dependency:

  • Consistency: All database clients will read the same value for the same query, even given concurrent updates.
  • Availability: All database clients will always be able to read and write data.
  • Partition tolerance: The database can be split into multiple machines; it can continue functioning in the face of network segmentation breaks.

Brewer’s theorem is that in any given system, you can strongly support only two of the three. We have to choose between them because of this sliding mutual dependency.

In distributed systems, it is very likely that you will have network partitioning, and that at some point, machines will fail and cause others to become unreachable. Distributed systems need Partition Tolerance!

For NoSQL databases, it leaves with only two real options to compromise on: availability and consistency.

results matching ""

    No results matching ""