Database
What is a database?
A database is an organized collection of data, stored and accessed electronically, usually via a computer system. A database can be small or large, simple or complex depending on the situation. It can contain any type of information, and is a critical component of many computer systems.
What’s the purpose of a database?
A database is simply a set (for the purposes of this article, a digital set) of data stored in a coherent, accessible way. Collecting and storing data for future use is necessary for many apps and services to run (though the collection of user data into a database can present privacy and security concerns).
But collecting data is only the first step. For the data to be useful, it has to be consistently formatted (i.e. “structured”), and securely stored. A good database system provides quick and accurate access to the specific data needed, and presents the requested data in a usable format. Maintaining data integrity, allowing and logging changes to data, and providing backup and restoration capabilities are some other features of a well designed database.
How do you create a database?
There are three components to building a database:
- The design of the database contents
- The architecture of the system that hosts the data
- The management system used to maintain the data
To show these different components in action, in this article we’ll refer to a hypothetical database that records data about the users of a music app, and the songs they listen to.
Design
Database design considers what data elements are going to be stored and how the different elements relate to one another, and includes features to ensure data integrity (i.e. eliminate the chance for lost or unusable—aka “garbage”—data). The goal of good database design is to ensure data is correct, well-structured, and easily accessed by programmers, database admins, and engineers.
For our hypothetical music app, in the design phase the engineers would identify the data elements the business needs to store. For example, when a person signs up to use the app, the business might want to save that user’s name, email address, and screen handle. Separately, the business would need a database with the titles of every song available on the app. When a user listens to a song, admins would need to determine what additional data to capture, such as what song it was, what time of day they listened, and if they “liked” the song.
Architecture
Database architecture focuses on the hardware and software used to collect, house, and access the data. These decisions take into account the current and eventual size of the database, who will be accessing it, how, and what the data will be used for.
In the music app example, deciding on the hardware setup needed to store the data would depend on things like how many users the business expects to have, how many songs will be available, and how often users might listen. The business would also consider whether to directly manage the data servers, or outsource.
Management system
The database management system (DBMS) is the software system used to access and manage the database. The DBMS supports queries, updates, backups, and restorations (in case of a data loss or other major issue). The DBMS can provide information regarding the database structure, and track who has accessed and altered data. Strong security procedures are usually incorporated within the DBMS.
For the music app example, the DBMS would give the admin the ability to add new songs to the library, add or remove users, or query the data for info like most popular songs. The DBMS might allow lots of admins to have access to some data (e.g. what songs are played), but only a few to have access to user data or to overwrite/purge existing data.
What types of databases are there?
There are many types of databases. Sometimes a database is described by its data structure (how the data elements are organized or stored). Other times, a database might be described based on its physical location—where the computer storing the data actually “lives.” Some common database descriptors include:
Flat databases
A simple structure with all data in a single file or table. Examples of this type are a spreadsheet, or a text file with data elements separated by a specific character such as a comma or semicolon.
Relational databases
A collection of tables linked by keys. The tables break up the data into groupings that maximize efficiency and minimize repetitive information. One table might contain a single row for users (prefaced by a unique identifying number, or UID), while a second table might have rows for products (prefaced by a product ID), and a third might have rows capturing the interactions between the first two tables. By structuring this way, each table can remain a manageable size.
NoSQL databases
Short for Not SQL (where SQL is an abbreviation for Structured Query Language, a standardized language used to request data from a database). May also be called a document, graph, or key-value database. NoSQL databases have a more dynamic structure and can accommodate changes to their structure better than a SQL database. This makes NoSQL a popular choice for real-time apps, and new apps that might need to change quickly over time.
Columnar databases
In contrast to the typical database system (which stores data in rows, and a single row is stored as a set), a columnar database system stores data in columns (data in a single column is stored as a set). This type of database is often used in “big data” applications where the focus is analysis of data. When stored as a set, a column containing a single type of data can be quickly retrieved and analyzed.
Different database locations (or “hosting”)
There are several options for where a database “lives” (as in the physical servers it’s hosted on). These options include:
In-house
The computers or servers that store the data are physically located in the office or building of the database owner.
Cloud
The computers or servers that store the data are in a remote location like a server farm or data center. The servers storing the data are often maintained by a third party, who may or may not also manage the database itself. A company may opt to use the cloud and associated services because it is easier or less risky than managing themselves in-house.
Distributed
Data is stored on multiple computers, possibly in multiple locations. The performance advantage of this arrangement is improved response time. Another advantage of distributing data is decentralizing control of the Web and its content, reducing reliance on a few large tech companies. Decentralization is a core concept in the emergence of Web3 and blockchains.
Blockchain
The core technology of Web3, a blockchain is made up of many individual computers (even personal computers or devices) or servers that maintain the same copies of recorded data. This shared record is commonly known as a “ledger,” and functions much like a traditional ledger used in accounting. The data on these shared ledgers could be anything, but it’s most commonly a record of cryptocurrency transactions.
What is a query language?
A query language is a codified way for a user to access the data in a database. A query allows a user to retrieve one piece of data or whole sets of data that share one or more characteristics. The correct query language to use depends on the structure of the database. A relational database might use SQL, while a NoSQL database might use MongoQL. These are just two examples of many possible language options.
How is a database kept secure?
There are several layers of protection in database security. First is security of access. The data should only be accessible to people who need it. Further, not all users need to have the same permissions. Some users may have access to a limited amount of the data, while others may have access to view data but not alter it.
Then there is the security of the data itself, in the event someone gains unauthorized access. Best practices include encrypting sensitive data when stored, so that even if data is stolen it can’t be read.
Finally, it’s important to be able to restore data in the event it has been compromised. Backup and recovery procedures prevent data from being permanently lost or corrupted by creating periodic backups (e.g. every 24 hours) of the database and its data.
Can I protect my personal data in databases?
There’s not much you as an individual can do to protect your data once it’s in someone else’s database. Generally, you have to rely on the owner of the database to practice good cybersecurity. When you have the option, you can exercise care about where and when to supply your personal data:
- Check that the URL begins with “https://” (not “http://”). This indicates that your data is encrypted during transfer. While this isn’t a guarantee that your data is stored encrypted, https might improve the odds.
- Minimize the number of places your data is stored. The fewer places your data is stored, the less exposure you have to data theft. For instance, set up automatic payments through your bank’s bill pay system rather than at each account website. This reduces the number of databases your bank account information is stored on.
Fortunately, there has been recent legal activity aimed at protecting personal data, including the General Data Protection Regulation (GDPR) adopted in the EU, and the California Consumer Privacy Act (CCPA), which among other protections allow users to see what data is being stored, and request their data be corrected or deleted. As more of these laws are enacted, protection of an individual’s data will continue to improve.