System design interview basics
The system design round is one of the rounds in any technical interview. The idea of this round is to know your design skills, analytical and trade-off skills, to see if you can take a fuzzy problem, break it down it in small executable chunks and actually solve the problem. The beauty is that there are no correct or incorrect answers, it is more of a discussion.
To have a fruitful discussion, one must know the basics and should be able to prove his/her claims with facts. This article discusses what are all those basic technical concepts you should be aware of when you go into a system design round.
System Design concepts: Databases
Databases are essentially required in any application/system unless it is a non-authenticated, static, read-only and not frequently changing content only application. It is therefore important to know a few concepts around databases.
First of all, understand that these concepts are independent of what actual DB used, it can be open source like MySQL or proprietary like MS-SQL Server.
Imagine a simple application like which is set up like the figure below. What are the problems with setup?
There are three problems to start with among many others: First, what if the only DB server goes down or we have to take it down for maintenance? It impacts the system’s availability. Second, on the other instance, what if the server crashes and becomes unrecoverable, all application data in it is also gone, this affects the application’s data availability and persistence. Third, what if there are many reads and writes requests coming to the server. The only server will degrade in performance, simple read requests will take a lot of time waiting for complex writes to finish, impacting the performance of the overall system.
What can we do? Let’s set up a new server, which is nothing but a replica of the original server, and store everything which is on the original server. This process is called replication, the original server is called master and the copy server is called slave.
How does it solve the problems: First, having a replicated server gives us the power to immediately switch to a new server in case the master goes down. The system will keep running as if nothing happened and nobody except your database administrators needs to know anything.
Second, you have the exact copy of the data, so even you lose data on one server, you can always extract that from the other.
Third, since the servers contain the exact same data, you can read from one server and write on another. This improves the performance of your read requests (which are usually more in any system) and is not blocked on write.
So in essence, replicating your DB servers gives you system availability, protection against data loss, and performance gain .
Usually, in production systems, there are two masters, (one stand by) and many slaves attached to those masters. Also, you want to keep a server with replication delayed by a few hours (24 hours ideally). Just in case if data on master get corrupted ( developer run a wrong query, file system issues) and it gets replicated immediately to all the servers, all of the data is now corrupted. This delayed replicated server can give you sane data although you lose data of a day.
Where is the catch, where is the tradeoff here? We improved availability and performance, however, we have risked the consistency. Imagine a case, where data is written on the master. Due to some lag, data replication to slave did not happen quickly. If another service reads from the slave the same data, either it does not get the data or gets an obsolete piece of data.
Replication improves availability and performance, however, risks consistency.
A database is essentially a collection of tables that are nothing but rows and columns. The number of rows and columns in tables vary. Let’s have a very popular application and half of the earth’s population is registered to your application. In this case, your user table has too many rows. On the other hand, you have 10K employee and you want to store everything about those employees in one table called employee, here you have a table which has too many columns.
Before understanding partitioning, let’s understand why too many rows or columns in the table are problems for a system?
With too many rows, index size will be large and search performance will degrade, this will impact the overall performance of the service for a specific user. Let’s say a few thousands of users are accessing your service which queries this table, one user query starts degrading the other and your system goes into this spiral of degrading performance. If this big data gets corrupted, it impacts the availability of service to all the users.
With too many columns, you will be pulling a lot of data in every query even when you need just a few columns out of those. This will indirectly impact the performance and put a lot of unnecessary data on the network.
What is the solution to too many rows or columns? The answer is partitioning. There are two types of partitioning:
In horizontal partitioning, a table is divided into multiple partitions, each partition stores rows of the table. These partitions can be stored in multiple database clusters. One of the examples is that all the users from Germany, Netherlands, and France in a table called
EuropeUser whereas all the users in USA and Canada are stored in a table called
NorthAmericaUser and so on for Asia, Africa, etc. If you want to know all the users of your service, all you need to unite all these tables.
Gain is that now you have to search less number of rows to find a European user compared to earlier, which improves query performance and hence improves the response time of service. If table
EuropeUser is corrupted or deleted, your service is still available in North America and Asia.
In vertical partitioning, we divide the table into multiple partitions and each partition contains certain columns. For example, an employee table with id, name, email, picture, salary, organization can be partitioned into three one with columns: id, name, and email; second: id, picture; Third: id, org, and salary.
There are multiple ways you can partition a table:
Range partitioning – selects a partition by determining if the partitioning key is inside a certain range. It distributes tuples based on the value intervals (ranges) of some attributes.
List partitioning – a partition is assigned a list of values. If the partitioning key has one of these values, the partition is chosen. For example, all rows where the column Country is either Iceland, Norway, Sweden, Finland, or Denmark could build a partition for the Nordic countries.
Composite partitioning – allows for certain combinations of the above partitioning schemes, for example first applying a range partitioning and then a hash partitioning.
Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the keyspace to a size that can be listed.
Round-robin partitioning – the simplest strategy, it ensures uniform data distribution. With n partitions, the ith tuple in insertion order is assigned to partition (i mod n). This strategy enables sequential access to a relation to be done in parallel.
Hash partitioning – applies a hash function to some attribute that yields the partition number. This strategy allows exact-match queries on the selection attribute to be processed by exactly one node and all other queries to be processed by all the nodes in parallel.
Partitioning improves your database availability and performance.
We already know that horizontal partitioning splits a big table into multiple partitions, however, these partitions are stored in the same schema and on the same server. Sharding takes it one step beyond, where partitions can be stored on multiple servers and in different schemas. This division of data and its storage on multiple servers gives the flexibility to store more data that does not fit on single servers and fast search responses on the data.
In algorithmic sharding, the client can determine a given partition’s database without any help. Usually, there is a partition key, we can derive cluster-id using that partition key. A simple hash function Hash(key) can be used as a sharding function. An example of such sharding implementation is Memcached.
In dynamic sharding, an external locator service determines the location of data. To read and write data, clients need to consult the locator service first.
Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate.
This is true because adding sharding actual may reduce performance if your data is not uniformly distributed by creating hotspots. Also, it adds lot of operational overheads, adding a new node in the system may require re-sharding and moving data between nodes.