What is caching? Caching is a mechanism to store and fetch data faster by applications without referring to the original source of data. Usually, applications store data which they think will be used again, instead of doing a round trip to the original source, applications use the stored data to serve requests. These are mostly software caches implemented using main memory.
A cache is a storage which stores a subset of the entire data based on current usage and probability of its future usage for faster execution of requests. This storage is typically in main memory which has must faster read time compared to disk.
In typical microprocessors, hardware cache exists which between CPU and main memory, these are very high-speed miniature memories which store instructions and data, which are processed by CPU, referred to as L1, L2 caches.
In distributed web systems, caching can be done at various places and levels which we will discuss in a while.
Caching eviction strategy
As mentioned above, cache stores a subset of the original data. Question is which subset and how does it work when data which is not in the cache is requested? There are different approaches to select which data stays in the cache, these are called cache eviction strategies.
Before understanding more on eviction strategies, a cache miss is an event where the application did not find data it requested in the cache. A cache hit is when the application did find what requested in the cache.
First in first out : In the event of a cache miss, the entity which came first into the cache is evicted first. Every entity has a timestamp associated with it and the oldest timestamp entity is removed first.
Least recently used: In the case of a cache miss, a page which was used least recently gets evicted from the cache.
Least frequently used: In case of a cache miss, the entity which is least frequently used among all the entities in the cache is thrown out of the cache.
There are other eviction strategies like minimum used eviction or heuristic-based eviction.
Cache hit ration is dependent on many parameters, first, the size of cache key-space, the more unique cache keys your application generates, the less chance you have to reuse any one of them. Always consider ways to reduce the number of possible cache keys. second, the number of items you can store in the cache, the more objects you can physically fit into your cache, the better your cache hit ratio. Third, longevity, how long each object can be stored in cache before expiring or being invalidated.
Caches can be applied and leveraged throughout various layers of technology including Operating Systems, Networking layers including Content Delivery Networks (CDN) and DNS, web applications, and Databases.
In a web application, most of the time, user requests need the same data to fulfill those request. If each request starts hitting your database, your servers will be overloaded and response time will slow. To avoid unnecessary load on servers and to decrease the response time, we place caches in between our databases and application. These caches can be on the same servers as the database, completely separate servers or at the application servers. Based on the metric and function you want to optimize, use appropriate caching level.
– Client caching
Caching which is done at the client-side like operating system and browser of the user. Typical examples are Address Resolution Protocol and static assets like HTML and CSS. Remember, you did nothing in this case, everything is done by browser and not your system.
– CDN caching
– Server caching
We could have a cache directly on the servers. Each time a request is made to the service, the server will quickly return local, cached data if it exists. If it is not in the cache, the requesting node will query the data by going to network storage such as a database.
How this solution will scale as we may grow to many nodes? If we decide to expand to multiple nodes, it’s still quite possible to have each node host its own cache. However, if your load balancer randomly distributes requests across the nodes, the same request will go to different nodes, thus increasing cache misses.
Another design pattern to handle the cache miss problems is to have a common cache for the entire system which all server write and read from. This pattern scales better and even if the requests in the same session go to multiple servers, the user gets the consistent experience without latency. Trouble is that now you cache layer has become a bottleneck.
In this caching approach, we write directly on to the DB. The cache reads the info from DB in case of a miss and then stores it till the eviction policy moves it out of cache. This approach can lead to higher read latency in case of applications which write and re-read the information quickly.
Where writes go through the cache and write is confirmed as success only if writes to database and the cache succeed. There is data consistency between cache and database. If your cache crashes due to power failures or other disruptions and restarts, nothing will be lost. However, write latency will be higher in this case as there are writes to two separate systems.
– Write-behind (write-back)
In this approach we write on cache and it is confirmed as soon as it is done on cache without writing it on to database. This write is asynchronously synced to database. It results in quick write latency and high write throughput for the write-intensive applications. However, you have a risk of loss of data incase the caching layer dies because the only single copy of the written data is in the cache. We can improve this by having more than one replica acknowledging the write in the cache.
Advantages of using cache in your system design
1. Improve Application Performance
Because memory is orders of magnitude faster than disk (magnetic or SSD), reading data from the in-memory cache is extremely fast (sub-millisecond). This significantly faster data access improves the overall performance of the application.
2.Reduce Database Cost
A single cache instance can provide hundreds of thousands of IOPS (Input/output operations per second), potentially replacing a number of database instances, thus driving the total cost down. This is especially significant if the primary database charges per throughput.
3. Reduce the Load on the Backend
By redirecting significant parts of the read load from the backend database to the in-memory layer, caching can reduce the load on your database, and protect it from slower performance under load, or even from crashing at times of spikes.
4. Eliminate Database Hotspots
In many applications, it is likely that a small subset of data, such as a celebrity profile or popular product, will be accessed more frequently than the rest. This can result in hot spots in your database and may require overprovisioning of database resources based on the throughput requirements for the most frequently used data. Storing common keys in an in-memory cache mitigates the need to overprovision while providing fast and predictable performance for the most commonly accessed data.
5.Increase Read Throughput (IOPS)
In addition to lower latency, in-memory systems also offer much higher request rates (IOPS) relative to a comparable disk-based database. A single instance used as a distributed side-cache can serve hundreds of thousands of requests per second.
The system design round is one of the rounds in any technical interview. The idea of this round is to know your design skills, analytical and trade-off skills, to see if you can take a fuzzy problem, break it down it in small executable chunks and actually solve the problem. The beauty is that there are no correct or incorrect answers, it is more of a discussion.
To have a fruitful discussion, one must know the basics and should be able to prove his/her claims with facts. This article discusses what are all those basic technical concepts you should be aware of when you go into a system design round.
System Design concepts: Databases
Databases are essentially required in any application/system unless it is a non-authenticated, static, read-only and not frequently changing content only application. It is therefore important to know a few concepts around databases.
First of all, understand that these concepts are independent of what actual DB used, it can be open source like MySQL or proprietary like MS-SQL Server.
Imagine a simple application like which is set up like the figure below. What are the problems with setup?
There are three problems to start with among many others: First, what if the only DB server goes down or we have to take it down for maintenance? It impacts the system’s availability. Second, on the other instance, what if the server crashes and becomes unrecoverable, all application data in it is also gone, this affects the application’s data availability and persistence. Third, what if there are many reads and writes requests coming to the server. The only server will degrade in performance, simple read requests will take a lot of time waiting for complex writes to finish, impacting the performance of the overall system.
What can we do? Let’s set up a new server, which is nothing but a replica of the original server, and store everything which is on the original server. This process is called replication, the original server is called master and the copy server is called slave.
How does it solve the problems: First, having a replicated server gives us the power to immediately switch to a new server in case the master goes down. The system will keep running as if nothing happened and nobody except your database administrators needs to know anything.
Second, you have the exact copy of the data, so even you lose data on one server, you can always extract that from the other.
Third, since the servers contain the exact same data, you can read from one server and write on another. This improves the performance of your read requests (which are usually more in any system) and is not blocked on write.
So in essence, replicating your DB servers gives you system availability, protection against data loss, and performance gain .
Usually, in production systems, there are two masters, (one stand by) and many slaves attached to those masters. Also, you want to keep a server with replication delayed by a few hours (24 hours ideally). Just in case if data on master get corrupted ( developer run a wrong query, file system issues) and it gets replicated immediately to all the servers, all of the data is now corrupted. This delayed replicated server can give you sane data although you lose data of a day.
Where is the catch, where is the tradeoff here? We improved availability and performance, however, we have risked the consistency. Imagine a case, where data is written on the master. Due to some lag, data replication to slave did not happen quickly. If another service reads from the slave the same data, either it does not get the data or gets an obsolete piece of data.
Replication improves availability and performance, however, risks consistency.
A database is essentially a collection of tables that are nothing but rows and columns. The number of rows and columns in tables vary. Let’s have a very popular application and half of the earth’s population is registered to your application. In this case, your user table has too many rows. On the other hand, you have 10K employee and you want to store everything about those employees in one table called employee, here you have a table which has too many columns.
Before understanding partitioning, let’s understand why too many rows or columns in the table are problems for a system?
With too many rows, index size will be large and search performance will degrade, this will impact the overall performance of the service for a specific user. Let’s say a few thousands of users are accessing your service which queries this table, one user query starts degrading the other and your system goes into this spiral of degrading performance. If this big data gets corrupted, it impacts the availability of service to all the users.
With too many columns, you will be pulling a lot of data in every query even when you need just a few columns out of those. This will indirectly impact the performance and put a lot of unnecessary data on the network.
What is the solution to too many rows or columns? The answer is partitioning. There are two types of partitioning: Horizontal partitioning
In horizontal partitioning, a table is divided into multiple partitions, each partition stores rows of the table. These partitions can be stored in multiple database clusters. One of the examples is that all the users from Germany, Netherlands, and France in a table called EuropeUser whereas all the users in USA and Canada are stored in a table called NorthAmericaUser and so on for Asia, Africa, etc. If you want to know all the users of your service, all you need to unite all these tables.
Gain is that now you have to search less number of rows to find a European user compared to earlier, which improves query performance and hence improves the response time of service. If table EuropeUser is corrupted or deleted, your service is still available in North America and Asia.
In vertical partitioning, we divide the table into multiple partitions and each partition contains certain columns. For example, an employee table with id, name, email, picture, salary, organization can be partitioned into three one with columns: id, name, and email; second: id, picture; Third: id, org, and salary.
There are multiple ways you can partition a table:
Range partitioning – selects a partition by determining if the partitioning key is inside a certain range. It distributes tuples based on the value intervals (ranges) of some attributes.
List partitioning – a partition is assigned a list of values. If the partitioning key has one of these values, the partition is chosen. For example, all rows where the column Country is either Iceland, Norway, Sweden, Finland, or Denmark could build a partition for the Nordic countries.
Composite partitioning – allows for certain combinations of the above partitioning schemes, for example first applying a range partitioning and then a hash partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the keyspace to a size that can be listed.
Round-robin partitioning – the simplest strategy, it ensures uniform data distribution. With n partitions, the ith tuple in insertion order is assigned to partition (i mod n). This strategy enables sequential access to a relation to be done in parallel.
Hash partitioning – applies a hash function to some attribute that yields the partition number. This strategy allows exact-match queries on the selection attribute to be processed by exactly one node and all other queries to be processed by all the nodes in parallel.
Partitioning improves your database availability and performance.
We already know that horizontal partitioning splits a big table into multiple partitions, however, these partitions are stored in the same schema and on the same server. Sharding takes it one step beyond, where partitions can be stored on multiple servers and in different schemas. This division of data and its storage on multiple servers gives the flexibility to store more data that does not fit on single servers and fast search responses on the data.
In algorithmic sharding, the client can determine a given partition’s database without any help. Usually, there is a partition key, we can derive cluster-id using that partition key. A simple hash function Hash(key) can be used as a sharding function. An example of such sharding implementation is Memcached.
In dynamic sharding, an external locator service determines the location of data. To read and write data, clients need to consult the locator service first.
Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate.
This is true because adding sharding actual may reduce performance if your data is not uniformly distributed by creating hotspots. Also, it adds lot of operational overheads, adding a new node in the system may require re-sharding and moving data between nodes.
A commonly asked system design interview question is to design a twitter timeline. We will discuss step by step method to answer this or any system design interview question in this post.
Step 1: Requirement gathering, 5-7 mins
Before you start designing any system, it is very important to know what will this system do? In real world, you will never told everything needed from a system at once. There different stakeholders and business area which can be impacted and benefited from the system you are building, not all of them are technical people. Hence the first step you do in real world, you should in the interview: Collect as many requirements as possible upfront. This is called scoping of the problem.
Why should you do it?
First of all, it shows that you do not jump into implementation as soon as you hear something. This behavior is closely observed in interviews.
Second, in the context of the interview, you cannot possibly design the system which is designed in years by many developers in 1 hour. You need to sacrifice quite a few features, and this is a good place to start dropping those features. It shows that you can clarify things when required.
Third, you have now a bit concrete requirement set which you can use to build further. This shows your prioritizing skills.
We should start listing the requirements which come to our mind, do not hold yourself in a box. Think as many as possible and ask the interviewer if you should design that feature or requirement. If he says yes, keep it else just mark them as out of scope.
Below are a few requirements I have listed for twitter timeline
When a user logs in, (s)he should see the timeline with tweets from people (s)he follows.
A user should be able to post a tweet
A user should be able to retweet a tweet and like a tweet.
A user should be able to follow another user
A user should be able to search his/her timelines
A user should be able to see his/her own tweet timelines.
For shake of this exercise, let’s design the system for all of the above requirements, but I am sure that end of this exercise, you are left with 2 to 3 most important requirements interviewer wants you to design for.
However, we are yet not done with requirements, we collected the functional requirements of the system, but what about non-functional requirements? Considering non-functional requirements during the design process puts you into a different bracket.
Following are some of the non-functional requirements you should always gather:
Latency: Collect the time expected for each of the major action users can do, in this example, we should ask what are the latency expectations for timeline display, posting a tweet, search. How much time should it take when a user posts a tweet to appear in his/her followers’ timelines? For this, we have low latency on reading: 1ms for p50 and 4ms for p95.
Availability: What are the availability requirements? Should the system be always available or it can have a downtime?
Consistency: What kind of consistency model this system is expected to follow? Can it be eventual consistency or it has to have hard constraints? In this example, one follower of a person can see the tweet in the timeline, but another one does not. However, eventually, everyone will see it.
The last thing if you have time, you should discuss is your Service Level Objectives, in terms of reliability, availability, and latency. Discuss something like, your system will be 99.99% of the time available and 99.5% of all timelines will load with 500 ms.
Step 2: Get the numbers, 5 min
One of the natural things we do in real life when designing systems is to understand the scale at which our system will be used. This information plays a crucial role in our storage selection, caching strategies, queue strategy, fault-tolerance mechanisms, machine hardware, network topology on the architecture level. At the same time, these numbers will influence design in terms of which patterns to follow (map-reduce, scatter and gather, request-response, etc), communication protocol (HTTP, GRPC, Queue), message exchange formats (Protobuff, Avro)
If this can influence so much of your design, how can you miss it in an interview? We focus on two major things and the rest all follow:
What is the number of requests per second and data is exchanged in requests and responses?
What is the user base and what is stored for each user?
These are the numbers for twitter
150M active users.
300K requests per second for a pull-based timeline query.
Up to 800 tweets could be displayed on the timeline.
allowing relatively high on write.
5K new tweets per second on average. (12K at peak).
Users can have up to tens of millions of followers.
Let’s start doing the Math, if we have 150 M active user, we will have them stored on the database, each user will have, user_id, user_name, which will be 104 bytes per user, for 150 * 104 MB, which is approx 15GB of data, we can ask a follow-up question, what is the rate of growth of users, but that not a focus of this question.
There are 5K tweets are tweeted per second, that means in 24 hours, there will be 5000 * 24 * 60 * 60 tweet which is 432 M tweets, each tweet containing the 140 characters, tweet_id,user _id and media, overall, each tweet costs us 10KB (media dominates the size of each tweet). Each tweet needs to be stored, so our storage is growing at the pace of 432 * 104 MB which is approx 62 GB per day. The media will be store in the object store we will see later.
Now that we know what will we store and how it will grow, let’s focus on the transient data on the network. There 300K requests per second for a timeline, each timeline containing up to 800 tweets that are 300K * 800 * 140 characters, alongside meta information of let’s say 20 bytes, overall, approx 39 Gigabytes of data will be flowing through our network.
Now we know what is size of storage we are looking at and how much data will be flowing through the network. Also, we know how many read and write requests will be coming to our service per second. Based on our server capacity, we can decide how many servers do we need to handle that traffic. Also, keep in mind that traffic is not evenly distributed, so you may have to provision more than the average calculations.
Step 3: Draw the high-level block diagram 15-20 mins
What are the bottlenecks in the diagram? Web servers are usually stateless, we can spawn as many of them as needed. Real bottleneck starts at back-end service, it handles also the requests, writes when user tweets and reads when the timeline is loaded or searches timeline. So my first idea would be to split these two services: read and write service APIs.
Now, we can scale write and read APIs independently of each other. However, still, both services are using directly the database for reading and writing data. For example, to load timeline, read API has to go to the database to find the user info, then people this user follows, then tweets from those followers and then actual tweets and their information like the number of retweets and likes. That’s a lot for a single service. Why do not divide it into different services with each service should a single thing? Single responsibility principle. Three services I can think of: timeline service (orchestrator service), userInfo service to get information about a user, and tweetInfo service to get information about a tweet or bunch of tweets.
What is the next bottleneck in the system? Reading from databases is very expensive operations because data is store on the hard disk. Yes, there is some level of caching, but still it expensive. Why don’t we use cache which is an in-memory store and much faster in reads and writes?
As soon as you speak of the cache, you should mention the impact it has on the consistency of the system.
Earlier, our timeline service used to find all the users current user is following and then go through each of these users to find if they have tweeted anything recently, sorted them by time and relevancy, and then go to tweetInfo service to find actual tweets and information about the tweet. That’s quite an expensive operation for each timeline load.
One of the quick thought comes to mind is what if store the people user follows in the cache for each user. That saves a request to DB, but does it save much? Because we still have to go to DB to find tweets of these people.
How about if we prepared the timeline for each of the users, a list of maximum 800 tweets, whenever, the user comes online, we can just fetch the prepare timeline from the cache and send it back. The next question, is how to fill this timeline and when to update it? Whenever, a user tweets, we go through all his/her follower’s timelines and store it there. Should we store it in DB first and then timelines? Ideally yes. Because caches are non-persistent and we do not want to lose the tweet. All of these things have to be done by the write APIs, they call another service fanout service which fans out the recent tweet from the user to all the followers.
The writing part also has to deal with the media in the tweets. One of the best ways to use the object store to store these media. Block diagram changes as follows:
What is still a bottleneck in the design? Obviously, it is the database. It is a single database used by both reading and writes and both of them unnecessary compete for the same resource. Also, the database is right now a single point of failure. Why reading and writing from the same database server is bad? A typical way to handle this bottleneck is to put a master-slave, leader-follower setup for the DB. In this, all the writes happen to the master DB, which then gets replicated to slaves or followers and then all the reads happen from slaves. As you can already see that we are introducing something which can compromise the consistency of the system. What if the replication from master to slave is super slow? What if the user publishes the tweet and reads it again immediately? One advantage apart from increasing in throughput, master-slave arrangement brings is redundancy, which helps in data recovery in case of master crashes completely and availability by making one of the slave masters in case master is down for some time.
Another problem which we figured while doing number math was the rate at which the database will grow. If the database is growing at 62 GB per day, we cannot have it on one master machine. We will have to partition our database into smaller chunks to store it. In this case, we will choose horizontal partitioning, where we will store tweets from certain users on certain partitions of DB.
Seems like we have cover the first two use cases, the timeline for a user and user can post tweet which is displayed to followers. There is another use case to design : a user can search for a tweet.
Current storage (cache and database) are not ideal for search functionality, because the search will not be on user id on which DB is partitioned on, it will be mostly a free text or based on tags. So, we must extend this system to incorporate the search. Also, it is not required that a tweet should show up in search results as soon as it tweeted, so we can actually do some processing before it is on the search. Which system is best for the text-based search? What if our fanout service, pushes every tweet into a queue, which is consumed on the other end, massage according to our search criteria and pushed into search cluster. This search cluster data is exposed through our search APIs. One thing to note is that we do not intend to add this data to each node in the search cluster, we are done as soon as data is store in one server and then replicated to 2 more for redundancy purposes. So, the major difference between the search pipeline and write pipeline. Write pipeline is deliberately kept O(n) to keep read O(1), whereas search is O(1) when indexing the search but O(n) when reading results.
Added advantage of adding a queue based communication is that we can use multiple consumer groups to process the same tweet and do different things. For example, a tweet can be put into search service, timeline cache, notification service, trend calculator service and analytics.
Whenever there are multiple services depend on the same input and real-time acknowledgment is not required, introduce queues.
It is interesting that we
It is very important in an interview to take your interviewer along with you, explain every block to him/her, communicate clearly why are you introducing that block and highlight the downsides of it.
Step 4: Explain technology choices, 10 mins
Now, that we have clear requirements, scale and broad architecture, it is good to make some concrete choices and showcase your breadth of technology stack awareness.
First of all, which database? Would you choose relational databases or would you go from non-relational? If non-relational, would you go for document-based storage or key value-based storage? In the current scenario, we are not reading a lot from the database or doing searches, a relational database may just suffice. Document-based storage will be an overkill. If I have to pick, I will pick MySQL or PostgreSQL, both are proven in production to scale well, they provide essential ACID properties we need for this system.
Next, what will be my choice for a cache? There are three main competitors: Memcached, Cassandra, and Redis. In this case, we have a specific case, where we write to cache a lot and read it using a very specific key, which is user_id, Cassandra fits the bill very beautifully because it is designed for fast writes, is eventually consistent, fault-tolerant and performance very well when we know the query parameters.
What will be our choice of queue or messaging system? There are a few like Kafka, RabbitMQ, ActiveMQ. We will go with Kafka, for two reasons: First, it scales very well, if you want another service to consume message from specific topics, add a consumer group and you are good to go. If you need to increase throughput, just increase the number of partitions and corresponding threads. Second, it provides durability. In case something happened on the consumer side, all the messages are preserved on Kafka for a configurable number of days. Apart from that, native connectivity to HDFS is also a big plus. Also, I will use Avro message protocol, because it compresses message in less number of bytes compared to Protofbuff, also fit well with schema registry pattern and supports multiple compatibility levels between producers and consumers.
For text-basedsearch, the candidates are Elastic search or Lucene, however, it is good to talk about retention time of indices and data. Access control to this data and how would you archive the search data. The approach would be scatter and gather, where we scatter the request to each node in the cluster and ask if they anything related to the search term, get the result, process them, rank them and send it to the user.
Communication between external clients and our dynamic load balancer can be on HTTP, however, internal systems can communicate on different protocols like gRPC for low latency, messaging queues for durability.
Problem of celebrities
Celebrities on Twitter are followed by millions of users. That becomes a problem for our fanout service, as that service is responsible for putting the tweets into followers’ timeline cache. This operation can take up to minutes based on the follower count. There is another problem, what if I got the tweet into my timeline inserted and decided to reply to it. As I have a small follower count, my tweet will be visible to my followers immediately while the celebrity tweets will not. That creates a very bad user experience.
How Twitter has solved it?
Interestingly enough, Twitter does not fan out celebrity tweets to all the followers. It adds these tweets on the read time. So whenever a user asks for timeline, service pulls the timeline from cache, looks at the follower list, see celebrities being followed, check if some tweet has to be inserted into timeline before returning it to front end. It saves a tens of percents of computing resources, however, adds extra cost to read service.
Wrap up and conclude
You must understand that you can not design a perfect system in 45 to 60 mins. It is completely OK to miss certain things as far as you know them. This is the section, where you conclude your design and explain what would have you done given more time, which bottlenecks to address next, what testing, bench-marking, monitoring, alerting, support process, etc.
Last of tip for any system design interview, there are not wrong or right answers, only bad, good and great designs. It is always a story of trade-offs. To do trade-offs, you must know you options and for that you have to read a lot around the technologies presently being used like MySQL, Redis, Cassandra, Memcached, Kafka, Rest APIs, Elastic Search, Lucene and many more. Understand the performance gain and consistency and durability loss between DB and cache. Understand read favored cached Vs write favored caches. Understand difference between Relational and NoSQL databases, when to use document storage?
Privacy & Cookies Policy
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.