system-design

Drafts of the Mind #2

Kadıköy in Istanbul, Turkey.

Hello everyone, today I am going to write about scalable applications. For those who don't know, I started a new series yesterday about my life, my thoughts, opinions, learnings, and so on. At the same time, I have begun to read Design Data-Intensive Applications, and in this post, I would like to share my learnings on scalability after reading the scalability section of the first chapter of this book. Let's start with what scalability means.

What is scalability?

Scalability is the ability to cope with increasing load. In other words, when we say this is a scalable application, it means this application can cope with increasing load. As you can see, each definition of scalability depends on what the load is.

What is load?

At first, it can be hard to define load. Because there is no one definition to fit the load, in my opinion. For example, for elevators, the load can be the sum of humans weights in the Elevator, or for a restaurant, the load can be the sum of the human count, or for a typical web application, it can be concurrent user requests, or for a database, it can be concurrent requests, or it can be data amount that is stored.

So, as agreed, the load depends on the system. When we say a typical web application is scalable, basically, we can understand that this application can cope with increasing concurrent users. However, this is also still not meaningful for experienced developers. Because it doesn't express how many concurrent users can access this system with reliably. So, a more correct definition seems like: this application now is coping with 10.000 users concurrently, but we designed it can cope with 100.000 concurrent users. But, this definition also feels missing when we look at dig into it. Because it doesn't define what exactly it means by the ability to cope.

What is performance?

Performance definition also depends on the context, like the load. For a car, it can be the top speed, or for a typical web application, it can be latency or response time. So, when we say this application can be scaled to 100.000 concurent users. This means this application can work with 100.000 concurent users without any additional remarkable latency. At this time, I should also define latency and response time.

What is latency and response time?

The response time is what the client sees: besides the actual time to process the request, it includes the network delays and queuing delays. On the other hand, latency is the duration that a request is waiting to be handled.

According the these definitions, response time is always bigger than latency because response time equals latency plus network delays plus others. In here, others, it can be various things, for example, it can be the capability of the client's processing responses. As a developer, most of the time we should care about response time rather than latency. Because let's say we have a service that processes 100GB and returns in one second after processing. So, the latency of this service is one second for this amount of data. But, due to limitations of some protocols and the network, the client sees this response after 10 hours. So, if you care about user or client experience, then, as you see, you should measure response time. In this scenario, the network is the weakest link in the chain. In other words, it is the head-of-line blocking.

While for the HTTP protocol, head-of-line blocking can be a connection count between the server and the client, for the microservice application, it can be the service with the highest latency.

When we measure response time, we should also interpret it, right? Otherwise, it would be meaningless. While interpreting it, we can use the mean(aka average), median(aka p50), p95, p99, and so on. But, average response time doesn't give us how many people experienced delayed response time. On the other hand, percentiles give us better information. If you take your list of response times and sort it from fastest to slowest, then the median(p50) is the halfway point. For example, let's say your application has a median response time is 200ms, which means that while half of the response times are lower than 200ms, half of the response times are higher than 200ms. Based on the p50(median) definition, we can explain p95, p99, and so on. For example, let's say your application has a p95 response time is 200ms, which means that while %95 of the response times are lower than 200ms, the remaining 5 percent are higher than 200ms.

Let's turn back to scalability again. Now, we know that scalability is the ability to cope with increasing load. So, how can we scale our system? In other words, how can we increase the ability to cope with an increasing load in our system?

What is scale up and scale out?

I am not sure that I need to header to define scaling up and scaling out. However, I believe that now it seems more cool 😄

While scaling up(aka vertical scaling) means moving to a more powerful server, scaling out(aka horizontal scaling) means distributing the load across multiple smaller machines. Like most of where, a hybrid approach makes more sense to scale our system. But sometimes horizontal scaling cannot be possible, especially when we designed a stateful system.

As application developers, we should architect our systems properly to ensure scalability without depending on increased server capacity. To achieve this, first of all, we need to thoroughly understand the system's non-functional requirements.

One of the key non-functional parts we should care about is performance, especially how the system handles read and write operations. In most real-world cases, applications deal with more reads than writes. For example, Twitter, YouTube, Reddit, and other social media platforms are read-heavy systems. So, obviously, we should optimize reads if we architect such a system. On the other hand, while optimizing read operations, we may have to compromise on write operations. This balance between read and write optimization is what ultimately determines how efficiently our system scales, even when the underlying hardware remains constant.

Thanks for reading. Next time, I will try to share learnings on maintainability. See you 👋.
This article was written while listening to Spotify’s This Is Chopin playlist in the background.