Using Binomial Distribution to Model Data Durability

Durability requirements influence the choice of data protection mechanisms, such as replication, erasure coding, and RAID parity configurations. Achieving higher durability involves trade-offs between redundancy, storage usage ratio, and computational complexity.

Replication achieves durability by creating multiple copies of data, which increases redundancy but reduces the storage usage ratio. In contrast, RAID levels like RAID 5 and RAID 6 use parity to provide fault tolerance, striking a balance between durability and storage efficiency. Similarly, erasure coding offers high durability with significantly lower storage overhead compared to replication and RAID parity configurations but introduces computational and rebuild complexity. Each method involves trade-offs in storage efficiency, cost, and performance, which must be carefully evaluated to meet specific durability needs.

This blog explores how different data protection mechanisms—replication, erasure coding, and RAID—achieve their number of 9s in durability. But first, let's define what data durability is.

Data durability refers to the ability of a storage system to preserve data over time without any loss or corruption. It quantifies the likelihood that data remains intact and recoverable, even in the face of hardware failures, power outages, or other disruptions. Durability is typically expressed in terms of the number of 9s, such as 99.999999999% (11 nines), which represents an extremely low probability of data loss.

For example, a storage system with 11 nines of durability suggests that if you store 1 million objects, you can expect to lose one object over 10 million years. Achieving such high durability requires robust data protection strategies that can tolerate hardware failures and ensure long-term data integrity.

Let's understand how data durability is calculated for each of the data protection schemes.

Erasure Coding

In erasure coding, data is divided into \( k \) data and \( m \) parity shards, collectively referred to as an \( (k,m) \) scheme. This configuration enables the reconstruction of the original data from any \( k \) out of the total \( k + m \) shards. In other words, the system can tolerate up to \( m \) shard failures without any data loss.

To give an example, in a \( (17,3) \) erasure coding scheme:
• A data object is split into 17 data blocks and 3 parity blocks.
• A total of 20 blocks are distributed across 20 disks.

This system can tolerate up to 3 disks failures without losing data because any 17 blocks (out of 20 total) are sufficient to reconstruct the original data and the storage overhead is only approx 20%.

Let's now understand how data durability is calculated for an erasure coding scheme.

If the probability of failure of one shard over a year is given as Annual Failure Rate (AFR), the probability of failure of that shard over the Mean Time To Recover (MTTR) that shard can be given as:

\( P(F) = AFR\times \frac{MTTR (days)}{365} \)

The probability of survival, \( P(S) \) of that shard can be given as \( 1-P(F) \).

Now the probability of any outcome i.e., failure or survival of that shard could be expressed as:

\( P(F) + P(S) \)

Similarly, probability of any outcome for \( n \) shards could be expressed as:

\( (P(F) + P(S))^{n} \)

Using Binomial Distribution Theorem, this could expressed as:

\( \sum_{k=0}^{n} \left(\! \begin{array}{c} n \\ k \end{array} \!\right) P(F)^{k} P(S)^{n-k} \)

where if \( k \) represents the number of failed shards, \( n-k \) would represent the number of survived shards. The binomial coefficient i.e., \( \left(\! \begin{array}{c} n \\ k \end{array} \!\right) = \frac{n!}{k!(n-k)!} \) gives the number of ways to select \( k \) failed shards from a total of \( n \) shards.

Durability in erasure coding depends on the probability of losing more than \( m\) shards simultaneously. If the total number of shards is \( n = k + m \), and the probability of a single shard failing is \( P(F) \), the probability of losing more than \( m \) shards can be calculated using the binomial distribution as:

\( P(X>m) = \sum_{k=m+1}^{n} \left(\! \begin{array}{c} n \\ k \end{array} \!\right) p^{k} (1-p)^{n-k} \)

\( X \) here is a random variable representing the number of failed shards out of \( n \) total shards in the system.

Using the complement rule, the durability \( P_{durability} = P(X<=m) \) then is:

\( 1- P(X>m) \) or,

\( 1- \sum_{k=m+1}^{n} \left(\! \begin{array}{c} n \\ k \end{array} \!\right) p^{k} (1-p)^{n-k} \) or,

\( \sum_{k=0}^{m} \left(\! \begin{array}{c} n \\ k \end{array} \!\right) p^{k} (1-p)^{n-k} \)

For an erasure coding (17, 3) system with 20 total shards, the durability can be expressed as:

\( P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) \) or,

\( \left(\! \begin{array}{c} 20 \\ 0 \end{array} \!\right) (P(F))^{0} (P(S))^{20} + \)
\( \left(\! \begin{array}{c} 20 \\ 1 \end{array} \!\right) (P(F))^{1} (P(S))^{19} + \)
\( \left(\! \begin{array}{c} 20 \\ 2 \end{array} \!\right) (P(F))^{2} (P(S))^{18} + \)
\( \left(\! \begin{array}{c} 20 \\ 3 \end{array} \!\right) (P(F))^{3} (P(S))^{17} \)

Calculating binomial coefficients, we get:

\( 1 \times (P(F))^{0} (P(S))^{20} + \)
\( 20 \times (P(F))^{1} (P(S))^{19} + \)
\( 190 \times (P(F))^{2} (P(S))^{18} + \)
\( 1140 \times (P(F))^{3} (P(S))^{17} \)

Substituting AFR of 0.0041, MTTR of 6.5 days and the values calculated for \( P(F) \) and \( P(S) \), we get the \( P_{durability} \) as:

\( 0.99999999999986243584 \) or,
\( \approx 0.999999999999 \)

Conclusively, the closer \( P_{durability} \) is to 1 (or 100%), the more durable the system is, as it indicates a very low probability of data loss.

The distance of durability from 100% could also be expressed as:
\( y = 1 - P_{durability} \) or,
\( y = 1 - 0.999999999999 \) or,
\( y = 1E-12 \) or,
\( y = 10^{-12} \)

Taking negative logarithm both sides, we get:
\( -log_{10} y = -log_{10} (10^{-12}) \) or,
\( -log_{10} y = 12 \times log_{10} (10) \)

The number of 9s of durability for this system can then be expressed as:
\( -log_{10} y = 12 \)

In simpler terms, the smaller the distance of durability from 100% i.e., \( y \), the larger is the number of 9s of durability.

The durability over a year, \( P_{durability.over.a.year} \) can be calculated as:

\( P_{durability.over.MTTR}^{(365/MTTR)} \) or,

\( 0.99999999999986243584^{(365/6.5)} \) or,

\( \approx 0.99999999999 \)

which is 11 nines of durability.

Replication

Let's now understand how data durability is calculated for the simplest of all protection schemes i.e., replication.

Suppose we replicate the data across 3 different disks to ensure sufficient durability, giving us a replication factor of \( 3 \). This means that the data will remain intact as long as all 3 disks do not fail simultaneously within the specified time period. In other words, for a replication factor of \( n \), the system can tolerate up to \( n-1 \) disk failures before data loss occurs.

The erasure coding \( (k,m) \) scheme gives us a storage overhead of only \( m/k \), which is significantly smaller compared to the \( n-1 \) storage overhead of replication, where \( n \) is the replication factor. A whooping 200% of storage overhead for a replication factor of 3.

While the binomial distribution can be applied to calculate the durability of a replication-based data protection scheme and arrive at the exact results, the calculation simplifies here. Replication is concerned with the probability that all disks fail simultaneously, which can be directly expressed as \( P(F)^n \), where \( n \) is the replication factor.

In contrast, erasure coding asks a different question: “Do I still have enough pieces to reconstruct the whole data?” This makes the calculation relient on the binomial distribution, as it involves evaluating multiple failure scenarios to determine if the remaining shards are sufficient. Or in simpler terms: “If \( m \) shards have failed, have the remaining shards survived to reconstruct the data?”

The durability, \( P_{durability} \) can be expressed as:

\( 1 - P(F)^n \) or,

\( 1 - (AFR \times \frac{MTTR}{365})^3 \) or,

Substituting AFR of 0.0041, MTTR of 6.5 days, we get:

\( 0.99999999999961076396 \) or,
\( \approx 0.999999999999 \)

The durability over a year, \( P_{durability.over.a.year} \) can be calculated as:

\( P_{durability.over.MTTR}^{(365/MTTR)} \) or,

\( 0.99999999999961076396^{(365/6.5)} \) or,
\( \approx 0.9999999999 \)

The number of 9s of durability, can then be expressed as:
\( -log_{10} (1 - P_{durability}) \) or,
\( -log_{10} (1 - 0.9999999999) \) or,
\( -log_{10} (0.0000000001) \) or,
\( -log_{10} (10^{-10}) \) or,
\( 10 \times log_{10} (10) \) or,
\( 10 \)

If you notice, we borrowed the MTTR value from the EC scheme. However, EC is computationally more expensive because it involves calculating and reconstructing parities, unlike replication, which simply involves creating replicas. While the MTTR for replication may be shorter, resulting in slightly improved durability compared to itself, EC still provides better overall durability and significantly better storage efficiency, making it the preferred choice for large-scale systems.

RAID Parity

The last data protection scheme that I would like us to delve into is RAID 6.

RAID 6 is a data protection mechanism that uses striping with double parity, allowing it to tolerate up to two simultaneous disk failures. The double parity ensures that the system remains operational even if two drives fail, making it more fault-tolerant than RAID 5.

Durability in RAID 6 depends on the probability of losing more than \( 2\) shards simultaneously. If the total number of shards is \( n \), and the probability of a single shard failing is \( P(F) \), the probability of losing more than \( 2 \) shards can be calculated using the binomial distribution as:

\( P(X>2) = \sum_{k=3}^{n} \left(\! \begin{array}{c} n \\ k \end{array} \!\right) p^{k} (1-p)^{n-k} \)

\( X \) here is a random variable representing the number of failed shards out of \( n \) total shards in the system.

Using the complement rule, the durability \( P_{durability} = P(X<=2) \) then is:

\( 1- P(X>2) \) or,

\( 1- \sum_{k=3}^{n} \left(\! \begin{array}{c} n \\ k \end{array} \!\right) p^{k} (1-p)^{n-k} \) or,

\( \sum_{k=0}^{2} \left(\! \begin{array}{c} n \\ k \end{array} \!\right) p^{k} (1-p)^{n-k} \)

For a RAID 6 that has 8 data shards and 2 parity shards, the durability can be expressed as:

\( P(X = 0) + P(X = 1) + P(X = 2) \) or,

\( \left(\! \begin{array}{c} 10 \\ 0 \end{array} \!\right) (P(F))^{0} (P(S))^{10} + \)
\( \left(\! \begin{array}{c} 10 \\ 1 \end{array} \!\right) (P(F))^{1} (P(S))^{9} + \)
\( \left(\! \begin{array}{c} 10 \\ 2 \end{array} \!\right) (P(F))^{2} (P(S))^{8} \)

Calculating binomial coefficients, we get:

\( 1 \times (P(F))^{0} (P(S))^{10} + \)
\( 10 \times (P(F))^{1} (P(S))^{9} + \)
\( 45 \times (P(F))^{2} (P(S))^{8} \)

Substituting AFR of 0.0041, MTTR of 6.5 days and the values calculated for \( P(F) \) and \( P(S) \), we get the \( P_{durability} \) as:

\( 0.999999999953309576263 \) or,
\( \approx 0.9999999999 \)

The durability over a year, \( P_{durability.over.a.year} \) can be calculated as:

\( P_{durability.over.MTTR}^{(365/MTTR)} \) or,

\( 0.999999999953309576263^{(365/6.5)} \) or,
\( \approx 0.99999999 \)

The number of 9s of durability, can then be expressed as:
\( -log_{10} (1 - P_{durability}) \) or,
\( -log_{10} (1 - 0.99999999) \) or,
\( -log_{10} (0.00000001) \) or,
\( -log_{10} (10^{-8}) \) or,
\( 8 \times log_{10} (10) \) or,
\( 8 \)

If you notice, we again borrowed the MTTR value from the EC scheme. However, since RAID 6 relies on fixed parity calculations, it is less computationally expensive than EC. While the shorter MTTR for RAID 6 may result in improved durability for smaller datasets, its durability decreases rapidly as the number of data shards increases.

This trend is evident from the calculations below:
• For a total of 16 data shards, durability is 7 nines.
• For a total of 64 data shards, durability drops to 6 nines.
• For a total of 142 data shards, durability further decreases to 4 nines.
• For a total of 512 data shards, durability is reduced to just 3 nines.

Conclusively, EC provides better overall durability and if not better storage efficiency when compared to RAID 6, making it the preferred choice for large-scale systems where high fault tolerance is critical.

Conclusion

Each data protection scheme comes with its own strengths and trade-offs—be it EC’s unmatched storage efficiency and high durability, replication’s simplicity at the cost of overhead, or RAID 6’s balance of parity and fault tolerance for smaller datasets.

The choice of a data protection mechanism should be guided by the specific requirements of your system: fault tolerance, storage efficiency, performance, and cost. While the calculations reveal the potential of each scheme, real-world scenarios involve additional complexities, such as hardware reliability, workload patterns, and repair mechanisms. Additionally, Geographic Redundancy and Snapshots and Backups can play a crucial role in enhancing durability by protecting against large-scale failures and offering recovery options for accidental deletions or corruption.

Another important consideration is how drive failures are modeled. Drive failures could be viewed as continuous events rather than discrete events as when one drive fails, the remaining drives (especially those installed at the same time) are more likely to fail soon after due to shared environmental or aging factors. In such cases, Poisson distribution is often more suitable for modeling failures over time, as it accounts for the probability of continuous events, unlike the binomial distribution, which treats failures as independent discrete events.

Ultimately, designing for durability is about balancing these trade-offs to achieve a solution that meets your system’s unique needs. Understanding how durability is calculated not only demystifies those impressive “11 nines” but also empowers you to make informed decisions about your data’s safety.

Hope this post has been informative!

References