The issue with BezVision's performance analysis is that it assumes a fault-free system. If any thing goes wrong e.g. the database goes down, the collectors will stop and no profiles would be created at all. This makes it impossible to find out the source of the blackout and to identify the workload that caused it.
The solution is to incorporate availability and reliability analysis and prediction into BezVision's toolset.
We can gather several metrics by analysing collection failures like MTTF, Probability, Up Time, Down time, Reliability etc.
We will need a heartbeat collector with timeout for DB, heartbeat queries for node (may be running out of memory), disk (may be full try checking its space if its below threshhold.), network etc.
We can then analyse to data and predict the availability of a system at various points in time.
We can also analyse the workload that caused the system (dbms/node/disk) failure.
Availability Analysis and Audits
This section draws heavily from the paper:
Stochastic Reward Nets for Reliability Prediction
This section draws heavily from the paper:
Exploiting Availability Prediction in Distributed Systems
We have to define pre-defined states of availability.
Class-1: Completely Offline
Class-16: Completely Online
Instead of queuing networks, we have to define predictor networks. But first lets briefly describe the types of predictors mentioned in the paper.
1. Right Now Predictor
Put simply. If a system is online now, it will be online for all periods in the future. And if its offline, it will be offline for all periods.
2. Saturating Counter Predictors
A saturating counter prediction uses n-bits of state and can assume 2^n states.
e.g. SatCount-2 predictor uses a 2-bit saturating counter.
Such a counter can assume four values (-2,-1,+1,and +2) which correspond to four uptime states (strongly offline, weakly offline, weakly online, and strongly online).
During each sampling period, the counter is incremented if the node is online, otherwise it is decremented; increments and decrements cannot move the counter beyond the saturated counts of -2 and +2. Predictions for all lookahead periods use the current value of the saturating counter, i.e., negative counter values produce "offline"predictions, whereas positive values result in "online" predictions.
Saturating Counter Predictiors are pretty tolerant to occasional deviations from long stretches of uniform uptime behavior. However, like the RightNow predictors, they are inaccurate for nodes with periodic uptimes.
3. State-Based Predictors History Predictor
+ve: To predict the behavior of nodes with periodic availabilities.
These predictors explicitly represent a node’s recent uptime states using a de Bruijn graph. A de Bruijn graph over k bits has a vertex for each binary number in [0, 2k-1]. Each vertex with binary label b1b2...bk has two outgoing edges, one to the node labeled b2b3...0 and the other to the node b2b3...1.
Suppose that we represent a node’s recent availability as a k-bit binary string, with bi equal to 0 if the node was offline during the ith most recent sampling period and 1 if it was online. A k-bit de Bruijn graph will represent each possible transition between availability states.
To assist uptime predictions, we attach a 2-bit saturating counter to each vertex. These counters represent the likelihood of traversing a particular outbound edge; negative counter values bias towards the 0 path, whereas positive values bias towards the 1 path. After each uptime sampling, the counter for the vertex representing the previous uptime state is incremented or decremented according to whether the new uptime sample represented an "online" edge or an "offline" edge.
To make an uptime prediction for n time steps into the future, we trace the most likely path of n edges starting from the vertex representing the current uptime state. If the last bit we shift in is a 1, we predict the node will be online in n time units, otherwise we predict that it will be offline.
4. Twiddle History Predictors
Suppose that a node has a fundamentally cyclical uptime pattern, but the pattern is “noisy.” For example, a machine might be online 80% of the time between midnight and noon and always offline at other times. If the punctuated downtime between midnight and noon is randomly scattered, the de Bruijn graph will accumulate infrequently visited vertices whose labels contain mostly 1's but differ in a small number of bit positions.
As the length of time that we observe the node grows, noisy downtime will generate increasingly more vertices whose labels are within a few bit-flips of each other. Probabilistically speaking, we should always predict that the node will be online from midnight to noon.
However, the many vertices representing this time interval are infrequently visited and thus infrequently updated. Their counters may have weak saturations (-1 or +1) that poorly capture the underlying cyclic availability.
For nodes like this, we can nudge predictions towards the probabilistically favored ones by considering superpositions of multiple uptime states. Given a vertex v representing the current uptime history, we make a prediction by considering v’s counter and the counters of all observed vertices whose labels differ from v's by at most d bits.
For example, suppose that k=3 and d=1, and thateach of the 2k = 8 possible vertices corresponds to an actually observed uptime history. To make a prediction for the next time step when the current vertex has the label 111, we average the counter values belonging to vertices 111, 110, 101, and 011. If the average is greater than 0, we predict "online" otherwise we predict "offline".
The TwiddledHistory strategy will perform worse than the regular History strategy when vertices within d bits of each other correspond to truly distinct uptime patterns. In these situations, superposition amalgamates state belonging to unrelated availability behavior, reducing prediction accuracy.
5. Linear Predictor
It uses a linear combination of the last k signal points to predict future points. The k coefficients are chosen to reduce the magnitude of an error signal, which is assumed to be uncorrelated with the underlying “pure” signal. To make availability predictions for t time steps into the future, we iteratively evaluate the linear combination using the k most recent availability samples, shifting out the oldest data point and shifting in the predicted data point. Linear prediction produces good estimates for signals that are stable in the short term but oscillatory in the medium to long term.
We would expect this technique to work well with nodes having diurnal uptimes, e.g., machines that are online during the work day and offline otherwise.
6. Hybrid Predictor
A machine can transition between multiple availability patterns during its lifetime.
To dynamically select the best predictor for a given uptime pattern and lookahead interval, we employ a Hybrid predictor.
For each lookahead period of interest, the Hybrid predictor maintains tournament counters. These saturating counters determine the best predictor to use for that lookahead period.
Negative counter values select the left input, whereas positive values select the right.
Consider a Hybrid predictor making forecasts for an n-sample lookahead period. At the beginning of each time unit, the Hybrid predictor samples the current uptime state of its node. Its five sub-predictors are updated with this state, and each sub-predictor makes a prediction for n time units into the future.
The final output of the Hybrid predictor is selected via tournaments as shown in Figure 2, and the individual sub-predictions are placed in a queue and timestamped with curr time + n. If the head of the queue contains an entry whose timestamp matches the current time, the entry is dequeued and the tournament counters are updated using the dequeued predictions.
A tournament counter remains unchanged if both of the relevant dequeued predictions match the current uptime state or both do not match.
Otherwise, one prediction was right and the other was wrong, and the tournament counter is incremented or decremented appropriately. In the last stage of the update, the curr time value is incremented.