# Find MKAverage

I recently encountered an interesting question on leetcode.com, the last question on the Weekly Contest 236. Let’s read the question description together first.

You are given two integers,mandk, and a stream of integers. You are tasked to implement a data structure that calculates theMKAveragefor the stream.

TheMKAveragecan be calculated using these steps:

1. If the number of the elements in the stream isless than myou should consider theMKAverageto be-1. Otherwise, copy the lastmelements of the stream to a separate container.

2. Remove the smallestkelements and the largestkelements from the container.

3. Calculate theaveragevalue for the rest of the elementsrounded down to the nearest integer.

Essentially, the question asks to find the *mean* of a sliding window consisting of *m ***sorted** numbers after trimming *k *outlier values on both sides. It imitates a real-life scenario of handling streaming data. Removing outliers decreases the variability of the data, thus increase statistical significance, and finding the mean of the data requires real-time analysis of the data. Coming up with a highly optimized algorithm with an efficient runtime complexity is the fundamental requirement of the problem.

# Sliding Window

First, implementing a sliding window is relatively straightforward. We need a data structure that can efficiently remember the order of visited numbers and remove the chronologically oldest number if the size of the data set is *greater than m*. The perfect candidate for these operations is a queue. A queue has First-In-First-Out (FIFO) access pattern, meaning new additions to a line made to the back of the queue while removing (or serving) happen in the front.

# Outliers

The nutshell of the question is to identify the smallest *k *elements and the largest *k *elements efficiently. I used three sets, for convenience, called *left*, *mid*, and *right*. Each set shares the following characteristics:

- It is easy to find the smallest and the largest values in the set.
- Looking up if a value is in the set can be done efficiently.
- The set can contain duplicates.

The *left* set contains *k* smallest values, and the *right* set *k* largest values. This way, the *mid* set has all the numbers without the outliers. Consequently, the largest element in the *left* set would be smaller or equal to the smallest element in the *mid* set, and the largest element in the *mid* set would be smaller or equal to the smallest element in the *right* set.

The* **Multiset* in C++ standard library satisfies all of the above characteristics. It is an *ordered* set that can contain *duplicate *values. You can query the smallest and the largest values using iterators. Checking if a number is in the set is quickly done in an associated set.

# Mean

The *mid* set is essentially a subset that excludes the outliers. One optimization is to maintain a running *sum* of all of the values in the *mid* set. The mid-set size is always *m − 2 × k *because *m* keeps the sliding window constant. The running sum changes its value when a number is added to or removed from the *mid* set.

Let’s go through a sample case together!

We have *m* as six and *k* as 2, which means that our sliding window will contain six elements, and the *mean* calculation will exclude the smallest two and the largest two elements.

We have a series of inputs coming in consist of [3, 1, 5, 4, 2, 2, 2, 3, 7]. Let’s see how the queue and the three sets change.

Initially, incoming numbers are added to the *left* set until the size of the queue reaches *m*. Then, sets are balanced. First, the largest values in the *left* set are transferred to the *mid* set until **the size of the left set becomes k**.

Then, the largest values in the *mid* set are transferred to the *right* set until **the size of the right set becomes k**.

Here is how the sets look when balanced.

Next, number two comes in. As the queue size is *greater than m*, the queue must de-queue the first element, three. The deletion process is simple: find which set contains the number, then remove it. Make sure also to subtract the number from the *sum* if it is in the *mid* set. In this case, since three was in the *mid* set, the running sum after the deletion becomes two.

A new element is always added to the *left* set, as previously discussed. Since the size of the *left* set is *greater than k*, transfer the largest element to the *mid* set.

Balanced again.

Let’s try the next shift. The *left* set adds three and removes one. The *sum* does not change as the deletion does not happen in the *mid* set. The size of the *left* set remains the same, so we do not transfer.

However, now the largest element in the *left* set is *greater than* the smallest element in the *mid* set. We need to trade these two to ensure that the *left* set is always smaller or equal to the *mid* set.

Balanced again.

This time, the left set adds seven. This time, five is found in the *right* set. The size of the left *set* is *greater than k*, so let’s perform the transfer from the *left* to the *mid* set.

Now the *right* set is *smaller than k*, so we need to transfer the largest element in the *mid* to the *right* set.

Let’s recap what’s going on here. Two operations occur when a new element is added: **deleting the de-queued element **and** balancing the sets**.

# Deletion

Deletion needs to check if which set contains the element. If the set happens to be the *mid* set, then the running sum is subtracted by the value.

# Balancing the sets

Balancing the sets is like how water flows down. As new elements are added to the left set first, we need to transfer from *left* to *mid*, then from *mid* to *right* set. However, we also need to confirm that the largest element in the previous set is equal to or smaller than the smallest element in the next set. Otherwise, trade the elements around to make them balanced. Here are the four steps when broken down:

- While the size of the left set is greater than k, transfer the largest element from the left to the mid set.
- Confirm that the largest value in the left is equal to or smaller than the smallest value in the mid set.
- While the size of the right set is smaller than k, transfer the largest element from the mid to the right set.
- Confirm that the largest value in the mid is equal to or smaller than the smallest value in the right set.

There you go! I know that this question is not easy and requires multiple data structures around, but it’s a great question that makes you more exposed to streaming data.

Here is the complete implementation. Thank you for reading, and please comment at the bottom for further clarification. This is my first post on Medium, so please be kind to me :)