I recently encountered an interesting question on leetcode.com, the last question on the Weekly Contest 236. Let’s read the question description together first.
You are given two integers, m and k, and a stream of integers. You are tasked to implement a data structure that calculates the MKAverage for the stream.
The MKAverage can be calculated using these steps:
1. If the number of the elements in the stream is less than m you should consider the MKAverage to be -1. Otherwise, copy the last m elements of the stream to a separate container.
2. Remove the smallest k elements and the largest k elements from the container.
3. Calculate the average value for the rest of the elements rounded down to the nearest integer.
Essentially, the question asks to find the mean of a sliding window consisting of m sorted numbers after trimming k outlier values on both sides. It imitates a real-life scenario of handling streaming data. Removing outliers decreases the variability of the data, thus increase statistical significance, and finding the mean of the data requires real-time analysis of the data. Coming up with a highly optimized algorithm with an efficient runtime complexity is the fundamental requirement of the problem.
First, implementing a sliding window is relatively straightforward. We need a data structure that can efficiently remember the order of visited numbers and remove the chronologically oldest number if the size of the data set is greater than m. The perfect candidate for these operations is a queue. A queue has First-In-First-Out (FIFO) access pattern, meaning new additions to a line made to the back of the queue while removing (or serving) happen in the front.
The nutshell of the question is to identify the smallest k elements and the largest k elements efficiently. I used three sets, for convenience, called left, mid, and right. Each set shares the following characteristics:
- It is easy to find the smallest and the largest values in the set.
- Looking up if a value is in the set can be done efficiently.
- The set can contain duplicates.
The left set contains k smallest values, and the right set k largest values. This way, the mid set has all the numbers without the outliers. Consequently, the largest element in the left set would be smaller or equal to the smallest element in the mid set, and the largest element in the mid set would be smaller or equal to the smallest element in the right set.
The Multiset in C++ standard library satisfies all of the above characteristics. It is an ordered set that can contain duplicate values. You can query the smallest and the largest values using iterators. Checking if a number is in the set is quickly done in an associated set.
The mid set is essentially a subset that excludes the outliers. One optimization is to maintain a running sum of all of the values in the mid set. The mid-set size is always m − 2 × k because m keeps the sliding window constant. The running sum changes its value when a number is added to or removed from the mid set.
Let’s go through a sample case together!
We have m as six and k as 2, which means that our sliding window will contain six elements, and the mean calculation will exclude the smallest two and the largest two elements.
We have a series of inputs coming in consist of [3, 1, 5, 4, 2, 2, 2, 3, 7]. Let’s see how the queue and the three sets change.
Initially, incoming numbers are added to the left set until the size of the queue reaches m. Then, sets are balanced. First, the largest values in the left set are transferred to the mid set until the size of the left set becomes k.
Then, the largest values in the mid set are transferred to the right set until the size of the right set becomes k.
Here is how the sets look when balanced.
Next, number two comes in. As the queue size is greater than m, the queue must de-queue the first element, three. The deletion process is simple: find which set contains the number, then remove it. Make sure also to subtract the number from the sum if it is in the mid set. In this case, since three was in the mid set, the running sum after the deletion becomes two.
A new element is always added to the left set, as previously discussed. Since the size of the left set is greater than k, transfer the largest element to the mid set.
Let’s try the next shift. The left set adds three and removes one. The sum does not change as the deletion does not happen in the mid set. The size of the left set remains the same, so we do not transfer.
However, now the largest element in the left set is greater than the smallest element in the mid set. We need to trade these two to ensure that the left set is always smaller or equal to the mid set.
This time, the left set adds seven. This time, five is found in the right set. The size of the left set is greater than k, so let’s perform the transfer from the left to the mid set.
Now the right set is smaller than k, so we need to transfer the largest element in the mid to the right set.
Let’s recap what’s going on here. Two operations occur when a new element is added: deleting the de-queued element and balancing the sets.
Deletion needs to check if which set contains the element. If the set happens to be the mid set, then the running sum is subtracted by the value.
Balancing the sets
Balancing the sets is like how water flows down. As new elements are added to the left set first, we need to transfer from left to mid, then from mid to right set. However, we also need to confirm that the largest element in the previous set is equal to or smaller than the smallest element in the next set. Otherwise, trade the elements around to make them balanced. Here are the four steps when broken down:
- While the size of the left set is greater than k, transfer the largest element from the left to the mid set.
- Confirm that the largest value in the left is equal to or smaller than the smallest value in the mid set.
- While the size of the right set is smaller than k, transfer the largest element from the mid to the right set.
- Confirm that the largest value in the mid is equal to or smaller than the smallest value in the right set.
There you go! I know that this question is not easy and requires multiple data structures around, but it’s a great question that makes you more exposed to streaming data.
Here is the complete implementation. Thank you for reading, and please comment at the bottom for further clarification. This is my first post on Medium, so please be kind to me :)