Top K Frequent Keywords

Given a list of reviews, a list of keywords and an integer k. Find the most frequent k keywords in order of most to least frequently mentioned.

  • The comparison of strings is case-insensitive.
  • Multiple occurrences of a keyword in a review should be considered as a single mention.
  • If keywords are mentioned an equal number of times in reviews, sort alphabetically.

Example

Input:
k = 2
keywords = ["services", "algorithms", "inteview"]
reviews = [
  "algorithms and Me provides the best services for the interview preparations",
  "Also, their mock interview services awesome services",
  "Best services provided by Algorithms and me, everyone should use their services",
]

Output:
["services", "algorithms"]

Explanation:
"services" is occurring in 3 different reviews, and "algorithms" occurs in 2 reviews. Note that even though "services" occur two times in the same review, it is counted only once.

Rule of thumb, if you have to find top k of anything, give priority queue a thought, and to understand priority queues, it is very important to understand the fundamentals of heaps.

This is a straight forward application of priority queues. All we have to do is count the frequency of each word in the keywords list in the given list of reviews. Once we have the frequency count, use priority queue to find the top K keywords.

Implementation notes
1. Convert the list of keywords into a set, to have an efficient lookup.
2. Split each review into words and check if the word is the keyword set.
3. If yes, increment the frequency count.
4. Put the map of words to frequency into the priority queue and pluck top K words. Adjust the comparator function to have order on the frequency and then alphabetically.

Top K frequent keywords in reviews

public List<String> findKWordsInReviews(int k,
                       List<String>keywords, List<String> reviews) {

        List<String> res = new ArrayList<>();

        //For fast look up. additional space O(n)
        Set<String> set = new HashSet<>(keywords);

        //To store the freq count of keywords
        Map<String, Integer> map = new HashMap<>();

        //Go through each review
        for(String review : reviews) {
            //Split the reviews in words
            String[] words = review .split("\\W");

            //So that we do not count word twice
            Set<String> added = new HashSet<>();
            for(String word : words) {
                //As it is case sensitive
                word = word.toLowerCase();
                //If word is keywords and not yet added
                if(set.contains(word) && !added.contains(word)) {
                    map.put(word, map.getOrDefault(word, 0) + 1);
                    added.add(word);
                }
            }
        }

        //Add the map created into the priority queue.
        Queue<Map.Entry<String, Integer>> maxHeap = new PriorityQueue<>(
                (a, b)->a.getValue() == b.getValue()
                        ? a.getKey().compareTo(b.getKey())
                        : b.getValue() - a.getValue());

        //Add everything to PQ
        maxHeap.addAll(map.entrySet());

        //Take out top k words
        while(!maxHeap.isEmpty() && k-- > 0) {
            res.add(maxHeap.poll().getKey());
        }
        return res;
    }

The complexity of this code depends on the following :
1. No of keywords let’s say is n
2. No of reviews, m
3. K

If we are using max heap approach, building a max heap with n keywords will take O(n), while taking out k words from it, will take O(klogn).

Splitting the reviews into words will take O(m * average length of reviews) with additional space depending on the number of words in reviews.

Priority Queue problems

A priority queue a data structure that keeps things in order of priority. The question is how is it different from the monotonic queue we discussed earlier. The difference is in the time complexity and applicability. The time complexity to find insert an element in the priority queue is O(logn), whereas in monotonic queues it is O(n). On the point of applicability, priority queue orders the entire input set, whereas monotonic queues maintain only partial input set in a particular order, and whenever an order is violated, we remove elements from the monotonic queues.

Before we do in detail of priority queues, please refresh fundamentals of heaps.

A priority queue is a collection in which items can be added at any time, but the only item that can be removed is the one with the highest priority.

It means that no matter what the insertion order of elements, removal will always give you the highest or lowest priority element. Internally, a priority queue is implemented as heaps. A heap is a data structure that as the following property:

A parent is greater than both its children (max heap) or a parent is less than both its children (min-heap)

How do I identify that it is a priority queue question? Almost all the time, the problem statement will have top-K, smallest-k, most frequent K in it. It means this problem has something to do with ranking and for ranking problems best data structure is a priority queue.

Priority queue in Java

If your choice of language in the interview is Java, you are in luck, because Java has this in build class called priority queue.

PriorityQueue pq = new PriorityQueue();

With the above instantiation, you can define a min priority queue, it means the smallest number will be polled first.

If you want to declare it in such a way that the largest number is polled first, all you have to do is pass a comparator. We can use an in-built function provided by Comparator.

PriorityQueue pq = new PriorityQueue(Comparator.reverseOrder());

What if you are intending to store objects in a priority queue, then this default will not work as they work on integers or long only. This is a point you must read on comparators in Java. You can pass a comparator to the constructor, it will be used by the queue to order objects accordingly.

 PriorityQueue<Map.Entry<Integer, Integer>> pq = new PriorityQueue<>(
            (a, b) -> a.getValue().compareTo(b.getValue()) //Comparator
        );

You will find problems where you have to first find the frequency or priority of elements, we use HashMap to map element to its priority.

Use Map.Entry as an object to be stored in the queue. It helps us avoid creating new objects

Enough of theory, let’s get to the business, how can I solve problems using priority queues?

Top K Frequent Elements

Given a non-empty array of integers, return the k most frequent elements. For example: nums = [1,1,1,2,2,3], k = 2 Output: [1,2]

From the problem statement, it is evident that it is a ranking problem and we know the data structure to use. All we have to do is collect the respective frequency of elements. However, we have two ways to add elements to the priority queue. The first way is to create a map of element and frequency and add the whole map into the priority queue in decreasing order of frequency (max heap). Take out K element from the priority queue and that will be your K top elements.
The time complexity of the solution is O(n) to add all the elements in the heap, and then O(klogn) to take K elements out. Remember, each take-out incurs O(logn) complexity due to heapification process.

The second way is to add only K elements in the priority queue in min heap order. We keep the size of the queue to constant K, if it goes more, we just take our the top, which will be the minimum element. Once we have processed all the elements, elements in the queue will be top K elements
The time complexity of the solution is O(nlogk). Remember, each take-out and insert incurs O(logn) complexity due to heapification process.
Depending on the relationship between n and k, you can chose which method to use.

 public List<Integer> topKFrequent(int[] nums, int k) {
        
        
        Map<Integer, Integer> map = new HashMap<>();
        for(int i : nums){
            map.put(i, map.getOrDefault(i,0) + 1);
        }
        
        PriorityQueue<Map.Entry<Integer, Integer>> pq = new PriorityQueue<>(
            (a, b) -> a.getValue().compareTo(b.getValue())
        );
        /* We chose the second method. apparently,
           it give better performance in leetcode
        */
        for(Map.Entry e : map.entrySet()){
            pq.offer(e);
            if(pq.size() > k){
                pq.poll();
            }
        }
        
        List<Integer> res = new ArrayList<>();
        pq.stream().forEach(e -> res.add(e.getKey()));
        
        return res;
        
    }

Top K Frequent Words

Given a non-empty list of words, return the k most frequent elements. The answer should be sorted by frequency from highest to lowest. If two words have the same frequency, then the word with the lower alphabetical order comes first.

It is more or less the same problem as above, the only change is we have to add comparator to take care of alphabetical order.

    public List<String> topKFrequent(String[] words, int k) {
        
        Map<String, Integer> map = new HashMap<>();
       
        PriorityQueue<Map.Entry<String, Integer>> pq 
            = new PriorityQueue<>(
                       (a,b) -> a.getValue()==b.getValue() 
            ? b.getKey().compareTo(a.getKey()) : a.getValue()-b.getValue()
        );
        
        for(int i=0; i<words.length; i++){
            map.put(words[i], map.getOrDefault(words[i], 0) + 1);
        }
        
        for(Map.Entry<String, Integer> entry: map.entrySet()){
            pq.offer(entry);
            if(pq.size() > k){
                pq.poll();
            }
        }
        
        List<String> res = new LinkedList<>();
        while(!pq.isEmpty())
            res.add(0, pq.poll().getKey());
        
        return res;
    }

Kth Largest Element in an Array

Find the kth largest element in an unsorted array. Note that it is the kth largest element in the sorted order, not the kth distinct element.

class Solution {
    public int findKthLargest(int[] nums, int k) {
        
        PriorityQueue<Integer> pq = new PriorityQueue<>();
        int i = 0;
        for(; i<nums.length; i++){
            pq.add(nums[i]);
            if(pq.size() > k)
                pq.poll();
        }   
        
        return pq.peek();
    }
}

K Closest Points to Origin

We have a list of points on the plane. Find the K closest points to the origin (0, 0). (Here, the distance between two points on a plane is the Euclidean distance.)

    public int[][] kClosest(int[][] points, int K) {
        
        PriorityQueue<int[]> pq = new PriorityQueue<>(
                (a, b) -> ((a[0] - 0 ) * (a[0] - 0 )  + (a[1] - 0 ) * (a[1] - 0 )
                           - (b[0] - 0 ) * (b[0] - 0 ) - (b[1] - 0 ) * (b[1] - 0 ))
        );
            
        for(int i=0; i<points.length; i++){
            pq.add(new int[]{points[i][0], points[i][1]});
        }
            
        int [][] res = new int[K][2];
        for(int i=0; i<K && !pq.isEmpty(); i++){
            res[i] = pq.poll();
        }
            
        return res;
        
    }

Sort Characters By Frequency

Given a string, sort it in decreasing order based on the frequency of characters. For example, S = “tree”, the output should be “eert”

    public String frequencySort(String s) {
        
        PriorityQueue<Map.Entry<Character, Integer>> pq 
            = new PriorityQueue<>((a,b) -> b.getValue() - a.getValue());
        
        Map<Character, Integer> map = new HashMap<>();
       
        for(int i=0; i<s.length(); i++){
            map.put(s.charAt(i), 1 + map.getOrDefault(s.charAt(i), 0));
        }
        
        pq.addAll(map.entrySet());
       
        StringBuilder sb = new StringBuilder();
        while (!pq.isEmpty()) {
            Map.Entry<Character, Integer> e = pq.poll();
            int n = (int)e.getValue();
            for (int i = 0; i < n;  i++) 
                sb.append(e.getKey());
        }
        return sb.toString();
    }

Find median in a stream of integer

This problem is covered in detail here: Median in a stream of integers.

 PriorityQueue<Integer> pqMin;
    PriorityQueue<Integer> pqMax;
    
    /** initialize your data structure here. */
    public MedianFinder() {
         pqMax = new PriorityQueue<>(Collections.reverseOrder());
         pqMin = new PriorityQueue<>();
    }
    
    public void addNum(int num) {
        /* If size of max heap, i.e left bunch of numbers is 
           greater than the min heap, right bunch of numbers,
           add the number in left bunch and take the Max and 
           add it to right bunch */
        if (pqMax.size() > pqMin.size()) {
            pqMax.offer(num);
            pqMin.offer(pqMax.poll());
        } else {
            //Do the opposite.
            pqMin.offer(num);
            pqMax.offer(pqMin.poll());
        }
        
    }
    
    public double findMedian() {
        /*If the size of two heaps is equal, 
           even number of integers, take the average of two
           middle elements */
        if(pqMin.size() == pqMax.size()){
            return (double)(pqMin.peek() + pqMax.peek())/2.0;
        }
        
        //Else return from the left bunch
        return (double)(pqMax.peek() * 1.0);
    }
}

Find K-th Smallest Pair Distance

Given an integer array, return the kth smallest distance among all the pairs. The distance of a pair (A, B) is defined as the absolute difference between A and B.

  private int findKth(int [] a, int k){
        PriorityQueue<Integer> pq = 
                 new PriorityQueue<>(Comparator.reverseOrder());
        
        for(int i=0; i<a.length; i++){
            for(int j=i+1; j<a.length; j++){
                int diff =  Math.abs(a[i]-a[j]);
                pq.offer(diff);
                
                if(pq.size() > k){
                    pq.poll();
                }
            }
        }
        return pq.peek();
    }    

The above solution works but due to the constraints, it may run out of memory and time. A better solution is based on a binary search and trial and error methods.

Merge k Sorted Lists

Merge k sorted linked lists and return it as one sorted list. For example:
Input:
[
1->4->5,
1->3->4,
2->6
] Output: 1->1->2->3->4->4->5->6

private ListNode mergeList(ListNode[] lists){
        
        if(lists.length == 0) return null;
        
        PriorityQueue<ListNode> pq = new PriorityQueue<>(
            (a,b) -> ((Integer)a.val).compareTo(b.val)
        );
        
        for(int i=0; i<lists.length; i++){
            if(lists[i] != null)
                pq.offer(lists[i]);
        }
        
        ListNode head = pq.poll();
        ListNode tail =  head; 
        if(head != null && head.next != null)
            pq.add(head.next);
    
        while(!pq.isEmpty()){
            tail.next = pq.poll();
            tail = tail.next;
            if(tail.next != null)
                pq.offer(tail.next);
            
        }
        
        return head;
    }

As we see from the above code examples, it is easy to identify a priority queue problem and simple to solve. Please share your views and best of luck with your preparations.

Minimum cost of connecting n ropes

Given n ropes of different lengths, connect these ropes in one rope. The cost to connect two ropes is equal to the sum of their lengths. We have to connect these ropes with minimum cost.

For example,

Input:
5, 2, 3, 9.
Output:
34
Explanation:
We can connect the ropes in the following way: First, connect the ropes of lengths 2 and 3, the cost of this connection is the sum of lengths of ropes which is 2 + 3 = 5. We are left with three ropes with lengths5,5, and 9. Next, connect the ropes of lengths 5 and 5. The cost of connection is 10. Total cost till now is 5 + 10 = 15. We have two ropes left with lengths 10 and 9. Finally, connect the last two ropes and all ropes have connected, Total Cost would be 15 + 19 = 34.

Another way of connecting ropes would be: connect ropes with lengths 5 and 9 first (we get three ropes of 3, 2, and 14), then connect 14 and 3, which gives us two ropes of lengths 17 and 2. Finally, we connect 19 and 2. Total cost in this way is 14 + 17 + 21 = 52. which is much higher than the optimal cost we had earlier.

Connect n ropes with minimum cost: A priority queue problem

When we were doing calculations in examples, did you notice one thing? Lengths of the ropes connected first are added subsequently in all the connections. For example, we connected ropes with length 2 and 3 in the first example, it gets added to next connect as part of the rope with length 5, and again when we connect the ropes with lengths 15 and 9, 2 + 3 is already inside 15.

Read Huffman coding to understand how to solve this problem from this hint.

All we have to make sure that the most repeated added rope is the smallest, then the second smallest and so on. This gives the idea that if we sort the ropes by their sizes and add them, sort again the array again until there are no ropes to add. It will always give us the optimal solution to connect ropes.

What will be the complexity of sorting based implementation? The complexity will be dominated by the sorting algorithm, best we can achieve is O(nlogn) using quicksort or merge sort. Also, connecting two ropes we have to sort the arry again. So overall complexity of this method is O(n2logn)

Can we do better than this? Do we need the array sorted at all times? All we need is the two ropes with the least length. What data structure gives us the minimum element in the least time. Priority queue will be our data structure. If we create a min-heap with lengths of ropes, we can easily find the two ropes with the least length in O(1) complexity.

  1. Create a min heap (priority queue) from the array of rope lengths
  2. Fetch the root which will give us smallest rope
  3. Fetch the root again which will give us second smallest rope
  4. Add two ropes and put it back into heap (heapify)
  5. Go back to step 2

connect n ropes with minimum cost

Min cost to connect ropes java implementation

 
package com.company;

import java.util.Arrays;
import java.util.List;
import java.util.PriorityQueue;
import java.util.stream.Collectors;

/**
 * Created by sangar on 3.1.19.
 */
public class ConnectRopes {

    public int getMinimumCost(int[] ropeLength){

        PriorityQueue<Integer> minHeap = new PriorityQueue<Integer>();

        /*
        There is no shortcut for converting from 
        int[] to List<Integer> as Arrays.asList
        does not deal with boxing and will just create a List<int[]>
        which is not what you want.
         */
        List<Integer> list = Arrays.stream(ropeLength)
                            .boxed().collect(Collectors.toList());

        /*
        Javadoc seems to imply that addAll is 
        inherited from AbstractQueue where
        it is implemented as a sequence of adds.
        So complexity of this operation is O(nlogn)
         */
        minHeap.addAll(list);

        int totalLength = 0;

        while(minHeap.size() > 1){
            int len1 = (int)minHeap.remove();
            int len2 = (int)minHeap.remove();

            totalLength+=(len1 + len2);

            minHeap.add(len1+len2);
        }

        return totalLength;
    }
}
Test cases
package test;

import com.company.ConnectRopes;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.assertEquals;
/**
 * Created by sangar on 23.9.18.
 */
public class ConnectRopeTest {

    ConnectRopes tester = new ConnectRopes();

    @Test
    public void minimumCostTest() {

        int[] a = {5,2,3,9};

        assertEquals(24, tester.getMinimumCost(a));
    }
    @Test
    public void minimumCostOneRopeTest() {

        int[] a = {5};

        assertEquals(0, tester.getMinimumCost(a));
    }
}

The complexity of this implementation is O(nlogn) (to create min heap out of an array in java Priority queue) + O(nlogn) (to fetch two minimum and re-heapify). However, initial complexity to build a heap from the array can be brought down to O(n) by using own implementation of min heap.

Please share if there is something wrong or missing. If you are preparing for interviews and want to receive a daily email with an interview question, please signup here

Most frequent words in file

Most frequent words in file

In last post:  find K smallest element in an array, we learned some concepts to find top or bottom element in a given set. Let’s apply those concept again to a different problem called most frequent words in a file.

Given a text file which contains strings/words, find n most frequently occurring words in it i.e. n words whose count is the maximum.

For example, if given file is like follows: one, two, two, three, three, three, four,four, four,four, five, five, five, five, five, six,six, six, six, six, six  and we are looking for three most frequent words, output should be :  six,five, four.

Line of thoughts

Brute force solution would be really easy, scan the whole file, get the count for each unique words in file. Then sort the output based on that count in descending order and then return first n words.

This problem has three parts to it. First, read all words from file, second created a map which store frequency count of each word on file. Third is to get top n words from that map.

Reading a file is pretty standard operation in any language.  Brush upon Java basics here. We will not focus on that and also that’s not the intention of this question.
Assuming we can read from the file, how can we store frequency count against words. Simple, use a hash map. We read word from file, store in hash if it is not already there with count 1. If words is already in hash map, update the count by 1.

After this step, we have a hash with count of occurrence of each word. Now comes the challenging part:  how to find top n or most frequent words from this hash map. Mind you that hash map does not store elements in sorted or any order. 

Rule of thumb is to find top or least from collection of items, heaps are best bet. To find top N most frequently occurring words, let’s try heap. 
Which heap would be best?  We need to get a limited set, if we have free entry in that set till n words(as we need n top words). All further words have to pass a test before they enter the set. 

If new word is less than the least frequent word in the set, what can be said about this new word? Well, it cannot be in top n words right? 
If new word has frequency more than word with least frequency in set, new word should enter the set and  word with least frequency should be moved out.
What we just explained is typical use of min heap, as it give least element at the root. To find top n most frequent words in file, below is the algorithm.

Most frequent words in file : algorithm

  1. Take first N words appearing in map along with their count and create a min heap with these N words.
  2. One by one read words from hash and check if frequency of new word is more than least frequent word on heap, i.e word at root of heap.
  3. If yes, remove root and add new word in min heap. If not, continue with next word.
  4. When done with all words in hash, min heap will contain N most frequently occurring words in file.

Implementation

package com.company;

import javafx.util.Pair;

import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;

/**
 * Created by sangar on 1.10.18.
 */
public class FrequentWords {

	public static HashMap<String, Integer> readFile(String fileName)
		throws IOException {
	HashMap<String, Integer> wordMap = new HashMap<>();

	Path path = Paths.get(fileName);
	try (Scanner scanner =  new Scanner(path)){
		while (scanner.hasNextLine()){
			String line = scanner.nextLine();
			if(wordMap.containsKey(line)) {
				wordMap.put(line, wordMap.get(line) + 1);
			}
			else{
				wordMap.put(line, 1);
			}
		}
	}
	return wordMap;
	}

	public static ArrayList<String> mostFrequentWords(String fileName, int n){
		ArrayList<String> topWords = new ArrayList<>();

		try {
			HashMap<String, Integer> wordMap = readFile(fileName);
			PriorityQueue<Pair<String, Integer>> pq =
				new PriorityQueue<>(n, (x,y) -> x.getValue().compareTo(y.getValue()));

			int i = 0;
			Iterator it = wordMap.entrySet().iterator();
			/*
				Get first n words on heap
			*/
			while(it.hasNext()){
				if(i == n) break;
				HashMap.Entry<String, Integer> entry =
					(HashMap.Entry<String, Integer>)it.next();
				pq.add(new Pair<>(entry.getKey(), entry.getValue()));
				it.remove();
				i++;
			}

			/*
				Check all other words, if anyone more than least
				remove the least and add the new word.
			*/
			for (String key : wordMap.keySet()){
				if(pq.peek().getValue() < wordMap.get(key)){
					pq.poll();
					pq.add(new Pair<>(key, wordMap.get(key)));
				}
			}
			while(!pq.isEmpty()){
				topWords.add(pq.poll().getKey());
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
		return topWords;
	}
	
	public static void main(String[] args){  
		System.out.println(mostFrequentWords("/home/sangar/Documents/test.txt", 3));
	}
}

Reading M words from file will require O(m) time, where as creating N element heap will take O(n). Also, scanning through all words and inserting them on to heap has complexity of O((m-n) log n). Overall complexity to find top n most frequent words in fileis O(m log m).


Median of integers stream

Median of integers stream

We solve two problems which involved streams, first was to find first non repeated character in stream and second was LRU cache. Let’s discuss another problem which is to find median of integers stream. Problem statement is like this: Given continuous stream of integers, find median of integers stream received till given point of time. Median can be asked at multiple times.

To understand problem better, ask yourself, what is a median?

The median is the value separating the higher half from the lower half of a data sample. For a data set, it may be thought of as the “middle” value.

Wikipedia

For example, in the data set {1, 3, 3, 6, 7, 8, 9}, the median is 6, the fourth largest, and also the fourth smallest, number in the sample.

Median of sorted array of integers is element at middle index of array if size of array is odd and average of elements at mid and mid +1 elements if size of array is even.

Now, that we understood the definition of median. let’s go back to our problem and take an example to understand it further. Problem is that we get integers from a stream, one by one and at any given point of time, we have to return median of set of integers received till now. 
First, stream throws 12, then 7 and then 8. What will be the median now? It will be 8, because if we arrange 12,7,8 in sorted order, 8 is element at middle. What if we get 11 next? Well, now sorted order looks like 7,8,11,12. As size of set is even, we take average of mid and mid+1 element which is 9.5.

Median of integers stream : thoughts

What will be the brute force solution? As integers are processed from stream, store them in an array. Can we store element randomly? If yes, to find median, we have to sort array every time. Complexity of this method to find median in stream of integers will be O(n log n) dominated by the sorting algorithm.
How about we insert element in array in sorted order. This will make complexity of processing integer from stream O(n2), as we have to move n elements to right in worst case.
Another underlying problem in using array here is that we do not know how many integers will come out of stream, so it will be very difficult to pre-allocate memory for it. Linked list can solve that problem, however, it does not reduce complexity of processing, at the same increases the complexity of finding median to O(n) from O(1).

Think of this, do we need completely sorted set of  integers before we can calculate the median? Actually, we need kth smallest element of array if size of set is odd and average of kth and k+1th element if size of set is even, k will be n/2. 

However, we do not have pre-processed array with us. What is the property of the median? Median is greater than all elements on left of it and less than all elements on the right side of it, where the number of elements on both groups is equal or differs by 1.

Median of integers stream : Heaps

How about we split the incoming integers into two halves. Whenever median is asked, we can get the maximum of one half and return it as median, if the size of two halves differ by 1 or return of average of the max of one half and minimum of other halves if the size of two halves is equal.

What data structure is best to find min and max in constant time? Heap it is. In this case, we will need two heaps, one max and another min heap. Max heap will store all the elements on the left side of median and min heap will store all the elements on the right side of the median.

How to balance the size difference between the two heaps? Insert new processed integer into the max heap,  if the size of the max heap is 2 more than min heap, extract the maximum element from the max heap and put it in min heap.

Also, maintain the property that all the elements on the max heap should be less than elements on the min heap. So, whenever the root of the max heap greater than the root of min heap, it should be removed from the max heap and added to the min heap.

Let’s take an example and understand the method first and the make concrete algorithm out of it. We have the first number from the stream as 12, what should we do? We decided to put every number on the max heap to start with.

median of integer stream
Add the integer to max heap as both heaps are empty at this point of time

Now, comes the integer 7. First of all, we add a new integer to the max heap. This will create a difference in size of the min and max heap more than one. In that case, we will take out the max from the max heap and put it into the min heap.

median of integers stream
7 is added to the max heap, which makes size difference of more than 1.
So, the root of the max heap (12) is moved to min heap

Next integer is 18, what happens now. We add into the max heap. Difference between sizes is not more than 1, However, the root of the max heap (18) is greater than the root of min heap (12). In this case, too, we take the root of the max heap and move it to the min heap. At this point, if the median of integers stream is asked, return the root of min heap which is 12.

18 is added to max heap, however now the root of max heap is more than the root of the min heap, so it should be removed from the max heap
median of stream
18 is removed from the max heap and added to the min heap.

Come the integer 10, it goes into the max heap, does not create any size difference and the root of the max heap is less than the root of the min heap. At this point, the median of the stream of integers till now is 11 ((10+12)/2).

median of stream of integers
10 is added to the max heap.

.New integer from the stream is 11. As usual, add the new integer to the max heap, size difference remains less than 2 and 11 is less than the root of the min heap (12).
What should be the median now? At this point, the size of the max heap is more than the min heap, hence we will return the root of the max heap (11)

median of integer stream
11 is added to max heap

Median of a stream of integers: Algorithm

  1. Process integer from the stream and add it to the max heap.
  2. If the root of max heap greater than the root of the min heap:
    1. Delete the root from the max heap
    2. Add removed integer from the max heap to the min heap
  3. If the size difference between the two heaps is more than 2:
    1. Remove the root of the heap which has more elements.
    2. Add removed node to another heap.
  4. To calculate the median:
    1. If the size of both heaps equal, return average of their roots.
    2. Else, return the root of the heap with more elements.

Median of integers stream : Implementation

Implementation involves priority queue in Java, refer to Stack Overflow question on how to use priority queue as a max heap.

package com.company;

import java.util.Collections;
import java.util.PriorityQueue;

/**
 * Created by sangar on 18.10.18.
 */
public class MedianOfIntegerStream {
    private PriorityQueue maxHeap;
    private PriorityQueue minHeap;

    public MedianOfIntegerStream(){
        maxHeap = new PriorityQueue(Collections.reverseOrder());
        minHeap = new PriorityQueue();
    }

    public double getMedian(){
        if(maxHeap.size() == minHeap.size())
            return (double)((int)maxHeap.peek() + (int)minHeap.peek())/2;

        if(maxHeap.size() > minHeap.size())
            return (double)(int)maxHeap.peek();

        return (double)(int)minHeap.peek();

    }

    public void processInteger(int data){
        maxHeap.add(data);

        if(maxHeap.size() - minHeap.size() > 1
                || ( minHeap.size() > 0 
				&& (int)maxHeap.peek() > (int)minHeap.peek())){
            minHeap.add(maxHeap.poll());
        }

        if(minHeap.size() - maxHeap.size() > 1){
            maxHeap.add(minHeap.poll());
        }
    }
}

Test cases for median in integers stream

package test;

import com.company.MedianOfIntegerStream;
import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.assertEquals;

/**
 * Created by sangar on 23.9.18.
 */
public class MedianOfIntegerStreamTest {

    MedianOfIntegerStream tester = new MedianOfIntegerStream();

    @Test
    public void baseTest() {

        tester.processInteger(12);
        tester.processInteger(7);

        assertEquals(9.5, tester.getMedian() );
    }

    @Test
    public void maxHeapWithMoreElementsTest() {

        tester.processInteger(12);
        tester.processInteger(7);
        tester.processInteger(9);

        assertEquals(9, tester.getMedian() );
    }

    @Test
    public void minHeapWithMoreElementsTest() {

        tester.processInteger(12);
        tester.processInteger(7);
        tester.processInteger(9);
        tester.processInteger(13);
        tester.processInteger(15);

        assertEquals(12, tester.getMedian() );
    }

    @Test
    public void minHeapSizeMoreThanTwoDifferenceTest() {

        tester.processInteger(12);
        tester.processInteger(7);
        tester.processInteger(9);
        tester.processInteger(13);
        tester.processInteger(15);
        tester.processInteger(17);
        tester.processInteger(19);

        assertEquals(13, tester.getMedian() );
    }

    @Test
    public void maxHeapGetsTheElementTest() {

        tester.processInteger(12);
        tester.processInteger(7);
        tester.processInteger(9);
        tester.processInteger(13);
        tester.processInteger(15);
        tester.processInteger(17);
        tester.processInteger(5);
        assertEquals(12, tester.getMedian() );
    }
}

Complexity of processing is O(log n) to insert an element into any heap. However, fetching median in stream of integers at any given time is O(1).

Please share if there is something wrong or missing. Please signup if you want to receive curated interview material for your preparation.

Merge k sorted arrays

Given k sorted arrays of varying lengths, merge these k sorted arrays into one sorted array.

For example, given 3 arrays:
merge k sorted arrays

The resulting array should be like

merge k sorted array

Companies this problem is asked in

Microsoft, Amazon, Facebook, Salesforce, Indeed

Merge k sorted arrays divide and conquer

Since all the input arrays are sorted, the first element in the output sorted array will be one of these first elements of input arrays. How can we find the minimum among all the elements plucked from the first index of each array? Easy, take those k elements (there are k arrays, so k first elements) and build a min-heap. The root of the min-heap will be the least element among each of the first elements of the given k sorted arrays, i.e.

result[0] = min(arr1[0], arr2[0], arr3[0]…arrK[0])

merging k sorted arrays

The initial root above will be the first element in the result array. Now the second element for the result array can be found from the set of first elements of all input arrays except the array from which the first element of result array was taken. For example, if arr1 had the least of all first elements while finding the initial root, then:

result[1] = min(arr1[1], arr2[0], arr3[0] … arrK[0])

merge k sorted arrays java

Next iteration, 5 will be picked and put into the result array and index of arr 3 will be increased.
merging k sorted arrays

After putting 5 in the result array, we will move the next element in array 3 to the min-heap.
merge n sorted arrays

In order to know which array gave the minimum element at a particular time, we will store additional information of about array and index at which minimum element was.

If i represents the array number, and j represents the index of the minimum number currently in the heap from the ith array, then we add (j+1)th element to the min-heap next and re-heapify.
If we have put all the element from the ith array in the heap then we need to reduce the size of min-heap to k-1.

Follow the procedure for (n-1)*k times. When all array elements are processed the result array will be the sorted array for all nk element.

Algorithm

  • Build min heap with the first element of all k arrays.
  • Pick the root of min element and put it in the result array.
  • If there are remaining elements in the array,  put next element at the root of min heap and heapify again
  • If all elements are already of an array are processed, reduce the size of min heap by 1.
  • Repeat step 2, 3 and 4 till min heap is empty.

Show me the implementation

package com.company;
import java.util.PriorityQueue;

/**
 * Created by sangar on 2.12.18.
 */
public class MergeKSortedArrays {
    private class HeapNode {
        public int arrayNum;
        public int index;
        public int value;

        public HeapNode(int arrayNum, int index, int value) {
            this.arrayNum = arrayNum;
            this.index = index;
            this.value = value;
        }
    }

    public int[] mergeKSortedArrays(int[][] arrays) {

        if (arrays == null) return null;

        PriorityQueue<HeapNode> minHeap =
                new PriorityQueue<>(arrays.length,
                        (HeapNode a, HeapNode b) -> a.value - b.value);

        int size = 0;
        for (int i = 0; i < arrays.length; i++) {
            size += arrays[i].length;
        }
        int[] result = new int[size]; // k * n

        //add first elements in the array to this heap
        for (int i = 0; i < arrays.length; i++) {
            minHeap.add(new HeapNode(i, 0, arrays[i][0]));
        }

        //Complexity O(n * k * log k)
        for (int i = 0; i < size; i++) {
            //Take the minimum value and put into result
            HeapNode node = minHeap.poll();

            if (node != null) {
                result[i] = node.value;
                if (node.index + 1 < arrays[node.arrayNum].length) {
                    //Complexity of O(log k)
                    minHeap.add(new HeapNode(node.arrayNum,
                            node.index + 1,
                            arrays[node.arrayNum][node.index + 1]));
                }
            }
        }
        return result;
    }
}

The complexity of the code to merge k sorted arrays is O(nklogk) along with space complexity of O(k).

Please share if there is something wrong or missing. If you are preparing for an interview, please sign up to receive interview preparation kit for free.

Heaps fundamentals

In this post will learn heap fundamentals, which are very important for solving priority queue problems. Heap is a kind of data structure based on the complete tree principle. By definition, a complete binary search tree of N levels has at least 2^N-1 nodes in it. There two properties heap holds :

1. Structural property: This means every level of the heap will be completely full except the last level. The last level will be filled from left to right order.
2. Heap property: It means parent node will be either greater or smaller than its children nodes. There are two kinds of heaps based on second property:

Max Heap
Max heap maintains the property that every parent node is greater than its children. The root of the max heap is the greatest element.

heap fundamentals

Min Heap
Min heap maintains the property that every parent node is less than its children. The root of the min-heap is the least element.

min heap

Implementation Notes
Usually, heaps are implemented with an array, which eases traversing from child to parent and parent to child; and viewed as a tree. The height of a heap with n elements will be O(logN). 

Children of a parent node i are 2i and 2i+1.
Parent of a node i will be [i/2].

    private int parent(int pos) {
        return pos / 2;
    }
    
    private int leftChild(int pos){
        return (2 * pos);
    }

    private int rightChild(int pos){
        return (2 * pos) + 1;
    }
    

Given that multiply and divide by 2 can be very efficiently implemented by left and right shifting, these operations are very efficient.

Maximum elements in a heap with height h will be 2h-1 while the minimum number of elements in heap with the same height will be 2h-1+1 (One more than nodes with height h-1)

Heap operations

1. Insert an element in heap
To insert a new element in a heap, insert it as the leftmost position available. This new element may violate the heap property, for example, in a min-heap, the newly added element is less than the parent node. We have to push this newly added node to its correct position, this process is called a heapification.
Let’s take an example and see how it works.
insert in heap

    public void insert(int element) {
        if (size >= maxsize) {
            return;
        }

        Heap[++size] = element;
        int current = size;

        while (Heap[current] < Heap[parent(current)]) {
            swap(current, parent(current));
            current = parent(current);
        }
    }

The complexity of this procedure is for O(logn).

2. Pop from heap
The deletion of node is usually done at the root. To delete a root, replace the root node with the last node in heap and then do downshift. As the new node replacing root may violate heap property, we need to check with its children. In Max heap, check if new node is greater than both its children. If not then swap it with the largest child and then again repeat it till node replacing root finds its valid place.

Algorithm (This is for max heapify)

  1. Let i be the index which we doubt that might violate heap property.
  2. left = 2i, right = 2i+1, largest = a[i]
  3. Check if left child is with in array limits, if yes, check if it is greater than the a[i].
  4. If left is greater than a[i], then largest till now is left, largest = left.
  5. Check if right is with in array limit, if yes, check if it is greater than largest, then change the largest, largest = right.
  6. Swap largest and a[i].
  7. Now, repeat step 1 to 6 with largest, till we see an element which does not violate heap property.

Let’s take an example:
pop in heap

   public int pop() {
        int popped = Heap[1];
        Heap[1] = Heap[size--];
        minHeapify(1);
        return popped;
    }
    private void swap(int i, int j) {
        int tmp = Heap[i];
        Heap[i] = Heap[j];
        Heap[j] = tmp;
    }

    // Function to heapify the node at pos
    private void minHeapify(int pos){

        if (!isLeaf(pos)) {
            if (Heap[pos] > Heap[leftChild(pos)]
                    || Heap[pos] > Heap[rightChild(pos)]) {

                if (Heap[leftChild(pos)] < Heap[rightChild(pos)]) {
                    swap(pos, leftChild(pos));
                    minHeapify(leftChild(pos));
                }
                else {
                    swap(pos, rightChild(pos));
                    minHeapify(rightChild(pos));
                }
            }
        }
    }

3. Building a heap from a given array of elements.
This operation can easily be done using the heapify methods explained above. 

  1. Start from the middle element of the array, let’s say i
  2. Heapify with given index.
  3. Decrease index by one. Repeat step 2 till we reach first element.
   public void minHeap() {
        for (int pos = (size / 2); pos >= 1; pos--) {
            minHeapify(pos);
        }
    }

The complexity of this procedure is O(n).

How this complexity becomes O(n) while for adding each node will need logn time to heapify and if there are n nodes, it should be O(nlogn). The complexity of the heapify method for a particular element depends on its position in the heap. It takes O(1) time when the node is a leaf node (which makes up at least half of the nodes) and O(log n) time when it’s at the root.
If the height of the heap is h, the number of nodes will be 2h. So 2h/2 = 2h-1 nodes never move. 2h-2 nodes move by one level and so on.

4. Heapsort
Heap sort combines property of both insertion sort (no extra space required) and merge sort (time complexity being O(nlogn)). It can be easily achieved by using the above two procedures.

  1. Build max heap from the given array, with complexity of O(n)
  2. Swap first and last element of the array. Last element is now at its proper position.
  3. Decrease the size of heap by 1 to be heapify.
  4. Heapify with first element of the array.
  5. Repeat step 2 , 3 and 4 until there are elements to be sorted.

The complexity of the above procedure is O(nlogn).

Problems based on heaps
Reference : http://www.amazon.com/Introduction-Algorithms-Thomas-H-Cormen/dp/0262033844

Find Kth smallest element in array

Given an array of integers which is non sorted, find kth smallest element in that array. For example: if input array is A = [3,5,1,2,6,9,7], 4th smallest element in array A is 5, because if you sort the array A, it looks like A = [1,2,3,5,6,7,9] and now you can easily see that 4th element is 5.

Companies asked in

This problem is commonly asked in Microsoft and Amazon interviews as it has multiple layers and there are some many things that can be tested with this one problem.

Kth smallest element : Line of thought

First of all, in any interview, try to come up with brute force solution. Brute force solution to find Kth smallest element in array of integers would be to sort the array and return A[k-1] element (K-1 as array is zero base indexed).

What is the complexity of brute force solution? It’s O(n2)? Well, we have sort algorithms like merge sort and heap sort which work in O(nlogn) complexity.

The problem with both searches is that they use additional space. Quick sort is another sorting algorithm. It has problem that it’s worst-case complexity will be O(n2), which happens when input is completely sorted.
In our case, the input is given as unsorted already, so we can expect that quicksort will function with O(n log n) complexity which is its average-case complexity. Advantage of using quicksort is that there is no additional space complexity.

Optimising quick sort

Let’s see how quicksort works and see if we can optimize solution further?
Idea behind quicksort is to find the correct place for the selected pivot. Once the pivot is at the correct position, all the elements on the left side of the pivot are smaller and on the right side of the pivot are greater than the pivot. This step is partitioning.

If after partitioning, pivot is at position j, can we say that pivot is actually jth smallest element of the array? What if j is equal to k? Well problem solved, we found the kth smallest element.

If j is less than k, left subarray is less than k, we need to include more elements from right subarray, therefore kth smallest element is in right subarray somewhere. We have already found j smallest elements, all we need to find is k-j elements from right subarray.

What if j is greater than k? In this case, we have to drop some elements from left subarray, so our search space would be left subarray after partition.

Theoretically, this algorithm still has the complexity of O(n log n), but practically, you do not need to sort the entire array before you find k smallest elements.

If you are preparing for a technical interview and need personal coaching along with mock interviews, book a free demo session with us

Algorithm to find Kth smallest element in array

  1. Select a pivot and partition the array with pivot at correct position j
  2. If position of pivot, j, is equal to k, return A[j].
  3. If j is less than k, discard array from start to j, and look for (k-j)th smallest element in right sub array, go to step 1.
  4. If j is greater than k, discard array from j to end and look for kth element in left subarray, go to step 1

Let’s take an example and see if this algorithm works? A =  [4, 2, 1, 7, 5, 3, 8, 10, 9, 6 ], and we have to find fifth smallest element in array A.

Kth smallest element in array

Start with pivot as first index of array, so pivot = 0, partition the array into two parts around pivot such that all elements on left side of pivot element, i.e. A[pivot] are smaller and all elements on right side are greater than A[pivot].

Start with pivot as first index of array, so pivot = 0, partition the array into two parts around pivot such that all elements on left side of pivot element, i.e. A[pivot] are smaller and all elements on right side are greater than A[pivot].

In our example, array A will look like below after pivot has found it’s the correct position.

kth smallest element
After partition, correct position of pivot is index 3

If pivot == k-1 (array is represented as zero base index), then A[pivot] is kth smallest element. Since pivot (3) is less than k-1 (4), look for kth smallest element on right side of the pivot.

k remains as it is as opposed to k-j mentioned in algorithm as pivot is given w.r.t entire array and not w.r.t subarray.

In second iteration, pivot = 4 (index and not element). After second execution of quick sort array A will be like

kth smallest element
After partition of right subarray, correct position of pivot is index 4

pivot(4) which is equal to k-1(5-1). 5th smallest element in array A is 5.

Implementation

package com.company;

/**
	* Created by sangar on 30.9.18.
*/
public class KthSmallest {
	private void swap(int[] a, int i, int j){
		int temp = a[i];
		a[i] = a[j];
		a[j] = temp;
	}
	private int partition(int[] a, int start, int end){
		int pivot = a[start];
		int i  = start+1;
		int j  = end;

		while(i <= j){
			while(a[i] < pivot) i++;
			while(a[j] > pivot) j--;

			if(i < j) {
				swap(a, i, j);
			}
		}
		swap(a, start, j);
		return j;
	}

	public int findKthSmallestElement(int a[], int start, 
				int end, int k){
		if(start <= end){
		int p = partition(a, start, end);
		if(p == k-1){
			return a[p];
		}
		if(p > k-1)
			return findKthSmallestElement(a, start, p, k);
		if(p < k-1)
			return findKthSmallestElement(a, p+1, end, k);
		}
		return -1;
	}
}

Test cases

package test;

import com.company.KthSmallest;
import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.assertEquals;

/**
 * Created by sangar on 28.8.18.
 */
public class KthSmallestTest {

	KthSmallest tester = new KthSmallest();
	private int[] a = {4, 2, 1, 7, 5, 3, 8, 10, 9};
	@Test
	public void kthSmallest() {
		assertEquals(7, tester.findKthSmallestElement(a,0,8,6));
	}

	@Test
	public void firstSmallest() {
		assertEquals(1, tester.findKthSmallestElement(a,0,8,1));
	}

	@Test
	public void lastSmallest() {
		assertEquals(10, tester.findKthSmallestElement(a,0,8,9));
	}

	@Test
	public void kGreaterThanSize() {
		assertEquals(-1, tester.findKthSmallestElement(a,0,8,15));
	}
	@Test
	public void emptyArray() {
		int[] a = {};
		assertEquals(-1, tester.findKthSmallestElement(a,0,0,1));
	}

	@Test
	public void nullArray() {
		assertEquals(-1, tester.findKthSmallestElement(null,0,0,1));
	}
}

Complexity of using quicksort algorithm to find the kth smallest element in the array of integers is still O(n logn).

Kth smallest element using heaps

Before going into details of this problem, I strongly recommend reading heap fundamentals.

Imagine a case where there are a billion integers in the array and you have to find 5 smallest elements from that array. The complexity of O(n log n) is too costly for that use case. Above algorithm using quicksort does not take into consideration disparity between k and n.

We want top k elements, how about we chose those k elements randomly, call it set A and then go through all other n-k elements, call it set B, check if element from set B (n-k elements) can displace element in set A (k elements)?

What will be the condition for an element from set B to replace an element in set A? Well, if the new element is less than maximum in set A than the maximum in set A cannot be in the set of k smallest elements right?  Maximum element in set A would be replaced by the new element from set B.

Now, the problem is how to quickly find the maximum out of set A. Heap is the best data structure there. What kind of heap: min heap or max heap? Max heap as it store the maximum of the set at the root of it.

Let’s defined concrete steps to find k smallest elements using a max heap. 

  1. Create a max heap of size k from first k elements of array.
  2. Scan all elements in array one by one.
    1.  If current element is less than max on heap, add current element to heap and heapify.
    2. If not, then go to next element.
  3. At the end, max heap will contain k smallest elements of array and root will be kth smallest element.

Let’s take an example and see if this algorithm works? The input array is shown below and we have to find the 6th smallest element in this array.

kth smallest element
input array

Step 1 : Create a max heap with first 6 elements of array.

Create a max heap with set A

Step 2: Take the next element from set B and check if it is less than the root of max heap. In this case, yes it is. Remove the root and insert the new element into max heap.

kth largest element
Element from set B removes root from max heap and added to max heap

Step 2: It continues to 10, nothing happens as the new element is greater than the root of max heap. Same for 9.  At 6, again the root of max heap is greater than 6. Remove the root and add 6 to max heap.

nth smallest number in an integer array
Again, new element from set B is less than root of max heap. Root is removed and new element is added.

Array scan is finished, so just return the root of the max heap, 6 which is the sixth smallest element in given array.

Implementation

	public int findKthSmallestElementUsingHeap(int a[], int k){
	//https://stackoverflow.com/questions/11003155/change-priorityqueue-to-max-priorityqueue

	PriorityQueue<Integer>  maxHeap =
			new PriorityQueue<>(k, Collections.reverseOrder());

		if(a == null || k > a.length) return -1;
		//Create max with first k elements
		for(int i=0; i<k; i++){
			maxHeap.add(a[i]);
		}

		/*Keep updating max heap based on a new element
		If new element is less than root, 
		remove root and add new element
		*/

		for(int i=k; i<a.length; i++){
			if(maxHeap.peek() > a[i]){
				maxHeap.remove();
				maxHeap.add(a[i]);
			}
		}
		return maxHeap.peek();
	}

Can you calculate the complexity of above algorithm? heapify() has complexity of log(k) with k elements on heap. In worst case, we have to do heapify() for all elements in array, which is n, so overall complexity of algorithm becomes O(nlogk). Also, there is additional space complexity of O(k) to store heap.
When is very small as compared to n, this algorithm again depends on the size of array.

We want k smallest elements, if we pick first k elements from a min heap, will it solve the problem? I think so. Create a min heap of n elements in place from the given array, and then pick first k elements.
Creation of heap has complexity of O(n), do more reading on it. All we need to do is delete k times from this heap, each time there will be heapify(). It will have complexity of O(log n) for n element heap. So, overall complexity would be O(n + klogn).

Depending on what you want to optimize, select correct method to find kth smallest element in array.

Please share if there is something wrong or missing. If you are interested in taking coaching sessions from our experienced teachers, please reach out to us at [email protected]