Top K Frequent Keywords

Given a list of reviews, a list of keywords and an integer k. Find the most frequent k keywords in order of most to least frequently mentioned.

  • The comparison of strings is case-insensitive.
  • Multiple occurrences of a keyword in a review should be considered as a single mention.
  • If keywords are mentioned an equal number of times in reviews, sort alphabetically.

Example

Input:
k = 2
keywords = ["services", "algorithms", "inteview"]
reviews = [
  "algorithms and Me provides the best services for the interview preparations",
  "Also, their mock interview services awesome services",
  "Best services provided by Algorithms and me, everyone should use their services",
]

Output:
["services", "algorithms"]

Explanation:
"services" is occurring in 3 different reviews, and "algorithms" occurs in 2 reviews. Note that even though "services" occur two times in the same review, it is counted only once.

Rule of thumb, if you have to find top k of anything, give priority queue a thought, and to understand priority queues, it is very important to understand the fundamentals of heaps.

This is a straight forward application of priority queues. All we have to do is count the frequency of each word in the keywords list in the given list of reviews. Once we have the frequency count, use priority queue to find the top K keywords.

Implementation notes
1. Convert the list of keywords into a set, to have an efficient lookup.
2. Split each review into words and check if the word is the keyword set.
3. If yes, increment the frequency count.
4. Put the map of words to frequency into the priority queue and pluck top K words. Adjust the comparator function to have order on the frequency and then alphabetically.

Top K frequent keywords in reviews

public List<String> findKWordsInReviews(int k,
                       List<String>keywords, List<String> reviews) {

        List<String> res = new ArrayList<>();

        //For fast look up. additional space O(n)
        Set<String> set = new HashSet<>(keywords);

        //To store the freq count of keywords
        Map<String, Integer> map = new HashMap<>();

        //Go through each review
        for(String review : reviews) {
            //Split the reviews in words
            String[] words = review .split("\\W");

            //So that we do not count word twice
            Set<String> added = new HashSet<>();
            for(String word : words) {
                //As it is case sensitive
                word = word.toLowerCase();
                //If word is keywords and not yet added
                if(set.contains(word) && !added.contains(word)) {
                    map.put(word, map.getOrDefault(word, 0) + 1);
                    added.add(word);
                }
            }
        }

        //Add the map created into the priority queue.
        Queue<Map.Entry<String, Integer>> maxHeap = new PriorityQueue<>(
                (a, b)->a.getValue() == b.getValue()
                        ? a.getKey().compareTo(b.getKey())
                        : b.getValue() - a.getValue());

        //Add everything to PQ
        maxHeap.addAll(map.entrySet());

        //Take out top k words
        while(!maxHeap.isEmpty() && k-- > 0) {
            res.add(maxHeap.poll().getKey());
        }
        return res;
    }

The complexity of this code depends on the following :
1. No of keywords let’s say is n
2. No of reviews, m
3. K

If we are using max heap approach, building a max heap with n keywords will take O(n), while taking out k words from it, will take O(klogn).

Splitting the reviews into words will take O(m * average length of reviews) with additional space depending on the number of words in reviews.

Priority Queue problems

A priority queue a data structure that keeps things in order of priority. The question is how is it different from the monotonic queue we discussed earlier. The difference is in the time complexity and applicability. The time complexity to find insert an element in the priority queue is O(logn), whereas in monotonic queues it is O(n). On the point of applicability, priority queue orders the entire input set, whereas monotonic queues maintain only partial input set in a particular order, and whenever an order is violated, we remove elements from the monotonic queues.

Before we do in detail of priority queues, please refresh fundamentals of heaps.

A priority queue is a collection in which items can be added at any time, but the only item that can be removed is the one with the highest priority.

It means that no matter what the insertion order of elements, removal will always give you the highest or lowest priority element. Internally, a priority queue is implemented as heaps. A heap is a data structure that as the following property:

A parent is greater than both its children (max heap) or a parent is less than both its children (min-heap)

How do I identify that it is a priority queue question? Almost all the time, the problem statement will have top-K, smallest-k, most frequent K in it. It means this problem has something to do with ranking and for ranking problems best data structure is a priority queue.

Priority queue in Java

If your choice of language in the interview is Java, you are in luck, because Java has this in build class called priority queue.

PriorityQueue pq = new PriorityQueue();

With the above instantiation, you can define a min priority queue, it means the smallest number will be polled first.

If you want to declare it in such a way that the largest number is polled first, all you have to do is pass a comparator. We can use an in-built function provided by Comparator.

PriorityQueue pq = new PriorityQueue(Comparator.reverseOrder());

What if you are intending to store objects in a priority queue, then this default will not work as they work on integers or long only. This is a point you must read on comparators in Java. You can pass a comparator to the constructor, it will be used by the queue to order objects accordingly.

 PriorityQueue<Map.Entry<Integer, Integer>> pq = new PriorityQueue<>(
            (a, b) -> a.getValue().compareTo(b.getValue()) //Comparator
        );

You will find problems where you have to first find the frequency or priority of elements, we use HashMap to map element to its priority.

Use Map.Entry as an object to be stored in the queue. It helps us avoid creating new objects

Enough of theory, let’s get to the business, how can I solve problems using priority queues?

Top K Frequent Elements

Given a non-empty array of integers, return the k most frequent elements. For example: nums = [1,1,1,2,2,3], k = 2 Output: [1,2]

From the problem statement, it is evident that it is a ranking problem and we know the data structure to use. All we have to do is collect the respective frequency of elements. However, we have two ways to add elements to the priority queue. The first way is to create a map of element and frequency and add the whole map into the priority queue in decreasing order of frequency (max heap). Take out K element from the priority queue and that will be your K top elements.
The time complexity of the solution is O(n) to add all the elements in the heap, and then O(klogn) to take K elements out. Remember, each take-out incurs O(logn) complexity due to heapification process.

The second way is to add only K elements in the priority queue in min heap order. We keep the size of the queue to constant K, if it goes more, we just take our the top, which will be the minimum element. Once we have processed all the elements, elements in the queue will be top K elements
The time complexity of the solution is O(nlogk). Remember, each take-out and insert incurs O(logn) complexity due to heapification process.
Depending on the relationship between n and k, you can chose which method to use.

 public List<Integer> topKFrequent(int[] nums, int k) {
        
        
        Map<Integer, Integer> map = new HashMap<>();
        for(int i : nums){
            map.put(i, map.getOrDefault(i,0) + 1);
        }
        
        PriorityQueue<Map.Entry<Integer, Integer>> pq = new PriorityQueue<>(
            (a, b) -> a.getValue().compareTo(b.getValue())
        );
        /* We chose the second method. apparently,
           it give better performance in leetcode
        */
        for(Map.Entry e : map.entrySet()){
            pq.offer(e);
            if(pq.size() > k){
                pq.poll();
            }
        }
        
        List<Integer> res = new ArrayList<>();
        pq.stream().forEach(e -> res.add(e.getKey()));
        
        return res;
        
    }

Top K Frequent Words

Given a non-empty list of words, return the k most frequent elements. The answer should be sorted by frequency from highest to lowest. If two words have the same frequency, then the word with the lower alphabetical order comes first.

It is more or less the same problem as above, the only change is we have to add comparator to take care of alphabetical order.

    public List<String> topKFrequent(String[] words, int k) {
        
        Map<String, Integer> map = new HashMap<>();
       
        PriorityQueue<Map.Entry<String, Integer>> pq 
            = new PriorityQueue<>(
                       (a,b) -> a.getValue()==b.getValue() 
            ? b.getKey().compareTo(a.getKey()) : a.getValue()-b.getValue()
        );
        
        for(int i=0; i<words.length; i++){
            map.put(words[i], map.getOrDefault(words[i], 0) + 1);
        }
        
        for(Map.Entry<String, Integer> entry: map.entrySet()){
            pq.offer(entry);
            if(pq.size() > k){
                pq.poll();
            }
        }
        
        List<String> res = new LinkedList<>();
        while(!pq.isEmpty())
            res.add(0, pq.poll().getKey());
        
        return res;
    }

Kth Largest Element in an Array

Find the kth largest element in an unsorted array. Note that it is the kth largest element in the sorted order, not the kth distinct element.

class Solution {
    public int findKthLargest(int[] nums, int k) {
        
        PriorityQueue<Integer> pq = new PriorityQueue<>();
        int i = 0;
        for(; i<nums.length; i++){
            pq.add(nums[i]);
            if(pq.size() > k)
                pq.poll();
        }   
        
        return pq.peek();
    }
}

K Closest Points to Origin

We have a list of points on the plane. Find the K closest points to the origin (0, 0). (Here, the distance between two points on a plane is the Euclidean distance.)

    public int[][] kClosest(int[][] points, int K) {
        
        PriorityQueue<int[]> pq = new PriorityQueue<>(
                (a, b) -> ((a[0] - 0 ) * (a[0] - 0 )  + (a[1] - 0 ) * (a[1] - 0 )
                           - (b[0] - 0 ) * (b[0] - 0 ) - (b[1] - 0 ) * (b[1] - 0 ))
        );
            
        for(int i=0; i<points.length; i++){
            pq.add(new int[]{points[i][0], points[i][1]});
        }
            
        int [][] res = new int[K][2];
        for(int i=0; i<K && !pq.isEmpty(); i++){
            res[i] = pq.poll();
        }
            
        return res;
        
    }

Sort Characters By Frequency

Given a string, sort it in decreasing order based on the frequency of characters. For example, S = “tree”, the output should be “eert”

    public String frequencySort(String s) {
        
        PriorityQueue<Map.Entry<Character, Integer>> pq 
            = new PriorityQueue<>((a,b) -> b.getValue() - a.getValue());
        
        Map<Character, Integer> map = new HashMap<>();
       
        for(int i=0; i<s.length(); i++){
            map.put(s.charAt(i), 1 + map.getOrDefault(s.charAt(i), 0));
        }
        
        pq.addAll(map.entrySet());
       
        StringBuilder sb = new StringBuilder();
        while (!pq.isEmpty()) {
            Map.Entry<Character, Integer> e = pq.poll();
            int n = (int)e.getValue();
            for (int i = 0; i < n;  i++) 
                sb.append(e.getKey());
        }
        return sb.toString();
    }

Find median in a stream of integer

This problem is covered in detail here: Median in a stream of integers.

 PriorityQueue<Integer> pqMin;
    PriorityQueue<Integer> pqMax;
    
    /** initialize your data structure here. */
    public MedianFinder() {
         pqMax = new PriorityQueue<>(Collections.reverseOrder());
         pqMin = new PriorityQueue<>();
    }
    
    public void addNum(int num) {
        /* If size of max heap, i.e left bunch of numbers is 
           greater than the min heap, right bunch of numbers,
           add the number in left bunch and take the Max and 
           add it to right bunch */
        if (pqMax.size() > pqMin.size()) {
            pqMax.offer(num);
            pqMin.offer(pqMax.poll());
        } else {
            //Do the opposite.
            pqMin.offer(num);
            pqMax.offer(pqMin.poll());
        }
        
    }
    
    public double findMedian() {
        /*If the size of two heaps is equal, 
           even number of integers, take the average of two
           middle elements */
        if(pqMin.size() == pqMax.size()){
            return (double)(pqMin.peek() + pqMax.peek())/2.0;
        }
        
        //Else return from the left bunch
        return (double)(pqMax.peek() * 1.0);
    }
}

Find K-th Smallest Pair Distance

Given an integer array, return the kth smallest distance among all the pairs. The distance of a pair (A, B) is defined as the absolute difference between A and B.

  private int findKth(int [] a, int k){
        PriorityQueue<Integer> pq = 
                 new PriorityQueue<>(Comparator.reverseOrder());
        
        for(int i=0; i<a.length; i++){
            for(int j=i+1; j<a.length; j++){
                int diff =  Math.abs(a[i]-a[j]);
                pq.offer(diff);
                
                if(pq.size() > k){
                    pq.poll();
                }
            }
        }
        return pq.peek();
    }    

The above solution works but due to the constraints, it may run out of memory and time. A better solution is based on a binary search and trial and error methods.

Merge k Sorted Lists

Merge k sorted linked lists and return it as one sorted list. For example:
Input:
[
1->4->5,
1->3->4,
2->6
] Output: 1->1->2->3->4->4->5->6

private ListNode mergeList(ListNode[] lists){
        
        if(lists.length == 0) return null;
        
        PriorityQueue<ListNode> pq = new PriorityQueue<>(
            (a,b) -> ((Integer)a.val).compareTo(b.val)
        );
        
        for(int i=0; i<lists.length; i++){
            if(lists[i] != null)
                pq.offer(lists[i]);
        }
        
        ListNode head = pq.poll();
        ListNode tail =  head; 
        if(head != null && head.next != null)
            pq.add(head.next);
    
        while(!pq.isEmpty()){
            tail.next = pq.poll();
            tail = tail.next;
            if(tail.next != null)
                pq.offer(tail.next);
            
        }
        
        return head;
    }

As we see from the above code examples, it is easy to identify a priority queue problem and simple to solve. Please share your views and best of luck with your preparations.

Heaps fundamentals

In this post will learn heap fundamentals, which are very important for solving priority queue problems. Heap is a kind of data structure based on the complete tree principle. By definition, a complete binary search tree of N levels has at least 2^N-1 nodes in it. There two properties heap holds :

1. Structural property: This means every level of the heap will be completely full except the last level. The last level will be filled from left to right order.
2. Heap property: It means parent node will be either greater or smaller than its children nodes. There are two kinds of heaps based on second property:

Max Heap
Max heap maintains the property that every parent node is greater than its children. The root of the max heap is the greatest element.

heap fundamentals

Min Heap
Min heap maintains the property that every parent node is less than its children. The root of the min-heap is the least element.

min heap

Implementation Notes
Usually, heaps are implemented with an array, which eases traversing from child to parent and parent to child; and viewed as a tree. The height of a heap with n elements will be O(logN). 

Children of a parent node i are 2i and 2i+1.
Parent of a node i will be [i/2].

    private int parent(int pos) {
        return pos / 2;
    }
    
    private int leftChild(int pos){
        return (2 * pos);
    }

    private int rightChild(int pos){
        return (2 * pos) + 1;
    }
    

Given that multiply and divide by 2 can be very efficiently implemented by left and right shifting, these operations are very efficient.

Maximum elements in a heap with height h will be 2h-1 while the minimum number of elements in heap with the same height will be 2h-1+1 (One more than nodes with height h-1)

Heap operations

1. Insert an element in heap
To insert a new element in a heap, insert it as the leftmost position available. This new element may violate the heap property, for example, in a min-heap, the newly added element is less than the parent node. We have to push this newly added node to its correct position, this process is called a heapification.
Let’s take an example and see how it works.
insert in heap

    public void insert(int element) {
        if (size >= maxsize) {
            return;
        }

        Heap[++size] = element;
        int current = size;

        while (Heap[current] < Heap[parent(current)]) {
            swap(current, parent(current));
            current = parent(current);
        }
    }

The complexity of this procedure is for O(logn).

2. Pop from heap
The deletion of node is usually done at the root. To delete a root, replace the root node with the last node in heap and then do downshift. As the new node replacing root may violate heap property, we need to check with its children. In Max heap, check if new node is greater than both its children. If not then swap it with the largest child and then again repeat it till node replacing root finds its valid place.

Algorithm (This is for max heapify)

  1. Let i be the index which we doubt that might violate heap property.
  2. left = 2i, right = 2i+1, largest = a[i]
  3. Check if left child is with in array limits, if yes, check if it is greater than the a[i].
  4. If left is greater than a[i], then largest till now is left, largest = left.
  5. Check if right is with in array limit, if yes, check if it is greater than largest, then change the largest, largest = right.
  6. Swap largest and a[i].
  7. Now, repeat step 1 to 6 with largest, till we see an element which does not violate heap property.

Let’s take an example:
pop in heap

   public int pop() {
        int popped = Heap[1];
        Heap[1] = Heap[size--];
        minHeapify(1);
        return popped;
    }
    private void swap(int i, int j) {
        int tmp = Heap[i];
        Heap[i] = Heap[j];
        Heap[j] = tmp;
    }

    // Function to heapify the node at pos
    private void minHeapify(int pos){

        if (!isLeaf(pos)) {
            if (Heap[pos] > Heap[leftChild(pos)]
                    || Heap[pos] > Heap[rightChild(pos)]) {

                if (Heap[leftChild(pos)] < Heap[rightChild(pos)]) {
                    swap(pos, leftChild(pos));
                    minHeapify(leftChild(pos));
                }
                else {
                    swap(pos, rightChild(pos));
                    minHeapify(rightChild(pos));
                }
            }
        }
    }

3. Building a heap from a given array of elements.
This operation can easily be done using the heapify methods explained above. 

  1. Start from the middle element of the array, let’s say i
  2. Heapify with given index.
  3. Decrease index by one. Repeat step 2 till we reach first element.
   public void minHeap() {
        for (int pos = (size / 2); pos >= 1; pos--) {
            minHeapify(pos);
        }
    }

The complexity of this procedure is O(n).

How this complexity becomes O(n) while for adding each node will need logn time to heapify and if there are n nodes, it should be O(nlogn). The complexity of the heapify method for a particular element depends on its position in the heap. It takes O(1) time when the node is a leaf node (which makes up at least half of the nodes) and O(log n) time when it’s at the root.
If the height of the heap is h, the number of nodes will be 2h. So 2h/2 = 2h-1 nodes never move. 2h-2 nodes move by one level and so on.

4. Heapsort
Heap sort combines property of both insertion sort (no extra space required) and merge sort (time complexity being O(nlogn)). It can be easily achieved by using the above two procedures.

  1. Build max heap from the given array, with complexity of O(n)
  2. Swap first and last element of the array. Last element is now at its proper position.
  3. Decrease the size of heap by 1 to be heapify.
  4. Heapify with first element of the array.
  5. Repeat step 2 , 3 and 4 until there are elements to be sorted.

The complexity of the above procedure is O(nlogn).

Problems based on heaps
Reference : http://www.amazon.com/Introduction-Algorithms-Thomas-H-Cormen/dp/0262033844