Top K Frequent Keywords

Tags: , , ,

Given a list of reviews, a list of keywords and an integer k. Find the most frequent k keywords in order of most to least frequently mentioned.

  • The comparison of strings is case-insensitive.
  • Multiple occurrences of a keyword in a review should be considered as a single mention.
  • If keywords are mentioned an equal number of times in reviews, sort alphabetically.

Example

Input:
k = 2
keywords = ["services", "algorithms", "inteview"]
reviews = [
  "algorithms and Me provides the best services for the interview preparations",
  "Also, their mock interview services awesome services",
  "Best services provided by Algorithms and me, everyone should use their services",
]

Output:
["services", "algorithms"]

Explanation:
"services" is occurring in 3 different reviews, and "algorithms" occurs in 2 reviews. Note that even though "services" occur two times in the same review, it is counted only once.

Rule of thumb, if you have to find top k of anything, give priority queue a thought, and to understand priority queues, it is very important to understand the fundamentals of heaps.

This is a straight forward application of priority queues. All we have to do is count the frequency of each word in the keywords list in the given list of reviews. Once we have the frequency count, use priority queue to find the top K keywords.

Implementation notes
1. Convert the list of keywords into a set, to have an efficient lookup.
2. Split each review into words and check if the word is the keyword set.
3. If yes, increment the frequency count.
4. Put the map of words to frequency into the priority queue and pluck top K words. Adjust the comparator function to have order on the frequency and then alphabetically.

Top K frequent keywords in reviews

public List<String> findKWordsInReviews(int k,
                       List<String>keywords, List<String> reviews) {

        List<String> res = new ArrayList<>();

        //For fast look up. additional space O(n)
        Set<String> set = new HashSet<>(keywords);

        //To store the freq count of keywords
        Map<String, Integer> map = new HashMap<>();

        //Go through each review
        for(String review : reviews) {
            //Split the reviews in words
            String[] words = review .split("\\W");

            //So that we do not count word twice
            Set<String> added = new HashSet<>();
            for(String word : words) {
                //As it is case sensitive
                word = word.toLowerCase();
                //If word is keywords and not yet added
                if(set.contains(word) && !added.contains(word)) {
                    map.put(word, map.getOrDefault(word, 0) + 1);
                    added.add(word);
                }
            }
        }

        //Add the map created into the priority queue.
        Queue<Map.Entry<String, Integer>> maxHeap = new PriorityQueue<>(
                (a, b)->a.getValue() == b.getValue()
                        ? a.getKey().compareTo(b.getKey())
                        : b.getValue() - a.getValue());

        //Add everything to PQ
        maxHeap.addAll(map.entrySet());

        //Take out top k words
        while(!maxHeap.isEmpty() && k-- > 0) {
            res.add(maxHeap.poll().getKey());
        }
        return res;
    }

The complexity of this code depends on the following :
1. No of keywords let’s say is n
2. No of reviews, m
3. K

If we are using max heap approach, building a max heap with n keywords will take O(n), while taking out k words from it, will take O(klogn).

Splitting the reviews into words will take O(m * average length of reviews) with additional space depending on the number of words in reviews.