Most frequent words in file

Tags: , , , ,

Most frequent words in file

In last post:  find K smallest element in an array, we learned some concepts to find top or bottom element in a given set. Let’s apply those concept again to a different problem called most frequent words in a file.

Given a text file which contains strings/words, find n most frequently occurring words in it i.e. n words whose count is the maximum.

For example, if given file is like follows: one, two, two, three, three, three, four,four, four,four, five, five, five, five, five, six,six, six, six, six, six  and we are looking for three most frequent words, output should be :  six,five, four.

Line of thoughts

Brute force solution would be really easy, scan the whole file, get the count for each unique words in file. Then sort the output based on that count in descending order and then return first n words.

This problem has three parts to it. First, read all words from file, second created a map which store frequency count of each word on file. Third is to get top n words from that map.

Reading a file is pretty standard operation in any language.  Brush upon Java basics here. We will not focus on that and also that’s not the intention of this question.
Assuming we can read from the file, how can we store frequency count against words. Simple, use a hash map. We read word from file, store in hash if it is not already there with count 1. If words is already in hash map, update the count by 1.

After this step, we have a hash with count of occurrence of each word. Now comes the challenging part:  how to find top n or most frequent words from this hash map. Mind you that hash map does not store elements in sorted or any order. 

Rule of thumb is to find top or least from collection of items, heaps are best bet. To find top N most frequently occurring words, let’s try heap. 
Which heap would be best?  We need to get a limited set, if we have free entry in that set till n words(as we need n top words). All further words have to pass a test before they enter the set. 

If new word is less than the least frequent word in the set, what can be said about this new word? Well, it cannot be in top n words right? 
If new word has frequency more than word with least frequency in set, new word should enter the set and  word with least frequency should be moved out.
What we just explained is typical use of min heap, as it give least element at the root. To find top n most frequent words in file, below is the algorithm.

Most frequent words in file : algorithm

  1. Take first N words appearing in map along with their count and create a min heap with these N words.
  2. One by one read words from hash and check if frequency of new word is more than least frequent word on heap, i.e word at root of heap.
  3. If yes, remove root and add new word in min heap. If not, continue with next word.
  4. When done with all words in hash, min heap will contain N most frequently occurring words in file.

Implementation

package com.company;

import javafx.util.Pair;

import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;

/**
 * Created by sangar on 1.10.18.
 */
public class FrequentWords {

	public static HashMap<String, Integer> readFile(String fileName)
		throws IOException {
	HashMap<String, Integer> wordMap = new HashMap<>();

	Path path = Paths.get(fileName);
	try (Scanner scanner =  new Scanner(path)){
		while (scanner.hasNextLine()){
			String line = scanner.nextLine();
			if(wordMap.containsKey(line)) {
				wordMap.put(line, wordMap.get(line) + 1);
			}
			else{
				wordMap.put(line, 1);
			}
		}
	}
	return wordMap;
	}

	public static ArrayList<String> mostFrequentWords(String fileName, int n){
		ArrayList<String> topWords = new ArrayList<>();

		try {
			HashMap<String, Integer> wordMap = readFile(fileName);
			PriorityQueue<Pair<String, Integer>> pq =
				new PriorityQueue<>(n, (x,y) -> x.getValue().compareTo(y.getValue()));

			int i = 0;
			Iterator it = wordMap.entrySet().iterator();
			/*
				Get first n words on heap
			*/
			while(it.hasNext()){
				if(i == n) break;
				HashMap.Entry<String, Integer> entry =
					(HashMap.Entry<String, Integer>)it.next();
				pq.add(new Pair<>(entry.getKey(), entry.getValue()));
				it.remove();
				i++;
			}

			/*
				Check all other words, if anyone more than least
				remove the least and add the new word.
			*/
			for (String key : wordMap.keySet()){
				if(pq.peek().getValue() < wordMap.get(key)){
					pq.poll();
					pq.add(new Pair<>(key, wordMap.get(key)));
				}
			}
			while(!pq.isEmpty()){
				topWords.add(pq.poll().getKey());
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
		return topWords;
	}
	
	public static void main(String[] args){  
		System.out.println(mostFrequentWords("/home/sangar/Documents/test.txt", 3));
	}
}

Reading M words from file will require O(m) time, where as creating N element heap will take O(n). Also, scanning through all words and inserting them on to heap has complexity of O((m-n) log n). Overall complexity to find top n most frequent words in fileis O(m log m).