Blog

Intersection of two arrays

Intersection of two arrays

Given two unsorted arrays of integers, find intersection of these two arrays. Intersection means common elements in the given two arrays. For example, A = [1,4,3,2,5,6] B = [3,2,1,5,6,7,8,10] intersection of A and B is [ 1,3,2,5,6 ].

Sort array and then use binary search
As given arrays are unsorted, sort one of the arrays, preferably the larger one. Then search each element of another array in the sorted array using binary search. If the element is present, put it into the intersection array.

class Solution {
    public int[] intersection(int[] nums1, int[] nums2) {
        
        int len1 = nums1.length;
        int len2 = nums2.length;
        Set<Integer> result = new HashSet<>();
        
        for(int i=0; i<len2; i++){
            if(binarySearch(nums1, nums2[i]) != -1){
                result.add(nums2[i]);
            }
        }
        int i = 0;
        int[] resultArray = new int[result.size()];
        for(Integer num : result){
            resultArray[i++] = num ;
        }
        
        return resultArray;
    }
    
    private int binarySearch(int[] a, int key) {
        
        for(int i=0; i<a.length; i++){
            if(a[i] == key) return i;
        }
        
        return -1;
    }
}

The time complexity of binary search method to find intersection is O(nlogn) for sorting and then O(mlogn) for searching. Effective time complexity becomes O((n+m)logn) which is not optimal.

Sort and use merge to find common elements
Again in this method, sort two arrays first. Then use two pointers to scan both arrays simultaneously. (Please refer to merge part of merge sort ). The difference is we will put only common elements, instead of all.

The time complexity of merge sort method is O(nlogn) + O(mlogm) for sorting and then O(m+n) for scanning both arrays. It is worst than the binary search method.

Use hash
Create a hash with key as elements from the smaller array (saves space). Then scan through other array and see if the element is present in hash. If yes, put into intersection array else do not.

package AlgorithmsAndMe;

import com.sun.org.apache.xpath.internal.operations.Bool;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class IntersectionTwoArrays {

    public List<Integer> findIntersecton(int[] a, int[] b) {
        List<Integer> result = new ArrayList<>();
        Map<Integer, Boolean> existingElements = new HashMap<>();

        for (int i = 0; i < a.length; i++) {
            existingElements.put(a[i], true);
        }

        for (int i = 0; i < b.length; i++) {
            if (existingElements.containsKey(b[i])) {
                result.add(b[i]);
            }
        }
        return result;
    }
}

Test case

package Test;

import AlgorithmsAndMe.DuplicatesInArray;
import AlgorithmsAndMe.IntersectionTwoArrays;

import java.util.List;
import java.util.Set;

public class IntersectonTwoArraysTest {


    IntersectionTwoArrays intersectionTwoArrays
             = new IntersectionTwoArrays();

    @org.junit.Test
    public void testIntersectionTwoArrays() {
        int [] a = {1,6,3};
        int [] b = {1,2,3};
        List<Integer> result = intersectionTwoArrays.findIntersecton(a,b);

        result.forEach(s -> System.out.println(s));
    }
}

This method has the complexity of O(n) where n is the number of elements in the larger array and extra space complexity of O(m) where m is the number of elements in the smaller array.

These methods to find the intersection of two arrays do not work when there are duplicate elements in any of the array as they will be part of intersection array only once.

Please share if there is something wrong or missing. we would love to hear from you.

Minimizing maximum lateness

Minimizing maximum lateness : Greedy algorithm

Since we have chosen the greed, let continue with it for one more post at least. Today’s problem is to minimize maximum lateness of a task. Let me clarify the problem: given a processor which processes one process at a time and as always given a list of processes to be scheduled on that processor, with the intention that maximum late process should be minimized. Contrary to previous problems, this time, we are not provided with start time and end time, but we are given length of time ti process will run and deadline it has to meet di, fi is actual finish time of process completion.

Lateness of a process is defined as
li = max{0, fi − di}, i.e. the length of time past its deadline that it finishes.
Goal here to schedule all tasks to minimize maximum lateness L = max li For example:

minimize maximum lateness

Minimizing maximum lateness : algorithm

Let’s decide our optimization strategy. There is some order in which jobs can be decided: shortest job first, earliest deadline first, least slack time first.

Let’s see if any of the above strategies work for the optimal solution. For shortest processing time first, consider example P1 = (1,100) P2 = (10, 10). If we schedule the shortest job first as in order (P1, P2), lateness will be 91, but if we take them as (P2, P1), lateness will be 0. So, clearly taking the shortest process first does not give us an optimal solution.

Check for the smallest slack time approach. See if you can come up with some counterexample that it does not work.

That leaves us with only one option, take the process which has the most pressing deadline, that is the one with the smallest deadline and yet not scheduled. If you have noticed, the example given for the problem statement is solved using this method. So, we know it works.

  1. Sort all job in ascending order of deadlines
  2. Start with time t = 0
  3. For each job in the list
    1. Schedule the job at time t
    2. Finish time = t + processing time of job
    3. t = finish time
  4. Return (start time, finish time) for each job

Minimizing maximum lateness : implementation

from operator import itemgetter

jobs = [(1, 3, 6), (2, 2, 9), (3, 1, 8), (4, 4, 9), 
        (5, 3, 14), (6, 2, 15)] 

def get_minimum_lateness():
	schedule =[];
	max_lateness = 0
	t = 0;
	
	sorted_jobs = sorted(jobs,key=itemgetter(2))
	
	for job in sorted_jobs:
		job_start_time = t
		job_finish_time = t + job[1]

		t = job_finish_time
		if(job_finish_time > job[2]):
			max_lateness =  max (max_lateness, (job_finish_time - job[2]))
		schedule.append((job_start_time, job_finish_time))

	return max_lateness, schedule

max_lateness, sc = get_minimum_lateness();
print "Maximum lateness will be :" + str(max_lateness)
for t in sc:
	print t[0], t[1]

The complexity of implementation is dominated by sort function, which is O(nlogn), rest of processing takes O(n).

Please share your suggestions or if you find something is wrong in comments. We would love to hear what you have to say. If you find this post interesting, please feel free to share or like.

Coin change problem : Greedy algorithm

Coin change problem : Greedy algorithm

Today, we will learn a very common problem which can be solved using the greedy algorithm. If you are not very familiar with a greedy algorithm, here is the gist: At every step of the algorithm, you take the best available option and hope that everything turns optimal at the end which usually does. The problem at hand is coin change problem, which goes like given coins of denominations 1,5,10,25,100; find out a way to give a customer an amount with the fewest number of coins. For example, if I ask you to return me change for 30, there are more than two ways to do so like

 
Amount: 30
Solutions : 3 X 10  ( 3 coins ) 
            6 X 5   ( 6 coins ) 
            1 X 25 + 5 X 1 ( 6 coins )
            1 X 25 + 1 X 5 ( 2 coins )

The last solution is the optimal one as it gives us a change of amount only with 2 coins, where as all other solutions provide it in more than two coins.

Solution for coin change problem using greedy algorithm is very intuitive and called as cashier’s algorithm. Basic principle is : At every iteration in search of a coin, take the largest coin which can fit into remaining amount we need change for at the instance. At the end you will have optimal solution.

Coin change problem : Algorithm

1. Sort n denomination coins in increasing order of value.
2. Initialize set of coins as empty. S = {}
3. While amount is not zero:
3.1 Ck is largest coin such that amount > Ck
3.1.1 If there is no such coin return “no viable solution”
3.1.2 Else include the coin in the solution S.
3.1.3 Decrease the remaining amount = amount – Ck

Coin change problem : implementation

#include <stdio.h>
 
int coins[] = { 1,5,10,25,100 };
 
int findMaxCoin(int amount, int size){
	for(int i=0; i<size; i++){
	    if(amount < coins[i]) return i-1;
	}
	return -1;
}

int findMinimumCoinsForAmount(int amount, int change[]){
 
	int numOfCoins = sizeof(coins)/sizeof(coins[0]);
	int count = 0;
	while(amount){
	    int k = findMaxCoin(amount, numOfCoins);
	    if(k == -1)
                printf("No viable solution");
	    else{
                amount-= coins[k];
		change[count++] = coins[k];
            }
	}
	return count;
}
 
int main(void) {
	int change[10]; // This needs to be dynamic
	int amount = 34;
	int count = findMinimumCoinsForAmount(amount, change);
 
	printf("\n Number of coins for change of %d : %d", amount, count);
	printf("\n Coins : ");
	for(int i=0; i<count; i++){
		printf("%d ", change[i]);
	}
	return 0;
}

What will the time complexity of the implementation? First of all, we are sorting the array of coins of size n, hence complexity with O(nlogn). While loop, the worst case is O(amount). If all we have is the coin with 1-denomination. Overall complexity for coin change problem becomes O(n log n) + O(amount).

Will this algorithm work for all sort of denominations? The answer is no. It will not give any solution if there is no coin with denomination 1. So be careful while applying this algorithm.

Please share if you have any suggestion or if you want me to write on a specific topic. If you liked the post, share it!

Find duplicate numbers in array

Find all duplicate numbers in array

Given an array of positive integers in range 0 to N-1, find all duplicate numbers in the array. The array is not sorted. For example:
A = [2,4,3,2,1,5,4] Duplicate numbers are 2,4 whereas in A = [4,1,3,2,1,1,5,5] duplicate numbers are 1,5.

Brute force solution would be to keep track of every number which is already visited. The basic idea behind the solution is to keep track that whether we have visited the number before or not. Which data structure is good for quick lookups like this? Of course a map or hash.
The time complexity of this solution is O(n) but it has an additional space complexity of O(n).

To reduce space requirement, a bit array can be used, where ith index is set whenever we encounter number i in the given array. If the bit is set already, its a duplicate number. It takes O(n) extra space which is actually less than earlier O(n) as only bits are used. The time complexity remains O(n)

Find duplicate numbers in an array without additional space

Can we use the given array itself to keep track of the already visited numbers? How can we change a number in an array while also be able to get the original number back whenever needed? That is where reading the problem statement carefully comes. Since array contains only positive numbers, we can negate the number at the index equal to the number visited. If ever find a number at any index negative, that means we have seen that number earlier as well and hence should be a duplicate.

Idea is to make the number at ith index of array negative whenever we see number i in the array. If the number at ith index is already negative, it means we have already visited this number and it is duplicate. Limitation of this method is that it will not work for negative numbers.

Duplicate numbers implementation

package AlgorithmsAndMe;

import java.util.HashSet;
import java.util.Set;

public class DuplicatesInArray {

    public Set<Integer> getAllDuplicates(int[] a ) 
                              throws IllegalArgumentException {

        Set<Integer> result = new HashSet<>();

        if(a == null) return result;

        for(int i=0; i<a.length; i++) {
            //In case input is wrong
            if(Math.abs(a[i]) >= a.length ){
               throw new IllegalArgumentException();
            }
            
            if (a[Math.abs(a[i])] < 0) {
                result.add(Math.abs(a[i]));
            } else {
                a[Math.abs(a[i])] = -a[Math.abs(a[i])];
            }
        }
        return result;
    }
}

Test cases

package Test;

import AlgorithmsAndMe.DuplicatesInArray;
import java.util.Set;

public class DuplicatesInArrayTest {

    DuplicatesInArray duplicatesInArray = new DuplicatesInArray();

    @org.junit.Test
    public void testDuplicatesInArray() {
        int [] a = { 1,2,3,4,2,5,4,3,3};
        Set<Integer> result = duplicatesInArray.getAllDuplicates(a);

        result.forEach(s -> System.out.println(s));
    }

    @org.junit.Test
    public void testDuplicatesInArrayWithNullArray() {
        Set<Integer> result = duplicatesInArray.getAllDuplicates(null);

        result.forEach(s -> System.out.println(s));
    }

    //This case should generate an exception as 3 is greater than the size.
    @org.junit.Test
    public void testDuplicatesInArrayWithNullArray() {
        int [] a = { 1,2,3};
        try{
             Set<Integer> result = duplicatesInArray.getAllDuplicates(a);
        } catch (IllegalArgumentException  e){
            System.out.println("invalid input provided");
        }
    }
}

The complexity of the algorithm to find duplicate elements in an array is O(n).

Linked list implementation in Linux kernel

Linked list implementation in Linux kernel

We learned a lot about linked and solve around 30 odd problems : Linked list problems. However, the actual implementation of a linked list in Linux kernel is very different than what we learned. Let us understand how a linked list is implemented in Linux kernel and used in kernel code.

In a simple linked list, nodes contain data and point to the next node in the linked list. In other words, its the list which contains nodes which are linked. A typical example of the structure of a node of this kind of a list is:

class Node {
  private int data;
  private Node next;

  public Node (int data, int next){
      this.data = data;
      this.next = next;   
  }

  //Getters
};

However, linked lists in the Linux kernel, it’s the other way around, that is linked list is contained inside the node. This means, that there is no next pointer inside the node and each node is effectively a head node like a circular linked list. Also, it is a doubly linked list. A lot of things in one sentence!!

Linked list implementation in Kernel

Let’s understand it in detail. As said above, linked list is contained inside the node, structure of node is like this:

struct node {
 int data;
 list_head list; // list is inside the node
};

Here list_head is what defined as :

struct list_head{
  struct list_head *next, *prev;
}

See it has two pointers, essentially, making any node which contains this structure, part of a doubly linked list. The most interesting part of this kind of definition of a node is that same node can be part of multiple lists without being reallocated for each list. For example, in traditionally linked lists, if we need two linked lists: one as odd numbers and other as prime numbers, we would have to define two linked lists, one with odd numbers and other with prime numbers. With implementation provided in the Linux kernel, we can attach the same node to two lists as shown below, where an odd number which is also prime is allocated only once.

struct numbers {
 int number;
 list_head odd_numbers; // contains all odd numbers
 list_head primer_numbers; // Contains all prime numbers
};

How to access a node in list in Linux Kernel

We understood the node structure, how can we access a given node of a linked list. It was simple to do in a normal linked list as the base address of node accessible. In list implemented in Linux kernel, we have a pointer to the list_head structure in the next node and not a pointer to next node itself, as shown below.

linked list representation in linux kernel

There is a beautiful trick in C, which is used here to access the base address of the node whose list_head pointer is given. Once the base address of a node is known, accessing the node becomes similar to a normal linked list. The trick is that given a pointer to list_head in the structure; to find the base pointer of structure, find the offset at which list_head is stored in list. Once, we know the offset, (how many bytes, it is far from base address), then just subtract that offset from the absolute address of the pointer (which is given) and we get the base address. Figure explains

node representation in linked list in linux kernel

Let’s take an example, we will use structure numbers as given above. To get offset of element number in that, code is:

(unsigned long)(&((struct numbers *)0)->number)

Now, that we have offset of number and absolute address of number, we can get the base address of struct numbers as:

((struct numbers *)((char *)(pos) - \
          (unsigned long)(&((numbers *)0)->number)))

ANSI C defines the offsetof() macro in which lets you compute the offset of field f in struct s as offsetof(struct s, f). If for some reason you have to code this sort of thing yourself, one possibility is

#define offsetof(type, f) ((size_t) \
 ((char *)&((type *)0)->f - (char *)(type *)0))

Above code is not portable and some compilers may have problems with it.

There are some MACROS which are defined in the linux kernel which are useful in dealing with linked lists. Below are some examples:

#define list_entry(ptr, type, member) \
	((type *)((char *)(ptr)-(unsigned long)(&amp;((type *)0)-&gt;member)))
<pre>#define LIST_HEAD(name) \
	struct list_head name = LIST_HEAD_INIT(name)

#define INIT_LIST_HEAD(ptr) do { \
	(ptr)-&gt;next = (ptr); (ptr)-&gt;prev = (ptr); \
} while (0)

Please refer to this file to understand various macros which are used in Linux kernel.

In next post, we will use these constructs and see how can we created a list, access it and delete a list.

Please share if there is something wrong or suggestion for improvements. Also, if you like the content, please share it.

Repeated number in array

Repeated number in an array

In last post : Find missing number in array, we learned how to find a missing number in array of integers with values in a given range. Today, we will learn how find a repeated number in array of integers from 1 to N. Note that here also, numbers are not sorted but are confined to a range. So, if size of array is N, then range of numbers is from 1 to N-1 as one number is repeated. Examples :

A = [1,2,3,3,4,5]. Repeated number is 3
Size of array : 6 Range : 1 to 5

Repeated number : Algorithm

As we have learned while solving the missing number problem earlier, XOR principle can be applied here too. Why? Because in this case repeated number will be XORed with itself three times. Properties of XOR to understand the method and how we use them.

A XOR A = 0
0 XOR A = A

Now, when a number XORed with itself, the result is zero, and when zero is XORed with a number, the result is the number itself. Extending this, if we XORed the same number thrice or without losing generality, an odd number of times, the result will be the number itself.

Using an odd number of times XOR principle, algorithm to find repeating number in an array.

1. XOR all actual numbers in the array. Call it aXOR.
2. XOR all numbers in range 1 to N-1. Call it eXOR
3. XOR aXOR with eXOR. Result will be repeated number.

This is because all numbers except the repeated number will be XORed even number of times, and cancel each other. The repeated number will be XORed thrice, the final result will be the repeated number. Let’s take above example and see if it works

A = [1,2,2,3,4]

aXOR = 001 XOR 010 = 011 XOR 010 = 001 XOR 011 = 010 XOR 100 = 110
eXOR = 001 XOR 010 = 011 XOR 011 = 000 XOR 100 = 100

ActualXOR XOR expectedXOR = 110 XOR 100 = 010

Repeated number in array implementation

public int repeatedNumber(int[] nums) {
 
    int n =  nums.length;
     
    int nXOR = 0;
    for(int i=0; i<=n; i++){
        nXOR ^= i;
    }
     
    int aXOR = 0;
    for(int i=0; i<n; i++){
        aXOR ^= nums[i];
    }
     
    return aXOR ^ nXOR;
}

The time complexity of the XOR method to find a repeated number in an array is O(n).

Please share your thoughts through comments, if you see something is missing or wrong or not explained properly.

Find a missing number in array

Missing number in an array

Given an array of N integers, ranging from 1 to N+1, find the missing number in that array. It is easy to see that with N slots and N+1 integers, there must be a missing number in the array. For example, A = [1,2,5,4,6] N = 5 range 1 to 6. The output is 3.
A = [1,5,3,4,7,8,9,2] N = 8 range 1 to 9. Output is 6

Methods to find a missing number

Using hash
Create a hash with the size equal to N+1. Scan through elements of the array and mark as true in the hash. Go through the hash and find a number which is still set to false. That number will be the missing number in the array.
The complexity of this method is O(n) with additional O(n) space complexity.

Using mathmatics
We know that the sum of N consecutive numbers is N*(N+1)/2. If a number is missing, the sum of all numbers will not be equal to N*(N+1)/2. The missing number will be the difference between the expected sum and the actual sum.

Missing num = (N+2) * (N+1) /2 – Actual sum; N+1 because the range of numbers is from 1 to N+1
Complexity is O(n). However, there is a catch: there may be an overflow risk if N is big enough.

Using XOR
There is a beautiful property of XOR, that is: if we XOR a number with itself, the result will be zero. How can this property help us to find the missing number? In the problem, there are two sets of numbers: the first one is the range 1 to N+1, and others which are actually present in the array. These two sets differ by only one number and that is our missing number. Now if we XOR first set of numbers with the second set of numbers, all except the missing number will cancel each other. The final result will be the actual missing number.

Algorithm to find a missing number using XOR

1. Scan through the entire array and XOR all elements. Call it aXOR
2. Now XOR all numbers for 1 to N+1. Call it eXOR
3. Now XOR aXOR and eXOR, the result is the missing number

Let’s take an example and see if this works

A = [1,3,4,5] Here N = 4, Range is 1 to 5.

XORing bit representations of actual numbers
001 XOR 011 = 010 XOR 100 = 110 XOR 101 = 011 (aXOR)

XORing bit representation of expected numbers
001 XOR 010 = 011 XOR 011 = 000 XOR 100 = 100 XOR 101 = 001 (eXOR)

Now XOR actualXOR and expectedXOR;
011 XOR 001 = 010 = 2 is the missing number

Implementation

    public int missingNumber(int[] nums) {
    
        int n =  nums.length;
        
        int nXOR = 0;
        for(int i=0; i<=n; i++){
            nXOR ^= i;
        }
        
        int aXOR = 0;
        for(int i=0; i<n; i++){
            aXOR ^= nums[i];
        }
        
        return aXOR ^ nXOR;
    }

The complexity of the XOR method to find a missing number in an array of integers is O(n) with no additional space complexity.

If you want to contribute to this blog in any way, please reach out to us: Contact. Also, please share if you find something wrong or missing. We would love to hear what you have to say.

Design a data structure with insert delete and getRandom in O(1)

Design a data structure with insert delete and getRandom in O(1)

The problem statement is to design a data structure which performs the following operations in O(1) time complexity:
1. Insert an element, insert(int value)
2. Remove an element, remove(int value)
3. Get random element, getRandom()

For example, insert 1 into the data structure insert(1): [1]
insert 2 into the data structure insert(2): [1,2]
insert 3 into the data structure insert(3): [1,2,3]

Remove 2 from it, remove(2). [1,3]
getRandom() should return 1 and 3 with equal probabilities.

These kind of problems are easy and hard at the same time. Idea is to go step by step and solve each part. The first step is to define an interface for this data structure, which is easy given the definition of the problem.

public interface IRandomNumberGenerator {
    public boolean insert(int value);
    public boolean remove (int value);
    public int getRandom();
}

Now that interface is ready, time to start implementing the class which implements this interface. First of all, we have to find a container to store all the elements. If we take an ArrayList, insert() is O(1) as we will always add new element at the end of the ArrayList. getRandom is also O(1). However, there is problem with remove(). To remove an element from ArrayList, we have to scan the whole ArrayList and remove the element, the move all the elements on the right of the deleted element to one index left. This is O(n) operation.

Insert delete and getRandom in O(1): selection of data structures

A problem with storing elements in an ArrayList is that while removal, we have to scan the list and find the location of the element to be removed. What if we already knew the location of the element? If we store the position of each element in ArrayList in a HashMap which maps the value to its index on ArrayList

Now, insert() has to insert a value to two data structures, first into the ArrayList and then the location of the value in ArrayList to the HashMap. Remove operation can simply go to the location in the ArrayList and delete the element. Wait, still, we have to move all the elements on the right one position left. It means the worst case complexity of remove() still O(n).

We know one thing: if I remove the last element from the ArrayList then there is no shifting required. What if we copy the last value at the index of the element to be removed and then just remove the last element. Be careful, we have to update the HashMap with the new value for the element at the last index of ArrayList. In this way, remove() is also O(1).

Insert, delete and getRandom in O(1): implementation

package AlgorithmsAndMe;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;

public class RandomNumberGenerator implements IRandomNumberGenerator {

    private ArrayList<Integer> list;
    private Map<Integer, Integer> loc;
    private Random random;

    //Initializing the class
    public RandomNumberGenerator(){
        list = new ArrayList<>();
        loc = new HashMap<>();
        random = new Random();
    }

    @Override
    public boolean insert(int value) {
        /*If hash already contains key then it is a duplicate key.
          So, we just return false.
         */
        if(loc.containsKey(value)) return false;

        //Insert into list
        list.add(value);

        //Save the location on hash map
        loc.put(value, list.size()-1);
        return true;
    }

    @Override
    public boolean remove(int value) {
        /* If there is no entry in hash, that means
        there is no element in ArrayList */
        if(!loc.containsKey(val)) return false;
 
        int location = loc.get(val);
        //Remove from hash
        loc.remove(val);

        if(location != list.size()-1){
            /*Copy the last value in the array
            list to the current location*/
            list.set(location, list.get(list.size()-1));

            //Update the location of last element in hash
            loc.put(list.get(location), location);
        }

        //remove the last location from ArrayList
        list.remove(list.size()-1);
 
        return true;
    }

    @Override
    public int getRandom() {
        return list.get(random.nextInt(list.size()));
    }
}

package AlgorithmsAndMe;

import static org.junit.Assert.*;

public class RandomNumberGeneratorTest {

    RandomNumberGenerator randomNumberGenerator =
           new RandomNumberGenerator();

    @org.junit.Test
    public void testInterface() {
        assertEquals(true, randomNumberGenerator.insert(4));
        assertEquals(true, randomNumberGenerator.insert(5));
        assertEquals(true, randomNumberGenerator.insert(3));
        assertEquals(true, randomNumberGenerator.insert(2));

        assertEquals(true, randomNumberGenerator.remove(4));

        int random = randomNumberGenerator.getRandom();
        System.out.println(random);
    }
}

The complexity of the whole data structure for insert, delete and getRandom is O(1).

Insert, delete and get random when duplicates are allowed

Let’s make this problem a bit more complex by making duplicate elements possible in the list. The first problem with the existing implementation is that it stores the location of an element in ArrayList in a HashMap. If the same element can appear multiple times in the list, then which location should we store? We should store all the locations. It will change the definition of our HashMap as

Map<Integer, HashSet<Integer>> 

Hashset implements the Set interface, backed by a hash table which is actually a HashMap instance. No guarantee is made as to the iteration order of the set which means that the class does not guarantee the constant order of elements over time, that is what we require. We require that insert and remove operation on this data structure should be O(1) or constant time complexity.
To know more about the complexity of various data structures in Java, follow Runtime Complexity of Java Collections and read reason why HashSet provides constant time insert and remove operations.
Everything else follows the same process. To insert(), we should insert the location of the element at the HashSet in the hash table. While removing we find the last location of the element, put the last element of ArrayList in that location and update the HashSet of the location corresponding to the value at the last index of the ArrayList. Remove the last element from ArrayList.
We also have to move the last element in ArrayList of location in Hash, which is O(1) operation.

getRandom() implementation remains same.

package AlgorithmsAndMe;

import java.util.*;

public class RandomNumberGenerator implements IRandomNumberGenerator {

    private ArrayList<Integer> list;
    private Map<Integer, HashSet<Integer>> loc;
    private Random random;

    //Initializing the class
    public RandomNumberGenerator(){
        list = new ArrayList<>();
        loc = new HashMap<>();
        random = new Random();
    }

    @Override
    public boolean insert(int value) {

        if(!loc.containsKey(value)){
            loc.put(value, new HashSet<>());
        };

        //Insert into list
        list.add(value);

        //Save the location on hash map
        loc.get(value).add(list.size()-1);
        return true;
    }

    @Override
    public boolean remove(int value) {
        /* If there is no entry in hash, that means
        there is no element in ArrayList */
        if(!loc.containsKey(value)) return false;

        //Get the last location of the element in ArrayList
        HashSet<Integer> listLocations = loc.get(value);
        int location = listLocations.iterator().next();
        loc.get(value).remove(location);

        int lastElement = list.get(list.size()-1);
        if( lastElement != value) {
        /*Copy the last value in the array
        list to the current location*/
            list.set(location, lastElement);
            //Update the location of last element in hash
            loc.get(lastElement).remove(list.size()-1);
            loc.get(lastElement).add(location);
        }
        //remove the last location from ArrayList
        list.remove(list.size()-1);

        if(listLocations.isEmpty()) loc.remove(value);
        return true;
    }

    @Override
    public int getRandom() {
        return list.get(random.nextInt(list.size()));
    }
}

Other problems which are very similar to this concept are: design an LRU cache, first non-repeated character in stream etc.

Please share if there is anything wrong or missing. If you are preparing for an interview and need one to one personalized coaching, please reach out to us on communications@algorithmsandme.com

How packet travels on network from source to destination

Journey of a packet in network

Packet routing:Journey of a packet in internet

I am digressing from my topics algorithms and operating system because this question is now a days is commonly asked in interviews. More so in any networking related companies interviews like CISCO, Juniper, ALU and Qualcom. Question is how does a packet travels from source to destination in internet. This is to gauge if you understand packet routing.
There are three parts to the question. First is what happens inside the computer or host when a packet is generated by application. Second part is how it travels from one host that is source to other host that is destination with many routers sitting in between. Third part is what goes inside destination host when it receives a packet on network. We will discuss each one by one.

Processing packet at source machine:
Below figure explains above steps a packet goes through before going out of host

packet processing in network stack
  1. Application generates a packet to be sent on the network and send it to layer below.
  2. The next layer is called as transport layer which manages end to end communication between two machines. The protocol used can be TCP or UDP. What is difference between these two protocol is another subject altogether. 
  3. Once packet is formed at transport layer, it is sent to network layer which adds source and destination IP in the packet.Most important field which is added at IP or network layer is Time To Live (TTL) which is used by intermediate routers/switches to decide if packet needs to be forwarded or not. (How destination IP is found?)
  4. After network layer, packet reaches data link or MAC layer, where source and destination MAC address of machines are added. We will see how these fields change between every two neighbors. (How destination MAC is found?)
  5. Data link layer push this packet to physical layer where it is sent as stream of “0” and “1” on physical medium available.
There are lot more things being done at each layer like MTU decision at transport, fragmentation at IP and data link layer etc, but for simplicity of explanation, I have skipped them.
Now packet has reached at an intermediate router which sit between source and destination like shown in figure.

Processing a packet at router:

Router takes the packet and does three basic operations : Routing, forwarding and encapsulation

Routing

When router receives packet, first of all it strips down the MAC layer header and looks into the IP header which contains destination IP address. Once destination IP is known, router looks into it database in order to find where should this packet be forwarded to make it reach to destination. This databases is known as routing table.
There are three cases which may occur when router looks into routing table for destination IP
1. If there is an entry corresponding to destination IP, we receive the interface name the packet should be forwarded on to. 
2. If there is no direct entry, then IP is converted into network IP using mask and then checked again.It should be noted that longest prefix match to be find best forwarding interface.
3. If nothing matches, then router just forwards it to default destination configured.

packet routing : How packet travels



Forwarding

Once routing process finishes, the packet is switched from the ingress interface to egress interface, commonly known as forwarding. Process switching, fast switching and CEF switching are three method of forwarding.
Before third step, router decreases the TTL and recalculates the checksum of packet and put it back.

Encapsulation
Third process is encapsulation. Please keep in mind that L3 or layer 3 or network layer destination IP address never changes through out the path of IP packet, except from cases like NATing or VPN. 
Only thing which changes is source and destination MAC addresses at data link layer.
Router caches the MAC address of next hops it needs to send packet to, it replaces the source and destination MAC address in it and send it to physical layer.
Below figure explains packet transformation between ingress and egress interfaces.  
Processing packet at destination host

  1. Packet is received at network card, physical layer, which generates an interrupt to CPU and CPU reads packet in,
  2. At data link layer, destination MAC address is checked to see if packet is destined to this machine, If yes, packet is sent up to network layer.
  3. At IP layer, packet validation like checksum verification etc is done and then passed on to relevant transport layer.
  4. Transport later then passes it on to the appropriate port so that it reaches correct application.
I have explained the process without going into too much of details deliberately to make it easy to understand the essential process without worrying about finder details. Please share your views if you have something to add or I have missed something important. I would also love to learn about other good sources writing on similar topics, please share them in comments.

A method to quantify quality of software

Measurement of quality

Today, the dependence of an enterprise on IT has increased many folds than it used to be twenty years back. The business too is changing very fast and to remain competitive, the agility of the IT infrastructure and software is essential. Virtualization and cloud provide this much-needed agility to IT in the infrastructure area, but when it comes to software and that too custom software the solution is not as simple. It all boils down to how fast the software, specifically the custom software, can be restructured to meet the ever-changing demands of the business. Among many factors that influence this restructuring, the biggest that comes in the way of high agility is the quality of the code and hence measurement of quality.

There are many quality metrics in the software industry that are today used to measure some aspect of the code. For example cyclomatic complexity, which measures the number of linearly independent paths through a section of program, gives a measure of the complexity of the corresponding section in some way. Is this the complete measure of the complexity? Obviously the answer would be no, since the complexity dependents on many other factor apart from the linearly independent paths. Some of the key measures are cyclomatic complexity, cohesion, coupling (for example Data Abstraction Coupling and Fan-out Coupling), N path complexity, code coverage, program size, documentation, MOOD metrics, and adherence to standards.

Software quality measurement

The quantities obtained from these quality metrics are different as they measure different aspects of the code. Simply doing a mathematical operation to some of these quantities and then adding them will give us measure e.g. Maintainability Index, but will it balance all concerns of different stakeholders? A single approach to fit all needs would be too simplistic an approach. With years Maintainability Index has been redefined many times. Following are some of its definitions:

  • The original formula:
    MI = 171 – 5.2 * ln(V) – 0.23 * (G) – 16.2 * ln(LOC)
  • The derivative used by SEI:
    MI = 171 – 5.2 * log2(V) – 0.23 * G – 16.2 * log2 (LOC) + 50 * sin (sqrt(2.4 * CM))
  • The derivative used by Microsoft Visual Studio (since v2008):
    sMI = MAX(0,(171 – 5.2 * ln(V) – 0.23 * (G) – 16.2 * ln(LOC))*100 / 171)

The above formulation uses V for Halstead Program Volume; G for Cyclomatic Complexity; LOC: for Lines of Source
Code; and CM for Comment Ratio (lines of comment to the total number of lines). This formulation for Maintainability index used the experience and skills of the individuals and organizations where they were first conceived. This has for long been an art and highly dependent on the skills of the individuals and the data he/she is working with. Note that with experience only have the individuals/ organization been able to find constants or mathematical functions which have given results matching the expectations on the code at hand.

In my experience with developing and maintaining software for many of my organization’s customers, I have seen the concerns change over time in the same engagement. The index such as the above still gives the same measure and therefore becomes less applicable. From engagement to engagement, since the priorities vary, the use of the same index again is less applicable. Maintenance engagement, I have noticed, are more focused towards maintainability. So would be the case with products which are mature. Development engagements, however, are more holistic but then tend to focus on the maintainability aspect as the product being developed becomes mature.

The quality metrics sited above are not the universe. There are bound to be changes in them itself and addition of newer and smarter metrics. Architects and managers would certainly want to use them too.

A more mature method is, therefore, required to be developed which is independent of the quality metric in question and treats all the quality metrics in a similar manner. With quantities produced from such a method, it would be easier to alter the index based on the concerns relevant at the time and still be able to correlate it with the indices values obtained in the historical times.

Why should such a quantity exist?

To answer this question, I would like to consider the example of two applications along with the quality metric ‘cyclomatic complexity’. Let me call them A1 and A2. Let these two applications have similar number of code artifacts. Application A1 has most of the code artifacts with cyclomatic complexity in the 1-3 range. While the application A2 has most of the code artifacts with cyclomatic

complexity in the 8-10 range. Note that the Industry best practice for this quality measure is 10. So the question is do the two applications considered in the scenario have the same code quality?

Obviously the answer is no. The artifacts in application A1 have cyclomatic complexity less than that in application A2. This in turn means that the artifacts of application A1 are simpler than that of application A2. The graph of two applications when plotted in the manner shown above makes this fact very obvious.

Notice, in the graph above I compared two applications. Let us for this moment assume that we have a mathematical formulation which can compare two applications in the manner shown in the graph above and give us a quantity. What if we were to compare each application with a hypothetically perfect application of similar size? Now with the assumed mathematical formulation we can obtain a quantity for both the applications and can use it to compare the two applications.
Now, what is such a mathematical formulation? One would be tempted to use average as the formulation, but then average will not cover the essence present in the graph. If one dwells further into the statistical quantities, the quantity that covers the essence of the graph above is the general correlation coefficient. Here, the correlation is on the count of code artifacts having a particular range of values of the quality metric with a hypothetical perfect application. Note that it is very simple to qualify a similar sized perfect application. All the counts would be in the range that is considered best from the quality perspective for that quality metric. The formula that I will use for correlation after deriving it from the general correlation coefficient will be as follows:

The scores ai are derived by subtracting the quality metric value of the code artifact i from the value that is considered best for that quality metric. This is to be done for all artifacts that are not at the desirable levels (It should be ensured that these values are negative). For the ones that are at the desirable levels the value obtained for the quality metric is to be used. However, if the quality metric is grouped in k groups with the ith group having the assigned score as ai and the count of artifacts from the application lying in the ith group is ni (given that ∑ni=1ni = n), the above formula will change to

Now let us look at some good formulations for this quantity for a given quality metric. The table below shows some scenario of different kinds of application where the counts for the quality metric is split into three groups viz. good (2), average(-1) and bad (-2).

Quality Metric Grouping Artifacts Count for Application Classification
Perfect Bad Bad Bad Below Average Below Average Average On the edge Good
Good(2) 50 0 0 0 25 25 25 35 40
Average(-1) 0 50 0 25 0 17 25 15 7
Bad(-2) 0 0 50 25 25 8 0 0 3
Expected quantity < 0.2 < 0.2 < 0.2 < 0.4 < 0.4 < 0.65 < 0.7 > 0.7
τ -1 -1 -0.948 0 0.197 0.316 0.625 0.709
(1+τ)/2 0 0 0.025 0.5 0.598 0.658 0.812 0.854
[(1+τ)/2]2 0 0 0 0.25 0.358 0.433 0.659 0.729

Notice that the spread for bad applications correlation value lies between 0.5 to -1 while for applications average or better the range of correlation lies from 0.5 to 1. This leaves little scope of identifying good, on the edge, average applications. Thankfully, since the number is between -1 and 1, squaring or cubing the number will result in increasing the range where we want it to be increased. Squaring (and keeping the sign) reduces the range for bad applications by making it from 0.25 to -1 while increasing the range for the rest type of applications by making it from 0.25 to 1. Also notice the calculation (1 + τ)/2 just changes the whole range from [-1, 1] to [0, 1]. Since [(1 + τ)/2]2 gives very good value in comparison to the value I was expecting for each type of application.

The method to quantify the quality or measurement of quality of software provides a way to calculate a correlation value for different code quality matrices. Since the value obtained is all correlation values, a weighted addition can easily be done to arrive at an overall score. The weights can be so chosen to be in line with the key concerns of various stakeholders relevant to the software. Such a score is not dependent on the skills of the individual and therefore have greater significance and can be used in many ways.