Blog

Design a data structure with insert delete and getRandom in O(1)

Design a data structure with insert delete and getRandom in O(1)

The problem statement is to design a data structure which performs the following operations in O(1) time complexity:
1. Insert an element, insert(int value)
2. Remove an element, remove(int value)
3. Get random element, getRandom()

For example, insert 1 into the data structure insert(1): [1]
insert 2 into the data structure insert(2): [1,2]
insert 3 into the data structure insert(3): [1,2,3]

Remove 2 from it, remove(2). [1,3]
getRandom() should return 1 and 3 with equal probabilities.

These kind of problems are easy and hard at the same time. Idea is to go step by step and solve each part. The first step is to define an interface for this data structure, which is easy given the definition of the problem.

public interface IRandomNumberGenerator {
    public boolean insert(int value);
    public boolean remove (int value);
    public int getRandom();
}

Now that interface is ready, time to start implementing the class which implements this interface. First of all, we have to find a container to store all the elements. If we take an ArrayList, insert() is O(1) as we will always add new element at the end of the ArrayList. getRandom is also O(1). However, there is problem with remove(). To remove an element from ArrayList, we have to scan the whole ArrayList and remove the element, the move all the elements on the right of the deleted element to one index left. This is O(n) operation.

Insert delete and getRandom in O(1): selection of data structures

A problem with storing elements in an ArrayList is that while removal, we have to scan the list and find the location of the element to be removed. What if we already knew the location of the element? If we store the position of each element in ArrayList in a HashMap which maps the value to its index on ArrayList

Now, insert() has to insert a value to two data structures, first into the ArrayList and then the location of the value in ArrayList to the HashMap. Remove operation can simply go to the location in the ArrayList and delete the element. Wait, still, we have to move all the elements on the right one position left. It means the worst case complexity of remove() still O(n).

We know one thing: if I remove the last element from the ArrayList then there is no shifting required. What if we copy the last value at the index of the element to be removed and then just remove the last element. Be careful, we have to update the HashMap with the new value for the element at the last index of ArrayList. In this way, remove() is also O(1).

Insert, delete and getRandom in O(1): implementation

package AlgorithmsAndMe;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;

public class RandomNumberGenerator implements IRandomNumberGenerator {

    private ArrayList<Integer> list;
    private Map<Integer, Integer> loc;
    private Random random;

    //Initializing the class
    public RandomNumberGenerator(){
        list = new ArrayList<>();
        loc = new HashMap<>();
        random = new Random();
    }

    @Override
    public boolean insert(int value) {
        /*If hash already contains key then it is a duplicate key.
          So, we just return false.
         */
        if(loc.containsKey(value)) return false;

        //Insert into list
        list.add(value);

        //Save the location on hash map
        loc.put(value, list.size()-1);
        return true;
    }

    @Override
    public boolean remove(int value) {
        /* If there is no entry in hash, that means
        there is no element in ArrayList */
        if(!loc.containsKey(val)) return false;
 
        int location = loc.get(val);
        //Remove from hash
        loc.remove(val);

        if(location != list.size()-1){
            /*Copy the last value in the array
            list to the current location*/
            list.set(location, list.get(list.size()-1));

            //Update the location of last element in hash
            loc.put(list.get(location), location);
        }

        //remove the last location from ArrayList
        list.remove(list.size()-1);
 
        return true;
    }

    @Override
    public int getRandom() {
        return list.get(random.nextInt(list.size()));
    }
}

package AlgorithmsAndMe;

import static org.junit.Assert.*;

public class RandomNumberGeneratorTest {

    RandomNumberGenerator randomNumberGenerator =
           new RandomNumberGenerator();

    @org.junit.Test
    public void testInterface() {
        assertEquals(true, randomNumberGenerator.insert(4));
        assertEquals(true, randomNumberGenerator.insert(5));
        assertEquals(true, randomNumberGenerator.insert(3));
        assertEquals(true, randomNumberGenerator.insert(2));

        assertEquals(true, randomNumberGenerator.remove(4));

        int random = randomNumberGenerator.getRandom();
        System.out.println(random);
    }
}

The complexity of the whole data structure for insert, delete and getRandom is O(1).

Insert, delete and get random when duplicates are allowed

Let’s make this problem a bit more complex by making duplicate elements possible in the list. The first problem with the existing implementation is that it stores the location of an element in ArrayList in a HashMap. If the same element can appear multiple times in the list, then which location should we store? We should store all the locations. It will change the definition of our HashMap as

Map<Integer, HashSet<Integer>> 

Hashset implements the Set interface, backed by a hash table which is actually a HashMap instance. No guarantee is made as to the iteration order of the set which means that the class does not guarantee the constant order of elements over time, that is what we require. We require that insert and remove operation on this data structure should be O(1) or constant time complexity.
To know more about the complexity of various data structures in Java, follow Runtime Complexity of Java Collections and read reason why HashSet provides constant time insert and remove operations.
Everything else follows the same process. To insert(), we should insert the location of the element at the HashSet in the hash table. While removing we find the last location of the element, put the last element of ArrayList in that location and update the HashSet of the location corresponding to the value at the last index of the ArrayList. Remove the last element from ArrayList.
We also have to move the last element in ArrayList of location in Hash, which is O(1) operation.

getRandom() implementation remains same.

package AlgorithmsAndMe;

import java.util.*;

public class RandomNumberGenerator implements IRandomNumberGenerator {

    private ArrayList<Integer> list;
    private Map<Integer, HashSet<Integer>> loc;
    private Random random;

    //Initializing the class
    public RandomNumberGenerator(){
        list = new ArrayList<>();
        loc = new HashMap<>();
        random = new Random();
    }

    @Override
    public boolean insert(int value) {

        if(!loc.containsKey(value)){
            loc.put(value, new HashSet<>());
        };

        //Insert into list
        list.add(value);

        //Save the location on hash map
        loc.get(value).add(list.size()-1);
        return true;
    }

    @Override
    public boolean remove(int value) {
        /* If there is no entry in hash, that means
        there is no element in ArrayList */
        if(!loc.containsKey(value)) return false;

        //Get the last location of the element in ArrayList
        HashSet<Integer> listLocations = loc.get(value);
        int location = listLocations.iterator().next();
        loc.get(value).remove(location);

        int lastElement = list.get(list.size()-1);
        if( lastElement != value) {
        /*Copy the last value in the array
        list to the current location*/
            list.set(location, lastElement);
            //Update the location of last element in hash
            loc.get(lastElement).remove(list.size()-1);
            loc.get(lastElement).add(location);
        }
        //remove the last location from ArrayList
        list.remove(list.size()-1);

        if(listLocations.isEmpty()) loc.remove(value);
        return true;
    }

    @Override
    public int getRandom() {
        return list.get(random.nextInt(list.size()));
    }
}

Other problems which are very similar to this concept are: design an LRU cache, first non-repeated character in stream etc.

Please share if there is anything wrong or missing. If you are preparing for an interview and need one to one personalized coaching, please reach out to us on communications@algorithmsandme.com

How packet travels on network from source to destination

Journey of a packet in network

Packet routing:Journey of a packet in internet

I am digressing from my topics algorithms and operating system because this question is now a days is commonly asked in interviews. More so in any networking related companies interviews like CISCO, Juniper, ALU and Qualcom. Question is how does a packet travels from source to destination in internet. This is to gauge if you understand packet routing.
There are three parts to the question. First is what happens inside the computer or host when a packet is generated by application. Second part is how it travels from one host that is source to other host that is destination with many routers sitting in between. Third part is what goes inside destination host when it receives a packet on network. We will discuss each one by one.

Processing packet at source machine:
Below figure explains above steps a packet goes through before going out of host

packet processing in network stack
  1. Application generates a packet to be sent on the network and send it to layer below.
  2. The next layer is called as transport layer which manages end to end communication between two machines. The protocol used can be TCP or UDP. What is difference between these two protocol is another subject altogether. 
  3. Once packet is formed at transport layer, it is sent to network layer which adds source and destination IP in the packet.Most important field which is added at IP or network layer is Time To Live (TTL) which is used by intermediate routers/switches to decide if packet needs to be forwarded or not. (How destination IP is found?)
  4. After network layer, packet reaches data link or MAC layer, where source and destination MAC address of machines are added. We will see how these fields change between every two neighbors. (How destination MAC is found?)
  5. Data link layer push this packet to physical layer where it is sent as stream of “0” and “1” on physical medium available.
There are lot more things being done at each layer like MTU decision at transport, fragmentation at IP and data link layer etc, but for simplicity of explanation, I have skipped them.
Now packet has reached at an intermediate router which sit between source and destination like shown in figure.

Processing a packet at router:

Router takes the packet and does three basic operations : Routing, forwarding and encapsulation

Routing

When router receives packet, first of all it strips down the MAC layer header and looks into the IP header which contains destination IP address. Once destination IP is known, router looks into it database in order to find where should this packet be forwarded to make it reach to destination. This databases is known as routing table.
There are three cases which may occur when router looks into routing table for destination IP
1. If there is an entry corresponding to destination IP, we receive the interface name the packet should be forwarded on to. 
2. If there is no direct entry, then IP is converted into network IP using mask and then checked again.It should be noted that longest prefix match to be find best forwarding interface.
3. If nothing matches, then router just forwards it to default destination configured.

packet routing : How packet travels



Forwarding

Once routing process finishes, the packet is switched from the ingress interface to egress interface, commonly known as forwarding. Process switching, fast switching and CEF switching are three method of forwarding.
Before third step, router decreases the TTL and recalculates the checksum of packet and put it back.

Encapsulation
Third process is encapsulation. Please keep in mind that L3 or layer 3 or network layer destination IP address never changes through out the path of IP packet, except from cases like NATing or VPN. 
Only thing which changes is source and destination MAC addresses at data link layer.
Router caches the MAC address of next hops it needs to send packet to, it replaces the source and destination MAC address in it and send it to physical layer.
Below figure explains packet transformation between ingress and egress interfaces.  
Processing packet at destination host

  1. Packet is received at network card, physical layer, which generates an interrupt to CPU and CPU reads packet in,
  2. At data link layer, destination MAC address is checked to see if packet is destined to this machine, If yes, packet is sent up to network layer.
  3. At IP layer, packet validation like checksum verification etc is done and then passed on to relevant transport layer.
  4. Transport later then passes it on to the appropriate port so that it reaches correct application.
I have explained the process without going into too much of details deliberately to make it easy to understand the essential process without worrying about finder details. Please share your views if you have something to add or I have missed something important. I would also love to learn about other good sources writing on similar topics, please share them in comments.

A method to quantify quality of software

Measurement of quality

Today, the dependence of an enterprise on IT has increased many folds than it used to be twenty years back. The business too is changing very fast and to remain competitive, the agility of the IT infrastructure and software is essential. Virtualization and cloud provide this much-needed agility to IT in the infrastructure area, but when it comes to software and that too custom software the solution is not as simple. It all boils down to how fast the software, specifically the custom software, can be restructured to meet the ever-changing demands of the business. Among many factors that influence this restructuring, the biggest that comes in the way of high agility is the quality of the code and hence measurement of quality.

There are many quality metrics in the software industry that are today used to measure some aspect of the code. For example cyclomatic complexity, which measures the number of linearly independent paths through a section of program, gives a measure of the complexity of the corresponding section in some way. Is this the complete measure of the complexity? Obviously the answer would be no, since the complexity dependents on many other factor apart from the linearly independent paths. Some of the key measures are cyclomatic complexity, cohesion, coupling (for example Data Abstraction Coupling and Fan-out Coupling), N path complexity, code coverage, program size, documentation, MOOD metrics, and adherence to standards.

Software quality measurement

The quantities obtained from these quality metrics are different as they measure different aspects of the code. Simply doing a mathematical operation to some of these quantities and then adding them will give us measure e.g. Maintainability Index, but will it balance all concerns of different stakeholders? A single approach to fit all needs would be too simplistic an approach. With years Maintainability Index has been redefined many times. Following are some of its definitions:

  • The original formula:
    MI = 171 – 5.2 * ln(V) – 0.23 * (G) – 16.2 * ln(LOC)
  • The derivative used by SEI:
    MI = 171 – 5.2 * log2(V) – 0.23 * G – 16.2 * log2 (LOC) + 50 * sin (sqrt(2.4 * CM))
  • The derivative used by Microsoft Visual Studio (since v2008):
    sMI = MAX(0,(171 – 5.2 * ln(V) – 0.23 * (G) – 16.2 * ln(LOC))*100 / 171)

The above formulation uses V for Halstead Program Volume; G for Cyclomatic Complexity; LOC: for Lines of Source
Code; and CM for Comment Ratio (lines of comment to the total number of lines). This formulation for Maintainability index used the experience and skills of the individuals and organizations where they were first conceived. This has for long been an art and highly dependent on the skills of the individuals and the data he/she is working with. Note that with experience only have the individuals/ organization been able to find constants or mathematical functions which have given results matching the expectations on the code at hand.

In my experience with developing and maintaining software for many of my organization’s customers, I have seen the concerns change over time in the same engagement. The index such as the above still gives the same measure and therefore becomes less applicable. From engagement to engagement, since the priorities vary, the use of the same index again is less applicable. Maintenance engagement, I have noticed, are more focused towards maintainability. So would be the case with products which are mature. Development engagements, however, are more holistic but then tend to focus on the maintainability aspect as the product being developed becomes mature.

The quality metrics sited above are not the universe. There are bound to be changes in them itself and addition of newer and smarter metrics. Architects and managers would certainly want to use them too.

A more mature method is, therefore, required to be developed which is independent of the quality metric in question and treats all the quality metrics in a similar manner. With quantities produced from such a method, it would be easier to alter the index based on the concerns relevant at the time and still be able to correlate it with the indices values obtained in the historical times.

Why should such a quantity exist?

To answer this question, I would like to consider the example of two applications along with the quality metric ‘cyclomatic complexity’. Let me call them A1 and A2. Let these two applications have similar number of code artifacts. Application A1 has most of the code artifacts with cyclomatic complexity in the 1-3 range. While the application A2 has most of the code artifacts with cyclomatic

complexity in the 8-10 range. Note that the Industry best practice for this quality measure is 10. So the question is do the two applications considered in the scenario have the same code quality?

Obviously the answer is no. The artifacts in application A1 have cyclomatic complexity less than that in application A2. This in turn means that the artifacts of application A1 are simpler than that of application A2. The graph of two applications when plotted in the manner shown above makes this fact very obvious.

Notice, in the graph above I compared two applications. Let us for this moment assume that we have a mathematical formulation which can compare two applications in the manner shown in the graph above and give us a quantity. What if we were to compare each application with a hypothetically perfect application of similar size? Now with the assumed mathematical formulation we can obtain a quantity for both the applications and can use it to compare the two applications.
Now, what is such a mathematical formulation? One would be tempted to use average as the formulation, but then average will not cover the essence present in the graph. If one dwells further into the statistical quantities, the quantity that covers the essence of the graph above is the general correlation coefficient. Here, the correlation is on the count of code artifacts having a particular range of values of the quality metric with a hypothetical perfect application. Note that it is very simple to qualify a similar sized perfect application. All the counts would be in the range that is considered best from the quality perspective for that quality metric. The formula that I will use for correlation after deriving it from the general correlation coefficient will be as follows:

The scores ai are derived by subtracting the quality metric value of the code artifact i from the value that is considered best for that quality metric. This is to be done for all artifacts that are not at the desirable levels (It should be ensured that these values are negative). For the ones that are at the desirable levels the value obtained for the quality metric is to be used. However, if the quality metric is grouped in k groups with the ith group having the assigned score as ai and the count of artifacts from the application lying in the ith group is ni (given that ∑ni=1ni = n), the above formula will change to

Now let us look at some good formulations for this quantity for a given quality metric. The table below shows some scenario of different kinds of application where the counts for the quality metric is split into three groups viz. good (2), average(-1) and bad (-2).

Quality Metric Grouping Artifacts Count for Application Classification
Perfect Bad Bad Bad Below Average Below Average Average On the edge Good
Good(2) 50 0 0 0 25 25 25 35 40
Average(-1) 0 50 0 25 0 17 25 15 7
Bad(-2) 0 0 50 25 25 8 0 0 3
Expected quantity < 0.2 < 0.2 < 0.2 < 0.4 < 0.4 < 0.65 < 0.7 > 0.7
τ -1 -1 -0.948 0 0.197 0.316 0.625 0.709
(1+τ)/2 0 0 0.025 0.5 0.598 0.658 0.812 0.854
[(1+τ)/2]2 0 0 0 0.25 0.358 0.433 0.659 0.729

Notice that the spread for bad applications correlation value lies between 0.5 to -1 while for applications average or better the range of correlation lies from 0.5 to 1. This leaves little scope of identifying good, on the edge, average applications. Thankfully, since the number is between -1 and 1, squaring or cubing the number will result in increasing the range where we want it to be increased. Squaring (and keeping the sign) reduces the range for bad applications by making it from 0.25 to -1 while increasing the range for the rest type of applications by making it from 0.25 to 1. Also notice the calculation (1 + τ)/2 just changes the whole range from [-1, 1] to [0, 1]. Since [(1 + τ)/2]2 gives very good value in comparison to the value I was expecting for each type of application.

The method to quantify the quality or measurement of quality of software provides a way to calculate a correlation value for different code quality matrices. Since the value obtained is all correlation values, a weighted addition can easily be done to arrive at an overall score. The weights can be so chosen to be in line with the key concerns of various stakeholders relevant to the software. Such a score is not dependent on the skills of the individual and therefore have greater significance and can be used in many ways.

Five reasons why not to use global variables?

Five reasons why not to use global variables?

Global variables are those variables which can be accessed and modified by any module. They are declared in outermost scope so that they are visible to every inner scope module.
They are usually considered a bad practice because of their non-locality. They create mutual dependencies and increase the complexity of the program. However, there are some places where global variables are suitable to use. When one wants to avoid frequent passing of parameters forms one function to another, global variables can be of use. They are used to pass information between modules that do not share called / caller relationships such as concurrent threads and signal handlers or interrupt routines.

Why they should be avoided when unnecessary?
As explained in the introduction section, global variables induce interdependency between modules thus increasing the complexity of the code making it hard to analyze it. Here I give you other reason for not to use a global variable unless absolutely necessary.

Non-locality
Global variables are declared in the outermost scope. So if a function uses it in the innermost scope we have to keep track that whether that variable has been modified before that is been used in the function. Functions are easy to analyze when their variables are localized and have a minimum influence of other modules over them. If you code is large enough, then another problem called thrashing (related to paging in OS) may occur (only if there is no spare memory left) while executing your code as the variable and code may be on different pages.

Implicit coupling
Coupling between modules of the code should be as low as possible in order to make it maintainable in the future. When you use global variables in code, they induce coupling between modules.

Concurrency issues
This case arises when your application is multithreaded. If more than one threads try to access the same variable at the same time, which might result in deadlock situations. Special precaution should be taken to avoid such situation resulting in extra code. We have to implement lock and unlock functions to ensure single thread access to the global variable. There may be a possibility that threads get obsolete values.

No access control or constraint checking
Global variables can be get or set in any part of the program without worrying about the rules to access them as there are no rules. This greatly hinders security when you are using third party tools.

Difficult to test
Code which use global variables are difficult to test as there is no easy method to set a clean environment between run. It will be very difficult to test modules in isolation as Unit test suggests. There will be some implicit effect of other modules on the module in form of global variables, so it will be difficult to pin point and debug code.

These are five reasons why you should avoid global variables as far as possible.

Four root causes why code goes haywire?

Four root causes why Code goes haywire?

Code written by us is not the last one but it will be modified maintained and enhanced by many others. There are many reasons why code becomes unmanageable after some time. In this post, I will be considering four reasons.

Root cause 1: Creating libraries of business functions without considering their domain module.
It’s a general tendency to club similar functions together in one library file. For example, while developing a banking application, we tend to write the all interest functions together without worrying whether that interest is calculated on loan or deposit. Loan and deposit are domain modules of the application here.

Another example of the similar mistake can be the reservation system; we tend to write the payment and reservation functions in the same file as they are linked but we do not consider that their domain is different.
First of all, it will be very difficult to enhance such systems, if at all, you are able to do some modifications into that worst effect is seen while performing regression testing. Your small modification will affect many unrelated functions.

Root cause 2: Mixing up the business logic with the presentation logic.
If the user is the borrower, screen 1 should appear; if a user is a depositor, screen2 should appear. All these are parts of presentation logic. These decisions shall be handled at the presentation layer of the application without disturbing the business logic.
Your loan calculation function checks whether the call for interest calculation is from batch processing or online is a classic example when you mix the business logic with the presentation logic.

Root cause 3: No proper layering of functions and communication mechanism.
We shall never keep utility function with domain-specific at the same layer. For example, the utility function for date formatting shall not be at the same level of a domain-specific function like interest calculation. There are general rules which shall be followed:
• Module residing on the same level shall communicate with each other via predefined interfaces.
• Module at upper layer can access the lower level functions but not vice-a-versa.

This will also enable the user to put some common part of a function, which is used across the domains, in the lower level and more specialized part for the specific domain in the upper layer.

Root cause 4: Not organizing code as per the functional domain.
Do not keep all your files at the same hierarchy irrespective of their domain. Always make separate folder for keeping files which are related to one domain. This will improve the maintainability of the system as the code written is well organized.

Unit testing techniques

Unit testing techniques

A unit is an atomic part of a program which could not be decomposed in smaller parts. Testing that part for its correct implementation is called unit testing. There are various techniques for conducting unit testing. These techniques are complementary to each other and find a different set of errors when applied to the unit, one after the other.

1) Specification-based testing techniques: This technique generates test cases based on the program’s behavior, input domain, and expected output. Following are specification-based techniques:

a) Equivalence partition: Under the technique, the input domain of the unit is partitioned into various classes. Each class contains input values which would have a similar effect on the code. Test cases select input values at random, from these classes.

b)Boundary value analysis: test cases are generated so that they can cover the boundaries of any condition in the code. Generally, it is performed after the equivalence partition. One test case is designed for the inside of the classes and as many as necessary to cover its limits.

c) Decision tables and cause-effect graphing: Inputs for the unit can occur in any combinations. Those combinations are represented and analyzed with decision tables and cause-effect graphing. Then test cases are generated for all the possible input-output combinations.

d)Random testing: Test cases are generated at random according to the input domain and expected output domain specified in specifications.

2) Code based testing techniques: These testing techniques can be again classified into two categories:

a) Control flow based criteria: This varies as to how well and what part of the code, the test case covers in the program control flow.
Data flow based criteria: In this, test cases cover the execution space between variable defined and where they are used in the program flow. Let’s consider the control flow based criteria techniques:
a) Statement coverage: Test cases generated for this criterion must ensure that they execute each and every statement of the code.
b) Decision (branch) coverage: Code contains many decisions and corresponding branches for those decisions. To fulfill the decision or branch coverage criterion test cases must ensure that all the program decisions take the value true or false and execute the corresponding code
c) Condition (Predicate) Coverage: In practice, decisions compose of many conditions. Combination of values attained by these conditions decides the final outcome for decisions. To satisfy the condition coverage criterion, test cases must ensure that all conditions present in the decision take the value true and false.

d) Modified condition/ decision coverage: There may be a case where we have tested the condition and decision in the program, but still an individual effect of each condition on the decision is not tested. There may be cases of short-circuiting in logical expressions. To test the independent influence of every condition on decision modified condition/decision coverage is used.

Confusing testing techniques

Type of Testing in software

There are a plenty of testing techniques ranging from ad hoc monkey testing to moral formal unit testing. White box testing, black box testing, Grey box testing and so on so forth. Being under graduate in software engineering or fresher in software industry, you may always confuse between one testing and other. So let’s explore some of the most confusing and interrelated terms used in testing terminology;

Sanity testing and smoke testing

Sanity testing is a testing technique in which you take a specialized functionality of the whole system and travel deep into that and try to find out if that functionality is working correctly in every aspect. Actually, it follows the depth-first approach.
Smoke testing is a testing technique in which you just check broadly whether the system in question worth testing. The scope of the test is the whole system and follows the breadth-first approach.

Smoke testing is generally used in hardware testing if a hardware component produces smoke as soon as it is put in a test that means that the component is not yet ready for testing and require more time. The same principle is applied to software too.

Regression testing and Retesting

When you fix a bug in code and you want to be sure that this fixing has not affected any other part of the program you re run all existing test case again. This is called regression testing.
Retesting is a process in which we write different or more test cases for the same functionality in order to test it completely and satisfactorily after there has been testing with the existing test case.

Load testing and stress testing

When you test a system for its robustness, giving inputs within the allowed range this is called load testing. For example if an application has to support 500 users, load testing will be done with simulating 500 users and noting the performance degradation of the application.
When you test a system for its robustness, giving inputs which are out of bound or beyond the scope of the system, it’s called stress testing. Like if your system supports only 100 users, and you simulate 110 users and see how the performance of the system degrades.

Integration testing and interface testing

When you combine two components that are independently developed you need to test that these two components work correctly together and perform the task assign to the integrated component in a seamless manner. To test this you do integration testing.
When there is communication between two components, protocols and message formats are defined to communicate. Interface testing confirms that the data passing from one component to another is in the correct and in the specified format and follows the protocol decided upon. Whether two components perform the correct task is none of the concern of interface testing.
Hope after reading this post your doubts, about slightly different and confusing testing techniques, may have vanished.

Nature of bugs found in code inspection

Code inspection is an important process in any mission or safety-critical system development. It’s a type of static testing in which code is not executed but critically analyzed in order to find some subtle bugs which are not detectable or require a lot of efforts to detect in the testing phase. The point of interest is what kind of defects one can find in code inspection. This article analyses bugs classes found in the code in the code inspection process.

1) Maintainability of the code: No code is perfect when it has been written. It is bound to change, enhanced and modified for adaptation to different platforms. So it is necessary that code is written is understandable and maintainable. Code inspection focuses on this goal. Under this, following points are looked for in code:
a) Code to comment ratio is proper and whether comments explain the intent of the code appropriately and correctly.
b) Whether all the variable names and function names are meaningful and clearly state the intention.
c) Whether the coding standards or guidelines are adhered to.
d) Whether code has proper logical flow or not?

2) Design flaws: Code reflects design of the software. What if the deign itself contains flaws. Since software is tested against requirements and design, it is very difficult to detect these kinds of defects in the testing phase. In that case, code inspection provides an additional design review step. There are issues which will never fail the system but very dangerous from the quality point of view. For example, when designing a system, one designs a structure to keep track of some of the hardware channels status as if they were empty or full etc. The very thought instrumental behind the design was that structure could be used bit wise and memory could be saved which would have not possible by using arrays. But this thought didn’t consider that system had some functionality which requires checking the status of all channels at the same time like for averaging of all channels data. Since the data structure used was bitwise structure, one has to check all the fields of the structure one by one resulting in very lengthy and unstructured code which would have been very simple, in the single loop if the array has been used.

3) Structural flaws: Proper structure of the code is must for understanding and testing of the software. Following points which are given stress on in code inspection process:

a) Does every function implement single functionality and is testable?
b) Does length of the function justifiable or shall it be broken into smaller ones?
c) Is any dead or extra code present in the code?
d) Is any unreachable code present in the code?
e) Whether there are multiple exists for functions and can they be removed?
f) Does every variable, that has been assigned value in any conditional block, get initialized prior to usage?

4) Logical errors: These are errors which are infused in the coding phase by developer because of ambiguous design or unclear requirements. As there is no guard against them in requirements and design, these kinds of bugs are subtle, not obvious in nature, may occur once in many execution cycles or when a particular sequence of events occur.
For example, let’s say, the software works in the sequence of steps like power on, operation on, check of automatic corrections, operation off and then power off. In the operation phase, it first checks whether the given number of passes have already been elapsed in operation then only does the automatic correction. The variable which keeps track of no of passes was reinitialized at the power on stage. Assuming that software goes to operation off state; again comes to operation on state without going to power off state. Now since this is a fresh operation, automatic correction shall be applied after given no of passes have been elapsed in this operation, but as variable keeping track of no of passes is not reinitialized hence contains last value, so the software does not wait for given no of passes to elapse and starts performing automatic correction. This error will not hamper the functioning of the software but will surely hamper the algorithm which calculates the corrections to be applied to the data. These kinds of errors are found in the code inspection by hypothesis.

There is a subtle difference between code walkthrough and code inspection
Code walkthrough is an informal process which is conducted by developer team. Peer reviews are kind of code walkthrough.
Code inspection is more formal process with formal team,roles and meetings. Code inspections report is asked for in project and process audits. Checklists are used for code inspection.

What is LCSAJ?

What is LCSAJ?

LCSAJ stands for Linear Code Sequence and Jump. Typically it contains three parts:
1) Initial step of the program code or a step to which control may jump and continue to execute after that.
2) The program terminating step or an instruction which lead to jump.
3) And the destination step of the jump.

Example:
Let’s there is a program which finds whether the input number is prime or not.

   printf( “Enter the number:”);
   scanf(“%d”,&num);
   bPrime=true;
   for(int i=2;i&<num/2;i++){
       if( num%i==0){
         printf( “ %d is factor of %d”, i, num);
         bPrime = false;
       }
   }
   if(bPrime==true){
        printf( “Number is Prime”);
   }

To find all LCSAJs in code, find all jumps in the program. Those are: (4->9, 8->5, 6->8 and 9->13). All those instructions, to which control could jump from instruction other than the preceding one, are called as the start of the LCSAJ.
In the above code starts of LCSAJ are: (1 (entry of the program), 5, 8 and 13).

Now find LCSAJ from each LCSAJ start. For LCSAJ starting at 1, control goes from instruction 1 to 3 and checks for the condition at step 4. If the condition is false, then code jump occurs giving the first LCSAJ, assuming all other conditions following it are true, (Initial step, branching step and destination step) (1, 4, 9). Now, the case where condition at step 4 is true, control flows from step 5 and then step 6, which is a condition statement. If that condition is false, a new LCSAJ would be formed due to jump which is (1, 6, 8). Similarly, LCSAJs (1, 8, 5) and (1, 9, 13) are also there.
Similarly, LCSAJs could be formed with every start of the LCSAJ.

100% LCSAJ Coverage implies 100% Decision Coverage. But there is a possibility of LCSAJs being infeasible while testing and also effort required for LCSAJ testing and coverage is more as compared to decision coverage.

What is regression testing?

Regression Testing

Regression testing is the process of finding defects in features of the software which were working perfectly prior to bug fixation or inclusion of new features. So most of the test cases will be already present and we have to just re-run them. If a new feature has been added, we might have to create new test cases.
Why we need regression testing? The answer to the question lies in the design process. However perfectly we design our system or software, there may be some interdependency among various functions and modules of the software called as coupling. As a result, a small change in one function can have a catastrophic effect on the other. So to be assured that, any change done to fix a bug in one module has not affected any other module, we use regression testing.

Qualifying a component for regression testing: When any build comes for regression testing, we have to check whether or not the build worth testing. It may be possible that the build doesn’t satisfy the major requirements, so there is no point putting efforts in testing various smaller requirements. So it is advised that always test critical and difficult parts of the software at first glance and then move on to the trivial and smaller parts. This process of qualifying software for detailed testing is called as smoke testing.

These are the four types of tests one shall run on the software before qualifying it for further testing.
1) Customer based tests: Find out all scenarios in which software is going to be used and run test cases to test software functionalities in those scenarios. Think at a higher level i.e. product level and ask what are the test case critical for your customer and run them on priority. Operation profile of the software may be handy
2) Complex tests: Find out which are the test cases which will test the majority of the important features of the software.
3) Expected failure tests: From the past experiences and intuition one can find out what are the ways in which the system can fail and try to test those things first before moving to other features.
4) Big picture test: Test the performance of the software and any other quality attributes. If the software doesn’t satisfy these attributes tests then avoid testing other features.
Hence one can now understand that regression testing is not about blindly running already created test cases but there are some other subtle issues to be taken care of.