How to write REST Service with Dropwizard

With increasing complexity of software systems and scale at which they operate, it’s difficult to maintain and scale monolithic implementation. Most of internet giants are either moved to or moving towards micro-service based approach. What is micro-service based approach? In short, it is method of software architect where each functionality of system is a small, independent service, with it’s own service level agreements (SLA).

In this post, we will learn how to write a Java service using dropwizard.

Why Dropwizard you ask?

Dropwizard is between a framework and library which can be included into any project. Once you do so, you need to worry about lot of things like HTTP, REST, Logging etc.

Instead of including all libraries needed to build a REST service separately and configuring each of them, Dropwizard does that for us. Here is the list of libraries that come with Dropwizard by default:

  • Jetty: You would require HTTP for running a web application. Dropwizard embeds the Jetty servlet container for running web applications. Instead of deploying your applications to an application server or web server, Dropwizard defines a main method that invokes the Jetty server as a standalone process. As of now, Dropwizard recommends only running the application with Jetty; other web services like Tomcat are not officially supported.
  • Jersey: Jersey follows the standard JAX-RS specification, and it’s the reference implementation for the JAX-RS specification. Dropwizard uses Jersey as the default framework for building RESTful web applications.
  • Jackson: Jackson is the de facto standard for JSON format handling.
  • Metrics: Dropwizard has its own metrics module for exposing the application metrics through HTTP endpoints.
  • Guava: In addition to highly optimized immutable data structures, Guava provides a growing number of classes to speed up development in Java.
  • Logback and Slf4j: These two are used for better logging mechanisms.
  • Freemarker and Mustache: Choosing template engines for your application is one of the key decisions. The chosen template engine has to be more flexible for writing better scripts. Dropwizard uses well-known and popular template engines Freemarker and Mustache for building the user interfaces.

Apart from the above list, there are many other libraries like Joda Time, Liquibase, Apache HTTP Client, and Hibernate Validator used by Dropwizard for building REST services.

This post assumes that you are using maven as project builder, if your are using Gradle, no worries, it can be easily be migrated. Let’s get into details.

There are four basic building blocks of any dropwizard service.

  1. Configuration file. This is YAML file which contains all the configuration for the application. As part of application bootstrap, this configuration file is parsed and serialized as application  specific configuration class which extends dropwizard Configuration class. What can be configured in this file, we will see below.
  2. Application Class : This class is subclass of dropwizard Application class, typed with our application specific configuration class.
  3. Representation. These are POJO which are converted from and to JSON. These are the entities in your domain.
  4. Resources. These are the classes which responds to all REST requests.

Putting all these together will lead to creation of simple RESTFul service.

I am creating a service which will expose all events which have been scheduled in future or past. Events database will contain date of event, location of event, and other related to it like tags, artists, descriptions etc. Project structure would be something like:

Dropwizard project

Configuration of maven project

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>Seller</groupId>
    <artifactId>Seller-Service</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>
    <properties>
        <dropwizard.version>1.2.2</dropwizard.version>
    </properties>

 <dependencies>
     <dependency>
         <groupId>mysql</groupId>
         <artifactId>mysql-connector-java</artifactId>
         <version>5.1.6</version>
     </dependency>

     <dependency>
         <groupId>io.dropwizard</groupId>
         <artifactId>dropwizard-jdbi</artifactId>
         <version>1.2.2</version>
     </dependency>

     <dependency>
         <groupId>io.dropwizard</groupId>
         <artifactId>dropwizard-core</artifactId>
         <version>1.2.2</version>
     </dependency>
 </dependencies>
</project>

Let me explain why each of these dependencies are required. dropwizard-core brings all libraries which are mentioned above to project. The dropwizard-jdbi module provides you with managed access to JDBI, a flexible and modular library for interacting with relational databases. Third dependency is mysql connector for java which provides JDBC connector for mysql.

Now all dependencies are added,  let’s jump to next part of our service, which is configuration of application.

Configuring dropwizard application

Dropwizard looks for configuration file (in YAML format), in current project root directory.  So, we will create file called config.yml and it looks something like this.

server:
  rootPath: '/api/*'
database:
  driverClass: 'com.mysql.jdbc.Driver'
  user: 'root'
  password: 'root'
  url:  'jdbc:mysql://localhost:3306/events'

In this file, I have configured database properties, and server properties. We can configure properties under server, metrics or logging headers. We will create EventConfiguration class which will map to this file. Dropwizard configuration provides default implementation for server,metric and logging config, however, to configure databases, extra code should be written in configuration class, so that parsing of configuration is successful.

Now, let’s look at EventConfiguration,which is subclass of dropwizard  Configuration class. To create a managed DBI instance, configuration class needs a DataSourceFactory instance, we will see how can we get database connection in Application Class.

import com.fasterxml.jackson.annotation.JsonProperty;
import io.dropwizard.Configuration;
import io.dropwizard.db.DataSourceFactory;

import javax.validation.Valid;
import javax.validation.constraints.NotNull;

/**
 * Created by sangar on 17.12.17.
 */
public class EventConfiguration extends Configuration {

    @Valid
    @NotNull
    private DataSourceFactory database;

    @JsonProperty("database")
    public DataSourceFactory getDataSourceFactory() {
        return database;
    }
    @JsonProperty("database")
    public void setDatabase(DataSourceFactory database) {
        this.database = database;
    }
}

Create an application class

This is the entry point of the your whole application. It extends the Application class of dropwizard, with type parameter as your Configuration class created above. This class has to implement run function of base Application class. This is the place where we do the system wide settings, like database connections, registering resources etc. In this example, Event application would have two classes: EventConfiguration, extending Configuration and EventApplication, extending Dropwizard Application

import dao.EventDao;
import io.dropwizard.Application;
import io.dropwizard.jdbi.DBIFactory;
import io.dropwizard.setup.Environment;
import org.skife.jdbi.v2.DBI;
import resources.EventResource;
/**
 * Created by sangar on 17.12.17.
 */
public class EventApplication extends Application<EventConfiguration> {
    public static void main(final String[] args) throws Exception {
        new EventApplication().run(args);
    }

    @Override
    public void run(EventConfiguration configuration, Environment e) {
        final DBIFactory factory = new DBIFactory();
        final DBI jdbi = factory.build(e, configuration.getDataSourceFactory(), "database");
        final EventDao eventDao = jdbi.onDemand(EventDao.class);

        e.jersey().register(new EventResource(eventDao));
    }
}

Creating Resources and Representation

In REST architecture,  everything is a resource and client can access and modify those resources. Resource needs to support HTTP functions like GET,PUT,POST, DELETE etc. These resources can have different representations, like xml, Json etc.

In our examples or in dropwizard in general, resource is a class which responds to GET and POST methods and consumes and produces data in JSON format. Resources are identified by global IDs (which are typically URIs).

In our case, we have EventResource as our resource, which looks like something like this. Details on how a resource can be created and annotations supported, please go through : Dropwizard resource

package resources;

import javax.ws.rs.*;
import javax.ws.rs.core.MediaType;
import javax.ws.rs.core.Response;

import api.Event;
import dao.EventDao;

import java.util.List;

/**
 * Created by sangar on 17.12.17.
 */
@Path("/events")
@Produces(MediaType.APPLICATION_JSON)

public class EventResource {

    EventDao eventDao;

    public EventResource(EventDao eventDao) {
        this.eventDao = eventDao;
    }

    @GET
    @Produces(MediaType.APPLICATION_JSON)
    public Response getEvents() {
        List<Event> events = this.eventDao.getAllEvents();
        return Response.ok(events).build();
    }
}

Jersey is a framework for mapping various aspects of incoming HTTP requests to POJOs and then mapping various aspects of POJOs to outgoing HTTP responses.  We will be working with JSON data, we need to map this data to and from our POJO, these POJOs are called as Representations. In our application, it will be Event which will be representation class.

package api;

import org.joda.time.DateTime;

/**
 * Created by sangar on 17.12.17.
 */
public class Event {

    private String id;
    private  String name;
    private DateTime eventDate;
    private String City;

    public void Event(String name, String City, DateTime eventDate, int id){
        this.eventDate = eventDate;
        this.name = name;
        this.City = City;
        this.id = this.id;
    }

    public String getName(){
        return this.name;
    }
    public String getCity(){
        return this.City;
    }
    public DateTime getEventDate(){
        return this.eventDate;
    }

}

Below code is for DAO and mapper for JDBI Resultset.

package dao;

import api.Event;
import core.EventMapper;
import org.skife.jdbi.v2.sqlobject.Bind;
import org.skife.jdbi.v2.sqlobject.SqlQuery;
import org.skife.jdbi.v2.sqlobject.SqlUpdate;
import org.skife.jdbi.v2.sqlobject.customizers.RegisterMapper;

import java.util.List;

/**
 * Created by sangar on 17.12.17.
 */
@RegisterMapper(EventMapper.class)
public abstract class EventDao {

        @SqlQuery("SELECT * FROM Event")
        public abstract List<Event> getAllEvents();

}
package core;

/**
 * Created by sangar on 20.12.17.
 */

import api.Event;
import org.joda.time.DateTime;
import org.skife.jdbi.v2.StatementContext;
import org.skife.jdbi.v2.tweak.ResultSetMapper;

import java.sql.ResultSet;
import java.sql.SQLException;

public class EventMapper implements ResultSetMapper<Event>
{
    public Event map(int index, ResultSet resultSet, StatementContext statementContext) throws SQLException
    {
        return new Event(resultSet.getInt("id"), resultSet.getString("name"),
                resultSet.getString("city"),
                new DateTime(resultSet.getTimestamp("date"))
                );
    }
}

Putting all together

Now, that we have written configuration, application entry point, resources and representation, time to build the process. For this whole thing to work, we need to create FAT jar. Add following code into your pom.xml file

    <build>
    <plugins>
    <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.1.0</version>
    <configuration>
        <createDependencyReducedPom>true</createDependencyReducedPom>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
    </configuration>
    <executions>
    <execution>
        <phase>package</phase>
        <goals>
            <goal>shade</goal>
        </goals>
        <configuration>
            <transformers>
                <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                    <mainClass>java.Seller</mainClass>
                </transformer>
            </transformers>
        </configuration>
    </execution>
    </executions>
    </plugin>
    </plugins>
    </build>

Once you do, build the project and run it. You can see the APIs which are created and can be used.

INFO  [2017-12-19 10:26:51,948] org.eclipse.jetty.setuid.SetUIDListener: Opened admin@1687eb01{HTTP/1.1,[http/1.1]}{0.0.0.0:8081}
INFO  [2017-12-19 10:26:51,957] org.eclipse.jetty.server.Server: jetty-9.4.7.v20170914
INFO  [2017-12-19 10:26:54,563] io.dropwizard.jersey.DropwizardResourceConfig: The following paths were found for the configured resources:

    GET     /api/events (resources.EventResource)

Use postman and hit the end point and you will get all the events which in the system.

Dropwizard-api

I have not touched upon lot of things in detail like connection pooling, exception handling, metrics and logging etc, in order to keep post very basic and minimalist. I would be adding details of each of these topics in following posts.

Please share if there is something wrong or missing,

Object oriented programming : Classes

What is a class?

In object oriented programming parlance, a class is a blue print used to create different instances, called objects, of a domain entity. Classes encapsulate state and behavior of domain entity.  State is maintained using member variables and behavior using member functions.

For example, Bicycle is a class, it has three members : model name, top speed and number of gears. Based on this blueprint, we can instantiate objects like bike with 2 gears and 15 as top speed, another bike with 6 gears and 20 as top speed etc. Behaviors like change the top speed, get the number of gears are all encapsulate with in class and available in all instances of class Bicycle.

/**
 * Created by sangar on 22.10.17.
 *
 */
package Examples;
public class Bicycle {
    
    private int gears;
    private int topSpeed;
    
    public Bicycle(int gears, int topSpeed){
        this.gears = gears;
        this.topSpeed = topSpeed;
    }
    
    public int getGears(){
        return this.gears;
    }
    
    public void service(){
        System.out.println("Servicing for bike with grear"
           + this.gears + "costs $10");
    }
}

Are all classes used to create objects? Answer is no, as there are abstract classes and static classes, which cannot be instantiated as usual classes.

An abstract class contains one or more abstract methods. Abstract method does not have implementation, it only has declaration. An abstract class requires subclass to provide implementation of methods and instantiate it.

Continuing above example of bicycle, it would be nice if we can define different types of bicycles like city bikes, mountain bikes and racing bikes. All these bikes will have some common behavior like service, but the implementation of that behavior would be different for each one it. In this scenario, Bicycle becomes an abstract class and declares member function without implementation. City bike,  Mountain bike and Racing bike are sub-classes of bicycle abstract class and implement the service method.

package Examples;

/**
 * Created by sangar on 22.10.17.
 */
public abstract class AbstractBicycle {
    //member fields
    protected int gears;
    protected int topSpeed;
    protected int serviceCost;

    public int getGears(){
        return this.gears;
    }

    public abstract void service();
}
package Examples;

/**
 * Created by sangar on 22.10.17.
 */
public class MountainBike extends AbstractBicycle {

    public MountainBike(int gear,
                        int topSpeed, int serviceCost){
        this.gears = gears;
        this.topSpeed = topSpeed;
        this.serviceCost = serviceCost;
    }

    public void service() {
        System.out.println("Servicing Mountain Bike :");
        System.out.println("Oil the chain");
        System.out.println("Check tyre pressure");
        System.out.println("Oil brake");
    }
}
package Examples;

/**
 * Created by sangar on 22.10.17.
 */
public class CityBike extends AbstractBicycle {
    public CityBike(int gear,
                    int topSpeed, int serviceCost){
        this.gears = gears;
        this.topSpeed = topSpeed;
        this.serviceCost = serviceCost;
    }

    public void service() {
        System.out.println("Servicing City Bike :");
        System.out.println("Oil the chain");
        System.out.println("Fix the lock");
    }
}
package Examples;

/**
 * Created by sangar on 22.10.17.
 */
public class BicycleStore {

    public static void main(String[] args){
        MountainBike mountainBike = new MountainBike(3,15,7);
        mountainBike.service();

        CityBike cityBike = new CityBike(6,10,10);
        cityBike.service();
    }
}

On the other hand, static class is a class, which is never required to be instantiated. It contains member values which will not differ across instance and methods, which do not depend on external context.  In Java specifically, nested classes are static in nature.

What does a class contain?

As mentioned above, a class will contain member fields and functions. A class can also have one or more constructors.

Member fields can be public, private or protected. In short, public members can be access from outside of the class, using objects. Private members are not visible outside of class and can only be accessed by member functions of class itself. Protected ones have different rules when it comes to inheritance, but usually, accessible from within package, subclass and not accessible to outside world.

Access control specifier in Java

One can have setters and getters on fields of class. Setters set the value of fields and getters retrieve value of fields.

We will be discussing more advance things in following posts, so please stay tuned. Also, if you find anything wrong or missing, please let us know. We would be more than happy to correct things.

 

Apache Spark : Introduction

Apache Spark

Before understanding what is Apache Spark and how it helps in data pipelines, let’s understand history behind big data and map reduce. Map reduce introduced ability to process vast amount of data on commodity hardware. Map reduce implementations inherently had fault tolerance, locality aware scheduling and load balancing. Programming models supported acyclic flow of data through data models. These are good for most kind of applications, however there are two big class of applications which were not completely and satisfactorily solved by existing programming models and implementations of Map Reduce. Those application classes are :

  • Iterative applications

Application where output of a stage is fed as input to one of previous stage to refine things. Machine learning application are usually of this nature where same datasets are processed again and again to refine a feature. Available map reduce implementation Hadoop was inefficient in handling this use case because in Hadoop communication between two stages of pipeline happens through HDFS file system, that is disk. A stage has to process data, and then put data on to disk where next stage can pick up again. This leads to delays in processing and there is no optimization to support iterative applications.

Apache Spark introduction
Typical data processing in Hadoop
  • Interactive applications

This are applications which are ETL in nature with requirement of a console where data is to be presented. User can change input and then ask for different set of data based on input. Again, in Hadoop, each such request will be treated independently and every request will fetch data from disk. This leads to delays and application does not remain interactive anymore.

Comparison b/w Hadoop and Spark
Interactive system using Map Reduce

To solve challenges of these applications, different tools were introduced, like Apache Storm for stream processing or Hive and Pig for interactive usage of Hadoop, but none of these tools were solving both the problem at same time. Also, they lacked the abstraction so that they can be used for any general purpose.

Apache spark introduction
Specialized map reduce systems for solving specific problems

How does Apache Spark solves challenges of these types of applications?

Apache Spark : In memory data storage and processing

From above analysis, we understood that the basic problem with Hadoop implementation of Map reduce is disk access. If data which is read from disk can be read from memory, system becomes efficient. Apache Spark does exactly that. All of the data is stored in memory as abstract entities called Resilient Distributed Datasets (RDDs). More on that later.

Two questions come to mind when we say data is in memory: First, how fault tolerance works as data may be lost if a node goes down? Second, what if data is more than memory available at node? Fault tolerance is solved by maintaining meta data ( Step that led to creation of this RDD) about how RDDs on that nodes are created and if node goes down, it can re-run the steps which led to RDDs at this node and get whole data. More how RDDs can be created and used in Spark later.

Second, if data is more than memory available, it is written on disk, this affects the performance, but it is edge case and not a normal working condition.

This was the brief introduction to Apache Spark. We will continue on this thread in coming post. Stay tuned.

If you see something is missing or wrong, please share and we will fix it.

Counting permutations : Programming as an art

Counting permutations : Programming as an art

I am writing this blog on Algorithms to lay emphasis on not only the solution to the problem, but to some other aspects which get completely ignored when writing the implementation of the algorithms. To start with lets first look at the Question:
Implement a program that would find the total number of ways of picking one number each from N bags (each bag containing K numbers) such that the sum of the picked numbers is greater than a given threshold (H)? K & N > 1? Assume that the program is invoked on the command line, with N, K, input-file-name and H as parameters to the program and outputs the result to its standard output.

Brute Force Algorithm

To start with let us look at how we would solve this problem in the simplest of manner. If we could somehow manage to iterate through all the permutations, then we have to simply add the numbers selected in the current permutation and compare it with the threshold and increment the count if found greater.
The algorithm is simple and can be written as follows:

The complexity of this simple algorithm is O(KNN) since there would KN permutations and for each one of them the sum needs to be found for the N numbers in a given permutation.

Think before you Program

I had given this as an exercise to some students. I am reproducing here, without prejudice, one of the answer to highlight some of the other aspects that were completely ignored.

Biggest problem that I think in the above code is to locate where the algorithm is implemented. The program is simple and it is not difficult to locate where the real algorithm is in this program. Even though the implementation is recursive, but it is in essence doing exactly, with some optimization, what the algorithm I have written above. However, it is spread across at multiple places in the program. It is party present in the recursive method and partly in the main method.
Consider the situation when I would have to use this algorithm for some analytical reasoning in some business application and I had asked for POC to be done. Using the above implementation probably would be difficult for me, since the program does not separate the concerns by putting the logically similar concerns together and the one dis-similar separated with loose coupling.
Another big problem here is the field ‘count’ being held in a class scoped variable. This is very dangerous considering the fact one would wants to calculate the count multiple times in parallel when in production.
One concern that an architect would have is to be able to use different algorithm in different situations, and therefore will require a proper modeling of the problem. Since there is none, the architect would have to sit down and do the modeling afresh.

Art or Science

Now, the question is, whether the aspects that were missed completely in the above solution so difficult that they were completely ignored. So lets just put some thought to it and list down the high level concerns in the problem (I think this is know as the CRC [class responsibility collaboration] card). At a very high level, I see the following concerns:

  1. Bag containing the numbers – I intend to model it via a interface Bag which is responsible for the following:
    1. Count of numbers in the bag
    2. Some kind of Iterator though the numbers in the bag
    3. Min, Max numbers in the bag
    4. Mechanism to hold the numbers in a particular order as desired by the algorithm
    5. With a Self Cloning, Self Factory default implementation
  2. Algorithm to find the count – I intend to model it via a interface PermutationCounter which is responsible for the following:
    1. Name of the algorithm
    2. A method to calculate the count for a given array of bags and the threshold
  3. Take problem input from the screen/file and invoke the algorithm for the problem – this will be done by the main method

Simple, was it not? The advantage that one gets from this approach is that the responsibilities have been clearly demarcated and it is any bodies guess which all parts can be very easily reused. Yes, the Bag and PermutationCounter along with all its implementation can be reused. Now, lets look at adjacent diagram which captures the class hierarchy and how it is to be used by the client (for the scope of this problem it is the main method).  In actual, one can experiment and decide on the strategy to follow for using the implementations. Now let me return the the point where I left the iterative algorithm for this problem. So how are the aspects present in the algorithm mentioned in the start actually implemented. The current permutation can be easily held in an integer array of same size as the number of bags. Each element in the array represents the index of the number in the corresponding bag. To start with this array will be initialized to say for N = 4 as [0, 0, 0, -1] and then can be manipulated to produce arrays (for K = 3) [0, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 2], [0, 0, 1, 0], and so on till [2, 2, 2, 2]. These two methods can be present in the IterativePermutationCounter class as private method (implementation as below).
Given the above two implementations, implementation of the algorithm is as follows:
For the complete code and some interesting implementations for the problem you can request to the author.

Suffix Array

Suffix Arrays

What are Suffix arrays and why are they useful? Before learning suffix arrays I would recommend you to read about “tries” here. Question: You have a word for example: “blogger”, you want to print all the suffixes in sorted order or want to find the kth suffix in lexicographical order.
Naive Solution:
In naive solution you just need to find  all suffixes ,save them in an array of string (this will take around O(n2) operation) and sort them in lexicographical order(this will be taking O(n2 log2(n)) so total complexity will be around O(n2(1 + log2(n))) which is quite large.Then finding the kth suffix in it.

All suffixes of the word “blogger”: blogger
logger
ogger
gger
ger
er
r
Solution using Tries:
In this method we can create the trie for all the suffixes of the word and for finding kth suffix in lexicographical order, we can pre-process and find lexicographical kth suffix in O(n) (You just figure it out how you can do it).
Even if the idea of suffix trie would be very pleasing,but it’s simplest implementation in which at every step one of the strings suffixes is inserted into structure leads to O(n2) complexity algorithm.There is a “suffix tree” which can be built in O(n) complexity algorithm but creating a suffix tree is quite a long code to write,I will talk about suffix trees in future blogs.Here I will talk about another data structure suffix arrays whose code is quite short to write and it gives around O(n log2(n)) and memory used in implementing suffix array with O(n) memory is 3 to 5 times less than the amount needed by a suffix tree.

Suffix Arrays
Suffix Array is the data structure which stores the sorted form of suffixes of a given strings.It will be more clear after seeing the diagram given below:-
SUFFIX ARRAY

How to build suffix array: As it is clear from the diagram above, We have to sort the suffixes of the String in lexicographic order for creating suffix arrays. So the big question is How to do it?
We will first sort according to the first character of the suffixes of the string and store it in suffix array, for example:                b  l  o  g  g  e  r
 Suffix array – 0  3 4  2  2  1 5

Now,we will be gradually increasing the sorting criteria i.e from sorting according to single character to two characters, then four characters and so on in the increasing power of two. Until we have reached a length such that we can be sure all the suffixes of the string are themselves sorted.(it will be more clear from the diagram below). So we have “m = ceil(log2(n))” steps to create the suffix array where ‘n’ is the length of the string. Let us denote each step by ‘y’, where 0<=y<=m .At step y we will sort the suffixes according to the first 2y characters of each suffix, For example at step 2, we will sort the suffixes according to first 4 (22) characters of each suffix. How to optimally implement the above stated condition? I’ll explain it in step by step manner.
At Step 0(y = 0) :You just sort according to the first character of each suffixes and give sort index(sort index:position of the suffix in the sorted array) to each suffix according to it,if the first character is same for two or more suffixes, give same sort index to them. See the example below.
Step 0:               Sort-index                      
    b                 0                                              
     l                 3                                            
o                 4
g                 2
g                 2
e                 1
r                 5

At Step 1(y = 1):We want to sort the suffixes according to the first two characters. For creating the sort-index at step 1 ,take the help of sort-index created at step 0.Create a pair of indexes in which first value is the sort-index of the respective suffix got in step 0 and second value will be the sort-index of the suffix that starts 1 positions later that we obtain in step 0. If the length of the suffix is less than 2 characters, then we can keep the second value  as -1
Now sort according to this pair to get new a sort-index.It will be more helpful if you relate it to diagram given below:
Step 1(0 – 1)(concatenating at i  with  i + 21):
Sort-index
bl (0,3) 0
lo         (3,4) 4
og         (4,2) 5
gg       (2,2) 3
ge (2,1) 2
er (1,5) 1
r$ (5,-1) 6

Here ‘i’ denotes index of the suffix starting at position ‘i’ in the string.

At Step 2(y = 2):Same work has to be done at step 2.Here we want to sort according to the first four characters.Create a pair of indexes in which first will be the sort-index of the respective suffix and second will be the sort-index that starts 2 positions later like concatenate bl(1st position) with og(3rd position) to get first four characters of the respective suffix.Do this with all other suffixes and get the sort-index of each position.

Step 2(1 – 2)(concatenating at i  with  i + 21):Sort-index
blog          (0,5)                             0
logg          (4,3)                             4
ogge          (5,2)                             5
gger          (3,1)                             3
ger$          (2,6)                             2
er$$          (1,-1)                            1
r$$$          (6,-1)                            6

Same will happen for step 3. So to generalize this,if we go from step y to step y + 1 , we will concatenate the sort-index of suffix array starting at position ‘i’ with sort-index at ‘i + 2y‘ to obtain the length of 2y + 1 for creating the sort-index at step ‘y + 1’.

Code for building Suffix Arrays:

Time Complexity:Complexity for building suffix array is O(n*log22n).
Now you can find kth suffix in lexicographical order using suffix array in O(n) complexity. Suffix arrays are quite helpful in competitive programming because they are fast to code.

Application:
1) Compute the longest common prefix(LCP) using suffix arrays
2) Suffix Arrays are helpful in various string searching and string matching problems.

Questions to practice:
Easy:
http://www.spoj.com/problems/SARRAY/
http://www.codechef.com/problems/TASTR

Medium:
http://www.codechef.com/problems/SSTORY
http://www.spoj.com/problems/DISUBSTR/

Range Sum Using Segment Tree

What is and why Segment Tree?

Take an example of an array = {1 , 4 , 8 , 11, 15, 22} , and task is to find the sum of range say (2,6) , naive approach will be to run the loop from two to six giving us an O(n) time complexity,so if we have q queries it will make it O(n*q), but we can modify it to O(q) complexity if we can pre-compute the sum such that at index “i”, In the pre-computed array we have sum from 0 to index i which make new array as sum[] = {0,1, 5, 13, 24, 39, 61} so now when we are having range sum query like (3,5) we can just compute it by sum[5] – sum[2] which gives O(q) time complexity.

Now a slight modification in question , if we can change any particular number in an array at index i then we have to modify whole sum array from that index ”i” to “n” ,so if we have an alternate query of changing and finding range sum it will be giving us O(n*q) time complexity.Now the Segment tree comes to rescue , by using segment tree we can reduce the complexity from O(n*q) to O(q*log(n)) which is a big difference when n is large. Now we get the answer why Segment Tree?

Segment Tree(Divide and Conquer Solution)

Problem: We have an array arr{0……….n – 1} and we have to perform two types of queries:
                 1)Find sum of elements from index l to r where 0 <= l <= r <= n
                 2)Change value at specified index arr[i] = arr[n]. 
I’ll take about segment in three levels:

Structure
Solution for segment tree will be:
1)If the range contains a single element,the element itself will be in that node.
2)Otherwise divide the range into two smaller ranges,each will be half the size of original.
Recursive function will be written as:
             
Representation:

Level 1: Construction

             In construction of segment tree I’ll give the recursive approach(top down approach), start with root node which leads to a recursive call to it’s two children(see recursive function) which are storing the sum of the range which is half the size of parent node until we reach leaf node which will be storing the original element of an array. And now this recursion will revert back storing sum of it’s child in it’s parent.So,in the parent node we will be having the sum of it’s two child.See the diagram below for more clearity .We will be using an array named as sum[] which will be storing this tree where index “i” stores the sum of parent,it’s child’s sum will be stored at 2*i and 2*i + 1.It will be a full binary tree because we are dividing segment into two parts,so we will be requiring a size of around 2*2^(ceil(log(n))) – 1 to store the binary tree. It is a binary it’s height will be log(n).

I added the code for creating the segment tree:  

Level 2: Query
Now we have created the tree .Since query operation is quite complex  I’ll explain it by example: Suppose we have to query for f(0,4) ,since we can see that there is no such node which is having this range but we can notice that f(0,4) = f(0,2) + f(3,4) and there are nodes in segment tree containing those two ranges so we can conclude that we have a general expression such that f(x,y) = f(x,z) + f(z + 1,y) .

When querying a segment tree, we select a subset of the nodes with the property that the union of their sets is exactly the interval whose sum we seek. To do so, we start at the root and recurse over nodes whose corresponding intervals have at least one element in common with the query interval. Hence, in our example of f(0,4), we notice that both the left and right subtrees contain elements in the query interval; hence, we recurse on both.The left child of the root serves as a base case(See the figure), since its interval is contained entirely within the query interval; hence it is chosen. At the right child of the root, we notice that its left child has descendants in the query interval(figure), but its right does not; hence we recurse on the former and not the latter. The former is now another base case, and it is chosen and then we return the sum of given range.
Time Complexity:Since it is a binary tree so for query it will be taking O(log n).You can check by yourself.

Level 3:Update
Now in update you just have to check the index of the number in which interval it exist then choosing that interval recursively till we reach the leaf of the tree and updating that leaf,now while you are recursing back you just have to update the node in which the index is such that l <= index <=r (l and r is the range of the node,index is the value to be updated).Below is the code for updating the tree.
Time Complexity:Since the height of the tree is (log n),so for worst case time complexity for update will be O(log n).

Questions for practice:
Easy:
http://www.spoj.com/problems/GSS1/
http://www.spoj.com/problems/KGSS/
http://www.spoj.com/problems/HORRIBLE/

Medium:
http://www.codechef.com/problems/QSET
http://www.codechef.com/problems/FLIPCOIN

Knuth Morris Pratt algorithm for string searching

Knuth Morris Pratt (KMP) algorithm for string searching

Problem

You are given two strings – a Text and a Pattern. Determine whether the Pattern exists in the Text or not.
E.g. – Consider a Text ABCDGBCDLM and a Pattern BCD. Final output should be all the indexes where the Pattern exists, here it is 1 and 5 (0 based index).

Naive Method

The naive method is very basic and straightforward. For every position in the Text, consider it as the starting position and see if a match with the pattern is found.

Considering a text and a pattern of small length, this method would work properly giving O (N*M) complexity, where N and M are the lengths of text and pattern respectively.

The C++ implementation of the naive method is given below…

The Knuth-Morris-Prat Algorithm

If we are able to identify all the positions where an existing match of the pattern ends, in a single iteration over the text. This would solve our problem, since we know the length of the pattern it is easy to identify the starting position of the match. Knuth-Morris-Prat algorithm works on this principle.

In this algorithm, we apply the concept of automaton. An automaton is a kind of an abstract object. At each step of this automaton, some information is presented to it. Using this information the automaton goes to a new state from its current state. whenever the automaton reaches its final state, we have found an end position of a match. The general idea behind the automaton is very simple, consider this Pattern – BBABBB.  Let us list all the prefixes of the pattern.

1. Empty string
2. B
3. BB
4. BBA
5. BBAB
6. BBABB
7. BBABBB

Now for every prefix listed above, consider the longest suffix that is also a prefix of it. The suffix should be different from the string itself.

1. Empty string (No suffix for an empty string)
2. Empty string (The suffix is B but it is same as the string itself)
3. B     (BB) (Here B is the suffix that is also the prefix of string BB)
4. Empty string (No suffix is present which is also the prefix)
5. B     (BBAB(Here B is the suffix that is also the prefix of string BBAB)
6. BB   (BBABB)(Here BB is the suffix that is also the prefix of string BBABB)
7. BB (BBABBB)(Here BB is the suffix that is also the prefix of string BBABBB)

Let us call a suffix that is also a prefix of a string a “Partial match” of that string.We can see that if at some point, the pattern matches till BBABB (which is the prefix of the pattern), then there is also a match till AA which is Partial match of BBABB. 
For example : 
Let us consider text : BBBABBABBBA, pattern matches till BBABB (If we check from the 2nd character of the text), since BB is partial match of BBABB, the text also matches BB (BB is the largest suffix of BBABB which is also prefix of it), so if we cannot extend our match of BBABB to full pattern with the text, we can start matching the pattern again starting from the suffix BB(BB is the largest suffix of BBABB which is also prefix of it
         
          0 1 2 3 4 5 6 7 8 9 10   (Indexes)
Text : BBBABBABBBA
             BBABBB
Here we wanted the next character of the text to be A to complete our match but it is B so we can start check for next match from 4th index of text because BB is also a partial match now. 

So here two cases arise, 
Consider  pattern :  BBABBB and some text, and till now we have found a match till BBABB, as shown in the figure below.

           BBBABB…………..(Text continues)
              BBABB    (Match found till BBABB)

Case 1:  If the next character of the text is ‘B’, that means a match is found, we can start checking in  similar manner starting from the next character  of the text to find another match if there is any.
Case 2: If the next character of the text is anything other than ‘B’ that means the match is not found, so we return to the largest partial match of the string we have found match till now (which is BB for string BBABB), now if the next character of the text matches, our match extends, but if it does not match we will find the next partial match of BB and so on until the partial match becomes empty, then we will skip the next character of the text and start  a fresh check.

Let us see the above operations of Knuth Morris Pratt algorithm with an example.
Text – BBBABBABBBA
Pattern – BBABBB
The automaton for the pattern will look like this…


 Step 1: we will start by matching pattern from the first character of the text. Here we found a mismatch at 3rd character. so we will find partial match of BB, which is B and start match from 2nd character of the text. 


Step 2: we when checking from 2nd character, mismatch is found at 6th character of the text, so we start again from the partial match of BBABBB which is BB.


Step 3: we found a match this time at the 5th character of the text.


Now, to find partial matches of prefixes (till we have found a match) of the pattern, we will make an array of length equal to the length of the pattern, let us call it F[pattern_length]. F[i]   
(0 <= i <= pattern_length)  contains the length of longest partial match of a prefix (till we have found a match) of pattern till ‘i’.
So here : F[0] = 0
               F[1] = 0
               F[2] = 1
               F[3] = 0
               F[4] = 1
               F[5] = 2
                  F[6] = 2
The array F not only contains the largest partial match till index i, but also all the partial matches of it, so F[i] is the largest partial match till index i, F[F[i]] is second largest partial match, F[F[F[i]]] is third largest and so on.
let us call it failure function of the pattern.Below is the function to build the array

Now we have to use failure function to find match. whenever we cannot expand the current partial match, we go to next best partial match of the current partial match and so on. If we are unable to expand even the empty string we just skip this character from the text and go to the next one in the text.
Below is the code to do that.


Removing duplicates from an array of integers

Removing duplicates from array

Given an unsorted array of integers, remove all duplicates from it. For example if input array is A = [2,3,4,4,1,2,3,3]
Final output should be A = [2,3,4,1] where duplicates of 2,4 and 1 are removed. Notice that it is not necessary that duplicates are clubbed together in array.

Analysis
First option which comes to mind is to sort given array using quick sort or merge sort with O(nlogn) complexity and then in linear time remove all duplicates as all duplicates will then be together.
Detailed explanation how duplicates from a sorted array can be removed is given in this post :

How about if the constraint is not to sort array first, other way to put this constraint is to say that elements in output array should preserve the order they were in input array.

Some basic algorithm to remove duplicate elements will be to start from first element till the last element repeat following
Check from the next element to last element, if element matches the current element of step 1,
move all the elements on right side of this element by one left. Let’s code it

Complexity of this algorithm to remove duplicates from array will be O(N^3)

Other method of solving this problem is to use binary search tree. Construct a binary search tree using elements of array and discard all elements which are duplicates. Tree will contain only non duplicate values,however, order in original array is not maintained in this solution.

Complexity of above code is O(nlogn) for each node insertion will take O(logn) and there are n nodes to be added.

Last but most efficient solution is to use merge sort and remove duplicates in merge step. This solution requires extra book keeping while merge step of merge sort. I have written the merge sort, book keeping I still need to figure that out, if you have some hints, please comment or drop me a mail.

Find Euler path in a graph

Find Euler circuit and path in a graph using Fleury’s algorithm.

In last post, Graphs: Find bridges in connected graphswe discussed how we can find bridges in an undirected graph. In this post, we will be discussing an algorithm which uses bridges to find Euler’s path in a graph, algorithm is called as Fleury’s algorithm.

As per the definition, Euler’s path or circuit in a graph is a path which traverses each edge of the graph once between source and destination. If source and destination are same, then it becomes Euler’s circuit. To understand how we can find that there is an Euler path existing in a graph, please follow this post: Graphs : Euler circuit and Euler path in graphs

Fluery’s algorithm to find Euler path or circuit

  1. Start from the source node, call it as current node u. If there are nodes with odd degree (there can be max two such nodes), start any one of them. Else start from any node in graph.
  2. Traverse any edge (u, v) from current node which is not a bridge edge.
  3. Set current as v and go to step 2
  4. When all edges which are not bridge edges are traversed, start traversing bridges edges.
  5. Stop when all nodes are traversed.

Walk through Euler path step by step

1. Graph is as follows, there are two nodes 1 and 6 which have odd degrees, so our candidates for start will be either one of these. Start from 6.

Initial graph with node 1 and 6 with odd degree

2. Traverse to 5 and then to 1 and then to 3 followed by 2 one by one as these are not bridge edges.

3. At 2, there are three edges (2,1), (2,4) and (2,4). However, if we follow edge (2,1) then we will burn the bridge and never be able to come back to rest of the graph, hence we will not be going to that edge. So, we will be moving edge (2,4).

4. At 4 again, we have two edges (4,2), (4,6) and (4,3), But if we go through (4,2) we will again burn bridge as (4,2) is bridge edge. Hence, we will traverse (4,3).

5. At 3, we are left with only edge and we will follow it, which will be followed by edge (6,4).
At 4, bridge edge now has to be followed at 4 because there is no option and same for edge (2,1).

Hence, Euler path will be 6-5-1-3-2-4-3-6-4-2-1

Code for Fluery’s algorithm
Please refer Graphs : Euler circuit and Euler path in graphs for detail graph creation and finding if there is an Euler path existing in graph.
Complexity of above algorithm is O ((V + E)* (V+ E)), as for each vertex V, we need to check each edge whether it is bridge or not which take O (V+E) time. Hence, the overall complexity is as given above.

Find bridges in connected graphs

Detect bridge edges in graph

Given a graph, detect all bridges in it.  An edge is called as bridge edge if and only if on removal of that node, graph becomes disconnected if it was connected graph and if it was disconnected then number of components increase by one.  For example in below graphs, bridges are shown in red.

Bridges in graphs
Concept of detecting bridges in graph will be useful in solving Euler path or tour problem.

Analysis

Depth First Search of graph can be used to see if graph is connected or not. We can use same concept, one by one remove each edge and see if graph is still connected using DFS. If yes, then edge is not bridge edge, if not, then edge is bridge edge. However, this method entails quite a complexity of O(E * (V+E)) where E is number of edges and V is number of vertices.

Let’s think something better. Consider that we are looking at edge (u,v) in graph. In what condition, we can say that it is bridge edge?
If we can somehow reach node u or any ancestor of u from any node which is decedent of v, that means graph is still connected and (u,v) is not a bridge edge. If above condition is not possible, then (u,v) is bridge.

How can we determine that there is no edge from decedent of v to u or its ancestor? For that we need to maintain time when a node was discovered during depth first search, call is d[].
d[u] is time when node u was discovered using DFS. If d[u] < d[v], means u was discovered before v.
Below is graph with d[] filled for each node.

DFS traversal order of graph

Now, figure out the lowest d[x] which can be reached from each node. Reason to find that is to see if there is a node x which is reachable from children of v and has d[x] less than u,i.e. x is ancestor of u reachable from children of v. Store lowest DFS ancestor reachable from node u in an array low[u].

So low[u] = min(low[u], low[v])  for edge (u,v)
Idea here is that if (u,v) is an edge, then either there is no back edge from sub tree of v to u or ancestor of u, in that case low[u] will be lowest. If there is a back edge to x from sub tree of v, then minimum d[x] reached by node in sub tree will be low[u].
Below diagram shows calculation of low[] in graph.

Calculation of low[]

Finally, if low[v] > disc [u] that means if discovery time of u is less than least ancestor that can be reached from sub tree of v, we have a bridge, because there is no way we can reach to any ancestor of u once we disconnect edge (u,v).

In above diagram, if we remove edge (3,4), still graph is connected as low[4] = 2 which is less than d[3] which is 3. Hence edge (3,4) is not a bridge.

Lots of theory,lets code it. We will be modifying Depth First Search implementation to keep track of d[] and low[].

Code for finding bridges in graph

Complexity Analysis

Complexity of finding bridges in graph code is O(V + E) where V is number of vertices and E is number of edges in graph.