Strings : Toeknize a string separated by delimiter

Continuing the previous post, let’s look at the second problem.

Problem statement

Find all tokens in a string which are separated by given delimiter.

Analysis

We have to traverse the string and for each character, check if that character is present in delimiter string. If it is, then it is end of the token started after previous encounter of the delimiter. Here forward we should scan all characters till the time we first encounter a character which is not a part of given delimiter string, that would be start of the next token.

Let’s look at it step by step.

Step 1 : Assume we have string as s and delimiter string as delim.
Search in s the first character which is not present in delim.
If there is no such byte, we are done, return NULL.
If there is such byte, then that would be start of our first token.

Step 2: Scan through the found token.
From the character found in step 1, scan through s till we again encounter a character present in delim.
If there is no such character, we are done, we have only one token and return the start position of that token.

If we have such character, that would be end of the token.

Great we have found one token, How about next one? 

For next token to be generated, we need to keep track where to start looking from in the string as we would have scanned a part of it in the previous tokens. So we keep track of the pointer where to start looking from.

Step 3 : Once we have found the end of token in step 2, again scan out all subsequent characters which are part of delim. Take the index of first non-delimiter character. This would be the starting point of our next token search.

Implementation note : To distinguish between whether it is first call to the function or subsequent call, we pass NULL pointer to the source string parameter indicating that it is subsequent call and not initial call.

Similar approach can be used to remove all characters which are present in a string from another string.

Code

Test cases

1. String and delimiter with one characters
S = “aaab_dddcc”
delim = “_”

2. String and delimiter with multiple characters
S = “aaab_dd?dcc”
delim = “_?”

3. Empty string
S = “”
delim = “_”

4. Empty delimiter
S = “aaab_dddcc”
delim = “”

5. String starting with delimiters
S = ___aaaaabbb
s = “_”

6. String ending with delimiters
S =”aaaa___”
s = “_”

7. String with no delimiter character
S = “aaaaaaaa”
s = “_”

8. String with all character as delimiter characters
S =”________”
s =”_”

Complexity analysis

Complexity of above code is O(N * M), where N is length of string and M is length of delimiter string.