Saturday, November 8, 2014

Is strtok safe !?

We all know that strtok is the most convenient way we all use to tokenize strings.

Syntax:
        #include <string.h>
        char *strtok(char *str, const char *delim);
        where,
        string to be tokenized
        delim - set of delimiters to be used while tokenizing

Example:

#include <iostream>
#include <string.h>
using namespace std;

int main()
{
    char paragraph[] = "Hello$Funny$Man"; // String where words a seperated  by $ 
    // We need to tokenize above string
    char* tok;
    tok = strtok(paragraph,"$"); //  extract the word using $ as a delimiter
    while(tok)
    {
        cout << "Extracted word is " << tok << endl;
        tok = strtok(NULL,"$");
    }
    return 0;
}


Output: 

Extracted word is Hello
Extracted word is Funny
Extracted word is Man


Now is it the safest way to tokenize string/char-array? Do we see any problem with this method ?
What happens when we have to sub-tokenize the extracted token ? Let see with the below example.


#include <iostream>
#include <string.h>
#include <stdlib.h>
using namespace std;

int main()
{
    char paragraph[] = "Hello$Funny$Man How$are$you$doing?"; // String where sentence a seperated by space and words by $ 
    // We need to tokenize above string
    char* tok;
    tok = strtok(paragraph," "); // Firstly extract the sentence using space as a delimiter
    int count = 0;
    while(tok)
    {
        char* subtok;
        char* sentence = strdup(tok);
        subtok = strtok(sentence,"$");
        count++;
        while(subtok)
        {
            cout << "Extracted word in sentence " << count << " is " << subtok << endl;
            subtok = strtok(NULL,"$");
        }
        free(sentence);
        tok = strtok(NULL," ");
    }
    return 0;
}


Output: 

Extracted word in sentence 1 is Hello
Extracted word in sentence 1 is Funny
Extracted word in sentence 1 is Man

We can observe that only one sentence was extracted, this happens because, if we closely observe the usage of strtok, strtok takes source string to be tokenized only in its first call, from the second call on-wards it takes NULL as the first argument, which means strtok uses some global scope space to store the pointer(point to which it has tokenized)

If we use strtok both in outer loops and inner loops, it only runs the outer-loop once as we saw in above example. Since there is only one common global pointer and by the time inner loop is completely executed, this global pointer points to NULL, and hence the actual outer loop pointer(point to which it has tokenized) is lost.Therefore it exits out of outer loop assuming it has tokenized completely.

How do we solve this problem?? well we don't we to do anything, this is a open secret which is in the man page of strtok.
string.h also has an another API called strtok_r which takes in an additional parameter to store the context of the string/char-array i.e point to which it has tokenized.

Syntax:
        #include <string.h>
        char *strtok_r(char *str, const char *delim, char **saveptr);
       where,
        str - string to be tokenized
        delim - set of delimiters to be used while tokenizing
        saveptr - additional pointer to save the context

As per man page its strtok_r() function is a reentrant version strtok(). The saveptr argument is a pointer to a char * variable that is used internally by strtok_r() in order to maintain context between successive calls that parse the same string.

Lets us use strtok_r and see if it solves above problem or not.



#include <iostream>
#include <string.h>
#include <stdlib.h>
using namespace std;

int main()
{
    char paragraph[] = "Hello$Funny$Man How$are$you$doing?"; // String where sentence a seperated by space and words by $ 
    // We need to tokenize above string
    char* tok;
    char *savePtr1, *savePtr2;
    tok = strtok_r(paragraph," ", &savePtr1); // Firstly extract the sentence using space as a delimiter
    int count = 0;
    while(tok)
    {
        char* subtok;
        char* sentence = strdup(tok);
        subtok = strtok_r(sentence,"$",&savePtr2);
        count++;
        while(subtok)
        {
            cout << "Extracted word in sentence " << count << " is " << subtok << endl;
            subtok = strtok_r(NULL,"$",&savePtr2);
        }
        free(sentence);
        tok = strtok_r(NULL," ", &savePtr1);
    }
    return 0;
}


Output:

Extracted word in sentence 1 is Hello
Extracted word in sentence 1 is Funny
Extracted word in sentence 1 is Man
Extracted word in sentence 2 is How
Extracted word in sentence 2 is are
Extracted word in sentence 2 is you
Extracted word in sentence 2 is doing?


From the above output we can observe that the problem is solved as both the sentences are tokenized. So its always best practice to use strtok_r when you know that your program is going to use strtok in recursions or threads or nested-loops or to put the other way around would be use strtok only if you are damn sure that your program doesn't use strtok in resursions or threads or nested-loops.

Happy Coding !!

No comments:

Post a Comment