Saturday, November 8, 2014

Is strtok safe !?

We all know that strtok is the most convenient way we all use to tokenize strings.

Syntax:
        #include <string.h>
        char *strtok(char *str, const char *delim);
        where,
        string to be tokenized
        delim - set of delimiters to be used while tokenizing

Example:

#include <iostream>
#include <string.h>
using namespace std;

int main()
{
    char paragraph[] = "Hello$Funny$Man"; // String where words a seperated  by $ 
    // We need to tokenize above string
    char* tok;
    tok = strtok(paragraph,"$"); //  extract the word using $ as a delimiter
    while(tok)
    {
        cout << "Extracted word is " << tok << endl;
        tok = strtok(NULL,"$");
    }
    return 0;
}


Output: 

Extracted word is Hello
Extracted word is Funny
Extracted word is Man


Now is it the safest way to tokenize string/char-array? Do we see any problem with this method ?
What happens when we have to sub-tokenize the extracted token ? Let see with the below example.


#include <iostream>
#include <string.h>
#include <stdlib.h>
using namespace std;

int main()
{
    char paragraph[] = "Hello$Funny$Man How$are$you$doing?"; // String where sentence a seperated by space and words by $ 
    // We need to tokenize above string
    char* tok;
    tok = strtok(paragraph," "); // Firstly extract the sentence using space as a delimiter
    int count = 0;
    while(tok)
    {
        char* subtok;
        char* sentence = strdup(tok);
        subtok = strtok(sentence,"$");
        count++;
        while(subtok)
        {
            cout << "Extracted word in sentence " << count << " is " << subtok << endl;
            subtok = strtok(NULL,"$");
        }
        free(sentence);
        tok = strtok(NULL," ");
    }
    return 0;
}


Output: 

Extracted word in sentence 1 is Hello
Extracted word in sentence 1 is Funny
Extracted word in sentence 1 is Man

We can observe that only one sentence was extracted, this happens because, if we closely observe the usage of strtok, strtok takes source string to be tokenized only in its first call, from the second call on-wards it takes NULL as the first argument, which means strtok uses some global scope space to store the pointer(point to which it has tokenized)

If we use strtok both in outer loops and inner loops, it only runs the outer-loop once as we saw in above example. Since there is only one common global pointer and by the time inner loop is completely executed, this global pointer points to NULL, and hence the actual outer loop pointer(point to which it has tokenized) is lost.Therefore it exits out of outer loop assuming it has tokenized completely.

How do we solve this problem?? well we don't we to do anything, this is a open secret which is in the man page of strtok.
string.h also has an another API called strtok_r which takes in an additional parameter to store the context of the string/char-array i.e point to which it has tokenized.

Syntax:
        #include <string.h>
        char *strtok_r(char *str, const char *delim, char **saveptr);
       where,
        str - string to be tokenized
        delim - set of delimiters to be used while tokenizing
        saveptr - additional pointer to save the context

As per man page its strtok_r() function is a reentrant version strtok(). The saveptr argument is a pointer to a char * variable that is used internally by strtok_r() in order to maintain context between successive calls that parse the same string.

Lets us use strtok_r and see if it solves above problem or not.



#include <iostream>
#include <string.h>
#include <stdlib.h>
using namespace std;

int main()
{
    char paragraph[] = "Hello$Funny$Man How$are$you$doing?"; // String where sentence a seperated by space and words by $ 
    // We need to tokenize above string
    char* tok;
    char *savePtr1, *savePtr2;
    tok = strtok_r(paragraph," ", &savePtr1); // Firstly extract the sentence using space as a delimiter
    int count = 0;
    while(tok)
    {
        char* subtok;
        char* sentence = strdup(tok);
        subtok = strtok_r(sentence,"$",&savePtr2);
        count++;
        while(subtok)
        {
            cout << "Extracted word in sentence " << count << " is " << subtok << endl;
            subtok = strtok_r(NULL,"$",&savePtr2);
        }
        free(sentence);
        tok = strtok_r(NULL," ", &savePtr1);
    }
    return 0;
}


Output:

Extracted word in sentence 1 is Hello
Extracted word in sentence 1 is Funny
Extracted word in sentence 1 is Man
Extracted word in sentence 2 is How
Extracted word in sentence 2 is are
Extracted word in sentence 2 is you
Extracted word in sentence 2 is doing?


From the above output we can observe that the problem is solved as both the sentences are tokenized. So its always best practice to use strtok_r when you know that your program is going to use strtok in recursions or threads or nested-loops or to put the other way around would be use strtok only if you are damn sure that your program doesn't use strtok in resursions or threads or nested-loops.

Happy Coding !!

Monday, November 3, 2014

Scientific Notation - best way to print floating point numbers

Dealing with floating point numbers is one of the most complex thing which a programmer and computer deal with in a day-to-day programming.

Floating point numbers are real numbers which has a fractional part.
For Eg: 1.234, -2.345 .. etc

As we all know that computers are the integer based machines, [Basically it can understand only 2 voltages high (1) and low (0)], and there are complex ways which includes lots of assumptions and approximations in representing floating point numbers, In this post i will only focus on the best way to print floating point numbers.

There are few standards set by IEEE for using floating point numbers. So coming to printing of floating point numbers on the standard o/p (monitor) or in the file, which is the best practice to be followed ? Which types of notation is more precise and more consistent.... Answer to that would be scientific notation

In scientific notation (which is considered most standard way) any number can be represented in the form :
x*10y

For Eg : 0.2 --> 2*10-1
              200 --> 2*102


The following c++ code and its output could prove to be a simple proof to explain why is scientific method the best way to print floating point numbers.


// Non-scientific way
#include<iostream>  // To print to standard output
#include <iomanip> 
using namespace std;

int main () {

    double shift = 10616.200000000001;
    double num1 = 0;
    double num2 = 0.01;
    
    std::setprecision(15);
    // cout << scientific;

    cout << "(shift + num1) * 1e-3 = " << setw (12) << (shift + num1) * 1e-3 << endl;
    cout << "(shift + num2) * 1e-3 = " << setw (12) << (shift + num2) * 1e-3 << endl;

    return 0;
}

// Output
(shift + num1) * 1e-3 =  10.6162
(shift + num2) * 1e-3 =  10.6162

From above output we can observe that even tough we had set the precision to more than what we need, final result is same irrespective of num1(0) and num2(0.01) begin different.

Now let us try to print the same numbers in scientific way and see the difference. We can just use cout << scientific to print in scientific way. From the above code we need to just remove the comment line.

// Scientific way
#include<iostream>  // To print to standard output
#include <iomanip> 
using namespace std;

int main () {

    double shift = 10616.200000000001;
    double num1 = 0;
    double num2 = 0.01;
    
    std::setprecision(15);
    cout << scientific;

    cout << "(shift + num1) * 1e-3 = " << setw (12) << (shift + num1) * 1e-3 << endl;
    cout << "(shift + num2) * 1e-3 = " << setw (12) << (shift + num2) * 1e-3 << endl;

    return 0;
}

And its output looks like this.

//Output 
(shift + num1) * 1e-3 = 1.061620e+01
(shift + num2) * 1e-3 = 1.061621e+01

We can very clearly observe that the result is different while printing in scientific notation.


Hence scientific notation is the best way as it does not lose more information due to approximations and its safe way to deal with floating point numbers.

A very big thank you to http://hilite.me/ which helped me beautify the code.