Hash Function In C++

main

If you are a programmer, you must have heard the term “hash function”. In its most general form, a hash function projects a value from a set with many members to a value from a set with a fixed number of members. We basically convert the input into a different form by applying a transformation function. Hash functions map data of arbitrary length to data of a fixed length. The values returned by a hash function are called hash values. An interesting thing to note is that hash functions are not reversible. This means that you cannot recover the original data from hashed values. This property makes it one of the most useful data structures known to mankind. Hash functions are used extensively in internet security. In this post, we will talk about C++ STL and how to use hash functions with user defined classes.  

Why do we care about this?

Almost every single data structure is made available as part of C++ STL. The containers that implement hash tables are std::unordered_set and std::unordered_map. These containers usually work with predefined datatypes like int, float, string, etc. So if you want to use them with a user-defined classes, you need to define the following two things:

  • Hash function: It is basically a mathematical operation that defines how we transform the input. We need to specify the rule so that the compiler knows what to do. This must be a class that overrides operator() and calculates the hash value given an object of the key-type. The inbuilt hash function expects a predefined data type to be the input, so that it can hash the value. In our case, we have a custom class. So the compiler won’t know what to do. So we need to specialize the std::hash template for the user-defined class. We need to come up with a hash function that depends on the data inside the user-defined class.
  • Comparison function for equality: As we know, all hash tables need to know how to deal with collision efficiently. Wait a minute, what is collision exactly? A collision happens whenever two or more inputs are mapped to the same hash value. If we don’t take care of collision, then our input data might get overwritten. Now you might ask, how is it possible that multiple inputs might get mapped to the same value? Well, that depends on how good you hash function is. The hash cannot rely on the fact that the hash function will always provide a unique hash value for every distinct key, so it needs a way to compare two given keys for an exact match. Now if the input is int or float, it can just directly compare the values. But since we have a custom class, we need to tell it how to compare two instances. You can implement this either as a class that overrides operator(), or as a specialization of std::equal, or by simply overloading operator==() for your key type.

How to do it?

The difficulty with the hash function is that if your key type consists of several members. So now, we need to tell the hash function to calculate hash values for the individual members. We will then combine them into one hash value for the entire object. For good performance, i.e. fewer collisions, you should think carefully about how to combine the individual hash values to ensure you avoid getting the same output for different objects too often. Remember, we cannot recover original data from hashed values! So if we map multiple objects to the same output, it’s gone.

A fairly good starting point for a hash function is the one that uses bit shifting and bitwise XOR to combine the individual hash values. For example, let’s say we have a custom class like this:

class Line
{
private:
    float m;
    float c;
 
public:
    Line() {m = 0; c = 0;}
    Line(float mInput, float cInput) {m = mInput; c = cInput;}
    float getM {return m;}
    float getC {return c;}
    void setM(float mInput) {m = mInput;}
    void setC(float cInput) {c = cInput;}
 
    bool operator==(const Line &anotherLine) const
    { 
        return (m == anotherLine.m && c == anotherLine.c);
    }
};

Now we override the functionality of the hash function with our custom definition:

namespace std
{
    template <>
    struct hash<Line>
    {
        size_t operator()(const Line& k) const
        {
            // Compute individual hash values for two data members and combine them using XOR and bit shifting
            return ((hash<float>()(k.getM()) ^ (hash<float>()(k.getC()) << 1)) >> 1);
        }
    };
}

Now in the main function, we can use the hash function as given below:

int main()
{
    unordered_set<Line> t;

    Line line1 = Line(1.0,2.0);
    Line line2 = Line(3.0,4.0);
    Line line3 = Line(3.0,4.0);
    
    t.insert(line1);
    t.insert(line2);
    t.insert(line3);

    for(unordered_set<Line>::const_iterator it = t.begin(); it != t.end(); it++)
    {
        Line t = *it;
        cout << t.m << " " << t.c << "\n" ;
    }
    
    return 1;
}

It will automatically use std::hash<Line> as defined above for the hash value calculations, and the operator== defined as member function of Line for equality checks. If you see the print out on the command line, you will see that “line3″ is not printed because it’s the same as “line2″.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s