Friday, May 29, 2009

Performance testing of Dictionary, List and HashSet

Submit this story to DotNetKicks

Update June 4, 2009
The original post favored Dictionary as the fastest data structure both for add and contains. After some debate and some re-runs of my test code, I found the result to be wrong. HashSet is faster. Probably the results were affected by the workload on my computer. I ran one List test, then a Dictionary test and finally a HashSet test - I should have run them multiple times and at the same workload (i.e. no programs or processes running). Anyway, on to the article.

Still working with high-volume realtime datafeeds, I'm struggling to understand where the bottleneck in my code is. It's not with the database, it's not with the network - it's somewhere in the code. Now that doesn't help a lot.

So I decided to have a look at the different data structures that could be usable for my needs. Right now I'm using a List to keep track of my orders, and each time I get a new order, I check that list for the given ID.

So I'm starting to wonder, maybe there is a performance issue with the List data structure. So I made a small test with the following code:

static void TestList()
{

var watch = new Stopwatch();

var theList = new List<int>();

watch.Start();

//Fill it
for (int i = 0; i <>

{

theList.Add(i);

}

watch.Stop();

Console.WriteLine("Avg add time for List: {0}", (watch.Elapsed.TotalMilliseconds / noOfIterations));

watch.Reset();

watch.Start();

//Test containsKey

for (int j = 0; j <>

theList.Contains(j);

}

watch.Stop();

Console.WriteLine("Avg Contains lookup time for List: {0}", (watch.Elapsed.TotalMilliseconds / noOfIterations));

}

I created similar test code for Dictionary and HashSet. In all of the tests I used a loop where the given key always existed in the data structure. I used Add instead of Insert, because of tests I've seen online show that this is much faster. For example, this one.

First run, I used 1,000 for the noOfIterations variable and ran it three times.

The horizontal scale here is in milliseconds, so things are reasonably fast. A value of 0.2 gives you a possible 200 adds per second. As you probably can see without checking the numbers, dictionary is faster. List is slower for lookup, but HashSet suprises a little with the slow add function. So what happens when we go to 100,000 items?

OK, List is a lot slower for lookup still. Add seems to compare ok though. Let's see how it compares if we remove the lookup-time from the chart: This is wrong, see "The aftermath"

Now that's a result! Turns out List is your winner if fast adding is all you care about. If you want to look up your values later though, dictionary is your winner. Hope this will save some time for other developers in the same situation. Also feel free to comment if you find different results, or a problem with my test!

By popular request: The rest of the code!

static void TestDictionary()

{

var watch = new Stopwatch();

//Create Dictionary

var theDict = new Dictionary<int, int>();

watch.Start();

//Fill it

for (int i = 0; i <noofiterations;i++)

{

theDict.Add(i, i);

}

watch.Stop();

Console.WriteLine("Avg add time for Dictionary: {0}", (watch.Elapsed.TotalMilliseconds/noOfIterations));

watch.Reset();

//Test containsKey

watch.Start();

for (int j = 0; j <noofiterations;j++)

{

theDict.ContainsKey(j);

}

watch.Stop();

Console.WriteLine("Avg Contains lookup time for Dictionary: {0}",

(watch.Elapsed.TotalMilliseconds/noOfIterations));

}

static void TestHashSet()

{

var watch = new Stopwatch();

//Create List

var hashSet = new HashSet<int>();

//Fill it

watch.Start();

for (int i = 0; i <>

{

hashSet.Add(i);

}

watch.Stop();

Console.WriteLine("Avg add time for HashSet: {0}", (watch.Elapsed.TotalMilliseconds / noOfIterations));

watch.Reset();

//Test contains

watch.Start();

for (int j = 0; j <>

{

hashSet.Contains(j);

}

watch.Stop();

Console.WriteLine("Avg Contains lookup time for HashSet: {0}", (watch.Elapsed.TotalMilliseconds / noOfIterations));

}

The aftermath
Because of the controversy involved in HashSet being slower than Dictionary, despite having only one value and the Dictionary two, I re-ran the test on my computer, and on a colleagues. Instead of running the tests one by one, I ran three times HashTest and three times Dictionary, and picked the averages. The result you can see below, HashSet is faster than Dictionary both for Add and Contains methods.


12 comments:

F Quednau said...

I am a bit disappointed at the HashSet lookup performance. I was actually expecting it to be on a par with the dictionary.

Grumpy Grandma said...

Send this to Vance Morrison so he can maybe make this faster for release. HashSet seems wrong in these benchmarks

tommyaquinas said...

1) A list is O(N) time to find an item. It is O(1) to insert an item.

2) A Dictionary and a HashSet are both Hashtables of course. Insert and retrieve time are both O(1), but the constant for inserting into a hashtable is much larger than a list due to having to compute the correct bucket, resize the table, etc.

3) "I created similar test code for Dictionary and HashSet." Don't say you used similar code. Post it! I just tried comparing adding 1 million items using Dictionary and HashSet and HashSet was FASTER than Dictionary doing inserts. There is no WAY that Hashset should be that much slower, but post your code, maybe you have some data that somehow causes it to be slower.

Hegi said...

This entry is rather sad.

I expect that ANY graduate from ANY CS releated studies would know which data structure use in any scenario.

To be honest, if someone would come for interview and he(she) would not know asimptotic time boundaries for basic operations on list, balanced tree, hashset etc. and this person would not know which structure is best suited for which scenario it would be a disaster.

Anonymous said...

@Hegi:
your comment is rather sad.

I expect that ANY graduate from ANY CS releated studies would know that if i bring a class and call it a List and the api looks like it was a list inside it can be anything! If sth looks like duck but walks like a dog it cannot be duck. You never know what is behind the facade mr. smart.

Anonymous said...

Jopp post the code for the hashset and dictionar cause the results seems wrong

SanjayU said...

I'd love to see all your testing code as well...

Bart czernicki said...

Tests like this should always be taken "as is". You could write a book on comparing these data structures properly:
- how is the List initialized? A .NET list uses an array internally, so if you initialize it with 1,000,000 objects then it mitigates the whole "re-create the array" when we add objects (i forget the algorithm, but for a 1,000,0000 objects it would happen more than several times)
- thread-safety...how would applying Read/Write locks affect the collection
- when you say "lookup" what does that mean? I can write you a query with PLINQ (.NET 4.0) that uses multiple processor cores and could come back factors faster.
- speaking of LINQ, you can also override the where implementation to use an internal hash. So, in effect you can use a list for adding/removing and hashing for looking items up

Once you start getting into that kind of performance tuning it gets blurry as to which data structure is "better"

Stian Sveen said...

Thanks for all your comments, I'll definitively be posting the rest of the test code. I agree with what you say, it does seem strange that HashSet would be outperformed by Dictionary. Hence this post...

Stian Sveen said...

@Bart Czernicki:
Thanks for your input! I think PLINQ is very interesting, and we've tried the Parallel FX CTP before. I did not think about taking it into this test, to be honest. I guess this is more of a vanilla .NET 3.5 test.

The LINQ override idea looks interesting, maybe I'll find time to look into that later.

In retrospect the whole test should have been better planned, and according to some, a list should not even be in here. But it is, and at least the material has potential, just look at the comments.

Anonymous said...

It would be handy to have the x axis as a logarithmic scale, so that we can see the small and large numbers on the same graphs. Nice article + comments, anyway!

Ganesh Raman said...

Very Nice...