Software Development - Java, Ruby, Perl - RoboProg's

RoboProg's / Software Development

Last Month

Apr 29, 2009

Get Your Thread Off of My Fork.

I've been considering the "c10k" problem lately: to have a server that must service 10,000 concurrent requests. I don't think I've solved it. However, I wanted to get a feel for how threads and the unix fork construct compare, performance wise.

Below is a table of several possible languages / implementations. I first used Perl for my benchmarks, because it is supposed to have native threads. It turns out, the default build on Ubuntu 8.04 is using green (simulated) threads. I also used Python (which I don't really know) because it is an interpreted high level language which supports both fork and native threads.

Language Fork Native Threads Green Threads Comments
Java N Y N Green threads are no longer used
Perl Y N Y Threading model is implementation dependent
Ruby Y N Y
Python Y Y N Note that the so-called GIL can degrade to effectively only utilizing a single CPU
C Y Y N Not considering third party libraries to do green threads

What the test programs do is to simulate overlapping requests and responses. While there is no actual socket connection to a client, each sub-process must generate a formatted timestamp, followed by about a half KB of pseudo-XML. There is a mutex around the time formatter in the threaded version in either implementation, as many resources in multithreaded solutions have to be guarded via access in non-reentrant critical sections. Here is the Perl source and the Python source.

My first set of trials is on a 3 GHz (1025 KB cache) Intel Pentium D, which has 2 logical CPUs and SMP enabled in the OS.

rob@ganymede$ /usr/bin/time ./ F 5000 > ignore.txt 
0.18user 1.34system 0:19.43elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
112inputs+0outputs (1major+70961minor)pagefaults 0swaps
rob@ganymede$ /usr/bin/time ./ T 5000 > ignore.txt 
A thread exited while 2 threads were running.
19.29user 0.72system 0:20.11elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+4632outputs (0major+2753minor)pagefaults 0swaps
rob@ganymede$ /usr/bin/time ./ F 5000 > ignore.txt 
0.16user 1.20system 0:30.14elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
192inputs+0outputs (1major+81574minor)pagefaults 0swaps
rob@ganymede$ /usr/bin/time ./ T 5000 > ignore.txt 
0.01user 0.00system 0:20.01elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+4672outputs (0major+976minor)pagefaults 0swaps

My second set of trials is on a 1.5 GHz (512 KB cache) Intel Atom, which has hyperthreading disabled and is thus running as a single CPU.

rob@charon$ /usr/bin/time ./ F 5000 > ignore.txt 
0.17user 0.78system 0:22.36elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+60747minor)pagefaults 0swaps
rob@charon$ /usr/bin/time ./ T 5000 > ignore.txt 
A thread exited while 2 threads were running.
27.63user 0.73system 0:29.66elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+4632outputs (0major+1601minor)pagefaults 0swaps
rob@charon$ /usr/bin/time ./ F 5000 > ignore.txt 
0.16user 0.58system 0:35.28elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+66512minor)pagefaults 0swaps
rob@charon$ /usr/bin/time ./ T 5000 > ignore.txt 
1.40user 0.24system 0:01.83elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+4672outputs (0major+793minor)pagefaults 0swaps

(note that I ran the trials multiple times, and the results were consistant with the samples shown here)

I have summarized the results below

Host Implementation Method Elapsed Time
Relative Speed
(Perl + Fork = 1,
0 = instant)
Pentium D Perl Fork 19.4 1
Pentium D Perl Thread 20.1 1.04
Pentium D Python Fork 30.1 1.55
Pentium D Python Thread 20.0 1.03

Atom Perl Fork 22.4 1
Atom Perl Thread 29.7 1.33
Atom Python Fork 35.3 1.56
Atom Python Thread 1.83 0.08

These were not exactly the results I expected: the Python thread implementation on the wimpy machine runs much faster than any other implementation on either host. I guess that's why they call it computer science, since sometimes you have to measure instead of believing dogma.

I will, however, speculate about the other results. No matter how you slice it, I think Python must have a pretty good thread implementation, relative to Perl. Now, on the dual core SMP host, the Perl fork implementation seemed to edge out the other implementations. I think this may be due, in part, to the garbage collection implementations: Perl uses a reference counting scheme, which while much disparaged in the press as inadequate, does keep memory references more localized during execution; Python uses the now industry standard mark and sweep garbage collection, which is more thorough, but causes several performance problems -- All of that sweeping causes much of the heap to be marched in and out of the CPU cache, destroying an locality benefit the code might have had, and, all of that marking causes any forked processes to have to copy memory pages that had no real application data changes in them. I suspect that Java shares performance characteristics with the Python thread implementation. Like they say, the Lisp programmer knows the value of everything, and the cost of nothing (in contrast to the cynic). What do you think?

Contact me:

Copyright 2009, Robin R Anderson