Sunday, November 30, 2008

Python - a really useful tool

Being largely from the Java world, I hadn't noticed how wide spread the deployment of Python has come. Apparently it is now a standard part of Unix installation (well at least on all Unix boxes I have access to). It can also be run in the same manner as a standard bash/bourne/... Unix script, but still give you the power of a full blown OO languange with a fair number of functional features thrown in. In the future I'm certainly intending to start using it for those more complicated scripting tasks, like extract data from log files, if only to try and solve them in a more functional way.

Below are a few examples. These are largely to remind me of some key aspects that might be useful next time I have to write some scriptable task.

Line counting the functional way

#!/usr/bin/python
import sys

fp = open(sys.argv[1])
print reduce(lambda count,row: count+1, fp, 0)
Executing Unix commands and piping the results
#!/usr/bin/python
from subprocess import call, Popen, PIPE

out = Popen("wc", stdin = PIPE, stdout = PIPE).communicate(Popen("ls", stdout = PIPE).communicate(None)[0]);
print out[0]
The start of a find command
from re import match
from os import walk
from os.path import join

for root, dirs, files in walk('.'):
    print "\n".join([join(root, filename) for filename in files if match('.*\.py', filename)])

Monday, March 24, 2008

Getting Started with F#

I've playing around with a few different functional programming languages and have settled on F# for the moment. It is essentially a port of Objective Caml onto the .Net platform. Coming from a Java background this kills two birds with one stone for me; it will give me a little intro to .Net and will allow me study a functional programming language in more depth.

F# has been developed by the Microsoft research group with means the compiler and associated tools are already of a pretty good quality. There is also a plugin for Visual Studio providing code completion, syntax highlighing etc. Although I'm not a big fan of VS, it is better than a text editor. The great thing is that you can effectively get a free (non-evaluation) F# IDE by downloading the empty Visual Studio shell and then add the F# plugin. First grab the Visual Studio shell (i.e. this doesn't contain C# or any other language) and then download F#. For more details read here.

Once you have installed this lot, you should see new 'Microsoft Research F# ...' item on your start menu. The best thing to do is play around with a few simple expression using the F# Interactive Console (which can also be run from within Visual Studio). For example type the following at the prompt:

printfn "Hello world";;
So what next? Most of the F# books cost quite a bit of money, so I'd recommend initially looking at some of the free OCaml ones. The core language is so similar you should be able to make quite a bit of progress before having to spend any money. Try an Introduction to Objective Caml and Developing Application with Objective Caml.

Finally a good resource is the F# wiki.

Saturday, February 23, 2008

Hibernate Peformance Again

So Hibernate really is adding a significant amount of overhead compared to JDBC when querying the database. As shown in my previous blog, the overhead of processing a row seems to be quite high compared to JDBC (i.e. over double).

I've done quite a bit of investigating trying to work out the root cause. Initially I thought it might be the connection pool, so I switched from using C3P0 to DBCP, but that made no noticeable difference. I also adjusted the JDBC connection to work from the connection pool and use a proper transaction for the query. This slowed the JDBC code down a bit, but there was still a big gap.

The best I can make out, having done some profiling, is that the hydrating of objects from the database is really causing a significant performance cost which is proportional to the rows retrieved. I guess this is to be expected given the complexity of the problem of mapping between a relational database and an object model.

One useful performance improvement was to mark the transaction as read-only when performing the queries. This seemed to save about .5 m/s.

Sunday, January 13, 2008

Hibernate vs JDBC Performance

For my first blog entry ever, I thought I'd look at the performance of Hibernate. I was interested to know how much of an overhead Hibernate adds over straight JDBC, hoping to alay fears about the impact on system performance.

I've run a number of tests (see description and results below) comparing the equivalent code in JDBC and Hibernate. For insertion and update operations, the overhead of Hibernate is not really significant. However, for reads/queries, Hibernate does account for a significant amount of the cost of retrieval. I'm not sure why yet and hope to do some profiling to determine the root cause.

I have to admit that the overhead of using Hibernate was larger than I was expecting (I thought it would be sub millisecond). Still for most applications I doubt this will prove a significant issue, with the increased development productivity more than making up for the drop in performance. Also my PC is relatively dated now - Pentium 1.73GHz with 1 GB of memory.

The test used a database schema for a very simple running log (yes, I've been know to do the odd bit of exercise). The details of a run (distance, heart rate, etc) are recorded against a particular runner, giving a simple one-to-many mapping from a Runner entity to a Run entity. I've put the source code here if anybody wants to try and reproduce my findings. I tested using the Derby database, but it would obviously be easy to adapt to a different database.

Insertions: This test measures the time taken to insert a runner and there associated runs into the database as a single transaction. The graph below shows the insertion time against the number of runs inserted (child entities of runner) per runner.

There really isn't much in it, but the performance of Hibernate does seem to improve as the number of rows inserted increases. I'd speculate that the cost of setting up the Hibernate session become proportionally less significant as more work is peformed in each session.

I have also included the effect of switching on caching (using EhCache) and it can be seen that this unsurprisingly does have an impact on performance.

Updates: This test measures the time it takes to update a runner and all their associated runs. The graph below shows the update time against the number of runs updated. There is a bit of a mismatch between the semantics of JDBC and Hibernate here. Typically Hibernate would only update the objects that have changed, but I'm forcing it to update all the entity so I can measure the overhead of mapping the object to the database.

Queries: This tests measure the time it takes to retrieve a runner and all their associated runs.

You can see that when the cache is enabled, the performance is on a par with straight JDBC. I suspect that in a typical production system, the straight JDBC peformance would be worse than the Hibernate caching, but I happen to be running the database locally so there is little cost in going to the database.

Hibernate's performance for retrieval without a cache is quite bad and I'm at a loss to explain why this should be. I've tried to optimise the performance, using a connection pool with a prepared statement cache, but it still lags the straight JDBC performance. My only explanation is that this is the time it takes to construct the object graph from the database, which the straight JDBC does not have to do, but it seems much more expensive than I expected.

One useful thing to note is that there is no difference in performance between a HQL and using load. Hibernate is very good at caching the execution plan for HQL queries, and subsequent queries don't require any further Hibernate parsing/anlysis.