Archive for May, 2009

Automatic Twitter

Steve on May 5th 2009

Like many people, I signed up for Twitter and don’t use it.  Although several people I know do, I just couldn’t be arsed enough to care about the sort of guff that people ‘tweet’.  It’s like Facebook status updates.  Anyone I know well enough to care about, I already know what they’re doing!

Having said that though, the amazing WordPress is constantly offering me new things to try.  “WP to Twitter” is a plugin that automatically issues a ‘tweet’ whenever I add a new blog post, there’s nothing else I need to do.  So, if you’d rather keep updated on Twitter you can follow me at www.twitter.com/LongSteve, although personally I prefer RSS!

Filed in General | No responses yet

How Google Works – MapReduce

Steve on May 5th 2009

I recently came across a paper by Jeffrey Dean and Sanjay Ghemawat of Google, MapReduce – Simplified Data Processing on Large Clusters.  It describes, very clearly, a programming model for operating on large data sets  using parallel processing.  Specifically, parallel processing on large numbers of ‘commodity machines’, or run-of-the-mill PCs.

Before I joined iomo, I worked for IBM at their Hursley laboratory.  I was a programmer on their transaction processing system CICS, specifically the System/390 mainframe version.  I was very fortunate to join IBM as the Internet revolution was hitting the radar of the very clever technical people who worked there.  I went on to be involved in various technologies that interfaced the world of COBOL and business transactions with the internet, TCP/IP and Java.

IBM System/390

Back in the early 90′s, I imagine that anyone looking to processes huge amounts of data would have looked towards IBM and it’s competitors in ‘big iron’ computing, maybe Amdahl, Fujitsu, Tandem or even Cray.  Amdahl vanished, Tandem were bought by Compaq in 1997, Cray by SGI and then Tera Computing, but IBM went with the times and moved to a services based business and are still going.  Their mainframe business is still strong from what I can gather too.

Had I not worked at IBM though, I doubt that now I would even consider their machines for large scale computing.  Distributed systems linked either by the Internet or LANs have become the first place to look when needing masive processing power.  Companies like Amazon are now selling processing time on huge networks of PC like machines.  Widely distributed systems have become known as ‘Cloud Computing‘ and it’s a big buzzword at the moment.

Parallel programming has always been difficult though, even for the most gifted of programmers.  Moving from single to multi-threading is a jump that causes many mistakes and countless bugs.  It’s a whole new world of pain when more than one thread, process or even  CPU can access your data at the same time!  A model like MapReduce really makes distributed processing usable by all though.  Here’s the abstract of the aforementioned Google paper:

MapReduce is a programming model and an associated
implementation for processing and generating large
data sets. Users specify a map function that processes a
key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key.

I would encourage any programmer to read the paper and understand a little about how Google processes it’s web crawled data.  I found it fascinating and really wish I had the time to experiment with programming in MapReduce myself.  When I do get a moment free, I will definitely be taking a look at Hadoop, an Open Source top level Apache project that implements MapReduce in Java.

Not everyone thinks MapReduce is so great of course, indeed, it  might even be  a step backwards if you think of it in terms of current database technology.  MapReduce isn’t a database though, it’s a programming model for distributed processing of large sets of data, such as  might be trawled from the Internet by web crawlers, or generated by massive information gathering exercises.  No doubt, as I type,  the UK Government is looking keenly at such systems for trawling ISP logs.

Filed in General | No responses yet