Containers all the way through…

In this post I will attempt to cover fundamentals of Bare Metal Systems, Virtual Systems and Container Systems. And the purpose for doing so is to learn about these systems as they stand and also the differences between them, focusing on how they execute programs in their respective environments.

Bare metal systems

Let’s think of our Bare Metal Systems as desktops and laptops we use on a daily basis (or even servers in server rooms and data-centers), and we have the following components:

  • the hardware (outer physical layer)
  • the OS platform (running inside the hardware)
  • the programs running on the OS (as processes)

Programs are stored on the hard drive in the form of executable files (a format understandable by the OS) and loaded into memory via one or more processes. Programs interact with the kernel, which forms a core part of the OS architecture and the hardware. The OS coordinate communication between hardware i.e. CPU, I/O devices, Memory, etc… and the programs.

 

Bare Metal Systems

A more detailed explanation of what programs or executables are, how programs execute and where an Operating System come into play, can be found on this Stackoverflow page [2].

Virtual systems

On the other hand Virtual Systems, with the help of Virtual System controllers like, Virtual Box or VMWare or a hypervisor [1] run an operating system on a bare metal system. These systems emulate bare-metal hardware as software abstraction(s) inside which we run the real OS platform. Such systems can be made up of the following layers, and also referred to as a Virtual Machines (VM):

  • a software abstraction of the hardware (Virtual Machine)
  • the OS platform running inside the software abstraction (guest OS)
  • one or more programs running in the guest OS (processes)

It’s like running a computer (abstracted as software) inside another computer. And the rest of the fundamentals from the Bare Metal System applies to this abstraction layer as well. When a process is created inside the Virtual System, then the host OS which runs the Virtual System might also be spawning one or more processes.

Virtual Systems

Container systems

Now looking at Container Systems we can say the following:

  • they run on top of OS platforms running inside Bare Metal Systems or Virtual Systems
  • containers which allow isolating processes and sharing the kernel between each other (such isolation from other processes and resources are possible in some OSes like say Linux, due to OS kernel features like cgroups[3] and namespaces)[4]

A container creates an OS like environment, inside which one or more programs can be executed. Each of these executions could result in a one or more processes on the host OS. Container Systems are composed of these layers:

  • hardware (accessible via kernel features)
  • the OS platform (shared kernel)
  • one or more programs running inside the container (as processes)

Container Systems

Summary

Looking at these enclosures or rounded rectangles within each other, we can already see how it is containers all the way through.

Bare Metal Systems
Virtual SystemsContainer Systems

There is an increasing number of distinctions between Bare Metal Systems, Virtual Systems and Container Systems. While Virtual Systems encapsulate the Operating System inside a thick hardware virtualisation, Container Systems do something similar but with a much thinner virtualisation layer.

There are a number of pros and cons between these systems when we look at them individually, i.e. portability, performance, resource consumption, time to recreate such systems, maintenance, et al.

Word of thanks and stay in touch

Thank you for your time, feel free to send your queries and comments to @theNeomatrix369. Big thanks to my colleague, and  our DevOps craftsman  Robert Firek from Codurance for proof-reading my post and steering me in the right direction.

Resources

Adopt OpenJDK & Java community: how can you help Java !

Introduction

I want to take the opportunity to show what we have been doing in last year and also what we have done so far as members of the community. Unlike other years I have decided to keep this post less technical compare to the past years and compared to the other posts on Java Advent this year.

inthebeginning

This year marks the fourth year since the first OpenJDK hackday was held in London (supported by LJC and its members) and also when the Adopt OpenJDK program was started. Four years is a small number on the face of 20 years of Java, same goes to the size of the Adopt OpenJDK community which forms a small part of the Java community (9+ million users). Although the post is non-technical in nature, the message herein is fairly important for the future growth and progress of our community and the next generation developers.

Creations of the community

Creations from the community

Over the many months a number of members of our community contributed and passed on their good work to us. In no specific order I have enlisted these picking them from memory. I know there are more to name and you can help us by sharing those with us (we will enlist them here).  So here are some of those that we can talk about and be proud of, and thank those who were involved:

  • Getting Started page – created to enabled two way communication with the members of the community, these include a mailing list, an IRC channel, a weekly newsletter, a twitter handle, among other social media channels and collaboration tools.
  • Adopt OpenJDK project: jitwatch – a great tool created by Chris Newland, its one of its kind, ever growing with features and helping developers fine-tune the performance of your Java/JVM applications running on the JVM.
  • Adopt OpenJDK: GSK – a community effort gathering knowledge and experience from hackday attendees and OpenJDK developers on how to go about with OpenJDK from building it to creating your own version of the JDK. Many JUG members have been involved in the process, and this is now a e-book available in many languages (5 languages + 2 to 3 more languages in progress).
  • Adopt OpenJDK vagrant scripts – a collection of vagrant scripts initially created by John Patrick from the LJC, later improved by the community members by adding more scripts and refactoring existing ones. Theses scripts help build OpenJDK projects in a virtualised container i.e. VirtualBox, making building, and testing OpenJDK and also running and testing Java/JVM applications much easier, reliable and in an isolated environment.
  • Adopt OpenJDK docker scripts – a collection of docker scripts created with the help of the community, this is now also receiving contributions from a number of members like Richard Kolb (SA JUG). Just like the vagrant scripts mentioned above, the docker scripts have similar goals, and need your DevOps foo!
  • Adopt OpenJDK project: mjprof – mjprof is a Monadic jstack analysis tool set. It is a fancy way to say it analyzes jstack output using a series of simple composable building blocks (monads). Many thanks to Haim Yadid for donating it to the community.
  • Adopt OpenJDK project: jcountdown – built by the community that mimics the spirit of ie6countdown.net. That is, to encourage users to move to the latest and greatest Java! Many thanks to all those involved, you can already see from the commit history.
  • Adopt OpenJDK CloudBees Build Farm – thanks to the folks at CloudBees for helping us host our build farm on their CI/CD servers. This one was initially started by Martijn Verburg and later with the help of a number of JUG members have come to the point that major Java projects are built against different versions of the JDK. These projects include building the JDKs themselves (versions 1.7, 1.8, 1.9, Jigsaw and Shenandoah). This project has also helped support the Testing Java Early project and Quality  Outreach program.

These are just a handful of such creations and contributions from the members of the community, some of these projects would certainly need help from you. As a community one more thing we could do well is celebrate our victories and successes, and especially credit those that have been involved whether as individuals or a community. So that our next generation contributors feel inspired and encourage to do more good work and share it with us.

Contributions from the community

contribution_header-700x325In a recent tweet and posts to various Java / JVM and developer mailing lists, I requested the community to come forward and share their contribution stories or those from others with our community. The purpose was two-fold, one to share it with the community and the other to write this post (which in turn is shared with the community). I was happy to see a handful of messages sent to me and the mailing lists by a number of community members. I’ll share some of these with you (in the order I have received them).

 

Sebastian Daschner:

I don’t know if that counts as contribution but I’ve hacked on the
OpenJDK compiler for fun several times. For example I added a new
thought up ‘maybe’ keyword which produces randomly executed code:
https://blog.sebastian-daschner.com/entries/maybe_keyword_in_java

Thomas Modeneis:

Thanks for writing, I like your initiative, its really good to show how people are doing and what they have been focusing on. Great idea.
From my part, I can tell about the DevoxxMA last month, I did a talk on the Hacker Space about the Adopt the OpenJDK and it was really great. We had about 30 or more attendees, it was in a open space so everyone that was going to any talk was passing and being grabbed to have a look about the topic, it was really challenging because I had no mic. but I managed to speak out loud and be listen, and I got great feedback after the session. I’m going to work over the weekend to upload the presentation and the recorded video and I will be posting here as soon as I have it done! :)

Martijn Verburg:

Good initiative.  So the major items I participated in were Date and Time and Lambdas Hackdays (reporting several bugs), submitted some warnings cleanups for OpenJDK.  Gave ~10 pages of feedback for jshell and generally tried to encourage people more capable than me to contribute :-).

Andrii Rodionov:

Olena Syrota and Oleg Tsal-Tsalko from Ukraine JUG: Contributing to JSR 367 test code-base (https://github.com/olegts/jsonb-spec), promoting ‘Adopt a JSR’ and JSON-B spec at JUG UA meetings (http://jug.ua/2015/04/json-binding/) and also at JavaDay Lviv conference (http://www.slideshare.net/olegtsaltsalko9/jsonb-spec).

Contributors

contributorAs you have seen that from out of a community of 9+ million users, only a handful of them came forward to share their stories. While I can point you out to another list of contributors who have been paramount with their contributions to the Adopt OpenJDK GitBook, for example, take a look at the list of contributors and also the committers on the git-repo. They have not just contributed to the book but to Java and the OpenJDK community, especially those who have helped translate the book into multiple languages. And then there are a number of them who haven’t come forward to add their names to the list, even though they have made valuable contributions.

From this I can say contributors can be like unsung heroes, either due their shy or low-profile nature or they just don’t get noticed by us. So it would only be fair to encourage them to come forward or share with the community about their contributions, however simple or small those may be. In addition to the above list I would like to also add a number of them (again apologies if I have missed out your name or not mentioned about you or all your contributions). These names are in no particular order but as they come to my mind as their contributions have been invaluable:

  • Dalibor Topic (OpenJDK Project Lead) & the OpenJDK team
  • Mario Torre & the RedHat OpenJDK team
  • Tori Wieldt (Java Community manager) and her team
  • Heather Vancura & the JCP team
  • NightHacking, vJUG and RebelLabs (and the great people behind them)
  • Nicolaas & the team at Cloudbees
  • Chris Newland (JitWatch developer)
  • Lucy Carey, Ellie & Mark Hazell (Devoxx UK & Voxxed)
  • Richard Kolb (JUG South Africa)
  • Daniel Bryant, Richard Warburton, Ben Evans, and a number of others from LJC
  • Members of SouJava (Otavio, Thomas, Bruno, and others)
  • Members of Bulgarian JUG (Ivan, Martin, Mitri) and neighbours
  • Oti, Ludovic & Patrick Reinhart
  • and a number of other contributors who for some reason I can’t remember…

I have named them for their contributions to the community by helping organise Hackdays during the week and weekends, workshops and hands-on sessions at conferences, giving lightening talks, speaking at conferences, allowing us to host our CI and build farm servers, travelling to different parts of the world holding the Java community flag, writing books, giving Java and advance-level training, giving feedback on new technologies and features, and innumerable other activities that support and push forward the Java / JVM platform.

How you can make a difference ? And why ?

make_a_differenceYou can make a difference by doing something as simple as clicking the like button (on Twitter, LinkedIn, Facebook, etc…) or responding to a message on a mailing list by expressing your opinion about something you see or read about –as to why you think about it that way or how it could be different.

The answer to the question “And why ?” is simple, because you are part of a community and ‘you care’ and want to share your knowledge and experience with others — just like the others above who have spared free moments of their valuable time for us.

Is it hard to do it ? Where to start ? What needs most attention ?

important-checklist The answer is its not hard to do it, if so many have done it, you can do it as well. Where to start and what can you do ? I have written a page on this topic. And its worth reading it before going any further.

There is a dynamic list of topics that is worth considering when thinking of contributing to OpenJDK and Java. But recently I have filtered this list down to a few topics (in order of precedence):

We need you!

With that I would like to close by saying:

i_need_you_duke3

Not just “I”, but we as a community need you

This post have been re-blogged from the Java Advent Calendar 2015 site. Many thanks to its organisers and writers.

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Why not build #OpenJDK 9 using #Docker ? – Part 1 of 2

Introduction

I think I have joined the Docker [1] party a bit late but that means by now everyone knows what Docker is and all the other basic fundamentals which I can very well skip, but if you are still interested, please check these posts Get into Docker – A Guide for Total Newbies [2] and Docker for Total Newbies Part 2: Distribute Your Applications with Docker Images [3]. And if you still want to know more about this widely spoken topic, check out these Docker posts on Voxxed [4].

Why ?

Since everyone has been doing some sort of provisioning or spinning up of dev or pre-prod or test environments using Docker [1] I decided to do the same but with my favourite project i.e. OpenJDK [5].

So far you can natively build OpenJDK [6] across Linux, MacOs and Windows [7], or do the same things via virtual machines or vagrant instances, see more on then via these resources Virtual Machines, [8] Build your own OpenJDK [9] and this vagrant script [10]. All part of the Adopt OpenJDK initiative lead by London Java Community [14] and supported by JUGs all over the world.

Requirements

Most parts of post is for those using Linux distributions (this one was created on Ubuntu 14.04). Linux, MacOS and Windows users please refer to Docker‘s  Linux, MacOS and Windows instructions respectively.

Hints: MacOS and Windows users will need to install Boot2Docker and remember to run the below two commands (and check your Docker host environment variables):

$ boot2docker init
$ boot2docker up 
$ boot2docker shellinit 

For the MacOS, if the above throw FATA[…] error messages, please try the below:

$ sudo boot2docker init
$ sudo boot2docker up 
$ sudo boot2docker shellinit 

For rest of the details please refer to the links provided above. Once you have the above in place for the Windows or MacOS platform, by merely executing the Dockerfile using the docker build and docker run commands you can create / update a container and run it respectively.

*** Please refer to the above links and ensure Docker works for you for the above platforms – try out tutorials or steps proving that Docker run as expected before proceeding further. ***

Building OpenJDK 9 using Docker

Now I will show you how to do the same things as mentioned above using Docker.

So I read the first two resource I shared so far (and wrote the last ones). So lets get started, and I’m going to walk you through what the Dockerfile looks like, as I take you through each section of the Dockerfile code.

*** Please note the steps below are not meant to be executed on your command prompty, they form an integral part of the Dockerfile which you can download from here at the end of this post. ***

You have noticed unlike everyone else I have chosen a different OS image i.e. phusion/baseimage, why? Read YOUR DOCKER IMAGE MIGHT BE BROKEN without you knowing it [11], to learn more about it.

FROM phusion/baseimage:latest

Each of the RUN steps below when executed becomes a Docker layer in isolation and gets assigned a SHA like this i.e. 95e30b7f52b9.

RUN \
  apt-get update && \
  apt-get install -y \
    libxt-dev zip pkg-config libX11-dev libxext-dev \
    libxrender-dev libxtst-dev libasound2-dev libcups2-dev libfreetype6-dev && \
  rm -rf /var/lib/apt/lists/*

The base image is updated and a number of dependencies are installed i.e. Mercurial (hg) and build-essential.

RUN \
  apt-get update && \
  apt-get install -y mercurial ca-certificates-java build-essential

Clone the OpenJDK 9 sources and download the latest sources from mercurial. You will notice that each of these steps are prefixed by this line cd /tmp &&, this is because each instruction is run in its own layer, as if it does not remember where it was when the previous instruction was run. Nothing to worry about, all your changes are still intact in the container.

RUN \
  cd /tmp && \
  hg clone http://hg.openjdk.java.net/jdk9/jdk9 openjdk9 && \
  cd openjdk9 && \
  sh ./get_source.sh

Install only what you need when you need them, see below I downloaded wget and then the jdk binary. I also learnt how to use wget by passing the necessary params and headers to make the server give us the binary we request. Finally un-tar the file using the famous tar command.

RUN \
  apt-get install -y wget && \
  wget --no-check-certificate --header "Cookie: oraclelicense=accept-securebackup-cookie" \ 
http://download.oracle.com/otn-pub/java/jdk/8u45-b14/jdk-8u45-linux-x64.tar.gz

RUN \
  tar zxvf jdk-8u45-linux-x64.tar.gz -C /opt

Run configure with the famous –with-boot-jdk=/opt/jdk1.8.0_45 to set the bootstrap jdk to point to jdk1.8.0_45.

RUN \
  cd /tmp/openjdk9 && \
  bash ./configure --with-cacerts-file=/etc/ssl/certs/java/cacerts --with-boot-jdk=/opt/jdk1.8.0_45

Now run the most important command:

RUN \  
  cd /tmp/openjdk9 && \
  make clean images

Once the build is successful, the artefacts i.e. jdk and jre images are created in the build folder.

RUN \  
  cd /tmp/openjdk9 && \
  cp -a build/linux-x86_64-normal-server-release/images/jdk \
    /opt/openjdk9

Below are some chmod ceremonies across the files and directories in the openjdk9 folder.

RUN \  
  cd /tmp/openjdk9 && \
  find /opt/openjdk9 -type f -exec chmod a+r {} + && \
  find /opt/openjdk9 -type d -exec chmod a+rx {} +

Two environment variable i.e. PATH and JAVA_HOME are created with the respective values assigned to them.

ENV PATH /opt/openjdk9/bin:$PATH
ENV JAVA_HOME /opt/openjdk9

You can find the entire source for the entire Dockerfile on github [12].

…more of this in the next post, Why not build #OpenJDK 9 using #Docker ? – Part 2 of 2, we will use the docker build, docker run commands and some more docker stuff.

(Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

This is a continuation of the previous post titled (Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

Without any further ado, lets get started with our next set of blogs and videos, chop…chop…! This time its Martin Thompson’s blog posts and talks. Martin’s first post on Java Garbage collection distilled basically distils the GC process and the underlying components including throwing light on a number of interesting GC flags (-XX:…). In his next talk he does his myth busting shaabang about mechanical sympathy, what people correctly believe in and the misconceptions. In the talk on performance testing, Martin takes its further and fuses Java, OS and the hardware to show how understanding of all these aspects can help write better programs.


Java Garbage Collection Distilled by Martin Thompson

There are too many flags to allow tuning the GC to achieve the throughput and latency your application requires. There’s plenty of documentation on the specifics of the bells and whistles around them but none to guide you through them.

(Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

I have been contemplating for a number of months about reviewing a cache of articles and videos on topics like Performance tuning, JVM, GC in Java, Mechanical Sympathy, etc… and finally took the time to do it – may be this was the point in my intellectual progress when was I required to do such a thing!

Thanks to Attila-Mihaly for giving me the opportunity to write a post for his yearly newsletter Java Advent Calendar, hence a review on various Java related topics fits the bill! The selection of videos and articles are purely random, and based on the order in which they came to my knowledge. My hidden agenda is to mainly go through them to understand and broaden my own knowledge at the same time share any insight with others along the way….read more (reblogged from the Java Advent Calendar)

SoCrates UK 2013 – my experiences!

Refactoring TDD habits

It’s about 15:21 on 19th September, a group of us from the London Software Craftsmanship Community gathered to leave for SoCrates UK 2013. We all gathered in the same and luckily found empty seats next to each other.

Amongst a lot of jokes, we suggested lets do a small dojo in the two hours we will be travelling. Wifi is free but terrible, on this train – nevertheless we tethered our phones and enabled Wifi on our laptops to make the most of the bandwidth.

This was also about the time when others sitting around me took notice of the desktop wallpaper, it looked something like this:

13 habits of good TDD programming

13 habits of good TDD programming

One thing lead to another, and before we knew, we were going through the above list of habits – some found the habits difficult to remember, others found duplicates and overlaps. There was this tendency that, this is a big list to remember when doing TDD!

So like developers we said we can “refactor” these habits into something more meaningful, maybe even organise them as per our thinking process.

From this conversation the below list was born:

12 habits to good TDD programming:
  1) Write the assertion first and work backwards
  2) Test should test one thing only
  3) See the test fail
  4) Write the simplest code to pass the test
  5) Refactor to remove duplications
  6) Don’t refactor with failing test(s)
  7) Write meaningful tests
  8) Triangulate
  9) Keep your test and model code separate (except when practising TDD-as-if-you-meant-it)
10) Isolate your tests
11) Organise your tests to reflect model code
12) Maintain your tests

You can notice already that the 13…habits shrunk to 12 habits, and their order changed in comparison to the original.

Then @Frankie mentions the difference between Test-first development and Test driven development, no one claimed to know it, but suggested the below:

Test-first development
– you start the development with tests, and change the tests if goal is not achieved via model (domain) code. – No guidelines about design, flexible approach.

Test driven development
– test drives the model code, you never change the test, only model code to reflect any changes that does not satisfy the tests! Guided by design.
Model code = production code or implementation, domain (better term)

Soon we arrived at Moreton-in-Marsh and our focussed changed to getting a taxi to the venue.

I continued contemplating with the idea of collaborating with other developers and ironing our this list – so it can be helpful at the least.

I met @gonsalo and @sandromancuso and ran the idea by them and they thought it was certainly an idea to present and get others involved to see what the final outcome could be.

Next morning everyone was proposing their presentations at the “Open Session”, and I took the chance of presenting – Refactoring TDD habits… in the Cheltham Loft room in the house called Coach at the Cotwold estate!

30 minutes into the conversation and we already attracted discussions between @sleepyfox and @sandromancuso, we did try to persuade them to avoid tangenting from our core topic of discussion.

@sandromancuso also shared with us one of the rules of simple design – as these have evolved over time, these sorta look like this on the flip chart he scribbled on:

4 rules of simple design

4 rules of simple design (thanks @racheldavies for taking this pic at the event, I have cropped it to fit it in here)

– passes all tests
– minimises duplication
– maximises clarity (clear, expressive, consistent)
– has fewer elements

The idea is that all your actions and practises could use these as guiding light to keep on track.

An hour and few minutes later and we have refactored and cleansed the original 13 habits down to 12 habits – reshuffled and rejuvenated. They may not be perfect but close (Kent Beck: use ‘perfect’ as a verb, not a noun – its a journey not a destination!

12 habits of good TDD programming

I sincerely hope that these list of habits do cover the essence of the principles, values and practises of TDD programming.

Prepare yourself with things you should or should not do, and then perform the Red-Green-Refactor actions to satisfy the TDD process.

The things some of the attendees not like is the way of the first list of habits were worded:

  • definition of the word ‘model’ or ‘domain’
  • use of the term duplication
  • the habits being assigned with numbers
  • use of the term assert – one assert per test or groups of assert per logical test

I also brought up the topic of the difference between Test-first development and Test driven development – and there were disagreements about it amongst the attendees, on the meanings of the definition itself. Please share with me your input!

Back to the final list of habits, @CarlosBle brought up a very valid point that some of the habits on the list might be time-based rather than relevant all the time – they were not linear but appear and disappear from the list depending on the current task in hand. We agreed we would sit together and work it out, but @sleepyfox was ahead of us and kindly drew this flow / state diagram on the flip chart that illustrated the list of habits but in a diagrammatic format:

Flow diagram of TDD habits

State/flow diagram of TDD habits (my apologies if its not clear, sometime down the line one of us could come up with a digital version of the state/flow diagram)

@CarlosBle – please feel free to come up with your time-based / non-linear list and share it with us when you get a chance.

Although we got a lot out of the session with discussions on various topics, we haven’t covered everything and finished discussing everything!

I’m more of the idea, that we could try to look at the above habits as trigger points (practical and pragmatic use) to make us do the right things when we are writing code with the intention of writing good code…clean code…quality code…whatever the terminology be in your environment.

At the end of the day, good practises become habitual only when practised repeatedly, with focus and intent.

Please do provide constructive criticism, if any, such feedback to the above are very welcome as they help improve the quality of the post!

Thanks to @sandromancuso & @sleepfox for helping and participating during my session. Big thanks to Socrates UK – its host, facilitators, organisors and sponsors for making this event happen!

My experience of learning R – from basic graphs to performance tuning

Background

R as some of you may know is a statistical and graphics programming language (see Wikipedia [1]) used by academia and recently by IT professionals of our ever growing software industry. There is a sudden demand for Data Scientists, Data Analysts and Statisticians with a background in R among other things data and development related subjects.

I have been fortunate to work with such a programming language, even though I haven’t had any prior experience working with such a programming language and moreover with Data Scientists. My interest in Mathematics and affinity for numbers drew me to learning it, and with further help of Herve Schnegg our in-house Senior Data Scientist, I was able to pick a fair bit of the subject.

 

R is a mix of a object-oriented programming, Clojure-like functional programming, Javascript-like style of writing code and a Smalltalk-like programming interface. And it offers REPL like many functional programming environments. The fundamental units of the data we manipulate are usually objects like lists, vectors, data-frames, tables, etc…
 
Initial baby-steps
 
I went through a few hours of tutoring by getting an understanding of the R environment, how to install it, and an overview of RStudio and how amazing it is! What fascinates me, is that you can load objects into memory and play with it and when you shutdown your environment your data is not cleared! Rather you can save it (into the .Rdata file) and it retains such information per project!
You are able to remove individual objects from memory, view them, modify them, and reload them from the command-line or by just executing single lines of code in your R script file (they have the obvious extension of .r) in an IDE like RStudio.
R gives developers access to a REPL (stands for Read–eval–print loop [2]) environment and thats how you are able to do the above actions seamlessly! A number of other popular languages have a similar environment i.e. Clojure, Haskell, Python, Ruby, Scala, and Smalltalk, and so forth.

More about R
The order of precedence with regards to declaring a function is important in R, you can’t just call a function unless it has been defined in the package/library you have loaded like:

library([name of library])

or included a resource using the source() function like:source(“./Utils.LoadAndVerify.r”)

or defined the function in the beginning of the script file before referring to it, at a later stage! I had to learn this by the trial-an-error-then-ask-the-experts-around-you method.

 

Contents of any object can be viewed by referring to the object at the REPL CLI, that’s kind of easy!

 

>  someObject <- “contents”
>  someObject [press enter]
[1] “contents” <==== output

 

I discovered another way to view the contents of an object especially when its a list, vector, data-frame, etc…, and is a bit cumbersome to read its output on the console. I learnt that the View() function displays the contents of the object in a tabular form in a separate floating window:

 

> View(table(someList))

 

(the object is displayed in a grid like table in a separate window, which could look like the below)

Plotting graphs from a set of numeric values contained in a list or vector in R is like doing 1..2..3…:

 > counts par(bg = "white");
 > barplot(counts, main="Car Distribution by Gears and VS",
   xlab="Number of Gears", col=c("darkblue","maroon"),
   legend = rownames(counts), beside=TRUE)

And voila, you get a nice simple looking bar graph!

Thanks to a helpful R blogger who has put together some resource for us: Using R to plot data [4].

We can do something more advance by running the below commands:

> x  y  f  z  par(bg = "white");
> persp(x,y,z,zlim=c(0,0.25), theta=50, phi=10);

…and we have the below nice looking 3D mesh (wireframe), from an angle:
Note: the par (bg=”white”) command sets the colour of the canvas for the entirety of your session.

 

Logging
I wrote my own suite of very simple logging functions that log messages to the console depending on the nature of the message, these messages can of course be piped into a text file at run-time.
log.INFO print(paste(date(), "[INFO]", message))
}

log.WARNING print(paste(date(), "[WARNING]", message))
}

log.DEBUG print(paste(date(), "[DEBUG]", message))
}

log.ERROR print(paste(date(), "[ERROR]", message))
}
Of course the above block of code could have been written like this:
MSG_TYPE_INFO <- "[INFO]"
MSG_TYPE_WARNING <- "[WARNING]"
MSG_TYPE_DEBUG <- "[DEBUG]"
MSG_TYPE_ERROR <- "[ERROR]"

log.ANY print(paste(date(), typeOfMessage, message))
}

log.INFO log.ANY(MSG_TYPE_INFO, message)
}

log.WARNING log.ANY(MSG_TYPE_WARNING, message)
}

log.DEBUG log.ANY(MSG_TYPE_DEBUG, message)
}

log.ERROR log.ANY(MSG_TYPE_ERROR, message)
}

As you will know, the way R is, it is wise to have logging functions to hand, to dump values of variables when running scripts. Just because sometimes the error messages thrown by R can be obscure, which has been my finding during my pursuits. Hence I resorted to the above functions and relieved myself from annoyances during exceptions.Later some passed me a link to an R Logging library (an implementation of log4j in R) [7].

What’s up!
At the moment I’m refactoring bits of code I wrote during the last two weeks and still have many blocks of code to go through to find suitable method functions to place them into – our purpose is to make the code more readable, scalable and maintainable.
Just now in the process of replacing the slow and verbose for-loop like constructs with their equivalent xapply() functions. By doing this we will gain in speed and compactness with regards to the lines of code.

 

R gives us a number of MapReduce like functions to play with, here’s a blog [3] that covers the topic on the xapply() functions.
 
Performance measurement and performance tuning
As R is an interpreted language, if you don’t write efficient functions, you could end up waiting a bit longer than expected, before any results are thrown back onto the console. It is not verbose and does not usually tell you what it is upto.
We spent most of our two weeks performing this action as we came across performance bottlenecks in our scripts and could do with using the xapply() like functions. Applying them improved the performance of certain tasks from several hours to a reasonable number of minutes per execution.

 

“Measure, don’t guess.” was the motto!

 

Thanks to the sequence of calls to the proc.time() function, which we used voraciously to measure performances of the different blocks of code we thought needed attention.

 

startTimer <- proc.time()
and
proc.time() – startTimer
 
This paid off at the end of the process as we were able to determine how much time it would take for the script to transform and validate the heaps of data we have been playing with.
At the end of each such iteration we saw the stats in the below format. It got us excited if it was a low number and dejected if it wasn’t to our liking:

 

   user  system elapsed
 87.085   0.694  87.877
 
We tried various methods to bring down the total elapsed time. Some of the things we did even before we came to a final resolution:
 – used for-loop to iterate through a list or vector and perform the same action repeatedly and accumulate results
  – we noticed the for-loop slowed down after a number of iterations and this was a standard pattern. To relieve that we split the for-loop into an inner and outer loop. The outer loop split the inner loop into batches of 40-50 iterations followed by a call to gc() at the end of the iteration. This wasn’t a decent solution from an algorithms or language point of view
finally we settled to refactoring the for-loop into a mapply() which looked like:
result &lt;- unlist(mapply(FUN=transposeColumnAsRow, rangeOfIndices, SIMPLIFY=TRUE))
 
The last action gave us a better grip over the performance and we were confident that if we had to run all the data we had through the script, we would be able to finish transposing it within several hours as opposed to a few days, previously.
Here’s the equation we used to benchmark our functions each time we improved it. It was more to find out for us if we would be able to meet our goals. If the action was acceptable, otherwise we needed to investigate further to find a better method:

 

nm = (ns / nr) * tnr / nsm

 

 nm – no. of minutes it would take to process the whole raw file
 ns – no. of seconds taken to process the batch of records
 nr – total number of records process in the batch
tnr – grand total of the number of records in raw file
nsm – number of seconds in a minute

 

The method we settled for gave us the below results, which was a great benchmark based on processing a sample of 100 records, and when extrapolated on 11200+* records gave the below – which was pretty acceptable at the time:

 

 9.315 / 100 * 11250 / 60 = 17.465 minutes per raw data file

* – each row was made up of 1300+ columns which added to the processing time


We had about 24 files in total to process, which compute to

17.465 * 24 / 60 = 6.986 hours if run one file per session

The tasks of processing each file was split into 3 to 4 sessions processing 1000 records per session.

But it wasn’t as easy as said, we had a number of sessions running doing the above on different pieces of raw data, but never got to committing the data into the database and wondered why? We thought it was hardware/software limitations on our systems. But after further investigations and experimentations found out that no system can handle writing mega-tons of data from memory into the database system without creating giga-tons of swap files. And these swap files are a catch-22, because now the OS needs resources to manage its own resources so our resource requirements would take a back seat!

After a couple discussions, and trials we finally decided to write data back into the database system, in smaller blocks at a time, which means we can still have multiple sessions running in the background and have each one of them write smaller blocks of data into the database.
Everyone is happy as processes can handle smaller blocks much better than bigger blocks – didn’t we already know this, maybe we re-learnt it by facing a bottleneck?   Our script learnt from it as well and got modified to be able to accept and handle processing smaller blocks of data by splitting the processes into smaller batches of records per execution.$ RScript IncrementalLoad.r [filename] [starting record no.] [ending record no.]

The verification script also imitated the same and elected to be run in batches:$ RScript VerifyData.r [filename] [starting record no.] [ending record no.]

Once data had been transformed and written to a database, we wrote a script to validate the data written into the database, we chose Postgres as a trial, and found it was a pretty good database system with an intuitive SQL language.

The verification process was run in the same manner in parallel which took similar amount of time, so at the end of the 7th hour we had both the data written into the database and verified.

 

R can write to such a database system easily. We were further helped with the primary, simple and compound indices that we created to facilitate the searching and selecting processes that our SQL statements would make it do. Postgres also has an efficient caching mechanism, which helps further speed things up.

 

Tweaking the R environment to get efficiency out of it
What I didn’t mention was that before we returned to the R script to tweak it and improve its performance, we thought it was the environment and the way R was, that made it slow – so we wanted to speed up our scripts using the below methods to get the maximum out of the R environment:
  • JIT compiling R scripts – thinking its not slow when interpreted anymore
  • Converting R scripts into C/C++ code and compiling and running it instead
  • Running R scripts using parallel processing (need some library for it)
  • Learning how to use GPUs via R to get that extra performance (need some library for it)
  • Investigating other methods of High Performance Computing in R
We have parked these ideas for now, but it will be a great experience to be able to explore them at a later date.
But once again it was techniques over technology that made our day. Rory Gibson, rightly said “Its not surprising to know how game developers produce some of the best pieces of work under restricted environments”. Such situations are a good nudge to everyone especially developers when faced with performance bottleneck – look at your code not your machine first!

 

At the end of it all, it feels we did what Hadoop or Cloudera would do to our jobs – split, slice, execute, verify, put together and bring back the results at an efficient speed.

 

Hurdles
 
The time I spend learning and applying R, I had to get familiar with its unique or rather say different from other programming language syntax. Like the use of the <- (arrow sign or indirection operator) instead of the usual = (equal to sign). How you point the arrow makes a difference in R, instead of assigning a value to a variable or function you might end up doing something else if you are not careful.

 

You need to define your functions at the top first, otherwise you can’t refer to it. And all entities are case-sensitive, please pay careful attention or else you will only be notified when you least expect it and in the middle of an execution of a block – remember its an interpreted language, no compile time warnings / error messages are available.
There was one more hurdle which put the spanners at work for us – we bumped into an encoding/decoding issue with reading data from the raw file. The ESS plugin [5] in Emacs was reading the data literally at a stage and not evaluating the escape codes. When we switched to RStudio or even the R repl, we immediately became free from the issue – this was also because both I and Herve were using different development environments. He used Emacs to develop in R while I used RStudio. Why this was happening is not known to us, we think there might be a bug in the plugin – at this stage its still a speculation, but more importantly we don’t have to investigate the issue anymore.Our raw file was written using the application called SPSS which writes data in a proprietary format. Such files can be read via a few ways, and using R is one way to achieve that. There is also a Java library [6] that facilitates reading such files, but remains unexplored at this time.
Test driven development in R
 
This is where I have still been hovering around with regards to R, I came across two libraries that enables writing unit tests in R, i.e. RUnit and svUnit. See below in the External Resources section for a number of links I have put together while searching for TDD methodologies in R.
It still needs to be investigated further but a promising start – since test-first driven development is a great way to start working on any piece of problem in any programming language of choice.

 

 

Refactoring
Another action which has been a continuous process since the start. We have applied it to generalise, and make the code base more compact and manageable.

 

Move away common function calls into another .r file and called it into our main script using the source() function. Make our work more maintainable and re-usable – basically keep our code-base clean and tidy.

 

I learnt that refactoring is a continuous effort – its a journey not a destination.

 

What we took away…
I and Herve both took away a lot of learning both technical and non-technical from the whole process  – pair-programming and pair-investigation of a problem space, as the old adage goes “Two minds, are better than one.” Also reveals another reason why pair-programming is encouraged as part of a development process.

 

One important point: we learnt that when we started working on this project, slicing it into simple smaller atomic chunks when solving a problem was effective and efficient, and learnt the hard way. Also when we had a solution to apply to a dataset,  we had already decided to only apply any experimental solution to a smaller subset of the dataset first, verify the results and then scale it incrementally till the dataset was exhausted. Both these working methods came to our rescue and reduced the combination and permutations of trial-and-error!
We both exchanged ideas that we were new to and very well incorporated many of them into our work methods and was able come up with a fine, and a re-usable solution.
I have taken this project further by documenting the work, continuing with refactoring the script files, writing this blog post, and tidying up the project space as a whole.This blog comes about as a documentation of our trial to check the viability of different platforms that could serve us as an ETL (Extract, Transform, Load) – of which we have made good use of RWhether is brilliant at it, or another tools serves betters is debatable. R stands good at what it does, and it does it well – but can be used to do light/medium weight ETL work.So what would be the next tool or platform of choice for our next ETL project. We can’t tell which one is better till we have tried a few and benchmarked them against their pro-s and con-s.


Thanks

Herve Schnegg – for a good partnership during our R session the last couple of weeks, and all the input and learning.

Rory Gibson – for lending his reviewer eyes, for reviewing our R work I and Herve did and also for reviewing this post.

External resources
This blog has also been published on the web’s popular R blogging site: http://www.R-bloggers.com.
During my quest, while learning and applying R, I came across the below links that could come useful to anyone who is interested in furthering their knowledge.
JIT for R
R to CPP
Parallels in R
High Performance Computing using R
GPU programming with R
Test driven development in R
Other useful topics
PSPP

Read more….