How is Java / JVM built ? Adopt OpenJDK is your answer!

Introduction & history
As some of you may already know, starting with Java 7, OpenJDK is the Reference Implementation (RI) to Java. The below time line gives you an idea about the history of OpenJDK:
OpenJDK history (2006 till date)
If you have wondered about the JDK or JRE binaries that you download from vendors like Oracle, Red Hat, etcetera, then the clue is that these all stem from OpenJDK. Each vendor then adds some extra artefacts that are not open source yet due to security, proprietary or other reasons.


What is OpenJDK made of ?
OpenJDK is made up of a number of repositories, namely corba, hotspot, jaxp, jaxws, jdk, langtools, and nashorn. Between OpenjJDK8 and OpenJDK9 there have been no new repositories introduced, but lots of new changes and restructuring, primarily due to Jigsaw – the modularisation of Java itself [2] [3] [4] [5].
repo composition, language breakdown (metrics are estimated)
Recent history
OpenJDK Build Benchmarks – build-infra (Nov 2011) by Fredrik Öhrström, ex-Oracle, OpenJDK hero!

Fredrik Öhrström visited the LJC [16] in November 2011 where he showed us how to build OpenJDK on the three major platforms, and also distributed a four page leaflet with the benchmarks of the various components and how long they took to build. The new build system and the new makefiles are a result of the build system being re-written (build-infra).


Below are screen-shots of the leaflets, a good reference to compare our journey:

How has Java the language and platform built over the years ?

Java is built by bootstrapping an older (previous) version of Java – i.e. Java is built using Java itself as its building block. Where older components are put together to create a new component which in the next phase becomes the building block. A good example of bootstrapping can be found at Scheme from Scratch [6] or even on Wikipedia [7].


OpenJDK8 [8] is compiled and built using JDK7, similarly OpenJDK9 [9] is compiled and built using JDK8. In theory OpenJDK8 can be compiled using the images created from OpenJDK8, similarly for OpenJDK9 using OpenJDK9. Using a process called bootcycle images – a JDK image of OpenJDK is created and then using the same image, OpenJDK is compiled again, which can be accomplished using a make command option:
$ make bootcycle-images       # Build images twice, second time with newly built JDK

make offers a number of options under OpenJDK8 and OpenJDK9, you can build individual components or modules by naming them, i.e.

$ make [component-name] | [module-name]
or even run multiple build processes in parallel, i.e.
$ make JOBS=<n>                 # Run <n> parallel make jobs
Finally install the built artefact using the install option, i.e.
$ make install


Some myths busted
OpenJDK or Hotspot to be more specific isn’t completely written in C/C++, a good part of the code-base is good ‘ole Java (see the composition figure above). So you don’t have to be a hard-core developer to contribute to OpenJDK. Even the underlying C/C++ code code-base isn’t scary or daunting to look at. For example here is an extract of a code snippet from vm/memory/universe.cpp in the HotSpot repo –
.
.
.
Universe::initialize_heap()

if (UseParallelGC) {
#ifndef SERIALGC
Universe::_collectedHeap = new ParallelScavengeHeap();
#else // SERIALGC
fatal("UseParallelGC not supported in this VM.");
#endif // SERIALGC

} else if (UseG1GC) {
#ifndef SERIALGC
G1CollectorPolicy* g1p = new G1CollectorPolicy();
G1CollectedHeap* g1h = new G1CollectedHeap(g1p);
Universe::_collectedHeap = g1h;
#else // SERIALGC
fatal("UseG1GC not supported in java kernel vm.");
#endif // SERIALGC

} else {
GenCollectorPolicy* gc_policy;

if (UseSerialGC) {
gc_policy = new MarkSweepPolicy();
} else if (UseConcMarkSweepGC) {
#ifndef SERIALGC
if (UseAdaptiveSizePolicy) {
gc_policy = new ASConcurrentMarkSweepPolicy();
} else {
gc_policy = new ConcurrentMarkSweepPolicy();
}
#else // SERIALGC
fatal("UseConcMarkSweepGC not supported in this VM.");
#endif // SERIALGC
} else { // default old generation
gc_policy = new MarkSweepPolicy();
}

Universe::_collectedHeap = new GenCollectedHeap(gc_policy);
}
. . .
(please note that the above code snippet might have changed since published here)
The things that appears clear from the above code-block are, we are looking at how pre-compiler notations are used to create Hotspot code that supports a certain type of GC i.e. Serial GC or Parallel GC. Also the type of GC policy is selected in the above code-block when one or more GC switches are toggled i.e. UseAdaptiveSizePolicy when enabled selects the Asynchronous Concurrent Mark and Sweep policy. In case of either Use Serial GC or Use Concurrent Mark Sweep GC are not selected, then the GC policy selected is Mark and Sweep policy. All of this and more is pretty clearly readable and verbose, and not just nicely formatted code that reads like English.


Further commentary can be found in the section called Deep dive Hotspot stuff in the Adopt OpenJDK Intermediate & Advance experiences [11] document.


Steps to build your own JDK or JRE
Earlier we mentioned about JDK and JRE images – these are no longer only available to the big players in the Java world, you and I can build such images very easily. The steps for the process have been simplified, and for a quick start see the Adopt OpenJDK Getting Started Kit [12] and Adopt OpenJDK Intermediate & Advance experiences [11] documents. For detailed version of the same steps, please see the Adopt OpenJDK home page [13]. Basically building a JDK image from the OpenJDK code-base boils down to the below commands:


(setup steps have been made brief and some commands omitted, see links above for exact steps)
 $ hg clone http://hg.openjdk.java.net/jdk8/jdk8 jdk8  (a)...OpenJDK8
or
$ hg clone http://hg.openjdk.java.net/jdk9/jdk9 jdk9  (a)...OpenJDK9
$ ./get_source.sh                                     (b)
$ bash configure                                      (c)
$ make clean images                                   (d)
(setup steps have been made brief and some commands omitted, see links above for exact steps)


To explain what is happening at each of the steps above:
(a) We clone the openjdk mercurial repo just like we would using git clone ….
(b) Once we have step (a) completed, we change into the folder created, and run the get_source.sh command, which is equivalent to a git fetch or a git pull, since the step (a) only brings down base files and not all of the files and folders.
(c) Here we run a script that checks for and creates the configuration needed to do the compile and build process
(d) Once step (c) is success we perform a complete compile, build and create JDK and JRE images from the built artefacts


As you can see these are dead-easy steps to follow to build an artefact or JDK/JRE images [step (a) needs to be run only once].


Benefits
– contribute to the evolution and improvement of the Java the language & platform
– learn about the internals of the language and platform
– learn about the OS platform and other technologies whilst doing the above
– get involved in F/OSS projects
– stay on top the latest changes in the Java / JVM sphere
– knowledge and experience that helps professionally but also these are not readily available from other sources (i.e. books, training, work-experience, university courses, etcetera).
– advancement in career
– personal development (soft skills and networking)


Contribute
Join the Adopt OpenJDK [13] and Betterrev [15] projects and contribute by giving us feedback about everything Java including these projects. Join the Adoption Discuss mailing list [14] and other OpenJDK related mailing lists to start with, these will keep you updated with latest progress and changes to OpenJDK. Fork any of the projects you see and submit changes via pull-requests.


Thanks and support

Adopt OpenJDK [13] and umbrella projects have been supported and progressed with help of JCP [21], the Openjdk team [22], JUGs like London Java Community [16], SouJava [17] and other JUGs in Brazil, a number of JUGs in Europe i.e. BGJUG (Bulgarian JUG) [18], BeJUG (Belgium JUG) [19], Macedonian JUG [20], and a number of other small JUGs. We hope in the coming time more JUGs and individuals would get involved. If you or your JUG wish to participate please get in touch.

—-

Credits
Special thanks to +Martijn Verburg (incepted Adopt OpenJDK),+Richard Warburton, +Oleg Shelajev, +Mite Mitreski, +Kaushik Chaubal and +Julius G for helping improve the content and quality of this post, and sharing their OpenJDK experience with us.

—-

How to get started ?
Join the Adoption Discuss mailing list [14], go to the Adopt OpenJDK home page [13] to get started, followed by referring to the Adopt OpenJDK Getting Started Kit [12] and Adopt OpenJDK Intermediate & Advance experiences [11] documents.


Please share your comments here or tweet at @theNeomatrix369.

Resources
[17] SouJava

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

(Part 3 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

This is a continuation of the previous post titled (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

In our first review, The Atlassian guide to GC tuning is an extensive post covering the methodology and things to keep in mind when tuning GC, practical examples are given and references to important resources are also made in the process. The next one How NOT to measure latency by Gil Tene, he discusses some common pitfalls encountered in measuring and characterizing latency, demonstrating and discussing some false assumptions and measurement techniques that lead to dramatically incorrect reporting results, and covers simple ways to sanity check and correct these situations.  Finally Kirk Pepperdine in his post Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! throws light on some JVM flags – he starts with some 700 flags and boils it down to merely 7 flags. Also cautions you to not just draw conclusions or to take action in a whim but consult and examine – i.e. measure don’t guess!

….read more (reblogged from the Java Advent Calendar)

(Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

This is a continuation of the previous post titled (Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

Without any further ado, lets get started with our next set of blogs and videos, chop…chop…! This time its Martin Thompson’s blog posts and talks. Martin’s first post on Java Garbage collection distilled basically distils the GC process and the underlying components including throwing light on a number of interesting GC flags (-XX:…). In his next talk he does his myth busting shaabang about mechanical sympathy, what people correctly believe in and the misconceptions. In the talk on performance testing, Martin takes its further and fuses Java, OS and the hardware to show how understanding of all these aspects can help write better programs.


Java Garbage Collection Distilled by Martin Thompson

There are too many flags to allow tuning the GC to achieve the throughput and latency your application requires. There’s plenty of documentation on the specifics of the bells and whistles around them but none to guide you through them.

Using SonarQube ™ (formerly Sonar ™) installed on the Mac OS X Mountain Lion 10.8.4

Introduction (contd.)

Continuing from where we left in our previous blog post on Installing SonarQube™ (formerly Sonar™) on Mac OS X Mountain Lion 10.8.4 [01], we will cover how to use SonarQube given different situations.

This post might come across a bit more verbose than the previous one, namely with outputs of commands and screenshots illustrating how SonarQube responds to various user actions.

Running SonarQube to analyse projects

We will cover the two ways SonarQube can be used to analyse a project (written in one of the SonarQube supported programming languages [02]) either via maven or through sonar-runner (for non-maven projects) and also the different aspects of SonarQube which help as a static code analysis tool.
via maven
Go to the project folder containing the maven configuration file i.e. pom.xml and run one of the below commands depending on the end goal:

$ mvn clean install sonar:sonar
$ mvn install sonar:sonar
$ mvn sonar:sonar

$ mvn clean sonar:sonar -Dsonar.host.url=http://localhost:nnnn

(where nnnn is an alternative port number where SonarQube is listening)

outputs

Successful analysis of the project via the above commands would lead to the below output onto the console or log files:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 29.923s
[INFO] Finished at: Fri Sep 13 18:07:01 BST 2013
[INFO] Final Memory: 62M/247M
[INFO] ------------------------------------------------------------------------
[INFO] [18:07:01.557] Execute org.apache.maven.plugins:maven-surefire-plugin:2.10:test done: 20372 ms
[INFO] [18:07:01.557] Execute maven plugin maven-surefire-plugin done: 20373 ms
.
.
.
[INFO] [18:07:09.526] ANALYSIS SUCCESSFUL, you can browse http://localhost:9000/dashboard/index/com.webapplication:sub-webapp
[INFO] [18:07:09.528] Executing post-job class org.sonar.issuesreport.ReportJob
[INFO] [18:07:09.529] Executing post-job class org.sonar.plugins.core.issue.notification.SendIssueNotificationsPostJob
[INFO] [18:07:09.529] Executing post-job class org.sonar.plugins.core.batch.IndexProjectPostJob
[INFO] [18:07:09.580] Executing post-job class org.sonar.plugins.dbcleaner.ProjectPurgePostJob
[INFO] [18:07:09.590] -&gt; Keep one snapshot per day between 2013-08-16 and 2013-09-12
[INFO] [18:07:09.591] -&gt; Keep one snapshot per week between 2012-09-14 and 2013-08-16
[INFO] [18:07:09.591] [INFO] [18:07:09.614]  Keep one snapshot per month between 2008-09-19 and 2012-09-14
[INFO] [18:07:09.627] -&gt; Delete data prior to: 2008-09-19
[INFO] [18:07:09.629] -&gt; Clean webapp [id=1]
[INFO] [18:07:09.631] [INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 38.345s
[INFO] Finished at: Fri Sep 13 18:07:09 BST 2013
[INFO] Final Memory: 28M/255M
[INFO] ------------------------------------------------------------------------
Here are a couple of links to sample pom.xml files which should help with creating new or amend existing configurations to integrate maven projects with SonarQube (including additional maven CLI switches) i.e. Analyzing with Maven [03] and SonarQube examples on Github [04].

via sonar-runner

Go to the project folder containing the sonar-project.properties configuration file and run the below command:

</div>
<div></div>
<div>$ sonar-runner</div>
<div></div>
<div>

outputs

Successful analysis of the project via the above command would lead to the below output onto the console or log files:
SonarQube Runner 2.3
Java 1.7.0_25 Oracle Corporation (64-bit)
Mac OS X 10.8.5 x86_64
INFO: Runner configuration file: /opt/sonar-runner-2.3/conf/sonar-runner.properties
INFO: Project configuration file: /Users/manisarkar/bn_projects/TimelineJS/sonar-project.properties
INFO: Default locale: &quot;en_US&quot;, source code encoding: &quot;UTF-8&quot;
INFO: Work directory: /Users/manisarkar/bn_projects/TimelineJS/.sonar
INFO: SonarQube Server 3.7
14:11:20.927 INFO - Load batch settings
.
.
.
14:11:38.290 INFO - ANALYSIS SUCCESSFUL, you can browse http://localhost:9000/dashboard/index/TimelineJS
14:11:38.292 INFO - Executing post-job class org.sonar.issuesreport.ReportJob
14:11:38.293 INFO - Executing post-job class org.sonar.plugins.core.issue.notification.SendIssueNotificationsPostJob
14:11:38.314 INFO - Executing post-job class org.sonar.plugins.core.batch.IndexProjectPostJob
14:11:38.356 INFO - Executing post-job class org.sonar.plugins.dbcleaner.ProjectPurgePostJob
14:11:38.365 INFO - -&gt; Keep one snapshot per day between 2013-08-19 and 2013-09-15
14:11:38.365 INFO - -&gt; Keep one snapshot per week between 2012-09-17 and 2013-08-19
14:11:38.365 INFO - -&gt; Keep one snapshot per month between 2008-09-22 and 2012-09-17
14:11:38.365 INFO - -&gt; Delete data prior to: 2008-09-22
14:11:38.368 INFO - -&gt; Clean TimelineJS [id=151]
14:11:38.372 INFO - INFO: ------------------------------------------------------------------------
INFO: EXECUTION SUCCESS
INFO: ------------------------------------------------------------------------
Total time: 19.099s
Final Memory: 14M/502M
INFO: ------------------------------------------------------------------------
Here are a couple of links to sample sonar-project.properties files to assisting in creating new ones i.e. Sonar setup for non-maven Java project [05] and Analyzing with SonarQube Runner [06].

Note: SonarQube Runner expects SonarQube to be running on the designated port, otherwise throws errors like i.e. ERROR: Sonar server http://localhost:9000 can not be reached. This of course can be changed via configuration files (see previous post [01]).

SonarQube Components

Once the build is complete and successful, the new or updated project can be found in the Dashboard. Drilling down into the project would bring up a screen like which is loaded with important metrics and analysis on various aspects of the project:

(above is a screen-shot of a sample application)

The prime important components that is of interest are Quality Index, Complexity Factor, Complexity (bottom left), Test coverage metrics (Unit Tests Coverage and Unit test success). Possibly Security violations. Package Tangle Index and Dependencies to cut, are definitely handy in order to keep clean packages and loosely coupled dependencies. On the same note, LCOM4 (Lack of Cohesion in Methods – lower the value the better)  and Complexity also throws light on how loosely coupled your classes, methods and functions are – it is also measured at the file level and an overall-level giving the full picture. All of these components are good indicators of Software Quality at the least if not Software Craftsmanship – how well is the underlying code written with care for quality in mind! Or it could be seen as-there-is-still-plenty-of-room-for-improvement-and-refactoring.

The Hotspot view now drills further into a few other important aspects of the analysis and highlights the areas that needs more attention or one ore more issue is near its culmination point – either have cross the maximum allowed limit or needs a bit more polishing to meet the requirements.

 

(above is a screenshot of a JDK7 as published on the nemo.sonarqube.org site)

I quite like the below Design component which gives a good break-down of the package dependencies across each other and highlights dependency-cycles. Its one of the more complex things to do on a medium to large project and can usually come in the way of modularisation.


(above is a screenshot of a sample application)

If you ever wanted to know the internal or external libraries a project uses, you might need to look at the contents of your project including the pom.xml file. No longer the case if you are using SonarQube, as Library is such a component that enlists components that your application depends on and is more reliable than searching for it manually.

(above is a screenshot of a sample application)

It is also possible to add any Widget on any Dashboard (Widgets are components that make up a Dashboard), like the one presented below.

(above is a screenshot of a sample application)

Issues Drilldown
Just being told that something is wrong and here’s the score on how much wrong or incorrect something is, does not help. A more constructive feedback is, here’s what’s wrong and this is what you can do to fix it.
The Issues Drilldown is one such dashboard where we can find such information or enough to know what’s wrong and where to go and how to fix it (sometimes). It also archives older and closed issues, and indicates how bad a problem it is by giving it various gradations of severity i.e. Blocker to Info.

(above is a screenshot of a sample application on nemo.sonarsource.org)

The Manage dashboards option at the top right corner of any of the Dashboard pages (as below) is used to create new dashboard pages into which widgets can be placed.
Similarly the Configure widgets link on every Dashboard page allows adding, removing or changing the position of the widgets anywhere on the Dashboard page.

(above is a screenshot of the Apache Commons Collection)

Tag or word Clouds is a very popular concept heavily used as a form of visualisation to convey metrics – as shown above, which is an illustration of the Apache Commons Collection library.

Commercial component – SQALE

SQALE is a proprietary component and not available in the community version, although a demo version is available via SonarQube’s Nemo project [07]. SQALE is a technical debt evaluation tool, more details of it can be found at [08].

(above is a screenshot of Apache Commons Collections)


Settings

Under the hood, this SonarQube instance relies on a number of default or customised configuration settings laid out as below.

(above is a screenshot of a sample application)

Configuration settings to individual components can be accessed and changed via this interface.



Update Centre

Many widgets that are populated in the various Dashboards seen so far can be enabled or disabled from the below page. It is also where updates and upgrades for all the widgets are available, including updates and upgrades for SonarQube itself.


(above is a screenshot of a sample application)

Upgrade process

Check out the upgrade process from [10], see also [11] to learn what should be done before and after the process.

Usually stopping and restarting SonarQube are common steps performed before and after applying an update or upgrade to one or more components or to SonarQube itself.


Conclusion

After assessing these features its clear that this product has a number of advantages over other solutions out there i.e. lots of free plugins, a plug-in based dashboard system, besides it being an open-source project, and a very good one to get started with. Having said that, there might be commercial products out there that have better quality assessment propositions but not necessarily useful unless yours is a big organisation.
Use SonarQube as a tool to create short-feedback loops, and apply improvements to your code base after assessing the rationale for the change suggested. In case the feedback is incorrect or is a false positive or false negative – one option would be to tweak the configuration settings behind the relevant component to see if the issue raised is applicable under a current conditions – basically either turning of the check or not taking the feedback literally.
Disclaimer: please do not take any of these metrics or the contents of the blog literally but rather as another point of view of  the quality of your code-base. Its important to know how to read the data and not read too much into the numbers and flags raised – there may still be false positives or false negatives. Some commercial products have taken this seriously and have invested to reduce the false positives and false negatives.

External resources

The below links have been used during the installation of SonarQube, and have been mentioned throughout the blog.

Notes

The terms Sonar and SonarQube have been used interchangeably in a number of places above. Some of it is due to the referred links not being updated, and others are due to the fact that scripts and program references have continued to be used with their original names to prevent issues with dependencies.

Do not take the settings, paths and file locations, url references, excetra mentioned in this blog literally, in some cases they would need to be adjusted to settings relevant to your environment.

Please note all the external links on this blog may or may not stay actual and is not feasible to be maintained as part of this blog post.

Please feel free to contribute to the above post in the form of constructive comments, useful links, additional information, excetra to improve the quality of the information provided. If something hasn’t worked for you and you have managed to make it work or have a work-around / alternative solution, please do share it with us!

(Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

I have been contemplating for a number of months about reviewing a cache of articles and videos on topics like Performance tuning, JVM, GC in Java, Mechanical Sympathy, etc… and finally took the time to do it – may be this was the point in my intellectual progress when was I required to do such a thing!

Thanks to Attila-Mihaly for giving me the opportunity to write a post for his yearly newsletter Java Advent Calendar, hence a review on various Java related topics fits the bill! The selection of videos and articles are purely random, and based on the order in which they came to my knowledge. My hidden agenda is to mainly go through them to understand and broaden my own knowledge at the same time share any insight with others along the way….read more (reblogged from the Java Advent Calendar)

SoCrates UK 2013 – my experiences!

Refactoring TDD habits

It’s about 15:21 on 19th September, a group of us from the London Software Craftsmanship Community gathered to leave for SoCrates UK 2013. We all gathered in the same and luckily found empty seats next to each other.

Amongst a lot of jokes, we suggested lets do a small dojo in the two hours we will be travelling. Wifi is free but terrible, on this train – nevertheless we tethered our phones and enabled Wifi on our laptops to make the most of the bandwidth.

This was also about the time when others sitting around me took notice of the desktop wallpaper, it looked something like this:

13 habits of good TDD programming

13 habits of good TDD programming

One thing lead to another, and before we knew, we were going through the above list of habits – some found the habits difficult to remember, others found duplicates and overlaps. There was this tendency that, this is a big list to remember when doing TDD!

So like developers we said we can “refactor” these habits into something more meaningful, maybe even organise them as per our thinking process.

From this conversation the below list was born:

12 habits to good TDD programming:
  1) Write the assertion first and work backwards
  2) Test should test one thing only
  3) See the test fail
  4) Write the simplest code to pass the test
  5) Refactor to remove duplications
  6) Don’t refactor with failing test(s)
  7) Write meaningful tests
  8) Triangulate
  9) Keep your test and model code separate (except when practising TDD-as-if-you-meant-it)
10) Isolate your tests
11) Organise your tests to reflect model code
12) Maintain your tests

You can notice already that the 13…habits shrunk to 12 habits, and their order changed in comparison to the original.

Then @Frankie mentions the difference between Test-first development and Test driven development, no one claimed to know it, but suggested the below:

Test-first development
– you start the development with tests, and change the tests if goal is not achieved via model (domain) code. – No guidelines about design, flexible approach.

Test driven development
– test drives the model code, you never change the test, only model code to reflect any changes that does not satisfy the tests! Guided by design.
Model code = production code or implementation, domain (better term)

Soon we arrived at Moreton-in-Marsh and our focussed changed to getting a taxi to the venue.

I continued contemplating with the idea of collaborating with other developers and ironing our this list – so it can be helpful at the least.

I met @gonsalo and @sandromancuso and ran the idea by them and they thought it was certainly an idea to present and get others involved to see what the final outcome could be.

Next morning everyone was proposing their presentations at the “Open Session”, and I took the chance of presenting – Refactoring TDD habits… in the Cheltham Loft room in the house called Coach at the Cotwold estate!

30 minutes into the conversation and we already attracted discussions between @sleepyfox and @sandromancuso, we did try to persuade them to avoid tangenting from our core topic of discussion.

@sandromancuso also shared with us one of the rules of simple design – as these have evolved over time, these sorta look like this on the flip chart he scribbled on:

4 rules of simple design

4 rules of simple design (thanks @racheldavies for taking this pic at the event, I have cropped it to fit it in here)

– passes all tests
– minimises duplication
– maximises clarity (clear, expressive, consistent)
– has fewer elements

The idea is that all your actions and practises could use these as guiding light to keep on track.

An hour and few minutes later and we have refactored and cleansed the original 13 habits down to 12 habits – reshuffled and rejuvenated. They may not be perfect but close (Kent Beck: use ‘perfect’ as a verb, not a noun – its a journey not a destination!

12 habits of good TDD programming

I sincerely hope that these list of habits do cover the essence of the principles, values and practises of TDD programming.

Prepare yourself with things you should or should not do, and then perform the Red-Green-Refactor actions to satisfy the TDD process.

The things some of the attendees not like is the way of the first list of habits were worded:

  • definition of the word ‘model’ or ‘domain’
  • use of the term duplication
  • the habits being assigned with numbers
  • use of the term assert – one assert per test or groups of assert per logical test

I also brought up the topic of the difference between Test-first development and Test driven development – and there were disagreements about it amongst the attendees, on the meanings of the definition itself. Please share with me your input!

Back to the final list of habits, @CarlosBle brought up a very valid point that some of the habits on the list might be time-based rather than relevant all the time – they were not linear but appear and disappear from the list depending on the current task in hand. We agreed we would sit together and work it out, but @sleepyfox was ahead of us and kindly drew this flow / state diagram on the flip chart that illustrated the list of habits but in a diagrammatic format:

Flow diagram of TDD habits

State/flow diagram of TDD habits (my apologies if its not clear, sometime down the line one of us could come up with a digital version of the state/flow diagram)

@CarlosBle – please feel free to come up with your time-based / non-linear list and share it with us when you get a chance.

Although we got a lot out of the session with discussions on various topics, we haven’t covered everything and finished discussing everything!

I’m more of the idea, that we could try to look at the above habits as trigger points (practical and pragmatic use) to make us do the right things when we are writing code with the intention of writing good code…clean code…quality code…whatever the terminology be in your environment.

At the end of the day, good practises become habitual only when practised repeatedly, with focus and intent.

Please do provide constructive criticism, if any, such feedback to the above are very welcome as they help improve the quality of the post!

Thanks to @sandromancuso & @sleepfox for helping and participating during my session. Big thanks to Socrates UK – its host, facilitators, organisors and sponsors for making this event happen!

My experience of learning R – from basic graphs to performance tuning

Background

R as some of you may know is a statistical and graphics programming language (see Wikipedia [1]) used by academia and recently by IT professionals of our ever growing software industry. There is a sudden demand for Data Scientists, Data Analysts and Statisticians with a background in R among other things data and development related subjects.

I have been fortunate to work with such a programming language, even though I haven’t had any prior experience working with such a programming language and moreover with Data Scientists. My interest in Mathematics and affinity for numbers drew me to learning it, and with further help of Herve Schnegg our in-house Senior Data Scientist, I was able to pick a fair bit of the subject.

 

R is a mix of a object-oriented programming, Clojure-like functional programming, Javascript-like style of writing code and a Smalltalk-like programming interface. And it offers REPL like many functional programming environments. The fundamental units of the data we manipulate are usually objects like lists, vectors, data-frames, tables, etc…
 
Initial baby-steps
 
I went through a few hours of tutoring by getting an understanding of the R environment, how to install it, and an overview of RStudio and how amazing it is! What fascinates me, is that you can load objects into memory and play with it and when you shutdown your environment your data is not cleared! Rather you can save it (into the .Rdata file) and it retains such information per project!
You are able to remove individual objects from memory, view them, modify them, and reload them from the command-line or by just executing single lines of code in your R script file (they have the obvious extension of .r) in an IDE like RStudio.
R gives developers access to a REPL (stands for Read–eval–print loop [2]) environment and thats how you are able to do the above actions seamlessly! A number of other popular languages have a similar environment i.e. Clojure, Haskell, Python, Ruby, Scala, and Smalltalk, and so forth.

More about R
The order of precedence with regards to declaring a function is important in R, you can’t just call a function unless it has been defined in the package/library you have loaded like:

library([name of library])

or included a resource using the source() function like:source(“./Utils.LoadAndVerify.r”)

or defined the function in the beginning of the script file before referring to it, at a later stage! I had to learn this by the trial-an-error-then-ask-the-experts-around-you method.

 

Contents of any object can be viewed by referring to the object at the REPL CLI, that’s kind of easy!

 

>  someObject <- “contents”
>  someObject [press enter]
[1] “contents” <==== output

 

I discovered another way to view the contents of an object especially when its a list, vector, data-frame, etc…, and is a bit cumbersome to read its output on the console. I learnt that the View() function displays the contents of the object in a tabular form in a separate floating window:

 

> View(table(someList))

 

(the object is displayed in a grid like table in a separate window, which could look like the below)

Plotting graphs from a set of numeric values contained in a list or vector in R is like doing 1..2..3…:

 > counts par(bg = "white");
 > barplot(counts, main="Car Distribution by Gears and VS",
   xlab="Number of Gears", col=c("darkblue","maroon"),
   legend = rownames(counts), beside=TRUE)

And voila, you get a nice simple looking bar graph!

Thanks to a helpful R blogger who has put together some resource for us: Using R to plot data [4].

We can do something more advance by running the below commands:

> x  y  f  z  par(bg = "white");
> persp(x,y,z,zlim=c(0,0.25), theta=50, phi=10);

…and we have the below nice looking 3D mesh (wireframe), from an angle:
Note: the par (bg=”white”) command sets the colour of the canvas for the entirety of your session.

 

Logging
I wrote my own suite of very simple logging functions that log messages to the console depending on the nature of the message, these messages can of course be piped into a text file at run-time.
log.INFO print(paste(date(), "[INFO]", message))
}

log.WARNING print(paste(date(), "[WARNING]", message))
}

log.DEBUG print(paste(date(), "[DEBUG]", message))
}

log.ERROR print(paste(date(), "[ERROR]", message))
}
Of course the above block of code could have been written like this:
MSG_TYPE_INFO <- "[INFO]"
MSG_TYPE_WARNING <- "[WARNING]"
MSG_TYPE_DEBUG <- "[DEBUG]"
MSG_TYPE_ERROR <- "[ERROR]"

log.ANY print(paste(date(), typeOfMessage, message))
}

log.INFO log.ANY(MSG_TYPE_INFO, message)
}

log.WARNING log.ANY(MSG_TYPE_WARNING, message)
}

log.DEBUG log.ANY(MSG_TYPE_DEBUG, message)
}

log.ERROR log.ANY(MSG_TYPE_ERROR, message)
}

As you will know, the way R is, it is wise to have logging functions to hand, to dump values of variables when running scripts. Just because sometimes the error messages thrown by R can be obscure, which has been my finding during my pursuits. Hence I resorted to the above functions and relieved myself from annoyances during exceptions.Later some passed me a link to an R Logging library (an implementation of log4j in R) [7].

What’s up!
At the moment I’m refactoring bits of code I wrote during the last two weeks and still have many blocks of code to go through to find suitable method functions to place them into – our purpose is to make the code more readable, scalable and maintainable.
Just now in the process of replacing the slow and verbose for-loop like constructs with their equivalent xapply() functions. By doing this we will gain in speed and compactness with regards to the lines of code.

 

R gives us a number of MapReduce like functions to play with, here’s a blog [3] that covers the topic on the xapply() functions.
 
Performance measurement and performance tuning
As R is an interpreted language, if you don’t write efficient functions, you could end up waiting a bit longer than expected, before any results are thrown back onto the console. It is not verbose and does not usually tell you what it is upto.
We spent most of our two weeks performing this action as we came across performance bottlenecks in our scripts and could do with using the xapply() like functions. Applying them improved the performance of certain tasks from several hours to a reasonable number of minutes per execution.

 

“Measure, don’t guess.” was the motto!

 

Thanks to the sequence of calls to the proc.time() function, which we used voraciously to measure performances of the different blocks of code we thought needed attention.

 

startTimer <- proc.time()
and
proc.time() – startTimer
 
This paid off at the end of the process as we were able to determine how much time it would take for the script to transform and validate the heaps of data we have been playing with.
At the end of each such iteration we saw the stats in the below format. It got us excited if it was a low number and dejected if it wasn’t to our liking:

 

   user  system elapsed
 87.085   0.694  87.877
 
We tried various methods to bring down the total elapsed time. Some of the things we did even before we came to a final resolution:
 – used for-loop to iterate through a list or vector and perform the same action repeatedly and accumulate results
  – we noticed the for-loop slowed down after a number of iterations and this was a standard pattern. To relieve that we split the for-loop into an inner and outer loop. The outer loop split the inner loop into batches of 40-50 iterations followed by a call to gc() at the end of the iteration. This wasn’t a decent solution from an algorithms or language point of view
finally we settled to refactoring the for-loop into a mapply() which looked like:
result &lt;- unlist(mapply(FUN=transposeColumnAsRow, rangeOfIndices, SIMPLIFY=TRUE))
 
The last action gave us a better grip over the performance and we were confident that if we had to run all the data we had through the script, we would be able to finish transposing it within several hours as opposed to a few days, previously.
Here’s the equation we used to benchmark our functions each time we improved it. It was more to find out for us if we would be able to meet our goals. If the action was acceptable, otherwise we needed to investigate further to find a better method:

 

nm = (ns / nr) * tnr / nsm

 

 nm – no. of minutes it would take to process the whole raw file
 ns – no. of seconds taken to process the batch of records
 nr – total number of records process in the batch
tnr – grand total of the number of records in raw file
nsm – number of seconds in a minute

 

The method we settled for gave us the below results, which was a great benchmark based on processing a sample of 100 records, and when extrapolated on 11200+* records gave the below – which was pretty acceptable at the time:

 

 9.315 / 100 * 11250 / 60 = 17.465 minutes per raw data file

* – each row was made up of 1300+ columns which added to the processing time


We had about 24 files in total to process, which compute to

17.465 * 24 / 60 = 6.986 hours if run one file per session

The tasks of processing each file was split into 3 to 4 sessions processing 1000 records per session.

But it wasn’t as easy as said, we had a number of sessions running doing the above on different pieces of raw data, but never got to committing the data into the database and wondered why? We thought it was hardware/software limitations on our systems. But after further investigations and experimentations found out that no system can handle writing mega-tons of data from memory into the database system without creating giga-tons of swap files. And these swap files are a catch-22, because now the OS needs resources to manage its own resources so our resource requirements would take a back seat!

After a couple discussions, and trials we finally decided to write data back into the database system, in smaller blocks at a time, which means we can still have multiple sessions running in the background and have each one of them write smaller blocks of data into the database.
Everyone is happy as processes can handle smaller blocks much better than bigger blocks – didn’t we already know this, maybe we re-learnt it by facing a bottleneck?   Our script learnt from it as well and got modified to be able to accept and handle processing smaller blocks of data by splitting the processes into smaller batches of records per execution.$ RScript IncrementalLoad.r [filename] [starting record no.] [ending record no.]

The verification script also imitated the same and elected to be run in batches:$ RScript VerifyData.r [filename] [starting record no.] [ending record no.]

Once data had been transformed and written to a database, we wrote a script to validate the data written into the database, we chose Postgres as a trial, and found it was a pretty good database system with an intuitive SQL language.

The verification process was run in the same manner in parallel which took similar amount of time, so at the end of the 7th hour we had both the data written into the database and verified.

 

R can write to such a database system easily. We were further helped with the primary, simple and compound indices that we created to facilitate the searching and selecting processes that our SQL statements would make it do. Postgres also has an efficient caching mechanism, which helps further speed things up.

 

Tweaking the R environment to get efficiency out of it
What I didn’t mention was that before we returned to the R script to tweak it and improve its performance, we thought it was the environment and the way R was, that made it slow – so we wanted to speed up our scripts using the below methods to get the maximum out of the R environment:
  • JIT compiling R scripts – thinking its not slow when interpreted anymore
  • Converting R scripts into C/C++ code and compiling and running it instead
  • Running R scripts using parallel processing (need some library for it)
  • Learning how to use GPUs via R to get that extra performance (need some library for it)
  • Investigating other methods of High Performance Computing in R
We have parked these ideas for now, but it will be a great experience to be able to explore them at a later date.
But once again it was techniques over technology that made our day. Rory Gibson, rightly said “Its not surprising to know how game developers produce some of the best pieces of work under restricted environments”. Such situations are a good nudge to everyone especially developers when faced with performance bottleneck – look at your code not your machine first!

 

At the end of it all, it feels we did what Hadoop or Cloudera would do to our jobs – split, slice, execute, verify, put together and bring back the results at an efficient speed.

 

Hurdles
 
The time I spend learning and applying R, I had to get familiar with its unique or rather say different from other programming language syntax. Like the use of the <- (arrow sign or indirection operator) instead of the usual = (equal to sign). How you point the arrow makes a difference in R, instead of assigning a value to a variable or function you might end up doing something else if you are not careful.

 

You need to define your functions at the top first, otherwise you can’t refer to it. And all entities are case-sensitive, please pay careful attention or else you will only be notified when you least expect it and in the middle of an execution of a block – remember its an interpreted language, no compile time warnings / error messages are available.
There was one more hurdle which put the spanners at work for us – we bumped into an encoding/decoding issue with reading data from the raw file. The ESS plugin [5] in Emacs was reading the data literally at a stage and not evaluating the escape codes. When we switched to RStudio or even the R repl, we immediately became free from the issue – this was also because both I and Herve were using different development environments. He used Emacs to develop in R while I used RStudio. Why this was happening is not known to us, we think there might be a bug in the plugin – at this stage its still a speculation, but more importantly we don’t have to investigate the issue anymore.Our raw file was written using the application called SPSS which writes data in a proprietary format. Such files can be read via a few ways, and using R is one way to achieve that. There is also a Java library [6] that facilitates reading such files, but remains unexplored at this time.
Test driven development in R
 
This is where I have still been hovering around with regards to R, I came across two libraries that enables writing unit tests in R, i.e. RUnit and svUnit. See below in the External Resources section for a number of links I have put together while searching for TDD methodologies in R.
It still needs to be investigated further but a promising start – since test-first driven development is a great way to start working on any piece of problem in any programming language of choice.

 

 

Refactoring
Another action which has been a continuous process since the start. We have applied it to generalise, and make the code base more compact and manageable.

 

Move away common function calls into another .r file and called it into our main script using the source() function. Make our work more maintainable and re-usable – basically keep our code-base clean and tidy.

 

I learnt that refactoring is a continuous effort – its a journey not a destination.

 

What we took away…
I and Herve both took away a lot of learning both technical and non-technical from the whole process  – pair-programming and pair-investigation of a problem space, as the old adage goes “Two minds, are better than one.” Also reveals another reason why pair-programming is encouraged as part of a development process.

 

One important point: we learnt that when we started working on this project, slicing it into simple smaller atomic chunks when solving a problem was effective and efficient, and learnt the hard way. Also when we had a solution to apply to a dataset,  we had already decided to only apply any experimental solution to a smaller subset of the dataset first, verify the results and then scale it incrementally till the dataset was exhausted. Both these working methods came to our rescue and reduced the combination and permutations of trial-and-error!
We both exchanged ideas that we were new to and very well incorporated many of them into our work methods and was able come up with a fine, and a re-usable solution.
I have taken this project further by documenting the work, continuing with refactoring the script files, writing this blog post, and tidying up the project space as a whole.This blog comes about as a documentation of our trial to check the viability of different platforms that could serve us as an ETL (Extract, Transform, Load) – of which we have made good use of RWhether is brilliant at it, or another tools serves betters is debatable. R stands good at what it does, and it does it well – but can be used to do light/medium weight ETL work.So what would be the next tool or platform of choice for our next ETL project. We can’t tell which one is better till we have tried a few and benchmarked them against their pro-s and con-s.


Thanks

Herve Schnegg – for a good partnership during our R session the last couple of weeks, and all the input and learning.

Rory Gibson – for lending his reviewer eyes, for reviewing our R work I and Herve did and also for reviewing this post.

External resources
This blog has also been published on the web’s popular R blogging site: http://www.R-bloggers.com.
During my quest, while learning and applying R, I came across the below links that could come useful to anyone who is interested in furthering their knowledge.
JIT for R
R to CPP
Parallels in R
High Performance Computing using R
GPU programming with R
Test driven development in R
Other useful topics
PSPP

Read more….