Planet Apache

Syndicate content
Updated: 4 hours 9 min ago

Piergiorgio Lucidi: Packt Publishing 2000th title released

Mon, 2014-03-24 08:18

Packt Publishing has reached 2000 titles in their catalog and they are offering a free book for every purchased item.

Categories: FLOSS Project Planets

Adrian Sutton: Revert First, Ask Questions Later

Mon, 2014-03-24 06:02

The key to making continuous integration work well is to ensure that the build stays green – ensuring that the team always knows that if something doesn’t work it was their change that broke it. However, in any environment with a build pipeline beyond a simple commit build, for example integration, acceptance or deployment tests, sometimes things will break.

When that happens, there is always a natural tendency to commit an additional change that will fix it. That’s almost always a mistake.

The correct approach is to revert first and ask questions later. It doesn’t matter how small the fix might be there’s a risk that it will fail or introduce some other problem and extend the period that the tests are broken. However since we know the last build worked correctly, reverting the change is guaranteed to return things to working order. The developers working on the problematic change can then take their time to develop and test the fix then recommit everything together.

Reverting a commit isn’t a slight against its developers and doesn’t even suggest the changes being made are bad, merely that some detail hasn’t yet been completed and so it’s not yet ready to ship. Having a culture where that’s understood and accepted is an important step in delivering quality software.

Categories: FLOSS Project Planets

Adrian Sutton: Testing@LMAX – Test Results Database

Mon, 2014-03-24 06:01

One of the things we tend to take for granted a bit at LMAX is that we store the results of our acceptance test runs in a database to make them easy to analyse later.  We not only store whether each test passed or not for each revision, but the failure message if it failed, when it ran, how long it took, what server it ran on and a bunch of other information.

Having this info somewhere that’s easy to query lets us perform some fairly advanced analysis on our test data. For example we can find tests that fail when run after 5pm New York (an important cutoff time in the world of finance) or around midnight (in various timezones). It has also allowed us to identify subtly faulty hardware based on the increased rate of test failures.

In our case we have custom software that distributes our acceptance tests across the available hardware so it records the results directly to the database, however we have also parsed JUnit reports from XML and imported into the database that way.

However you get the data, having a historical record of test results in a form that’s easy to query is a surprisingly powerful tool and worth the relatively small investment to set up.

Categories: FLOSS Project Planets

Stefane Fermigier: Week-end readings #2

Mon, 2014-03-24 03:10

(A new series of post dedicated to interesting stuff found on the web).


Flask-User: an alternative to Flask-Security.


Scribe, a new in browser editor from the Guardian.

Data Modeling

How to model data that changes over time? Some pointers:

Temporal Patterns (Martin Fowler).

Five Tips for Managing Database Version Tables.

Temporal Tables & Asserted Versioning

Example using SQLAlchemy.

DBMS2 revisited: alternative data models

Categories: FLOSS Project Planets

Andrew Savory: Mastering the mobile app challenge at Adobe Summit

Sun, 2014-03-23 12:00

I’m presenting a 2 hour Building mobile apps with PhoneGap Enterprise lab at Adobe Summit in Salt Lake City on Tuesday, with my awesome colleague John Fait. Here’s a sneak preview of the blurb which will be appearing over on the Adobe Digital Marketing blog tomorrow. I’m posting it here as it may be interesting to the wider Apache Cordova community to see what Adobe are doing with a commercial version of the project…


Mobile apps are the next great challenge for marketing experts. Bruce Lefebvre sets the the scene perfectly in So, You Want to Build an App. In his mobile app development and content management with AEM session at Adobe Summit he’ll show you how Adobe Marketing Cloud solutions are providing amazing capabilities for delivering mobile apps. It’s a must-see session to learn about AEM and PhoneGap.

But what if you want to gain hands-on practical experience of AEM, PhoneGap, and mobile app development? If you want to roll up your sleeves and build a mobile app yourself, then we’ve got an awesome lab planned for you. In “Building mobile apps with PhoneGap Enterprise“, you’ll have the opportunity to create, build, and update a mobile application with Adobe Experience Manager. You’ll see how easy it is to deliver applications across multiple platforms. You’ll also learn how you can easily monitor app engagement through integration with Adobe Analytics and Adobe Mobile Services.

If you want to know how you can deliver more effective apps, leverage your investment in AEM, and bridge the gap between marketers and developers, then you need to attend this lab at Adobe Summit. Join us for this extended deep dive into the integration between AEM and PhoneGap. No previous experience is necessary – you don’t need to know how to code, and you don’t need to know any of the Adobe solutions, as we’ll explain it all as we go along. Some of you will also be able to leave the lab with the mobile app you wrote, so that you can show your friends and colleagues how you’re ready to continuously drive mobile app engagement and ROI, reduce your app time to market, and deliver a unified experience across channels and brands.

Are you ready to master the mobile app challenge?


All hyperbole aside, I think this is going to be a pretty interesting technology space to watch:

  • Being able to build a mobile app from within a CMS is incredibly powerful for certain classes of mobile app. Imagine people having the ability to build mobile apps with an easy drag-and-drop UI that requires no coding. Imagine being able to add workflows (editorial review, approvals etc) to your mobile app development.
  • No matter how we might wish we didn’t have these app content silos, you can’t argue against the utility of content-based mobile apps until the mobile web matures sufficiently so that everyone can build offline mobile websites with ease. Added together with over-the-air content updates, it’s really nice to be able to have access to important content even when the network is not available.
  • Analytics and mobile metrics are providing really useful ways to understand how people are using websites and apps. Having all the SDKs embedded in your app automatically with no extra coding required means that everyone can take advantage of these metrics. Hopefully this will lead to a corresponding leap in app quality and usability.
  • By using Apache Cordova, we’re ensuring that these mobile app silos are at least built using open standards and open technologies (HTML, CSS, JS, with temporary native shims). So when mobile web runtimes are mature enough, it will be trivial to switch front-end app to front-end website without retooling the entire back-end content management workflow.

Exciting times.

Categories: FLOSS Project Planets

Claus Ibsen: Apache Camel 2.13.0 released

Sat, 2014-03-22 02:43
We have just released Apache Camel 2.13.0 which you can download from Apache, as well from Maven Central. This release is about 6 months of work since Camel 2.12.0 was released.

In this release the community have contributed a number of new components such as integration with Splunk, Apache Hadoop 2.x, Infinispan, JGroups, and a few others. We also have a component to integrate with Apache Kafka, but currently the documentations is outstanding.

The Splunk component was developed and contributed by Preben Asmussen, whom wrote a blog post about it in action.

A number of open source projects has migrated to use the ASL2 license which allows us to provide the Infinispan and JGroups components out of the box - Yeah the ASL2 license rocks!

There is also a new language that leverages JSonPath which makes routing using JSon payloads.

We also improved support for Spring 4.x, so you should be able to use Spring with this release. However the release is built and tested against Spring 3.2.8. If you have any trouble let us know, but we intended to upgrade for Spring 4.x for the next release.

This release comes with the usual hardening, improvements and bug fixes we do for each release.

You can find more information in the release notes, and make sure to read the sections in the bottom of the page, when you upgrade.

Categories: FLOSS Project Planets

Justin Mason: Links for 2014-03-21

Fri, 2014-03-21 18:58
  • Microsoft “Scroogles” Itself

    ‘Microsoft went through a blogger’s private Hotmail account in order to trace the identity of a source who allegedly leaked trade secrets.’ Bear in mind that the alleged violation which MS allege allows them to read their email was a breach of the terms of service, which also include distribution of content which ‘incites, advocates, or expresses pornography, obscenity, vulgarity, [or] profanity’. So no dirty jokes on Hotmail!

    (tags: hotmail fail scroogled microsoft stupid tos law privacy data-protection trade-secrets ip)

  • Theresa May warns Yahoo that its move to Dublin is a security worry

    Y! is moving to Dublin to evade GCHQ spying on its users. And what is the UK response?

    “There are concerns in the Home Office about how Ripa will apply to Yahoo once it has moved its headquarters to Dublin,” said a Whitehall source. “The home secretary asked to see officials from Yahoo because in Dublin they don’t have equivalent laws to Ripa. This could particularly affect investigations led by Scotland Yard and the national crime agency. They regard this as a very serious issue.” There’s priorities for you!

    (tags: ripa gchq guardian uk privacy data-protection ireland dublin london spying surveillance yahoo)

  • A Look At Airbnb’s Irish Pub-Inspired Office In Dublin –

    Very nice, Airbnb!

    (tags: airbnb design offices work pubs ireland dublin)

  • Internet Tolls And The Case For Strong Net Neutrality

    Netflix CEO Reed Hastings blogs about the need for Net Neutrality:

    Interestingly, there is one special case where no-fee interconnection is embraced by the big ISPs — when they are connecting among themselves. They argue this is because roughly the same amount of data comes and goes between their networks. But when we ask them if we too would qualify for no-fee interconnect if we changed our service to upload as much data as we download** — thus filling their upstream networks and nearly doubling our total traffic — there is an uncomfortable silence. That’s because the ISP argument isn’t sensible. Big ISPs aren’t paying money to services like online backup that generate more upstream than downstream traffic. Data direction, in other words, has nothing to do with costs. ISPs around the world are investing in high-speed Internet and most already practice strong net neutrality. With strong net neutrality, new services requiring high-speed Internet can emerge and become popular, spurring even more demand for the lucrative high-speed packages ISPs offer. With strong net neutrality, everyone avoids the kind of brinkmanship over blackouts that plague the cable industry and harms consumers. As the Wall Street Journal chart shows, we’re already getting to the brownout stage. Consumers deserve better.

    (tags: consumer net-neutrality comcast netflix protectionism cartels isps us congestion capacity)

  • Micro jitter, busy waiting and binding CPUs

    pinning threads to CPUs to reduce jitter and latency. Lots of graphs and measurements from Peter Lawrey

    (tags: pinning threads performance latency jitter tuning)

  • The Day Today – Pool Supervisor – YouTube

    “in 1979, no-one died. in 1980, some one died. in 1981, no-one died. in 1982, no-one died. … I could go on”

    (tags: the-day-today no-one-died safety pool supervisor tricky-word-puzzles funny humour classic video)

  • The colossal arrogance of Newsweek’s Bitcoin “scoop” | Ars Technica

    Many aspects of the story already look like a caricature of journalism gone awry. The man Goodman fingered as being worth $400 million or more is just as modest as his house suggests. He’s had a stroke and struggles with other health issues. Unemployed since 2001, he strives to take care of basic needs for himself and his 93-year-old mother, according to a reddit post by his brother Arthur Nakamoto (whom Goodman quoted as calling his brother an “asshole”). If Goodman has mystery evidence supporting the Dorian Nakamoto theory, it should have been revealed days ago. Otherwise, Newsweek and Goodman are delaying an inevitable comeuppance and doubling down on past mistakes. Nakamoto’s multiple denials on the record have changed the dynamic of the story. Standing by the story, at this point, is an attack on him and his credibility. The Dorian Nakamoto story is a “Dewey beats Truman” moment for the Internet age, with all of the hubris and none of the humor. It shouldn’t be allowed to end in the mists of “he said, she said.” Whether or not a lawsuit gets filed, Nakamoto v. Newsweek faces an imminent verdict in the court of public opinion: either the man is lying or the magazine is wrong.

    (tags: dorian-nakamoto newsweek journalism bitcoin privacy satoshi-nakamoto)

  • Papa’s Maze | spoon & tamago

    While going through her papa’s old belongings, a young girl discovered something incredible – a mind-bogglingly intricate maze that her father had drawn by hand 30 years ago. While working as a school janitor it had taken him 7 years to produce the piece, only for it to be forgotten about… until now. 34″ x 24″ print, $40

    (tags: mazes art prints weird papas-maze japan)

  • Continuous Delivery with ETL Systems [video]

    Lonely Planet and Dr Foster Intelligence both make heavy use of ETL in their products, and both organisations have applied the principles of Continuous Delivery to their delivery process. Some of the Continuous Delivery norms need to be adapted in the context of ETL, and some interesting patterns emerge, such as running Continuous Integration against data, as well as code.

    (tags: etl video presentations lonely-planet dr-foster-intelligence continuous-delivery deployment pipelines)

  • The MtGox 500

    ‘On March 9th a group posted a data leak, which included the trading history of all MtGox users from April 2011 to November 2013. The graphs below explore the trade behaviors of the 500 highest volume MtGox users from the leaked data set. These are the Bitcoin barons, wealthy speculators, dueling algorithms, greater fools, and many more who took bitcoin to the moon.’

    (tags: dataviz stamen bitcoin data leaks mtgox greater-fools)

  • What We Know 2/5/14: The Mt. Chiliad Mystery

    hats off to Rockstar — GTA V has a great mystery mural with clues dotted throughout the game, and it’s as-yet unsolved

    (tags: mysteries gaming via:hilary_w games gta gta-v rockstar mount-chiliad ufos)

  • Make Your Own 3-D Printer Filament From Old Milk Jugs

    Creating your own 3-D printer filament from old used milk jugs is exponentially cheaper, and uses considerably less energy, than buying new filament, according to new research from Michigan Technological University. [...] The savings are really quite impressive — 99 cents on the dollar, in addition to the reduced use of energy. Interestingly (but again not surprisingly), the amount of energy used to ‘recycle’ the old milk jugs yourself is considerably less than that used in recycling such jugs conventionally.

    (tags: recycling 3d-printers printing tech plastic milk)

Categories: FLOSS Project Planets

Bryan Pendleton: Stuff I'm reading, March Madness edition

Fri, 2014-03-21 18:11

My bracket is a shambles ... Mercer?

But here's some madness of a different sort:

  • Brief History of Latency: Electric Telegraph Brief look at the history of the electric telegraph - how it came to be, how it was used, and the early problems it encountered. Namely, networking congestion, which was caused by routing and queuing delays!
  • Denial of Service AttacksOver the last year, we have seen a large number and variety of denial of service attacks against various parts of the GitHub infrastructure. There are two broad types of attack that we think about when we're building our mitigation strategy: volumetric and complex.

    We have designed our DDoS mitigation capabilities to allow us to respond to both volumetric and complex attacks.

  • Go Concurrency Patterns: Pipelines and cancellationGo's concurrency primitives make it easy to construct streaming data pipelines that make efficient use of I/O and multiple CPUs. This article presents examples of such pipelines, highlights subtleties that arise when operations fail, and introduces techniques for dealing with failures cleanly.
  • Marginally UsefulBut while it may be wishful thinking to imagine Bitcoin as a true currency, it’s a highly functional and effective technology. Bitcoin’s “block chain protocol” is built atop well-understood, established cryptographic standards and allows perfect certainty about which transactions occurred when.
  • HackHack is a programming language for HHVM that interoperates seamlessly with PHP. Hack reconciles the fast development cycle of PHP with the discipline provided by static typing, while adding many features commonly found in other modern programming languages.

    Hack provides instantaneous type checking via a local server that watches the filesystem. It typically runs in less than 200 milliseconds, making it easy to integrate into your development workflow without introducing a noticeable delay.

  • Facebook Introduces ‘Hack,’ the Programming Language of the FutureFacebook engineers Bryan O’Sullivan, Julien Verlaguet, and Alok Menghrajani spent the last few years building a programming language unlike any other.

    Working alongside a handful of others inside the social networking giant, they fashioned a language that lets programmers build complex websites and other software at great speed while still ensuring that their software code is precisely organized and relatively free of flaws — a combination that few of today’s languages even approach. In typical Facebook fashion, the new language is called Hack, and it already drives almost all of the company’s website — a site that serves more than 1.2 billion people across the globe.

  • Facebook VP of Engineering on Solving Hard Things EarlyBuilding great products is hard enough without allowing the ease of early management to fool founders into thinking they invented something new. Build great management, train new managers, and introduce sustainable, scalable structure now. Do not wait until world-class managers and radical re-orgs are needed to fix poor information flow or productivity problems.
Categories: FLOSS Project Planets

Bryan Pendleton: Skyrim snort for a Friday afternoon

Fri, 2014-03-21 15:23

It's not just a game, it's an obsession: The Cheesing of Lydia.

Categories: FLOSS Project Planets

Edward J. Yoon: 고급 분석 알고리즘 총 집합, Apache Hama 예제들을 알아보자!

Fri, 2014-03-21 02:51
Apache Hama의 예제들은 대단히 현실적이고 바로 써먹을 수 있는 고급 분석 알고리즘들로 꽉 차있다는 사실, .. 지금부터 알아보자.

현재 구현된 예제들은 다음과 같다:

  1. Bipartite-Matching 
  2. GradientDescent
  3. Kmeans
  4. NeuralNetwork
  5. PageRank
  6. SSSP
  7. Semi-Clustering 
대충 보면 학창시절 어디서 한번 씩은 들어본 (게다가) "몇 개 되지 않는 알고리즘들, 뭐 어쩌라고?" 할지 모른다. 그러나 이는 우리가 선진 웹 서비스에 어떻게 녹아 있는지, 나는 어떻게 활용해야하는지 잘 모르기 때문에 그런거다. 
후루룩 훑어보면, Bipartite Matching, 짝에 남자 출연자와 여자 출연자 선호도를 가지고 최대 매칭수를 뽑는 뭐 그런거다. 소셜 네트워크 서비스나, 이미지 검색 시스템 등에서 활용 될 수 있다 (아! Semi-Clustering 도). Google Images에서의 감동적인 유사 이미지 검색을 보라. 이런 서비스 기반엔 이미지 프로세싱, 그래프 이론과 기계학습 등이 숨어있었단 말이지. 또, Gmail에 스팸, 그리고 Primary와 Social 등 메일 종류 분류해주는 것도 기계 학습에 의한 것으로 보면 된다. 
K-Means와 Neural Network은 군집분석과 추천 엔진 등에 활용될 수 있다. 또, PageRank, SSSP, 그리고 K-core 같은 그래프 알고리즘은 사회 망이나 복잡계 네트워크 분석 툴로 쓰일 수 있고, 더 나아가서 생물정보학, 자동차/선박 부품 분석에도 유용하게 적용 가능하다는 사실. ㅠ.ㅠ
자 요약하면, 쇼핑몰, 정보 검색, 영상처리, 사회망 서비스, 제조 회사, ... 등 다양한 산업의 빅 데이터 분석에 당장 써먹을 수 있는 현실적인 분석 알고리즘! 아마도 빅데이터 분석의 대부분은 MapReduce를 통한 적절한 가공과 Hama 예제만으로도 가능할 것이다.

수집과 가공은 중간 단계물이고, 결과물은 언제나 분석에서 나온다!
Categories: FLOSS Project Planets

Adrian Sutton: Javassist & Java 8 – invalid constant type: 15

Thu, 2014-03-20 23:32

Here’s a fun one, we have some code generated using javassist that looks a bit like:

public void doStuff(byte actionId) {
switch (actionId) {
case 0: doOneThing(); break;
case 1: doSomethingElse(); break;
throw new IllegalArgumentException("Unknown action: " + actionId);

This works perfectly fine on Java 6 and Java 7 but fails on Java 8.  It turns out the problematic part is "Unknown action: " + actionId. When run on Java 8 that throws “ invalid constant type: 15″ in javassist.bytecode.ConstPool.readOne.

The work around is simple, just do the conversion from byte to String explicitly:

throw new IllegalArgumentException("Unknown action: " +

There may be a more recent version of Javassist which fixes this, I haven’t looked as the work around is so trivial (and we only hit this problem in the error handling code which should never be triggered anyway).

Categories: FLOSS Project Planets

Justin Mason: Links for 2014-03-20

Thu, 2014-03-20 18:58
Categories: FLOSS Project Planets

Tim Bish: Packt Publishing now has 2000 titles and a great sale.

Thu, 2014-03-20 11:26

Packt is celebrating it's 2000th title and has a cool buy one get one free offer.  This is a great time to add to your personal library and start learning some new tech skills.  Its also a great time to grab a copy of my ActiveMQ book at Packt.

The sale ends 26th-Mar-2014 so don't wait to long.
Categories: FLOSS Project Planets

Nick Kew: 2005 revisited

Thu, 2014-03-20 09:43

Yesterday’s budget sounded a note of optimism.  The economy is growing, the deficit is shrinking, and …

… hang on …

… the deficit?  We’re running a still-huge deficit when we’re in the cyclical boom?   Right, straight back to the bubble-economics that got us into trouble in the first place!

2005 was kind-of the opposite.  The economy was slowing, the credit bubble had grown beyond sustainable, house prices were stumbling, and we were staring at recession.  The government of the day spent its way out with a huge dose of Ballsian stimulus: add fuel to the fire, buy a couple more years feelgood at the price of turning that recession into the biggest slump for 30 years.

The chancellor of the day rationalised breaking his own rules by explaining that when he had talked of a balanced budget, he meant over the economic cycle.  So at the bottom of the cycle in 2005 he would spend more, and make it up when the economy recovered.  Ed Balls said much the same even as his idea collapsed in flames.  And now … today’s chancellor has made clear his own commitment to Osbrownomics: run a huge structural deficit and – currently – claim the credit when a cyclical boom takes a few quid off the headline figure.

Hmmm … not really so different to 2005 after all …

On the plus side, the mood music about savings is a real change, and a welcome one.  It will probably work for some time, propped up by “safe haven” status for the global super-rich.  But that of course is another bubble involving prostituting our economy most wantonly!  One day sentiment towards Sterling will change, and then what can survive a round of Weimar inflation in commodities including food and energy?

Categories: FLOSS Project Planets

Norman Maurer: The hidden performance costs of instantiating Throwables

Thu, 2014-03-20 09:33

Today it's time to make you aware of the performance penalty you may pay when using Throwable, Error, Exception and as a result give you a better idea how to avoid it. You may never have thought about it, but using those in a wrong fashion can affect the performance of your applications to a large degree.

Alright, let us start from scratch. You may have heard that you should only use Exception / Throwable / Error for exceptional situations (something that is not the norm and signals unexpected behaviour). This is actually a good advice, but even if you follow it (which I really hope you do) there may be situations where you need to throw one.

Throwing a Throwable (or one of it's subtypes) is not a big deal. Well it's not for free, but still not the main cause for peformance issues. The real issue comes up when you create the object itself.


So why is creating a Throwable so expensive? Isn't it just a simple light-weight POJO? Simple yes, but certainly not light-weight!

It's because usally it will call Throwable.fillInStackTrace(), which needs to look down the stack and put it in the newly created Throwable. This can affect the performance of your application to a large degree if you create a lot of them.

But what to do about this?

There are a few techniques you can use to improve the performance. Let's have a deeper look into them now.

Lazy create a Throwable and reuse

There are some situations where you would like to use the same Throwable multiple times. In this case you can lazily create and then reuse it. This way you eliminate a lot of the initial overhead.

To make things more clear let's have a look at some real-world example. In this example we assume that we have a list of pending writes which are all failed because the underlying Channel was closed.

The pending writes are represented by the PendingWrite interface as shown below.

public interface PendingWrite { void setSuccess(); void setFailure(Throwable cause); }

We have a Writer class which will need to fail all PendingWrite instances with a ClosedChannelException. You may be tempted to implement it like this:

public class Writer { .... private void failPendingWrites(PendingWrite... writes) { for (PendingWrite write: writes) { write.setFailure(new ClosedChannelException()); } } }

This works, but if this method is called often and with a not to small array of PendingWrites you are in serious trouble. It will need to fill in the stacktrace for every PendingWrite you are about to fail!

This is not only very wasteful but also something that is easy to optimize, so let's bring it on…

The key is to lazy create the ClosedChannelException and reuse it for each PendingWrite that needs to get failed. And doing so will even result in the correct stacktrace to be filled in… JackPot!

So fixing this is as easy as rewriting the failPendingWrites(...) method as shown here:

public class Writer { .... private void failPendingWrites(PendingWrite... writes) { if (writes.length == 0) { return; } ClosedChannelException error = new ClosedChannelException(); for (PendingWrite write: writes) { write.setFailure(error); } } }

Notice we lazily create the ClosedChannelException only if needed (if we have something to fail) and reuse the same instance for all the PendingWrites in the array. This will dramatically cut down the overhead, but you can reduce it even more with some tradeoff which I will explain next…

Use static Throwable with no stacktrace at all

Sometimes you may not need a stacktrace at all as the Throwable itself is enough information for what's going on. In this case, you are able to just use a static Throwable and reuse it.

What you should remember in this case is to set the stacktrace to an empty array to not have some "wrong" stacktrace show up, and so cause a lot of headache when debugging.

Let us see how this fit in again in our Writer class:

public class Writer { private static final ClosedChannelException CLOSED_CHANNEL_EXCEPTION = new ClosedChannelException(); static { CLOSED_CHANNEL_EXCEPTION.setStackTrace(new StackTraceElement[0]); } .... private void failPendingWrites(PendingWrite... writes) { for (PendingWrite write: writes) { write.setFailure(CLOSED_CHANNEL_EXCEPTION); } } }

But where is this useful?

For example in a network application a closed Channel is not a really exceptional state anyway. So this may be a good fit in this case. In fact we do something similar in Netty for exactly this case.

Caution: only do this if you are sure you know what you are doing!


Now with all the claims it's time to actually proof them. For this I wrote a microbenchmark using JMH.

You can find the source code of the benchmark in the github repository. As there is no JMH version in any public maven repository yet I just bundled a SNAPSHOT version of it in the repository. As this is just a SNAPSHOT it may get out of date at some point in time…. Anyway this is good enough for us to run a benchmark and should be quite simple to be updated if needed.

This benchmark was run with:

# git clone # cd jmh-benchmarks ➜ jmh-benchmarks git:(master) ✗ mvn clean package ➜ jmh-benchmarks git:(master) ✗ java -jar target/microbenchmarks.jar -w 10 -wi 3 -i 3 -of csv -o output.csv -odr ".*ThrowableBenchmark.*"

This basically means:

  • Clone the code
  • Build it the code
  • Run a warmup for 10 seconds
  • Run warmup 3 times
  • Run each benchmark 3 times
  • Generate output as csv

The benchmark result contains the ops/msec. Each op represents the call of failPendingWrites(...) with and array of 10000 PendingWrites.

Enough said, time to look at the outcome:

As you can see here creating a new Throwable is by far the slowest way to handle it. Next one is to lazily create a Throwable and reuse it for the whole method invocation. The winner is to reuse a static Throwable with the drawback of not having any stacktrace. So I think it's fair to say using a lazy created Throwable is the way to go in most cases. If you really need the last 1 % performance you could also make use of the static solution but will loose the stacktrace for debugging. So you see it's always a tradeoff.


You should be aware of how expensive Throwable.fillInStackTrace() is and so think hard about how and when you create new instances of Throwable. This is also true for subtypes as those will call the super constructor.

To make it short, nothing is for free so think about what you are doing before you run into performance problems later. Another good read on this topic is the blog post of John Rose.

Thanks again to Nitsan Wakart and Michael Nitschinger for the review!

Categories: FLOSS Project Planets

Norman Maurer: Lesser known concurrent classes - Atomic*FieldUpdater

Thu, 2014-03-20 09:33

Today I want to talk about one of the lesser known utility classes when it comes to atomic operations in Java. Everyone who ever has done some real work with the java.util.concurrent package should be aware of the Atomic* classes in there which helps you to do atomic operations on references, Longs, Integers, Booleans and more.

The classes in question are all located in the java.util.concurrent.atomic package. Like:

  • AtomicBoolean
  • AtomicInteger
  • AtomicReference
  • AtomicLong
  • ….

Using those is as easy as doing something like: <pre class="syntax java"> AtomicLong atomic = new AtomicLong(0); atomic.compareAndSet(0, 1); … … </pre>

So what is the big deal with them? It's about memory usage … Wouldn't it be nice to be able to just use a volatile long, save a object allocation and as a result use less memory?


This is exactly where the not widely known Atomic*FieldUpdater comes in. Those allow you to do "atomic" operations on a volatile field and so save the space which is needed to hold the object that you would create if you would use something like AtomicLong. This works as Atomic*FieldUpdater is used as a static field and so not need to create a new Object everytime.

Neat, isn't it ?

So to replace the above usage of AtomicLong your code would look like:

private static final AtomicLongFieldUpdater<TheDeclaringClass> ATOMIC_UPDATER = AtomicLongFieldUpdater.newUpdater(TheDeclaringClass.class, "atomic"); private volatile long atomic; public void yourMethod() { ATOMIC_UPDATER.compareAndSet(this, 0, 1); ... ... }

This works with some reflection magic which is used when you create the AtomicLongFieldUpdater instance. The field names passed in as argument (in this case atomic) will be used to lookup the declared volatile field. Thus you must be sure it matches. And this is one of the weak things when using Atomic*FieldUpdater as there is no way for the compiler to detect that those match. So you need to keep an eye on this by yourself.

You may ask you self about if it worth it at all? As always it depends… If you only create a few thousands instances of the class that use Atomic* it may not worth it at all. But there may be situations where you need to create millions of them and keep the alive for a long time. In those situations it can have a big impact.

In the case of the Netty Project we used AtomicLong and AtomicReference in our Channel, DefaultChannelPipeline and DefaultChannelHandlerContext classes. A new instance of Channel and ChannelPipeline is created for each new connection that is accepted or established and it is not unusal to have 10 (or more ) DefaultChannelHandlerContext objects per DefaultChannelPipeline. For Non-Blocking Servers it is not unusal to handle a large amout of concurrent connections, which in our case was creating many instances of the mentioned classes. Those stayed alive for a long time as connections may be long-living. One of our users was testing 1M+ concurrent connections and saw a large amount of the heap space taken up because of the AtomicLong and AtomicReference instances we were using. By replacing those with AtomicField*Updater we were able to save about 500 MB of memory which, in combination with other changes, reduced the memory footprint by 3 GB.

For more details on the specific enhancements please have a look at those two issues: #920 and #995

On thing to note is that there is no AtomicBooleanFieldUpdater that you could use to replace a AtomicBoolean. This is not a problem, just use AtomicIntegerFieldUpdater with value 0 as false and 1 as true. Problem solved ;)

Gimme some numbers

Now with some theory behind us, let's proof our claim. Let us do a simple test here: we create a Class which will contain 10 AtomicLong and 10 AtomicReference instances and instantiate itself 1M times. This resembles the pattern we saw within Netty.

Let us first have a look at the actual code:

public class AtomicExample { final AtomicLong atomic1 = new AtomicLong(0); final AtomicLong atomic2 = new AtomicLong(0); final AtomicLong atomic3 = new AtomicLong(0); final AtomicLong atomic4 = new AtomicLong(0); final AtomicLong atomic5 = new AtomicLong(0); final AtomicLong atomic6 = new AtomicLong(0); final AtomicLong atomic7 = new AtomicLong(0); final AtomicLong atomic8 = new AtomicLong(0); final AtomicLong atomic9 = new AtomicLong(0); final AtomicLong atomic10 = new AtomicLong(0); final AtomicReference atomic11 = new AtomicReference<String>("String"); final AtomicReference atomic12 = new AtomicReference<String>("String"); final AtomicReference atomic13 = new AtomicReference<String>("String"); final AtomicReference atomic14 = new AtomicReference<String>("String"); final AtomicReference atomic15 = new AtomicReference<String>("String"); final AtomicReference atomic16 = new AtomicReference<String>("String"); final AtomicReference atomic17 = new AtomicReference<String>("String"); final AtomicReference atomic18 = new AtomicReference<String>("String"); final AtomicReference atomic19 = new AtomicReference<String>("String"); final AtomicReference atomic20 = new AtomicReference<String>("String"); public static void main(String[] args) throws Exception { List<AtomicExample> list = new LinkedList<AtomicExample>(); for (int i = 0; i < 1000000; i++) { list.add(new AtomicExample()); } System.out.println("Created instances 1000000");; } }

You may think this is not very often the case in real world applications but just think about it for a bit. It may not be in one class but actually may be in many classes but which are still related. Like all of them are created for each new connection.

Now let us have a look at how much memory is retained by them. For this I used YourKit but any other tool which can inspect heap-dumps should just work fine.

As you can see AtomicLong and AtomicReference instances took about about 400 MB of memory where AtomicExample itself takes up 96MB. This makes up a a sum of ca. 500 MB memory that is used by each AtomicExample instance that is created.

Now let's do a second version of this class but replace AtomicLong with volatile long and AtomicLongFieldUpdater. Beside this we also replace AtomicReference with volatile String and AtomicReferenceFieldUpdater.

The code looks like this now:

public class AtomicFieldExample { volatile long atomic1 = 0; volatile long atomic2 = 0; volatile long atomic3 = 0; volatile long atomic4 = 0; volatile long atomic5 = 0; volatile long atomic6 = 0; volatile long atomic7 = 0; volatile long atomic8 = 0; volatile long atomic9 = 0; volatile long atomic10 = 0; volatile String atomic11 = "String"; volatile String atomic12 = "String"; volatile String atomic13 = "String"; volatile String atomic14 = "String"; volatile String atomic15 = "String"; volatile String atomic16 = "String"; volatile String atomic17 = "String"; volatile String atomic18 = "String"; volatile String atomic19 = "String"; volatile String atomic20 = "String"; static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC1_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic1"); static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC2_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic2"); static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC3_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic3"); static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC4_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic4"); static final AtomicLongFieldUpdater<<AtomicFieldExample> ATOMIC5_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic5"); static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC6_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic6"); static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC7_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic7"); static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC8_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic8"); static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC9_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic9"); static final AtomicLongFieldUpdater<AtomicFieldExample> ATOMIC10_UPDATER = AtomicLongFieldUpdater.newUpdater(AtomicFieldExample.class, "atomic10"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String> ATOMIC11_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic11"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String>ATOMIC12_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic12"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String> ATOMIC13_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic13"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String> ATOMIC14_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic14"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String> ATOMIC15_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic15"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String> ATOMIC16_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic16"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String>ATOMIC17_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic17"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String> ATOMIC18_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic18"); static final AtomicReferenceFieldUpdater<AtomicFieldExample, String> ATOMIC19_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic19"); static final AtomicReferenceFieldUpdater<<AtomicFieldExample, String>ATOMIC20_UPDATER = AtomicReferenceFieldUpdater.newUpdater(AtomicFieldExample.class, String.class, "atomic20"); public static void main(String[] args) throws Exception { List<AtomicFieldExample> list = new LinkedList<<AtomicFieldExample>(); for (int i = 0; i < 1000000; i++) { list.add(new AtomicFieldExample()); } System.out.println("Created instances 1000000");; } }

As you see the code becomes a bit more bloated, hopefully it pays out. Again let us take a look at the memory usage as before.

As you can see from the screenshot the memory footprint is a lot smaller. In fact it now needs not more then ca. 136MB of memory for the 1M instances of AtomicFieldExample. This is a nice improvement compared to the baseline memory footprint. Now think about how much memory you can save if you have a few cases where you can replace Atomic* classes with volatile and Atomic*FieldUpdater in classes that are instanced a lot.

You may ask yourself why the AtomicFieldExample is larger then the AtomicExample. This is caused by the extra memory you need to store the references + longs. AtomicFieldExample has 10 longs + 10 references. This gives us:

  • 10 * 8 bytes (for the longs)
  • 10 * 4 bytes (for the references)
  • 1 * 16 bytes (for itself)

AtomicExample has 20 refernces. This gives us:

  • 20 * 4 bytes (for the references)
  • 1 * 16 bytes (for itself)

So it only pays off because we save the extra memory overhead of AtomicLong and AtomicReference itself. So to put it straight: Every Object has a fixed overhead.

Beside the memory savings there are some other nice effect here, we did not mention before:

  • Because we save Objects the Garbage Collector has less overhead to care about, as it needs to keep track of every Object.
  • We save the tax of the built in monitor which comes as part of each Object

To summarize it, it may pay off to replace Atomic* objects with the corresponding volatile + Atomic*FieldUpdater. How much you save in terms of memory varies depending on what you replace. But the savings can be huge, especially when we talk about small "Objects".

Let us do the maths again:

  • AtomicLong = 24 bytes + 4 bytes (for the reference to it)
  • volatile long = 8 bytes

This gives us a saving of 16 bytes!


Special thanks go out to Nitsan Wakart, Michael Nitschinger and Benoît Sigoure for the review and feedback.

Categories: FLOSS Project Planets

Justin Mason: Links for 2014-03-19

Wed, 2014-03-19 18:58
  • No, Nate, brogrammers may not be macho, but that’s not all there is to it

    Great essay on sexism in tech, “brogrammer” culture, “clubhouse chemistry”, outsiders, wierd nerds and exclusion:

    Every group, including the excluded and disadvantaged, create cultural capital and behave in ways that simultaneously create a sense of belonging for them in their existing social circle while also potentially denying them entry into another one, often at the expense of economic capital. It’s easy to see that wearing baggy, sagging pants to a job interview, or having large and visible tattoos in a corporate setting, might limit someone’s access. These are some of the markers of belonging used in social groups that are often denied opportunities. By embracing these markers, members of the group create real barriers to acceptance outside their circle even as they deepen their peer relationships. The group chooses to adopt values that are rejected by the society that’s rejecting them. And that’s what happens to “weird nerd” men as well—they create ways of being that allow for internal bonding against a largely exclusionary backdrop. (via Bryan O’Sullivan)

    (tags: nerds outsiders exclusion society nate-silver brogrammers sexism racism tech culture silicon-valley essays via:bos31337)

  • Impact of large primitive arrays (BLOBS) on JVM Garbage Collection

    some nice graphs and data on CMS performance, with/without -XX:ParGCCardsPerStrideChunk

    (tags: cms java jvm performance optimization tuning off-heap-storage memory)

  • Anatomical Collages by Travis Bedel

    these are fantastic

    (tags: collage anatomy art prints)

  • htcat/htcat

    a utility to perform parallel, pipelined execution of a single HTTP GET. htcat is intended for the purpose of incantations like: htcat | tar -zx It is tuned (and only really useful) for faster interconnects: [....] 109MB/s on a gigabit network, between an AWS EC2 instance and S3. This represents 91% use of the theoretical maximum of gigabit (119.2 MiB/s).

    (tags: go cli http file-transfer ops tools)

Categories: FLOSS Project Planets

Bryan Pendleton: Stuff I'm reading, midweek edition

Wed, 2014-03-19 12:49

The weather's great; the snowpack isn't. Meanwhile, I read.

  • Leslie Lamport Receives Turing Award“Leslie’s schemes for dealing with faults was a major area of investigation in distributed computing,” Levin says. “His work was fundamental there. Then it led to work on agreement protocols, a key part of the notion of getting processes to converge on a common answer. This is what has come to be known as Paxos.”

    Incidently, that work was independently invented at about the same time by Barbara Liskov, Turing Award winner in 2008, and her student Brian Oki.

    “Lamport’s paper [The Part-Time Parliament],” Levin adds, “explained things through the use of a mythical Greek island and its legislative body. Partly because he chose this metaphor, it was not appreciated for quite some time what was really going on there.”

    Lamport has a less diplomatic assessment.

    “With the success of Byzantine Generals, I thought, ‘We need a story.’ I created a story, and that was an utter disaster.

  • Surprise! Science!Chao-Lin Kuo surprises Andrei Linde and his wife with the news that gravitational waves were detected, proving Linde's theory of an inflationary universe.
  • Ernest Cline is the luckiest geek aliveCline spent eight years writing Ready Player One before it hit the presses in August 2011. Set in 2044, Cline’s book prophesied a future where everybody’s always plugged in to The Oasis, a virtual world hosting the Earth’s jobs, education, and secrets. Each Oasis user hooks in using a variety of haptic and visual inputs to simulate and stimulate the senses in a virtual world. Ready Player One is indebted to hundreds of other books, games, and films like The Matrix, and it wears them on its shirt-sleeve. Cline’s most original creation yet moved away from Star Wars into a world of his own, a world made out of his favorite things.
  • Why Dorian Nakamoto Probably Isn’t SatoshiTo me, one of the weakest points in Newsweek’s argument is their assertion that Dorian had the skills and background to create Bitcoin. All they really have as evidence is that Dorian trained as a physicist, worked as an engineer, and is reputed to be very intelligent. But none of that indicates that Dorian understood cryptography or distributed algorithms well enough to devise Bitcoin and write the original Bitcoin paper.
  • Missed Alarms and 40 Million Stolen Credit Card Numbers: How Target Blew ItOn Saturday, Nov. 30, the hackers had set their traps and had just one thing to do before starting the attack: plan the data’s escape route. As they uploaded exfiltration malware to move stolen credit card numbers—first to staging points spread around the U.S. to cover their tracks, then into their computers in Russia—FireEye spotted them. Bangalore got an alert and flagged the security team in Minneapolis. And then …

    Nothing happened.

    For some reason, Minneapolis didn’t react to the sirens.

  • WorseThis isn’t just an Amazon problem. In the last few years, Google, Apple, Amazon, Facebook, and Twitter have all made huge attempts to move into major parts of each others’ businesses, usually at the detriment of their customers or users.
  • managers are awesome / managers are cool when they’re part of your teamThis bothered me a bit when I heard it last summer, and it’s gotten increasingly more uncomfortable since. I’ve been paraphrasing this part of the talk as “management is a form of workplace violence,” and the still-evolving story of Julie Ann Horvath suggests that the removal of one form of workplace violence has resulted in the reintroduction of another, much worse form.
  • What Your Culture Really SaysCulture is not about the furniture in your office. It is not about how much time you have to spend on feel-good projects. It is not about catered food, expensive social outings, internal chat tools, your ability to travel all over the world, or your never-ending self-congratulation.

    Culture is about power dynamics, unspoken priorities and beliefs, mythologies, conflicts, enforcement of social norms, creation of in/out groups and distribution of wealth and control inside companies. Culture is usually ugly. It is as much about the inevitable brokenness and dysfunction of teams as it is about their accomplishments. Culture is exceedingly difficult to talk about honestly. The critique of startup culture that came in large part from the agile movement has been replaced by sanitized, pompous, dishonest slogans.

  • Welcome to the Thirsty WestDon’t get me wrong: This has always been an extreme environment. But over the thousands of years of human civilization in this corner of the world, people have adapted to little water. Problem is: The water supply/demand calculus has never changed this quickly before. About 100 years ago, the balance started to tip. Groundwater was invested for agricultural purposes. Massive civil engineering projects pulled more water from rivers. The human presence in the desert blossomed.
Categories: FLOSS Project Planets

Jean-Baptiste Onofré: Hadoop CDC and processes notification with Apache Falcon, Apache ActiveMQ, and Apache Camel

Wed, 2014-03-19 11:49

Some weeks (months ? ;)) ago, I started to work on Apache Falcon. First of all, I would like to thanks all Falcon guys: they are really awesome and do a great job (special thanks to Srikanth, Venkatesh, Swetha).

This blog post is a preparation to a set of “recipes documentation” that I will propose in Apache Falcon.

Falcon is in incubation at Apache. The purpose is to provide a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.

A interesting feature provided by Falcon is notifications of the activities in the Hadoop cluster “outside” of the cluster
In this article, we will see how to get two kinds of notification in Camel routes “outside” of the Hadoop cluster:

  • a Camel route will be notified and triggered when a process is executed in the Hadoop cluster
  • a Camel route will be notified and triggered when a HDFS location changes (a first CDC feature)

If you already have your Hadoop cluster, or you know to install/prepare it, you can skip this step.

In this section, I will create a “real fake” Hadoop cluster on one machine. It’s not really a pseudo-distributed as I will use multiple datanodes and tasktrackers, but all on one machine (of course, it doesn’t make sense, but it’s just for demo purpose ;)).

In addition of Hadoop common components (HDFS namenode/datanodes and M/R jobtracker/tasktracker), Falcon requires Oozie (for scheduling) and ActiveMQ.

By default, Falcon embeds ActiveMQ, but for the demo (and provide screenshots to the ActiveMQ WebConsole), I will use a standalone ActiveMQ instance.

Hadoop “fake” cluster preparation

For the demo, I will “fake” three machines.

I create a demo folder on my machine, and I uncompress hadoop-1.1.2-bin.tar.gz tarball in node1, node2, node3 folders:

$ mkdir demo $ cd demo $ tar zxvf ~/hadoop-1.1.2-bin.tar.gz $ cp -r hadoop-1.1.2 node1 $ cp -r hadoop-1.1.2 node2 $ cp -r hadoop-1.1.2 node3 $ mkdir storage

I also create a storage folder where I will put the nodes’ files. This folder is just for convenience, as it’s easier to restart from scratch, just be deleting the storage folder content.

Node1 will hosts:

  • the HDFS namenode
  • a HDFS datanode
  • the M/R jobtracker
  • a M/R tasktracker

So, the node1/conf/core-site.xml file contains the location of the namenode:

<?xml version="1.0"> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node1 conf/core-site.xml --> <configuration> <property> <name></name> <value>hdfs://localhost</value> </property> </configuration>

In the node1/conf/hdfs-site.xml file, we define the storage location for the namenode and the datanode (in the storage folder), and the default replication:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node1 conf/hdfs-site.xml --> <configuration> <property> <name></name> <value>/home/jbonofre/demo/storage/node1/namenode</value> </property> <property> <name></name> <value>/home/jbonofre/demo/storage/node1/datanode</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>

Finally, in node1/conf/mapred-site.xml file, we define the location of the job tracker:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node1 conf/mapred-site.xml --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>

Node1 is not ready.

Node2 hosts a datanode and a tasktracker. As for node1, the node2/conf/core-site.xml file contains the location of the namenode:

<?xml version="1.0"> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node2 conf/core-site.xml --> <configuration> <property> <name></name> <value>hdfs://localhost</value> </property> </configuration>

The node2/conf/hdfs-site.xml file contains:

  • the storage location of the datanode
  • the network location of the namenode (from node1)
  • the port numbers used by the datanode (core, IPC, and HTTP in order to be able to start multiple datanodes on the same machine)
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsd"?> <!-- node2 conf/hdfs-site.xml --> <configuration> <property> <name></name> <value>/home/jbonofre/demo/storage/node2/datanode</value> </property> <property> <name>dfs.datanode.address</name> <value>localhost:50110</value> </property> <property> <name>dfs.datanode.ipc.address</name> <value>localhost:50120</value> </property> <property> <name>dfs.datanode.http.address</name> <value>localhost:50175</value> </property> </configuration>

The node2/conf/mapred-site.xml file contains the network location of the jobtracker, and the HTTP port number used by the tasktracker (in order to be able to run multiple tasktracker on the same machine):

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node2 conf/mapred-site.xml --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>mapred.task.tracker.http.address</name> <value>localhost:50160</value> </property> </configuration>

Node3 is very similar to node2: it hosts a datanode and a tasktracker. So the configuration is very similar to node2 (just the storage location, and the datanode and tasktracker port numbers are different).

Here’s the node3/conf/core-site.xml file:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node3 conf/core-site.xml --> <configuration> <property> <name></name> <value>hdfs://localhost</value> </property> </configuration>

Here’s the node3/conf/hdfs-site.xml:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node3 conf/hdfs-site.xml --> <configuration> <property> <name></name> <value>/home/jbonofre/demo/storage/node3/datanode</value> </property> <property> <name>dfs.datanode.address</name> <value>localhost:50210</value> </property> <property> <name>dfs.datanode.ipc.address</name> <value>localhost:50220</value> </property> <property> <name>dfs.datanode.http.address</name> <value>localhost:50275</value> </property> </configuration>

Here’s the node3/conf/mapred-site.xml file:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node2 conf/mapred-site.xml --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>mapred.task.tracker.http.address</name> <value>localhost:50260</value> </property> </configuration>

Our “fake” cluster configuration is now ready.

We can format the namenode on node1:

$ cd node1/bin $ ./hadoop namenode -format 14/03/06 17:26:38 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = vostro/ STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.1.2 STARTUP_MSG: build = -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013 ************************************************************/ 14/03/06 17:26:39 INFO util.GSet: VM type = 64-bit 14/03/06 17:26:39 INFO util.GSet: 2% max memory = 17.78 MB 14/03/06 17:26:39 INFO util.GSet: capacity = 2^21 = 2097152 entries 14/03/06 17:26:39 INFO util.GSet: recommended=2097152, actual=2097152 14/03/06 17:26:39 INFO namenode.FSNamesystem: fsOwner=jbonofre 14/03/06 17:26:39 INFO namenode.FSNamesystem: supergroup=supergroup 14/03/06 17:26:39 INFO namenode.FSNamesystem: isPermissionEnabled=true 14/03/06 17:26:39 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 14/03/06 17:26:39 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 14/03/06 17:26:39 INFO namenode.NameNode: Caching file names occuring more than 10 times 14/03/06 17:26:40 INFO common.Storage: Image file of size 114 saved in 0 seconds. 14/03/06 17:26:40 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/home/jbonofre/demo/storage/node1/namenode/current/edits 14/03/06 17:26:40 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/home/jbonofre/demo/storage/node1/namenode/current/edits 14/03/06 17:26:40 INFO common.Storage: Storage directory /home/jbonofre/demo/storage/node1/namenode has been successfully formatted. 14/03/06 17:26:40 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at vostro/ ************************************************************/

We are now ready to start the namenode on node1:

$ cd node1/bin $ ./hadoop namenode &

We start the datanode on node1:

$ cd node1/bin $ ./hadoop datanode &

We start the jobtracker on node1:

$ cd node1/bin $ ./hadoop jobtracker &

We start the tasktracker on node1:

$ cd node1/bin $ ./hadoop tasktracker &

Node1 is fully started with the namenode, a datanode, the jobtracker, and a tasktracker.

We start a datanode and a tasktracker on node2:

$ cd node2/bin $ ./hadoop datanode & $ ./hadoop tasktracker &

And finally, we start a datanode and a tasktracker on node3:

$ cd node3/bin $ ./hadoop datanode & $ ./hadoop tasktracker &

We access to the HDFS web console (http://localhost:50070) to verify that the namenode is able to see the 3 live datanodes:

We also access to the MapReduce web console (http://localhost:50030) to verify that the jobtracker is able to see the 3 live tasktrackers:


Falcon delegates scheduling of jobs (plannification, re-execution, etc) to Oozie.

Oozie is a workflow scheduler system to manage hadoop jobs, using Quartz internally.

It uses a “custom” Oozie distribution: Falcon adds some addition EL extensions on top of a “regular” Oozie.

Falcon provides a script to create the Falcon custom Oozie distribution: we provide the Hadoop and Oozie version that we need.

We can clone Falcon sources from git and call the src/bin/ with the Hadoop and Oozie target versions that we want:

$ git clone falcon $ cd falcon $ src/bin/ 1.1.2 4.0.0

The script creates target/oozie-4.0.0-distro.tar.gz in the Falcon sources folder.

In the demo folder, I uncompress oozie-4.0.0-distro.tar.gz tarball:

$ cp ~/oozie-4.0.0-distro.tar.gz $ tar zxvf oozie-4.0.0-distro.tar.gz

We now have a oozie-4.0.0-falcon folder.

Oozie requires a special configuration on the namenode (so on node1). We have to update the node1/conf/core-site.xml file to define the system user “proxied” by Oozie:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- node1 conf/core-site.xml --> <configuration> <property> <name></name> <value>hdfs://localhost</value> </property> <property> <name>hadoop.proxyuser.jbonofre.hosts</name> <value>localhost</value> </property> <property> <name>hadoop.proxyuser.jbonofre.groups</name> <value>localhost</value> </property> </configuration>

NB: don’t forget to restart the namenode to include these changes.

Now, we can prepare the Oozie webapplication. Due to license restriction, it’s up to you to add ExtJS library for Oozie webconsole. To enable it, first, we create a oozie-4.0.0-falcon/libext folder and put archive:

$ cd oozie-4.0.0-falcon $ mkdir libext $ cd libext $ wget ""

We have to populate the libext folder with different additional jar files:

  • the Hadoop jar files:

    $ cp node1/hadoop-core-1.1.2.jar oozie-4.0.0-falcon/libext $ cp node1/hadoop-client-1.1.2.jar oozie-4.0.0-falcon/libext $ cp node1/hadoop-tools-1.1.2.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-beanutils-1.7.0.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-beanutils-core-1.8.0.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-codec-1.4.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-collections-3.2.1.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-configuration-1.6.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-digester-1.8.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-el-1.0.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-io-2.1.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-lang-2.4.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-logging-api-1.0.4.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-logging-1.1.1.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-math-2.1.jar oozie-4.0.0-falcon/libext $ cp node1/lib/commons-net-3.1.jar oozie-4.0.0-falcon/libext
  • the Falcon Oozie extender:

    $ cp falcon-0.5-incubating-SNAPSHOT/oozie/libext/falcon-oozie-el-extension-0.5-incubating-SNAPSHOT.jar oozie-4.0.0-falcon/libext
  • jar files required for hcatalog, pig from oozie-sharelib-4.0.0-falcon.tar.gz:

    $ cd oozie-4.0.0-falcon $ tar zxvf oozie-sharelib-4.0.0-falcon.tar.gz $ cp share/lib/hcatalog/hcatalog-core-0.5.0-incubating.jar libext $ cp share/lib/hcatalog/hive-* libext $ cp share/lib/pig/hsqldb- libext $ cp share/lib/pig/jackson-* libext $ cp share/lib/hcatalog/libfb303-0.7.0.jar libext $ cp share/lib/hive/log4j-1.2.16.jar libext $ cp libtools/oro-2.0.8.jar libext $ cp share/lib/hcatalog/webhcat-java-client-0.5.0-incubating.jar libext $ cp share/lib/pig/xmlenc-0.52.jar libext $ cp share/lib/pig/guava-11.0.2.jar libext $ cp share/lib/hcatalog/oozie-* libext

We are now ready to setup Oozie.

First, we “assemble” the oozie webapplication (war):

$ cd oozie-4.0.0-falcon/bin $ ./ prepare-war setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m" INFO: Adding extension: /home/jbonofre/demo/oozie-4.0.0-falcon/libext/ant-1.6.5.jar ... New Oozie WAR file with added 'ExtJS library, JARs' at /home/jbonofre/demo/oozie-4.0.0-falcon/oozie-server/webapps/oozie.war INFO: Oozie is ready to be started

Now, we “upload” the oozie shared libraries on our HDFS, including the falcon shared lib:

$ cd oozie-4.0.0/bin $ ./ sharelib create -fs hdfs://localhost -locallib ../oozie-sharelib-4.0.0-falcon.tar.gz setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m" the destination path for sharelib is: /user/jbonofre/share/lib

If we browse the HDFS, we can see the folders created by Oozie.

Finally, we create the Oozie database (where it stores the jobs definition, etc).

$ cd oozie-4.0.0-falcon/bin $ ./ db create -run setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m" Validate DB Connection DONE Check DB schema does not exist DONE Check OOZIE_SYS table does not exist DONE Create SQL schema DONE Create OOZIE_SYS table DONE Oozie DB has been created for Oozie version '4.0.0' The SQL commands have been written to: /tmp/ooziedb-4527318150729236810.sql

The Oozie configuration is done, we start it:

$ cd oozie-4.0.0-falcon/bin $ ./ start Setting OOZIE_HOME: /home/jbonofre/demo/oozie-4.0.0-falcon Setting OOZIE_CONFIG: /home/jbonofre/demo/oozie-4.0.0-falcon/conf Sourcing: /home/jbonofre/demo/oozie-4.0.0-falcon/conf/ setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m" Setting OOZIE_CONFIG_FILE: oozie-site.xml Setting OOZIE_DATA: /home/jbonofre/demo/oozie-4.0.0-falcon/data Setting OOZIE_LOG: /home/jbonofre/demo/oozie-4.0.0-falcon/logs Setting OOZIE_LOG4J_FILE: Setting OOZIE_LOG4J_RELOAD: 10 Setting OOZIE_HTTP_HOSTNAME: Setting OOZIE_HTTP_PORT: 11000 Setting OOZIE_ADMIN_PORT: 11001 Setting OOZIE_HTTPS_PORT: 11443 Setting OOZIE_BASE_URL: Setting CATALINA_BASE: /home/jbonofre/demo/oozie-4.0.0-falcon/oozie-server Setting OOZIE_HTTPS_KEYSTORE_FILE: /home/jbonofre/.keystore Setting OOZIE_HTTPS_KEYSTORE_PASS: password Setting CATALINA_OUT: /home/jbonofre/demo/oozie-4.0.0-falcon/logs/catalina.out Setting CATALINA_PID: /home/jbonofre/demo/oozie-4.0.0-falcon/oozie-server/temp/ Using CATALINA_OPTS: -Xmx1024m Adding to CATALINA_OPTS: -Doozie.home.dir=/home/jbonofre/demo/oozie-4.0.0-falcon -Doozie.config.dir=/home/jbonofre/demo/oozie-4.0.0-falcon/conf -Doozie.log.dir=/home/jbonofre/demo/oozie-4.0.0-falcon/logs -Doozie.config.file=oozie-site.xml -Doozie.log4j.reload=10 -Doozie.admin.port=11001 -Doozie.http.port=11000 -Doozie.https.port=11443 -Doozie.base.url= -Doozie.https.keystore.file=/home/jbonofre/.keystore -Doozie.https.keystore.pass=password -Djava.library.path= Using CATALINA_BASE: /home/jbonofre/demo/oozie-4.0.0-falcon/oozie-server Using CATALINA_HOME: /home/jbonofre/demo/oozie-4.0.0-falcon/oozie-server Using CATALINA_TMPDIR: /home/jbonofre/demo/oozie-4.0.0-falcon/oozie-server/temp Using JRE_HOME: /opt/jdk/1.7.0_51 Using CLASSPATH: /home/jbonofre/demo/oozie-4.0.0-falcon/oozie-server/bin/bootstrap.jar Using CATALINA_PID: /home/jbonofre/demo/oozie-4.0.0-falcon/oozie-server/temp/

We access to the Oozie webconsole on http://localhost:11000/oozie/:


By default, Falcon embeds ActiveMQ, so generally speaking, you don’t have to install ActiveMQ. However, for the demo, I would like to show how to use a external and standalone ActiveMQ.

I uncompress the apache-activemq-5.7.0-bin.tar.gz tarball in the demo folder:

$ cd demo $ tar zxvf ~/apache-activemq-5.7.0-bin.tar.gz

The default ActiveMQ configuration is fine, we can just start the broker on the default port (61616):

$ cd demo/apache-activemq-5.7.0/bin $ ./activemq console

All the Falcon pre-requirements are done.

Falcon installation

Falcon can be deployed:

  • standalone: it’s the “regular” deployment mode when you have only one hadoop cluster. It’s the deployment mode that I will use for this CDC demo.
  • distributed: it’s the deployment to use when you have multiple hadoop clusters, especially if you want to use the Falcon replication feature.

For the installation, we uncompress the falcon-0.5-incubating-SNAPSHOT-bin.tar.gz tarball in the demo folder:

$ cd demo $ tar zxvf ~/falcon-0.5-incubating-SNAPSHOT-bin.tar.gz

Before starting Falcon, we disable the default embedded ActiveMQ broker in the conf/ file:

# conf/ ... export FALCON_OPTS="-Dfalcon.embeddedmq=false" ...

We start the falcon server:

$ cd falcon-0.5-incubating-SNAPSHOT/bin $ ./falcon-start Could not find installed hadoop and HADOOP_HOME is not set. Using the default jars bundled in /home/jbonofre/demo/falcon-0.5-incubating-SNAPSHOT/hadooplibs/ /home/jbonofre/demo/falcon-0.5-incubating-SNAPSHOT/bin falcon started using hadoop version: Hadoop 1.1.2

The falcon server starts actually a Jetty container with jersey to expose the Falcon REST API.

You can check if the falcon server started correctly using bin/falcon-status or bin/falcon:

$ bin/falcon-status Falcon server is running (on http://localhost:15000/) $ bin/falcon admin -status Falcon server is running (on http://localhost:15000/) $ bin/falcon admin -version Falcon server build version: {"properties":[{"key":"Version","value":"0.5-incubating-SNAPSHOT-r5445e109bc7fbfea9295f3411a994485b65d1477"},{"key":"Mode","value":"embedded"}]} Falcon usage: the entities

In Falcon, the configuration is defined by “entity”. Falcon supports three types of entity:

  • cluster entity defines the hadoop cluster (location of the namenode, location of the jobtracker), related falcon module (Oozie, ActiveMQ), and the location of the Falcon working directories (on HDFS)
  • feed entity defines a location on HDFS
  • process entity defines a hadoop job scheduled by Oozie

An entity is described using XML. You can do different actions on an entity:

  • Submit: register an entity in Falcon. Submitted entity are not scheduled, meaning it would simply be in the configuration store of Falcon.
  • List: provide the list of all entities registered in the configuration store of Falcon.
  • Dependency: provide the dependency of an entity. For example, a feed would show process that are dependent on the feed and the clusters that it depends on.
  • Schedule: feeds or processes that are already submitted and present in the configuration store can be scheduled. Upon schedule, Falcon system wraps the required repeatable action as a bundle of oozie coordinators and executes them on the Oozie scheduler.
  • Suspend: this action is applicable only on scheduled entity. This triggers suspend on the oozie bundle that was scheduled earlier through the schedule function. No further instances are executed on a suspended process/feed.
  • Resume: put a suspended process/feed back to active, which in turn resumes applicable oozie bundle.
  • Status: to display the current status of an entity.
  • Definition: dump the entity definition from the configuration store.
  • Delete: remote an entity from the Falcon configuration store.
  • Update: update operation allows an already submitted/scheduled entity to be updated. Cluster update is currently not allowed. Feed update can cause cascading update to all the processes already scheduled. The following set of actions are performed in Oozie to realize an update.
    • Suspend the previously scheduled Oozie coordinator. This is prevent any new action from being triggered.
    • Update the coordinator to set the end time to “now”
    • Resume the suspended coordinators
    • Schedule as per the new process/feed definition with the start time as “now”

The cluster entity defines the configuration of the hadoop cluster and components used by Falcon.

We will store the entity descriptors in the entity folder:

$ mkdir entity

For the cluster, we create entity/local.xml file:

<?xml version="1.0" encoding="UTF-8"?> <cluster colo="local" description="Local cluster" name="local" xmlns="uri:falcon:cluster:0.1"> <interfaces> <interface type="readonly" endpoint="hftp://localhost:50010" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:8020" version="1.1.2"/> <interface type="execute" endpoint="localhost:8021" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.0.0"/> <interface type="messaging" endpoint="tcp://localhost:61616" version="5.7.0"/> </interfaces> <locations> <location name="staging" path="/falcon/staging"/> <location name="temp" path="/falcon/temp"/> <location name="working" path="/falcon/working"/> </locations> <properties></properties> </cluster>

A cluster contains different interfaces and locations used by Falcon. A cluster is referenced by feeds and processes entities (using the cluster name). A cluster can’t be scheduled (it doesn’t make sense).

The colo specifies a kind of cluster grouping. It’s used in distributed deployment mode, so not useful in our demo (as we have only one cluster).
The readonly interface specifies the Hadoop’s HFTP protocol, only used in the case of feed replication between clusters (again, not use in our demo).
The write interface specifies the write access to hdfs, containing the value. Falcon uses this interface to write system data to hdfs and feeds referencing this cluster are written to hdfs using this interface.
The execute interface specifies the location of the jobtracker, containing the mapred.job.tracker value. Falcon uses this interface to submit the processes as jobs in the jobtracker defined here.
The workflow interface specifies the interface for worklow engine (the Oorie URL). Falcon uses this interface to schedule the processes referencing this cluster on workflow engine defined here.
Optionally, you can have a registry interface (defininng thrift URL) to specify the metadata catalog, such as Hive Metastore (or HCatalog). We don’t use it in our demo.
The messaging interface specifies the interface for sending feed availability messages. It’s the URL of the ActiveMQ broker.

A cluster has a list of locations with a name (working, temp, staging) and a path on HDFS. Falcon would use the location to do intermediate processing of entities in hdfs and hence Falcon should have read/write/execute permission on these locations.

Optionally, a cluster may have a list of properties. It’s a list of key-value pairs used in Falcon and propagated to the workflow engine. For instance, you can specify the JMS broker connection factory:

<property name="brokerImplClass" value="org.apache.activemq.ActiveMQConnectionFactory" />

Now, that we have the XML description, we can register our cluster in Falcon. We use the Falcon client commandline to do submit our cluster definition:

$ cd falcon-0.5-incubating-SNAPSHOT $ bin/falcon entity -submit -type cluster -file ~/demo/local.xml default/Submit successful (cluster) local

We can check that our local cluster is actually present in the Falcon configuration store:

$ bin/falcon entity -list -type cluster (cluster) local(null)

We can see our cluster “local”, for now without any dependency (null).

If we take a look on hdfs, we can see that the falcon directory has been created:

$ cd node1 $ bin/hadoop fs -ls / Found 3 items drwxr-xr-x - jbonofre supergroup 0 2014-03-08 07:48 /falcon drwxr-xr-x - jbonofre supergroup 0 2014-03-06 17:32 /tmp drwxr-xr-x - jbonofre supergroup 0 2014-03-06 18:05 /user Feed

A feed entity is a location on the cluster. It also defines additional attributes like frequency, late-arrival handling, and retention policies. A feed can be scheduled, meaning that Falcon will create processes to deal with retention and replication on the cluster.

As other entity, a feed is described using a XML. We create entity/output.xml file:

<?xml version="1.0" encoding="UTF-8"?> <feed description="RandomProcess output feed" name="output" xmlns="uri:falcon:feed:0.1"> <group>output</group> <frequency>minutes(1)</frequency> <timezone>UTC</timezone> <late-arrival cut-off="minutes(5)"/> <clusters> <cluster name="local"> <validity start="2012-07-20T03:00Z" end="2099-07-16T00:00Z"/> <retention limit="hours(10)" action="delete"/> </cluster> </clusters> <locations> <location type="data" path="/data/output"/> </locations> <ACL owner="jbonofre" group="supergroup" permission="0x644"/> <schema location="none" provider="none"/> </feed>

The locations element define the feed storage. It’s paths on HDFS or table names for Hive. A location is define on a cluster, identified by name. In our example, we use the “local” cluster that we submitted before.

The group element defines a list of comma separated groups. A group is a logical grouping of feeds. A group is said available if all the feeds belonging to a group are available. The frequency of all the feeds which belong to the same group must be same.

The frequency element specifies the frequency by which this feed is generated (for instance, it can generated every hour, every 5 minutes, daily, weekly, etc). Falcon uses this frequency to check if the feed has changed or not (the size has changed). In our example, we define a frequency of every minute. Falcon creates a job in Oozie to monitor the feed.
Falcon system can handle late arrival of input data and appropriately re-trigger processing for the affected instance. From the perspective of late handling, there are two main configuration parameters late-arrival cut-off and late-inputs section in feed and process entity definition that are central. These configurations govern how and when the late processing happens. In the current implementation (oozie based) the late handling is very simple and basic. The falcon system looks at all dependent input feeds for a process and computes the max late cut-off period. Then it uses a scheduled messaging framework, like the one available in Apache ActiveMQ to schedule a message with a cut-off period, then after a cut-off period the message is dequeued and Falcon checks for changes in the feed data which is recorded in HDFS in late data file by Falcons “record-size” action, if it detects any changes then the workflow will be rerun with the new set of feed data.

The retention element specifies how long the feed is retained on the cluster and the action to be taken on the feed after the expiration of the retention period. In our example, we delete the feed after a retention of 10 days.

The validity of a feed on cluster specifies duration for which this feed is valid on this cluster (considered for scheduling by Falcon).

The ACL defines the permission on the feed (owner/group/permission).

The schema allows you to specific the “format” of the feed (for instance csv). In our case, we don’t define any schema.

We can now submit the feed (register the feed) into Falcon:

$ cd falcon-0.5-incubating-SNAPSHOT $ bin/falcon entity -submit -type feed -file ~/demo/entity/output.xml default/Submit successful (feed) output Process

A process entity defines a job in the cluster.

Like other entity, a process is described with XML (entity/process.xml):

<?xml version="1.0" encoding="UTF-8"?> <process name="my-process" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="local"> <validity start="2013-11-15T00:05Z" end="2030-11-15T01:05Z"/> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>minutes(5)</frequency> <timezone>UTC</timezone> <inputs> <!-- In the workflow, the input paths will be available in a variable 'inpaths' --> <input name="inpaths" feed="input" start="now(0,-5)" end="now(0,-1)"/> </inputs> <outputs> <!-- In the workflow, the output path will be available in a variable 'outpath' --> <output name="outpath" feed="output" instance="now(0,0)"/> </outputs> <properties> <!-- In the workflow, these properties will be available with variable - key --> <property name="queueName" value="default"/> <!-- The schedule time available as a property in workflow --> <property name="time" value="${instanceTime()}"/> </properties> <workflow engine="oozie" path="/app/mr"/> <late-process policy="periodic" delay="minutes(1)"> <late-input input="inpaths" workflow-path="/app/mr"/> </late-process> </process>

The cluster element defines where the process will be executed. Each cluster has a validity period, telling the times between which the job should run on the cluster. For the demo, we set a large validity period.

The parallel element defines how many instances of the process can run concurrently. We set a value of 1 here to ensure that only one instance of the process can run at a time.

The order element defines the order in which the ready instances are picked up. The possible values are FIFO(First In First Out), LIFO(Last In First Out), and ONLYLAST(Last Only). It’s not really used in our case.

The frequency element defines how frequently the process should run. In our case, minutes(5) means that the job will run every 5 minutes.

The inputs element defines the input data for the process. The process job will start executing only after the schedule time and when all the inputs are available. There can be 0 or more inputs and each of the input maps to a feed. The path and frequency of input data is picked up from feed definition. Each input should also define start and end instances in terms of EL expressions and can optionally specify specific partition of input that the process requires. The components in partition should be subset of partitions defined in the feed.
For each input, Falcon will create a property with the input name that contains the comma separated list of input paths. This property can be used in process actions like pig scripts and so on.

The outputs element defines the output data that is generated by the process. A process can define 0 or more outputs. Each output is mapped to a feed and the output path is picked up from feed definition. The output instance that should be generated is specified in terms of EL expression.
For each output, Falcon creates a property with output name that contains the path of output data. This can be used in workflows to store in the path.

The properties element contains key value pairs that are passed to the process. These properties are optional and can be used to parameterize the process.

The workflow element defines the workflow engine that should be used and the path to the workflow on hdfs. The workflow definition on hdfs contains the actual job that should run and it should confirm to the workflow specification of the engine specified. The libraries required by the workflow should be in lib folder inside the workflow path.
The properties defined in the cluster and cluster properties(nameNode and jobTracker) will also be available for the workflow.
Currently, Falcon supports three workflow engines:

  • oozie enables users to provide a Oozie workflow definition (in XML).
  • pig enables users to embed a Pig script as a process
  • hive enables users to embed a Hive script as a process. This would enable users to create materialized queries in a declarative way.

NB: I proposed to support a new type of workflow: MapReduce, to be able to directly execute MapReduce job.

In this demo, we use the oozie workflow engine.

We create a Oozie workflow.xml:

<?xml version="1.0" encoding="UTF-8"?> <workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf"> <start to="mr-node"/> <action name="mr-node"> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${outpath}"/> </prepare> <configuration> <property> <name></name> <value>${queueName}</value> </property> <property> <name>mapred.mapper.class</name> <value>org.apache.hadoop.mapred.lib.IdentityMapper</value> </property> <property> <name>mapred.reducer.class</name> <value>org.apache.hadoop.mapred.lib.IdentityReducer</value> </property> <property> <name></name> <value>1</value> </property> <property> <name>mapred.input.dir</name> <value>${inpaths}</value> </property> <property> <name>mapred.output.dir</name> <value>${outpath}</value> </property> </configuration> </map-reduce> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>

This workflow is very simple: it uses IdentityMapper and IdentityReducer (provided in Hadoop core) to copy input data as output data.

We upload this workflow.xml on HDFS (in the location specified in the Falcon process workflow element):

$ cd node1 $ bin/hadoop fs -mkdir /app/mr $ bin/hadoop fs -put ~/demo/workflow.xml /app/mr

The late-process allows the process to react with the input feed changes and trigger an action (here, we re-execute the oozie workflow).

We are now ready to submit the process in Falcon:

$ cd falcon-* $ bin/falcon entity -submit -type process -file ~/entity/process.xml

The process is ready to be scheduled.

Before scheduling the process, we create the input data. The input data is a simple file (containing a string) that we upload to HDFS:

$ cat > file1 This is a test file $ node1/bin/hadoop fs -mkdir /data/input $ node1/bin/hadoop fs -put file1 /data/input

We can now trigger the process:

$ cd falcon* $ bin/falcon entity -schedule -type process -name my-process default/my-process(process) scheduled successfully

We can see the different jobs in Oozie (accessing http://localhost:11000/oozie):

On the other hand, we see new topics and queues created in ActiveMQ:

Especially, in ActiveMQ, we have two topics:

  • Falcon publishes messages in the topic for each execution of the process
  • Falcon publishes messages in the FALCON.ENTITY.TOPIC topic for each change on the feeds

It’s where our Camel routes subscribe.

Camel routes in Karaf

Now that we have our Falcon platform ready, we just have to create Camel routes (hosted in Karaf container), subscribing on the Falcon topics in ActiveMQ.

We uncompress a Karaf container, and install the Camel features (camel-spring, activemq-camel):

$ tar zxvf apache-karaf-2.3.1.tar.gz $ cd apache-karaf-2.3.1 $ bin/karaf karaf@root> features:chooseurl camel adding feature url mvn:org.apache.camel.karaf/apache-camel/LATEST/xml/features karaf@root> features:install camel-spring karaf@root> features:chooseurl activemq karaf@root> features:install activemq-camel

We create a falcon-route.xml route file containing the Camel routes (using Spring DSL):

<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="" xmlns:xsi="" xsi:schemaLocation=""> <camelContext id="camel" xmlns=""> <route id="process-listener"> <from uri=""/> <to uri="log:process-listener"/> </route> <route id="feed-listener"> <from uri="jms:topic:FALCON.ENTITY.TOPIC"/> <to uri="log:feed-listener"/> </route> </camelContext> <bean id="jms" class="org.apache.camel.component.jms.JmsComponent"> <property name="connectionFactory"> <bean class="org.apache.activemq.ActiveMQConnectionFactory"> <property name="brokerURL" value="tcp://localhost:61616"/> </bean> </property> </bean> </beans>

In the Camel context, we create two routes, both connecting on the ActiveMQ broker, and listening on the two topics.

We drop the falcon-routes.xml in the deploy folder, and we can see it active:

karaf@root> la|grep -i falcon [ 114] [Active ] [ ] [Started] [ 80] falcon-routes.xml (0.0.0) karaf@root> camel:route-list Context Route Status ------- ----- ------ camel feed-listener Started camel process-listener Started

The routes subscribed on the topics and just send to the log (it’s very very simple).

So, we just have to take a look on the log (log:tail):

2014-03-19 11:25:43,273 | INFO |] | process-listener | rg.apache.camel.util.CamelLogger 176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06-05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory,}] 2014-03-19 11:25:43,693 | INFO | ON.ENTITY.TOPIC] | feed-listener | rg.apache.camel.util.CamelLogger 176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06-05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, workflowUser=jbonofre, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, operation=GENERATE, logDir=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/job-2013-11-15-06-05/, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.ENTITY.TOPIC}]

And we can see our notifications:

  • on the process-listener logger, we can see that my-process (entityName) has been executed with SUCCEEDED (status) at 2014-03-19T10:24Z (timeStamp). We also have the location of the job execution log on HDFS.
  • on the feed-listener logger, we can see quite the same messages. This message comes from the late-arrival, so it means that the input field changed.

For sure, the Camel routes are very simple now (just a log), but there is no limit: you bring all the powerful from ESB and BigData all together.
Once the Camel routes get the messages on ActiveMQ coming from Falcon, you can implement the integration process of your choice (sending e-mails, using Camel EIPs, calling beans, etc).

What’s next ?

I’m working on different enhancements on the late-arrival/CDC feature:

  1. The late-arrival messages in the FALCON.ENTITY.TOPIC should be improved: the message should contain a message with the feed changed, the location of the feed, eventually the size gap.
  2. We should provide a more straight forward CDC feature which doesn’t require a process to monitor a feed. Just scheduling a feed should be enough with the late cut-off.
  3. In addition of the oozie, pig, and hive workflow engine, we should provide a “pure” MapReduce jar workflow engine.
  4. The should be improved to provide a more “ready” to use Falcon Oozie custom distribution.

I’m working on this different enhancements and improvements.

On the other hand, I will propose a set of documentation improvements, especially some kind of “recipe documentation” like this one.

Stay tuned, I’m preparing a new blog about Falcon, this time about the replication between two Hadoop clusters.

Categories: FLOSS Project Planets

Sebastien Goasguen: Migrating from Publican to Sphinx and Read The Docs

Wed, 2014-03-19 10:26
Migration from Publican to Sphinx and Read The DocsWhen we started with Cloudstack we chose to use publican for our documentation. I don't actually know why, except that Red Hat documentation is entirely based on publican. Perhaps David Nalley's background with Fedora influenced us :) In any case publican is a very nice documentation building system, it is based on the docbook format and has great support for localization. However it can become difficult to read and organize lots of content, and builds may break for strange reasons. We also noticed that we were not getting many contributors to the documentation, in contrast, the translation efforts via transifex has had over 80 contributors. As more features got added to CloudStack the quality of the content also started to suffer and we also faced issues with publishing the translated documents. We needed to do something, mainly making it easier to contribute to our documentation. Enters ReStructured Text (RST) and Read The Docs (RTD).
Choosing a new formatWe started thinking about how to make our documentation easier to contribute to. Looking at Docbook, purely xml based, it is a powerful format but not very developer friendly. A lot of us are happy with basic text editor, with some old farts like me mainly stuck with vi. Markdown has certainly helped a lot of folks in writing documentation and READMEs, just look at Github projects. I started writing in Markdown and my production in terms of documentation and tutorials skyrocketed, it is just a great way to write docs. Restructured Text is another alternative, not really markdown, but pretty close. I got familiar with RST in the Apache libcloud project and fell in love with it, or at least liked it way more than docbook. RST is basically text, only a few markups to learn and your off.
Publishing PlatformA new format is one thing but you then need to build documentation in multiple formats: html, pdf, epub potentially more. How do you move from .rst to these formats for your projects ? Comes in Sphinx, pretty much an equivalent to publican originally aimed at Python documentation but now aimed at much more. Installing sphinx is easy, for instance on Debian/Ubuntu:
apt-get install python-sphinxYou will then have the sphinx-quickstart command in your path, use it to create your sphinx project, add content in index.rst and build the docs with make html. Below is a basic example for a ff sample project.

What really got me sold on reStructuredText and Sphinx was ReadTheDocs (RTD). It hosts documentation for open source projects. It automatically pulls your documentation from your revision control system and builds the docs. The killer feature for me was the integration with github (not just git). Using hooks, RTD can trigger builds on every commit and it also displays an edit on github icon on each documentation page. Click on this icon, and the docs repository will get forked automatically on your github account. This means that people can edit the docs straight up in the github UI and submit pull requests as they read the docs and find issues.
ConversionAfter [PROPOSAL] and [DISCUSS] threads on the cloudstack mailing list, we reached consensus and decided to make the move. This is still on-going but we are getting close to going live with our new docs in RST and hosted by RTD. There were couple challenges:
  1. Converting the existing docbook based documentation to RST
  2. Setting up new repos, CNAMEs and Read The Docs projects
  3. Setting up the localization with transifex
The conversion was much easier than expected thanks to pandoc, one of those great command line utility that saves your life.
pandoc -f docbook -t rst -o test.rst test.docbookYou get the just of it, iterate through your docbook files and generate the RST files, combine everything to reconstruct your chapters and books and re-organize as you wish. They are off course couple gotchas, namely the table formatting may not be perfect, the note and warnings may be a bit out of whack and the heading levels should probably be checked. All of these are actually good to check as a first pass through the docs to revamp the content and the way it is organized.
One thing that we decided to do before talking about changing the format was to move our docs to a separate repository. What we wanted to do was to be able to release docs on a different time frame than the code release, as well as make any doc bug fixes go live as fast as possible and not wait for a code release (that's a long discussion...). With a documentation specific repo in place, we used Sphinx to create the proper directory structure and add the converted RST files. Then we created a project on Read The Docs and pointed to the github mirror of our Apache git repo. Pointing to the github mirror allowed us to enable the nice github interaction that RTD provides. The new doc site looks like this.

There is a bit more to it, as we actually created several repositories and used a RTD feature called subprojects to make all the docs live under the same CNAME This is still work in progress but in track for the 4.3 code release. I hope to be able to announce the new documentation sites shortly after 4.3 is announced.
The final hurdle is the localization support. Sphinx provides utilities to generate POT files. They can then be uploaded to transifex and translation strings can be pulled to construct the translated docs. The big challenge that we are facing is to not loose the existing translation that were done from the docbook files. Strings may have changed. We are still investigating how to not loose all that work and get back on our feet to serve the translated docs. The Japanese translators have started to look at this.
Overall the migration was easy, ReStructuredText is easy, Sphinx is also straigthfoward and Read The Docs provides a great hosting platform well integrated with Github. Once we go live, we will see if our doc contributors increase significantly, we have already seen a few pull requests come in, which is very encouraging.
I will be talking about all of this at the Write The Docs conference in Budapest on March 31st, april 1st. If you are in the area stop by :)
Categories: FLOSS Project Planets