At some point late last fall, I was anticipating having some time to spend with my Kindle, so I bought Patrick Rothfuss's The Name of the Wind
As usual with me, I am about 10 years behind the times, as this book came out some time ago.
But I was looking for a page-turner (is it fair to say that, when you are reading an e-book on an e-reader?)
At any rate, I turned the pages.
And kept turning them (there are a lot of pages...).
And I turned them all the way until the end.
Which is not always the case with me, and a book. I have too little time and too many distractions, and many is the book that I nobly start yet do not finish.
Rothfuss's style appealed to me, because he knows how to take his time with his story. Sometimes books rush along, hurrying to force the tale to be told, cramming adventures and villains and escapades willy-nilly into every page.
But Rothfuss is trying to tell the story of a boy growing up (even though that boy may become a mighty wizard).
And, as every boy knows (and surely, every girl as well), growing up takes its own time, and proceeds on its own schedule.
So, the long and short of it is: I enjoyed The Name of the Wind, and felt it lived up to my expectations.
Rothfuss has written a sequel, and promises that he will complete his story.
When the time comes.
And, down the road, when I find that I again have some time with my Kindle, I expect that I will continue reading Rothfuss, moving on to The Wise Man's Fear.
And see how I like turning those pages.
Oxfam grabs a headline with a report telling us the richest 1% will own half the world’s wealth in 2016.
As with many reports coming from lobbying organisations, this one provokes scepticism. Not outright dismissal, but a “really“, and a need to know what they’re actually measuring before I can treat it as meaningful. It also provokes mild curiosity: how rich do you have to be to be in that 1% (not least because I have a sneaking suspicion it includes a great many people who our chattering classes don’t consider at all rich).
The Oxfam report itself is a mere twelve pages and disappointingly light on data. If there’s any attempt to substantiate the headline claim then I missed it. But googling “World Wealth” finds this report, which tells me total world wealth is projected to be $64.3 trillion in 2016. OK, that’ll do for a ballpark calculation. $64.3 trillion between 7 billion people is an average of about $9k per head. If the top 1% own half of it, that’s $32.15 trillion between 70 million people: an average of $459k per head within that top 1%.
That’s £300k. There must be a millions in Blighty with that much in housing wealth alone (and others correspondingly locked out). Not to mention in other high-cost countries around Europe, America, Asia, and I expect even a few in the third world. All above the average of that fabled top 1%.
But of course housing isn’t our only asset. In Blighty and around the developed world, a big chunk of our wealth takes the form of Entitlements. One such in the UK is the Basic State Pension, which is worth £200k, and even the poorest Brit is entitled to it. It seems you can be in that top 1% without being rich enough to buy a house in Blighty!
Hmmm. Oh dear. Maybe Oxfam’s spin isn’t really very meaningful at all. Except perhaps to highlight how incredibly egalitarian we are within Blighty – and probably all developed countries – once you include the effect of government actions.
i’ve started reading a book on lisp. as Alan Perlis said:
A language that doesn’t affect the way you think about programming, is not worth knowing.
but this chapter title:
"Truth, Falsehood, and Equality" — sounds like a chapter from legend of korra— The Wrath of PB™ (@hirojin) January 19, 2015
made me think beyond programming. i’ve been contemplating this in terms of political systems & stories, and i’m thinking that there’s no chance to achieve radical equality:
societies change over generations, as do their their stories. and while, as societies, we frown at those (ancient or contemporary) societies that use murder of prisoners and slaves as entertainment, our stories are filled with such things.
the fight for power.
the struggle against corrupt power.
we even have to fight for love.
we have no need for equality, because the stories we are raised with neither prepare us for what such an equal society would look like, nor do they raise a desire to achieve it.
we are inching ourselves towards it. that’s societal change over generations. i’m starting to fear the only way we know how to radically change is to erase the past, and that would be profoundly dangerous.
even more dangerous than forgetting the (often recent) past, and regress into “good old” patterns.
A much better carbon-relay, written in C rather than Python. Linking as we’ve been using it in production for quite a while with no problems.The main reason to build a replacement is performance and configurability. Carbon is single threaded, and sending metrics to multiple consistent-hash clusters requires chaining of relays. This project provides a multithreaded relay which can address multiple targets and clusters for each and every metric based on pattern matches.
As noted in a previous blog post, I've started working on the 2.0 version of Telaen: a simple but powerful PHP-based Webmail system. Quite a bit has been changed, fixed and added under-the-covers, including baselining PHP 5.4, a more robust installation checker, and some significant performance increases.
Now I know, of course, that there are a number of other PHP webmail offerings out there, so some may be questioning the need for yet another. I can think of a few reasons:
- Telaen is designed to have as few dependencies as possible; the goal is that any typical PHP setup will be able to run Telaen.
- No external database is required.
- Extensive support for both IMAP and POP; to be honest, most webmail systems don't support POP at all, or are extremely limited in their support.
- Consistent functionality, no matter which IMAP/POP server is used; most webmail systems are simple "front ends" for IMAP servers, meaning the capability of the webmail system depends on what IMAP server is being used. Telaen puts that capability within the webmail system for a consistent feature set.
- Fast caching
- Designed to serve as both someone's primary Email client, as well as their supplemental client.
- Lots of what you need/want, and none of what you don't: Telaen is as simple as it can be, but no more so.
- A fast and secure upgrade path for all those people still using UebiMaiu
- Open to ALL contributions!
The last point is important: we really want as many people as possible to use, contribute, drive and develop Telaen. It's a great project for someone just starting out as well as for more experienced developers. Or if your passion is documentation, we could definitely use your help! In fact, however you want to be involved, we want to welcome you to the project.
Our goal is to have a beta available sometime within a month's timeframe. Stay tuned!
For certain XML documents, it is possible to modify the document and the streaming XML Signature verification code will not report an error when trying to validate the signature.
Please note that the "in-memory" (DOM) API for XML Signature is not affected by this issue, nor is the JSR-105 API. Also, web service stacks that use the streaming functionality of Apache Santuario (such as Apache CXF/WSS4J) are also not affected by this vulnerability.Apart from this issue, version 2.0.3 contains a significant performance improvement, and both releases contain minor bug fixes and dependency upgrades.
This post provides a summary of a recent benchmark effort for the Fortress RBAC Accelerator. The RBAC accelerator uses LDAPv3 extended operations to perform the following access control functions:
- Create Session – attempts to authenticate client; if successful, initiates an RBAC session by activating one or more user roles
- Check Access – determines if user has access rights for a given resource
- Add Active Role – attempts activation for a given role into user’s RBAC session
- Drop Active Role – deactivates a given role from user’s RBAC session
- Delete Session – deletes the given RBAC session from the server
- Session Roles – Returns the active roles associated with current session
The result of each of the above functions are persisted to LMDB for audit trail.
Benchmarks performed using a Jmeter test client to drive load for CheckAccess (#2). The server hosts the OpenLDAP daemon which has the RBAC accelerator overlay.Client Machine
- operating system: ubuntu 13.04
- kernel: 3.8.0-32-generic
- processor: Intel® Core™ i7-4702MQ CPU @ 2.20GHz × 8
- memory: 16GB (doesn’t use anywhere close to that)
- Java version 7
- operating system: ubuntu 14.04
- kernel: 3.13.0-32-generic
- processor: Intel® Core™ i7-4980HQ CPU @ 2.80GHz × 4
- memory: 8GB
- OpenLDAP version: 2.4.39
- 25 threads running on client
- each thread runs checkAccess 50,000 times
- 1,250,000 total
- Client CPU load: approximately 50%
- Response time: 1 millisecond
- Throughput: 11,533 transactions per second
- Server CPU load: approximately 85%
We've been watching some very good TV recently.
A few that stuck out to me:
- Vera follows Detective Vera Stanhope, following the books of Ann Cleeves. It's set in Newcastle, England, and it is wonderfully compelling. It's gritty yet human, and the sights and sounds of Newcastle fit the show perfectly.
- The Fall is a police procedural set in Belfast, Northern Ireland, as an English policewoman is brought in to take charge of a case that has gone cold. The astonishing thing about The Fall is its pace: it takes two full seasons, about 15 hours of watching, to tell a story that many other shows might spend 90 minutes on. By really slowing down and digging in, the series becomes riveting; you simply cannot stop watching it once you start.
- Jack Taylor is a strong show made from the books of Ken Bruen, set in Galway, Ireland. The character of Taylor is heart-breakingly self-destructive, but oh! the shows are so strong.
- Longmire is based on Craig Johnson's Sheriff Walt Longmire books. It's set in rural Wyoming, on the Wyoming / Montana border (though actually filmed in New Mexico, I believe), and although the lead character is good, what makes the show is the superb richness of the supporting characters and cast.
- In Plain Sight is sort of a one-woman show. Mary McCormack plays Marshall Mary Shannon, an inspector in the Witness Protection Program who is stationed in Albuquerque, New Mexico. Again, the wild west feel of the show is great, but we've also grown to love the supporting cast of this show, even through its rough edges.
- Continuum is a fascinating Sci-Fi Channel show that riffs upon the time travel concept with some great writing and an interesting plot. It's got it's flaws, but we've certainly enjoyed it.
- Orphan Black is another fascinating science fiction show, with a completely different plot. Most of the attention gathered by Orphan Black is hard to reveal without spoiling it, but the reality of the show is completely timely and believable, leading to lots of interesting discussions while you watch.
- The League is a comedy about a group of friends who stay close by participating in a fantasy football league. But that gives such short shrift to a wonderfully funny and human show.
- And Community is simply the funniest show you've never heard of. At least 5 laugh-out-loud moments in every 30 minute episode; great writing combined with a cast who clearly are having a delightful time with the show.
A friend commented to me recently that he barely watched movies anymore, because the TV series quality has become so high.
Perhaps it's just a burst of activity, but it's nice to get such great entertainment at the touch of the button at the end of a long hard day.
The "personal" blog of Asankha Perera: How the UltraESB and AdroitLogic was born..
‘Broadly, they are satisfied with what we are doing’ versus: ‘We have deep concerns about the Eircode initiative… We want to state clearly that we are not at all ‘satisfied’ with the postcode that has been designed or the implementation proposals.’
Heh, nice trolling.Here are two helpful guidelines (for largely disjoint populations): If you are going to use a big data system for yourself, see if it is faster than your laptop. If you are going to build a big data system for others, see that it is faster than my laptop. [...] We think everyone should have to do this, because it leads to better systems and better research.
Give them the power, they’ll use that power. ‘A document obtained under Freedom of Information legislation confirms the BBC’s use of RIPA in Northern Ireland. It states: “The BBC may, in certain circumstances, authorise under the Regulation of Investigatory Powers Act 2000 and Regulation of Investigatory Powers (British Broadcasting Corporation) Order 2001 the lawful use of detection equipment to detect unlicensed use of television receivers… the BBC has used detection authorised under this legislation in Northern Ireland.”‘
- Configure passwordless ssh across all cluster containers.
- Download, install and configura Java.
- Download, install and configure Apache Yarn:
- Configure Namenode and Datanode connectivity.
- Enable dynamic Datanodes to connect to Namenode.
- Configure Network:
- Network connectivity.
- Expose Yarn ports required by Administration UI and Node communication.
# install dev tools
RUN yum install -y curl which tar sudo openssh-server openssh-clients rsync
RUN yum update -y libselinux
# passwordless ssh
RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
RUN curl -LO 'http://download.oracle.com/otn-pub/java/jdk/7u71-b14/jdk-7u71-linux-x64.rpm' -H 'Cookie: oraclelicense=accept-securebackup-cookie'
RUN rpm -i jdk-7u71-linux-x64.rpm
RUN rm jdk-7u71-linux-x64.rpm
ENV JAVA_HOME /usr/java/default
ENV PATH $PATH:$JAVA_HOME/bin
RUN curl -s http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz | tar -xz -C /usr/local/
RUN cd /usr/local && ln -s ./hadoop-2.6.0 hadoop
ENV HADOOP_PREFIX /usr/local/hadoop
ENV HADOOP_COMMON_HOME /usr/local/hadoop
ENV HADOOP_HDFS_HOME /usr/local/hadoop
ENV HADOOP_MAPRED_HOME /usr/local/hadoop
ENV HADOOP_YARN_HOME /usr/local/hadoop
ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop
ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop
RUN sed -i '/^export JAVA_HOME/ s:.*:export JAVA_HOME=/usr/java/default\nexport HADOOP_PREFIX=/usr/local/hadoop\nexport HADOOP_HOME=/usr/local/hadoop\n:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
RUN sed -i '/^export HADOOP_CONF_DIR/ s:.*:export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
RUN mkdir $HADOOP_PREFIX/input
RUN cp $HADOOP_PREFIX/etc/hadoop/*.xml $HADOOP_PREFIX/input
# pseudo distributed
ADD core-site.xml $HADOOP_PREFIX/etc/hadoop/core-site.xml
#RUN sed s/HOSTNAME/localhost/ /usr/local/hadoop/etc/hadoop/core-site.xml.template > /usr/local/hadoop/etc/hadoop/core-site.xml
ADD hdfs-site.xml $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml
ADD mapred-site.xml $HADOOP_PREFIX/etc/hadoop/mapred-site.xml
ADD yarn-site.xml $HADOOP_PREFIX/etc/hadoop/yarn-site.xml
RUN $HADOOP_PREFIX/bin/hdfs namenode -format
# fixing the libhadoop.so like a boss
RUN rm /usr/local/hadoop/lib/native/*
RUN curl -Ls http://dl.bintray.com/sequenceiq/sequenceiq-bin/hadoop-native-64-2.6.0.tar | tar -x -C /usr/local/hadoop/lib/native/
ADD ssh_config /root/.ssh/config
RUN chmod 600 /root/.ssh/config
RUN chown root:root /root/.ssh/config
ADD bootstrap.sh /etc/bootstrap.sh
RUN chown root:root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh
ENV BOOTSTRAP /etc/bootstrap.sh
# workingaround docker.io build error
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh
RUN chmod +x /usr/local/hadoop/etc/hadoop/*-env.sh
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh
# fix the 254 error code
RUN sed -i "/^[^#]*UsePAM/ s/.*/#&/" /etc/ssh/sshd_config
RUN echo "UsePAM no" >> /etc/ssh/sshd_config
RUN echo "Port 2122" >> /etc/ssh/sshd_config
CMD ["/etc/bootstrap.sh", "-d"]
EXPOSE 50020 50090 50070 50010 50075 8031 8032 8033 8040 8042 49707 22 8088 8030
DYI - Building the Docker.io image
sudo docker build -t yarn-cluster .
Getting Started - Launching Yarn nodes In order to simplify what process to start when launching a NameNode/NodeManager versus a DataNode, a boostrap shell script is used and it supports a --namenode and --datanode parameter which is used in conjunction with the docker run command to launch the Yarn node. When launching the NameNode/NodeManager, there is also a need to map the ports used by the Yarn UI administration applications so it can be accessed ouside of the containers. Below is the command to launch a NameNode/NodeManager node. Note that we use the -p to map the ports, and then we use bootstrap.sh --namenode to start the proper Yarn services.
sudo docker run -i -t -p 8088:8088 -p 50070:50070 -p 50075:50075 --name namenode -h namenode yarn-cluster /etc/bootstrap.sh -bash -namenode
Now that the master node is up and running, let's add some DataNodes to our cluster. A peculiarity of launching the DataNodes is that they need to be aware of the NameNode location, and for this, docker enable containers be linked, which will cause the local /etc/hosts to be updated with the address of the linked container. Below is the command to launch a DataNode node. Note how the --link parameter links the DataNode container to the NameNode container, and also how the boostrap.sh --datanode now receives a different parameter to properly start only Yarn DataNode related services.
sudo docker run -i -t --link namenode:namenode --workdir /usr/local/hadoop yarn-cluster /etc/bootstrap.sh -bash -datanode
After launching a few images, the DataNode administration ui will then look like the one below : Conclusion Using Docker.io containers is a very good and lightweight option to build a Hadoop Yarn cluster, but in order to get it to the next level, there are few other items that need to be thought trough and solved, like a few described below :
- Managing machine resources available for each container : cpu, memory, etc.
- Strategy for non-transient persistent data.
- Hack aware data replication, when in container environment.
MAINTAINER Luciano Resende
# Enable EPEL repository for GIT, Node.js and npm
RUN rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
# Install Git, Node.js and npm
RUN yum install -y git nodejs npm
# Checkout node-app-template from github
RUN git clone https://github.com/lresende/node-app-template /opt/node-app-template
# Install app dependencies
RUN ls /opt/node-app-template
RUN cd /opt/node-app-template; npm install
# Node app is running on port 3000
# Define what to run when container is started
CMD ["node", "/opt/node-app-template/bin/www"]
Now that we have the "recipe" for building the docker image, we can build it with
sudo docker build --rm --no-cache -t node-app .
To run the application, we need to start a docker container based on the image we have just created. Note that, when we start the container, we are redirecting the public port 8080 to the exposed internal port 3000.
sudo docker run -p 8080:3000 -d node-app
Now, we are ready to access the application, just start your browser and point it to
Hope this helps you get started with Docker containers. All the source code is also available in the github repository: node-app-container.
Belgian cities full of trigger-happy armed troops, with orders to shoot to kill, and a recent track record of doing so.
In reality, probably a lower risk than regular vehicular traffic, even for those of us with an ample beard and a big backpack. Though surely a far higher risk than the supposed terrorist threat. But that level of security theatre is hardly welcoming to visitors. Since I have the choice, I’m staying away, and withholding the support that might be inferred from my travelling to Brussels for a weekend in the near future.
 That last sentence is a bit disingenuous, insofar as it suggests this is a big change of plan. In reality I hadn’t decided one way or the other. I’ve been doing that of late: I only got around to signing up for ApacheCon in Budapest the day before it started!
I drive a Toyota, and this is scary stuff. Critical software systems need to be coded with care, and this isn’t it — they don’t even have a bug tracking system!Investigations into potential causes of Unintended Acceleration (UA) for Toyota vehicles have made news several times in the past few years. Some blame has been placed on floor mats and sticky throttle pedals. But, a jury trial verdict was based on expert opinions that defects in Toyota’s Electronic Throttle Control System (ETCS) software and safety architecture caused a fatal mishap. This talk will outline key events in the still-ongoing Toyota UA litigation process, and pull together the technical issues that were discovered by NASA and other experts. The results paint a picture that should inform future designers of safety critical software in automobiles and other systems.
I was offline for 95% of the xmas break, instead investing my keyboard time into: (a) the exercises in Structure and Interpretation of Computer Programs and (b) writing some stuff on the implications of the Sony debacle for my home network security architecture.
I'm going to start posting the latter articles in an out-of-order sequence, with this post: InfoSec risks of android travel applications
1. Airline checkin & Travel apps- demand so many privileges that you can't trust corporate calendar/contact data to stay on the devices. Nor, in the absence of audit logs, can you tell if the information has leaked.
2. Budget Airline applications are the least invasive, "premium" airlines demand access to confidential calendar info.
3. Even train timetable apps like to know things like your contact list.
However hard you lock down your network infrastructure, mandate 26 digit high-unicode passwords rolled monthly, mandate encrypted phones and pin-protected SIM cards, if those phones say "android" when they boot you can't be confident that sensitive corporate data isn't leaking out of those phones if the users expect to be able to use their phones to check on buses, trains or airplanes.
Normally the fact that Android apps can ask and get near-unlimited data access is viewed as a privacy concern. It is for home users. Once you do any of the following, it becomes an InfoSec issue:
- Synchronise calendar with a work email service.
- Maintain a contact list which includes potentially confidential contact/customers
- Bond to a work Wifi network which offers network access to HTTP(S) sites without some form of auth.
- Do the same via VPN
Demands of Applications
Noticing that one application update needed to want more information than I was expected, I went through all the travel apps on my android phone and looked at what permissions they demanded. These weren't explicitly installed for the experiment, simply what I use to fly on airlines, and some train and bus ones in the UK. I'm excluding tripit on the basis that their web infrastructure requests (optional) access to your google emails to autoscan for trip plans, which is in a different league from these.
EntityCalendarContactsNetworkLocation British Airwaysconfidential, participantsNoYesPrecise United Airlinesconfidential, participantsNoYes; view network connectionsPrecise EasyjetNoNoYesPrecise RyanairNoNoYesPrecise National RailAdd, modify, participantsNoYesPrecise National Express CoachNoYesYes; view network connections & wifiPrecise First Great Western trainsNoNoYesPrecise trainlineNoNoYes; view network connectionsPrecise First BusNoNoYes; view network connectionsPrecise
When you look at this list, its appalling. Why does the company that I use to get a bus to LHR need to know my contact list? Why does BA need my confidential appointment data? Why does the UK National Rail app need to be able to enumerate the calendar and send emails to participants without the owner's knowledge?
British Airways: wants access to confidential calendar info and full network access. What could possibly go wrong?
United: wants to call numbers, take photos and access confidential calendar info
National Express Bus Service
This is a bus company. How can they justify reading my contact list -business as well as personal?
UK National Rail
Pretty much total phone control, though not confidential appointment info. Are event titles considered confidential though?
Google's business model is built on knowing everything about your personal life -but this isn't about privacy, it is about preventing data leakage from an organisation. If anyone connects to your email services from an android, your airline checkin apps get to see the title, body and participants in all calendar appointments, whether that is "team meeting" or "plans for takeover of Walmart" where the participants include Jim Bezos and Donald Trump(*).
What could be done?
- Log accesses. I can't see a way to do this today, yet it would seem a core feature IT security teams would like to know. Without it you can't tell what information apps have read.
- Track provenance of calendar events and restrict calendar access only to events created by the airline apps themselves. This would require the servers to add event metadata; as google own gmail they could add a new BigTable column with ease.
- Restrict network access HTTPS sites on specific subdomains. Requiring HTTPS is good for general wifi security, and stops (most) organisations from playing DNS games to get behind the firewall.
In the absence of that feature, if you want to be able to check in on your android phone on a non-budget airline, you have to give up expectations of the security of your confidential calendar data and contact list.
And in a world of BYOD, where the IT dept doesn't have control of the apps on a phone, that means they can't stop sensitive calendar/contact data leaking at all.
(*) FYI, there are no appointments in my calendar discussing taking over Walmart that include both Jim Besos and Donald Trump. I cannot confirm or deny any other meetings with these participants or plans for Walmart involving other participants. Ask British Airways or UAL if you don't believe me.
Open DataAmazon S3 seems to be emerging as the de facto solution for sharing large datasets. In particular, AWS curates a variety of public data sets that can be accessed for free (from within AWS; there are egress charges otherwise). To take one example from genomics, the 1000 Genomes project hosts a 200TB dataset on S3.
Hadoop has long supported S3 as a filesystem, but recently there has been a lot of work to make it more robust and scalable. It’s natural to process S3-resident data in the cloud, and here there are many options for Hadoop. The recently released Cloudera Director, for example, makes it possible to run all the components of CDH in the cloud.
NotebooksBy "notebooks" I mean web-based, computational scientific notebooks, exemplified by the IPython Notebook. Notebooks have been around in the scientific community for a long time (they were added to IPython in 2011), but increasingly they seem to be reaching the larger data scientist and developer community. Notebooks combine prose and computation, which is great for exposition and interactivity. They are also easy to share, which helps foster collaboration and reproducibility of research.
It’s possible to run IPython against PySpark (notebooks are inherently interactive, so working with Spark is the natural Hadoop lead in), but it requires a bit of manual set up. Hopefully that will get easier—ideally Hadoop distributions like CDH will come with packages to run an appropriately-configured IPython notebook server.
Distributed Data FramesIPython supports many different languages and libraries. (Despite its name IPython is not restricted to Python; in fact, it is being refactored into more modular pieces as a part of the Jupyter project.) Most notebook users are data scientists, and the central abstraction that they work with is the data frame. Both R and pandas, for example, use data frames, although both systems were designed to work on a single machine.
The challenge is to make systems like R and pandas work with distributed data. Many of the solutions to date have addressed this problem by adding MapReduce user libraries. However, this is unsatisfactory for several reasons, but primarily because the user has to explicitly think about the distributed case and can’t use the existing libraries on distributed data. Instead, what’s needed is a deeper integration so that the same R and pandas libraries work on local and distributed data.
There are several projects and teams working on distributed data frames, including Sparkling Pandas (which has the best name), Adatao’s distributed data frame, and Blaze. All are at an early stage, but as they mature the experience of working with distributed data frames from R or Python will become practically seamless. Of course, Spark already provides machine learning libraries for Scala, Java, and Python, which is a different approach to getting existing libraries like R or Pandas running on Hadoop. Having multiple competing solutions is broadly a good thing, and something that we see a lot of in open source ecosystems.
Combining the PiecesImagine if you could share a large dataset and the notebooks containing your work in a form that makes it easy for anyone to run them—it’s a sort of holy grail for researchers.
To see what this might look like, have a look at the talk by Andy Petrella and Xavier Tordoir on Lightning fast genomics, where they used a Spark Notebook and the ADAM genomics processing engine to run a clustering algorithm over a part of the 1000 Genomes dataset. It combines all the topics above—open data, cloud computing, notebooks, and distributed data frames—into one.
There’s still work to be done to expand the tooling and to make the whole experience smoother, nevertheless this demo shows that it's possible for scientists to analyse large amounts of data, on demand and in a way that is repeatable, using powerful high-level machine learning libraries. I'm optimistic that tools like this will become commonplace in the not-to-distant future.
I don't normally post LinkedIn approaches, especially from our competitors, but this one was so painful it blew my "do your research" criteria so dramatically it merits coverage.
FWIW my reply was: this is some kind of spoof, no?
On 01/16/15 2:56 AM, Jessica <surname omitted to avoid embarrassment> wrote:
I hope you are well?
We are currently hiring at Cloudera to expand our Customer Operations Engineering team.
We are looking to build this team significantly over the coming months and this is a rare opportunity to become involved in Cloudera's Engineering department.
The role is home based with very little travel required (just for training).
We are looking for people with strong Linux backgrounds and good experience with programming languages. It is not necessary to have experience of Hadoop - we will teach you !!
For the chance to be part of this team please send me your CV to <email omitted to avoid embarrassment>@cloudera.com alternatively we can organise a time to speak for me to tell you more about the role?
As an aside, I am always curious why recruiter emails always start with "I hope you are well?".
a) We both know that the recruiter doesn't care about my health as long as it doesn't impact my ability to work with colleagues, customers and, once my training in Hadoop is complete, maybe even to learn understand how to use things like DelegationTokenAuthenticatedURL —that being what I am staring at right now.(*)
b) We both know that she doesn't actually want details like "well, the DVLA consider my neurological issues under control enough for me to drive again —even down to places like the Alps, and the ripped up tendon off my left kneecap is manageable enough for me to do hill work when I get there"
(*) If anyone has got Jersey+SPNEGO hooked up to UserGroupInformation, I would love that code.
Anyone who was ever concerned with the level of surveillance in modern society by CCTV, helmet cams, the hacking of web-cams, and the use this can be put to by the nefarious activities of GCHQ and the US NSA, will be pleased that the headlong rush to turn us all into autonomous surveillance drones has paused for thought.
Lets hope Google use the pause to reflect on this.