Planet Apache |
James Duncan: The pros and cons of image watermarks
Don Peters, who makes his living as a lawyer, gives his (not legal) opinion on whether or not it makes sense to watermark images online. Everyone’s experience is different, but as much as I’ve been infringed, I still think big watermarks are fugly.
Linked by James Duncan Davidson.
Justin Mason: Links for 2012-02-09
Blank Canvas Script Handler : ‘This extension lets you customize web sites by running bits of JavaScript on pages. It’s kind of an unofficial Greasemonkey for Chrome, and supports many of the GM_* functions used in most scripts.’
(tags: google-chrome chrome browsers javascript ui customization greasemonkey userscripts extensions via:mmeaney)
Isabel Drost: Note to self - Java heap analysis
As I keep searching for those URLs over and over again linking them here. When running into JVM heap issues (an out of memory exception is a pretty sure sign, so can be the program getting slower and slower over time) there’s a few things you can do for analysis:
Start with telling the effected JVM process to output some statistics on heap layout as well as thread state by sending it a SIGQUIT (if you want to use the number instead - it’s 3 - avoid typing 9 instead ).
More detailed insight is available via jConsole - remote setup can be a bit tricky but is well doable and worth the effort as it gives much more detail on what is running and how memory consumption really looks like.
For an detailed analysis take a heap dump with either jmap, jConsole or by starting the process with the JVM option -XX:+HeapDumpOnOutOfMemoryError. Look at it either with jhat or the IBM heap analyzer. Also netbeans offers nice support for searching for memory leaks.
On a more general note on diagnosing java stuff see Rainer Jung’s presentation on troubleshooting Java applications as well as Attila Szegedi’s presentation on JVM tuning.
Jeroen Reijn: Get in control with Spring Insight!
The other day while commuting from home to work, I discovered the Spring Insight project. From what I've seen so far Spring Insight is a set of inspections (plugins) which are visually displayed in a web application. To get an idea of what Spring Insight can do for you, be sure to check out the introduction screencast.
By default Spring Insight comes with a default set of plugins/inspections for different kinds of frameworks/libraries like:
- Spring Web, Spring core
- JDBC
- Servlets
- Hibernate
- Grails
Writing your own Spring Insight plugin
Working with Hippo CMS driven web applications every day I had the idea of creating a Spring Insight plugin for the Hippo Site Toolkit (HST in short). The HST consists of a set of components that interact with the Hippo content repository. During a single request multiple components can be called and for each component there are multiple processing phases. So my initial idea for the Spring Insight plugin was to show:
- The amount of time taken for each processing phase of an HST component
- The time it takes to perform an HstQuery to the repository
Getting started
For this post we will now focus on creating an inspection on performing HST queries. From the Insight web application view I would like to see the information of an HstQuery and time it took to perform the actual query.
With AspectJ you can pick a join point and inspect for instance the execution of that join point. In our case I would like to inspect the HstQuery.execute() method. By putting the join point on the HstQuery interface, we've made sure that any object extending the HstQuery will be able to represent it's data within the Insight web application.
Let's first take a look at the what such an inspection looks like.
package com.jeroenreijn.insight.hst;
import com.springsource.insight.collection.AbstractOperationCollectionAspect;
import com.springsource.insight.intercept.operation.Operation;
import com.springsource.insight.intercept.operation.OperationType;
import org.aspectj.lang.JoinPoint;
import org.hippoecm.hst.content.beans.query.HstQuery;
import org.hippoecm.hst.content.beans.query.HstQueryResult;
import org.hippoecm.hst.content.beans.query.exceptions.QueryException;
/**
* Aspect for collecting HstQuery executions.
*/
public aspect HstQueryOperationAspect extends AbstractOperationCollectionAspect {
private static final OperationType TYPE = OperationType.valueOf("query_execute");
public pointcut collectionPoint(): execution(HstQueryResult HstQuery.execute());
public Operation createOperation(JoinPoint jp) {
HstQuery query = (HstQuery) jp.getTarget();
Operation op = new Operation()
.type(TYPE)
.label("HstQuery");
op.sourceCodeLocation(getSourceCodeLocation(jp));
try {
op.put("query", query.getQueryAsString(false));
op.put("limit", query.getLimit());
op.put("offset", query.getOffset());
} catch (QueryException e) {
// ignore for now
}
return op;
}
}
The more important part of the above collection aspect is the collectionPoint poincut, where we define what kind of operation we would like to collect information from. In this case we define an inspection on the HstQuery.execute() method.
Next to the collection point you will also see the createOperation() method. which allows you to collect certain information from the current state of the collection point. In the above code snippet we collect the actually HstQuery object and get some information from it like the actual JCR XPath query, the limit set on the query and the offset. That's all for the information collection part of our plugin.
Now that we've created the aspect for the HstQuery, let's create a view for this inspection. You can create a freemarker template for each inspection if you want. For the HstQuery I've created the following template.
<#ftl strip_whitespace=true>
<#import "/insight-1.0.ftl" as insight />
<@insight.group label="HST Query">
<@insight.entry name="Query" value=operation.query />
<@insight.entry name="Limit" value=operation.limit />
<@insight.entry name="Offset" value=operation.offset/>
</@insight.group>
In the above template we define the values that we've put as attributes on our Operation object. All we have to do now is wire the operation and the view together inside the plugin configuration.
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:insight="http://www.springframework.org/schema/insight-idk"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
http://www.springframework.org/schema/insight-idk http://www.springframework.org/schema/insight-idk/insight-idk-1.0.xsd">
<insight:plugin name="hst" version="${project.version}" publisher="Jeroen Reijn" />
<insight:operation-view operation="query_execute" template="com/jeroenreijn/insight/hst/query.ftl" />
<insight:operation-group group="Hippo" operation="query_execute" />
</beans>
So now that we've finished our plugin, we package it and drop it inside the collection-plugins folder of our Spring Insight instance. Next we fire up the VMware vFabric TM tc Server and do some requests on the web application that we would like to get some information from. Once that's done switch the URL in the browser to '/insight' and there is the information collected by Spring Insight. The image below show exactly the information that we tried to show.
In this example request you can see from the top of the call stack, the chain of filters that the request went through and all of the HST components. For each component you can now see the class, the window name (as you can also see in the CMS console) and the render path ( the JSP or Freemarker template) used for rendering the information of the component. You can also expand an HST component when it contains an HstQuery.
The advantage of having such a plugin might help us identify some slow pages that might have slow JCR queries or components that do extensive (unnecessary) processing.
Summary
Spring Insight is a very interesting project. Doing a quick scan for troublesome code is relatively fast, but can for now only be done with the VMware vFabric TM tc Server, so you cannot run it in your personal preferred application container like Tomcat, Jetty or JBoss. I've personally added Spring Insight to my default set of tools for figuring out performance issues when I need to do a review of a project.
All of the above code and how to install this HST Spring Insight plugin can be found on the plugin project page on Github.
Sam Ruby: Dominoes
Alex Russell: @glazou being entirely reasonable in the face of vendor-driven CRAZY (implementing other people’s prefixes): glazman.org/weblog/dotclea… Via @phae.
Alex, I think you need to move up the food chain a little.
The root-cause is vendor-driven advocacy directed at content producers which encourages them to produce compelling content using experimental features. Everything else is consequences. If you believe that those consequences are CRAZY, then you must conclude that the root-cause is CRAZY.
Edward J. Yoon: Graph computing with Apache Hama
Here's Hama version Single Shortest Path algorithm, it's the same as described in Google's Pregel paper:
public static class ShortestPathVertex extends Vertex<IntegerMessage> { public ShortestPathVertex() { super(IntegerMessage.class); this.setValue(Integer.MAX_VALUE); } public boolean isStartVertex() { String startVertex = getConf().get(START_VERTEX); return (this.getVertexID().equals(startVertex)) ? true : false; } @Override public void compute(Iterator<IntegerMessage> messages) throws IOException { int minDist = isStartVertex() ? 0 : Integer.MAX_VALUE; while (messages.hasNext()) { IntegerMessage msg = messages.next(); if (msg.getData() < minDist) { minDist = msg.getData(); } } if (minDist < (Integer) this.getValue()) { this.setValue(minDist); for (Edge e : this.getOutEdges()) { sendMessage(e.getTarget(), new IntegerMessage(e.getName(), minDist + e.getCost())); } } } }
Bryan Pendleton: Six interesting papers
- Optimistic Replication Algorithms, by Yasushi Saito. This paper is more than a decade old, but I just came across it recently. It's a survey of overall replication strategies, with a categorization of the various approaches to help compare and contrast them.Optimistic replication algorithms allow data presented to users to be stale but in a controlled way. A key feature that separates optimistic replication algorithms from pessimistic counterparts is the way object updates are handled: whereas pessimistic algorithms update all the replicas at once and possibly block read requests during the update application, optimistic algorithms propagate updates in background and allow any replica to be read directly most of the time.According to the paper's taxonomy, the implementation that I've been working on at my day job is a single-master log-transfer system with eventual consistency.
- Btrfs: The Swiss Army Knife of Storage by Josef Bacik. This is a filesystems paper:Btrfs is a new file system for Linux that has been under development for four years now and is based on Ohad Rodeh’s copy-on-write B-tree . Its aim is to bring more efficient storage management and better data integrity features to Linux . It has been designed to offer advanced features such as built-in RAID support, snapshotting, compression, and encryption . Btrfs also checksums all metadata and will checksum data with the option to turn off data checksumming . This is a great example of the sort of paper that I often call a "principles and techniques" paper: rather than diving into low-level details, the paper describes the high-level principles that the btrfs implementation is using to build a production quality filesystem. It's extremely readable while also being very informative. One of the major differentiators of btrfs from other recent filesystems has to do with how they ensure consistency. For the past 10-20 years, the primary technique has been journaling, but btrfs has a new approach:Traditional Linux file systems have used journals to ensure metadata consistency after crashes or power failures . In the case of ext this means all metadata is written twice, once to the journal and then to its final destination . In the case of XFS this usually means that a small record of what has changed is written to the journal, and eventually the changed block is written to disk . If the machine crashes or experiences a power failure, these journals have to be read on mount and re-run onto the file system to make sure nothing was lost . With Btrfs everything is copied on write . That means whenever we modify a block, we allocate a new location on disk for it, make our modification, write it to the new location, and then free the old location . You either get the change or you don’t, so you don’t have to log the change or replay anything the next time you mount the file system after a failure—the file system will always be consistent .
- RemusDB: Transparent High Availability for Database Systems by Minhas, Rajagopalan, Cully, Aboulnaga, Salem, and Warfield. This paper presents a very interesting approach to providing a high-availability database, using the technology of modern cloud-computing-style server virtualization:Two servers are used to provide HA for a DBMS. One server hosts the active VM, which handles all client requests during normal operation. As the active VM runs, its entire state including memory, disk, and active network connections are continuously checkpointed to a standby VM on a second physical server.
...
Remus’s checkpoints capture the entire state of the active VM, which includes disk, memory, CPU, and network device state. Thus, this captures both the state of the database and the internal execution state of the DBMS, e.g., the contents of the buffer pool, lock tables, and client connection state. After failover, the DBMS in the standby VM begins execution with a completely warmed up buffer pool, picking up exactly where the active VM was as of the most recent checkpoint, with all session state, TCP state, and transaction state intact. This fast failover to a warm backup and no loss of client connections is an important advantage of our approach.
- Fast Crash Recovery in RAMCloud, by Ongaro, Rumble, Stutsman, Ousterhout, and Rosenblum. This paper is also concerned with high availability and crash recovery, but this team takes a different approach:RAMCloud keeps only a single copy of data in DRAM; redundant copies are kept on disk or flash, which is both cheaper and more durable than DRAM. However, this means that a server crash will leave some of the system’s data unavailable until it can be reconstructed from secondary storage.
RAMCloud’s solution to the availability problem is fast crash recovery: the system reconstructs the entire contents of a lost server’s memory (64 GB or more) from disk and resumes full service in 1-2 seconds. We believe this is fast enough to be considered “continuous availability” for most applications.
In order to accomplish their recovery-time objective, they have a massively-parallel recovery algorithm:Each server scatters its backup data across all of the other servers, allowing thousands of disks to participate in recovery. Hundreds of recovery masters work together to avoid network and CPU bottlenecks while recovering data. RAMCloud uses both data parallelism and pipelining to speed up recovery. - Modular Data Storage with Anvil, by Mammarella, Hovsepian, and Kohler. This paper discusses an experimental database system which is designed to allow researchers and system builders to experiment with various storage systems by showing how a database storage system can be constructed in a component-based fashion, based on a small set of carefully-designed storage system modules:Two basic goals guided the design of Anvil. First, we want Anvil modules to be fine-grained and easy to write. Implementing behaviors optimized for specific workloads should be a matter of rearranging existing modules (or possibly writing new ones). Second, we want to use storage media effectively by minimizing seeks, instead aiming for large contiguous accesses. Anvil achieves these goals by explicitly separating read-only and write-mostly components, using stacked data storage modules to combine them into read/write stores. Although the Anvil design accommodates monolithic read/write stores, separating these functions makes the individual parts easier to write and easier to extend through module layering.The basic Anvil component is an object called a dTable; the paper shows a number of details about dTables and also gives some good examples of how they can be composed, layered, and extended.
- Interval Tree Clocks: A Logical Clock for Dynamic Systems, by Almeida, Baquero, and Fonte. This paper discusses the nearly 40 year old problem of trying to reason about the ordering of events in distributed systems.This paper addresses causality tracking in dynamic settings and introduces Interval Tree Clocks (ITC), a novel causality tracking mechanism that generalizes both Version Vectors and Vector Clocks. It does not require global ids but is able to create, retire and reuse them autonomously, with no need for global coordination; any entity can fork a new one and the number of participants can be reduced by joining arbitrary pairs of entities; stamps tend to grow or shrink adapting to the dynamic nature of the system. Contrary to some previous approaches, ITC is suitable for practical uses, as the space requirement scales well with the number of entities and grows modestly over time.The paper is very carefully and clearly written, and I enjoyed this paper the most out of the whole set. The best thing about the paper is its use of examples and illustrations: the examples are carefully chosen to be complex enough to capture the power of their approach, while still being small enough to fit in a terse paper.
But more importantly, the paper uses a simply brilliant graphical notation to allow you to visualize the tree-manipulation techniques. The essence of the ITC approach is to use tree data structures to encode event histories, but standard notation for describing the technique is very hard to follow. The diagrammatic approach that the paper uses is beautiful and very elegantly conveys the essence of the technique.
By the way, it appears that the authors have subsequently contributed a running implementation of their approach in multiple languages as open source. Very nice!
Who knows if any of these papers are of interest to you. I found them all interesting, well-written, and worth the time. If the subject matter of any of them appeals to you, I think you won't be disappointed by studying them.
James Duncan: TED on iTunes U
TED has put together a curated collection of talks for students, educators, and life-long learners and made them available through iTunes U. With topic areas like Creative Problem Solving and Climate Change, it looks to be a great way to browse TEDTalks. The big photo on the iTunes U page is one I made of John Hunter during the closing session of TED 2011 in Long Beach. It’s from a remote D700 that I set up and triggered.
Amusingly enough (to me, anyway), I’m actually in this photo. It’s not obvious as I’m in the shadows, but if you look for a little red dot to the top right of the stage, that’s where I am. Here’s an enlargement of the screenshot above with me highlighted (apologies for the blurry, I’d go back and snag the original hi-resolution file, but today’s kinda busy):
How did I get into this photo from above while composing a photo of John from in front of the stage? Simple. I was holding a PocketWizard to trigger the remote camera against my lens with my left hand. It’s less clumsy than it sounds, really. When I wanted a photo from above, I’d just push the PocketWizard’s button with my left thumb. Easy peasy. The red light you see is the PocketWizard’s transmit indicator.
For a bit more of a behind-the-scenes look, Rachel Tobias documented how we set up the remote camera on the TED Blog last year.
Posted by James Duncan Davidson.
James Duncan: A Few D800 Conversation Points
Inevitably, there’s been a lot of chatter about the Nikon D800 today and it ranges all over the spectrum. In addition to showing up all over the web, it’s also invaded my inbox. “Dude! There’s so many megapixels!” is one common refrain. “Dammit, they didn’t make a stripped down D4 this time!” is another. Me, while I have a love/hate relationship talking about gear—in large part because so many people think it’s just about the gear—I have to say that I’m pretty stoked by the D800. I think a lot of good photographers are going to put this camera to excellent use.
Here are some snippets from conversations I’ve had about the new camera today:
I’m ready to make the jump to full frame. Should I get a D4 or a D800? First off, the price difference between the two cameras should answer this for a lot of people. If that doesn’t immediately answer it for you, the way I see it is that the D4 is a heavy, rugged, no-compromise camera that lets you work fast, get the image, and be able to use it for publication. The need for speed and grace in challenging-light conditions trumps the need for maximum resolution. The lighter D800, on the other hand, is geared for everyone else and trades off frame-rate and low-light prowess for resolution. Between the two, most people should get the D800.
But I wanted a reasonably sized full frame low-light champ and don’t need all that resolution! The ultimate noise characteristics of the D800 have yet to be determined and I won’t pass my own judgement till I see a lot of test results both on screen and in print. I’ll be surprised, however, if the resulting images don’t look as good as D700 images in the same light conditions when viewed at normal usage sizes from a noise and image-quality perspective.
That said, the D700 is still an amazing low-light camera and the introduction of the D800 doesn’t immediately stop it from making great images. If you can find a good deal on a gently used one, maybe you should consider one?
Those big files are going to stress our computers and storage, aren’t they? Yeah. They will. They’ll eat up space on disk faster and Aperture and Lightroom won’t move as fast as they do with 12 megapixel images. The sad fact is that while folks using word processors have already been taken care of for a while when it comes to computer horsepower, we photographers can still use a bit more help. Considering the impact to your workflow goes part and parcel with looking at a camera like this.
I downloaded a sample and at 1:1 it looks like… You really need to stop that thought right there. I know, the first thing many people do when we load up an image is zoom to actual pixels and peep. I confess, I do it too. But, do the math and you’ll find that when you zoom a D800 image to 1:1 on your screen, you’re looking at a small crop of a photograph that’s over 6 feet wide. The only person that is really going to ever see your images that way is you.
Furthermore, comparing 1:1 views of images from cameras with different resolutions will tell you different stories by definition. To really compare different cameras in terms of the kinds of images they make and how useful they are in different shooting conditions, you need to display or print at the sizes that you’ll use them at. Compare at them full screen or print ’em out at a decent size. That’s the only valid way to do it.
36 megapixels! That makes medium format digital obsolete! Not so fast. There are other aspects of medium format beyond simple resolution. The sensor size and the impact it has on the final image is the biggest. The great set of medium format lenses and their unique characteristics is another. Yes, the D800 stretches what a 35mm SLR format camera can do to a higher realm, but cameras from PhaseOne and Hasselblad will still have a place.
You’ll have to use impeccable technique to get good results, won’t you? Well, yes and no. It’s true you’ll notice the sins of your technique more when you zoom into a D800 image than one from a D700. Any camera shake blur will be much more noticeable at 1:1. But, again, 1:1 isn’t reality. Use equivalent technique on a D700 and a D800, print both images at 8x10, and you’ll see equivalent results. That’s not to say that technique won’t matter. It will demand very good shooting technique (not to mention incredible lenses) to get the maximum benefit from what the D800 sensor can give you. But shooting the same way as you always have, your images won’t suddenly get blurrier when you shoot with the D800. You’ll just have more headroom to improve your craft and achieve a sharper result than you’ve seen before.
Lenses? What about lenses? I’ll have to get… Stop right there. I know where you’re going with this. The same argument applies. While you’ll certainly be able to better see what your lenses are capable of when pixel peeping, the D800 won’t make your existing lenses any worse.
Look at it this way. There’s always going to be a limit in the equation. When digital photography first came on the scene in a big way, resolution was limited by what the sensor could provide. Over time, the gap has closed and now we’re at the point where lenses are quite often the limiting factor. With the D800, we’re probably going to see the limits of a lot of lenses. That’s not a bad thing. It’s just part of how everything interacts. It doesn’t require you to go out and replace all of your glass unless you want to win a competition of resolving resolution charts.
OMG, this camera is so much better than my 5D! Should I abandon my Canon kit right now? Probably not. The replacement for the 5D Mark II is certainly on its way and it’s certainly going to be in the same ballpark as the D800. I expect that while one will be better than the other in some ways, it’ll be close enough to be a wash. The only big question mark is whether or not Canon will finally stop putting a crippled AF system into the 5D.
Speaking as somebody who has made the expensive and time-consuming change over from Canon to Nikon, I gotta say that you need to have a clear and present need for it to make any kind of sense. If you have that kind of need, you know it. If you don’t know that there’s a specific reason to make a jump, then don’t ask the question.
This moiré stuff sounds scary. Should I get the D800E or stick to the regular one with the anti-aliasing filter? A lot of commentators are quick to point out the potential problem with moiré, especially with textiles, when there’s not an anti-alias filter in the mix. It’s true. Moiré can happen without an anti-alias filter. Talk to somebody with an M9 or an X100 if you want to hear from somebody’s experience about how often it happens.
If you don’t know much about this issue or don’t want to faff about with it, just get the regular D800. Seriously. Don’t worry about it. If, on the other hand, you’re well educated about what’s going on, then you know enough that you’re probably not asking the question.
But really, I just wanted a D700S with the D3S sensor! Ok, fine. Go talk to Nikon about it. Or, you could always wait and see what comes up next. There’s an obvious gap between the D7000 and the D800 that’s going to get filled. The question is whether that will be a DX or a FX camera. It’ll probably be a DX camera, but I could see Nikon start pushing FX further down the line.
Are you getting one? Why yes, I am. I put an order in for a D800E, if you must know. I adore the D3S, but there’s many a time have I longed for a D700-sized body again for travel. Especially when I’m hiking up a hill. I also look forward to putting that resolution to work making big prints. And yes, I’ll be sure to let you know what I think of it when it arrives.
All of that said, remember that gear is just one part of the equation. The person who uses it and their skill set is by far more important. Just as a super awesome knife won’t make you a competent chef, a fancy-pants expensive camera won’t make you a competent photographer.
Posted by James Duncan Davidson.
Isabel Drost: Apache Mahout 0.6 released
As of Monday, February 6th a new Apache Mahout version was released. The new package features
Lots of performance improvments:
- A new LDA implementation using Collapsed Variational Bayes 0th Derivative Approximation - try that out if you have been bothered by the way less than optimal performance of the old version.
- Improved Decision Tree performance and added support for regression problems
- Reduced runtime of dot product between vectors - many algorithms in Mahout rely on that, so these performance improvements will affect anyone using them.
- Reduced runtime of LanczosSolver tests - make modifications to Mahout more easily and have faster development cycles by faster testing.
- Increased efficiency of parallel ALS matrix factorization
- Performance improvements in RowSimilarityJob, TransposeJob - helpful for anyone trying to find similar items or running the Hadoop based recommender
New features:
- K-Trusses, Top-Down and Bottom-Up clustering, Random Walk with Restarts implementation
- SSVD enhancements
Better integration:
- Added MongoDB and Cassandra DataModel support
- Added numerous clustering display examples
Many bug fixes, refactorings, and other small improvements. More information is available in the Release Notes.
Overall great improvements towards better performance, better stability and integration. However there are still quite some outstanding issues and issues in need for review. Come join the project, help us improve existing patches, improve performance and in particular integration and streamlining of how to use the different parts of the project.
Ioan Eugen Stan: NIO Iterator over messages in mbox file
The idea is to provide an Iterator over all the messages in a mbox file and provide access to the raw data. You can then use mime4j to parse the message and do whatever. It's very good for use cases when you don't need the data to be processed and would like to do your own processing or just need access to the raw data.
The project uses java NIO Memory Mapped files to map the file into memory. This will use the OS to manage the memory and file loading for you you get a ByteBuffer that you can use to accces the date. We are not using this ByteBuffer directly, because it's bytes and we need character sets so we need to decode the bytes. We get use a Charset to get a CharsetDecoder for the encoding we need. CharsetDecoder returns a CharBuffer instance and because it implements CharSequence we can use a regex to determine boundaries between messages.
I was hoping that CharBuffer would share the memory with the MappedByteBuffer that we get initially but it seems that this is not the case. We have some memory copying here because the CharBuffer we get is an instance of java.nio.HeapCharBuffer. I was thinking we could have zero Java heap memory and use just the O.S. pages for holding the mbox data but as it turns out, the process of translating bytes to chars needs some memory to keep the chars.
It would have been nice to use asCharBuffer() and provide zero-copy access, but darn, maybe with a future Java version.
Let's return to our cattle:
When we find such a boundary, we return a slice()ed CharBuffer to that message. We also set position and limit so that we get just the message. This has the advantage of using the same memory as the CharBuffer ww do matching on and avoids unnecessary memory copy operations.
I have tested it on a small mbox (135kb) and performs ok. I'm planning more tests with a 2gb mbox. Using NIO memory mapped files we can map very large files and use the OS cache and buffer memory so we can avoid GC activity and unnecessary memory copy operations.
The following areas need improvements before the project is usable:
- testing for different types of mbox files
- better regex to match different mbox formats (mboxcl, mboxrd and the likes)
- better regex for matching From_ lines (now we miss some From_ lines that have MAILER_DAEMON instead of email address)
- support for mboxes larger than 2GB - use more ByteBuffers to map portions of the files, watch out for mails that spill over boundaries.
I'm pretty new to NIO so if you have suggestions of how to do this better I'm open to suggestions and pull requests.
You will fins a simple example that splits one mbox into individual messages in the project sources.
Happy hacking.
References and links:
[1] http://www.kdgregory.com/index.php?page=java.byteBuffer
[2] http://en.wikipedia.org/wiki/Mbox
[3] http://qmail.org/man/man5/mbox.html
[4] http://james.apache.org/mime4j/index.html
[5] https://github.com/ieugen/mbox-iterator
Andrew Savory: Ripple
Here’s my experience using the Ripple emulator for BlackBerry WebWorks.
There’s a bunch of awesome BlackBerry developers at the hackathon, but I’m determined to work this out without them walking me through it. After all, developers don’t normally have the opportunity to ask directly for help. And this way I get to discover all the dark corners of the BlackBerry developer experience.
Again, Ripple comes as a non-native installer. This time, the installation goes into /Applications/Research in Motion/ – I would prefer to have everything in /Developer/SDKs/Research In Motion/ so everything is in one consistent place. Or, since Ripple is an emulator for more than just WebWorks, just leave it in /Applications but drop the “Research in Motion” folder. And tidy up the app so the resources are all inside the app bundle. Basically, follow Mac best practice.
Launching the Ripple emulator application the first time results in a prompt in the middle of the screen, asking what platform you want to emulate:
Selecting “WebWorks” results in a a huge emulator window with the device running off the bottom of the screen – this on my Macbook Pro running at 1680×1050. Are mobile screens really so big?
In discussion with some folks at the hackathon, it turns out the Windows version of Ripple has the option to scale the UI, but not in the Mac version.
I’ve got my packaged app from the previous exploration of creating a WebWorks app, but there doesn’t seem to be an obvious way to load it into the emulator.
Reading “Packaging your app with the BlackBerry WebWorks SDK” tells me about the different formats of files I discovered when creating my own app:
- .cod file for wireless distribution or distribution from a web page
- .alx file for distribution using BlackBerry Desktop Manager
- .jad file for distribution from a web page
- .cso file for application signing
- .csl file for application signing
Apparently there’s also a .bar file for a BlackBerry tablet. I can’t help but feel I’d like a single fat package for all eventualities.
There’s instructions on running your application on a smartphone simulator, but the simulator is a VM and does not appear to be the Ripple emulator.
Reading Packaging your app in Ripple, you can package from within the emulator. You have to click the tiny wrench icon in the top-right corner of the emulator window. This should be much more prominent if this is a common task.
Unfortunately, clicking on the wrench prompts me for lots of configuration: SDK path, Project Root, archive name … all as text fields, and not file/folder pickers. There’s also no support for tab-completion of paths in the fields, so you’ll have to enter them long-hand:
Given RIM only recently acquired Ripple, I’ll cut them some slack. But I’d like to see for example a wrapper script that launches Ripple with all the correct configurations for SDK, project, etc.
The settings for smartphones are on the packaging page.
I’m guessing that my settings should be:
- SDK Path: /Developer/SDKs/Research In Motion/BlackBerry WebWorks SDK 2.3.0.9
- Project Root: /Users/savs/Downloads/blackberry-WebWorks-Samples-0a5693e/UIExamples
- Archive Name: UIExamples
- Output Folder: /tmp/Ripple
Trying with these settings, I got the familiar config.xml not found error:
Tweaking the settings,
- SDK Path: /Developer/SDKs/Research In Motion/BlackBerry WebWorks SDK 2.3.0.9
- Project Root: /Users/savs/Downloads/blackberry-WebWorks-Samples-0a5693e/ProjectRoot
- Archive Name: UIExamples
- Output Folder: /tmp/Ripple
That worked:
I ended up with UIExamples.zip inside /tmp/Ripple, and an “OTAInstall” folder and a “StandardInstall” folder.
The OTAInstall folder contains UIExamples .cod files, split into ten separate packages:
Apparently this is for backwards-compatibility reasons, with only packages of ~60k or less being allowed for an OTA install. This means that, when you deploy to a phone, you get to watch 10 different packages being installed before your app is ready for testing. Ouch.
Now I’ve build the packages, it’s not clear how to actually use the built application. The “Package and Launch” menu option is greyed-out.
Looking at the settings screen again, at the bottom beside Simulator it says “No simulators found “.
During the hackathon, the network failed. This results in some fairly unhelpful problems with Ripple, where you’ll see a blank loading screen for a long time followed by an error message:
This all went away when the network came back.
Reading the docs suggests another way to view your app in Ripple is to stick it on a web server and point Ripple at that. If you’ve got a local server, the benefit is a much quicker development cycle, without having to go through the packaging process first. Indeed, this did work and allowed me to see my app:
The downside on a Mac is that you can’t easily symlink your content from the web root to your development location (at least, not without making a ton of parent directories more widely accessible). See Creating a symbolic link in Sites directory on StackOverflow for more details.
Anyway, success of sorts: I got my app packaged, and I got to view the development files via HTTP.
Next up: signing.
Edward J. Yoon: Terminate AWS instances with Java SDK
Ian Boston: Access Control Lists in Solr/Lucene
This isn’t so much about access control lists in Solr or Lucene but more about access control lists in an inverted index in general. The problem is as follows. We have a large set of data that is access controlled. The access control is managed by users and they can individual items closed or open or anywhere between. The access control lists on the content, which may be files, or simply bundles of metadata is of the form 2 bitmaps, representing the permissions granted and denied, each pair of bitmaps being associated with a principal and the set of principal/bitmap pairs associated with each content item. A side complication is that the content is organised hierarchically and permissions for any one user inherit following the hierarchy back to the root of the tree. Users have many principals through membership of groups, through directly granted static principals and through dynamically acquired principals. All of this is implemented outside of the Solr in a content system. Its Solr’s task to index the content in such a way that a query on the content for an item is efficient and returns a dense result set that can have the one or two content items that the user can’t read, filtered out before the user gets to see the list. Ie we can tolerate a few items the user can’t read being found by a Solr query, but we cant tolerate most being unreadable. In the ACL bitmaps, we are only interested in a the read permission.
The approach I took to date was to look at each content item or set of metadata when its updated, calculate a set of all principals that can read the item and add those principals as a multivalued keyword property of the Solr document. The query, performed by a user computes the principals that the user has at the time they are making the query, and builds a Solr query that gets any document matching the query and with a reading principal in that set. Where the use of principals is moderate, this works well and does not overload either the cardinality of the inverted index where the reader principals are stored in Solr or the size of the Solr query. In these cases the query can be serviced as any other low cardinality query would be, by looking up and accumulating the bitmap representing the full set of documents for each reader principal in turn. The query then requires n lookups and accumulate operations, where n is the number of principals the user has, to resolve the permissions part of the query.
However, and this is the reason for this post, where this fails is where the cardinality of the reader principals becomes to high, or the number of principals that a user has is too high. Unfortunately those two metrics are connected. The more principals there are in a system, the more a user will need to access information, and so the reader principal mechanism can begin to break down. The alternative is just as unpleasant, where the user only has a single principal, their own. In those scenarios active management of ACLs in the content system becomes unscalable both in compute and human terms, which is why principals representing groups were introduced in the first place. Even if there were not limits to the size of a Solr query the cost of processing 1024 terms is prohibitive for anything other than offline processing.
Image via Wikipedia
One solution that has been suggested is to use a Bloom filter to represent the set of principals that the user has and test each indexed principal against this filter. If this is done as part of the query, as the result set is being created there is no gain over scanning all documents since the inverted index would not be used. There could be some benefit in using this approach once a potential set of documents is generated and sorted, since the cost of performing sufficient hashes to fill the appropriate set of bloom buckets is low enough that it could be used as a post query filter. I think the term in Solr is a Collector. In this scenario we are already verifying the user can read a content item or its metadata before delivering to the user, hence its acceptable to have a less that perfect set of pointers being emitted from Solr, provided that the set of pointers we retrieve is dense. We can’t afford to have a situation where the set of pointers is sparse, say several million items, and the only item the user can read is the last one. In that scenario any permissions checking performed without the benefit of Solr would be super expensive.
So, the Bloom filter applied within Solr has the potential to be able to filter most result sets rapidly enough to create a result set that is dense enough for more thorough checking. How dense does it need to be and how large does the bloom filter need to be ? That is an open-ended question, however if, on average you had to read 20% more elements than you returned that might not be excessive if results sets were generally limited to no more than the first 100 items. If that’s the case then 80% density is enough. Bloom provides a guarantee of no false negatives but a probability of a % of false positives, ie items the a Bloom filter indicates are present in a set, but which, are not. For classical Bloom filters 20% is an extremely high probability of false positive, certainly not acceptable for most applications. It has been reported in a number of places that the quality of the hash function used to fill the buckets of the filter is also of importance in the number of false positives since an uneven distribution of hashes over the number space represented by the Bloom bitmap will result in inefficient usage of that bitmap. Rather than doing the math which you will find on Wikipedia, knowing that all fast hashes are less than perfect, and being a pragmatist I did an experiment. In Apache Hadoop there are several implementations of the Bloom filter with 2 evenly distributed and efficient hash functions, Jenkins and Murmur2, so I have used that implementation. What I am interested in is how big a filter would I need to get 80% density to a set of results and how big would that bitmap need to be as the number of inputs to the bloom filter (the users principals) rises. It turns out, that very small bitmap lengths will give sufficient density where the number of input terms is small, even if the number of tested principal readers is high. So 32 bytes of Bloom filter is plenty large enough to test with < 20 principals. Unfortunately however, the cardinality of these bitmaps is too high to be a keyword in an inverted index. For example, if the system contained 256K principals, and we expected users on average to have no more than 64 principals we would need a bloom filter of no more than 256 bits to generate, on average 80% density. Since we are not attempting to index that bloom filter the cardinality of 2^^256 is not an issue. Had we tried to, we would almost certainly have generated an unusable inverted index. Also, that Bloom filter is constructed for each users query, we can dynamically scale it to suit the conditions at the time of the query (number of items in the system, and number of principals the user has). Real system with real users have more principals and sometimes users with more principals. A system with 1M principals that has on average 1024 principals per user will need a bloom filter containing about 8Kbits. Its certain that adding a 8Kbit token ( or a 1Kbyte[] ) as a single parameter to a Solr query circumvents the issue surrounding the number of terms we had previously, but it’s absolutely clear that the cardinality of 2^^8196 is going to be well beyond indexing, which means that the only way this will work is to post filter a potentially sparse set of results. That does avoid rebuilding the index.
From this small experiment I have some questions unanswered:
- Will converting a potentially sparse set of results be quick enough, or will it just expose another DoS vector?
- What will be the practical cost performing 100 (principals) x 20 (I was using 20 hashes) into an 8kbit filter to filter out each returned doc item?
- Will the processing of queries this way present a DOS vector?
Justin Mason: Links for 2012-02-07
lrzip : ‘Lrzip uses an extended version of rzip which does a first pass long distance redundancy reduction. The lrzip modifications make it scale according to memory size. [...] The unique feature of lrzip is that it tries to make the most of the available ram in your system at all times for maximum benefit. It does this by default, choosing the largest sized window possible without running out of memory.’
(tags: zip compression via:dakami gzip bzip2 archiving benchmarks)
Emmanuel Lecharny: Clueless...
------------
Apache,
I'd like to add you to my professional network on LinkedIn.
- Chris
Chris XXX Lead XXXXXX Technical Recruiter at YYYYY
San Francisco Bay Area
Bryan Pendleton: Chrome is dropping CRL checking
Google's Adam Langley explains why, and this Ars Technica article adds some more context.
As Langley says:
So soft-fail revocation checks are like a seat-belt that snaps when you crash. Even though it works 99% of the time, it's worthless because it only works when you don't need it.While the benefits of online revocation checking are hard to find, the costs are clear: online revocation checks are slow and compromise privacy. The median time for a successful OCSP check is ~300ms and the mean is nearly a second. This delays page loading and discourages sites from using HTTPS. They are also a privacy concern because the CA learns the IP address of users and which sites they're visiting.
Seems like pretty good reasoning to me.Rich Bowen: Biashara Street
Biashara Street
February 5, 2012
From WeekendWordsmith.com
Step away from the
odour of bodies and exhaust into a
chutney of cardamom
cinnamon
ginger
garlic
Sacks of
cashews overflow onto
floors covered with boxes,
cartons,
and more heaps of
burlap bags
full of jasmine rice,
basmati rice,
long brain brown rice
from exotic places I
dream of going, some day.
In this quarter mile of
dusty street
are gathered all the spices of the world,
from Sri Lanka,
Singapore,
and far-away San Francisco.
Tea, coffee and
cocoa pods
lend their aroma to the
general cacophony of smells,
discordant, but, somehow
a symphony in a thousand voices.
Knowing that school uniforms
are only a street or two over,
I stand and breathe deeply
of the cloves,
curry powder,
and saffron.
For the Weekend Wordsmith - Chutney
Anton Tagunov: The challenge of a back door
Now here's a tough question I haven't been able to resolve.
I absolutely demand the universal freedom to know.
Yet I do not want the bad guys to take over computers.
I want to be able to hack my computer myself.
I want nobody else to be able to hack it.
How do I achieve both? Where's the balance?
I do not know. If you do know I'd like to hear from you.
Anton Tagunov: Need for an open-source mainstream capability-based OS
This situation paints for me the following picture: a tap is running, malware flowing like water into a sieve and onto the floor. The security industry is frantically mopping the floor, trying to stem the flow of malware. They are paid well for their trouble, but meanwhile the expensive rug that represents your business is getting awfully wet. It would be nice if someone could turn off the tap, or design an operating system that doesn’t leak like a sieve
Barrelfish?
A secure OS gotta be capability based.
It's gotta be peformant on multi-cpu boxes.
Barrelfish might be both.
Warning: capability based OS can be really restrictive.
It can be very non free (RMS will hate it).
Remote attestation peformed by a TMP chip is the issue.
BIOS tells the TPM chip the hashcode of the OS. The OS tells the chip the hashcode of your movie player. TPM chip signs the hashcode with a secret key. An MPAA member checks the signature against a database of all TPMs ever produced. If satisfied it provides you with a personal copy of a movie. To watch the movie you need a one-off key. This key is given to you. But the key is itself encrypted. Only your TPM can decrypt it. And your TPM will only decrypt it if correct hash-sums have been provided to it after the last machine restart. Unless BIOS has been broken a 3rd party can really verify what software you're running!
The only reason this is not happening now is that a myriad of drivers are running in kernel mode. It is not possible to check that your particular combination of drivers + OS comply with MPAA requirements.
But with a capability based OS there would be a very small OS core.
And it would be possible to sign it.
The 3rd parties would be able to check that it hasn't been hacked or deny useful services.
The TMP chips would lock us from our own computers!
We would no longer have the freedom to hack, the freedom to know.
Solution?
Programmers of good will should create a practical capability based OS
before commercial vendors do. They should make it so popular that nobody in
a right mind would want to repalce it with a commercial alternative.
And it should be both secure and free.
Free as in GPL v3.
Free as in free to hack.
Both secure and free to hack. That's a challenge. More on this later.

