The dressing and arming of a warrior is a common set scene in epic poetry, e.g., Iliad 2:
He put on a soft khiton,
fine and newly made, and put around himself a great cloak.
Under his shining feet he fastened fine sandals
and around his shoulders he placed a silver-studded sword.
He took up the ancestral scepter which is always unwilting.
The structure and contents of such scenes have been well-studied by scholars, e.g., Armstrong 1958, and even parodied, as in Pope’s mock epic The Rape of the Lock:
Now awful Beauty puts on all its Arms;
The Fair each moment rises in her Charms,
Repairs her Smiles, awakens ev’ry Grace,
And calls forth all the Wonders of her Face;
Sees by Degrees a purer Blush arise,
And keener Lightnings quicken in her Eyes.
The busy Sylphs surround their darling Care;
These set the Head, and those divide the Hair,
Some fold the Sleeve, while others plait the Gown;
And Betty‘s prais’d for Labours not her own.
However, the dressing of the 21st Century casual American male appears to lack rigorous analysis, a deficiency I hope to remedy, at list in the area of furthering understanding of the dependency constraints of this activity.
It is well-known that underpants must be donned before pants. Despite the intriguing experimentation by Rowan Atkinson no practical alternative has been found. Similarly, socks must be put on before shoes, pants before shoes, and both pants and shirt before the belt can be buckled.
Illustrating the topological ordering as direct graph, we have the following:
Within these constraints many dress orderings are possible, some of the more common ones beings:
- underwear, socks, pants, shirt, shoes, belt
- underwear, pants, shirt, belt, sock, shoes
- underwear, shirt, pants, socks, belt, shoes
Orderings like the above are familiar to most people. However, there are many other possibilities, some perhaps worthy of further exploration:
- socks, shirt, underwear, pants, shoes, belt
- shirt, socks, underwear, pants, belt, shoes
It will also be appreciated by those practiced in the art that the two socks need not be put on together. This permits extravagant ordering like:
- left sock, shirt, underwear, pants, belt, right sock, shoes
- right sock, underwear, pants, left sock, shoes, shirt, belt
There is also nothing that prevents a Towers of Hanoi approach for those with time to kill, where -X indicates that X is to be removed:
- pants, shoes, shirt, -shoes, socks, -pants, underwear, pants, shoes, belt
Hopefully the above gives ideas for further exploration and experimentation. Although we do not dress and arm ourselves to fight the Trojans, our morning ritual can be equally an epic experience!
The site, Immunicity.org, offers a proxy server and a proxy autoconfiguration file (PAC) to tell browsers to access various blocked sites (PirateBay, KickassTorrents et al) via the proxy.The Police Intellectual Property Crime Unit has arrested a 20-year-old man in Nottingham on suspicion of copyright infringement for running a proxy server providing access to other sites subject to legal blocking orders. Is operating a proxy server illegal? Interesting. Seems unlikely that this will go to court though. (Via TJ McIntyre)
Normally Ars Technica is one of my favorite web sites; they do strong, well-researched reporting and have good writers and editors.
But this recent article, which is getting a fair bit of attention, disappointed me greatly: How Microsoft dragged its development practices into the 21st century.
The author attempts to tell a story about changing development process approaches over the last few years, using Microsoft as his source of examples, and specifically uses the Visual Studio team in Microsoft as a focus of his discussion.
Avoiding any hard data, the article tries to use a narrative approach, spreading a wide net, touching on a vast universe of subjects, and dropping anecdotes spanning a 20-year time frame.
But the article fails on many levels.
It starts off poorly by making the classic mistake of trying to compare completely unrelated industrial processes, when clearly the author knows nothing about any of the industries and how they actually operate:In industries such as manufacturing and construction, design must be done up front because things like cars and buildings are extremely hard to change once they've been built. In these fields, it's imperative to get the design as correct as possible right from the start. It's the only way to avoid the costs of recalling vehicles or tearing down buildings.
The design of cars, and the design of buildings, is in fact extremely iterative. Car designers and building architects use all sorts of tools (design sketches, scale models, 3d printers, computerized visualizations) to get all sorts of feedback during the design process.
Then he compounds his error by trying to compare completely unrelated types of software development, comparing the development of products like Windows, SQL Server, Exchange, or Microsoft office with an entirely different sort of software:For example, lots of companies develop in-house applications to automate various business processes. In the course of developing these applications, it's often discovered that the old process just isn't that great. Developers will discover that there are redundant steps, or that two processes should be merged into one, or that one should be split into two. Electronic forms that mirror paper forms in their layout and sequence can provide familiarity, but it's often the case that rearranging the forms can be more logical.
Developing a custom in-house application, with users who are part of the same organization that performs the development, to solve a specific problem for that organization, has almost nothing in common with developing a general-purpose piece of software like SQL Server or Windows, which is used in all sorts of different environments, by organizations that have nothing to do with Microsoft, for all kinds of different purposes, by users who have never been, nor ever will be, Microsoft employees.
And a process that works for a 3 person team building an internal website has no hope of scaling to a process that allows 1,500 or more individuals to collaborate on an operating system used on several billion computers across the planet.
And the article descends into tragedy when the author reveals that he has never been part of a team that tried to build truly reliable software to address a truly complex project:the result was a two-year development process in which only about four months would be spent writing new code. Twice as long would be spent fixing that code.
I've worked on database servers, on web middleware, on networking protocols, on operating systems.
You start by carefully hiring the best possible developers you can find. You give them the best possible conditions you can provide (quiet environment, access to plenty of computer resources, meeting rooms to work out ideas cooperatively, excellent development tools to make them as efficient as possible).
You provide them with immense amounts of support: testing tools, hardware labs with test resources, testing experts, interaction designers, brilliant technical writers, and so on.
But even with all of this, building a product like SQL Server or Windows 8 is just HARD.
It's more than hard, it's nearly impossible.
If Microsoft manages to ONLY spend twice as long testing, stabilizing, and bullet-proofing their software as it took them to design and build it in the first place, I'm frankly astonished. I would have predicted 3-4 times as long.
And the fact that they can continue to roll out new releases of Office, of Windows, of SQL Server, every 2-3 years, on code bases that are nearly 3 decades old, with products that have to provide backward compatibility and upgrade paths for billions of users and existing installations, is a track record that nobody else in the industry can match.
Face it, software of this type is just fundamentally incredibly hard to build.
So while it's useful to look at alternate processes and techniques, and suggest things that might help improve the process, to walk up to a 25-year-old organization like Microsoft, who certainly have their faults, but who have figured out how to deliver an amazing stream of powerful and widely-accepted software, and then provide nothing but anecdotes and bad analogies, is just not what I expect of a publication like Ars Technica.
Maybe I totally missed the point of this article.
If so, drop me a line and tell me where I went wrong!
It’s been a while. Don’t worry. It’s not you, it’s me. In the process of moving to Berlin, selling my residence in Portland, starting to work on Wunderlist 3, and generally starting up a new life in a new country, my hopes to keep friends and family updated through blog posts, newsletters, and more fell flat. But, yah, things are good. I love living in Berlin, the new job is working out, and I’ve even managed to squeeze in a few trips to the Mediterranean for a few short bits of vacation. Katerina even arranged a lovely surprise two-day trip to Copenhagen for my birthday.
Now that I’ve settled into the pace of life here a bit more, it’s time to circle back and start blogging again. And, as always happens, I couldn’t help but mess with how I do things. Backing this site with Tumblr was fun while it lasted, but in the end, I want more control over the horizontal and vertical. More to the point, I want to continue to experiment, maybe blow things up in the process, and enjoy getting my hands a bit deeper in again. To that end, I’ve picked up Jekyll and have started hacking away. Watch out.
As to the old stuff, I put in redirects for the old Tumblr content, so links shouldn’t rot. That was easier to do than to migrate the content over and futz with formatting, redirect, and the like—at least for now. I’d rather write new stuff than worry too much about the previous round of stuff.
So, what else is new? Well, Wunderlist 3 launched last week. Luma Labs is bringing back the Loop—not quite yet announced, but here’s the sneak peek pre-order page. And, Katerina and I are finally moving into our own apartment in Berlin after spending the year so far in a temporary place.
Alles gut. More to come.
brilliant. a great threadless sub from Threadless user NickOG back in 2012
Excellent: ‘a Twitter-fueled link aggregator that favors new projects/sites over news/articles’ from Andy Baio.
Ah, I was waiting for this; rest-of-world-style carpooling on demand, in an app. Great stuff
I loved this recent essay from one of the top PostgreSQL developers: Memory Matters.
Excerpt:Under any circumstances, reading from disk is vastly slower than reading from memory, but reading data from disk sequentially is 10 to 100 times faster than random I/O. Unfortunately, it's often the case that the task which is evicting data from memory is writing data sequentially while the underlying database workload is typically accessing some working set of pages in a more-or-less random fashion. The result is that data is removed from the cache at a vastly higher rate than it can be read back in. Even after the bulk operation terminates and the cache-purging ceases, it can take a painfully long time - sometimes many hours - for random I/O to bring all of the hot data back into memory.
I think it's possible that he meant "reading data sequentially," not "writing data sequentially,", but that's a nit.
The whole essay is great. I probably give this advice about 3x a week at work, and never word it anywhere nearly so well, so it's nice to have a great reference to point to.
And I really love his two final points:
- If adding memory doesn't seem to help, it's possible that you just haven't added enough.
- Memory is different: because of the way operating system and PostgreSQL caching works, it's likely that substantially all of your memory will be in use all the time
librarian-puppet 1.3.1 and 1.0.8 [changelog] include two important changes. Now there is no need to create a Puppetfile if you have a Modulefile or metadata.json, it will use them by default. Of course you can add a Puppetfile to bring in modules from git, a directory, or github tarballs.
The other change is that all the dependencies’ metadata.json files will be parsed now for transitive dependencies, so it works with the latest Puppet Labs modules and those migrated from the old Modulefile format going forward. That also means that the puppet gem is no longer needed if there are no Modulefile present in your tree of dependencies, which was a source of pain for some users.
The 1.0.x branch is kept updated to run in Ruby 1.8 while 1.1+ requires Ruby 1.9 and uses the Puppet Forge API v3.
This sounds like a nice way to do effective peer-driven team reviews without herculean effort, which were one of the most effective reviewing techniques (along with upwards reviewing of management) I encountered at Amazon. (Yes, the Amazon approach was very time-consuming and universally loathed.) The potential downside I can see is that it doesn’t give the reviewer enough time to revise any review comments they have second thoughts about, whereas written reviews do, but that would be an easy fix at the end of the process. Also, it’s worth noting that in most cases, a good review requires a bit of time to marshal thoughts and come up with a coherent review of a peer, so this doesn’t completely avoid the impact on effort. Still, a definite improvement I would say.
A Java-oriented practical intro to the MinHash duplicate-detection shingling algo
John Rauser, Velocity, June 2010. Good data on real-world web perf based on the limitations which TCP and the speed of light impose
Baroness Warsi resigns over a matter of principle. Good to know there’s still a government minister not entirely without principles. Oh .. erm .. hang on ….
But what took her so long? It’s not as if Gaza is the first foreign problem in which our government has behaved disgracefully on her watch. It’s not even as if this was one of the conflicts for which we bear the most substantial responsibility – at least not in our times. Not like those heavily provoked in the first place by western agents provocateurs (like Syria or Ukraine), or the legacy of actual military action (like Libya). Maybe she protests her principles just a tad too much?
How will history view her? I guess precedents like Robin Cook show that a resignation can do a lot to redeem a reputation, even if it comes long after your hands are covered in blood.
This serves as a public notice that my beloved Motorola Razr M phone died (no longer bootable; local data not recovered) a few weeks ago, immediately before flying to attend OSCON 2014. Since I was busy at the conference, and since I couldn’t decide on what to upgrade to, I took a while replacing my phone. It was a strange experience traveling without a cell phone, I must say!
Please note that as of now I now have a working Moto X, which I love, and which I’m still working through setting up.
This is important for various two factor authorization setups I have that used my old phone, which I now need to figure out how to re-create. Ugh.
데이터크라우즈는 기업의 빅 데이터 문제를 외부 전문가 및 대중들의 협력으로 해결하는 크라우드소싱(datacrowds.com 이하 데이터 크라우즈) 서비스로 지난 7월 3일 베타 서비스를 오픈하였다.
크라우드소싱이란 기업이나 단체 등이 특정 목표 달성을 위해 대중(crowds)의 집단 지성을 활용하는 방법으로 페이스북, 월마트, 포드 자동차 등의 다양한 글로벌 기업은 이미 크라우드소싱을 통해 빅 데이터 문제를 해결해나가고 있다.
비트패킹컴퍼니(beatpacking.com, 대표 박수만)는 저비용으로 다수의 경쟁을 통해 우수한 아이디어와 솔루션을 다양하게 얻을 수 있다는 점이 가장 큰 매력이라며 데이터크라우즈를 통해 공모전을 추진하게 된 배경을 설명하였다.
이번 공모는 누구나 자격 제한없이 참여할 수 있으며, 푸짐한 상금은 물론 향후 비트패킹컴퍼니와 추천 엔진을 구현해볼 수 있는 기회를 각 수상자 또는 수상팀에게 제공하는 것이 특징이다.
데이터세이어 윤진석 대표는 “생각보다 기업들의 반응이 대단히 긍정적이어서 한 달만에 베타 서비스를 마무리하고 비트패킹컴퍼니와 같이 좋은 기업과 첫 문제를 함께하게 되었다”고 밝히며 “앞으로 오픈소스를 통한 개방형 혁신과 데이터크라우즈 서비스를 통해 기술 및 장비 위주의 SI산업 형태로 변질되고 있는 한국의 빅 데이터 산업을 지식산업으로 재정의 할 수 있도록 지속적으로 노력할 것“이라고 덧붙였다.
The blackberry season is firmly upon us. Indeed, it’s come exceptionally early: I’ve been getting some good pickings for two weeks in the garden.
In the wild, brambles tend to live alongside nettles. In my garden there are no nettles, but in their accustomed place is is ivy climbing anything that’ll support it, including some of the brambles. It’s got some rather attractive white flowers right now!
As a gardener, the ivy can be a pain: if I try to trim the brambles (or other plants the ivy climbs) back I have two intertwined things to deal with, and they need very different treatment. But for picking the blackberries, I discovered today a bit of ivy can be a huge advantage. Something soft and thorn-free I can grab to pull the thorny bits out of the way and give comfortable access to the berries!
Maryn McKenna was feeling my confusion, and steps in with a super article packed with information: Ebola in Africa and the U.S.: A CurationHaving said all that, here are a few pieces that I think would be worth your time to read.
- Tara Smith at Aetiology on how very over-hyped our image of Ebola is. (That explosive bleeding-out-everywhere thing? Mostly not.)
- Michael Osterholm of the University of Minnesota in the Washington Post, on what the world needs to do to control the West Africa outbreak.
- Laurie Garrett (who covered past Ebola outbreaks as a newspaper reporter) at CNN, on the African political instability that has made the epidemic so difficult to control.
- Declan Butler in Nature, on why the Ebola outbreak will remain a West Africa problem — but not a global one.
- David Kroll at Forbes, describing the protections in place at Emory to prevent Ebola spreading.
- Helen Branswell at National Geographic, on why there are so few treatments or vaccines for Ebola.
- Ren (a semi-anonymous public health worker) at Epidemiological, rendering appropriate disdain to people who said the aid workers should have been left in Africa.
Down the road, let's please have less coverage of Donald Trump, and more strong reporting like this.
Instead of answering to above questions again and again we thought to explain our strategy on both JavaEE and JavaEE-WebProfile, implementation details and expected time line through a detailed white paper. Recently we have finished this paper and already published on our website, here is the link. In summery WSO2 AS will support for JavaEE WebProfile with it's 6.0 version.
Following diagram provides you an idea about JavaEE specifications supported in WSO2 AS 5.2.1 which is the latest released version. Though WSO2 AS 5.2.1version is not fully supported for WebProfile it support for number of specifications defined under both WebProfile and Full profile.
( Here all dark coloured specifications are supported in AS 5.2.1 version and light coloured specifications are not supported in AS 5.2.1)
Link to the white paper
Additionally WSO2 AS use number of certified and proved open source frameworks to support JavaEE WebProfile, most of them from Apache.
It's been hard to find good reporting about the West Africa Ebola crisis, at least here in the states.
Most of the coverage has involved the Dr. Kent Brantly story, which is certainly a compelling story, but there is much more I'd like to know.
I've found a few good articles, though.
A detailed article carried by ABC News helps in explaining why it is so hard to combat the epidemic: Ebola Outbreak Feeds on Fear, Anger, RumorsMany health care workers and aid workers have said one cause for the rapid spread of the Ebola virus is the public's general mistrust of the government. Among the rumors about this disease:
- Ebola does not exist and government workers are using it as an excuse to steal organs to sell on the black market.
- The government is pretending Liberia has Ebola so they'll have an opportunity to receive and then abuse donated funds.
- If a person goes to the hospital with a disease that has symptoms that mirror those of Ebola, such as malaria, that person will end up getting Ebola from the hospital.
- Medical staffers are so afraid to catch Ebola, they neglect patients in the quarantine unit and let them starve to death.
- Because of the noxious fumes that come from the solution workers use to spray affected areas, some people believe the spray is meant to kill them, and they don't want workers to come into their communities.
On the NPR site, transcripts of two interesting interviews:
- Fear, Caution As Doctors Fight Ebola On The GroundSAYAH: We need more people. We need more actors to be involved in educating the population, the communities, sensitizing them about the fact that the key to resolving this is that people come and get treated and not hide their sick and not have secret burials. We have a lot of work ahead. It's now in three countries in multiple sites. We've never seen it before. Doctors Without Borders has stretched its capacity to respond. We're doing the best we can but I think many more actors need to be mobilized.
- Sierra Leone, Struggling With Ebola, Passes On Africa SummitFOFANA: Well, it has been difficult because it's never been here before. And health workers were not prepared for Ebola when eventually it did emerge. And lots of the health workers have died in all three countries. About in the region of 100 health workers have contracted the virus and at least half that number have died. We have been told by some nurses that the personal protective gear that they have been given is not good enough. They say the clothing is very thin, and they do not feel very much secure in them.
- Living in the shadow of EbolaI spent an instructive couple of hours at the weekend with a woman from Finland. Eeva was once a midwife, but she's just finished a five-week stint with a Red Cross team that has been going door to door in Kailahun province, the border region where Ebola first arrived in Sierra Leone.
She was on what's known as a sensitisation mission, explaining to people exactly how the virus spreads and how to avoid it.
There are three simple rules, she told me.
- Ebola outbreak: US experts to head to West AfricaDr Thomas Frieden, director of the Centers for Disease Control and Prevention, announced the new US measures in an interview with ABC's This Week.
"We do know how to stop Ebola. It's old-fashioned plain and simple public health: find the patients, make sure they get treated, find their contacts, track them, educate people, do infection control in hospitals."
It's been 14 months now since the Edward Snowden story broke.
During that time, there has been a conversation of sorts. I wish that more had participated; I wish that more had resulted.
But I'm pleased, at least, that the conversation continues.
Some have been focusing on the economic and commercial aspects of the conversation:
- Personal Privacy Is Only One of the Costs of NSA SurveillanceThe economic costs of NSA surveillance can be difficult to gauge, given that it can be hard to know when the erosion of a company’s business is due solely to anger over government spying. Sometimes, there is little more than anecdotal evidence to go on. But when the German government, for example, specifically cites NSA surveillance as the reason it canceled a lucrative network contract with Verizon, there is little doubt that U.S. spying policies are having a negative impact on business.
- Report Says Backlash From NSA's Surveillance Programs Will Cost Private Sector Billions Of DollarsAlso directly affecting US companies is a future full of increased compliance costs as countries move towards data sovereignty. This means tech companies like Facebook and Google will need to build local data centers if they wish to keep citizens in affected countries as users. The European Parliament's new data protection law could easily result in massive fines for US companies.
Others have been looking at the changing relationship between the American scientific community and its most important patron, the U.S. Government:
- Mathematicians Discuss the Snowden RevelationsThe only reason I am putting these words down now is the feeling of intense betrayal I suffered when I learned how my government and the leadership of my intelligence community took the work I and many others did over many years, with a genuine desire to prevent another 9/11 attack, and subverted it in ways that run totally counter to the founding principles of the United States, that cause huge harm to the US economy, and that moreover almost certainly weaken our ability to defend ourselves.
- The Mathematical Community and the National Security AgencyWe face a variety of threats -- from car accidents, which take about as many lives each month as the 9/11 tragedy, to weather (ranging from sudden disasters, such as hurricanes Katrina and Sandy, to the dangers from climate change), to global avian flu pandemics. The moves taken in the name of fighting terrorism, including the intrusive NSA data collection that has recently come to light and more generally the militarization of our society, are not justified by the dangers we currently face from terrorism.
- NSA and the Snowden IssuesNSA's intelligence activities stem from a foreign-intelligence requirement -- initiated by one or more Executive Branch intelligence consumers (the White House, Department of State, Department of Defense, etc.), vetted through the Justice Department as a valid need -- and run according to a process managed by the Office of the Director of National Intelligence.
- Why were CERT researchers attacking Tor?CERT was set up in the aftermath of the Morris Worm as a clearinghouse for vulnerability information. The purpose of CERT was to (1) prevent attacks by (2) channeling vulnerability information to vendors and eventually (3) informing the public. Yet here, CERT staff (1) carried out a large-scale, long-lasting attack while (2) withholding vulnerability information from the vendor, and now, even after the vulnerability has been fixed, (3) withholding the same information from the public.
- Cryptographer Adi Shamir Prevented from Attending NSA History ConferenceAs a friend of the US I am deeply worried that if you continue to delay visas in such a way, the only thing you will achieve is to alienate many world-famous foreign scientists, forcing them to increase their cooperation with European or Chinese scientists whose countries roll the red carpet for such visits. Is this really in the US best interest?
Best personal wishes, and apologies for not being able to meet you in person,
- US State Department: Let in cryptographers and other scientistsI’ve learned from colleagues that, over the past year, foreign-born scientists have been having enormously more trouble getting visas to enter the US than they used to. The problem, I’m told, is particularly severe for cryptographers: embassy clerks are now instructed to ask specifically whether computer scientists seeking to enter the US work in cryptography. If an applicant answers “yes,” it triggers a special process.
- The ultimate goal of the NSA is total population controlThe lack of official oversight is one of Binney’s key concerns, particularly of the secret Foreign Intelligence Surveillance Court (Fisa), which is held out by NSA defenders as a sign of the surveillance scheme's constitutionality.
“The Fisa court has only the government’s point of view”, he argued. “There are no other views for the judges to consider.
None of these topics are simple; none of these conversations are easy.
We must keep the discussion going.
From the Plym Valley trail. I’ve been meaning to photograph this sculpture for a while, so I took the opportunity when I passed it today.
During one of my previous post I have explained few security patterns that can be used with Java WebSocket applications and how to call them from client side applications including browser based and rich agent based clients. In this post I explain how to secure server side WebSocket endpoints easily, in fact if you are already familiar with security model defined by the Java Servlet specification there is nothing new, you could use same security model for WebSocket server endpoints as well. Let's take an example and discuss, consider following use case.
- Endpoint URL to secure - /securewebsocket
- Transport level security - HTTPS
- Allow roles - admin
- Authentication metod - Basic
Here in this use case we want to secure a WebSocket endpoint deployed on "/securewebsocket" URL. Users with only "admin" role can establish WebSocket connection and they should use SSL for transport level security, additionally server will use HTTP BasicAuth to authenticate users during the handshake.
We can fulfil above security requirement easily by adding following entries into web.xml file.
<display-name>Secure WebSocket Endpoint</display-name>
<web-resource-name>Secure WebSocket Endpoint</web-resource-name>
Now let's let's look at what are the various we could use for authentication and authorization.
1. BASIC This is the basic authentication schema where client sends set of user name and password as a encoded string along with a HTTP header. In case of browser based clients browser pop-up a dialog to enter user name and password.
2. FORM In form based authentication application developers create a HTML login page to send user name and password. This approach is similar to "Basic" but flexible to have customized login page.
3. DIGESTMuch secure than above two options, it specially applies a hash function to the password before sending to he server.
4. CLIENT-CERT This also a better authentication schema where client is authenticated using client's digital certificate.
1. NONE This indicate server should accept any connection including unprotected connections.
2. INTEGRAL This ensures that the data be sent between client and server in such a way that it cannot be changed in transit.
3. CONFIDENTIAL During the data transmissions this ensures other entities can't observer contents of the transmission.
- In practice web servers treat the CONFIDENTIAL and INTEGRAL transport guarantee values identical.
- In both CONFIDENTIAL and INTEGRAL options clients should use secure WebSocket (wss://) protocol.
Sometimes you do care about the positions of the terms, and for such cases Lucene has various so-called proximity queries.
The simplest proximity query is PhraseQuery, to match a specific sequence of tokens such as "Barack Obama". Seen as a graph, a PhraseQuery is a simple linear chain:
By default the phrase must precisely match, but if you set a non-zero slop factor, a document can still match even when the tokens are not exactly in sequence, as long as the edit distance is within the specified slop. For example, "Barack Obama" with a slop factor of 1 will also match a document containing "Barack Hussein Obama" or "Barack H. Obama". It looks like this graph:
Now there are multiple paths through the graph, including an any (*) transition to match an arbitrary token. (Note: while the graph cannot properly express it, this query would also match a document that had the tokens Barack and Obama on top of one another, at the same position, which is a little bit strange!)
In general, proximity queries are more costly on both CPU and IO resources, since they must load, decode and visit another dimension (positions) for each potential document hit. That said, for exact (no slop) matches, using common-grams, shingles and ngrams to index additional "proximity terms" in the index can provide enormous performance improvements in some cases, at the expense of an increase in index size.
MultiPhraseQuery is another proximity query. It generalizes PhraseQuery by allowing more than one token at each position, for example:
This matches any document containing either domain name system or domain name service. MultiPhraseQuery also accepts a slop factor to allow for non-precise matches.
Finally, span queries (e.g. SpanNearQuery, SpanFirstQuery) go even further, allowing you to build up a complex compound query based on positions where each clause matched. What makes them unique is that you can arbitrarily nest them. For example, you could first build a SpanNearQuery matching Barack Obama with slop=1, then another one matching George Bush, and then make another SpanNearQuery, containing both of those as sub-clauses, matching if they appear within 10 terms of one another.
As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop. Here's an example:
This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that's done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.
In addition to this new query there is also a simple utility class, TokenStreamToTermAutomatonQuery, that provides loss-less translation of any graph TokenStream into the equivalent TermAutomatonQuery. This is powerful because it means even arbitrary token stream graphs will be correctly represented at search time, preserving the PositionLengthAttribute that some tokenizers now set.
While this means you can finally correctly apply arbitrary token stream graph synonyms at query-time, because the index still does not store PositionLengthAttribute, index-time synonyms are still not fully correct. That said, it would be simple to build a TokenFilter that writes the position length into a payload, and then to extend the new TermAutomatonQuery to read from the payload and apply that length during matching (patches welcome!).
The query is likely quite slow, because it assumes every term is optional; in many cases it would be easy to determine required terms (e.g. Obama in the above example) and optimize such cases. In the case where the query was derived from a token stream, so that it has no cycles and does not use any transitions, it may be faster to enumerate all phrases accepted by the automaton (Lucene already has the getFiniteStrings API to do this for any automaton) and construct a boolean query from those phrase queries. This would match the same set of documents, also correctly preserving PositionLengthAttribute, but would assign different scores.
The code is very new and there are surely some exciting bugs! But it should be a nice start for any application that needs precise control over where terms occur inside documents.