Planet Python

Syndicate content
Planet Python - http://planet.python.org/
Updated: 3 hours 30 min ago

EmptysquarePython: Slides from my talk on asynchronous web frameworks, Python, and MongoDB

8 hours 45 min ago

Here’s the slides for the talk I gave at the NYC Python Meetup tonight, on asynchronous web frameworks, Python, and MongoDB.

Direct link.

Embedded:

Python, async web frameworks, and MongoDB View more presentations from emptysquare.
Categories: FLOSS Project Planets

Tarek Ziade: Defining a wsgi app deployment standard

Thu, 02/09/2012 - 18:14

Next month at Pycon, we’ll have a web summit and I’m invited there to talk about how I deploy web applications. This is not a new topic, as it was already discussed a bit last year — see Ian Bicking’s thought on the topic.

My presentation at the summit will be in two parts. I want to 1/ explain how I organized our Python deployments at Mozilla (using RPMs)  2/ make an initial proposal for a deployment standard that would work for the community at large – I intend to work on this during Pycon and later on the dedicated SIG.

Here’s an overview of the deployment standard idea…

How we deploy usually

If I want to roughly summarize how people deploy their web applications these days, from my knowledge I’d say that there are two main categories.

  1. Deployments that need to be done in the context of an existing packaging system — like RPM or DPKG
  2. Deployments that are done in no particular context, where we want it to just work. — like a directory containing a virtualenv and all the dependencies needed.

In both cases, preparing a deployment usually consists of fetching Python packages at PyPI and maybe compile some of them. These steps are usually done using tools like zc.buildout or virtualenv + pip, and in the case of Mozilla Services, a custom tool that transforms all dependencies into RPMs.

In one case we end up with a directory filled with everything needed to run the application, except the system dependencies, and in the other case with a collection of RPMs that can be deployed on the target system.

But in both cases, we end up using the same thing: a complete list of Python dependencies.

The trick with using tools like zc.buildout or pip is that from an initial list of dependencies, you end up pulling indirect dependencies. For instance, the Pyramid package will pull the Mako package and so on.  A good practice is to have them listed in a single place and to pin each package to a specific version before releasing the app. Both pip and zc.buildout have tools to do this.

Deployments practices I have seen so far:

  • a collection of rpms/debian packages/etc are built using tools like bdist_rpms etc.
  • a virtualenv-based directory is created in-place in production or as a pre-build binary release that’s archived and copied in production
  • a zc-buildout-based directory is created in-place in production or as a pre-build binary release that’s archived and copied in production

The part that’s still fuzzy for everyone that is not using RPMs or Debian packages is how to list system-level dependencies. We introduced in PEP 345 the notion of hint where you can define system level dependencies which name may not be the actual name on the target system. So if you say you need libxml-dev, which is valid under Debian, people that deploy your system will know they’ll need libxml-devel under Fedora. Yeah no magic here, it’s a tough issue. see Requires-External.

The Standard

The standard I have in mind is a very lightweight standard that could be useful in all our deployment practices – it’s a thin layer on the top of the WSGI standard.

A wsgi application is a directory containing:

  • a text file located in the directory at dependencies.txt,  listing all dependencies – possibly reusing Pip’s requirements format
  • a text file located in the directory at external-dependencies.txt,  listing all system dependencies – possibly reusing PEP 345 format
  • a Python script located it the directory at bin/wsgiapp with an  “application” variable. The shebang line of the Python script might also point to a local Python interpreter (a virtualenv version)

From there we have all kind of possible scenarios where the application can be built and/or run with the usual set of tools

Here’s one example of a deployment from scratch :

  • The repository of the project is cloned
  • A virtualenv is created in the repository clone
  • pip, which gets installed with virtualenv, is used to install all dependencies describes in dependencies.txt
  • gunicorn is used to run the app locally using “cd bin; gunicorn wsgiapp:application”
  • the directory is zipped and sent in production
  • the directory is unzipped
  • virtualenv is run again in the directory
  • the app is hooked to Apache+mod_wsgi

Another scenario I’d use in our RPM environment:

  • The repository of the project is cloned
  • a RPM is built for each package in dependencies.txt
  • if possible, external-dependencies.txt is used to feed a spec file.
  • the app is deployed using the RPM collection

That’s the idea, roughly — a light standard to point a wsgi app and a list of dependencies.


Filed under: mozilla, python
Categories: FLOSS Project Planets

Lightning Fast Shop: Release 0.6.6

Thu, 02/09/2012 - 15:45

We just released LFS 0.6.6. This is a yet another bugfix release. 

Changes
  • Bugfix: fixed url for Pages at breadcrumbs (Maciej Wisniowski)
  • Bugfix: display sale price at category products page (Maciej Wisniowski)
  • Bugfix: fix product pagination (Maciej Wisniowski)
  • Bugfix: added short_description to category management UI
  • Bugfix: display category descriptions
  • Bugfix: fixed template selection; issue #134
  • Improvement: allow easy modification of category/product templates (Maciej Wisniowski)
  • Updated polish translations (Maciej Wisniowski)
News: Information

You can find more information and help on following locations:

 

Categories: FLOSS Project Planets

Johan Dahlin: Writing a mixed Gtk / Javascript application

Thu, 02/09/2012 - 13:10

In my last blog post I mentioned the embedding of a javascript library inside Stoq. I got a couple of requests which asked me how this was accomplished, this blog post attempts to explain some of it.

Of course we need to use the great WebKitGtk library. Unfortunately we cannot use the introspection based bindings as this needs to work on Gtk+ 2.18 and PyGTK 2.17 which were shipped in the last Ubuntu LTS release.

WebView will do all the html/css/js parts. It’s almost as simple as a normal GtkTextView, add it to a scrolled window, load the content and off you go.

The first challange comes when you want to open http:// links in your normal browser, instead of handling them in your webkit. To do that you need to listen the navigation-policy-decision-requested signal ignore certain requests. You don’t actually need to use http protocols, you can invent any url which is parsable.

Next problem is AJAX, to write a proper asynchronous widget you don’t want to reload the whole page when something changes. Since we cannot implement our own protocols in the old libsoup bindings shipped for PyGTK we need to run our own http server. That is good for other reasons as well, we can do heavy IO such as database queries in there without actually blocking the user interaction.

When we need to execute scripting in gtk we just call web_view_execute_script() which will just execute a piece of javascript. For instance,

view.execute_script(“document.title = $(‘fc-header-title’).text()” is a actual line in Stoq, it sets the window title based on calendar header title from the dom.

Going the other direction is a bit uglier, the only way of communcation I found out was opening new urls, so I implemented an application specific domain which opens a dialog or some other action within the gtk application.

I know that some of these tricks are already outdated, in newer webkitgtk versions you should write your own libsoup handlers, use the gobject dom bindings for communication, but I didn’t have these options when writing this post

TL;DR

  • Listen to the ::navigation-policy-decision-requested to implement your own uri handling if you can’t do it via libsoup.
  • Run a separate daemon process which will serve as an internal webserver, so AJAX calls work and won’t block on IO.
  • Use web_view_execute_script() for Gtk->Javascript communication
  • Use window.location = “customprotocol://”  for Javascript ->Gtk communication
  • Make the javascript parts work in a normal browser so the normal developer tools can be used
  • If possible, use a newer version of WebKitGtk and avoid all of this.

Xan and the other webkit hackers will probably look at me in disgust for telling you how to do all of these dirty hacks

Categories: FLOSS Project Planets

John Cook: Mixing R, Python, and Perl in 14 lines of code

Thu, 02/09/2012 - 11:25

This is a continuation of my previous post, Running Python and R inside Emacs. That post shows how to execute independent code blocks in Emacs org-mode. This post illustrates calling one code block from another, each written in a different language.

The example below computes sin2(x) + cos2(x) by computing the sine function in R, the cosine function in Python, and summing their squares in Perl. As you’d hope, it returns 1. (Actually, it returns 0.99999999999985 on my machine.)

To execute the code, go to the #+call line and type C-c C-c.

#+name: sin_r(x=1) #+begin_src R sin(x) #+end_src #+name: cos_p(x=0) #+begin_src python import math return math.cos(x) #+end_src #+name: sum_sq(a = 0, b = 0) #+begin_src perl $a*$a + $b*$b; #+end_src #+call: sum_sq(sin_r(1), cos_p(1))

Apparently each function argument has to have a default value. If that’s documented, I missed it. I gave the sine and cosine functions default values that would cause the call to sum_sq to return more than 1 if the defaults were used.

Categories: FLOSS Project Planets

John Cook: Running Python and R inside Emacs

Thu, 02/09/2012 - 08:00

Emacs org-mode lets you manage blocks of source code inside a text file. You can execute these blocks and have the output display in your text file. Or you could export the file, say to HTML or PDF, and show the code and/or the results of executing the code.

Here I’ll show some of the most basic possibilities. For much more information, see  orgmod.org. And for the use of org-mode in research, see A Multi-Language Computing Environment for Literate Programming and Reproducible Research.

Source code blocks go between lines of the form

#+begin_src #+end_src

On the #+begin_src line, specify the programming language. Here I’ll demonstrate Python and R, but org-mode currently supports C++, Java, Perl, etc. for a total of 35 languages.

Suppose we want to compute √42 using R.

#+begin_src R sqrt(42) #+end_src

If we put the cursor somewhere in the code block and type C-c C-c, org-mode will add these lines:

#+results: : 6.48074069840786

Now suppose we do the same with Python:

#+begin_src python from math import sqrt sqrt(42) #+end_src

This time we get disappointing results:

#+results: : None

What happened? The org-mode manual explains:

… code should be written as if it were the body of such a function. In particular, note that Python does not automatically return a value from a function unless a return statement is present, and so a ‘return’ statement will usually be required in Python.

If we change sqrt(42) to return sqrt(42) then we get the same result that we got when using R.

By default, evaluating a block of code returns a single result. If you want to see the output as if you were interactively using Python from the REPL, you can add :results output :session following the language name.

#+begin_src python :results output :session print "There are %d hours in a week." % (7*24) 2**10 #+end_src

This produces the lines

#+results: : There are 168 hours in a week. : 1024

Without the :session tag, the second line would not appear because there was no print statement.

I had to do a couple things before I could get the examples above to work. First, I had to upgrade org-mode. The version of org-mode that shipped with Emacs 23.3 was quite out of date. Second, the only language you can run by default is Emacs Lisp. You have to turn on support for other languages in your .emacs file. Here’s the code to turn on support for Python and R.

(org-babel-do-load-languages 'org-babel-load-languages '((python . t) (R . t)))

Update: My next post shows how to call code in written in one language from code written in another language.

Related posts:

Personal organization software
Preventing an unpleasant Sweave surprise

Categories: FLOSS Project Planets

Reinout van Rees: Upgrading your svn checkouts to 1.7 with checkoutmanager

Thu, 02/09/2012 - 05:24

I updated subversion yesterday (because I installed git-svn). Subversion version 1.7 has a new repository structure and requires you to upgrade all your existing checkouts:

$> svn up svn: E155036: Please see the 'svn upgrade' command svn: E155036: Working copy 'xyz' is too old (format 10, created by Subversion 1.6)

Calling svn upgrade by hand for all checkouts is boring. And I made checkoutmanager to make checkouts less boring. A simple checkoutmanager up in the morning does an svn up, git pull or hg pull -u in every one of my checkouts.

And my brother luckily added a hidden command upgrade to checkoutmanager. It is called hidden because it is only listed on the pypi page, not when you call checkoutmanager --help, because it is so rarely needed.

But anyway, a simple checkoutmanager upgrade call later and all my svn checkouts were upgraded! Nice.

Categories: FLOSS Project Planets

Grig Gheorghiu: Handling date/time in Apache Pig

Thu, 02/09/2012 - 04:48
A common usage scenario for Apache Pig is to analyze log files. Most log files contain a timestamp of some sort -- hence the need to handle dates and times in your Pig scripts. I'll present here a few techniques you can use.

Mail server logs

The first example I have is a Pig script which analyzes the time it takes for a mail server to send a message. The script is available here as a gist.

We start by registering the piggybank jar and defining the functions we'll need. I ran this using Elastic MapReduce, and all these functions are available in the piggybank that ships with EMR.

REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();            
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT();
Since the mail log timestamps don't contain the year, we declare a variable called YEAR which by default is set to the current year via the Unix 'date' command. The variable can also be set when the Pig script is called by running "pig -p YEAR=2011 mypigscript.pig".
%default YEAR `date +%Y`;
We read in the mail logs and extract the lines containing the source of a given message ('from' lines). An example of such a line:

Dec  2 15:13:52 mailserver1 sendmail[1882]: pB2KCqu1001882: from=<info@example.com>, size=9544, class=0, nrcpts=1, msgid=<201112022012.pB2KCqu1001882@mailserver1.example.com>, proto=ESMTP, daemon=MTA, relay=relay1.example.com [10.0.20.6]

To split the line into its various elements, we use the EXTRACT function and a complicated regular expression. Note that in Pig the backslash needs to be escaped:
RAW_LOGS = LOAD '$INPUT' as (line:chararray);
SRC = FOREACH RAW_LOGS GENERATE                                                
FLATTEN(                                                                        
EXTRACT(line, '(\\S+)\\s+(\\d+)\\s+(\\S+)\\s+(\\S+)\\s+sendmail\\[(\\d+)\\]:\\s+(\\w+):\\s+from=<([^>]+)>,\\s+size=(\\d+),\\s+class=(\\d+),\\s+nrcpts=(\\d+),\\s+msgid=<([^>]+)>.*relay=(\\S+)')
)
AS (
month: chararray,
day: chararray,
time: chararray,
mailserver: chararray,
pid: chararray,
sendmailid: chararray,
src: chararray,
size: chararray,
classnumber: chararray,
nrcpts: chararray,
msgid: chararray,
relay: chararray
);
For this particular exercise we don't need all the fields of the SRC relation. We keep only a few:

T1 = FOREACH SRC GENERATE sendmailid, FORMAT('%s-%s-%s %s', $YEAR, month, day, time) as timestamp;
FILTER_T1 = FILTER T1 BY NOT sendmailid IS NULL;
DUMP FILTER_T1;

Note that we use the FORMAT function to generate a timestamp string out of the month, day and time fields, and we also add the YEAR variable. The FILTER_T1 relation contains tuples such as:

(pB2KDpaN007050,2011-Dec-2 15:13:52)
(pB2KDpaN007054,2011-Dec-2 15:13:53)
(pB2KDru1003569,2011-Dec-2 15:13:54)
We now use the DATE_TIME function which takes as input our generated timestamp and the date format string representing the timestamp ('yyyy-MMM-d HH:mm:ss'), and returns a DateTime string in Joda-Time format/ ISO 8601 format.

R1 = FOREACH FILTER_T1 GENERATE sendmailid, DATE_TIME(timestamp, 'yyyy-MMM-d HH:mm:ss') as dt;
DUMP R1;The R1 relation contains tuples such as:
(pB2KDpaN007050,2011-12-02T15:13:52.000Z)
(pB2KDpaN007054,2011-12-02T15:13:53.000Z)
(pB2KDru1003569,2011-12-02T15:13:54.000Z)

Note that the timestamp string "2011-Dec-2 15:13:52" got converted into a canonical ISO 8601 DateTime string "2011-12-02T15:13:52.000Z".

Now we can operate on the DateTime strings by using the ISOToUnix function, which takes a DateTime and returns the Unix epoch in milliseconds (which we divide by 1000 to obtain seconds):

-- ISOToUnix returns milliseconds, so we divide by 1000 to get seconds
toEpoch1 = FOREACH R1 GENERATE sendmailid, dt, ISOToUnix(dt) / 1000 as epoch:long;
DUMP toEpoch1;

The toEpoch1 relation contains tuples of the form:
(pB2KDpaN007050,2011-12-02T15:13:52.000Z,1322838832)
(pB2KDpaN007054,2011-12-02T15:13:53.000Z,1322838833)
(pB2KDru1003569,2011-12-02T15:13:54.000Z,1322838834)We now perform similar operations on lines containing destination email addresses:

DEST = FOREACH RAW_LOGS GENERATE                                                
FLATTEN(                                                                        
EXTRACT(line, '(\\S+)\\s+(\\d+)\\s+(\\S+)\\s+(\\S+)\\s+sendmail\\[(\\d+)\\]:\\s+(\\w+):\\s+to=<([^>]+)>,\\s+delay=([^,]+),\\s+xdelay=([^,]+),.*relay=(\\S+)\\s+\\[\\S+\\],\\s+dsn=\\S+,\\s+stat=(.*)')
)
AS (
month: chararray,
day: chararray,
time: chararray,
mailserver: chararray,
pid: chararray,
sendmailid: chararray,
dest: chararray,
delay: chararray,
xdelay: chararray,
relay: chararray,
stat: chararray
);


T2 = FOREACH DEST GENERATE sendmailid, FORMAT('%s-%s-%s %s', $YEAR, month, day, time) as timestamp, dest, stat;
FILTER_T2 = FILTER T2 BY NOT sendmailid IS NULL;

R2 = FOREACH FILTER_T2 GENERATE sendmailid, DATE_TIME(timestamp, 'yyyy-MMM-d HH:mm:ss') as dt, dest, stat;

-- ISOToUnix returns milliseconds, so we divide by 1000 to get seconds
toEpoch2 = FOREACH R2 GENERATE sendmailid, dt, ISOToUnix(dt) / 1000 AS epoch:long, dest, stat;
At this point we have 2 relations, toEpoch1 and toEpoch2, which we can join by sendmailid:
R3 = JOIN toEpoch1 BY sendmailid, toEpoch2 BY sendmailid;
The relation R3 will contain tuples of the form
(sendmailid, datetime1, epoch1, sendmailid, datetime2, epoch2, dest, stat)
We generate another relation by keeping the sendmailid, the delta epoch2 - epoch1, the destination email and the status of the delivery. We also order by the epoch delta:

R4 = FOREACH R3 GENERATE $0, $5 - $2, $6, $7;
R5 = ORDER R4 BY $1 DESC;

R5 contains tuples such as:
(pB2KDqo5007488,2,user1@earthlink.net,Sent (1rwzuwyl3Nl36v0 Message accepted for delivery))
(pB2KDru1003560,1,user2@yahoo.com,Sent (ok dirdel))
(pB2KCrvm030964,0,user3@hotmail.com,Sent ( <201112022012.pB2KCrvm030964> Queued mail for delivery))

At this point we can see which email deliveries took longest, and try to identify patterns (maybe certain mail domains make it harder to deliver messages, or maybe email addresses are misspelled, etc).
Nginx logs

In the second example, I'll show how to do some date conversions on Nginx access log timestamps. The full Pig script is available here as a gist.

We parse the Nginx access log lines similarly to the mail log lines in the first example:

RAW_LOGS = LOAD '$INPUT' as (line:chararray);
LOGS_BASE = FOREACH RAW_LOGS GENERATE                                            
FLATTEN(                                                                        
EXTRACT(line, '(\\S+) - - \\[([^\\[]+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)')
)
AS (
ip: chararray,
timestamp: chararray,
url: chararray,
status: chararray,
bytes: chararray,
referrer: chararray,
useragent: chararray,
xfwd: chararray,
reqtime: chararray
);
DATE_URL = FOREACH LOGS_BASE GENERATE timestamp;
F = FILTER DATE_URL BY NOT timestamp IS NULL;

The timestamp is of the form "30/Sep/2011:00:10:02 -0700" so we use the appropriate DATE_TIME formatting string 'dd/MMM/yyyy:HH:mm:ss Z' to convert it to an ISO DateTime. Note that we need to specify the timezone with Z:

R1 = FOREACH F GENERATE timestamp, DATE_TIME(timestamp, 'dd/MMM/yyyy:HH:mm:ss Z') as dt;
DUMP R1;

R1 contains tuples of the form:
(30/Sep/2011:00:19:35 -0700,2011-09-30T00:19:35.000-07:00)
(30/Sep/2011:00:19:36 -0700,2011-09-30T00:19:36.000-07:00)
(30/Sep/2011:00:19:37 -0700,2011-09-30T00:19:37.000-07:00)

At this point, if we wanted to convert from DateTime to Unix epoch in seconds, we could use ISOToUnix like we did for the mail logs:
toEpoch = FOREACH R1 GENERATE dt, ISOToUnix(dt) / 1000 as epoch:long;

However, let's use another function called FORMAT_DT to convert from the above DateTime format to another format of the type 'MM/dd/yyyy HH:mm:ss Z'. The first argument to FORMAT_DT is the desired format for the date/time, and the second argument is the original DateTime format:FDT = FOREACH R1 GENERATE FORMAT_DT('MM/dd/yyyy HH:mm:ss Z', dt) as fdt;
DUMP FDT;

The FDT relation now contains tuples such as:

(09/30/2011 00:19:35 -0700)
(09/30/2011 00:19:36 -0700)
(09/30/2011 00:19:37 -0700)

We can now use a handy function called CustomFormatToISO to convert from any custom date/time format (such as the one we generated in FDT) back to a canonical ISO DateTime format:

toISO = FOREACH FDT GENERATE fdt, CustomFormatToISO(fdt, 'MM/dd/yyyy HH:mm:ss Z');
DUMP toISO;

(09/30/2011 00:19:35 -0700,2011-09-30T07:19:35.000Z)
(09/30/2011 00:19:36 -0700,2011-09-30T07:19:36.000Z)
(09/30/2011 00:19:37 -0700,2011-09-30T07:19:37.000Z)

Note how the custom DateTime string "09/30/2011 00:19:35 -0700" got transformed into the canonical ISO DateTime string "2011-09-30T07:19:35.000Z".

Converting Unix epoch to DateTime

Some log files have timestamps in Unix epoch format. If you want to transform them into DateTime, you can use the UnixToISO function:

DEFINE UnixToISO org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();Here is an input file:

$ cat unixtime.txt
1320777563
1320777763
1320779563
1320787563

And here is a Pig script which converts the epoch into DateTime strings. Note that UnixToISO expects the epoch in milliseconds, and our input is in seconds, so we have to multiply each input value by 1000 to get to milliseconds:

UNIXTIMES = LOAD 's3://mybucket.com/unixtime.txt' as (unixtime:long);
D = FOREACH UNIXTIMES GENERATE UnixToISO(unixtime * 1000);
DUMP D;

(2011-11-08T18:39:23.000Z)
(2011-11-08T18:42:43.000Z)
(2011-11-08T19:12:43.000Z)
(2011-11-08T21:26:03.000Z)
Categories: FLOSS Project Planets

eGenix.com: eGenix mxODBC Zope DA 2.0.2 GA

Thu, 02/09/2012 - 04:00
eGenix is pleased to announce mxODBC Zope DA version 2.0.2, our new Zope/Plone ODBC Database Adapter, compatible with Plone 3.2 - 4.1, Zope 2.10 - 2.13 and Python 2.4 - 2.6 on all major platforms.
Categories: FLOSS Project Planets

S. Lott: PDF Reading

Thu, 02/09/2012 - 03:00
PDF files aren't pleasant.

The good news is that they're documented (http://www.adobe.com/devnet/pdf/pdf_reference.html).

They bad news is that they're rather complex.

I found four Python packages for reading PDF files.
I elected to work with PDFMiner for two reasons.  (1) Pure Python, (2) Reasonably Complete.

This is not, however, much of an endorsement.  The implementation (while seemingly correct for my purposes) needs a fair amount of cleanup.
Here's one example of remarkably poor programming.
# Connect the parser and document objects.
parser.set_document(doc)
doc.set_parser(parser)

Only one of these two is needed; the other is trivially handled as part of the setter method.

Also, the package seems to rely on a huge volume of isinstance type checking.  It's not clear if proper polymorphism is even possible.  But some kind of filter that picked elements by type might be nicer than a lot of isinstance checks.

Annotation Extraction

While shabby, the good news is that PDFMiner seems to reliably extract the annotations on a PDF form.

In a couple of hours, I had this example of how to read a PDF document and collect the data filled into the form.

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.psparser import PSLiteral
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, PDFTextExtractionNotAllowed
from pdfminer.pdfdevice import PDFDevice
from pdfminer.pdftypes import PDFObjRef
from pdfminer.layout import LAParams, LTTextBoxHorizontal
from pdfminer.converter import PDFPageAggregator

from collections import defaultdict, namedtuple

TextBlock= namedtuple("TextBlock", ["x", "y", "text"])

class Parser( object ):
"""Parse the PDF.

1. Get the annotations into the self.fields dictionary.

2. Get the text into a dictionary of text blocks.
The key to the dictionary is page number (1-based).
The value in the dictionary is a sequence of items in (-y, x) order.
That is approximately top-to-bottom, left-to-right.
"""
def __init__( self ):
self.fields = {}
self.text= {}

def load( self, open_file ):
self.fields = {}
self.text= {}

# Create a PDF parser object associated with the file object.
parser = PDFParser(open_file)
# Create a PDF document object that stores the document structure.
doc = PDFDocument()
# Connect the parser and document objects.
parser.set_document(doc)
doc.set_parser(parser)
# Supply the password for initialization.
# (If no password is set, give an empty string.)
doc.initialize('')
# Check if the document allows text extraction. If not, abort.
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)

# Process each page contained in the document.
for pgnum, page in enumerate( doc.get_pages() ):
interpreter.process_page(page)
if page.annots:
self._build_annotations( page )
txt= self._get_text( device )
self.text[pgnum+1]= txt

def _build_annotations( self, page ):
for annot in page.annots.resolve():
if isinstance( annot, PDFObjRef ):
annot= annot.resolve()
assert annot['Type'].name == "Annot", repr(annot)
if annot['Subtype'].name == "Widget":
if annot['FT'].name == "Btn":
assert annot['T'] not in self.fields
self.fields[ annot['T'] ] = annot['V'].name
elif annot['FT'].name == "Tx":
assert annot['T'] not in self.fields
self.fields[ annot['T'] ] = annot['V']
elif annot['FT'].name == "Ch":
assert annot['T'] not in self.fields
self.fields[ annot['T'] ] = annot['V']
# Alternative choices in annot['Opt'] )
else:
raise Exception( "Unknown Widget" )
else:
raise Exception( "Unknown Annotation" )
def _get_text( self, device ):
text= []
layout = device.get_result()
for obj in layout:
if isinstance( obj, LTTextBoxHorizontal ):
if obj.get_text().strip():
text.append( TextBlock(obj.x0, obj.y1, obj.get_text().strip()) )
text.sort( key=lambda row: (-row.y, row.x) )
return text
def is_recognized( self ):
"""Check for Copyright as well as Revision information on each page."""
bottom_page_1 = self.text[1][-3:]
bottom_page_2 = self.text[2][-3:]
pg1_rev= "Rev 2011.01.17" == bottom_page_1[2].text
pg2_rev= "Rev 2011.01.17" == bottom_page_2[0].text
return pg1_rev and pg2_rev

This gives us a dictionary of field names and values.  Essentially transforming the PDF form into the same kind of data that comes from an HTML POST request.

An important part is that we don't want much of the background text.  Just enough to confirm the version of the form file itself.

The cryptic text.sort( key=lambda row: (-row.y, row.x) ) will sort the text blocks into order from top-to-bottom and left-to-right.  For the most part, a page footer will show up last.  This is not guaranteed, however.  In a multi-column layout, the footer can be so close to the bottom of a column that PDFMiner may put the two text blocks together.

The other unfortunate part is the extremely long (and opaque) setup required to get the data from the page.
Categories: FLOSS Project Planets

Bit of Cheese: A couple of useful tools

Wed, 02/08/2012 - 22:04
logging_tree - introspect and display the logger tree inside the standard library's "logging" package. This could be an invaluable tool to discover what's really going on in your application's logging - and in particular perhaps why logging isn't working how you think it should.
hgtools - adds support for Mercurial in setuptools, both for the basics like listing the files under revision control (so find_packages and  include_package_data can do their work without needing explicit listings of files in MANIFEST.in) but also supporting pulling the version number from the repository tag so it doesn't have to be duplicated. The git equivalent appears to be setuptools-git (formerly known as gitlsfiles.)
Categories: FLOSS Project Planets

Daniel Greenfeld: Python Web Summit Questions needed!

Wed, 02/08/2012 - 18:36
I've been fortunate enough to land a moderating spot at the invitation only March 8th Python Web Summit that takes place at PyCon US 2012. The panel I'll be moderating is the somewhat contentious issue of code reuse across different Python frameworks. I'm working on some questions, but I would love input from the entire Python Web Community.

If you've got questions to ask, or ideas to suggest, please post them on this google moderator link.
Categories: FLOSS Project Planets

Johan Dahlin: Stoq 1.2

Wed, 02/08/2012 - 14:05

We released Stoq 1.2 last week, this release features quite a bit of features:

Calendar application

It’s now possible to list payments, purchase orders and client calls in a graphical view:

 

It might look familiar, it uses the fantastic javascript library fullcalendar. We really wanted to use a normal GtkWidget for the calendar but it would have been a lot more work to rip out half of evolution. If there are any other options that can match fullcalendars functionallity there we’d be open to switching as embedding WebKit, jQuery and fullcalendar in a Gtk+ application is not ideal.

Configurable keyboard shortcuts

This is something that has been requested many times over the years. It makes it easier to remap the keyboard bindings use often to other keys, such as the function keys. There’s still an open task to redo all the existing keybindings that aren’t uniform enough.

Configurable form fields

Some companies does not use all the form fields (fax anyone?) that we show per default and Stoq know has a configuration interface where you can make fields non-mandatory and even hide them if you don’t wish to see them. Perfect for the first steps of localization.

New manual

One of our interns rewrote old docbook manual to mallard, and it looks beatiful and is now well integrated in the application. You can find the online version here. It involved removing a lot of screenshots and text. It’ll be easier to update the manual in the future if there aren’t any screenshots. He also fixed the interface, there are now various help buttons in the application that goes to a help section describing that part.

Localization support

It’s now possible to configure some of the fields that are specific to each region/country. The only thing that made it into this release was company identification number (Brazil: CNPJ, Sweden: Organisationnr, US: Employer Identification Number). But person identification number and list of states has landed in the code repository since the release. We still need someone to step up and start doing the actual localization for this, be the hero of the day and download Stoq and start localizing it!

Boleto Bancário (Bank invoice)

Brazilian banks supports a kind of invoice with a barcodes/numbers, called boleto bancário. It’s semi-standardadized, most of the data is similar, but you need to special case each bank that should be supported. There are two kinds, with and without cobrança (for eventually sending to a collection agency). There are a couple of 100 active banks and about 15 major ones. Stoq currently supports 7: Banco do Brasil, Banco Real, Banco Santander, Banco Bradesco, Caixa Econômica, Banrisul, Banco Itaú. All without cobrança though, support for that will come in a future release.

Call for volunteers

Stoq has initially been targeting the Brazilian market, since that what’s close to the current development team. But there is now longer an excuse for not trying to use it. We can barely handle the legal part of Brazil and we’d need volunteer help to make it possible to use in other countries. We’re very proud of the application so we wouldn’t want to stop you just because you live outside of Brazil!

So, why don’t you grab the code and get started, it’s all python (and a tiny bit of javascript) and shouldn’t be hard to get started.

Don’t be discouraged by the web site and manual is only in Portuguese, we use gettext and rosetta and the code is modular and easy to understand.

We’ll need a lot of work to support localization in different countries such as: company/person formats, states, taxes and other things we don’t know about yet, let us know and we’ll try to find a solution.

Just send me a mail or come in on our new shiny web chat: http://chat.stoq.com.br/ (aka #stoq on freenode)

Categories: FLOSS Project Planets

PyPy Development: Introductionary Article About RPython

Wed, 02/08/2012 - 13:45
Laurence Tratt from King's College London has written a long and detailed introduction to the goals and significance of RPython over on his blog. Laurie has been implementing his Converge Language in RPython in the last months. He is one of the first people external to the PyPy team who have pushed a sizeable RPython-based VM quite far, adding and tuning JIT hints. The post describes some of that work and his impressions of RPython and PyPy.

"RPython, to my mind, is an astonishing project. It has, almost single-handedly, opened up an entirely new approach to VM implementation. As my experience shows, creating a decent RPython VM is not a huge amount of work (despite some frustrations). In short: never again do new languages need come with unusably slow VMs. That the the PyPy / RPython team have shown that these ideas scale up to a fast implementation of a large, real-world language (Python) is another feather in their cap."
Categories: FLOSS Project Planets

Fabio Zadrozny: PyDev forums -&gt; StackOverflow

Wed, 02/08/2012 - 11:33
The PyDev forums at SourceForge are now officially deprecated :)

So, anyone having a doubt regarding PyDev should now ask at StackOverflow and add a 'PyDev' tag.

I think this will be a real improvement over the current status quo... Some reasons I see for that are:

1. Many PyDev users follow StackOverflow and do answer things there, whereas in the PyDev forum, many questions were asked, but I was almost the only one answering... (I think the real plus here are the 'gaming' features that StackOverflow has, so, more people are inclined to participate actively).

2. As people started asking there anyways, I really had to follow StackOverflow closely too, so, deprecating the PyDev forums means I'll be able to follow a single place again :)

3. Interacting with StackOverflow as a whole seem a nice improvement over the SourceForge forum (it's edition is nicer, accepts pictures, etc.)

And now on to something a bit unrelated... the PyDev homepage (http://pydev.org) is now being generated from a wiki (it's still a read-only wiki -- but hopefully that'll change soon -- but at least, I feel it'll be easier for me to edit things there and later have the homepage updated from it). So, if someone finds something strange in the homepage, please let me know :)


Categories: FLOSS Project Planets

Jonathan Street: Django and Scrapy

Wed, 02/08/2012 - 10:46

I'm currently working on a project which centres around pulling in data from an external website, "mashing" it up with some additional content, and then displaying it on a website.

The website is going to be interactive and reasonably complex so I decided to use django. To acquire the external data there isn't a webservice so I'm stuck parsing html (and excel spreadsheets but that's a separate story). Scrapy seemed ideal for this and although I wish I had used some other approach than xpath it largely has been.

Having set up my database models in django and built my spider in scrapy the next step was putting the data from the spider in the database. There are plenty of posts detailing how to use the django ORM from outside a django project, even some specific to scrapy but they didn't seem to be working for me.

The issue was the way I handled development and production environment settings.

Read more . . .

Categories: FLOSS Project Planets

PyCon: PyCon US 2012: Getting the most out of PyCon (and a new Job Fair!)

Wed, 02/08/2012 - 09:44

PyCon 2012 will be the biggest PyCon yet. Amazing talks, tutorials, posters - robots - we are going to have it all for you. The volunteer team is working on welcoming committees, social events and many other things.

Each year there are quite a few new people, and with record attendance, we expect this year to be no different. So we thought that it this point it might be good to lay out the virtual welcome mat for everyone coming to PyCon and point out a few of the ways to make your PyCon unforgettable.

If we could point to just one thing that makes PyCon different, it is that at PyCon you come to contribute. If you want to have an extraordinary time and make PyCon your favorite conference all year, pick three of the items below, get involved, and contribute! Want to volunteer? Please sign up to pycon-organizers.

Stuff a Bag: For those who haven’t been to PyCon before, one of the most fun events takes place Wednesday evening.  Stand shoulder to shoulder with fifty or one hundred of your fellow Pythonistas to help stuff the attendee bags. Want to know who has the best swag? Want to see what people will be giving away in the Expo Hall? Want to just have fun? Come stuff bags.

Chair a Session: PyCon talks are arranged in groups of two or three, called sessions. (Look at the schedule to see what I mean). Session chairs help run the session, introduce speakers, call time, and help run the room for a short period of time. If you want to be in the front row at one particular talk, sign up to be session chair! There will be a sign-up board at PyCon.

Run a Race: Many Pythonistas are active runners. More are probably waiting for a kick in the pants to get up, get out the door, and start running. Well, here's your chance! Whether its part of your regular training, a New Years resolution, or whatever -- we hope you'll join us for the inaugural PyCon 5k.

Get a Job: A short while ago you may have seen a similar announcement for an online job board for our sponsors with open positions, located at https://us.pycon.org/2012/sponsors/jobs/. Sponsors have enjoyed this benefit and we think the community has as well. However, we’re taking this job fair one step further: into real life. On Sunday March 11 from 10:00 to 12:00, the expo hall will be running a job fair for all sponsors seeking to hire Python developers.

This job fair will run concurrent to the always excellent Poster Session, and will occur during the morning snack break. Grab a drink and a cookie and mingle with this year’s list of incredible sponsors, from small startups to big corporations, from the east coast to the west coast, local workers to telecommuters -- there’s a lot of organizations to choose from. With 122 sponsors on board, we think you’d have trouble not finding a company that interests you.

Give a talk: One of PyCon’s traditions - one that we aren’t ashamed to admit that we picked up from the Perl community - is having Lightning Talks. Lightning Talks are five-minute, rapid-fire talks about something that interests you. Maybe you've never given a talk before, and you'd like to start small. For a Lightning Talk, you don't need to make slides, and if you do decide to make slides, you only need to make three. Sign up quickly, though - spots go fast.

Check out the Hallway Track: Many of the PyCon old-timers are most fond of the “hallway track” - the spontaneous meetings and discussions that occur when you bring together interesting, intelligent people (like all PyCon attendees!). There have been projects and businesses launched, friendships made, and problems solved in the hallways at PyCon.

Organize an Open Space: PyCon sets aside rooms for “Open Space” discussions and meetings. Anyone can lead an open space - just sign up for the room and the time slot and it is yours. Do you play an instrument? Each night at PyCon usually has a music jam open space. Want to work on a quick idea with someone? Follow up on a talk? Plan to take over the world? Open space.

Attend a BoF: Some of our open spaces have grown up into semi-regular Birds of a Feather (BoF) sessions. The best-known is probably the Testing in Python (TiP) BoF, but we usually also have Board Game BoFs, Science BoFs, Whiskey BoFs, Newbie BoFs, “Teach me” BoFs, and many more.

Sprint: If you are still making your traveling plans, one of the best ways to take advantage of PyCon is attending the sprints. Development sprints are a key part of PyCon, a chance for the contributors to open-source projects to get together face-to-face for up to four days of intensive learning, development and camaraderie. Newbies sit with gurus, go out for lunch and dinner together, and have a great time while advancing their project. Have you ever wanted to hack on Python-core? Twisted? Django? SciPy? The leaders of each project will be there during the sprints, and you will be able to contribute in a meaningful way.

Sponsor PyCon: Ok, we had to say it. There are over 120 companies sponsoring PyCon, the most yet. We have filled up the Expo Hall, but you can still show your support (and participate in the Job Fair) with your sponsorship. If you are still considering sponsoring PyCon - now is the time to reach out to us - jnoller@python.org!

Come contribute to PyCon. It will be your favorite conference all year.

Categories: FLOSS Project Planets

Matt Harrison: Utah Python Feb 2012

Wed, 02/08/2012 - 08:10

The Utah Python will be meeting on Thursday, Feb 9th at 7pm. Amji will be doing a short presentation on a memoization decorator and Eric will be giving a preview of his PyCon talk “Interfaces and Python”. Cheers.

Categories: FLOSS Project Planets

John Cook: Example of not inverting a matrix: optimization

Wed, 02/08/2012 - 08:00

People are invariably surprised when they hear it’s hardly ever necessary to invert a matrix. It’s very often necessary solve linear systems of the form Ax = b, but in practice you almost never do this by inverting A. This post will give an example of avoiding matrix inversion. I will explain how the Newton-Conjugate Gradient method works, implemented in SciPy by the function fmin_ncg.

If a matrix A is large and sparse, it may be possible to solve Ax = b but impossible to even store the matrix A-1 because there isn’t enough memory to hold it. Sometimes it’s sufficient to be able to form matrix-vector products Ax. Notice that this doesn’t mean you have to store the matrix A; you have to produce the product Ax as if you had stored the matrix A and multiplied it by x.

Very often there are physical reasons why the matrix A is sparse, i.e. most of its entries are zero and there is an exploitable pattern to the non-zero entries. There may be plenty of memory to store the non-zero elements of A, even though there would not be enough memory to store the entire matrix. Also, it may be possible to compute Ax much faster than it would be if you were to march along the full matrix, multiplying and adding a lot of zeros.

Iterative methods of solving Ax = b, such as the conjugate gradient method, create a sequence of approximations that converge (in theory) to the exact solution. These methods require forming products Ax and updating x as a result. These methods might be very useful for a couple reasons.

  1. You only have to form products of a sparse matrix and a vector.
  2. If don’t need a very accurate solution, you may be able to stop very early.

In Newton’s optimization method, you have to solve a linear system in order to find a search direction. In practice this system is often large and sparse. The ultimate goal of Newton’s method is to minimize a function, not to find perfect search directions. So you can save time by finding only approximately solutions to the problem of finding search directions. Maybe an exact solution would in theory take 100,000 iterations, but you can stop after only 10 iterations! This is the idea behind the Newton-Conjugate Gradient optimization method.

The function scipy.optimize.fmin_ncg can take as an argument a function fhess that computes the Hessian matrix H of the objective function. But more importantly, it lets you provide instead a function fhess_p that computes the product of the H with a vector. You don’t have to supply the actual Hessian matrix because the fmin_ncg method doesn’t need it. It only needs a way to compute matrix-vector products Hx to find approximate Newton search directions.

For more information, see the SciPy documentation for fmin_ncg.

Categories: FLOSS Project Planets

Mike C. Fletcher: Corner cases do crop up, don't they?

Tue, 02/07/2012 - 17:30
So I've been playing with cutting down OpenGLContext into something like a modern scenegraph engine.  The first step there is to eliminate the old tree-traversal rendering mechanism, as the "flat" rendering engine is both simpler and much more easily optimized.  No big problem, really.  A lot of OpenGLContext's demos/tests just ran with only minimal changes, a few of the very old ones were using the customization points (e.g. Background) that were dependent on the old rendering model, but they could generally be ported to the new model by moving a few lines of code into their "Render" method.  The surprising corner case was one of the most recent tutorials, namely shadows.  As I was writing that tutorial I took a shortcut by using the legacy visitor in the middle of the rendering process to traverse and output the geometry for each sub-pass, and the modifications to the flat renderer mean that the "Context" is now a rendering node... which means the flat renderer includes it in the set of things to render when I do a query on the scenegraph for what should be rendered... queue infinite recursion.  Oh well, gives me something to work on tomorrow .

Categories: FLOSS Project Planets