Python


Facebook development team has open-sourced their light-weight, cross-language development platform, Thrift.

I played around it with it for a while and it looks interesting enough, but I don’t have an immediate need for it.

Check it out if you have a cross-language environment Thrift currently supports C++, Java, Ruby, Python and PHP - a list that should satisfy most everyone.

A POSIX-compliant *NIX system is a requirement, but I’d be curious if it’s possible to get it up n’ running inside of Cygwin (as a development-only exercise, of course).

I’ve lost most of my respect for Aaron Swartz over this password debacle.

See more coverage here: Never store passwords in a database!

Earlier this morning, I was looking for a way to look up Python function definitions inside of the interpreter.

My work buddy (that would be you, gfast) came to the rescue by pointing out the (very handy) __doc__ built-in attribute, which returns Docstring of the function it’s attached to.

Sweet!

It’s always annoyed me that Javadoc is not compiled into the *.class files - it’s nice to see that Guido didn’t make the same mistake with Python.

>>> dir(string)
[’__builtins__’, ‘__doc__’, ‘__file__’, ‘__name__’, ‘_float’, ‘_idmap’, ‘_idmapL’, ‘_int’, ‘_long’, ‘ascii_letters’, ‘as
cii_lowercase’, ‘ascii_uppercase’, ‘atof’, ‘atof_error’, ‘atoi’, ‘atoi_error’, ‘atol’, ‘atol_error’, ‘capitalize’, ‘capw
ords’, ‘center’, ‘count’, ‘digits’, ‘expandtabs’, ‘find’, ‘hexdigits’, ‘index’, ‘index_error’, ‘join’, ‘joinfields’, ‘le
tters’, ‘ljust’, ‘lower’, ‘lowercase’, ‘lstrip’, ‘maketrans’, ‘octdigits’, ‘printable’, ‘punctuation’, ‘replace’, ‘rfind
‘, ‘rindex’, ‘rjust’, ‘rstrip’, ’split’, ’splitfields’, ’strip’, ’swapcase’, ‘translate’, ‘upper’, ‘uppercase’, ‘whitesp
ace’, ‘zfill’]

>>> string.upper.__doc__
‘upper(s) -> string\n\n Return a copy of the string s converted to uppercase.\n\n ‘

To enable __doc__ support in custom methods, you need to surround comments with “”"triple double quotes”"” and place them at the top of the methods.

def function(a, b):
“”"Do some funky shit and return a list.”"”

I found myself questioning why others are looping using xrange, while I’ve been using the boring, non-x range.

Gotta love the fact that I was able to use the built-in in “help” module in Python to answer this quickly.

>>> help(xrange)
Help on class xrange in module __builtin__:

class xrange(object)
| xrange([start,] stop[, step]) -> xrange object
|
| Like range(), but instead of returning a list, returns an object that
| generates the numbers in the range on demand. For looping, this is
| slightly faster than range() and more memory efficient.

Parsing a CSV file with a simple string.split() was going along swimmingly…

for line in open("samples/sample.csv"):
    title, year, director = line.split(",")
    print year, title

…until I ran into a double-quoted string in the data file I was parsing. Thanks to Python’s CVS module (which I should have been using all along) all parsing activities are back to normal.

import csv
reader = csv.reader(open("samples/sample.csv"))
for title, year, director in reader:
    print year, title

One of my Django views required a two-column display, which required simultaneous iteration over two lists.

I didn’t find a clean way to handle this in Django, so I ended up creating a compound list of pairs to back the display.

Here’s my Python code to merge N lists into a list of lists with N entries. There is probably a way to handle this much cleaner - let me know if see how.

'''
merge lists into a list of lists by index
'''
def __merge_lists(*lists):
    list_of_lists = []

    lengths = [len(list) for list in lists]
    for idx in range(max(lengths)):
        current_idx_list = []
        for list in lists:
            if (idx < len(list)):
                current_idx_list.append(list[idx])
            else:
                current_idx_list.append(None)
        list_of_lists.append(current_idx_list)

    return list_of_lists

One of the more unique features of Python is the ability to collect all excess parameters passed to a method in a tuple (positional arguments) or a dictionary (keyword arguments).

This last method parameter often follows the “**kwargs” convention in the code I’ve run into.

When a final formal parameter of the form **name is present, it receives a dictionary containing all keyword arguments except for those corresponding to a formal parameter. This may be combined with a formal parameter of the form *name (described in the next subsection) which receives a tuple containing the positional arguments beyond the formal parameter list. (*name must occur before **name.) For example, if we define a function like this:

def cheeseshop(kind, *arguments, **keywords):
print "-- Do you have any", kind, '?'
print "-- I'm sorry, we're all out of", kind
for arg in arguments: print arg
print '-'*40
keys = keywords.keys()
keys.sort()
for kw in keys: print kw, ':', keywords[kw]

It could be called like this:

cheeseshop('Limburger', "It's very runny, sir.",
"It's really very, VERY runny, sir.",
client='John Cleese',
shopkeeper='Michael Palin',
sketch='Cheese Shop Sketch')

and of course it would print:

– Do you have any Limburger ?
– I’m sorry, we’re all out of Limburger
It’s very runny, sir.
It’s really very, VERY runny, sir.
—————————————-
client : John Cleese
shopkeeper : Michael Palin

sketch : Cheese Shop Sketch

Looking for a decent Python parser for BBCode?

Luke Plant’s bbcode.py is in Zyon’s Subversion depot and seems do the trick.

BBCode is an abbreviation for Bulletin Board Code, the markup language used to format posts in many message boards. The available tags are usually indicated by rectangular brackets surrounding a keyword, and they are parsed by the message board system before being translated into a markup language the web browsers understands, usually HTML or XHTML.

One of the more unique Python idioms is the Decorate-Sort-Undecorate (DSU) sorting method used to optimize sorting performance.

PythonWiki describes it’s function as a combination of three steps:

  • First, the initial list is decorated with new values that control the sort order.
  • Second, the decorated list is sorted.
  • Finally, the decorations are removed, creating a list that contains only the initial values in the new order.

>>> words = “This is a test string from Andrew.”.split()
>>> deco = [ (word.lower(), i, word) for i, word in enumerate(words) ]
>>> deco.sort()
>>> new_words = [ word for _, _, word in deco ]
>>> print new_words
[’a', ‘Andrew.’, ‘from’, ‘is’, ’string’, ‘test’, ‘This’]

It would nice if the Python language found a way to support this type of efficiency gain in one-line syntax, but for now this is what you’ll have to do to keep your sorts optimized.

While building a feed aggregator into a project of mine, I ran into a usage of a HTTP header unfamiliar to me before today.

From the documentation of the (very excellent) python feedparser project:

ETags and Last-Modified headers are two ways that feed publishers can save bandwidth, but they only work if clients take advantage of them. Universal Feed Parser gives you the ability to take advantage of these features, but you must use them properly.

The basic concept is that a feed publisher may provide a special HTTP header, called an ETag, when it publishes a feed. You should send this ETag back to the server on subsequent requests. If the feed has not changed since the last time you requested it, the server will return a special HTTP status code (304) and no feed data.

Following warning (from the feedparser documentation) should drive the significance of these headers home:

If you do not support ETag and Last-Modified headers, you will repeatedly download feeds that have not changed. This wastes your bandwidth and the publisher’s bandwidth, and the publisher may ban you from accessing their server.

Django was kicking my ass tonight, as I struggled to figure out how to reference HttpRequest parameters in templates. It’s 2:00AM and I finally arrived at an answer - default TEMPLATE_CONTEXT_PROCESSORS in settings.py is missing “django.core.context_processors.request”. You can override the settings by explicitly setting it in settings.py:

TEMPLATE_CONTEXT_PROCESSORS = (
“django.core.context_processors.auth”,
“django.core.context_processors.debug”,
“django.core.context_processors.i18n”,
“django.core.context_processors.request”,
)

Above settings change enables HttpRequest by this function in the context_processor.py:

def request(request):
return {’request’: request}

One of my co-workers recently called TreeMaps “hashes”, which is imprecise at best, but can be easily excused as a by-product of his Pythonista upbringing.

java.util.TreeMap, in fact, is implemented as a Red-Black tree - a data structure that has not had a definition in my brain for years. As a refresher course, I decided to Wikipedia the B-Tree family and blog the highlights.

First, the BINARY TREE:

A binary tree is a tree data structure in which each node has at most two children.

  • A directed edge connects the parent to the child.
  • A node that has no children is called a leaf.
  • The depth of a node n is the length of the path from the root to the node. The set of all nodes at a given depth is sometimes called a level of the tree.
  • The height of a node n is the length of the path from the node n to its furthest leaf.
  • Nodes that share parents are called siblings.
  • If a path exists from node p to node q, then p is an ancestor of q and q is a descendant of p.
  • The size of a node is the number of descendants it has including itself.

Second, a BINARY SEARCH TREE:

In computer science, a binary search tree (BST) is a binary tree which has the following properties:

  • Each node has a value.
  • A total order is defined on these values.
  • The left subtree of a node contains only values less than or equal to the node’s value.
  • The right subtree of a node contains only values greater than or equal to the node’s value.

The major advantage of binary search trees is that the related sorting algorithms and search algorithms such as in-order traversal can be very efficient.

Binary search trees are a fundamental data structure used to construct more abstract data structures such as sets, multisets, and associative arrays.

If a BST allows duplicate values, then it represents a multiset. This kind of tree uses non-strict inequalities, so everything in the left subtree of a node is less than or equal to the value of the node, and everything in the right subtree is greater than or equal to the value of the node.

If a BST doesn’t allow duplicate values, then the tree represents a set with unique values, like the mathematical set. Trees without duplicate values use strict inequalities, meaning that the left subtree of a node only contains nodes with values that are less than the value of the node, and the right subtree only contains values that are greater.

Third, a SELF-BALANCING BINARY SEARCH TREE:

In computer science, a self-balancing binary search tree or height-balanced binary search tree is a binary search tree that attempts to keep its height, or the number of levels of nodes beneath the root, as small as possible at all times, automatically. They are one of the most efficient ways of implementing associative arrays, sets, and other data structures.

Most operations on a binary search tree take time directly proportional to the tree’s height, so it is desirable to keep the height small. Ordinary binary search trees have the primary disadvantage that they can attain very large heights in rather ordinary situations, such as when the keys are inserted in order. The result is a data structure similar to a linked list, making all operations on the tree expensive. If we know all the data ahead of time, we can keep the height small on average by adding values in a random order, but we don’t always have this luxury, particularly in on-line algorithms.

Self-balancing binary trees instead solve this problem by performing transformations on the tree, such as tree rotations, at key times, in order to reduce the height. Although a certain overhead is involved, it is justified in the long run by drastically decreasing the time of later operations. Note that the height must always be at least the ceiling of log n, since there are at most 2k nodes on the kth level; a complete or full binary tree has exactly this many levels. Balanced BSTs are not always so precisely balanced, since it can be expensive to keep a tree at minimum height at all times; instead, they keep the height within a constant factor of this lower bound.

Finally, a RED-BLACK TREE:

A red-black tree is a binary search tree where each node has a color attribute, the value of which is either red or black. In addition to the ordinary requirements imposed on binary search trees, we make the following additional requirements of any valid red-black tree:

  1. A node is either red or black.
  2. The root is black.
  3. All leaves are black.
  4. Both children of every red node are black. (i.e. Every red node must have a black parent.)
  5. All paths from any given node to its leaf nodes contain the same number of black nodes.

These constraints enforce a critical property of red-black trees: that the longest possible path from the root to a leaf is no more than twice as long as the shortest possible path. The result is that the tree is roughly balanced. Since operations such as inserting, deleting, and finding values requires worst-case time proportional to the height of the tree, this theoretical upper bound on the height allows red-black trees to be efficient in the worst-case, unlike ordinary binary search trees.

To see why these properties guarantee this, it suffices to note that no path can have two red nodes in a row, due to property 4. The shortest possible path has all black nodes, and the longest possible path alternates between red and black nodes. Since all maximal paths have the same number of black nodes, by property 5, this shows that no path is more than twice as long as any other path.

Next Page »