2015

This story had to be published on 13th September, on Day of the Programmer. But I always forget about this holiday. So I did it again this year.

It happened about 15 years ago. I was a student and worked on a game with my friends. The game was multiplayer sci-fi turn-based strategy. There was no graphics, just text mode. And its gameplay was endless. The game was run on Turbo Pascal on PC with Intel 80386 processor. Golden times!

Once we had to implement a function that generates names for new weapons. Because of endless gameplay, upgrade process was endless too. So the function had to generate a new name for each call. The idea was to join a couple of random prefixes, random string, and random suffix. Prefixes and suffix had to be selected from predefined lists. The prefix list looked like this: hyper, mega, plasma, etc. And the suffix list was like this: gun, cannon, blaster, rifle, etc. The middle string was random combination of consonant syllables.

So the function had been done, and we ran a test. And in the first dozen of names, it printed out a name of probably the most powerful weapon in whole known universe: Super Megadick Launcher.

The test had been passed. The game unfortunately had not been finished.

If you develop applications that use untrusted input, you deal with validation. No matter which framework or library you are using. It is a common task. So I am going to share a recipe that neatly integrates validation layer with business logic one. It is not about what data to validate and how to validate they, it is mostly about how to make code looks better using Python decorators and magic methods.

Let’s say we have a class User with a method login.

class User(object):

    def login(self, username, password):
        ...

And we have a validation schema Credentials. I use Colander, but it does not matter. You can simply replace it by your favorite library:

import colander


class Credentials(colander.MappingSchema):

    username = colander.SchemaNode(
        colander.String(),
        validator=colander.Regex(r'^[a-z0-9\_\-\.]{1,20}$')
    )
    password = colander.SchemaNode(
        colander.String(),
        validator=colander.Length(min=1, max=100),
    )

Each time you call login with untrusted data, you have to validate the data using Credentials schema:

user = User()
schema = Credentials()

trusted_data = schema.deserialize(untrusted_data)
user.login(**trusted_data)

The excessive code is a trade-off for flexibility. Such methods are also can be called using trusted data. So we can’t just put validation into the method itself. However, we can bind the schema to the method without loss of flexibility.

Firstly, create validation package using the following structure:

myproject/
    __init__.py
    ...
    validation/
        __init__.py
        schema.py

Then add the following code into myproject/validation/__init__.py (again, usage of cached_property is inessential detail, you can use the same decorator provided by your favorite framework):

from cached_property import cached_property

from . import schema


def set_schema(schema_class):
    def decorator(method):
        method.__schema_class__ = schema_class
        return method
    return decorator


class Mixin(object):

    class Proxy(object):

        def __init__(self, context):
            self.context = context

        def __getattr__(self, name):
            method = getattr(self.context, name)
            schema = method.__schema_class__()

            def validated_method(params):
                params = schema.deserialize(params)
                return method(**params)

            validated_method.__name__ = 'validated_' + name
            setattr(self, name, validated_method)
            return validated_method

    @cached_property
    def validated(self):
        return self.Proxy(self)

There are three public objects: schema module, set_schema decorator, and Mixin class. The schema module is a container for all validation schemata. Place Credentials class into this module. The set_schema decorator simply adds passed validation schema to decorating method as __schema_class__ attribute. The Mixin class adds proxy object validated. The object provides access to the methods with __schema_class__ attribute and lazily creates their copies wrapped by validation routine. This is how it works:

from myproject import validation


class User(object, validation.Mixin):

    @validation.set_schema(validation.schema.Credentials)
    def login(self, username, password):
        ...

Now, we can call validated login method within a single line of code:

user = User()

user.validated.login(untrusted_data)

So what we get: the code is more compact; it is still flexible, i.e. we can call the method without validation; and it is more readable and self-documenting.

I hate writing documentation, but I have to. Good actual documentation significantly decreases efforts for introduction new teammates. And of course, nobody would use perfect open source code without documentation. So I have written plenty of documents. The most of them are miserable. But I tried to find a way to make them better. And it seems, I have found the general mistake I did.

Typical documentation consists of three parts:

  • Getting started guide describes main features and principles.
  • Advanced usage guide describes each feature in details.
  • Internals or API documentation describes low level things, i.e. particular modules, classes, and functions. It is usually generated from doc-strings of the sources.

I used to write documentation in the direct order: getting started tutorial, advanced section, and finally internals. Don’t do that. If you want to write good documentation, you have to write it in the opposite order.

This is how it works. The most important thing of any documentation is cross-linking. When you describe a feature that consists of a number of smaller ones, you have to link each mention of the smaller feature to its full description. That is why internal documentation generated from doc-strings is your foundation. It is quite easy to document particular function or class (lazy developers guess it is enough). So when you describe how the things work together, you can link mentions of the particular thing to its own documentation, instead of overburden the entire description by the details. The same works for getting started tutorial. It must be concise, but there must be links to the full description of each feature it mentions.

There is no magic. This technique just makes documentation writing process more productive and fun. Use it and make your documents better and your users happier.

PasteDeploy is a great tool for managing WSGI applications. Unfortunately, there is no support of configuration formats other than INI-files. Montague is going to solve the problem, but its documentation is unfinished and says nothing useful. Hope, it will be changed soon. But if you don’t want to wait, as me do, the following recipe is for you.

Using ConfigTree on my current project, I stumbled with the problem: how to serve Pyramid applications (I got three ones) from the custom configuration? Here is how it looks like in YAML:

app:
    use: "egg:MyApp#main"
    # Application local settings goes here
filters:
    -
        use: "egg:MyFilter#filter1"
        # Filter local settings goes here
    -
        use: "egg:MyFilter#filter2"
server:
    use: "egg:MyServer#main"
    # Server local settings goes here

The easy way is to build INI-file and use it. The hard way is to make my own loader. I chose the hard one.

PasteDeploy provides public functions loadapp, loadfilter, and loadserver. However, these functions don’t work, because they don’t accept local settings. Only global configuration can be passed into.

app = loadapp('egg:MyApp#main', global_conf=config)

But the most of PasteDeploy-based applications simply ignore global_conf. For example, here is the paste factory of Waitress:

def serve_paste(app, global_conf, **kw):
    serve(app, **kw)        # global_conf? Who needs this shit?
    return 0

I dug around the sources of PasteDeploy and found loadcontext function. It is kind of low level private function. But who cares? So here is the source of loader, that uses the function.

from paste.deploy.loadwsgi import loadcontext, APP, FILTER, SERVER


def run(config):

    def load_object(object_type, conf):
        conf = conf.copy()
        spec = conf.pop('use')
        context = loadcontext(object_type, spec)    # Loading object
        context.local_conf = conf                   # Passing local settings
        return context.create()

    app = load_object(APP, config['app'])
    if 'filters' in config:
        for filter_conf in config['filters']:
            filter_app = load_object(FILTER, filter_conf)
            app = filter_app(app)
    server = load_object(SERVER, config['server'])
    server(app)

But it is not the end. Pyramid comes with its own command pserve, that uses PasteDeploy to load and start up application from INI-file. And there is an option of the command that makes development fun. I mean --reload one. It starts separate process with a file monitor that restarts your application when its sources are changed. The following code provides the feature. It depends on Pyramid, because I don’t want to reinvent the wheel. But if you use another framework, it won’t be hard to write your own file monitor.

import sys
import os
import signal
from subprocess import Popen

from paste.deploy.loadwsgi import loadcontext, APP, FILTER, SERVER
from pyramid.scripts.pserve import install_reloader, kill


def run(config, with_reloader=False):

    def load_object(object_type, conf):
        conf = conf.copy()
        spec = conf.pop('use')
        context = loadcontext(object_type, spec)
        context.local_conf = conf
        return context.create()

    def run_server():
        app = load_object(APP, config['app'])
        if 'filters' in config:
            for filter_conf in config['filters']:
                filter_app = load_object(FILTER, filter_conf)
                app = filter_app(app)
        server = load_object(SERVER, config['server'])
        server(app)

    if not with_reloader:
        run_server()
    elif os.environ.get('master_process_is_running'):
        # Pass your configuration files here using ``extra_files`` argument
        install_reloader(extra_files=None)
        run_server()
    else:
        print("Starting subprocess with file monitor")
        environ = os.environ.copy()
        environ['master_process_is_running'] = 'true'
        childproc = None
        try:
            while True:
                try:
                    childproc = Popen(sys.argv, env=environ)
                    exitcode = childproc.wait()
                    childproc = None
                    if exitcode != 3:
                        return exitcode
                finally:
                    if childproc is not None:
                        try:
                            kill(childproc.pid, signal.SIGTERM)
                        except (OSError, IOError):
                            pass
        except KeyboardInterrupt:
            pass

That’s it. Wrap the code with a console script and don’t forget to initialize the logging.

I have just released ConfigTree. It is the longest project of mine. It took more than two and a half years from the first commit to the release. But the history of the project is much longer.

The idea came from “My Health Experience” project. It was a great project I worked on, unfortunately it is closed now. My team started from a small forum and ended up with a full featured social network. We got a single server at the start and a couple of clusters at the end. A handful of configuration files grew up to a directory with dozens of ones, which described all subsystems in all possible environments. Each module of the project had dozens of calls to the configuration registry. And we developed a special tool to manage the settings.

This is how it worked. An environment name was a dot-separated string in format group.subgroup.environment. For instance, prod.cluster-1.server-1 was an environment name of the first server from the first cluster of the production environment; and dev.kr41 was the name of my development environment. The configuration directory contained a tree of subdirectories, where each of the subdirectory was named after a part of some environment name. For example:

config/
    prod/
        cluster-1/
            server-1/
    dev/
        kr41/

The most common configuration options were defined at the root of the tree, the most specific ones—at the leafs. For example, config/prod directory contained files with common production settings; config/prod/cluster-1—common settings for all servers of the first cluster; and config/prod/cluster-1/server-1—concrete settings for the first server. The files were merged by a loader on startup into a single mapping object using passed environment name. Some of the common settings were overridden by the concrete ones during the loading process. So that we did not use copy-paste in our configuration files. If there was an option for a number of environments, this option had been defined within group settings. There we also post-loading validation, that helped us to use safe defaults. For instance, when each server had to use its own cryptographic key, such key had been defined on the group level with an empty default value, which was required to be overridden. So that validator raised an exception on startup, when it had found this empty value in the result configuration. Because of this we never deployed our application on production with unsafe settings.

The tool was so useful, so when I started to use Python I had tried to find something similar. Yep, “My Health Experience” had been written on PHP, and it was the last PHP project I worked on. My search was unsuccessful, and I reinvented such tool working on each my project. So I eventually decided to rewrite and release it as an open-source project. And here it is.

I added some flexibility and extensibility to the original ideas. Each step of configuration loading process can be customized or replaced by your own implementation. It also comes with command line utility program, which can be used to build configuration as a single JSON file. So you can even use it within a non-Python project—JSON parser is all what you need. I hope, the tool is able to solve a lot of problems and can be useful for different kind of projects. Try it out and send me your feedback. As for me, I am going to integrate it into my current project right now.

Requests library is a de facto standard for handling HTTP in Python. Each time I have to write a crawler or REST API client, I know what to use. I have made a dozen of ones during the last couple of years. And each time I stumbled one frustrating thing. I mean requests.exceptions.ConnectionError which is unexpectedly raised with the message error(111, 'Connection refused') after 3–5 hours of client uptime, when remote service works well and stays available.

I don’t know for sure why it happens. I have a couple of versions, but essentially they all about unideal world we live in. Connection may die or hang. Highly loaded web server may refuse request. Packets may be lost. Long story short—shit happens. And when it happens, default Requests settings will not be enough.

So if you are going to make long-live process, which will use some services via Requests, you should change its default settings in this way:

from requests import Session
from requests.adapters import HTTPAdapter


session = Session()
session.mount('http://', HTTPAdapter(max_retries=5))
session.mount('https://', HTTPAdapter(max_retries=5))

HTTPAdapter performs only one try by default and raises ConnectionError on fail. I started from two tries, and empirically got that five ones gives 100% resistance against short-term downtimes.

I am not sure is it a bug or a feature of the Requests. But I never see that these default settings are changed in some Requests-based library like Twitter or Facebook API client. And I got such errors using these libraries too. So if you are using such library, examine its code. Now you know how to fix it. Thanks Python design, there are no true private members.

Unfortunately, I cannot reproduce this bug (if it is a real bug) in laboratorial conditions for now. So I will be grateful, if somebody suggests me how to.

I released GreenRocket library on October 2012. It is a dead simple implementation of Observer design pattern, which I use in almost all of my projects. I thought, there was nothing to improve. But my recent project heavily uses the library. And I get tired to write tests that checks signals. This is how them look like:

from nose import tools
# I use Nose for testing my code

from myproject import MySignal, some_func
# ``MySignal`` inherits ``greenrocket.Signal``
# ``some_func`` must fire ``MySignal`` as its side-effect


def test_some_func():
    log = []                    # Create log for fired signals

    @MySignal.subscribe         # Subscribe a dummy handler
    def handler(signal):
        log.append(signal)      # Put fired signal to the log

    some_func()                 # Call the function to test

    # Test fited signal from the log
    tools.eq_(len(log), 1)
    tools.eq_(log[0].x, 1)
    tools.eq_(log[0].y, 2)
    # ...and so on

There are four lines of utility code. And it is boring. So I added helper class Watchman to the library to make it testing friendly. This is how it works:

from greenrocket import Watchman

from myproject import MySignal, some_func


def test_some_code_that_fires_signal():
    watchman = Watchman(MySignal)           # Create a watchman for MySignal
    some_func()
    watchman.assert_fired_with(x=1, y=2)    # Test fired signal

Just one line of utility code and one line for actual test! I have already rewritten all of my tests. So if you are using the library, it’s time to upgrade. If you don’t, then try it out.