Wordloosed

Running Yahoo! Pipes on Google App Engine

2010-10-24

Yahoo! Pipes is an excellent tool for processing data. It provides a visual way to aggregate, manipulate, and mashup content from around the web. It's very much like plumbing with data and is a great metaphor. I'm convinced that this approach is just the beginning, and look forward to connecting systems using pipes in a three-dimensional virtual environment with tactile and audio feedback... soon.

Tony Hirst, a prolific Yahoo! Pipes user, had the idea to translate the pipe definitions into code so that they could be run on your own computer, in case the Yahoo! Pipes server was unavailable. This sounded like an interesting challenge so I developed pipe2py. The pipe2py package can compile a Yahoo! Pipe into pure Python source code, or it can interpret the pipe on-the-fly. It supports embedded pipes too. (Not all of the Yahoo! Pipes modules are available yet, but they're gradually being added: if you find the need for one that's missing please let me know, or better still provide me with the code for the module.)

The design for the compiled pipes was based on David Beazley's work on building Python generators into pipelines, together with ideas from SQL query compilers and XProc pipelines. Each Yahoo! Pipes module is implemented as a Python generator which iterates over items provided by an input module and processes them to yield output results. Once these generators are connected together, iterating over the final one will initiate a cascading call to all earlier generators for them to iterate over their inputs and, in turn, yield their output. There are several benefits to this architecture:

the compiled pipeline closely matches the original Yahoo! pipeline
adding new modules is easy because they are loosely coupled
each item is typically passed through the whole pipeline one at a time, so:
1. memory usage is kept to a minimum
2. no module is waiting on an earlier module to finish processing the whole data set
by adding queues between the modules they could easily be made to run in parallel, each on a different CPU, to give great scalability

Here's an example pipe2py session which converts the pipe shown above into Python and then runs it locally:


    $ python compile.py -p UuvYtuMe3hGDsmRgPm7D0g

    $ python pipe_UuvYtuMe3hGDsmRgPm7D0g.py

    Name (default=Lancaster) Neill

    {u'title': u'Bob Neill',

    ...

    u'TotalAllowancesClaimedIncTravel': u'157332'}

Since pipe2py can compile pipes into Python modules, it seemed a good idea to try to run them in Google's cloud via App Engine. So now there's pipes-engine, which uses pipe2py to run your Yahoo! Pipes on Google's servers.

pipe2py running Yahoo! Pipes on Google App Engine

You'll need to log on with your Google account, and then you can take the Id of your Yahoo! Pipe (you can find it in the url when editing a pipe) and add it to the list. pipes-engine will then compile it and store the Python version of it. Clicking the pipe Id will run it on the App Engine. If you change the pipe in Yahoo, you can reload it in pipes-engine to re-compile the latest version (although I hope to automate this step in future).

There's currently an App Engine timeout of 30 seconds, but Google have said that they are working on increasing that soon.

There were some tricky bits to developing this, like storing the generated Python source in the datastore and then importing it dynamically back from the datastore, and doing so recursively for any embedded pipe imports. Some Python PEP 302 magic helped here.

The pipes-engine.appspot.com service is a proof of concept and needs some more work, not least to provide the output in formats other than json, but I think it proves it's feasible. Let me know what you think.

Tags: Python