Converting Microsoft Word documents to PDF programmatically – server side.

We couldn’t find one plain and simple example on the web so we’re posting it here in case it helps someone. Our requirements:

converting DOC and DOCX to PDF from command line, server side.
No X of course. (LibreOffice excluded).
Not just DOC (Antiword excluded).
Debian packages whenever possible (python-docx excluded).

Log in to your server:

#sudo apt-get install abiword
#abiword --to=pdf --to-name=your_document_name.pdf your_document_name.docx

That should give you “your_document_name.pdf”. We’re just interested in extracting text, so we’re not sure about format quality.

Enjoy!


Successful partnership for Scandinavian Public Library Quarterly Website

A couple of months ago, we teamed up with Stockholm located software company Seibo Software Studios to bring to life the new version of the Scandinavian Public Library Quarterly website.
There were two major requirements: 1. Develop a the new website using WordPress 2. Improve the old website.
After meeting with Rickard Carlsson, we were able to identify most of the client’s needs and to plan a three sprint project with a sprint duration of two weeks.
The challenge here was to provide Rickard with an admin interface to create articles and group them into issues. Each issue has a cover page and covers and articles can have multiple authors.
And I almost forget about the biggest challenge: porting over 400 articles from the old site to the new one.

The first sprint was about installing WordPress, plugins, creating post types and adding custom fields. The second sprint covered the site’s look & feel, coding the WordPress widget to show the issues index and the author index. The final sprint wrapped everything up. The task of porting the content started at the very beginning and wasn’t over until the very end. As usual, we made good use of scrapy for screen-scraping the old site and dump the content into WordPress import files. The old markup was such a bad enemy that we ended editing more than 400 articles manually to make them look pretty.

At the end of the day we are very pleased with having executed the project exactly as planned (this doesn’t happen very often in the industry), with having enjoyed the project and above all with having learned a lot about scandinavian public libraries and the people who work together to get all this information available to everyone.


Python Access For JIRA Part II

I’m going to share with you how to build a class that communicates with JIRA in a very easy way, and you’ll be able to learn about web scraping during the process:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# Description: JIRA web interface methods
# Authors: sophilabs
# Copyright: Copyright (c) 2011, sophilabs.

import httplib
import re
from BeautifulSoup import BeautifulSoup

def singleton(cls):
    instances = {}
    def instance():
        if cls not in instances:
            instances[cls] = cls()
        return instances[cls]
    return instance

@singleton
class JiraHTTPConnection:

    def init(self, domain, username, password):
        print domain
        self.__cookie = None
        self.__headers = None
        self.__connection = httplib.HTTPConnection(domain)
        response = self.get("/")
        data = response.read()
        self.__headers = response.getheaders()
        self.__cookie = [x[1] for x in self.__headers if x[0] == "set-cookie"][0]
        self.__login(username, password)

    def get(self, path, headers=None):
        if headers is None:
            headers = {"Cookie": self.__cookie}
        self.__connection.request("GET", path, "", headers)
        return self.__connection.getresponse()

    def post(self, path, post_data, headers=None):
        if not headers:
            headers = {"Cookie": self.__cookie}
        self.__connection.request("POST", path, post_data, headers)
        return self.__connection.getresponse()

    #private

    def __login(self, username, password):
        post_data = "os_username=%s&os_password=%s&os_destination=" % (username, password) + "%2Fsecure%2F"
        headers = { "Cookie" : self.__cookie, "Content-Type": "application/x-www-form-urlencoded", "Content-Length": str(len(post_data)) }
        path = '/login.jsp'
        response = self.post(path, post_data, headers)
        data = response.read()

class JiraWebConnection:

    def __init__(self, domain, username, password):
        JiraHTTPConnection().init(domain, username, password)

    def projects(self):
        path = "/secure/BrowseProject.jspa"
        response = JiraHTTPConnection().get(path)
        data = response.read()
        soup = BeautifulSoup(data)
        project_keys = [re.search("/browse/([a-zA-Z0-9_-]+)", unicode(x).encode("utf-8")).group(1) for x in soup("a") if "/browse/" in unicode(x).encode("utf-8")]
        return project_keys

We want the JiraHTTPConnection class to be a singleton. This can be achieved using decorators . This class will be responsible for providing requests methods and taking care of cookies, and logging in to JIRA when connection is first established.

I’ve used Live HTTP Headers for Firefox to inspect the contents of the POST request during a normal login to JIRA. Then all we need is to replicate that request by creating a custom one.

def __login(self, username, password):
        post_data = "os_username=%s&os_password=%s&os_destination=" % (username, password) + "%2Fsecure%2F"
        headers = { "Cookie" : self.__cookie, "Content-Type": "application/x-www-form-urlencoded", "Content-Length": str(len(post_data)) }
        path = '/login.jsp'
        response = self.post(path, post_data, headers)
        data = response.read()

So you’ll have the data containing the proper credentials and we need to add the cookie to the header as well.
Don’t forget to read the response after the request even if you’re making no use of it.

You can add this piece of code to the bottom of the file for testing:

if __name__ == "__main__":
    jira = JiraWebConnection(YOUR_JIRA_URL, USERNAME, PASSWORD)
    p = jira.projects()

This will return a list of project keys associated to the current user.

Coming Up: many more JIRA interfacing methods.


Parsing csv file with Python and removing null characters

Customer asked us to parse a csv file. There was a lot of processing to do on the contents, but we weren’t being able to apply logic to it because lines didn’t behave as expected.
After reading the files and printing to console we got something like this:

x000\x000\x00%\x00\t\x000\x00.\x000\x000\x00\t\x000\x00.\x000\x000\x00\t
\x000\x00.\x000\x00\t\x000\x00\t\x00B\x00r\x00o\x00a\x00d\x00\t
\x000\x00.\x000\x000\x00%\x00\t\x000\x00.\x000\x000\x00\t
\x00\t\x002\x00.\x000\x000\x00\t\x000\x00.\x000\

The solution was to remove null characters, using this function:

def nonull(stream):
    s = []
    for line in stream:
        s.append(line.replace('\x00', ''))
    return s

or, if you like list comprehensions:

def nonull(stream):
    return [line.replace('\x00', '') for line in stream]

If you are using readlines() to read your file, you can call this function to get rid of those annoying characters.


Python Access for JIRA Part I

A while ago we were required to add functionality for some project. It was about a portal where customers could log in and manage some assets. Our client wanted their customers to be able to create JIRA tickets so the support team would be able to take action. Of course JIRA is too complex for the regular user, so we were required to build a simple interface with JIRA, and to add some fancy stuff like charts showing working progress and amount of tickets by priority.

JIRA RPC service is very limited, so we decided to create our own interface to JIRA in Python.

This is what the class looks like:

class JiraHTTPConnection:

    def init(self, domain, username, password):
        print domain
        self.__cookie = None
        self.__headers = None
        self.__connection = httplib.HTTPConnection(domain)
        response = self.get("/")
        data = response.read()
        self.__headers = response.getheaders()
        self.__cookie = [x[1] for x in self.__headers if x[0] == "set-cookie"][0]
        self.__login(username, password)

    def get(self, path, headers=None):
        if headers is None:
            headers = {"Cookie": self.__cookie}
        self.__connection.request("GET", path, "", headers)
        return self.__connection.getresponse()

    def post(self, path, post_data, headers=None):
        if not headers:
            headers = {"Cookie": self.__cookie}
        self.__connection.request("POST", path, post_data, headers)
        return self.__connection.getresponse()

These are our implemented methods:

projects
project_issues
filtered_issues
issues
issue
user
add_comment
watchers
add_watchers

I’ll post details about these methods in upcoming posts.


If your Django 1.3 development server is not reloading the changes on PyCharm

Unfortunately, one of the latest commits to Django1.3 release broke the possibility to run Django applications from PyCharm when the autoreload mode is used.

The only way to fix it is applying a patch to your local Django installation. PyCharm 1.2.1 RC will automatically detect the broken Django version and enable the –no-reload option if necessary.

Hopefully if will be fixed in the next release, here is the code:

https://code.djangoproject.com/changeset/15911


Introducing Zinmoo Spain: Real State Searcher

After two years of work, today we are very proudly to announce the release of our real state ads search engine Zinmoo for Spain!

Zinmoo was built using Django, Scrapy, MySql, Sphinx and a lot of other tools and python packages.

What’s the mission of Zinmoo?

Zinmoo has been conceived to organize real state realated web content in order to achieve a better and broader access to it.

For that reason zinmoo seeks to centralize the biggest account of properies and provide with the most agile and complete tools for information search.


Error 403 Forbidden: symbolic link not allowed

Have you ever had this problem with Apache?
Well, I have. I spent a lot of time trying to find where the problem was. Finally, I found the solution, and It was so simple that I thought that hanging it in ‘the cloud’ would be a good idea, maybe there is some other unfortunate like me that is shaking his head against the monitor every time this error arise.
The story
I use to configure my applications in Apache using the Option FollowSymLinks, because they are in the development phase and I don’t want to move all my code to /var/www or whatever.
So, the idea is to create a symlink in /var/www (using ln -s) to my project folder (it is usually located in ~/workspace).
Then I add the FollowSymLinks Option in the site configuration, as follows:

<Directory /var/www/project>
    Options FollowSymLinks
</Directory>

And when I load the page in my browser:
Error 403 Forbidden:
WTF!?

The solution
I ran:

sudo -i -u www-data

Then I tried navigating the symbolik link’s path, directory by directory. Luckily, I realized that one of directories of the path’s chain was not accesible by the user www-data. I changed the permissions with chmod and voila!, the page was loaded like a charm.
I know that this is a newbies’ problem, but who knows, no one is free anyway.