Relaying - Measure Twice

My latest Twisted adventure began with a comment I came across in relay.py:

class RelayerMixin:

    # XXX - This is -totally- bogus
    # It opens about a -hundred- -billion- files
    # and -leaves- them open!

This seemed like a worthy problem to investigate so that, at the very least, I could write a ticket to track the issue.

The first challenge was to set up a smart host configuration with Twisted. A smart host is a mail server which accepts mail to any address and then determines the mail exchange for the address and connects to it to relay the mail. Unlike an open relay, a smart host imposes restrictions on the source of messages. While some may accept mail only from authenticated senders, Twisted’s default is to relay any mail received over a Unix socket or from localhost.

It was easy enough to run a smart host on my development machine. I just had to invoke twistd mail with the relay option and specify a directory to hold messages to be relayed:

twistd -n mail --relay=/tmp/mail_queue

The smart host uses DNS to look up mail exchanges and contacts them via SMTP on port 25. Because my ISP does not allow outgoing traffic on port 25 and because I did not want to relay test messages to real mail servers, I needed to make some changes to the Twisted source so that the email messages would be relayed to a Twisted mail server that I ran on a second computer. I modified relaymanager.py to relay to port 8025 and to use a hosts file for DNS resolution.

class SmartHostSMTPRelayingManager:
    ...
    # PORT = 25
    PORT = 8025
    ...
    def _checkStateMX(self):
        ...
        if self.mxcalc is None:
            # self.mxcalc = MXCalculator()
            from twisted.names.client import createResolver
            resolver = createResolver(None, None, b"/tmp/hosts")
            self.mxcalc = MXCalculator(resolver)

The hosts file maps example.com and example.net to the IP address of the computer running the target mail server.

10.224.77.149 example.com
10.224.77.149 example.net

I configured that server to run on the default port, 8025, and accept mail for a few users on the domains example.com and example.net:

twistd -n mail -d example.com=/tmp/example.com -u jim=pwd -u nat=pwd
-d example.net=/tmp/example.net -u joe=pwd -u bob=pwd

When I used telnet on the development machine to send mail to the smart host running on the same machine and addressed it to one of the configured users on example.com or example.net, the smart host relayed it to the mail server on the second machine.

Now that I had a usable configuration, I wanted to explore the implications of the comment that RelayerMixin opened a large number of files and never closed them. RelayerMixin is used to introduce a set of functions for relaying mail to another class, a relayer, through inheritance. On initialization, the relayer calls one of the RelayerMixin functions, loadMessages, with a list of the pathnames of messages which it is responsible for relaying. loadMessages opens each message file and stores the file object in a list. I hypothesized that if I sent a lot of messages to the smart host at once, its relayers would open files for all the messages and hit the operating system limit for open files.

I wrote a short program to send the SMTP commands for a series of messages to the smart host running on port 8025 of the same machine. The messages are randomly destined to one of two addresses on each of the two domains served by the mail server on the other machine.

from twisted.internet import protocol, reactor
from twisted.test.proto_helpers import LineSendingProtocol
from twisted.internet.defer import Deferred
from random import randint

NUM_MESSAGES = 250

addresses = ['joe@example.net', 'bob@example.net',
             'jim@example.com', 'nat@example.com']
num_addrs = len(addresses) - 1

msgs = ['helo']

for i in range(0, NUM_MESSAGES):
    origin = 'foo@example.com'
    destination = addresses[randint(0, num_addrs)]
    msgs.append('mail from: <{}>'.format(origin))
    msgs.append('rcpt to: <{}>'.format(destination))
    msgs.append('data'),
    msgs.append('from {} to {}'.format(origin, destination)),
    msgs.append('hi {}'.format(destination)),
    msgs.append('.'),

msgs.append('quit')
client = LineSendingProtocol(msgs)

done = Deferred()
f = protocol.ClientFactory()
f.protocol = lambda: client
f.clientConnectionLost = lambda *args: done.callback(None)

def finished(reason):
    reactor.stop()

done.addCallback(finished)

reactor.connectTCP('127.0.0.1', 8025, f)
reactor.run()

As I increased the number of messages sent, I expected to eventually see an exception occur when too many files were opened but that did not occur no matter how many messages were sent. From the server log, I observed that instead of opening one connection to the mail server for each domain and sending all the queued messages for that domain, the smart host was repeatedly connecting to the mail server and sending no more than a few messages at a time. That explained why the limit on open files was not being reached. The relayers were being handed only a few messages at a time so there was no need to open a lot of files at once.

This strategy for allocating work to relayers did not seem very efficient so I started exploring further. SmartHostSMTPRelayingManager, which implements the smart host functionality, has a function, checkState, which is called periodically to see if there are messages waiting to be relayed and if there is capacity to create new relayers. If so, it calls _checkStateMX to create relayers and allocate messages to them. It turns out that _checkStateMX contains a subtle bug which is the cause of the allocation behavior.

def _checkStateMX(self):
    nextMessages = self.queue.getWaiting()
    nextMessages.reverse()

    exchanges = {}
    for msg in nextMessages:
        from_, to = self.queue.getEnvelope(msg)
        name, addr = rfc822.parseaddr(to)
        parts = addr.split('@', 1)
        if len(parts) != 2:
            log.err("Illegal message destination: " + to)
            continue
        domain = parts[1]

        self.queue.setRelaying(msg)
        exchanges.setdefault(domain, []).append(self.queue.getPath(msg))
        if len(exchanges) >= (self.maxConnections - len(self.managed)):
            break

_checkStateMX asks the relay queue for a list of waiting messages. Then it loops through the messages, grouping them by target domain. Eventually, each group will be handed off to a relayer. The problem is that _checkStateMX breaks out of the loop as soon as it has at least one message for the maximum number of domains it can concurrently contact. That value, maxConnections, is an optional parameter to SmartHostSMTPRelayingManager.__init__. Its default value is 2.

As _checkStateMX loops through the waiting messages, it creates a list of messages for the first domain it sees and keeps adding messages for that domain to the list. When it sees a second domain, it creates another list for that domain but since it has hit the limit on connections, it breaks out of the loop. So, any other messages in the queue for either domain must wait to be sent even though they could be handled by the same relayers. Instead of breaking out of the loop when it reaches the connection limit, _checkStateMX should continue to add messages to the lists for the domains it has already seen and ignore messages for other domains.

With the understanding of how messages are allocated to relayers, I was now easily able to trigger an exception for too many open files by sending a large number of messages to one domain instead of splitting them between two.

As a result of this exploration, I filed and submitted fixes for two issue tickets, a defect ticket for the handling of open files by RelayerMixin, and an enhancement ticket to improve how messages are allocated to relayers.