Almost exactly two years ago, I began a little side project to create an email parser. I wanted to port an email sorting system I had previously written in Perl to Go. There was no particular reason to have my own email sorting system or to port it to Go except that I could and because I wanted to learn more practical skills in Go. However, in the process I was unable to find an email message parser which was able to meet all of my requirements. Those requirements included the following:

Round-tripping. My email sorting system needed to be able to rewrite message headers on message files stored on disk while making no alterations to the file in the bits I didn’t change. The system should be able to read any email, completely parse it into semantic bits, and then reverse the process and come up with a rebuilt message that is exactly identical to the original.
Logically Useful. I want to be able to see everything about the message. This is not totally necessary for an email sorting program, which generally just needs to be able to read the headers and maybe the message body, but if I want to read the message body properly, it really ought to be able to break a complex multipart message into parts, understand the message contents of each part, and be able to decode transfer encodings and so on.
Liberal Acceptance. This goal is absolutely key. The twisted universe that is email, is full of horrible, badly formatted messages. We can’t reject a message just because it does not follow RFC 5322 or RFC 2822 or even RFC 822. It must parse every message in my inbox filled with gross messages from Fortune 500 companies using really terrible email writers like Microsoft Exchange and those badly formatted messages from Nigerian princes and everything in between. It must be that good.
Prefer Strict Output. When I make changes, my changes should strictly adhere to all relevant RFCs and standards by default. However, if I want to output garbage, it should let me do that too. If it’s useful to spammers and careless programmers on the web, it might be useful to me.
Resume-able Error Handling. I wanted the ability to be able to recover in case an email message was partially readable, but not totally readable. It should be able to return everything that can be parsed, parsed as well as it can be with the error.

In order to achieve these goals, I decided to turn to the smartest authorities I know on dealing with the wacky and bizarre world that is email: Perl developers, particularly those like Ric Signes and Simon Cozens who were messing around with email decades before I started on my first mail sorting program and have each forgotten more about how email works than I ever learned. My original source material was the Email::Simple and Email::MIME packages. My first solution was a nearly direct port of these libraries into Go. That turned out to be not quite doable because Go doesn’t have direct analogs for things like inheritance, but I learned more about how the indirect analogs work and can be manipulated to work like inheritance in the process.

After implementing the original version of go-email, I then finished porting my email sorting tool from Perl as well and I’ve been using it pretty successfully ever since.

Problems with v1

However, over the past year or so I’ve had some problems:

One is that the original solution is a memory hog. When my email sorter would run against especially large mail folders, it would sometimes consume all the available memory and start thrashing hard.
Second, I have another application now where I want to be able to produce email and so I have new requirements.
Third, I want my email sorting program to have the ability to forward messages sometimes. The existing library was pretty decent at parsing, but it wasn’t great at transforming messages.
Finally, I have continued to grow in knowledge of Golang over the past couple years and the existing solution just isn’t very go-ish. I wanted to make an attempt at building something that would feel familiar to a typical Golang dev.

Design and Implementation of v2

And so, around the middle of December, I started a major rewrite of the original library. I started from first principles, but tried to keep the implementation code from the other library where I could. I’m pretty happy with what I’ve managed to produce in the past four weeks.

The requirements of this project remain largely the same. Some are not as strongly emphasized as before and some different trade-offs have been made, obviously. For example, The code does not work quite as hard as it did before at being resume-able on error, but it still tries and this could be improved as time goes on.

I have also layered on a few new requirements:

Memory Should Be Managed. I want to know what the memory foot print of this library looks like and I want to provide modes where the memory usage can be limited.
Go-ish Interface. The original library made no attempt at going about the implementation in a way that is familiar to Go a typical Golang dev. This new implementation should focus on reusable components like io.Reader and io.Writer and otherwise do things according to Go standard practices where it makes sense to do so.
Building and transforming as well as parsing. Having a good parser is great, but composing email should be supported as well as being able to transform an existing message for tasks like replying and forwarding.

So let’s take a look at some ways to parse a message with the newly released github.com/zostay/go-email/v2/message module and then let’s take a look at how to produce new messages from scratch.

Parsing a Message

Email parsing is not simple even aside from the complexity of the input. Depending on your application, you may want more or less parsing. The v2 release of go-email attempts to accommodate this by providing a variety of levers that can be used to control how much parsing occurs, what limits there are on input, etc.

Let’s consider the most basic parsing first and then move into consider more complex multipart parsing and transfer encoding.

Opaque Messages

At the heart of github.com/zostay/go-email/v2/message is the message.Opaque type. This object is very similar to the mail.Message type from net/mail. Similar enough that I could have reused it. It has a header part and a body part. That’s it. Those are the mandatory parts of all email messages. The body might be a simple text message or might be a complex multi-layer multipart message. The point is that message.Opaque doesn’t care; the body is opaque to it. It’s just a collection of bytes.

To parse any email message to an opaque object, we do the following:

r, _ := os.Open("message.txt"))
m, _ := message.Parse(r, message.WithoutMultipart())
opaqueMsg := m.(*message.Opaque)

The value in m will contain a fully decoded email header and can be used as a reader to get at the content within. The actual return value is an interface called message.Generic, but one of the semantics message.Generic guarantees is that it will always be implemented by either *message.Opaque or *message.Multipart (this latter type we will get to in a minute). And if you use the message.WithoutMultipart() option, the returned object is guaranteed to be a *message.Opaque so the type coercion shown is 100% runtime safe (though, we really ought to check for the error returned by message.Parse() to make sure parsing worked).

I’ve run the code for message.Parse() across the quarter of a million messages in my mailbox and it will parse every one of them even though many of them are pretty badly formatted.

This level of parsing is the safest for round-tripping. Thus, if you just want to manipulate the Keywords header (which is my primary application), this basic level of parsing is all you’re likely to need.

Multipart Messages

However, if your application is more complicated. Let’s say you want to save off every PDF in every message you’ve ever received, you’ll want a slightly different call to message.Parse():

r, _ := os.Open("message.txt")
m, _ := message.Parse(r, message.DecodeTransferEncoding())

This time, we cannot just convert the message m to *message.Multipart because it might not be that object. If the Content-type header of the original message is text/plain, then it will be returned as a *message.Opaque. If it is a multipart/mixed or something similar and all the other requirements for parsing a multipart message are met, then it will return a *message.Multipart. The behavior now depends on the message format. The *message.Multipart object does not provide a reader. Instead, it contains a list of parts, each of which implements the message.Part interface.

The message.Part and message.Generic interfaces are exactly identical (one is a copy of the other). However, while message.Generic has a documented contract that it will always be either a *message.Opaque or a *message.Multipart, the message.Part does not have any such guarantee. This allows the library to be extended by you or others. As long as you adhere to the contract documented for message.Part, then *message.Multipart will behave properly.

A message.Multipart object always represents a multipart MIME message. It cannot represent anything else.

Transfer Encoding

In the example above, I use the message.DecodeTransferEncoding() option while parsing for multipart messages. This enables an additional step during parsing where the Content-transfer-encoding of each part is checked and decoded, when it can be. Email is a text format and allowing binary data to be stored into it can be problematic, even 2023. As such, binary files are generally encoded using base64 encoding and even text files in character encodings like latin1 will often have a transfer encoding applied like quoted-printable or base64 to guarantee that no 8-bit characters are present or to ensure that binary data in the email wont break the text formatting required.

Now, why didn’t I make this the default? Mostly because round-tripping is that important to me. Decoding a transfer encoding and then reapplying it is very likely to come up with different results from the original. As such, I recommend using this option if you’re going to be examining content of individual parts, but leaving it off otherwise.

Building Messages

The final piece is that I wanted the ability to easily construct and transform complex messages. (We’ll skip discussing the transforming bits as I’m still thinking more about that.) I have provided a number of tools for building messages and may add some more. At the heart of message building is the type named message.Buffer. It is like a mirror of message.Generic. A message.Buffer is an email header, but instead of a reader, it may be used as an io.Writer. That writer writes to the message body directly. Or, you can manipulate the message.Buffer to create a multipart message by calling the Add() method, which adds parts to a message.

When you’re done, you may transform the buffer into either a *message.Opaque or a *message.Multipart. If you call the Opaque() method, you will get a *message.Opaque object representing the email. This always works, regardless of whether you used the io.Writer interface or the Add() method. You may, instead, call the Multipart() method to get a *message.Multipart. (And if the data you wrote to it is not in MIME multipart format, you’ll get an error.) This is guaranteed to work with you called Add(). If you used the io.Writer interface, it will attempt to parse the content and transform that into a *message.Multipart. This is probably less useful, in general, but provided for cases where it might be.

Here’s an example from the documentation for building a complex message and then writing it out to stdout:

mm := &message.Buffer{}
mm.SetSubject("Fancy message")
mm.SetMediaType("multipart/mixed")

txtPart := &message.Buffer{}
txtPart.SetMediaType("text/plain")
_, _ = fmt.Fprintln(txtPart, "Hello *World*!")

htmlPart := &message.Buffer{}
htmlPart.SetMediaType("text/html")
_, _ = fmt.Fprintln(htmlPart, "Hello <b>World</b>!")

mm.Add(message.MultipartAlternative(txtPart.Opaque(), htmlPart.Opaque()))

imgAttach, _ := message.AttachmentFile(
	"image.jpg",
	"image/jpeg",
	transfer.Base64,
)
mm.Add(imgAttach)

_, _ = mm.Opaque().WriteTo(os.Stdout)

Reading this carefully, you’ll see a some additional helpers for building messages. There’s a message.MultipartAlternative() constructor, that returns a *message.Multipart with the parts given and a single Content-type header set to multipart/alternative. (There’s also a method named message.MultipartMixed() that is exactly identical, except it sets the header to multipart/mixed.) You will also note the constructor named message.AttachmentFile() which will read the given filename from disk and created a message attachment.

This code automatically handles mundane details like selecting a boundary between parts and such.

And so much more…

I could write much more regarding the low-level features of the library, which give you nuanced access to individual header fields and helpers for managing different field type from address list fields to dates to keywords to Content-type and Content-disposition. However, I will let you read the online documentation for those details.

I could also talk about the walker, but I feel like that tooling is still a little on the unfinished side and also you can read more details in that documentation.

Finally, I could talk about the round trip test tools I built in to help ensure the email parsing works well against a large body of test messages, but perhaps I will leave that for another time.

I’ve tried to document everything. I know some of the last few features I added while working toward release don’t have test coverage yet, but test coverage is relatively high. I hope that it is a useful library and I have at least one production environment I intend to deploy it into in the next year. I also have some weird email test cases that I haven’t tried round-tripping yet, but such is any tool. The work is never quite done.

As with everything I do, I hope someone else will find it useful. However, whether someone does or not, I will.

Cheers.

Launching go-email v2