Launching go-email v2
Almost exactly two years ago, I began a little side project to create an email parser. I wanted to port an email sorting system I had previously written in Perl to Go. There was no particular reason to have my own email sorting system or to port it to Go except that I could and because I wanted to learn more practical skills in Go. However, in the process I was unable to find an email message parser which was able to meet all of my requirements. Those requirements included the following:
-
Round-tripping. My email sorting system needed to be able to rewrite message headers on message files stored on disk while making no alterations to the file in the bits I didn’t change. The system should be able to read any email, completely parse it into semantic bits, and then reverse the process and come up with a rebuilt message that is exactly identical to the original.
-
Logically Useful. I want to be able to see everything about the message. This is not totally necessary for an email sorting program, which generally just needs to be able to read the headers and maybe the message body, but if I want to read the message body properly, it really ought to be able to break a complex multipart message into parts, understand the message contents of each part, and be able to decode transfer encodings and so on.
-
Liberal Acceptance. This goal is absolutely key. The twisted universe that is email, is full of horrible, badly formatted messages. We can’t reject a message just because it does not follow RFC 5322 or RFC 2822 or even RFC 822. It must parse every message in my inbox filled with gross messages from Fortune 500 companies using really terrible email writers like Microsoft Exchange and those badly formatted messages from Nigerian princes and everything in between. It must be that good.
-
Prefer Strict Output. When I make changes, my changes should strictly adhere to all relevant RFCs and standards by default. However, if I want to output garbage, it should let me do that too. If it’s useful to spammers and careless programmers on the web, it might be useful to me.
-
Resume-able Error Handling. I wanted the ability to be able to recover in case an email message was partially readable, but not totally readable. It should be able to return everything that can be parsed, parsed as well as it can be with the error.
In order to achieve these goals, I decided to turn to the smartest authorities I know on dealing with the wacky and bizarre world that is email: Perl developers, particularly those like Ric Signes and Simon Cozens who were messing around with email decades before I started on my first mail sorting program and have each forgotten more about how email works than I ever learned. My original source material was the Email::Simple and Email::MIME packages. My first solution was a nearly direct port of these libraries into Go. That turned out to be not quite doable because Go doesn’t have direct analogs for things like inheritance, but I learned more about how the indirect analogs work and can be manipulated to work like inheritance in the process.
After implementing the original version of go-email, I then finished porting my email sorting tool from Perl as well and I’ve been using it pretty successfully ever since.
Problems with v1
However, over the past year or so I’ve had some problems:
- One is that the original solution is a memory hog. When my email sorter would run against especially large mail folders, it would sometimes consume all the available memory and start thrashing hard.
- Second, I have another application now where I want to be able to produce email and so I have new requirements.
- Third, I want my email sorting program to have the ability to forward messages sometimes. The existing library was pretty decent at parsing, but it wasn’t great at transforming messages.
- Finally, I have continued to grow in knowledge of Golang over the past couple years and the existing solution just isn’t very go-ish. I wanted to make an attempt at building something that would feel familiar to a typical Golang dev.
Design and Implementation of v2
And so, around the middle of December, I started a major rewrite of the original library. I started from first principles, but tried to keep the implementation code from the other library where I could. I’m pretty happy with what I’ve managed to produce in the past four weeks.
The requirements of this project remain largely the same. Some are not as strongly emphasized as before and some different trade-offs have been made, obviously. For example, The code does not work quite as hard as it did before at being resume-able on error, but it still tries and this could be improved as time goes on.
I have also layered on a few new requirements:
-
Memory Should Be Managed. I want to know what the memory foot print of this library looks like and I want to provide modes where the memory usage can be limited.
-
Go-ish Interface. The original library made no attempt at going about the implementation in a way that is familiar to Go a typical Golang dev. This new implementation should focus on reusable components like
io.Reader
andio.Writer
and otherwise do things according to Go standard practices where it makes sense to do so. -
Building and transforming as well as parsing. Having a good parser is great, but composing email should be supported as well as being able to transform an existing message for tasks like replying and forwarding.
So let’s take a look at some ways to parse a message with the newly released github.com/zostay/go-email/v2/message module and then let’s take a look at how to produce new messages from scratch.
Parsing a Message
Email parsing is not simple even aside from the complexity of the input. Depending on your application, you may want more or less parsing. The v2 release of go-email attempts to accommodate this by providing a variety of levers that can be used to control how much parsing occurs, what limits there are on input, etc.
Let’s consider the most basic parsing first and then move into consider more complex multipart parsing and transfer encoding.
Opaque Messages
At the heart of
github.com/zostay/go-email/v2/message
is the
message.Opaque
type. This object is very similar to the
mail.Message
type from
net/mail
. Similar enough that I could have
reused it. It has a header part and a body part. That’s it. Those are the
mandatory parts of all email messages. The body might be a simple text message
or might be a complex multi-layer multipart message. The point is that
message.Opaque
doesn’t care; the body is opaque to it. It’s just a
collection of bytes.
To parse any email message to an opaque object, we do the following:
r, _ := os.Open("message.txt"))
m, _ := message.Parse(r, message.WithoutMultipart())
opaqueMsg := m.(*message.Opaque)
The value in m
will contain a fully decoded email header and can be used as a
reader to get at the content within. The actual return value is an interface
called
message.Generic
,
but one of the semantics message.Generic
guarantees is that it
will always be implemented by either *message.Opaque
or *message.Multipart
(this latter type we will get to in a minute). And if you use the
message.WithoutMultipart()
option, the returned object is guaranteed to be a *message.Opaque
so the type
coercion shown is 100% runtime safe (though, we really ought to check for the
error returned by
message.Parse()
to make sure parsing worked).
I’ve run the code for message.Parse()
across the quarter of a million messages in my
mailbox and it will parse every one of them even though many of them are pretty
badly formatted.
This level of parsing is the safest for round-tripping. Thus, if you just want
to manipulate the Keywords
header (which is
my primary application), this basic level of parsing is all you’re likely to
need.
Multipart Messages
However, if your application is more complicated. Let’s say you want to save off
every PDF in every message you’ve ever received, you’ll want a slightly
different call to message.Parse()
:
r, _ := os.Open("message.txt")
m, _ := message.Parse(r, message.DecodeTransferEncoding())
This time, we cannot just convert the message m
to
*message.Multipart
because it might not be that object. If the Content-type
header of the
original message is text/plain
, then it will be returned as a
*message.Opaque
. If it is a multipart/mixed
or something similar and all the
other requirements for parsing a multipart message are met, then it will return
a *message.Multipart
. The behavior now depends on the message format. The
*message.Multipart
object does not provide a reader. Instead, it contains a
list of parts, each of which implements the
message.Part
interface.
The message.Part
and message.Generic
interfaces are exactly identical (one
is a copy of the other). However, while message.Generic
has a documented
contract that it will always be either a *message.Opaque
or a
*message.Multipart
, the message.Part
does not have any such guarantee. This
allows the library to be extended by you or others. As long as you adhere to the
contract documented for message.Part
, then *message.Multipart
will behave
properly.
A message.Multipart
object always represents a multipart MIME message. It
cannot represent anything else.
Transfer Encoding
In the example above, I use the
message.DecodeTransferEncoding()
option while parsing for multipart messages. This enables an additional step
during parsing where the Content-transfer-encoding
of each part is checked and
decoded, when it can be. Email is a text format and allowing binary data to be
stored into it can be problematic, even 2023. As such, binary files are
generally encoded using base64 encoding and even text files in character
encodings like latin1 will often have a transfer encoding applied like
quoted-printable or base64 to guarantee that no 8-bit characters are present or
to ensure that binary data in the email wont break the text formatting
required.
Now, why didn’t I make this the default? Mostly because round-tripping is that important to me. Decoding a transfer encoding and then reapplying it is very likely to come up with different results from the original. As such, I recommend using this option if you’re going to be examining content of individual parts, but leaving it off otherwise.
Building Messages
The final piece is that I wanted the ability to easily construct and transform
complex messages. (We’ll skip discussing the transforming bits as I’m still
thinking more about that.) I have provided a number of tools for building
messages and may add some more. At the heart of message building is the type
named
message.Buffer
.
It is like a mirror of message.Generic
. A message.Buffer
is an email
header, but instead of a reader, it may be used as an
io.Writer
. That writer writes to the message
body directly. Or, you can manipulate the message.Buffer
to create a multipart
message by calling the
Add()
method, which adds parts to a message.
When you’re done, you may transform the buffer into either a *message.Opaque
or a *message.Multipart
. If you call the
Opaque()
method, you will get a *message.Opaque
object representing the email. This
always works, regardless of whether you used the io.Writer
interface or the
Add()
method. You may, instead, call the
Multipart()
method to get a *message.Multipart
. (And if the data you wrote to it is not
in MIME multipart format, you’ll get an error.) This is guaranteed to work with you
called Add()
. If you used the io.Writer
interface, it will attempt to parse
the content and transform that into a *message.Multipart
. This is probably
less useful, in general, but provided for cases where it might be.
Here’s an example from the documentation for building a complex message and then writing it out to stdout:
mm := &message.Buffer{}
mm.SetSubject("Fancy message")
mm.SetMediaType("multipart/mixed")
txtPart := &message.Buffer{}
txtPart.SetMediaType("text/plain")
_, _ = fmt.Fprintln(txtPart, "Hello *World*!")
htmlPart := &message.Buffer{}
htmlPart.SetMediaType("text/html")
_, _ = fmt.Fprintln(htmlPart, "Hello <b>World</b>!")
mm.Add(message.MultipartAlternative(txtPart.Opaque(), htmlPart.Opaque()))
imgAttach, _ := message.AttachmentFile(
"image.jpg",
"image/jpeg",
transfer.Base64,
)
mm.Add(imgAttach)
_, _ = mm.Opaque().WriteTo(os.Stdout)
Reading this carefully, you’ll see a some additional helpers for building
messages. There’s a
message.MultipartAlternative()
constructor, that returns a *message.Multipart
with the parts given and a
single Content-type
header set to multipart/alternative
. (There’s also a
method named
message.MultipartMixed()
that is exactly identical, except it sets the header to multipart/mixed
.) You
will also note the constructor named
message.AttachmentFile()
which will read the given filename from disk and created a message attachment.
This code automatically handles mundane details like selecting a boundary between parts and such.
And so much more…
I could write much more regarding the low-level features of the library, which give
you nuanced access to individual header fields and helpers for managing
different field type from address list fields to dates to keywords to
Content-type
and Content-disposition
. However, I will let you read the
online documentation for
those details.
I could also talk about the
walker
, but I feel
like that tooling is still a little on the unfinished side and also you can read
more details in that documentation.
Finally, I could talk about the round trip test tools I built in to help ensure the email parsing works well against a large body of test messages, but perhaps I will leave that for another time.
I’ve tried to document everything. I know some of the last few features I added while working toward release don’t have test coverage yet, but test coverage is relatively high. I hope that it is a useful library and I have at least one production environment I intend to deploy it into in the next year. I also have some weird email test cases that I haven’t tried round-tripping yet, but such is any tool. The work is never quite done.
As with everything I do, I hope someone else will find it useful. However, whether someone does or not, I will.
Cheers.