summaryrefslogtreecommitdiff
path: root/modules/language/python/module/email/architecture.rst
blob: fcd10bde1325bbf44251c8323941eed2f73865b9 (about) (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
:mod:`email` Package Architecture
=================================

Overview
--------

The email package consists of three major components:

    Model
        An object structure that represents an email message, and provides an
        API for creating, querying, and modifying a message.

    Parser
        Takes a sequence of characters or bytes and produces a model of the
        email message represented by those characters or bytes.

    Generator
        Takes a model and turns it into a sequence of characters or bytes.  The
        sequence can either be intended for human consumption (a printable
        unicode string) or bytes suitable for transmission over the wire.  In
        the latter case all data is properly encoded using the content transfer
        encodings specified by the relevant RFCs.

Conceptually the package is organized around the model.  The model provides both
"external" APIs intended for use by application programs using the library,
and "internal" APIs intended for use by the Parser and Generator components.
This division is intentionally a bit fuzzy; the API described by this
documentation is all a public, stable API.  This allows for an application
with special needs to implement its own parser and/or generator.

In addition to the three major functional components, there is a third key
component to the architecture:

    Policy
        An object that specifies various behavioral settings and carries
        implementations of various behavior-controlling methods.

The Policy framework provides a simple and convenient way to control the
behavior of the library, making it possible for the library to be used in a
very flexible fashion while leveraging the common code required to parse,
represent, and generate message-like objects.  For example, in addition to the
default :rfc:`5322` email message policy, we also have a policy that manages
HTTP headers in a fashion compliant with :rfc:`2616`.  Individual policy
controls, such as the maximum line length produced by the generator, can also
be controlled individually to meet specialized application requirements.


The Model
---------

The message model is implemented by the :class:`~email.message.Message` class.
The model divides a message into the two fundamental parts discussed by the
RFC: the header section and the body.  The `Message` object acts as a
pseudo-dictionary of named headers.  Its dictionary interface provides
convenient access to individual headers by name.  However, all headers are kept
internally in an ordered list, so that the information about the order of the
headers in the original message is preserved.

The `Message` object also has a `payload` that holds the body.  A `payload` can
be one of two things: data, or a list of `Message` objects.  The latter is used
to represent a multipart MIME message.  Lists can be nested arbitrarily deeply
in order to represent the message, with all terminal leaves having non-list
data payloads.


Message Lifecycle
-----------------

The general lifecycle of a message is:

    Creation
        A `Message` object can be created by a Parser, or it can be
        instantiated as an empty message by an application.

    Manipulation
        The application may examine one or more headers, and/or the
        payload, and it may modify one or more headers and/or
        the payload.  This may be done on the top level `Message`
        object, or on any sub-object.

    Finalization
        The Model is converted into a unicode or binary stream,
        or the model is discarded.



Header Policy Control During Lifecycle
--------------------------------------

One of the major controls exerted by the Policy is the management of headers
during the `Message` lifecycle.  Most applications don't need to be aware of
this.

A header enters the model in one of two ways: via a Parser, or by being set to
a specific value by an application program after the Model already exists.
Similarly, a header exits the model in one of two ways: by being serialized by
a Generator, or by being retrieved from a Model by an application program.  The
Policy object provides hooks for all four of these pathways.

The model storage for headers is a list of (name, value) tuples.

The Parser identifies headers during parsing, and passes them to the
:meth:`~email.policy.Policy.header_source_parse` method of the Policy.  The
result of that method is the (name, value) tuple to be stored in the model.

When an application program supplies a header value (for example, through the
`Message` object `__setitem__` interface), the name and the value are passed to
the :meth:`~email.policy.Policy.header_store_parse` method of the Policy, which
returns the (name, value) tuple to be stored in the model.

When an application program retrieves a header (through any of the dict or list
interfaces of `Message`), the name and value are passed to the
:meth:`~email.policy.Policy.header_fetch_parse` method of the Policy to
obtain the value returned to the application.

When a Generator requests a header during serialization, the name and value are
passed to the :meth:`~email.policy.Policy.fold` method of the Policy, which
returns a string containing line breaks in the appropriate places.  The
:meth:`~email.policy.Policy.cte_type` Policy control determines whether or
not Content Transfer Encoding is performed on the data in the header.  There is
also a :meth:`~email.policy.Policy.binary_fold` method for use by generators
that produce binary output, which returns the folded header as binary data,
possibly folded at different places than the corresponding string would be.


Handling Binary Data
--------------------

In an ideal world all message data would conform to the RFCs, meaning that the
parser could decode the message into the idealized unicode message that the
sender originally wrote.  In the real world, the email package must also be
able to deal with badly formatted messages, including messages containing
non-ASCII characters that either have no indicated character set or are not
valid characters in the indicated character set.

Since email messages are *primarily* text data, and operations on message data
are primarily text operations (except for binary payloads of course), the model
stores all text data as unicode strings.  Un-decodable binary inside text
data is handled by using the `surrogateescape` error handler of the ASCII
codec.  As with the binary filenames the error handler was introduced to
handle, this allows the email package to "carry" the binary data received
during parsing along until the output stage, at which time it is regenerated
in its original form.

This carried binary data is almost entirely an implementation detail.  The one
place where it is visible in the API is in the "internal" API.  A Parser must
do the `surrogateescape` encoding of binary input data, and pass that data to
the appropriate Policy method.  The "internal" interface used by the Generator
to access header values preserves the `surrogateescaped` bytes.  All other
interfaces convert the binary data either back into bytes or into a safe form
(losing information in some cases).


Backward Compatibility
----------------------

The :class:`~email.policy.Policy.Compat32` Policy provides backward
compatibility with version 5.1 of the email package.  It does this via the
following implementation of the four+1 Policy methods described above:

header_source_parse
    Splits the first line on the colon to obtain the name, discards any spaces
    after the colon, and joins the remainder of the line with all of the
    remaining lines, preserving the linesep characters to obtain the value.
    Trailing carriage return and/or linefeed characters are stripped from the
    resulting value string.

header_store_parse
    Returns the name and value exactly as received from the application.

header_fetch_parse
    If the value contains any `surrogateescaped` binary data, return the value
    as a :class:`~email.header.Header` object, using the character set
    `unknown-8bit`.  Otherwise just returns the value.

fold
    Uses :class:`~email.header.Header`'s folding to fold headers in the
    same way the email5.1 generator did.

binary_fold
    Same as fold, but encodes to 'ascii'.


New Algorithm
-------------

header_source_parse
    Same as legacy behavior.

header_store_parse
    Same as legacy behavior.

header_fetch_parse
    If the value is already a header object, returns it.  Otherwise, parses the
    value using the new parser, and returns the resulting object as the value.
    `surrogateescaped` bytes get turned into unicode unknown character code
    points.

fold
    Uses the new header folding algorithm, respecting the policy settings.
    surrogateescaped bytes are encoded using the ``unknown-8bit`` charset for
    ``cte_type=7bit`` or ``8bit``.  Returns a string.

    At some point there will also be a ``cte_type=unicode``, and for that
    policy fold will serialize the idealized unicode message with RFC-like
    folding, converting any surrogateescaped bytes into the unicode
    unknown character glyph.

binary_fold
    Uses the new header folding algorithm, respecting the policy settings.
    surrogateescaped bytes are encoded using the `unknown-8bit` charset for
    ``cte_type=7bit``, and get turned back into bytes for ``cte_type=8bit``.
    Returns bytes.

    At some point there will also be a ``cte_type=unicode``, and for that
    policy binary_fold will serialize the message according to :rfc:``5335``.