RFC7578 (who obsoletes RFC2388) defines the multipart/form-data
type that is usually transported over HTTP when users submit forms on your Web page. Nowadays, it tends to be replaced by JSON encoded payloads; nevertheless, it is still widely used.
While you could decode an HTTP body request made with JSON natively with Python — thanks to the json
module — there is no such way to do that with multipart/form-data
. That's something barely understandable considering how old the format is.
There is a wide variety of way available to encode and decode this format. Libraries such as requests support this natively without making you notice, and the same goes for the majority of Web server frameworks such as Django or Flask.
However, in certain circumstances, you might be on your own to encode or decode this format, and it might not be an option to pull (significant) dependencies.
Encoding
The multipart/form-data
format is quite simple to understand and can be summarised as an easy way to encode a list of keys and values, i.e., a portable way of serializing a dictionary.
There's nothing in Python to generate such an encoding. The format is quite simple and consists of the key and value surrounded by a random boundary delimiter. This delimiter must be passed as part of the Content-Type
, so that the decoder can decode the form data.
There's a simple implementation in urllib3 that does the job. It's possible to summarize it in this simple implementation:
import binascii
import os
def encode_multipart_formdata(fields):
boundary = binascii.hexlify(os.urandom(16)).decode('ascii')
body = (
"".join("--%s\r\n"
"Content-Disposition: form-data; name=\"%s\"\r\n"
"\r\n"
"%s\r\n" % (boundary, field, value)
for field, value in fields.items()) +
"--%s--\r\n" % boundary
)
content_type = "multipart/form-data; boundary=%s" % boundary
return body, content_type
You can use by passing a dictionary where keys and values are bytes. For example:
encode_multipart_formdata({"foo": "bar", "name": "jd"})
Which returns:
--00252461d3ab8ff5c25834e0bffd6f70
Content-Disposition: form-data; name="foo"
bar
--00252461d3ab8ff5c25834e0bffd6f70
Content-Disposition: form-data; name="name"
jd
--00252461d3ab8ff5c25834e0bffd6f70--
multipart/form-data; boundary=00252461d3ab8ff5c25834e0bffd6f70
You can use the returned content type in your HTTP reply header Content-Type
. Note that this format is used for forms: it can also be used by emails.
Emails did you say?
Encoding with email
Right, emails are usually encoded using MIME, which is defined by yet another RFC, RFC2046. It turns out that multipart/form-data
is just a particular MIME format, and that if you have code that implements MIME handling, it's easy to use it to implement this format.
Fortunately for us, Python standard library comes with a module that handles exactly that: email.mime
. I told you it was heavily used by email — I guess that's why they put that code in the email
subpackage.
Here's a piece of code that handles multipart/form-data
in a few lines of code:
from email import message
from email.mime import multipart
from email.mime import nonmultipart
from email.mime import text
class MIMEFormdata(nonmultipart.MIMENonMultipart):
def __init__(self, keyname, *args, **kwargs):
super(MIMEFormdata, self).__init__(*args, **kwargs)
self.add_header(
"Content-Disposition", "form-data; name=\"%s\"" % keyname)
def encode_multipart_formdata(fields):
m = multipart.MIMEMultipart("form-data")
for field, value in fields.items():
data = MIMEFormdata(field, "text", "plain")
data.set_payload(value)
m.attach(data)
return m
Using this piece of code returns the following:
Content-Type: multipart/form-data; boundary="===============1107021068307284864=="
MIME-Version: 1.0
--===============1107021068307284864==
Content-Type: text/plain
MIME-Version: 1.0
Content-Disposition: form-data; name="foo"
bar
--===============1107021068307284864==
Content-Type: text/plain
MIME-Version: 1.0
Content-Disposition: form-data; name="name"
jd
--===============1107021068307284864==--
This method has several advantages over our first implementation:
- It handles
Content-Type
for each of the added MIME parts. We could add other data types than justtext/plain
like it is implicitly done in the first version. We could also specify the charset (encoding) of the textual data. - It's very likely more robust by leveraging the wildly tested Python standard library.
The main downside, in that case, is that the Content-Type
header is included with the content. In case of handling HTTP, it is problematic as this needs to be sent as part of the HTTP header and not as part of the payload.
It should be possible to build a particular generator from email.generator
that does this. I'll leave that as an exercise to you, reader.
Decoding
We must be able to use that same email
package to decode our encoded data, right? It turns out that's the case, with a piece of code that looks like this:
import email.parser
msg = email.parser.BytesParser().parsebytes(my_multipart_data)
print({
part.get_param('name', header='content-disposition'): part.get_payload(decode=True)
for part in msg.get_payload()
})
With the example data above, this returns:
{'foo': b'bar', 'name': b'jd'}
Amazing, right?
The moral of this story is that you should never underestimate the power of the standard library. While it's easy to add a single line in your list of dependencies, it's not always required if you dig a bit into what Python provides for you!
from Planet Python
via read more
No comments:
Post a Comment