Skip to content

Commit

Permalink
add option to allow unpacking invalid utf8 strings
Browse files Browse the repository at this point in the history
resolves #2.
  • Loading branch information
vsergeev committed Oct 19, 2016
1 parent 921ae15 commit 9ea90b1
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 6 deletions.
24 changes: 20 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,20 @@ OrderedDict([('compact', True), ('schema', 0)])
>>>
```

### Invalid UTF-8 Strings

The unpacking functions provide an `allow_invalid_utf8` option to unpack MessagePack strings with invalid UTF-8 into the `umsgpack.InvalidString` type, instead of throwing an exception. The `umsgpack.InvalidString` type is a subclass of `bytes`, and can be used like any other `bytes` object.

``` python
>>> # Attempt to unpack invalid UTF-8 string
... umsgpack.unpackb(b'\xa4\x80\x01\x02\x03')
...
umsgpack.InvalidStringException: unpacked string is invalid utf-8
>>> umsgpack.unpackb(b'\xa4\x80\x01\x02\x03', allow_invalid_utf8=True)
b'\x80\x01\x02\x03'
>>>
```

### Compatibility Mode

The compatibility mode supports the "raw" bytes MessagePack type from the [old specification](https://github.com/msgpack/msgpack/blob/master/spec-old.md). When the module-wide `compatibility` option is enabled, both unicode strings and bytes will be serialized into the "raw" MessagePack type, and the "raw" MessagePack type will be deserialized into bytes.
Expand Down Expand Up @@ -216,13 +230,15 @@ If a non-byte-string argument is passed to `umsgpack.unpackb()`, it will raise a
```
* `InvalidStringException`: Invalid UTF-8 string encountered during unpacking.

String bytes are strictly decoded with UTF-8. This exception is thrown if UTF-8 decoding of string bytes fails.
String bytes are strictly decoded with UTF-8. This exception is thrown if
UTF-8 decoding of string bytes fails. Use the `allow_invalid_utf8` option
to unpack invalid MessagePack strings into byte strings.

``` python
# Attempt to unpack the string b"\x80\x81"
# Attempt to unpack invalid UTF-8 string
>>> umsgpack.unpackb(b"\xa2\x80\x81")
...
umsgpack.InvalidStringException: unpacked string is not utf-8
umsgpack.InvalidStringException: unpacked string is invalid utf-8
>>>
```

Expand Down Expand Up @@ -268,7 +284,7 @@ If a non-byte-string argument is passed to `umsgpack.unpackb()`, it will raise a
* Python 3
* `str` type objects are packed into, and unpacked from, the msgpack `string` format
* `bytes` type objects are packed into, and unpacked from, the msgpack `binary` format
* The msgpack string format is strictly decoded with UTF-8 -- an exception is thrown if the string bytes cannot be decoded into a valid UTF-8 string
* The msgpack string format is strictly decoded with UTF-8 an exception is thrown if the string bytes cannot be decoded into a valid UTF-8 string, unless the `allow_invalid_utf8` option is enabled
* The msgpack array format is unpacked into a Python list, unless it is the key of a map, in which case it is unpacked into a Python tuple
* Python tuples and lists are both packed into the msgpack array format
* Python float types are packed into the msgpack float32 or float64 format depending on the system's `sys.float_info`
Expand Down
9 changes: 9 additions & 0 deletions test_umsgpack.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@
# These are the only global variables that should be exported by umsgpack
exported_vars_test_vector = [
"Ext",
"InvalidString",
"PackException",
"UnpackException",
"UnsupportedTypeException",
Expand Down Expand Up @@ -332,6 +333,14 @@ def test_unpack_compatibility(self):

umsgpack.compatibility = False

def test_unpack_invalid_string(self):
# Use last unpack exception test vector (an invalid string)
(_, data, _) = unpack_exception_test_vectors[-1]

obj = umsgpack.unpackb(data, allow_invalid_utf8=True)
self.assertTrue(isinstance(obj, umsgpack.InvalidString))
self.assertEqual(obj, b"\x80")

def test_unpack_ordered_dict(self):
# Use last composite test vector (a map)
(_, obj, data) = composite_test_vectors[-1]
Expand Down
24 changes: 22 additions & 2 deletions umsgpack.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,10 @@ def __str__(self):
s += ")"
return s

class InvalidString(bytes):
"""Subclass of bytes to hold invalid UTF-8 strings."""
pass

################################################################################
### Exceptions
################################################################################
Expand Down Expand Up @@ -551,10 +555,13 @@ def _unpack_string(code, fp, options):
if compatibility:
return _read_except(fp, length)

data = _read_except(fp, length)
try:
return bytes.decode(_read_except(fp, length), 'utf-8')
return bytes.decode(data, 'utf-8')
except UnicodeDecodeError:
raise InvalidStringException("unpacked string is not utf-8")
if options.get("allow_invalid_utf8"):
return InvalidString(data)
raise InvalidStringException("unpacked string is invalid utf-8")

def _unpack_binary(code, fp, options):
if code == b'\xc4':
Expand Down Expand Up @@ -655,6 +662,9 @@ def _unpack2(fp, **options):
Kwargs:
use_ordered_dict (bool): unpack maps into OrderedDict, instead of
unordered dict (default False)
allow_invalid_utf8 (bool): unpack invalid strings into instances of
InvalidString, for access to the bytes
(default False)
Returns:
A Python object.
Expand Down Expand Up @@ -690,6 +700,9 @@ def _unpack3(fp, **options):
Kwargs:
use_ordered_dict (bool): unpack maps into OrderedDict, instead of
unordered dict (default False)
allow_invalid_utf8 (bool): unpack invalid strings into instances of
InvalidString, for access to the bytes
(default False)
Returns:
A Python object.
Expand Down Expand Up @@ -726,6 +739,9 @@ def _unpackb2(s, **options):
Kwargs:
use_ordered_dict (bool): unpack maps into OrderedDict, instead of
unordered dict (default False)
allow_invalid_utf8 (bool): unpack invalid strings into instances of
InvalidString, for access to the bytes
(default False)
Returns:
A Python object.
Expand Down Expand Up @@ -765,6 +781,10 @@ def _unpackb3(s, **options):
Kwargs:
use_ordered_dict (bool): unpack maps into OrderedDict, instead of
unordered dict (default False)
allow_invalid_utf8 (bool): unpack invalid strings into instances of
InvalidString, for access to the bytes
(default False)
Returns:
A Python object.
Expand Down

0 comments on commit 9ea90b1

Please sign in to comment.