Binaries and Strings in Elixir
- 5 minsBelow is the general presentation from the November 2016 Houston Elixir Meetup on strings and binaries. It synthesizes a lot of sources about strings on the BEAM, but provides a quick and dirty guide to how strings/binaries work on this platform.
The talk was given days after Halloween, so we indulged in spooky examples.
Binaries and strings
What are binaries?
Strings are UTF-8 Encoded binaries.
Binaries are a sequence of bytes, denoted between the <<
and >>
angle brackets.
A Binary is a bitstring where the number of bits is divisible by 8.
Binaries that are valid UTF - 8 will be represented as strings.
Strings have a data structure and a printable representation.
This example shows when you enter the codepoints into IEX, you receive a representation as a string.
You can also get a character’s codepoint by using the ?
operator.
Elixir also supports a shorthand for referring to unicode characters:
Graphemes vs. codepoints
The data structure is composed of codepoints, which can combine to create individual graphemes. In unicode, some codepoints are used to modify other characters, like é
.
Keep in mind that in order to know a string’s length (String.length/1
, Elixir will have to traverse this sequence of bytes to return the number of printable graphemes from your data (and O(N) operation). You can also use Kernel.byte_size/1
to know the size of a given binary, which is an O(1) operation.
There be dragons 🐉
Some characters can be represented as one codepoint or as a combination of multiple codepoints:
Charlists
Charlists are typically used for compatability reasons with Erlang. They are represented as single-quoted strings.
One advantage/feature of this is that you can iterate over the sequence with functions you would use to work with lists.
Large binary space
Large binaries (> 64 bytes) stored in separate area from a process’ heap.
This allows for fast message passing as only pointer sent between processes
Can save a lot of memory as well (no needless copying)
Can be long delay before being reclaimed by GC. The large binary has a counter of all of the other processes that are using it. In order to be garbage collected, all processes which have “seen” the binary must first do a garbage collection.
If garbage collection does not happen often enough, the large binaries can grow and crash system.
If your process is only referencing a tiny part of a large binary, it might be worthwhile to use :binary.copy
to explicitly copy the portion you need from a large binary so that the large binary can be garbage collected sooner. More info in erlang’s docs.
See also Evan Miller’s Template of Doom post.
IO Lists
Concatenating strings costs memory to compose items into a single string. Since our data is immutable, it can be cheaper and faster to work with lists of references that can easily be appended.
IO lists allow you to create sequences of binaries and other IO lists, which will ultimately be flattened when the list is written (to disk for example). This has advantages because that final string never requires memory within your application. It only exists in the destination. Additionally, IO lists work with lists have have layers of nesting.
Additional Sources / Further Reading
- Elixir and Unicode, Part 2: Working with Unicode Strings from Nathan Long / Big Nerd Ranch
- Hitchhiker’s Guide to the BEAM Lecture from Robert Virding, Includes information on the large binary space on the BEAM, and lots of other information about the VM.
- Erlang’s binary manpage
- String Module from Elixir.
- Template of Doom from Evan Miller