How To: Read and Write ASCII and Binary Files in C++

I’m working on my C++ fundamentals at the moment, so I took a little time to cover reading and writing objects to/from files in ASCII and binary formats. I’ve worked with Java’s serializable mechanism before and liked it, so I was interested to see how C++ handled serialisation. In its usual C++ way; it does so at a lower level than you might expect.

Reading Strings From Binary Files

One of the interesting things I found was that if you write a string to a binary file, you can’t get it back because there’s no way to find its length. As you might imagine, the solution to this is to add a null-terminator (i.e. ‘\0’) at the end of your string, so you can pull it back in char-by-char and stop when you hit the terminator. This isn’t going to be very fast as we’re reading in such tiny (1 Byte) chunks – but depending on your scenario it may be perfectly acceptable:

There’s other ways of reading strings back from binary files, like writing the size of the string first, and then the string itself. When reading, you’d then read the size first (for example, as a 4 Byte unsigned int), and then read however many Bytes of data as specified by the size you just read. Or, alternatively, you can simply not use strings – instead, have a fixed size char array so, for example, name is always 30 characters. You might not use all 30, but even if you waste like 20 chars or so on average – so what? It’s only 20 Bytes. Then, you can just go: read 30 bytes of data and store it in my name char array, no null-terminator required.

Why Use Binary?

BenderBinary

Another interesting thing is why you’d bother storing things in binary in the first place – once your data’s in binary it’s not human-readable anymore, you need a hex editor if you really want to to modify it, and just… why? The answer lies in size and efficiency.

If we want to store an integer value, then typically an int is 4 Bytes. So lets say we had an unsigned int and we wanted to store the maximum value we can fit in that 4 bytes, in that case we’d be be storing the value 4,294,967,295 (just under 4.3 billion). Now, if we wanted to store that as a char array to keep things human readable, how many chars would we need? Counting them up gives us 10 chars worth of data (we won’t count the commas – I just put those in for formatting, and we’ll leave off any kind of terminator for now).

So, at 1 Byte of data per char, that’s 10 Bytes. What about if we wanted to store it in binary format? Well, we already said that one unsigned byte takes 4 Bytes – so there’s our answer – which means we’ve saved 6 Bytes of bloat on this one value alone. When you start working with 3D models, which might have thousands or even millions or vertices, each of which is typically 3 floats (x/y/z location) and maybe another 3 floats on top of that per vertex (x/y/z surface normal), and we may also have colour values (r/g/b/a) and maybe texture coordinates (s/t/[also p if 3D textures]) — we’re looking at a huge amount of data in a single model (and this doesn’t even account for multiple animation frames of a model!).

When our files get this big we’re never going to manually go into the data and twiddle it, so we gain nothing by keeping it human-readable. In fact, we lose time in loading/processing/writing the data as it will be significantly larger in ASCII than pure binary. So even though it’s a little bit more work to set up the read/write in binary, we gain a lot down the line by doing so.

Those of you who’re sharp as a tack will have noticed that I picked a large unsigned integer value (10 chars) to store in our example, and you may well be wondering what if we stored a small value, like something between 0 and 9 instead? Well, in that specific case we’d actually save 3 Bytes by storing it as a char (1 Byte each) rather than an unsigned int (4 Bytes each). But… if you were dealing with such a small range, you can only get a range of between 0 and 9 as a char (1 Byte) while you can store a value in the range -128 to +127 when treating the same Byte as a signed numerical type (or 0 to 255 unsigned) for the exact same storage. In effect, you’re never going to gain anything by using chars over numeric types!

If you really wanted to maximise your storage you can go further by using a tightly packed bit field (cppreference.com, wikipedia) instead of a ‘wasteful’ numeric type with all its fancy-pants byte alignment =P (Although bit-scrimper beware: unpacking non-boundary aligned data can come with performance penalties!).

Anyways, that’s the theory. You can find some heavily commented source code for reading/writing objects (which include a string property) to files in ASCII and binary formats below. In this simple example we write and then read two objects to a file in ASCII mode, and then do the exact same thing in binary mode, which results in file sizes of: 156 Bytes (ASCII) Vs. 33 Bytes (Binary).

Cheers!

Source Code

Leave a Reply

Your email address will not be published.