Thursday, November 25, 2010


C++ Serialization Anyone?



Today I had one of the most amazing programming experiences that I've ever had, from my entire exciting career. I'm still a bit stunned that this happened. I fully thought what happened was completely impossible till now.

At work, we use a lot of different languages to create our software. It's not odd for us to be working on a project which somehow ends up using a dozen languages. Between server code, client code, databases, communication, mark up, styling, pre-processing, dynamic code generation, and other commonality, it's rather easy actually.

Between all these different programming languages, quite often, we need some sort of data interchange format. There's many to choose from, ranging from something custom to something well known like XML. Using these formats, we can pass data from one segment of our application stack to another. Even when they use two different programming languages. Or to save some data, and load it back up later.

When it comes to these things, soft typed functional languages are generally easier to work with than hard typed. Soft typed languages are very good at building objects from data on the fly, thanks to their ability to not care much about what types they're looking at. Is it a number or a string? Doesn't matter to the soft typed language, as they store it all the same way.

When dealing with database access from hard typed languages, the popular method is to create some sort of catch all or convert to anything type. For some, terms like "QVariant" or "boost::any" are always on their lips. The intent of these and similar constructs is to ease things when dealing with data in an unknown type. Although such constructs generally require building a switch block which needs to check some enumeration method to figure out how to handle the data within the rest of the program. Such code is just downright annoying.

At work some time back, thanks to a lot of the new features C++-201x has been adding, we've been able to build a database access library which can handle data without any of these old kludges. Essentially, database access for us in C++ is now just as easy as it is in PHP (or perhaps even easier!).

Now database communication is great, but there's still the issue of data interchange between two programs, which aren't using a database as an intermediary. Many soft typed or functional languages can have a simple encode() or decode() function, pass it any object, and have a nice string representation of it which can be sent off, or saved to a file for later. C++ and related languages always had the nightmare of needing to iterate manually over every data type, or over a hierarchy to work with something like XML, or similar data formats.

There's those that have created workarounds of course. Such as adding a serialize() function for every type you have to work with individually. Or create some serializable objects that one copies data to or from, and which handle all the serialization work internally. Or one of my personal favorites, write a separate parser which can read a description of a format, and generate the C++ objects and code needed to serialize or deserialize it.

Well, today a coworker and I were putting our heads together on how to deal with a certain project. I wrote code some time back which can serialize/deserialize to and from an std::map which contains numbers, strings, or a mix thereof. We were using this data interchange format between two programs. However, now we need to deal with much more complex data, and a series of key pairs just won't cover it. One end of the equation is C++, the other end is a soft typed language which could pretty easily work with whatever we came up with.

We first thought about the option of using a classic method such as XML or JSON, and use some kind of hierarchical writer from C++, and have the soft typed language just read it directly into an object with one of its built in language features. Till my friend had a brilliant realization. The hierarchy of language containers and their children is recursive, as is any serialization that can encode an infinite amount of data stacked in a hierarchy. Then we started discussing if we could make a serialize() function in C++ which could take any C++ type and work, even when not knowing everything about it in advance. It'd be easy for plain old data types, but gets more complicated once we start dealing with containers of those, and containers of containers.

Of course this is where most conversations along these lines end. But then I brought up template meta programming, and some new features C++ is now adding (and already in GCC), and this discussion went on much further than usual, till the point we were talking code. Well, we got into it, and two hours and two hundred lines of code later, we now have a function with the following prototype:

template<typename T>
std::string serialize(const T &t);


It is able to take any type that exists in C++, and well, serialize it. Have some type which contains some types which contains a few more types which contains some other types? It's all serializeable with this function. No pre-processing, dynamic code generation, compiler hacks, or clumsy per program hierarchical parsing required. It just worksTM.

Now next week, we'll have to write the deserializer function to pull that magic in reverse. Using the same idea, it shouldn't be a problem. If the data matches the supplied structure, parse it in, otherwise, throw an error. But currently, our project is done, as we are now able to have our C++ applications send very complex data to our soft typed languages rather easily.

Looking over the code with my coworker, it all seems extremely obvious. Why the heck didn't we think of this 20 years ago? Now am I getting all excited over something that has been done before? Anyone familiar with anything like this?

Question is, what to do now that we know this? File for a patent? (Yes, I'm evil!) Or perhaps ignore this, as no one cares about this topic anyway?

14 comments:

brokenn said...

Sounds very interesting indeed! Any clues as to which c++201X features turned out to be useful for this? I've played around with similar serialisation problems a lot, but never moved over the c++0x for a solution.

Patuti said...

What to do with that?
Share it!

insane coder said...

Hi brokenn.

Aside from the various C++-201x features I already mentioned for this post, there's nothing really more I can say right now.

Roman Perepelitsa said...

Are you saying the serialize function can be used like this?

#include
// Assuming it's in serialize.h.
#include "serialize.h"

// Not a POD.
struct Foo {
std::string val;
};

int main() {
Foo foo = {"hello"};
serialize(foo);
}

I'm pretty sure there is no way to make this work even in C++0x.

Joe P. said...

Roman Perepelitsa, seems you're missing the crucial point here. Seems clear that it's possible. It might need a bit of help, but it looks like not nearly as much help as you seem to think. In fact, most C++ non-PODs can be serialized without extra help.

insane coder said...

Hi Joe P.

True, true. Although there is an important point not to forget. If a class restricts access to its members and provides no way to get them, nothing can be done.

Although you could make the serializer a friend function. A bit of extra work, but not much, as you described.

LB said...

Does it employ variadic templates?

Germán said...

Seriously, if this is like you describe, it's the most important discovering in c++ in years, and... I would like you to share it with all :-)
This would save lots of repetitive coding.
How do you access the members if you
don't know where they stay inside a class?
I see that impossible, but I'm not
sure now.

Germán said...

You got me intrigued, but I'm sure it's impossible to know the structure of a c++ type without some help.
What I came up with is that you can use tuples to describe the structure of a type. But I don't know where the c++0x features fit. Pleeeeeaseeeee... tell us! :-)

蕭刻庭 said...

Interesting... O_O really

tapted said...

I was playing with this a while ago. See

http://code.apted.net/project/apted-pub/browser/sera/trunk

Serialising arbitrary structs was the main challenge. I didn't want to interfere with the class definitions.. There were two ways (check out ser_test.cpp). But you basically had two options. With,

struct XY {int x; int y};

there is a macro SERIALISE_AS_POD(XY);, which would just serialise it as plain 'old data. Or a 'serialiser'. E.g. for:

class XS {
SERIALISABLE;
int x;
std::string s;
std::vector v;
public:
XS(int xx = 0) : x(xx), s(x, 'x'), v(3, s) {}
};

You would define:

SERIALISER(XS) {
SERIALISE_OPT(x, &init_int);
SERIALISE_OPT(s, "default string");
SERIALISE(v); //empty vectors are perfectly valid!
}


SERIALISE_OPT makes the data members optional in the serialised format. And it supports maps of strings of vectors of structs of maps of strings, etc.

"serialise" then converts any serialisable data into a map of strings->byte[]

The SQL stuff was also a part of it. I basically wanted to convert a struct-"schema" defined this way into an SQL schema.. but it's still a work-in-progress (that hasn't progressed for 17 months)...

蕭刻庭 said...

Recursive serialize a structure is not a new idea, boost::serialization is a perfect example.

And, to serialize everything with only one function is not a new idea, too. boost::fusion has an example to serialize everything with only one function. (well, with few help actually, but no tricky compiler hack, no ugly MACRO, only a typedef)

zbigg said...

I also think that this guy just invented something that "lies" as roots of boost serialization.

(Also this doesn't need C++0x, it works with C++98).

I'm doing this for 2 years from now ... it works for XML, ASN.1/BER, JSON and GUI generation. Nothing particulatly new.

Only thing you need is a
* serializer class with specialized member functions for each type (or type family like std::vector)
* visitor function that calls serializer for each member
* some ugly traits/macrology to distinguish between "agreegates" - structs and leaf types

Well, example - not full here: test_structure_printer.cpp.

Germán said...

"It is able to take any type that exists in C++, and well, serialize it. Have some type which contains some types which contains a few more types which contains some other types? It's all serializeable with this function. No pre-processing, dynamic code generation, compiler hacks, or clumsy per program hierarchical parsing required. It just worksTM."

Do you mean without registering any
single struct field?
Is that the way you did it?