C++ Morsels: Why does C++ distinguish between member and pointer-to-member?

In the bad old days, C was barely more sophisticated than Assembly with fancy macros.

Some would say little has changed. An unattributed quote reminds us that "C combines all the power of assembly language with all the ease of use of assembly language".

There are pages and pages of old C tricks that used to be required to make the dumb and naive compiler generate the code you wanted. Using * to dereference pointers instead of using array [] notation. Is ++i faster than i++? Part of it was performance-based, part of it was just that the compiler was dumb and you had to hold its hand. When C++ came along, the core C language got a big upgrade too, with the compiler being a lot smarter about what was really going on behind the scenes. Type-checking became much more robust when the compiler knew so much more about the context of your code. But one strange vestige has remained all along.

Suppose you have a structure (or object, in C++). The compiler recognizes the distinction between the actual object, and a pointer to the object. Let's see an example (using a C++ compiler):

struct foo
int i;
}; // struct foo

foo f; // create an 'auto' variable instance of a foo, called f
foo *p = &f; // create a pointer to a foo, called p, and assign it the address of f

"I see what you did there..."

So, a pointer is essentially an alias that 'points' to the original instance. Modifications done to p will operate through the pointer to affect the original instance. There are two different notataions or operators used for doing this, known as "member" and "pointer to member" or "member by pointer".

f.i = 12; // set i member to 12 via operator "member"
p->i = 12; // set i member to 12 via operator "member by pointer"

The two different operators were probably necessary in a time when the compiler didn't have full knowledge of the type of the objects f and p. Like an assembler, it just knew a memory location associated with each object. If it found the f.i "operator member" syntax, it would look up the offset to find the i member within the foo structure, add it to the address of the f object, and then write or read to the resulting offset address. If instead it encountered p->i (operator pointer-to-member), it would take the address of p, read the value found there, interpret THAT value as an address, offset it by the distance to find the i member, and read/write a value at that location. So, -> and . performed very different operations.

I believe (I have no legacy compilers to test this with, but perhaps one of my compu-archaeology friends can back me up) that you COULD perfectly legally try to do f->i or p.i. What would happen? Let's think about it like a macro assembler, ahem, compiler. I'm going to define for explanatory purposes that object f is at address 0x0f00 (hexadecimal, which is decimal 3840). We'll define p as being at location 0x1000 (4096 decimal). These numbers don't mean anything I just want concrete numbers to work with later. For explanatory purposes, I'm going to stipulate that initially the value stored in f.i = 8192 decimal (aka 0x2000 in hex). Since we initialized p to point to the address of f, above, the value stored in p will be the address of f, or 3840 or 0xf00.

First we'll dissect f->i. The operator pointer-to-member means "read the pointer-sized value located at the address of the object f, offset it by the 'i' distance within the structure, and use that value as a new address to modify the int-sized value found there".

So, we first look at the pointer-sized entity located at f's address (which is 0x0f00 if you recall). Well, f is a structure, and the first item in the structure is 'int i'. Old compilers typically ordered the members of a structure in the order they were listed. Modern compilers don't necessarily have to follow that rule. Is int i the same size as an address? This is unanswerable in the general form, since many 32-bit capable architectures might use one bit-size (like 16 bits) for an int and another (32 bits) for an address, whereas another 16-bit architecture would have them the same size. Let's assume for simplicity we're on a Motorola 680x0 architecture like the Amiga and a few Unix machines, with a 32-bit int and a 32-bit address/pointer size.

The computer will fetch the address-sized value at 0x0f00. What's there? That depends again on the arrangement of struct foo. The first member i is a 32-bit quantity, and we could somewhat coherently treat it like an address. So, the compiler reads it as an address. The value in i was 8192, so this is the address that -> uses as the "base" of the struct foo that it expects to find there. Now, since we're trying to access the i member of a structure AT that address, the compiler will need to add an offset to the address to account for where i falls within the structure. Here, our math is simple. Since i is the first member of the structure, its offset will be 0. So, the computer will now utilize address 8192/0x2000 like there was an int i there. If we did f->i = 400 we'd be writing 400 to some random memory location -- bad. If we did somevariable = f->i we'd be comitting the less-major sin of reading from a bogus location. Strong memory protection might protect us from both, but it's a bad situation.

What about trying to do the equally incorrect p.i = 31415?

Well, the "operator member" is going to treat the memory at the address of the object p (2048 or 0x1000) as if it were a struct foo itself, and write to an offset member there. The i member has an offset of 0 within the struct foo, so we'll end up writing 31415 straight into the 32 bits of memory at p, overwriting the 3840 / 0xf00 put there originally. This by itself is not an immediately harmful operation, since the write operation is being done to legit, valid, writable memory that our program "owns" at p. This isn't a fault even with memory protection. However, if we then try to use p properly, by saying p->i = 0, we will now act as if there is a foo structure at the address pointed to by p, 31415 (0x7AB7 if you care) and will write a 0 into that structure. Blam. That was an illegal write and you'll go down for it.

Ok, so we've now established WHY legacy C compilers needed to know how to treat the object in order to perform the correct operation on it -- they didn't have enough back-of-the-napkin type data to know what each object was in order to automatically choose the proper (de)referencing operation. When C++ came along and stuck methods into structures (and called some of them objects), it preserved the exact same situation and restrictions. The first C++ "compilers" (like cfront) were really just pre-processors, so they didn't have the tightly-integrated type database that modern C++ compilers later integrated. Even at that stage, I think the C++ parser knew whether the object in question was a class or a pointer to a class, but I suspect so much effort was put into the functional evolution of the language that no one thought about simplifying this basic notation.

And what of now? Modern compilers know more about your code and its types than you do. They are often free to rearrange members of structures at will to improve either performance or memory utilization. They certainly know whether the object you're operating on is a pointer or a real instance of an object. So the compiler can easily adapt to you using . instead of -> except that it won't.

Why would you want to do away with the distinction? Mostly for code-reuse and re-purposing. Many will tell you about the Copy and Paste Anti-Pattern. But the fact remains that many programmers take example code from a variety of sources, copy it, paste it and modify it to their purposes. Having to tediously search and replace between . and -> through a piece of code when you change the allocation of the object instances is just counter-productive. This is the sort of thing computers, compilers and advanced development tools were designed to negate. To ascend to the summit of sophisticated development, we need to be able to delegate menial tasks to menial digital laborers freeing our minds to comprehend, aspire to and accomplish bigger and more difficult tasks.

And C++ references? Well, references DO allow you to use the . operator on an "alias"-like entity, just like a pointer does. But references have their limitations too, some by design. You can't easily and safely just stick references into existing code.

The real question at the end of the day is why a modern C++ spec like C++0x doesn't permit the programmer to do away with always mentally keeping track of the type of the base object (pointer versus real instance) and take care of it for you.

Update:The always-sharp Jeff Frontz points out that the C/C++ compiler specification does guarantee that members of a structure will appear in the order defined. Various padding/packing options can affect the exact offsets within a strucutre, but they will be in the original order. Thanks Jeff!

Tip: There are two clever 'magic' values mentioned in this article. Guru points to anyone who points out the meaning behind the values I chose for demonstration purposes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h1> <h2> <h3> <img>
  • Lines and paragraphs break automatically.

More information about formatting options