Journal Articles

CVu Journal Vol 8, #2 - Apr 1996 + Programming Topics
Browse in : All > Journals > CVu > 082 (9)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: String Theory

Author: Administrator

Date: 03 April 1996 13:15:27 +01:00 or Wed, 03 April 1996 13:15:27 +01:00

Summary: 

Body: 

In the previous article I looked at the use of string literals for initialising arrays of char. These literals do not have any existence as detectable and measurable runtime entities. However, there do appear to be times that literals do have some role as actual runtime objects, as demonstrated by the familiar

int main(void) 
{
  puts("hello world");
  return 0;
}

Here the literal is clearly being used as a runtime object. So how does this differ from, say

int main(void) 
{
  char message[] = "hello world";
  puts(message);
  return 0;
}

Welcome to the machine

To appreciate the subtleties of C strings, it helps to understand a little about the underlying C view of the world, in particular storage and duration.

When used as an aggregate initialiser, i.e. when initialising an array of char, the string literal is merely a convenient short hand for the longer non-text oriented aggregate initialiser syntax. As I illustrated last time, the following

char message[] = "hello";

is equivalent to

char message[6] = {'h', 'e', 'l', 'l', 'o', '\0'};

The literal has no meaningful existence at runtime, and the message object is declared as having the default storage class for its scope. Within a function this means that it is an auto variable, i.e. it lives on the stack[1]. Every time that function is entered enough space for the message variable and any other locals is pushed onto the stack and initialisation occurs. On exiting the function the space is popped off the stack as control returns to the caller: any pointers to this space are now invalid. For example, consider the following relatively common coding error:

const char *bool_name(int truth) 
{
  const char false_name[] = "false";
  const char true_name[] = "true";
  return truth ? true_name : false_name;
}

The two variables declared, and hence all their elements, are defined on the stack. The function returns pointers to an area of the stack that is no longer valid. Technically the behaviour resulting from the use of the return value is undefined. The oft quoted Chinese curse "may you live in interesting times" is a good plain language translation of the standardese terminology!

The following code, however, is just what is intended:

const char *bool_name(int truth) 
{
  const char *false_name = "false";
  const char *true_name = "true";
  return truth ? true_name : false_name;
}

In this case the variables are pointers not arrays, and they point to the literals themselves. In answer to the question posed in the title of the article: "this" is an anonymous array of char with static duration.

Static

The string literal survives the duration of the program, like any other static variable. Also, like other static variables, it has internal linkage which effectively means that no name is exported to the linker. It is truly anonymous, so there is no way to refer to the storage used by name in a portable manner.

The compiler is entitled to optimise use of storage by providing only one definition for a string literal that is repeated within a single translation unit, ie. a conventional C or C++ source file. For example, the repeated string "%s%s\n" in the following code need not imply separate implementation arrays for every occurrence of the literal:

void prefixed(const char *base, const char *prefix) 
{
  printf("%s%s\n", prefix, base);
}
void suffixed(const char *base, const char *suffix) 
{
  printf("%s%s\n", base, suffix);
}

On many compilers it is an option to merge duplicate strings. A compiler using merged strings can be detected easily using

int unique = "hello" != "hello";

This does a comparison of the pointers, and not the contents - for that use the strcmp library function. There is not a lot you can, or should, do with this information, but it might be interesting to know.

Given that the strings themselves effectively refer to storage, we can simplify our original example:

const char *bool_name(int truth) 
{
  return truth ? "true" : "false";
}

It is as if the compiler generated something like the following:

static const char __false[] = "false";
static const char __true[] = "true";
const char *bool_name(int truth) 
{
  return truth ? __true : __false;
}

Look but don't touch

Alas, history tends to always interfere with simplicity. Although a string literal is intuitively read-only, i.e. const, strings existed in C before the idea of const-ness [I looked at this issue in a little more detail in "Literally yours", Overload 11]. Thus there is a lot of legacy code of the form

char *message = "hello";

If string literals were made const overnight a remarkable amount of code would break, so the declared type remains non-const. However, just to confuse the issue, this does not mean that you are entitled to modify the string. Although the following will compile, the runtime behaviour is undefined:

"help"[3] = 'l';

It is as if the string is itself const, but the compiler casts away this const-ness during compilation. The compiler is entitled, and some do, to place strings in write protected memory (such as the program's code segment on some systems). Overwriting a string literal might cause your application to fall over.

Although that is an obvious incentive to always treat literals as const, we can see that it also makes sense intuitively. Given that literals are static then once changed they would not simply revert back to their former content next time through a function. That duplicate strings might have been optimised together merely adds to the list of possible sources of bugs. The true read-only nature of literals should be honoured, and company coding standards as well as personal practice should enforce this:

char *no_no = "don't do this"; /* disallow */
const char *fine = "do this"; /* OK */


[1] Some compilers and compile-time checking tools will detect and warn you about non-const usage.

Notes: 

More fields may be available via dynamicdata ..