Abstract Machine

2020-10-18

Michael Schupikov

The first presentation on CppCon 2020 is from Bob Steagall. He gave an overview of the abstract machine, according to which C++ is defined. Here are my summary from the presentation, peppered with some complementary notes.

Contents

Definition

When we write code, we do not typically target any specific operating system or hardware. Instead, we are targeting its abstraction described by the language specification. The according abstract machine is defined in §4.1.2.

C++ Specification, §4.1.2/1.

The semantic descriptions in this document define a parameterized nondeterministic abstract machine. This document places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.

Specific implementations like gcc or clang translate our code for the physical machine. The behavior of the abstract machine and physical machine match on specific, observable points. Those consist of reading a volatile variable, modifying an object or calling a library function that performs IO. This also includes calling a function that performs any such operations.

The abstract machine is parameterized through implementation-defined behavior. It is nondeterministic due to unspecified behavior. The program is well-formed, if it has correct syntax, diagnosable semantics and no violations of the one definition rule 1.

Behavior

If not exactly specified, the behavior of the abstract machine is categorized in implementation defined, unspecified or undefined. The program can also be explicitly ill-formed.

Implementation Defined

The behavior is not exactly specified. However, it needs to be documented. One example is sizeof(void*), the value of which depends on the platform. Another is the exact message provided by std::bad_alloc::what().

Unspecified Behavior

Unspecified behavior is allowed and needs no documentation. One example is the evaluation order of function parameters. Another one is whether same string literals are stored in one place or individually. Finally, the order, contiguity and initial value of successive allocation requests is also unspecified.

Undefined Behavior

Undefined behavior puts no requirements on the program. No diagnostic is required. Examples consist of dereferencing nullptr and signed integer overflow.

Ill-Formed

Diagnosable semantics errors fall into this category. There is a sub-category ill-formed, no diagnostic required (IFNDR). It includes all semantics errors, which cannot be diagnosed at compile-time. One example is a constructor, which directly or indirectly delegates to itself. Another one consists of mismatching [[noreturn]] tags on function declarations in different compilation units. Finally, some violations of the one definition rule fall into this subcategory.

Structure

The structure of the abstract machine roughly consists of its memory, threads and expressions.

Memory

The memory consists of a single, flat space. According to specification, all memory is always reachable. The abstract machine provides no concepts of stack, registers or cache. However, the specification mentions stack unwinding regarding exceptions. There is no definition for external memory on GPU or the coprocessor.

Memory is composed of bytes and every byte has an address.

Objects

Memory is organized in objects. Each object has following properties.

Type

Object’s type like int or user-defined Class.

Value

Object’s value such as 0x01 or the values of the fields in the class.

Name

Optional name of the object. It is optional, as temporary objects do not have a name.

Location

The address of object’s first byte. It is optional as temporaries do not have an address.

Size

The value returned by sizeof().

Alighnment

The value returned by alignof().

Storage Duration

Automatic	Storage local to the current block. It might be an anonymous code block or a function.
Dynamic	Storage explicitly allocated using `new` and deallocated using `delete`.
Static	Storage for global objects. This also includes objects declared with `static` or `extern`. Only one object with given name is allowed in the static storage. Static storage is allocated before `main()` and deallocated after its execution.
Thread	Automatic storage bound to a thread via `thread_local`. Every thread has its own object.

Lifetime

The lifetime of any object begins with obtaining and initializing its storage. It ends with the object’s destruction. It also ends if object’s storage is released or re-occupied by another object that is not nested in the original one.

The specification allows pointing one past the last element of an array. Such pointers can be checked for equality. However, dereferencing and comparing is not allowed.

Lifetime gap in the specification.

Note an interesting gap in the specification. If you receive an object in raw byte representation, casting it via reinterpret_cast<>() is undefined behavior, because such an object has never been created in the first place. Another example is dynamic construction of arrays.

To close the gap, std::start_lifetime_as<>() is expected in the next standard. Unfortunately, it has not passed into C++20 due to time constraints.

Threads

Threads describe a single execution flow. Threads start with their top-level function. The initial function is executed by the thread and not its caller. Threads then recursively include all functions called by the top-level one.

According to the abstract machine, every thread has access to all memory.

Every program has at least one thread with main() as its entry point. Valid signatures are at least int main() and int main(int, char**). It cannot be overloaded and it cannot be a coroutine. The program cannot call it explicitly or define a global variable named main. main() cannot be deleted, static, inline or constexpr. It needs to have C++ linkage and no explicit linkage specification is allowed.

Expression

An expression has a type and a value category. The categories can be represented as following tree.

expression ──┬──> glvalue ──┬─> lvalue
             │              │
             │              └─> xvalue
             │              ┌─>  -"-
             │              │
             └──> rvalue ───┴─> prvalue

An rvalue has no name. Its address cannot be taken. One example are temporary objects. A glvalue is a general expression that determines the identity of an object. A lvalue has a name and its address can be retrieved. A prvalue initializes an object. It also computes the value of an operand according to its context. Examples are literals and function calls returning non-reference types. An xvalue is a glvalue, which can be moved from. It can initialize a rvalue reference. An example consists of a function returning rvalue.

A rough determination whether something is rvalue or lvalue is its name. If it has one, it is lvalue. If it has none, it is rvalue. To be sure, type traits std::is_rvalue_reference<> and std::is_lvalue_reference<> can be used. The following example provides a short demonstration.

constexpr std::is_lvalue_reference<auto&&>
  is_lvalue(auto&&){ return {}; }

constexpr std::is_rvalue_reference<auto&&>
  is_rvalue(auto&&){ return {}; }

auto main() -> int
{
  struct S{} s;

  static_assert(is_lvalue(s));
  static_assert(is_rvalue(S{}));
}

Conclusion

From my perspective, the presentation provides a great entry point for the language specification. I doubt it would make you a better developer right away. Your code would be as solid as before. It rather gives you some context to understand basic terminology such as objects and expression types. The target group would consist of developrers who are interested in language specifics.

1

The one definition rule is defined in §6.3 of the specification. According to it, no translation unit shall contain more than one definition of any variable, function, class, enumeration, template or default parameters for functions or templates.

While some violations must be diagnosed by the compiler, particularly those that span multiple translation units do not require diagnosis.