Ways C compilers break for objects larger than PTRDIFF_MAX bytes

zamalek · on May 20, 2016

Results from VC++ (VS2015 Update 2):

- 32-bit, demo 1: -268435456

- 32-bit, demo 2: 0

- 64-bit, demo 3:

    warning C4034: sizeof returns 0
    error C2036: 'arrayptr': unknown size

derefr · on May 20, 2016

I always figured this (or an equivalent problem with mmap(3)) was what was behind Erlang's DETS filesize limit.

userbinator · on May 20, 2016

Strange results are acceptable because the overflow in pointer subtractions is undefined behavior

The first example, with -2130706432, is not strange at all and there doesn't appear to be any exploiting of UB, since that is just 0x81000000 interpreted as a signed number; it's the following examples that really show the oddness caused by the signedness of the type.

That 64-bit example, however, is IMHO a bug --- I doubt the system he ran it on has anywhere near the 14 exabytes of RAM he asked for, so malloc() succeeding in that case is ridiculous.

spoiler · on May 20, 2016

> That 64-bit example, however, is IMHO a bug --- I doubt the system he ran it on has anywhere near the 14 exabytes of RAM he asked for, so malloc() succeeding in that case is ridiculous.

Could you elaborate a bit on this?

As far as I understand, malloc will succeed if overcommiting is enabled (AFAIK it is in most systems by defualt, but don't hold my word for it since it might be just anecdotal). So, until he actually tries to access the memory—which he never actually does—it won't be allocated. Or is this bollocks?

krallin · on May 20, 2016

(Note: the following is only applicable to Linux)

The default for overcommitting on Linux is heuristic; it doesn't always succeed: if you try and allocate several exabytes or RAM, allocation will definitely fail (in fact, trying to allocate e.g. 2GB of RAM if you only have 1 free will usually fail just the same).

There is an option for "always overcommit" (incidentally; the one Redis recommends you use), in which case allocation will always succeed provided the Kernel can represent what you're trying to allocate (what you're describing), but it's definitely not the default

Reference: https://www.kernel.org/doc/Documentation/vm/overcommit-accou...

ryao · on May 20, 2016

malloc() should fail even with always over commit enabled if you try to malloc more memory than there is virtual address space, although that needs testing to confirm it. The alternative to failing would be hanging inside the kernel, which would be a DoS attack.

krallin · on May 20, 2016

Yes, you're entirely correct; that's actually what I meant by "provided the Kernel can represent [it]" (albeit perhaps not very clearly!).

Cheers,

ryao · on May 20, 2016

There is also the case of the sum of the allocations of several malloc operations exceeding that threshold. The case of just 1 malloc operation being that large is just a special case of that. That is what I meant, but reading what Ibwrore made me realize that I was not clear about that.

spoiler · on May 21, 2016

Thanks for clarifying that up! :)

technion · on May 20, 2016

There has been a lot of arguments around not bothering to check the return of malloc(). I can say in this case.

    char *large = malloc((size_t)-1);
    if(NULL == large) {
        perror("Well that failed");

Compiles fine and outputs:

    Well that failed: Cannot allocate memory

unwind · on May 20, 2016

In C, 0x81000000 is always a signed number, since that's a regular "int"-type literal (like "17"). Whether it's negative or not is of course implementation-dependent, on a 32-bit system with 32-bit "int" it will be since bit 31 is set.

To make it unsigned you'd have to append an 'u'.

I'm sure you know this, but people somehow sometimes seem to magically expect "certain" literals to be unsigned just because they're hex, and that's not how it works.'

EDIT: It seems I forgot about the complexities pointed out by replies to this comment. Thanks. Now the only think I'm certain about is that I'm confused.

pascal_cuoq · on May 20, 2016

Just a small remark, on any platform that has 32-bit integer types and 64-bit integer types and nothing in between, regardless of how these are mapped to int/long/long long, 0x81000000 is an unsigned integer.

You can see the C11 rules here and work through them for yourself: http://port70.net/~nsz/c/c11/n1570.html#6.4.4.1p5 The rules were identical in C99 and different in C90.

This is not very important for the discussion though. What is important for the discussion, and is perhaps left too implicit—but there is a maximum useful length for a blog post and this one is already close—is:

- Any C variant (C90 or C99/C11) will try hard to pick a type for the literal 0x81000000 in which the desired value is representable. C99/C11 will always succeed, because they specify long long which has to allow to represent this value. A C90 compiler will also succeed because it guarantees a 32-bit unsigned long type which has to allow to represent this value.

- in ptr + offset, the particular integer type of “offset” does not matter. This expression is defined exactly the same regardless of whether offset is a signed char or an unsigned long long. The type of offset can be an integer type way wider than size_t/ptrdiff_t, and the expression still works as long as the result is in-bounds for the array that ptr is pointing to.

cremno · on May 20, 2016

That's not true. Hexadecimals integer constants without suffix may have a signed type but they also may have an unsigned one. For example, 0x81000000 usually has type unsigned int.

http://port70.net/~nsz/c/c11/n1570.html#6.4.4.1p5

ryao · on May 20, 2016

Passing %d to printf says to print an int. 0x81000000 is a negative number when processed as a 32-bit signed integer. The output is the two's complement interpretation of the value. To print an unsigned int, you should pass %u. This would be more obviously correct if %x were used to print the hexadecimal representation, which is 0x81000000.

userbinator · on May 20, 2016

but people somehow sometimes seem to magically expect "certain" literals to be unsigned just because they're hex, and that's not how it works

Indeed, and in the most general sense, the signed/unsigned is really nothing more than how those 32 bits are interpreted by whatever operates on them. C attaches a particular type to literals which is used by the built-in operators, but things like printf() to a certain extent will do their own interpretation, as this shows:

    printf("%d %u %x", 0x81000000, 0x81000000, 0x81000000);

This is definitely a concept that seems quite unfamiliar to those used to higher-level languages.

jwilk · on May 20, 2016

If the site showed anything beyond red bouncing balls, I'd so happy. And no, I'm not going to enable JS in my browser.

wildmusings · on May 20, 2016

No HackerNews thread is complete without someone complaining about JS on the website.

userbinator · on May 20, 2016

I can see the article fine without JS, it's not dynamically generated and you can see the text there in the page if you view the source.