I'm not familiar with C enough to know the answer, but I'm trying to think how anything goes from untrusted input -> trusted input safely.
To sanitize the data, you're putting the input into memory to perform logic on it, isn't that itself then an attack vector? I would think that any language would need to do this.
There are a lot of different issues that can come up, but in practice ~80% of those (my made up number) are out-of-bounds issues. So for example, say you're parsing a JSON string literal. What happens if the close-quote is missing from the end of the string? You might have a loop that iterates forward looking for the close-quote until it reaches the end of the input. What that code should do is then return an error like "unclosed string". If you write that check, your code will be fine in any language. What if you forget that check? In most languages you'll get an exception like "tried to read element X+1 in an array of length X". That's not a great error message, but it's invalid JSON anyway, so maybe we don't care super much. However in C, array accesses aren't bounds-checked, so your loop plows forward into random memory, and you get a CVE roughly like this one.
In short, the issue is that you forgot a check, and your code effectively "trusted" that the input would close all its strings. If you never make mistakes like that, you can validate input in C just like in any other language. But the consequences of making that mistake in C are really nasty.
The error you're describing is more likely to happen with an array of ints (or really any other type without a sentinel value).
Strings specifically are often enclosed enclosed in a `while(c != '\0')` loop (assume c is the character being examined) or something to that effect, which means you'll exit at the end of the string (non-string arrays don't have this).
The CVE in question seems to be the exact opposite of this. It's that someone didn't check the bounds on a write instead of a read.
`while(c != '\0')` is the same as `while(c != '"')`. An attacker controlled string may very will be missing the 0 byte, which has been an extremely common attack vector (though is probably not a realistic attack vector for JSON parsers, to be fair).
> An attacker controlled string may very will be missing the 0 byte
Entirely possible, especially if the attacker is local. But when we're dealing with something coming in over the network, I think even the old arpa headers get you a null byte at the end, regardless of if one was sent.
Unless we aren't dealing with tcp/ip, in which case I'm way out of my depth.
Just because something is in memory doesn’t mean that it is realistically executable. That’s why you can download a virus to look at the code without it installing itself.
You aren’t wrong that even downloading untrusted data is less secure than not downloading it. But to actually exploit a machine that is actively sanitizing unsafe data, you need either (A) an attack vector for executing code at an arbitrary location in memory, or (B) a known OOB bug in the code that you can exploit to read your malicious data, by ensuring your data is right after the data affected by the OOB bug.
>To sanitize the data, you're putting the input into memory to perform logic on it
Sure, but memory isn't normally executed.
One of the more common problems was not checking length. Many C functions assume sanitized data and so they don't check. You have functions to get that data that don't check length - thus if someone supplies more data than you have more room for (gets is most famous, but there are others) the rest of the data will just keep going off the end - and it turns out in many cases you and predict where that off the end is, and then craft that data to be something the computer will run.
One common variation: C assumes that many strings end with a null character. There are a number of ways to get a string to not end with that null, and if the user can force that those functions will read/write past the end of data which is sometimes something you can exploit.
So long as your C code carefully checks the length of everything you are fine. One common variation of this is checking length but miss counting by one character. It is very hard to get this right every single time, and mess it up just once and you are open to something unknown in the future.
(Note, there are also memory issues with malloc that I didn't cover, but that is something else C makes hard to get right).
don't forget the hilariously dangerous strcpy that they "fixed" with strncpy, which would happily create unterminated strings, so has was fixed again with strlcpy. At least std::string doesn't have these problems (it has its own issues because the anemic API surface means you keep needing C APIs that require null termination)
Slapping strlcpy on everything, as some codebases/companies have taken to doing, is a poor fix. The proper fix is not quite shipping yet, but you can build your own out of memccpy if you'd like. (Of course, at the risk of doing it wrong…)
Is anyone able to explain this to me?