I can’t entirely tell what the article’s point is. It seems to be trying to say that many languages can mmap bytes, but:
> (as far as I'm aware) C is the only language that lets you specify a binary format and just use it.
I assume they mean:
struct foo { fields; };
foo *data = mmap(…);
And yes, C is one of relatively few languages that let you do this without complaint, because it’s a terrible idea. And C doesn’t even let you specify a binary format — it lets you write a struct that will correspond to a binary format in accordance with the C ABI on your particular system.
If you want to access a file containing a bunch of records using mmap, and you want a well defined format and good performance, then use something actually intended for the purpose. Cap’n Proto and FlatBuffers are fast but often produce rather large output; protobuf and its ilk are more space efficient and very widely supported; Parquet and Feather can have excellent performance and space efficiency if you use them for their intended purposes. And everything needs to deal with the fact that, if you carelessly access mmapped data that is modified while you read it in any C-like language, you get UB.
> correspond to a binary format in accordance with the C ABI on your particular system.
We're so deep in this hole that people are fixing this on a CPU with silicon.
The Graviton team made a little-endian version of ARM just to allow lazy code like this to migrate away from Intel chips without having to rewrite struct unpacking (& also IBM with the ppc64le).
Early in my career, I spent a lot of my time reading Java bytecode into little endian to match all the bytecode interpreter enums I had & completely hating how 0xCAFEBABE would literally say BE BA FE CA (jokingly referred as "be bull shit") in a (gdb) x views.
I'm not sure how useful it is, though it was only added 10 years ago with GCC 6.1 (recent'ish in the world of arcane features like this, and only just about now something you could reasonably rely upon existing in all enterprise environments), so it seems some people thought it would still be useful.
ARM is usually bi-endian, and almost always run in little endian mode. All Apple ARM is LE. Not sure about Android but I’d guess it’s the same. I don’t think I’ve ever seen BE ARM in the wild.
Big endian is as far as I know extinct for larger mainstream CPUs. Power still exists but is on life support. MIPS and Sparc are dead. M68k is dead.
X86 has always been LE. RISC-V is LE.
It’s not an arbitrary choice. Little endian is superior because you can cast between integer types without pointer arithmetic and because manually implemented math ops are faster on account of being linear in memory. It’s counter intuitive but everything is faster and simpler.
Network data and most serialization formats are big endian by convention, a legacy from the early net growing on chips like Sparc and M68k. If it were redone now everything would be LE everywhere.
Yes. In little-endian, the difference between short and long at a specific address is how many bytes you read from that address. In big-endian, to cast a long to a short, you have to jump forward 6 bytes to get to the 2 least-significant bytes.
Wow, I've been living life assuming that little endian was just the VHS of byte orders with no redeeming qualities whatsoever until today. This actually makes sense, thank you!
Network data and most serialization formats are big endian because it's easiest to shift bits in and out of a shift register onto a serial comm channel in that order. If you used little endian, the shifter on output would have to operate in reverse direction relative to the shifter on input, which just causes stupid inconsistency headaches.
Had the same thought. Also confused at the backhanded compliment that pickle got:
> Just look at Python's pickle: it's a completely insecure serialization format. Loading a file can cause code execution even if you just wanted some numbers... but still very widely used because it fits the mix-code-and-data model of python.
Like, are they saying it's bad? Are they saying it's good? I don't even get it. While I was reading the post, I was thinking about pickle the whole time (and how terrible that idea is, too).
A thing can be good and bad. Everything is a tradeoff. The reason why C is 'good' in this instance is the lack of safety, and everything else that makes C, C (see?) but that is also what makes C bad.
Yeah, and as you well put it, it isn't even some snowflake feature only possible in C.
The myth that it was a gift from Gods doing stuff nothing else can make it, persists.
And even on the languages that don't, it isn't if as a tiny Assembly thunk is the end of the world to write, but apparently at a sign of a plain mov people run to the hills nowadays.
> And even on the languages that don't, it isn't if as a tiny Assembly thunk is the end of the world to write, but apparently at a sign of a plain mov people run to the hills nowadays.
Use the right tool for the job. I've always felt it's often the most efficient thing to write a bit of code in assembler, if that's simpler and clearer than doing anything else.
It's hard to write obfuscated assembler because it's all sitting opened up in front of you. It's as simple as it gets and it hasn't got any secrets.
it's not a terrible idea. It has it's uses. You just have to know when to use it and when not to use it.
For example, to have fast load times and zero temp memory overhead I've used that for several games. Other than changing a few offsets to pointers the data is used directly. I don't have to worry about incompatibilities. Either I'm shipping for a single platform or there's a different build for each platform, including the data. There's a version in the first few bytes just so during dev we don't try to load old format files with new struct defs. But otherwise, it's great for getting fast load times.
To support your point, it's also used in basically every shared library / DLL system. While usually used "for code", a "shared pure data library" has many applications. There are also 3rd party tools to make this convenient from many PLangs like HDF5, https://github.com/c-blake/nio with its FileArray for Nim, Apache Arrow, etc.
Unmentioned so far is that defaults for max live memory maps are usually much higher than defaults for max open files. So, if you are careful about closing files after mapping, you can usually get more "range" before having to move from OS/distro defaults. (E.g. for `program foo*`-style work where you want to keep the foo open for some reason, like binding them to many read-only NumPy array variables.)
The program you used to leave your comment, and the libraries it used, were loaded into memory via mmap(2) prior to execution. To use protobuf or whatever, you use mmap.
The only reason mmap isn’t more generally useful is the dearth of general-use binary on-disk formats such as ELF. We could build more memory-mapped applications if we had better library support for them. But we don’t, which I suppose was the point of TFA.
Entire libraries are a weird sort of exception. They fundamentally target a specific architecture, and all the nonportable or version dependent data structures are self describing in the sense that the code that accesses them are shipped along with the data.
And if you load library A that references library B’s data and you change B’s data format but forget to update A, you crash horribly. Similarly, if you modify a shared library while it’s in use (your OS and/or your linker may try to avoid this), you can easily crash any process that has it mapped.
I've written a lot of code using that method, and never had any portability issues. You use types with number of bits in them.
Hell, I've slung C structs across the network between 3 CPU architectures. And I didn't even use htons!
Maybe it's not portable to some ancient architecture, but none that I have experienced.
If there is undefined behavior, it's certainly never been a problem either.
And I've seen a lot of talk about TLB shootdown, so I tried to reproduce those problems but even with over 32 threads, mmap was still faster than fread into memory in the tests I ran.
Look, obviously there are use cases for libraries like that, but a lot of the time you just need something simple, and writing some structs to disk can go a long way.
As proven by many languages without native support for plain old goto, it isn't really required when proper structured programming constructs are available, even if it happens to be a goto under the hood, managed by the compiler.
My point is it's bad debating style. 'Everyone knows C is bad for all kinds of reasons ergo, even when someone presents their own actual experience, I can respond with a refrain that sounds good'
Not using goto because you've heard it's always bad is the same kind of thing. Yes it has issues, but that isn't a reason to brush anyone off that have actual valid uses for it.
C allows most of this, whereas C++ doesn't allow pointer aliasing without a compiler flag, tricks and problems.
I agree you can certainly just use bytes of the correct sizes, but often to get the coverage you need for the data structure you end up writing some form of wrapper or fixup code, which is still easier and gives you the control versus most of the protobuf like stuff that introduces a lot of complexity and tons of code.
Check your generated code. Most compilers assume that packed also means unaligned and will generate unaligned load and store sequences, which are large, slow, and may lose whatever atomicity properties they might have had.
Modern versions of standard C aren't very portable either, unless you plan to stick to the original version of K&R C you have to pick and choose which implementations you plan to support.
I disagree. Modern C with C17 and C23 make this less of an issue. Sure, some vendors suck and some people take shortcuts with embedded systems, but the standard is there and adopted by GCC, Clang and even MSVC has shaped up a bit.
Well, if that is the standard for portability then may_alias might as well be standard. GCC and Clang support it and MSVC doesn't implement the affected optimization as far as I can find.
Within the context of this discussion portability was mentioned as key feature of the standard. If C23 adoption is as limited as the, possibly outdated, tables on cppreference and your comments about gcc, clang and msvc suggest then the functionality provided by the gcc attribute would be more portable than C23 conformant code. You could call it a de facto standard, as opposed to C23 which is a standard in the sense someone said so.
That seems highly unlikely. Let's assume that all compilers use the exact same padding in C structs, that all architectures use the same alignment, and that endianness is made up, that types are the same size across 64 and 32 bit platforms, and also pretend that pointers inside a struct will work fine when sent across the network; the question remains still: Why? Is THIS your bottleneck? Will a couple memcpy() operations that are likely no-op if your structs happen to line up kill your perf?
I guess to not have to set up protobuf or asn1. Those preconditions of both platforms using the same padding and endianness aren't that hard to satisfy if you own it all.
But do you really have such a complex struct where everything inside is fixed-size? I wouldn't be surprised if it happens, but this isn't so general-purpose like the article suggests.
No defined binary encoding, no guarantee about concurrent modifications, performance trade-offs (mmap is NOT always faster than sequential reads!) and more.
C has had fixed size int types since C99. And you've always been able to define struct layouts with perfect precision (struct padding is well defined and deterministic, and you can always use __attribute__(packed) and bit fields for manual padding).
Endianness might kill your portability in theory. but in practice, nobody uses big endian anymore. Unless you're shipping software for an IBM mainframe, little endian is portable.
You just define the structures in terms of some e.g. uint32_le etc types for which you provide conversion functions to native endianness. On a little endian platform the conversion is a no-op.
It can be made to work (as you point out), and the core idea is great, but the implementation is terrible. You have to stop and think about struct layout rules rather than declaring your intent and having the compiler check for errors. As usual C is a giant pile of exquisitely crafted footguns.
A "sane" version of the feature would provide for marking a struct as intended for ser/des at which point you'd be required to spell out every last alignment, endianness, and bit width detail. (You'd still have to remember to mark any structs used in conjunction with mmap but C wouldn't be any fun if it was safe.)
mmap is not a C feature, but POSIX. There are C platforms that don't provide mmap, and on those that do you can use mmap from other languages (there's mmap module in the Python's standard library, for example).
I think this is sort of missing the point, though. Yes, mmap() is in POSIX[1] in the sense of "where is it specified".
But mmap() was implemented in C because C is the natural language for exposing Unix system calls and mmap() is a syscall provided by the OS. And this is true up and down the stack. Best language for integrating with low level kernel networking (sockopts, routing, etc...)? C. Best language for async I/O primitives? C. Best language for SIMD integration? C. And it goes on and on.
Obviously you can do this stuff (including mmap()) in all sorts of runtimes. But it always appears first in C and gets ported elsewhere. Because no matter how much you think your language is better, if you have to go into the kernel to plumb out hooks for your new feature, you're going to integrated and test it using a C rig before you get the other ports.
[1] Given that the pedantry bottle was opened already, it's worth pointing out that you'd have gotten more points by noting that it appeared in 4.2BSD.
If we're going to be pedantic, mmap is a syscall. It happens that the C version is standardized by POSIX.
The underlying syscall doesn't use the C ABI, you need to wrap it to use it from C in the same way you need to wrap it to use it from any language, which is exactly what glibc and friends do.
Moral of the story is mmap belongs to the platform, not the language.
No, that's too far down the pedantry rabbit hole. "mmap()" is quite literally a C function in the 4.2BSD libc. It happens to wrap a system call of the same name, but to claim that they are different when they arrived in the same software and were written by the same author at the same time is straining the argument past the breaking point. You now have a "C Erasure Polemic" and not a clarifying comment.
If you take a kernel written in C and implement a VM system for it in C and expose a new API for it to be used by userspace processes written in C, it doesn't magically become "not C" just because there's a hardware trap in the middle somewhere.
> C is the natural language for exposing Unix system calls
No, C is the language _designed_ to write UNIX. Unix is older than C, C was designed to write it and that's why all UNIX APIs follow C conventions. It's obvious that when you design something for a system it will have its best APIs in the language the system is written in.
C has also multiple weird and quirky APIs that suck, especially in the ISO C libc.
>> C is the natural language for exposing Unix system calls
> No, C is the language _designed_ to write UNIX. [...]
This is one of those hilarious situations where internet discussion goes off the rails. Everything you wrote, to the last word, would carry the same meaning and the same benefit to the discussion had you written "Yes" instead of "No" as the first word.
Literally you're agreeing with me, but phrasing it as a disagreement only because you feel I left something out.
If I write an OS in Basic, surely the 'natural' language for exposing the system calls is Basic?
Yes Unix predates C. But at this point in time 50+ years down the road, where the majority on nix users don't use anything that ever contained that code, and the minority use a nix that has been thoroughly ship of Theseused, Unix is to all intents and purposes a C operating system.
> If I write an OS in Basic, surely the 'natural' language for exposing the system calls is Basic?
For that specific OS, that would probably be the case? I think every API is bound to reflect the specific constraints of the language it has been written in. What I was trying to clarify was that UNIX and C are intertwined in an especially deep way, more than basically other OS that doesn't have a UNIX API, because both were born and written alongside each other, so some Unix APIs rely on C-specific behaviour and quirks and some C features were born and designed around the same historical context UNIX was born
Agree to disagree there. For casual "I need to vectorize this code" tasks, modern compilers are almost magic. I mean, have you looked at the generated code for array-based numerics processing? It's like, you start the process of
"vectorizing" the algorithm and realize the compiler already did 80% of it for you.
Using mmap means that you need to be able to handle memory access exceptions when a disk read or write fails. Examples of disk access that fails includes reading from a file on a Wifi network drive, a USB device with a cable that suddenly loses its connection when the cable is jiggled, or even a removable USB drive where all disk reads fail after it sees one bad sector. If you're not prepared to handle a memory access exception when you access the mapped file, don't use mmap.
Do these really ever result in access failures instead of just hangs? How are they surfaced to processes?
In my experience, all these things just cause whatever process is memory mapping to freeze up horribly and make me regret ever using a network file system or external hard drive.
If that's your program design then fread is not a substitute. Because you would need to pass in the FILE* pointer to all those calls.
And what are you hoping to do in those call stacks when you find an error? Can any of that logic hope to do anything useful if it can't access this data? Let the OS handle this. crash your program and restart.
You can even mmap a socket on some systems (iOS and macOS via GCD). But doing that is super fragile. Socket errors are swallowed.
My interpretation always was the mmap should only be used for immutable and local files. You may still run into issues with those type of files but it’s very unlikely.
It’s also great for when you have a lot of data on local storage, and a lot of different processes that need to access the same subset of that data concurrently.
Without mmap, every process ends up caching its own private copy of that data in memory (think fopen, fread, etc). With mmap, every process accesses the same cached copy of that data directly from the FS cache.
Granted this is a rather specific use case, but for this case it makes a huge difference.
C doesn't have exceptions, do you mean signals? If not, I don't see how that is that any different from having to handle I/O errors from write() and/or open() calls.
It's very different since at random points of your program your signal handler is caleld asynchronously, and you can only do a very limited signal-safe things there, and the flow of control in your i/o, logic etc code has no idea it's happening.
Well at least in this case the timing won't be arbitrary. Execution will have blocked waiting on the read and you will (AFAIK) receive the signal promptly in this case. Since the code in question was doing IO that you knew could fail handling the situation can be as simple as setting a flag from within the signal handler.
I'm unclear what would happen in the event you had configured the mask to force SIGBUS to a different thread. Presumably undefined behavior.
> If multiple standard signals are pending for a process, the order in which the signals are delivered is unspecified.
That could create the mother of all edgecases if a different signal handler assumed the variable you just failed to read into was in a valid state. More fun footguns I guess.
> Since the code in question was doing IO that you knew could fail handling the situation can be as simple as setting a flag from within the signal handler.
If you are using mmap like malloc (as the article does) you don't necessarily know that you are "reading" from disk. You may have passed the disk-backed pointers to other code. The fact that malloc and mmap return the same type of values is what makes mmap in C so powerful AND so prone to issues.
Yes, and for writing (the example is read-write) it's of course yet another kettle of fish. The error might never get reported at all. Or you might get a SIGBUS (at least with sparse files).
I think C# standard library is better. You can do same unsafe code as in C, SafeBuffer.AcquirePointer method then directly access the memory. Or you can do safer and slightly slower by calling Read or Write methods of MemoryMappedViewAccessor.
All these methods are in the standard library, i.e. they work on all platforms. The C code is specific to POSIX; Windows supports memory mapped files too but the APIs are quite different.
Indeed, but these normal APIs have runtime costs for bounds checking. For some use cases, unsafe can be better. For instance, last time I used a memory-mapped file was for a large immutable Bloom filter. I knew the file should be exactly 4GB, validated that in the constructor, then when testing 12 bits from random location of the mapped file on each query, I opted for unsafe codes.
It is a matter of the deployment scenario, in the days people ship Electron, and deploy production code in CPython, those bounds checking don't hurt at all.
When they do, thankfully there is unsafe if needed.
Aside from what https://news.ycombinator.com/item?id=47210893 said, mmap() is a low-level design that makes it easier to work with files that don't fit in memory and fundamentally represent a single homogeneous array of some structure. But it turns out that files commonly do fit in memory (nowadays you commonly have on the order of ~100x as much disk as memory, but millions of files); and you very often want to read them in order, because that's the easiest way to make sense of them (and tape is not at all the only storage medium historically that had a much easier time with linear access than random access); and you need to parse them because they don't represent any such array.
When I was first taught C formally, they definitely walked us through all the standard FILE* manipulators and didn't mention mmap() at all. And when I first heard about mmap() I couldn't imagine personally having a reason to use it.
> But it turns out that files commonly do fit in memory
The difference between slurping a file into malloc'd memory and just mmap'ing it is that the latter doesn't use up anonymous memory. Under memory pressure, the mmap'd file can just be evicted and transparently reloaded later, whereas if it was copied into anonymous memory it either needs to be copied out to swap or, if there's not enough swap (e.g. if swap is disabled), the OOM killer will be invoked to shoot down some (often innocent) process.
If you need an entire file loaded into your address space, and you don't have to worry about the file being modified (e.g. have to deal with SIGBUS if the file is truncated), then mmap'ing the file is being a good citizen in terms of wisely using system resources. On a system like Linux that aggressively buffers file data, there likely won't be a performance difference if your system memory usage assumptions are correct, though you can use madvise & friends to hint to the kernel. If your assumptions are wrong, then you get graceful performance degradation (back pressure, effectively) rather than breaking things.
Are you tired of bloated software slowing your systems to a crawl because most developers and application processes think they're special snowflakes that will have a machine all to themselves? Be part of the solution, not part of the problem.
On BSD, read() was already implemented in the kernel by page-faulting in the desired pages of the file, to then be copied into the user-supplied buffer. So from the first time mmap was ever implemented, it was always the fastest input mechanism. (First deployed implementation was in SunOS btw, 4.2BSD specified and documented it but didn't implement it.) Anyway there's no magic to get data off a device into memory faster, io_uring just lets you hide the delay in some other thread's time.
I'm not sure what the author really wants to say. mmap is available in many languages (e.g. Python) on Linux (and many other *nix I suppose). C provides you with raw memory access, so using mmap is sort-of-convenient for this use case.
But if you use Python then, yes, you'll need a bytearray, because Python doesn't give you raw access to such memory - and I'm not sure you'd want to mmap a PyObject anyway?
Then, writing and reading this kind of raw memory can be kind of dangerous and non-portable - I'm not really sure that the pickle analogy even makes sense. I very much suppose (I've never tried) that if you mmap-read malicious data in C, a vulnerability would be _quite_ easy to exploit.
And if you want to go farther back, even if it wasn't called "mmap" or a specific function you had to invoke -- there were operating systems that used a "single-level store" (notably MULTICS and IBM's AS/400..err OS/400... err i5 OS... err today IBM i [seriously, IBM, pick a name and stick with it]) where the interface to disk storage on the platform is that the entire disk storage/filesystem is always mapped into the same address space as the rest of your process's memory. Memory-mapped files were basically the only interface there was, and the operating system "magically" persisted certain areas of your memory to permanent storage.
I guess the author didn't use that many other programming languages or OSes. You can do the same even in garbage collected languages like Java and C# and on Windows too.
I'd be careful though, as they all have quirks due to how tricky it is handling mmap faults. The Java API mentions both unique garbage collection behavior and throwing unspecified exceptions at unspecified times.
What a bizarre conclusion to draw! It's like saying that cars are the best means of transportation because you can travel to the Grand Canyon in them and the Grand Canyon is the best landscape in the world, and yes you could use other means to get there, but cars are what everybody's using.
If the real goal of TFA was to praise C's ability to reinterpret a chunk of memory (possibly mapped to a file) as another datatype, it would have been more effective to do so using C functions and not OS-specific system calls. For example:
This is way more cumbersome than mmap if you need to out-of-core process the file in non-sequential patterns. Way way more cumbersome, since you need to deal with intermediate staging buffers, and reuse them if you actually want to be fast. mmap, on the other hand, is absolutely trivial to use, like any regular buffer pointer. And at least on windows, the mmap counterpart can be faster when processing the file with multiple threads, compared to fread.
But I agree that it's a bizarre article since mmap is not a C standard, and relies on platform-dependend Operating System APIs.
> However, in other most languages, you have to read() in tiny chunks, parse, process, serialize and finally write() back to the disk. This works, but is verbose and needlessly limited
C has those too and am glad that they do. This is what allows one to do other things while the buffer gets filled, without the need for multithreading.
Yes easier standardized portable async interfaces would have been nice, not sure how well supported they are.
Wouldn’t we need to implement all of that extra stuff if we really wanted to work with text from files? I have a use case where I do need extra fast text input/output from files. If anyone has thoughts on this, I’d love it.
The standard way is to use libraries like libevent, libuv that wraps system calls such as epoll, kqueue etc.
The other palatable way is to register consumer coroutines on a system provided event-loop. In C one does so with macro magic, or using stack switching with the help of tiny bit of insight inline assembly.
Take a look at Simon Tatham's page on coroutines in C.
To get really fast you may need to bypass the kernel. Or have more control on the event loop / scheduler. Database implementations would be the place to look.
mmap() is useful for some narrow use-cases, I think, but error-handling is a huge pain. I don't want to have to deal with SIGBUS in my application.
I agree that the model of mmap() is amazing, though: being able to treat a file as just a range of bytes in memory, while letting the OS handle the fussy details, is incredibly useful. (It's just that the OS doesn't handle all of the fussy details, and that's a pain.)
It has the best API for the author, that's for sure. One size does not fit all: believe it or not, different files have different uses. One does not mmap a pipe or /dev/urandom.
I think that I open files in very few cases in my job. I read and write PDF, xlsx, csv, yaml and I write docx. Those have their own formats and we use them to communicate with other apps or with users. Everything else goes in a PostgreSQL database or in sqlite3 because of many reasons and among them because of interoperability with other apps and ease of human inspection. A custom file format could be OK for data that only that app must use or for performance reasons, if you know how to make data retrieval performant.
nobody mentioning the "file system as a noSQL database comment". I found most of the friction when using bash/unix style tools when I tried to put everything in structured files that needed parsing. Once you see folder/files as the structure these tools work great.
It's also interesting to me that many noSQL starts from the assumption that relations are too complex, and that trees are preferred.
NoSQL was more of a convenience thing during that trend. You might have some bag of nested attributes that maps perfectly fine onto relations, but it's cumbersome for someone who just wants to load it all into an object, edit, then resave. People used to use ORMs to get around that, then NoSQL became popular, and now you can just use jsonb in SQL while still maintaining relations for other things.
This doesn't say much about the horizontal scaling that NoSQL systems were really designed for, but most people getting on that train didn't need that kind of scale.
At first glance, it's a quite weird article. But at the bottom:
> This simply isn't true on memory constrained systems — and with 100 GB files — every system is memory constrained.
I suppose the author might have a point in the context of making apps that constantly need to process 100GB files? I personally never have to deal with 100GB files so I am no one to judge if the rest of the article makes sense.
Yeah, the article makes more sense when you assume it's for that use case, but still not really sure when this case comes up. I've dealt with plenty of 100GB files, but they were coming from outside as CSVs or sqlite DBs or something. This one is, your C program is going to generate a 100GB file to use later, and you also don't need a DB.
It may have a tidy mmap api, but Smalltalk has a much better file api through its Streams hierarchy IMHO. You can create a stream on a diskfile, you can create a stream on a byteArray, you can create a stream on standard Unix streams, you can create a stream on anything where "next" makes sense.
After reading the comments here it boils down to: But my language is better then yours. mmap is not a feature of C. Some more modern languages try to prevent people form shooting in there feet and only allow byte wise access to such mmaped regions. The have a point doing this, but on the other hand also the C-Users have a valid point. Safety and Speed are 2 Factors you have to consider using the tools you use. From a Hardware point of view C might be more direct but it also enables you to make "stupid" errors fast. More Modern languages prevent you from the "stupid" errors but make you copy or transform the data more. Scotty from the Enterprise sayed once: Allways use the fitting tool
If mmap-style file access is this powerful, why do most higher-level languages avoid exposing typed, struct-level mappings directly instead of just byte buffers?
C's API does not include mmap, nor does it contain any API to deal with file paths, nor does it contain any support for opening up a file picker. This paired with C's bad string support results in one of it being one of the worst file APIs.
Also using mmap is not as simple as the article lays out. For example what happens when another process modifies the file and now your processes' mapped memory consists of parts of 2 different versions of the file at the same time. You also need to build a way to know how to grow the mapping if you run out room. You also want to be able to handle failures to read or write. This means you pretty much will need to reimplement a fread and fwrite going back to the approach the author didn't like: "This works, but is verbose and needlessly limited to sequential access." So it turns out "It ends up being just a nicer way to call read() and write()" is only true if you ignore the edge cases.
technically yes, because there's a failure path for every single failure that an OS knows about. And most others aren't so resilient. However, mmap bypasses a lot of that....
In my experience, having worked with a large system that used almost exclusively mmap for I/O, you don’t. The process segfaults and is restarted. In practice it almost never happened.
In go, you can do mmap with some help of external library :) you can mmap a file - https://github.com/edsrzf/mmap-go - and then unsafe-retype that to slice of objects, and then read/write to it. It's very handy sometimes!
It's unsafe though.
You also need to be careful to not have any pointers in the struct (so also no slices, maps); or if you have pointers, the must be nil. Once you start unsafe-retyping random bytes to pointers, thing explode very quickly.
mmap is nice. But, I find sqlite is a better filesystem API [1]. If you are going to use mmap why not take it further and use LMDB? Both have bindings for most languages.
When a developer that usually consumes the language starts critiquing the language.
I could go on as to why it's a bad signal, psychologically, but let's just say that empirically it usually doesn't come from a good place, it's more like a developer raising the stakes of their application failing and blaming the language.
Sure one out of a thousand devs might be the next Linus Torvalds and develop the next Rust. But the odds aren't great.
mmap is not a language feature. it is also full of its own pitfalls that you need to be aware of. recommended reading: https://db.cs.cmu.edu/mmap-cidr2022/
No it doesn't. If you have a file that's 2^36 bytes and your address space is only 2^32, it won't work.
On a related digression, I've seen so many cases of programs that could've handled infinitely long input in constant space instead implemented as some form of "read the whole input into memory", which unnecessarily puts a limit on the input length.
The point the article makes is that a 32GB file can be mmapped even if you only have 8GB of memory available - it wasn't talking about the address space. So the response is irrelevant even if technically correct
> What they said is correct regardless of that though?
I don't think so.
Their post is basically:
>> It still works if the file doesn't fit in RAM
> No it doesn't.
Which is incorrect: it actually does work for files that don't fit in RAM. It doesn't work only for files that don't fit in the address space, which is not what the author claimed.
All memory map APIs support moveable “windows” or views into files that are much larger than either physical memory or the virtual address space.
I’ve seen otherwise competent developers use compile time flags to bypass memmap on 32-bit systems even though this always worked! I dealt with database engines in the 1990s that used memmap for files tens of gigabytes in size.
> (as far as I'm aware) C is the only language that lets you specify a binary format and just use it.
I assume they mean:
And yes, C is one of relatively few languages that let you do this without complaint, because it’s a terrible idea. And C doesn’t even let you specify a binary format — it lets you write a struct that will correspond to a binary format in accordance with the C ABI on your particular system.If you want to access a file containing a bunch of records using mmap, and you want a well defined format and good performance, then use something actually intended for the purpose. Cap’n Proto and FlatBuffers are fast but often produce rather large output; protobuf and its ilk are more space efficient and very widely supported; Parquet and Feather can have excellent performance and space efficiency if you use them for their intended purposes. And everything needs to deal with the fact that, if you carelessly access mmapped data that is modified while you read it in any C-like language, you get UB.
reply