manticore, earlier sphinx search, has been rock solid for us for the past 16 years. now serving searches across nearly 300M short documents. we're using it in the old mode - where full index is re-created every 24h.
it's great to see that the project is alive and adding embeddings-related functions needed for semantic search.
aspect worth noting: up to my knowledge HE's tunnel will work only if you're assigned public IPv4 by your ISP. if you're behind a carrier grade NAT - too bad, you'll need to use another solution to get IPv6 to your home.
Go Fiber (Shentel) is one such ISP, and they will gladly switch you to a public IP for no cost if you contact their support. Sadly they don’t support IPv6 yet.
environment: KVM VMs running on physical hardware managed by us.
we have a belt & suspenders approach:
* backups of selected files / database dumps [ via dedicated tools like mysqldump or pg_dumpall ] from within VMs
* backups of whole VMs
backups of whole VMs are done by creating snapshot files via virsh snapshot-create-as, rsync followed by virsh blockcommit. this provides crash-consistant images of the whole virtual disks. we zero-fill virtual disks before each backup.
all types of backups later go to borg backup[1] [ kopia[2] would be fine as well ] deduplicated, compressed repository.
this approach is wasteful in terms of bandwidth, backup size but gives us peace of mind - even if we miss to take file-level backup of some folder - we'll have it in the VM-level backups.
Allegro has amazing metadata allowing you to precisely filter out the results. Search experience on amazon is an utter abomination compared to Allegro.
We're using BTRFS to host PostgreSQL and MySQL replication slaves. We're snapshoting drives holding data for both every 15 minutes, 1h, 8h and 12h and keep few snapshots for each frequency.
Those replicas are not used for any workload, besides nightly consistency checks for MySQLs via pt-table-checksum to ensure we don't have data drift.
Snapshots are crash consistent. Once in a while they give us ability to very quickly inspect how data looked like few minutes or hours ago. This can be life-saver in case of fat-fingering a production data and saved us from lenghty grepping of backups when we needed to recover few records from a specific table.
Yes, I know soft deletes, audit logs - all of those could help and we do have them, but sometimes that's not enough or not feasible.
Due to it's COW nature BTRFS is far from perfect for data that changes all the time [ databases busy with writes, images of VMs with plenty of disk write activity ]. There's plenty of write amplification, but that can be solved with NVMe drives thrown on the problem.
How do you avoid heavy fragmentation caused by random writes? Do you disable COW (sounds like "no", given you snapshot)? Or autodefrag (how's performance)?
Also - thanks to that FIDO2 does not seem to be usable with Microsoft's MS365 services [ Teams, Outlook, Excel etc ] on Android or iOS. there's no way to provide pin for the security key, regardless if it's plugged in via the USB port or used via NFC.
coincidentally - we've been using first sphinx and then manticore for over 15 years as well. in our case it's fed each night with XML generated by Java code from data stored in MySQL databases. we index over 294M pseudo documents.
borg is great. we've been using it for the past 3 years to archive hundreds of file-level backups of servers, database dumps and VM images. average size of each borg repo is few GB but there are few outliers up to few hundreds of GB. most of backups are done daily, with 7-24 past days preserved in borg archive. borg repos are verified, copied to external disks, verified again and rotated offline each week.
borg replaced https://rdiff-backup.net/ for us and gave:
* nice speedup of backups/backup tests,
* decent saving in the disk space thanks to compression and deduplication,
* decreased backup replication time [ borg repo tends to have much less, larger files compared to rdiff which has in its repo at least as many files as your source data; rsync likes it ].
to finish backups in a reasonable time we had to parallelize backup gathering [ each server / vm goes to separate borg repo; this limits a failure domain in case of corrupted repo, but denies us benefit of deduplication on larger scale - across servers ] and borg archiving. without that - we would be a limited by a single cpu core performance [ borg is not multithreaded yet ].
it's worth testing the backups - we're doing it each day by using borg's repo self test and by extracting few key files and checking their checksums and content... just in case.
echoing other comments - https://kopia.io/ looks interesting but we have not tried it yet.
as far as i understand apparent death of sphinx and demand for continued development/support from big users of it led to creation of manticore.