My plan was to just see if anyone wanted to actually use it first. That if I couldn't give it away I'd not invest the time in selling or open sourcing it.
I'd sort of designed it for my own needs first and hadn't thought too far beyond that.
Not one image - 5 frames per clip, sent in a single request with a transcript snippet. So the multi-frame + subtitles in one call part is the same as yours.
But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.
Could you share how your scene-detection picks the frames?
---
For the vector search, I went for the trade-off of not having it but keeping it simple with plain Markdown files for more portability. The knowledge travels with the files when an SSD moves, no index to keep in sync, and plain text that outlives the tool. But the other path you mentioned is interesting as well to explore.
I originally limited mine to 10 frames spread evenly throughout the video, but it missed a fair bit of context at the analysis step, and didn't scale with length. So now when a video is loaded the app extracts a bunch of frames for the entire video, then calculates an image histogram and compares similarity to the previous one. There's some configuration so it doesn't send too many to the LLM, but still gets a good cross-section of frames to send.
You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.
Yeah, markdown works well if you're going to search through it with Claude Code or something like that. I built ClipScape as an Electron app with a local SQLite database, as I wanted an interface I could search and chat in and see the relevant thumbnails.
Author here. I totally hear you. I wasn't expecting this to do well on HN for exactly this reason.
But I've mentioned elsewhere - if it wasn't for all the AI-assistance, I would've put-off documenting everything that I did and not even get to the writing part.
But yeah, I'll be working on the workflow to make the next write-up better, more humanized.
You can make AI-generated content without it being slop. Slop, to me at least, is content that's wrong, padded, or generic.
I see the cadence / short-sentence issues but if there's something else beyond those, I'd actually want to know what made it feel bad.
I would've put off documenting what I did over the weekend but instead, I did document everything, spent quite some time (several iterations) and effort to make sure it does not hallucinate and writes in my own tone and voice. I'm sure it could be better but the content is not made-up.
At a time where most of us software engineers have changed our workflows to let AI write 80+% of our code using agents, I feel writing is heading the same way. It then becomes a matter of taste, whether it's done well or not.
If you're looking clues and signs for whether a content has used AI, you're going to be disappointed over the next 12 months.
If it feels jarring right now, I'll work harder on the workflow so it feels more natural next time (someone shared this project with me - https://github.com/blader/humanizer).
But this clearly allows me to make content which I wouldn't have done earlier.
I'm not philosophically against AI or anything, but I think this needed some heavy editing.
I did not even initially think upon seeing this style for the first time that it was AI-written, because I would associate AI-written text as fluffy. This staccato instead looks like the model was told to be terse and informal. I think the informality doesn't help either -- it's not that you can't have a well-written colloquial text, but I think it's harder to pull off.
Here is an example:
> Gemma returned people_count: "many" instead of an integer. My vision prompt literally said integer or the string "many" if >10. Gemma followed instructions correctly; the bug was schema design. The fix was a stricter prompt (integer 0-99 with explicit guidance to estimate) plus a coercion in the parser for the legacy "many" responses. Don't union-type schema fields. Pick always-int or always-string, never "int or this one specific string," because every downstream consumer pays for the choice.
> The first half is a constant flood of footage from the iPhone, the DJI Pocket, the drone, the Nikon Z8, and lately the Ray-Ban Metas too. There's always something being recorded. Every photographer or videographer I know is sitting on the same problem: an archive that grows faster than they can edit it. The second half is why mine never gets touched.
This is your second paragraph but reads awkwardly. You mention two halves in the previous paragraph, so I kind of try to map those two halves to the halves in this paragrpah. But I don't understand what the second half is in this second paragraph.
> Three months ago the lodge's social channels went dark. Not for lack of content; the lodge has years of raw footage across multiple SSDs. The bottleneck was editing time, and my time disappeared. Claude Code with Opus 4.5 (and then 4.6) hit the point in February where you could leave agents running for hours and come back to merged PRs. KaribuKit was going live with its first paying property in the same window. I stopped sleeping properly, started running three or four agents in parallel in the background, and the months when I would have cut reels turned into months when I shipped software instead.
I don't fully understand this paragraph either. Your time disappeared? Into what? Was it the lack of sleep? I don't know what KaribuKit is.
> I asked it out loud: how does the agent know what's in each clip?
Did you? Really?
> Four bugs, four lessons
I've noticed that AI tends to rathole into random things when summarizing a piece of work, so I'm skeptical that these were actually the most four interesting bugs you could have shared.
I would recommend you just remove this section or take the time to actually think about some learnings you had from this project. Syntax errors or missed CLI params are mildly interesting but what makes these four bugs interesting to your readers?
> The actual take
The same criticism here applies. Are these your real takes, or did Claude make these up too?
Some obvious tells to me of things that AI likes to write that humans rarely ever say:
> Both real, both consuming attention.
> Four constraints set the shape:
There's way more than just this (the writing style of nearly the entire post screams Opus 4.7), but that's just what jumped out at me when I started reading your post.
I don't mind you used AI to write this but in the future when you write using AI, take the time to read the entirety of the article and consider the goals of what you want to write and if the AI achieved that. Take out what doesn't belong and make sure that what you have left says things in your voice.
Thanks! Videos is still kinda new to me. But I have a large collection of amazing photos - tens of thousands of RAW images - just lying there spread across the different trip folders.
You know what I REALLY want? Just point this beast at the folders and it tell me which 150 shots are good to process from these 1,500 images. That's the dream!
Although the technology is getting there, it's still a very difficult problem to solve. Taste and art is subjective. Also me as a photographer will always be concerned - "what if my best shot was in one of these rejected shots".
But yeah, I think I'll try to do some more of these experiments soon.
there’s a lot of open models out there… I told Claude to do a weighted score on several models and deduplicate by CLIP similarity for an expedition, should be easy to replicate (see below). Sure doesn’t select the absolute best pics from an emotional impact perspective, but it was pretty damn good at me not having to wade through the bottom 80% of mediocre shots and dupes!
—-
“Models scored all 4,487 photos. NIMA rewards technical craft (sharpness, composition), LAION rewards emotional/aesthetic appeal, MUSIQ is more general quality. Combined: 0.4 NIMA + 0.3 LAION + 0.3 MUSIQ, deduped at 0.85 CLIP similarity.
Interesting: the models wildly disagreed on some shots — one photo ranked NIMA #2 globally but LAION #4313.”
I thought about that in this video case and it's true. I thought the parent comment was making a broader statement about local models in general. But even with video, if it was stored in private cloud storage near the LLM could this still have worked efficiently? What are the most painful elements of this whole setup / work environment if everything was cloud?
Oh yes, if everything is cloud, then this is a non-issue.
The few other points of consideration would be:
1) Cost - I was considering using Sonnet for this but there's always the concern of reaching limits OR the API cost if you're using the API.
The feeling of knowing you have a capable model in your hands without any limits is actually pretty awesome. Your mind starts running at what else can I throw at it to do grunt work.
2) Privacy issues - same as with moving to cloud.
3) Reliability issues - I know from experience Claude uptime has been pretty bad the past few months
4) Restrictions - Claude has been pretty heavy handed with their restrictions lately, anything which remotely triggers there flags gets an instant denial (or worse, an account ban). Often these are false-positives.
I love the value I get from Claude but there's a different kind of freedom you get with local, capable models.
Tbh, I did spend a lot of time trying to ground it and de-slopify it - verified nothing was halucinated and went through 10 iterations to get to this. It's almost like wrestling with Claude and I knew it would be tough on HN.
But because of the fear of non-perfection, I used to put away things like creating this article or even posting it anywhere. And I do think the article has real value that HN would appreciate (I am myself an HN-enthusiast).
Also a side note, the blog is posted on my self-created Slopit.io platform which is purely meant for your personal agents (working along with you) to post content - I recommend trying it out.
https://blog.slopit.io/this-blog-post-is-slop/
I know, things are getting difficult with all the slop around, but my personal opinion is, as the agents get better at writing, the "annoying-ness" factor reduces and pieces of substance will still be appreciated, even if it was written by agents.
This and the fact that agents aren't going away.
If I've automated a lot of my coding, I feel like engineers like me would naturally progress to also taking agents' help to write useful content.
Definitely agree with this. Here, me and Claude brainstorming together did that Research, and some trial-and-error to get to this.
But I can tell it's only a matter of time before agents become smart enough to let my non-tech friends be able to just say "Make sense of all these videos in my folder" and it just does it.
Are you planning to open source it? Or maintain it in the future?