Hacker Newsnew | past | comments | ask | show | jobs | submit | ranyume's commentslogin

Careful with that benchmark. It's LLMs grading other LLMs.

Well if lmsys showed anything, it's that human judges are measurably worse. Then you have your run of the mill multiple choice tests that grade models on unrealistic single token outputs. What does that leave us with?

> What does that leave us with?

At the start, with no benchmark. Because LLMs can't reason at this time, and because we don't have a reliable way of grading LLM reasoning, and because people are stubborn thinking LLMs are actually reasoning we're at the start. When you ask a LLM "2 + 2 = ", it doesn't add the numbers together, it just looks up one of the stories it memorized and return what happens next. Probably in some such stories 2 + 2 = fish.

Similarly, when you're asking a LLM to grade another LLM, it's just looking up what happens next in it's stories, not even following instructions. "Following" instructions requires thinking, hence it's not even following instructions. But you can say you're commanding the LLM, or programming the LLM, so you have full responsibility for what the LLM produces, and the LLM has no authorship. Put in another way, the LLM cannot make something you yourself can't... at this point, in which it can't reason.


You have an outmoded understanding of how LLMs work (flawed in ways that are "not even wrong"), a poor ontological understanding of what reasoning even is, and too certain that your answers to open questions are the right ones.

My understanding is based on first-hand experimentation trying to make LLMs work on the impossible task of tasteful simulation of an adventure game.

That's kind of nonsense, since if I ask you what's five times six, you don't do the math in your head, you spit out the value of the multiplication table you memorized in primary school. Doing the math on paper is tool use, which models can easily do too if you give them the option, writing adhoc python scripts to run the math you ask them to with exact results. There is definitely a lot of generalization going on beyond just pattern matching, otherwise practically nothing of what everyone does with LLMs daily would ever work. Although it's true that the patterns drive an extremely strong bias.

Arguably if you're grading LLM output, which by your definition cannot be novel, then it doesn't need to be graded with something that can. The gist of this grading approach is just giving them two examples and asking which is better, so it's completely arbitrary, but the grades will be somewhat consistent and running it with different LLM judges and averaging the results should help at least a little. Human judges are completely inconsistent.


On the other hand, if you ask me what five times six in base eight is, I can spend a second and repy thirtysix. Is there an LLM able to do that yet?

> if I ask you what's five times six, you don't do the math in your head, you spit out the value of the multiplication table you memorized in primary school

Memorization is one ability people have, but it's not the only one. In the case of LLMs, it's the only ability it has.

Moreover, let's make this clear: LLMs do not memorize the same way people do, they don't memorize the same concepts people do, and they don't memorize the same content people do. This is why LLMs "have hallucinations", "don't follow instructions", "are censored", and "makes common sense mistakes" (these are words people use to characterize LLMs).

> nothing of what everyone does with LLMs daily would ever work

It "works" in the sense that the LLM's output serves a purpose designated by the people. LLMs "work" for certain tasks and don't "work" for others. "Working" doesn't require reasoning from an LLM, any tool can "work" well for certain tasks when used by the people.

> averaging the results should help at least a little

Averaging the LLM grading just exacerbates the illusion of LLM reasoning. It only confuses people. Would you ask your hammer to grade how well scissors cut paper? You could do that, and the hammer would say it gets the job done but doesn't cut well because it needs to smash the paper instead of cutting it; Your hammer's just talking in a different language. It's the same here. The LLMs output doesn't necessarily measure what the instructions in the prompt say.

> Human judges are completely inconsistent.

Humans can be inconsistent, but how well the LLM adapts to humans is itself a metric of success.


Seems like a foreshock of AGI if the average human is no longer good enough to give feedback directly and the nets instead have to do recursive self improvement themselves.

No we're just really vain and like models that suck up to us more than those that disagree even if the model is correct and the user is wrong. People also prefer confident, well formatted wrong responses to basic correct ones, cause we have great narrow knowledge in our field but know basically nothing outside of it so we can't gauge correctness of arbitrary topics.

OpenAI letting RLHF go wild with direct feedback is the reason for the sycophancy and emoji-bullet point pandemic that's infected most models that use GPTs as a source of synthetic data. It's why "you're absolutely right" is the default response to any disagreement.


Then when you're not on the era you're supposed to be it's called a "regression" or "skipping stages". People are very stubborn to classify development in terms of age or time.


You'd be surprised.


I'd call this 3DAssetGen. It's not a world model and doesn't generate a world at all. Standard sweat-and-blood powered world building puts this to shame, even low-effort world building with canned assets (see rpg maker games).


It's not really a world no. It generates only a small square by the looks of it. And a world built out of squares will be annoying.

Still, it's a first effort. I do think AI can really help with world creation, which I think is one of the biggest barriers to the metaverse. When you see how much time and money it costs to create a small island world called GTA..


Last time I checked, the metaverse was all about people collaborating in the making of a shared world, and we already have this. Examples include minecraft and vrchat, both of which are very popular metaverses. I don't see how not having bot content generation is a barrier?

Then, let's say people are allowed to participate in a metaverse in which they have the ability to generate content with prompts. Does this mean they're only able to build things the model allows or supports? That seems very limiting for a metaverse.


I don't mean for content creation to only be AI! I mean it could be a tool especially for people who don't understand 3D design so well.

Minecraft makes it easy by using big blocks but you can't have detail like that and it's very Lego like. VRChat requires very detailed Unity knowledge. You really need to be a developer for that.

Horizons has its own builder in world but it's kinda boring because it's limited. I think this is where AI can come in, to realise people's vision where they lack the skills to develop it themselves. As a helper tool, not the only means of generation.


I guess that doesn't matter in games where the world ultimately doesn't matter, it will be better procedural generation, but personally I adore games where the developers actually put effort into designing a world that is interesting to explore, where things are deliberately placed for story or gameplay mechanics reasons.

But I suppose AI could in theory reach the point where it understand the story/theme and gameplay of a game while designing a world.

But when anyone can generate a huge open world, who really cares, is the same as it is now, gotta make something that sticks out from the crowd, something notable.


It's the human guidance that makes it special. Low effort single sentence prompt creation like meta does here is super boring of course.

But it can be a tool for people with great imagination but not the technical skills to make it real.

Every time we talk about AI people think it will be used only as an easy mode A-Z creator. That's possible but creates boring output. I view it more as a tool to assist in the difficult and tedious parts of content creation. So the designer can focus on the experience and not tweaking the little things.


Nowhere in the page does it state that's it's a world model.


It's called world gen.

I know nothing about games and game development, but comments INSTA-sticking up for bigCo is increasingly hilarious to me.


World generation is different than world modeling. It's like java versus javascript. I'm not sure why I bother with technical discussion on hacker news anymore.


My comment was too snarky. I take your point. Based on the discussion this capability is closer to a really cool automated asset pack than "building 3D worlds". My understanding of world modeling is towards AGI, and you're saying nobody implied this is world modeling.

You're right. But the criticism is that it's closer to 2D asset packs than it is to 3D worlds and you're being overly charitable to Meta and underly charitable to the community response.

edit: this is just my over sharing of why i downvoted you. I didn't intent for you to feel dismissed.


I used quick research and it was pretty cool. A couple of caveats to keep in mind:

1. It answers using only the crawled sites. You can't make it crawl a new page. 2. It doesn't use a page' search function automatically.

This is expected, but doesn't hurt to take that in mind. I think i'd be pretty useful. You ask for recent papers on a site and the engine could use hackernews' search function, then kagi would crawl the page.


What exactly do you mean by "You can't make it crawl a new page"? It has the ability to read webpages, if that is what you're referring to


This query's results are wrong:

"""

site:https://hn.algolia.com/?dateRange=pastYear&page=0&prefix=fal... recent developments in ai?

"""

Also when testing, if you know a piece of information exists in a website, but the information doesn't show up when you run the query, you don't have the tools to steer the engine to work more effectively. In a real scenario you don't know what the engine missed but it'd be cool to steer the engine in different ways to see how that changes the end result. For example, if you're planning a trip to japan, maybe you want the AI to only be shown a certain percentage of categories (nature, night life, or places too), alongside controlling how much you want to spend time crawling, maybe finding more niche information or more related information.


Yup, pasting in a URL will cause it to fetch the page.


Their agents can natively read files and webpages from URLs. It's so convenient, I've implemented identical feature for our product at work.


In my workplace happens this, but in a bad way. There's a push to make use of AI as much as we can to "boost productivity", and the one thing people don't want to do is write documentation. So what ends up happening is that we end up with a bunch of AI documentation that other AIs consume but humans have a harder time following because of the volume of fluff and AI-isms. Shitty documentation still exists and can be worse than before...


Other than humans getting apoplectic at the word "delve" and — emdashes, can you explain and give some examples or say more about how AI-isms hurt readability?


Having encountered this spread across our orgs greenfield codebases which made heavy use of AI in the last 90 days: Restating the same information in slightly different formats, with slightly different levels of detail in several places, in a way that is unnecessary. Like a "get up and running quickly" guide in the documentation which has far more detail than the section it's supposed to be summarizing. Jarringly inconsistent ways of providing information within a given section (a list of endpoints and their purposes, followed by a table of other endpoints, followed by another list of endpoints). Unnecessary bulleted lists all over the places which could read more clearly as single sentences or a short paragraph. Disembodied documentation files nested in the repos that restate the contents of the README, but in a slightly different format/voice. Thousands of single line code comments that just restate what is already clear to the reader if they just read the line it's commenting on. That's before getting into any code quality issues themselves.


I've noticed AI generated docs frequently contain bulleted or numbered lists of trivialities, like file names - AI loves describing "architecture" by listing files with a 5 word summary of what they do which is probably not much more informative than the file name. Superficially it looks like it might be useful, but it doesn't contribute any actually useful context and has very low information density.


A piece of information, or the answer to a question, could exist in the documentation but is not in a format that's easily readable to humans. You ask the AI to add certain information, and it responds with "I already added it". But the AI doesn't "read" documents the way humans read.

For instance, say you need urgent actions from other teams. To this effect you order an AI to write a document and you give it information. The AI produces a document following it's own standard document format with the characteristic AI fluff. But this won't work well, because upon seeing the urgent call for action the teams will rush to understand what they need to do, and they will be greeted by a corporate-pr-sounding document that does not address their urgent needs first and foremost.

Yes, you could tell the AI how to make the document little by little... but at that point you might as well write it manually.


>Restrictions can both help and hinder innovation

I'm not sure innovation is really impacted when restricting the private sector. Traditionally, innovation happens in public (e.g, universities) or military spaces.


This is extremely dubious. There are hundreds (thousands?) of examples of innovation happening in the private sector - I could name the blue LED off the top of my head, and got personal computers, search engines, smartphones, cloud computing, and integrated circuits with less than a minute of searching.


>why shouldn’t Netflix have the right to choose who they distribute content to?

power asymmetry


There are dozens of sources of online streaming entertainment, and its not exactly a vital good.


Sure, Netflix may not be as important as, say, housing, food, or whatever else, but I think there is something to be said about the cultural importance of [at the very least some] film and television.

There's a lot of media worth studying, analyzing, and preserving. And in that sense, between the constant churn of catalog items, exclusive content, and the egregious DRM, I think these sorts of streaming services are, unfortunately, kind of harmful.


Doesn't your second paragraph run against the grain of your first? If streaming services like Netflix are harmful then we should avoid using them. Thus it should not be important for our freedom-preserving computers to be able to access Netflix.

Now, if you want to do an in-depth study of film and television material as a whole, you're actually better off avoiding Netflix and making use of archives such as public libraries, university libraries, and the Internet Archive.


I mean, I agree that you should be able to avoid things like Netflix and make use of libraries and other archives, but that's sort of the point; there is a ton of media that never even gets a physical release anymore; once one of these platforms goes under, or something enters licensing hell, or whatever else and gets removed, all you can do is hope someone out there with both the know-how and access went out of their way to illegally download a copy, illegally decrypt it, and illegally upload it somewhere.

I say "know-how" and "access" because, while I'd still argue decrypting, say, Widevine L3 is not exactly super common knowledge, decrypting things like 4K Netflix content, among other things, generally requires you to have something like a Widevine L1 CDM from one of the Netflix-approved devices, which typically sits in those hardware trusted execution environments, so you need an active valuable exploit or insider leaks from someone at one of the manufacturers.

But also on top of all of that, you also need to hope other people kept the upload alive by the time you decide to access it, and then you also often need to have access to various semi-elitist private trackers to consistently be able to even find some of this stuff.

The legal issues with DRM here are hardly exclusive to Netflix and other streaming services, but at least in the case of things like Blu-rays or whatever — even if it is technically illegal in most countries to actually make use of virtually any backed-up disc due to AACS — you usually don't have the same time-pressure problem nor the significant technical expertise barrier.

>If streaming services like Netflix are harmful then we should avoid using them. Thus it should not be important for our freedom-preserving computers to be able to access Netflix.

I generally do avoid them whenever possible, though, yes. And I've explicitly disabled DRM support in Firefox on my computer. But I am just one person and I don't think my behavior reflects the average person, for better or for worse.


>decrypting things like 4K Netflix content, among other things, generally requires you to have something like a Widevine L1 CDM from one of the Netflix-approved devices, which typically sits in those hardware trusted execution environments, so you need an active valuable exploit or insider leaks from someone at one of the manufacturers.

Or just use a cheap Chinese HDMI splitter that strips HDCP 2.2 and record the 4K video with a simple HDMI capture device.

But if you are talking about preserving media or making media accessible, then it's not like we NEED 4K.


Yeah, there are a lot of torrent sites! Netflix doens't want my business anymore, I don't really care.


There exist dozens of online services where you can store your photos, doesn't mean companies should be allowed to do whatever they want with your photos...


TBH I don't care if Netflix wants to abuse such an asymmetry. I don't need Netflix in my life, so I'll just cancel my subscription(already have). I honestly don't want my lawmakers to spend even a second thinking about Netflix when we have so many large issues in the world right now. If we were talking about something like financial services where I have to engage I would be more sympathetic.


Capital doesn't really care what you want, it will exert control regardless. So in this case Netflix will continue to be part of capital that normalizes the need for DRM to access videos, write IP law, and generally force you into either accepting the world they want or forcing you to become a hermit.

Edit: i mean to say this is true whether or not you've even heard of the company.


Well then I will get mad when that actually happens. Until then don't care.


The whole notion of DRM and penalties if you circumvent it comes from the entertainment industry, and it's written into law/official treaties. This already affects everything from secure boot to HDMI standards.


Which part of what I said do you think hasn't already happened and metastasized?


> Capital doesn't really care what you want, it will exert control regardless.

Working as intended. The market doesn't care what capital wants either.

> So in this case Netflix will continue to be part of capital that normalizes the need for DRM to access videos

I can access video without DRM. If you want to access Netflix's service that's on you.

> write IP law

Netflix does not write IP law, our politicians do. Vote better.

> generally force you into either accepting the world they want or forcing you to become a hermit.

I don't accept their world, and I'm not a hermit.


...and it will be too late.


You did call out.


If you read my comment closely, I didn't deny calling anyone out.


Keeping local communities habitable is each individual's responsibility towards the community. This much should be ingrained in everyone. If you treat yourself and act as an individual you will never accomplish anything.

I don't intent this comment to be a "you're wrong" comment. I'm only saying that OP's POV runs on an assumption that can be damaging.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: