I don’t understand why the agents requesting HTML can’t extract text from HTML themselves. You don’t have to feed the entire HTML document to your LLM. If that’s wasteful, why not have a little bit of glue that does some conversion?
I learned that the golang CLI[1] is the best through my work simplifying Firecrawl[2]. However, in this case I used one available through npmjs such that it would work with `npx` for the CF worker builds.
It's always better for the agent to have fewer tools and this approach means you get to avoid adding a "convert HTML to markdown" one which improves efficiency.
Also, I doubt most large-scale scrapers are running in agent loops with tool calls, so this is probably necessary for those at a minimum.
This does not make any sense to me. Can you elaborate on this?
It seems “obvious” to me that if you have a tool which can request a web page, you can make it so that this tool extracts the main content from the page’s HTML. Maybe there is something I’m missing here that makes this more difficult for LLMs, because before we had LLMs, this was considered an easy problem. It is surprising to me that the addition of LLMs has made this previously easy, efficient solution somehow unviable or inefficient.
I think we should also assume here that the web site is designed to be scraped this way—if you don’t, then “Accept: text/markdown” won’t work.
If you have a website and you're optimizing it for GEO, you can't assume that the agents are going to have the glue. So as the person maintaining the website you implement as much of the glue as possible.
That sounds completely backwards. It seems, again, obvious to me that it would be easier to add HTML->markdown converters to agents, given that there are orders of magnitude more websites out there compared to agent.
If your agent sucks so bad that it isn’t capable of consuming HTML without tokenizing the whole damn thing, wouldn’t you just use an agent that isn’t such a mess?
This whole thing kinda sounds crazy inefficient to me.
Well that's what I implemented. There are markdown docs for every HTML file and the proxy decides to serve either markdown or HTML based on the Accept header.
I think GP meant on the client, i.e. agent side. As in, you could deploy this kind of proxy in a forward/non-reverse way inside the agent system, so the LLM always gets markdown, regardless of what the site supports.
There is no real reason to pass HTML with tags and all to the LLM - you can just strip the tags beforehand.