Tools like JSDom are pretty nice for this, but I've found that most web scraping involves a lot of low level manipulation of strings and lists – stripping/formatting/concatenating/ranges/etc, and I find JS to have much worse ergonomics for this than languages like Python and Ruby. I actually find the ergonomics of this most comparable to Swift, the difference being that with Swift you get a ton of safety and speed for that trade-off.
If your whole stack is JS and you need a little bit of web scraping, this makes sense. If you're starting a new scraping project from scratch, I think you'll get far further, faster, with Python or Ruby.
Actually, if you're scraping at any scale above a hobby project, most of your web scraping hours would now be spent on avoiding bot detection, reverse engineering APIs and trying to make HTTP requests work where it seems only a browser can help. The time spent "working with strings" is not even noticeable to me.
I scrape for a living and I work with JS, because currently, it has the better tools.
I'm currently working to turn my hobby scraper into something profitable. "Working with strings" is already the least of my concern. I've spent most of the time with finding an architecture / file structure that allows me to
- easily handle markup changes on source-pages and
- quickly integrate new sources
I've feared it would be impossible to handle unexpected structural changes from a multitude of sources. Turns out that rarely happens. Like, once every x years per source's page-type.
>Tools like JSDom are pretty nice for this, but I've found that most web scraping involves a lot of low level manipulation of strings and lists – stripping/formatting/concatenating/ranges/etc, and I find JS to have much worse ergonomics for this than languages like Python and Ruby. I actually find the ergonomics of this most comparable to Swift, the difference being that with Swift you get a ton of safety and speed for that trade-off.
I think from es6 and up this is handled pretty well.
It has made it better, but things like slice operators are still missing, which can help a lot, Set/Map types aren't that great to use and aren't used much in practice, and there are still lots of sharp edges for newcomers even with simple things like iteration. That's also not mentioning things like the itertools/collections modules in Python which provide some rich types that come in handy.
Do demonstrate, those are pretty basic operations and all scripting languages work with those equally well. I dont see the benefit of a typed language here. Most parsing to me has been a bunch of xpaths then (named) regex on the result text nodes. Never needed any thing more than those 2.
I agree (Ruby + Selenium is a great combination), but for situations where I was forced to use Puppeteer, switching to the context of the browser and executing native Javascript is quite easy, since you can write the code first in the browser console.
I would recommend lxml over Beautiful Soup, it's got a much bigger API, which means more options for parsing awkward bits, it copes with badly formatted markup well, and it's considerably faster.
If your whole stack is JS and you need a little bit of web scraping, this makes sense. If you're starting a new scraping project from scratch, I think you'll get far further, faster, with Python or Ruby.