Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tools like JSDom are pretty nice for this, but I've found that most web scraping involves a lot of low level manipulation of strings and lists – stripping/formatting/concatenating/ranges/etc, and I find JS to have much worse ergonomics for this than languages like Python and Ruby. I actually find the ergonomics of this most comparable to Swift, the difference being that with Swift you get a ton of safety and speed for that trade-off.

If your whole stack is JS and you need a little bit of web scraping, this makes sense. If you're starting a new scraping project from scratch, I think you'll get far further, faster, with Python or Ruby.



Actually, if you're scraping at any scale above a hobby project, most of your web scraping hours would now be spent on avoiding bot detection, reverse engineering APIs and trying to make HTTP requests work where it seems only a browser can help. The time spent "working with strings" is not even noticeable to me.

I scrape for a living and I work with JS, because currently, it has the better tools.


I can echo this.

I'm currently working to turn my hobby scraper into something profitable. "Working with strings" is already the least of my concern. I've spent most of the time with finding an architecture / file structure that allows me to

- easily handle markup changes on source-pages and

- quickly integrate new sources

I've feared it would be impossible to handle unexpected structural changes from a multitude of sources. Turns out that rarely happens. Like, once every x years per source's page-type.


I'm not sure if puppeteer/playwrite might be a better option for some of those instances, with real Chrome.

I prefer JS + jsdom/cheerio as it's closer to the in-browser experience for scraping.


>Tools like JSDom are pretty nice for this, but I've found that most web scraping involves a lot of low level manipulation of strings and lists – stripping/formatting/concatenating/ranges/etc, and I find JS to have much worse ergonomics for this than languages like Python and Ruby. I actually find the ergonomics of this most comparable to Swift, the difference being that with Swift you get a ton of safety and speed for that trade-off.

I think from es6 and up this is handled pretty well.


It has made it better, but things like slice operators are still missing, which can help a lot, Set/Map types aren't that great to use and aren't used much in practice, and there are still lots of sharp edges for newcomers even with simple things like iteration. That's also not mentioning things like the itertools/collections modules in Python which provide some rich types that come in handy.


It’s certainly possible to make itertools-like stuff in Javascript.

https://observablehq.com/@jrus/itertools


Seems like slice operator is more like Syntactic sugar for substring?


I'm not the author of the comment you're replyin to, but doesn't that fall under the worse ergonomics argument?


I am the author, and yes this is exactly the sort of thing I mean.

Python/Ruby are far more expressive for these sorts of data manipulation tasks.


Yeah, there's a reason that I reach for Python for web scraping.

String manipulation and collections on Python are not an after thought, the syntax and API make them convenient and easy to use.


Why manipulate strings at all?

Text are DOM nodes. If I were making a business of this I would automate the shit out of it by:

1) Gather all text nodes directly

2) Eliminate all text nodes that only contain white space

3) Add context. Since text nodes are DOM nodes you can get information about the containing element directly from the node itself.

Hands down walking the DOM will be programmatically faster to write and execute than anything else you can write in any language.

Here is some tiny code that does just that: https://github.com/prettydiff/semanticText


Do demonstrate, those are pretty basic operations and all scripting languages work with those equally well. I dont see the benefit of a typed language here. Most parsing to me has been a bunch of xpaths then (named) regex on the result text nodes. Never needed any thing more than those 2.


I agree (Ruby + Selenium is a great combination), but for situations where I was forced to use Puppeteer, switching to the context of the browser and executing native Javascript is quite easy, since you can write the code first in the browser console.


Where would you get started with Python web scraping



I would recommend lxml over Beautiful Soup, it's got a much bigger API, which means more options for parsing awkward bits, it copes with badly formatted markup well, and it's considerably faster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: