Tools like JSDom are pretty nice for this, but I've found that most web scraping...

mnmkng · on Oct 26, 2020

Actually, if you're scraping at any scale above a hobby project, most of your web scraping hours would now be spent on avoiding bot detection, reverse engineering APIs and trying to make HTTP requests work where it seems only a browser can help. The time spent "working with strings" is not even noticeable to me.

I scrape for a living and I work with JS, because currently, it has the better tools.

woodpanel · on Oct 27, 2020

I can echo this.

I'm currently working to turn my hobby scraper into something profitable. "Working with strings" is already the least of my concern. I've spent most of the time with finding an architecture / file structure that allows me to

- easily handle markup changes on source-pages and

- quickly integrate new sources

I've feared it would be impossible to handle unexpected structural changes from a multitude of sources. Turns out that rarely happens. Like, once every x years per source's page-type.

tracker1 · on Oct 27, 2020

I'm not sure if puppeteer/playwrite might be a better option for some of those instances, with real Chrome.

I prefer JS + jsdom/cheerio as it's closer to the in-browser experience for scraping.

0df8dkdf · on Oct 26, 2020

>Tools like JSDom are pretty nice for this, but I've found that most web scraping involves a lot of low level manipulation of strings and lists – stripping/formatting/concatenating/ranges/etc, and I find JS to have much worse ergonomics for this than languages like Python and Ruby. I actually find the ergonomics of this most comparable to Swift, the difference being that with Swift you get a ton of safety and speed for that trade-off.

I think from es6 and up this is handled pretty well.

danpalmer · on Oct 26, 2020

It has made it better, but things like slice operators are still missing, which can help a lot, Set/Map types aren't that great to use and aren't used much in practice, and there are still lots of sharp edges for newcomers even with simple things like iteration. That's also not mentioning things like the itertools/collections modules in Python which provide some rich types that come in handy.

jacobolus · on Oct 26, 2020

It’s certainly possible to make itertools-like stuff in Javascript.

https://observablehq.com/@jrus/itertools

0df8dkdf · on Oct 26, 2020

Seems like slice operator is more like Syntactic sugar for substring?

mikedelfino · on Oct 26, 2020

I'm not the author of the comment you're replyin to, but doesn't that fall under the worse ergonomics argument?

danpalmer · on Oct 27, 2020

I am the author, and yes this is exactly the sort of thing I mean.

Python/Ruby are far more expressive for these sorts of data manipulation tasks.

heavyset_go · on Oct 26, 2020

Yeah, there's a reason that I reach for Python for web scraping.

String manipulation and collections on Python are not an after thought, the syntax and API make them convenient and easy to use.

austincheney · on Oct 27, 2020

Why manipulate strings at all?

Text are DOM nodes. If I were making a business of this I would automate the shit out of it by:

1) Gather all text nodes directly

2) Eliminate all text nodes that only contain white space

3) Add context. Since text nodes are DOM nodes you can get information about the containing element directly from the node itself.

Hands down walking the DOM will be programmatically faster to write and execute than anything else you can write in any language.

Here is some tiny code that does just that: https://github.com/prettydiff/semanticText

nsonha · on Oct 27, 2020

Do demonstrate, those are pretty basic operations and all scripting languages work with those equally well. I dont see the benefit of a typed language here. Most parsing to me has been a bunch of xpaths then (named) regex on the result text nodes. Never needed any thing more than those 2.

bdcravens · on Oct 27, 2020

I agree (Ruby + Selenium is a great combination), but for situations where I was forced to use Puppeteer, switching to the context of the browser and executing native Javascript is quite easy, since you can write the code first in the browser console.

moneywoes · on Oct 26, 2020

Where would you get started with Python web scraping

karanbhangui · on Oct 26, 2020

https://www.crummy.com/software/BeautifulSoup/

https://github.com/encode/httpx

danpalmer · on Oct 27, 2020

I would recommend lxml over Beautiful Soup, it's got a much bigger API, which means more options for parsing awkward bits, it copes with badly formatted markup well, and it's considerably faster.