> Is there a better way to surf the web, retrieve the source code of the pages and extract data from them ?
Yes, of course! To get the source code of a web site you don't need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.
If you want to extract data from web pages without requiring hundreds of megabytes for something like Electron, there are lots of scraping libraries out there. There are for example at least two good Python implementations: Scrapy[1] and BeautifulSoup[2].
This is nice sounding, but many modern web-pages use extensive client-side rendering. Sure, you can work around that without needing a full JS environment, but doing so is ad-hoc and you wind up having to write complex code on a per-site basis.
I do a bunch of web-scraping for hobby shit, and I've love to be able to not have to shell out to chromium for some sites, but unfortunately the modern web basically means you're stuck with it.
Correct me if I’m wrong but neither one supports Javascript rendered pages?
You’re right in the overhead though; I’d stay miles away from Electron for scraping but you’ll need more than a CURL wrapper to properly fetch data in all shapes and sizes :) Headless Chromium does do the trick in that regard.
With web scraping you typically don’t want the visuals anyway. JS rendered applications are usually easier to scrape because they have data in a more raw or canonical format available somewhere to do that rendering.
Plenty of websites will only render the content fully after some JavaScript runs, so to properly scrape them you do indeed need a browser to process the JS. This includes text content.
Javascript rendered pages load JS which then in turn calls some rest API to get data and use that to render contents. Web scraper stops scraping the html, but calls and scrapes the rest api endpoint.
Sure, but i meant to build a portable app, for end users who are not coders, with a GUI, and for a dedicated purpose, like for exemple navigating on facebook.
So i will edit this question to this : Is there a better way to code a portable application with a graphical user interface to scrape a given site ?
Look up robot process automation and visual web scraping. Web scraping without having to write code is a well established field. Just not very popular with the HN crowd for obvious reasons.
Some example would be Scrapinghub's Portia system and the Kantu startup. There are also established players like UIPath and Visualwebripper.
As the saying goes: "two wrongs don't make a right." Facebook's ToS is still a ToS. If you want to scrape the data that they've collected, either risk your account due to it being against the ToS or collect the data yourself.
Good luck with that :) Any modern website requires javascript interpreter on client side, so unless you provide some sort of javascript interpretation (which can be messy), you'll be able to scrape only simple content with scrapy/BS.
Yes, of course! To get the source code of a web site you don't need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.
If you want to extract data from web pages without requiring hundreds of megabytes for something like Electron, there are lots of scraping libraries out there. There are for example at least two good Python implementations: Scrapy[1] and BeautifulSoup[2].
[1]: https://scrapy.org/
[2]: https://www.crummy.com/software/BeautifulSoup/