How to scrape the web gently with Node.js

I was scraping a few thousand links from a single domain just a little too aggressively yesterday and kept getting ECONNRESET and socket hang up errors instead of that sweet, sweet data I longed for. Like usual, it took quite a bit of searching, testing, googling and experimenting until I finally identified the correct solution to the problem.

If you (yes, you) are getting request error: read ECONNRESET and/or request error: socket hang up errors in Node when scraping websites for data, chances are it’s because you’re essentially bombarding the server with thousands of requests. The server at some point stops responding to your suspicious behavior and all those requests you launched start coming back empty-handed.

The correct way to address this issue depends on your code of course. In my case for example, it was no use limiting my promise concurrency for the promise array I was mapping over, but for you that might be the easier solution. I’ll quickly go over two alternative solutions which may work for you before saying a few words about the one that did for me.

Read More

Asynchronous tests in Mocha using before and after blocks

While writing my hexo plugin, I resolved to switch my coding style for my next project from whatever works to BDD/TDD. I’d wasted a bit too much time pushing buggy code and fixing it after the fact. Unit testing was the obvious solution.

Honestly, I still don’t understand what exactly the difference is between Test- and Behavior-Driven Development, other than the usage of different syntax (TDD uses assert and BDD uses expect or should, or at least, this seems to be the case). I’ll look into it more later, but for now I just wanted to get test on.

Writing tests with Mocha and Chai before writing any code for my new Node project was a bit of a culture-shock, but it was relatively easy and painless. That is, until I ran into asynchronicity issues. Suddenly, my tests started timing out and it took quite some time (despite great documentation) to finally get them right.

Mocha tests screenshot

I suppose others out there could benefit from reading how I structured my async test cases. I’m obviously not saying I’m an expert, since I only just started to seriously use Mocha myself. But this works, and if it so happens that you you are pulling your hair out in despair from being met by failing async unit tests and by some stroke of chance (or through some adept googling) you arrive on this website, this just might be your ticket to sanity.

Read More

The new-on-GitHub blues and my new project: command line Arch wiki

Since I got back from Belgium last week, I’ve been stuck in a bit of a rut, GitHub-wise. I finished my hexo plugin so I had to start looking for something else to do while I finish up my reading.

First I started building a Reddit bot, but after a few hours I saw that someone else had already done exactly what I’d wanted to do. I didn’t feel like continuing writing Python anyway – and the node helper libraries aren’t as well-developed – so I let the project go.

Then I figured I’d just fix some Hexo issues. It looked like I was making some headway solving what I thought I had identified as a glaring bug in the code.

Sadly, there turned out to be a very good reason for why the code looked the way it did, and my ‘solutions’ ended up causing a lot more problems down the line. So after two whole days of tinkering I had to write that off too as an epic fail.

Read More

Richard Stallman on Piracy

Just got back from a short trip to Belgium. Visited my parents and some old friends, but also went to hear Richard Stallman speak at a Free Software Foundation event in Gent.

GNU logo

I can’t say I agreed with everything he said, but it was a fantastic experience to see and hear him in person. There’s no doubt the man is a living legend.

One of the many things he touched on in his 2 hour long speech is piracy.

Read More