Enriching our Research Bibliography with Wikipedia URLs

At the beginning of February I spotted this dataset which included PubMed IDs and DOIs which were mapped unto IDs of Wikipedia articles in which they were cited. Of course I had to see how many DOIs in our research bibliography could be gathered from this list. In my first iteration I ran it against our Solr instance in a single thread which took about 1 1/2 hours.

After that I refactored my code to send four concurrent requests but ran into the problem of Jetty intermittently dropping connections. After some searching I found that it was the OS's fault, running low on ports. To remedy this situation I did this:

sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"
sudo sysctl -w net.ipv4.tcp_tw_recycle=1

Now it only takes a mere four and a half minutes to check these half a million identifiers. Mission accomplished...

Go Top
comments powered by Disqus