Journal Articles

CVu Journal Vol 31, #6 - January 2020 + Internet Topics
Browse in : All > Journals > CVu > 316 (11)
All > Topics > Internet (35)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: How to Stay Out of a Webmaster’s Bad Books

Author: Bob Schmidt

Date: 04 January 2020 17:26:18 +00:00 or Sat, 04 January 2020 17:26:18 +00:00

Summary: Silas S. Brown demonstrates how not all online resources are created equal.

Body: 

I’ve made a Chinese-English ‘dictionary supplement’ of about 40,000 names and other things that aren’t normally defined in dictionaries but are nevertheless useful to be recognised by text analysis tools.

Being public-spirited, I thought it was a good idea to put the resulting collection on my home page as a downloadable text file that you can import into the software of your choice.

Until one month this file got over 70 gigabytes of traffic from China, which the Apache server logs show was sent by about 1,600 machines in various provinces, downloading the whole file an average of 30 times per machine. The worst offender was a single machine downloading the file 40 times per day.

The logs showed all of the offending machines were claiming to be Chrome 51 (a 3-year-old browser), all the timestamps were during the day in China time, and in some cases the download was stopped part-way through the file. So my guess is somebody tried to write a search tool, used a fetch library claiming to be Chrome 51, and set it to read from my server until the search string is found or until end of file. How would we critique such code?

Rule 1: Cache locally if possible

You wrote a search tool? Great! Now, do you think it’s possible somebody might want to use it more than once? After all, how often have you searched for a word, found the results not to your satisfaction, and thought of another word you could try instead? Might your users behave the same way? If you’re downloading all 3 million bytes of my text file for every single search, that could multiply up pretty fast. Obviously you should try to store that file locally if you can, and check for updates only occasionally. (And when you do check for updates, there’s an HTTP header called If-Modified-Since you can use to avoid a download altogether in the case of no change. The real Chrome 51 does this, which is part of the reason why I don’t think your tool is right to pretend to be Chrome 51. But if it’s too difficult to send an If-Modified-Since header, I won’t mind an unconditional download as long as its frequency is low enough, say once a week at the most.)

But let’s say you are a beginner at coding and you don’t know how to store a file locally. (But you’re still going to have 1600 active users across China.) Then what? Well how about this idea: put the file on your own server! Or if you don’t have your own server, find a service that will host it for you, but you must be the responsible person who owns the account on that service, not me. When I say my file is free (liberally licensed or public domain), I mean please copy it, and I say ‘copy’ precisely because I do not want all your traffic to come back to my original server. Yes I would appreciate people being up-to-date with the original (I don’t really like seeing years-old outdated copies of my stuff floating around), but I shan’t mind a reasonable delay before your mirror updates itself, and I’d much prefer that to having to foot the bill for your excessive traffic. Free is free copying-wise, but if your app might be big, you should at least do the right thing and help with the back-end infrastructure. Fair enough?

Rule 2: Identify yourself

When you fetch a resource from a Web server, you send it a "User-Agent" string to identify your browser. Some servers have a list of ‘acceptable’ browsers and block everything else, which annoys me, especially when by doing this they inadvertently block tools used by blind people. Consequently, some downloading tools are in the habit of choosing a string from an ‘acceptable’ browser (for example, Chrome 51 from 2016) and pretending to be that. But this is bad practice, because it prevents webmasters from finding you if there is a problem. So I suggest:

  1. Try setting your User Agent to the actual name of your app, preferably with a URL, or at least something I can type into Google to find you. That way, if it starts misbehaving, the webmaster can actually start talking with you and working out a mutually acceptable solution, instead of just coming up with ways to block your app. True, some webmasters might say “I don’t have time for this” and block you anyway, but it’s surprising how many of us are in fact nice enough to try contacting you first if it’s obvious how to do so.
  2. Only if (1) turns out not to work because a misguided webmaster has set a list of ‘acceptable’ browsers, should you then start thinking about claiming to be an ‘acceptable’ browser to get past that misguided test. Even then, it’s still a good idea to try identifying your app as an ‘extension’ to the acceptable browser, i.e. try sending the Chrome string with ‘(actually XYZ app)’ added to the end.

Sending the Chrome string with nothing else added should be an absolute last resort. It may force the server administrator to block that particular version of Chrome (as I did) when they can’t find out how to open a dialogue with you.

But let’s suppose you’re such a beginner that you can’t figure out how to change the User-Agent string on your library, nor can you figure out how to change to a better library (I don’t know which library it is that defaults to pretending to be Chrome 51, but I don’t expect it to be a good one). Now what? How about this: See if the webmaster has published an email address (as I have), and send a courtesy email saying, “Hi, I wrote this app that downloads stuff from your server, what do you think?” They might be able to help you make the app better. At the very least, they’ll know they can contact you if something goes wrong, so you might be able to have a conversation instead of being blocked.

Unfortunately, it seems there are still too many people out there who seem to think of websites as being like motorways. You can send a whole bunch of cars down a motorway without having to think about ‘will the presence of these cars cause unacceptable congestion’, unless it’s an enterprise fleet so big that you already know how to manage the problems. Coders need to realise that some of our servers are more like small private roads with a guard at the gatehouse keeping an eye on the situation, who lets individuals pass but might start putting up barriers if you bring in a tour group without talking about it first. I did set the block message to something that makes sense if a human reads it, but I don’t know if the code’s author will, and I doubt I’ll ever know the name of their program, which is a pity because I would have reached out to them and helped them fix it if it had identified itself.

Silas S. Brown Silas is a partially-sighted Computer Science post-doc in Cambridge who currently works in part-time assistant tuition and part-time for Oracle. He has been an ACCU member since 1994.

Notes: 

More fields may be available via dynamicdata ..