Title: Francis' Scribbles

Author:

Date: 07 October 2003 13:16:00 +01:00 or Tue, 07 October 2003 13:16:00 +01:00

Summary:

Body:

Spam

If you are reading this you are probably a programmer and that means that you are both a computer user and, almost certainly, a user of the Internet. Please read to the end of this column and think carefully about possible solutions. I will be offering some thoughts but only by way of getting the ball rolling.

Spam can be roughly grouped into three main categories:

Sales material
Virus/Trojan distribution
Well-intentioned, if thoughtless, sharing

Well-intentioned

The last of these can be tackled by better education of casual users. We have to educate our friends, colleagues and relatives that forwarding email should be limited to those that have made it clear that they are happy to accept it. Just because it is easy to forward 'The Ten Worst Jokes' to all your friends does not make that a good idea. If you had to post those you would not do so.

I think it is our duty as, hopefully more enlightened, users of email to politely and kindly instruct those that are in the habit of forwarding that funny picture to all and sundry that this is not acceptable behaviour. We need to explain why that is the case, and the possibility that forwarding unsolicited email might contain a virus or Trojan is not the main issue though it is an important one.

Virus Propagation

The second category was, until recently, the major concern because though it made up a relatively small volume of email it had a potential for extreme damage. The activity is already illegal in most civilised countries but identification and prosecution of the perpetrators can be a problem.

We need to focus on better ways to reduce the threat, because the damage has been done long before the guilty have been arrested. To find solutions to this threat we need to look at the field of epidemiology. Reducing susceptibility via inoculation is part of the answer. Recognising that monocultures are particularly vulnerable is another. Slowing transmission rates and early identification are also part of the solution.

ISPs have a part to play though it will not be anywhere near a complete solution. One idea that is worth considering at all levels is the use of trap email addresses. By that I mean addresses that are present in address books, placed out on the Internet for 'harvesters' etc. but which are never intended for use. Any email addressed to such will act as an alert. It should not be beyond competent programmers to write software that can act reasonably on such an alert. Exactly what reasonably means depends on where the trap is triggered.

For example a trap triggered at ISP level might cause immediate suspension of the source address for a period of time (it need only be relatively short while checks can be made) and trigger a delay of all emails with a similar 'signature.'

More controversially, I think it is worth making it illegal to knowingly or through professional incompetence transmit a virus containing email. I am thinking primarily of ISPs here. Before anyone starts screaming about personal liberty, remember that transmission of hazardous material via other mechanisms such as a postal service is already subject to the law. Few people would object to the use of chemical sniffers in post offices, so why should we object to the electronic equivalent.

We should never have to worry about old viruses being delivered by email because our ISP should have detected them and quarantined them, advised us and left the decision as to subsequent action to the end recipient. We need laws assigning responsibility to persuade ISPs that they must act rather than simply leaving it to their customers.

You might wonder if we should simply encourage ISPs to add this as a paid for service. I think that is not enough, we need to stop virus propagation as early as possible and that is effectively at the first place where it can be detected.

This proposal would not stop a new virus which is why I think we need things like trap addresses and intelligent use to slow up propagation until the detection software can be updated.

Sales Spamming

A large amount of spam until recently (more in a moment) was concerned with trying to sell something. Very often the product was something that we would have preferred our children not to be exposed to even if the sale was legal.

I have been told that most such spam is the product of a tiny number of specialists who make money out of being able to propagate such material.

At first sight this activity would seem to be part of the normal commercial exploitation of technology. However I think we have to look deeper. Have you noticed how those canvassing for work (jobbing gardeners etc.) use very small fliers which they hand deliver and usually ask that they be returned? Even the marginal cost of such fliers makes their reuse worth the effort. The problem with spam promotion is that it costs whatever the spammer charges you. However the real costs are paid by the entire community.

Do you think the above is too strong a statement? Forget the cost of implementing filters: think what it has done to communication. Not so many years ago it was easy to contact someone because they would tell you how on their home page or in their signatures on posts to newsgroups. Now we are progressively isolating ourselves. The Internet is rapidly losing its early promise of a technology that would bring people together and becoming a place where suspicion of strangers is getting ever worse.

Now the latest round of sales oriented spam comes disguised as messages that report such things as delivery failure. In the past you would always check those because you would want to know which of your messages had failed to reach its intended destination. One or two false ones were bad but we could live with them but when a couple become dozens they get filtered out. So your spam filter rejects my innocent email because of something in the subject line and my spam filter rejects your system's rejection. We become isolated.

From my perspective this is a fundamental attack on our sense of community and needs to be taken as a serious issue. ISPs claim they cannot filter such email. OK so let us take that as true (though I have my doubts) and attack the cause of the problem by making it illegal to tout for trade via any form of unsolicited contact. Yes I know that some companies will consider that an attack on what they consider to be their basic rights but it is past time that we grasped the nettle. Perhaps instead of making it illegal we should require companies that try to drum up trade with unsolicited contact (by email, phone or any other automated mechanism) should pay a levy (tax or licence fee) based on the number of attempted contacts.

Of course this will not eliminate the problem, companies based on bad and/or illegal business practices will always be with us but it would restore the balance a little.

The next point of attack could be on the use of harvesters to construct databases of email addresses. Clearly an email address is personal information so I suspect that in Europe it is already illegal to create a database of email addresses without registering under your country's data protection act. If we could get the USA to implement similar legislation we would have another lever with which to control the commercial spammers.

Note that commercial spam is nearly always a matter of sending an identical message to a very large number of addresses. At the customer end that leaves another possible way to filter out some spam. Couple a widely visible trap address with suitable software and any duplicate messages that go both to your trap address and to one or more of your normal addresses could be filtered. How easy would it be to write such software? Perhaps you are someone who could do so.

A Phase Change

When I set out to write this column things were bad enough but between the draft and this copy something else happened. I do not know what but sometime last Thursday my email system moved from having an irritating amount of spam to a complete disaster. From 100-150 spams a day it moved to 2000 plus almost instantly. I.e. there was no ramp up to the new level. Most of these extra messages purport to originate from or be connected to Microsoft. Of course this is not the case. Microsoft are an innocent victim. However I have no recourse but to reject all email with 'Microsoft' or 'MS' somewhere in their headers unless they come from a specifically whitelisted source.

This is a reverse form of denial of service, I cannot contact Microsoft on any issue and get a reply unless I allow literally thousands of false emails through.

There is another issue with this level of spam, I have to access the headers in order to reject. On my home system with a broadband connection that is just a nuisance. However in a couple of weeks time I will be working on site and my only email access will be through webmail and experience tells me that I simply will not have the bandwidth to even do the filtering. That means that in all likelihood my mailbox will accumulate over ten thousand messages during my five days of absence.

Conclusion

We often hear claims about the right to freedom of speech, well it seems to me that what is currently happening is seriously prejudicing my freedom. It is time that all concerned do whatever they can to deal with this problem. It is already way out of hand.

If we care, which I think most of us do, we must attack the problem of unsolicited email from every direction available. We need to persuade our governments that it is a threat to continued commerce and an invasion of our privacy. We need to persuade our service providers to take the issue seriously and act not only to remove customers who abuse the system but also to stem the flood of incoming but unwanted mail. Filtering at the recipient's end is now too late; it needs to be done earlier.

We need to address the use of email (and cold phone calling, text messaging etc.) for 'advertising'. The decision to email a thousand people should cost, and the decision to email a million should cost much more. Holding a database of email addresses of people that have never contacted you should be illegal. I know that sounds draconian but a large part of the problem is the creation of, currently legal, databases of email addresses via harvesters.

We should not forget to educate our friends, colleagues and relatives so that they both understand the issues and avoid making things worse be forwarding unsolicited messages.

No one person can solve or even mitigate the problems we are facing but by acting constructively together we should be able to reduce it. Doing nothing is not an option.

By the way, if I reject an email from you, or you do not get a response resend with 'syzygy' somewhere in the subject header and it will be accepted, at least for now.

Another Matter for Concern

Academic libraries have long contributed to the society that funds them (at least in part) by making materials available to those who present themselves in person. For example I can go to The Oxford University Libraries (Bodlean, Radcliffe Science etc.) and as long as I have a reader's card I can research materials there even if I am not a member of the University.

For many years this has been the only reasonable way for ordinary citizens to get access to many academic and technical journals because the cost of subscribing is extremely high. However these costs have become very high for academic institutions as well with the result that many now only subscribe to blocks of journals in electronic form. The problem is that the suppliers often prohibit access to those who are not members of the Institution in question, even if the access is through on line facilities in the Library reading room. So now we fund, through our taxes, academic journals and yet can no longer read them. I do not think this is good enough, do you?

Problem 11

The following is a template function to extract a value from an input stream.

template<typename in_type>
in_type read(std::istream & in) {
  in >> temp;
  if(in.fail() and not in.eof())
    throw fgw::bad_input("Corrupted data");
  if(not in.eof()) return temp;
  else ???
}

There are two possible situations where input may fail. In the first the kind of data being read may not meet the requirements for the value being sought. In this case there is little that we can do other than throw an exception.

The second case of failure is one that is not unexpected, we have reached the end of the input stream (e.g. we have already read up to the end of the file). It seems to me that we should not handle this instance by throwing an exception. What should we do?

Commentary on Problem 10

int foo(bool read_all) {
  if(read_all) {
    string line;
    getline(cin, line);
    return atoi(line);
  }
  else {
    int i;
    cin >> i; 
    return i;
  }
}

There are quite a few things wrong with the above function definition. Assume that all the appropriate headers have been included; what particular feature of C++ input makes it completely unusable?

The fundamental problem is that after running such a function the programmer cannot know the state of the input. If read_all is true it will have read the whole of a line from the standard input stream. If read_all is false it will leave at the very least a carriage return in the input stream.

Now there is no standard conforming way that you can restore the input stream to a state where you can use getline() in a predictable fashion. If you do not believe me, try it. There are all kinds of things that seem to offer some hope but you will find that each one falls by the wayside.

After months of trying to solve this problem for readers of my book (due out in the first week of December) I eventually gave up and provided functions that sidestep the problem. I provided a getdata() function that reads to the end of the first input line in which there is at least one character that is not whitespace.

I also made my read<> templates handle the problem by extracting terminating space up to and including a carriage return. They stop early if a non-whitespace character is encountered after the successful acquisition of the required value.

Should we be requiring inexperienced and incidental programmers to jump through such loops to handle input adequately?

Cryptic Clues for Prizes

Last time I set you the following little problem:

If I gave you 'That foolish day' as a clue you might quite reasonably think of 1st April. But what might 'An English programmer gets pieces of eight for the day of fools.' give as a two digit number? As an added clue an American would get a three-digit answer.

Was it too difficult? Or did you all assume that it was so easy that I would be inundated with answers? I had thought that as my readers are programmers they would find it fairly easy. The first of April becomes 0104 (well my side of the Atlantic, though the other side it is 0401). The mention of programmer and pieces of eight should have re-enforced using octal. That results in 68. I had to eliminate the US possibility because that would give 257.

Note that it is an interesting characteristic of clues based on dates that they often leave an option to interpret the number as an octal one.

Time and date clues are often only valid in the context of when they are written. For example 'When next we leap.' might be part of a clue with an answer of 2004 if written this year but it will not be that after 29/02/2004.

You may know that the ancient Greek method for representing numbers was based on using their alphabet (actually they threw in a few extra symbols). The basic idea was that the first 10 letters were treated as the values one to ten, the next nine covered twenty to one hundred and the final nine covered two hundred to one thousand. For many purposes being able to represent all values from one to a thousand was quite sufficient. They added extra features to deal with bigger numbers when they came to need them. Note that this method does not require some specific order for the letters. If we use the English alphabet we have to stop at 800 because we have not got enough symbols unless we add a couple of extras. I choose not to do so in order to keep things simple.

We have A-J representing 1 to 10, J-S for 10 to 100, S-Z for 100 to 800. Your task this time, should you accept the challenge, is to produce a Greek style clue for 261. You have plenty of scope for creativity because the letters required spell three English words, and the creative might find they could use one of the other orders by finding a suitable TLA (three letter acronym). And that is before you play with headless, leaderless, gutted and tailless versions.

Try warming up with this one because my Christmas competition will be for a prize with a little more value than those I have been offering so far. The prize for this time (apart from fame) will depend on which computer language or problem domain interests you.

Notes:

More fields may be available via dynamicdata ..