It seems regular expressions come up a lot when someone mentions validating email addresses. Curiously, and just as incorrect in my opinion, regex are also mentioned in the same sentence as parsing HTML. Notice the parsing bit, but that's not the topic of this post.
I've presented on the topic of regular expressions a number of times at usergroups and I always put up a slide showing this regex. It's the regex used by the Perl module Mail::RFC822::Address and it's nasty.
My problem with validating email addresses is even though it conforms to the spec does not mean it's an active, valid address and worse may not even belong to the user.
So it seems we'd want to:
1. Catch simple mistakes to make a better user experience.
2. Have a valid email address that can be used.
3. Make sure the email belongs to the user.
So how do we do that?
1. We could use a simple regex that's not too restrictive to make sure it generally looks like an email address (something @ something probably with a .) We could also make the user type their email address twice, verifying it the same way we would make them verify their password. A quick check on equality either means they didn't make a mistake, they consistently make that mistake in which case this hasn't helped us, or they copied and pasted it.
2. We could use a really nasty regex. We could shoot off an external process that tries to verify the email address through the mail server of its domain. We could send some url with a hash and have the user confirm their email. An ongoing part of this solution might also be to cull through bounce mail from the server and invalidate the address.
3. I really don't know how else to do this besides mailing the user something and requiring them to do something based on the contents of the email. This is the option from #2 above with the email containing a url and some hash. Do this every time you get a new email address and you should be fairly confident that you can send email address to the user.
My thought on this whole thing is: Why collect the data if you're not going to use it; and why just guess at it's validity when you could confirm it through user action?




