"How to cope with incorrect HTML" was the title of my Cand. Scient (Masters Equivalent) thesis in computer science.
Among others, it deals with how one can parse valid and invalid HTML and also contains the only major validation effort on the web to my knowledge (2.4 million html documents validated). The results of this validation are very interesting. The thesis is available in the following formats:
The thesis has also been referenced in quite a few articles on the internet. Note: quite a few of these list the percentage of valid pages incorrectly. The correct answer is 0.71%, not 0.007 (which is the fraction). Here is a listing of the ones I've seen so far: