According to Vanessa Fox, back when she worked for Google, Google basically discards most, if not all, of the HTML code before they ever even index a page.
(Source:
http://videos.webpronews.com/2006/12...ogle-sitemaps/ She makes her comments in the course of talking about code-to-text ratio, but to my mind it doesn't really matter if we're negating the idea of code-to-text ratio or if debunking the "valid HTML" myth, either way, the bottom line is: the code gets discarded. And if the code is discarded before indexing, it can't possibly have any effect on rankings or SEO.)
Your point 2 and point 3 are pretty much the same thing. Crawlers don't actually "read" the source code in the sense of trying to parse it the way a browser would. They simply collect what they find and carry it back, intact, to Google for further processing. Since they aren't trying to read the code, the only issue is if there is some kind of error condition that prevents them from retrieving all the page source.
And since the spiders are written to be very forgiving, it would have to be some abysmally bad code indeed before it would be enough to stop one of those little guys.
Any code that messed up would almost certainly be unparsable by current browser software, and so would not display correctly for human visitors, either.
And I totally agree that code that messed up is not a sign of a good quality site. Especially if the code
stays that messed up long enough for the spiders to find it in that condition more than once.
So, yeah, if a site's HTML is so FUBAR the spiders can't retrieve all the source, and browsers can't render it, then I would say visiting that site isn't going to be a good visitor experience, and Google and the others would be totally justified in ranking such a site very low (or excluding it from their results entirely).
But just because code can be parsed and rendered by a browser, this doesn't mean it's "valid." There are loads of "invalid" websites that still display perfectly well across a whole range of browsers, and are perfectly spiderable.
And the question is whether "valid HTML" makes a difference for SEO.
My answer to that is, IMO,
no. I've seen no evidence that validation is even
checked by the SE's, much less that it's used as a ranking factor.
Anyone who doubts this just needs to run a validator against the top-ranking websites for most search queries.
(According to the W3C validator, Google's own home page throws up 65 errors and 8 warnings. It would be very odd for them to insist on "valid HTML" from others when they don't seem to see the need for it on their own home page.)
My
--Torka