Tidy - Store Original Text

TIDY_STORE_ORIGINAL_TEXT

2009-07-12: Since a long time back, pre-2006 at least, Tidy has had a compile switch - TIDY_STORE_ORIGINAL_TEXT - back in 2006, when doing some fixes on the 'accessibility' testing code, I thought of using this 'original text store' to separate some access WARNINGS, but at the time, eventually an alternative was found. This old, messy effort, can still be found here, but as stated this was abandoned.

But since then, there seem to have been some 'improvements' in the current CVS code, that maybe allow a programmer to enable this compile time switch, and use the buffer stored with each 'node' - node->otext... This topic came up again, on an unrelated issue, and I decided to try enabling this compile switch again... and testing it... with a few small patches, got it working well ;=))

top

Discovery

Those with a delicate tummy should skip all this gore, and go to the downloads ;=))

In looking through the current CVS source, it seems some, if not ALL, the problem items I had seen before, specifically related to -
#ifdef TIDY_STORE_ORIGINAL_TEXT
have found their way into CVS ;=)) In fact, on careful inspection, there seem to have been some changes and IMPROVEMENTS done ;=))

Maybe there remains some cases especially related to when characters are 'put back', maybe the original store is NOT fully updated... Let me add some debug code, to my special TidyDev, and do some tests...

The first thing I found was that you have to ALSO enable it, like :-

#ifdef TIDY_STORE_ORIGINAL_TEXT
        doc->storeText = yes;
#endif

So to enable this 'storage', this could be done in tidyDocCreate(), or somewhere else, ... I chose in tidyDocCreate()

The original store is NOT handling "put backs" in some case, so you get :-
the ORIGINAL store text is output between the square brackets -

OTEXT:[
 <HEAD><]

So here you can see the final '<' char is from the NEXT token, but the original line ending is there before the first '<'... And

OTEXT:[TITLE>1]

So here you can see the missing '<' char, AND the '1' that belongs to the next text token... But this under-run, and over-run does NOT happen in all cases, like -

OTEXT:[</TITLE>]
OTEXT:[</HEAD>]
OTEXT:[ <BODY> ]
OTEXT:[ <P> ]

So, maybe I can fix the earlier case? Hmmm, in lexer.c, the GetTokenFromStream(), on finding a '<' + letter, does -

TY_(UngetChar)(c, doc->docIn); /* push back letter */ 
TY_(UngetChar)('<', doc->docIn);
lexer->lexsize -= 2; /* discard "<" + letter */

So here it is putting these 2 characters back in the stream, and dumping 2 characters off the lexsize, discarding the "<" plus letter, BUT does nothing about backing up the original text store... which it should!

It then deals with the previous token, and when it comes back to continue with the next token, the first 2 characters are from the 'push back' store, which are NOT added to original text... That means 'UngetChar' should back up the original text as well, and chars got from this 'UngetChar' store should be added to the now new in->otext... Let me see if I can do that much, at least...

Well, that is better ;=)) Added a new service :-

/* FIX20090712 - new service to unget original store */
void TY_(UngetCharOriginalText) (StreamIn *in)
{
   if (in->otextlen) {
      in->otextlen--;
      in->otextbuf[in->otextlen] = 0; /* zero terminate for string */
   }
}

which I call from - UngetChar()

#ifdef TIDY_STORE_ORIGINAL_TEXT 
      TY_(UngetCharOriginalText) (in);
#endif

And then in static uint PopChar( StreamIn *in ) added -

#ifdef TIDY_STORE_ORIGINAL_TEXT
      TY_(AddCharToOriginalText)(in, (tchar)c);
#endif

But still NOT perfect, but LOT better ;=)) - remember,
the ORIGINAL store text is output between the square brackets, [...] :-

OTEXT:[<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">]
OTEXT:[
<HTML>]
OTEXT:[
<HEAD>]
OTEXT:[<TITLE>]
OTEXT:[131630]
OTEXT:[7</TITLE>]
OTEXT:[</HEAD>]
OTEXT:[
<BODY>]
OTEXT:[
<P>]
...etc...

Note the '7' got tacked onto TITLE? And that is happening with EACH subsequent TEXT node...

OTEXT:[Row 1 - Col ]
OTEXT:[1</TD>]

YUCK! Now what is wrong?

Ah-ha, now the meaning of 'count' on the StoreOriginalTextInToken( doc, token, count ) starts to becomes clear... This is an attempt to account for some character back-ups, not actually done as 'put backs'... another small patch, with explanation :-

#ifdef TIDY_STORE_ORIGINAL_TEXT /* FIX20090712 - reduce to 2,
                                   since now UngetChar did 1! */
           StoreOriginalTextInToken(doc, lexer->token, 2); /* was 3 */
#endif

This got it RIGHT ;=)) -

OTEXT:[1316307]
OTEXT:[</TITLE>]
... and later
OTEXT:[Row 1 - Col 1]
OTEXT:[</TD>]
OTEXT:[
 <TD>]
OTEXT:[Row 1 - Col 2]
OTEXT:[</TD>]

Of course, this TIDY_STORE_ORIGINAL_TEXT compiler switch is ONLY available to programmers. So even if it is enabled when compiling the code, there is no API to obtain this 'original text', but that would not be hard to add.

Also this is NOT exactly the 'raw' original text, since every character has been first put through Tidy's internal character translations service. Much more would need to be done if it was to truly be a store containing the 'raw' original bytes from the file, sans any translation.

LOOKS GOOD! Now to apply those changes to copy of CVS, and do a full diff - all changes have been marked with
/* FIX20090712 */
There may still be some error cases, exposed from more comprehensive testing, but this is looking positively spiffy ;=))

top

Downloads

It is not worth enabling it, and supplying a binary executable, since there would be no visible change in the running of Tidy. As stated, at present this is only available through programming, so no binary is included in the downloads. The only item available is the 'patch' file to apply to the CVS source of 2009-07-12...

download	description	MD5	size
tidystore-01.txt	patch file for 12 July, 2009 CVS source	text file only	2,677

PS: This patch file also contains a small TAB related patch to streamio.c, to remove a bogus increment to in->curcol++;, when doing a tab to space expansion, which causes errant column offset in the lexer error and warning message, and when using tidyNodeLine(node), and tidyNodeColumn(node) API services are used.

This small patch certainly 'improves' the situation, but the column offset can still be wrong if say there are two or more tabs in the line. But this patch puts the column indicator much closer to the mark. With this patch, for example, regression test test/input/in_1707836.html now reports the CORRECT column number, 2 instead of 9, while test/input/in_1316307.html is still 1 out because there are two(2) tabs. It reports 2 in place of 9, but should report 3!

Obvious more needs to be done to fully clear this TAB situation, but this patch really improves the situation.

Geoff.
2009-07-12

top