New Problems of TMyDump of 6.00.0.3 (updated)

Justmade · Post by **Justmade** » Fri 10 Dec 2010 10:27

This is to sum up the other post which display format is ruin by long lines :
http://www.devart.com/forums/viewtopic.php?t=19735

Thanks for the new version for fixing Unicode issue. However, there is new problem and shortcoming :
1. Blob Field Backup is problematic when using both HexBlob or not
2. The file size is doubled as the current encoding is utf16

I skipped the demo code in that post as the problem is partially solved (as below) and they have long line that cause the format disaster.

Regarding Blob Field Backup, I think I had find a fix in HexBlob mode. The problem is that the code is still using PAnsiChar Typecast which should be PWideChar for VER12P.

Code: Select all

MyService.TCustomMyDumpProcessor.BackupObjects's
Sub Procedure ProcessField Line 1641
change
PAnsiChar(Integer(SValue) + sbOffset), Piece.Used);
to
{$IFDEF VER12P}PWideChar{$ELSE}PAnsiChar{$ENDIF}(Integer(SValue) + sbOffset), Piece.Used);

and the generated result is OK.

I still don't know how to fix the problem if HexBlob mode is false.

Also, I had many blob fields which actually has strings and I see that that procedure also change those strings to Hex is make the file even larger. It is not critical though as at least it work.

Regarding using utf8 to reduce file size, I can make it work successfully. The resulting file size is slightly over 50% of using Unicode.

For my modification, Backup part is easy and straight forward :

Code: Select all

MyService.TCustomMyDumpProcessor.Backup
Change
s := #$FF#$FE;
to
s := #$EF#$BB#$BF;

MyService.TCustomMyDumpProcessor.Add
Change
buf := Encoding.Unicode.GetBytes(WideString(Line + #$D#$A));
to
buf := Encoding.UTF8.GetBytes(UTF8Encode(Line + #$D#$A));

For Restore, I make the following change initially and it work in restoring the backup.

Code: Select all

DAScript.TDAScriptProcessor.CreateParser
Line 400 Change
enc := Encoding.Default;
to
enc := Encoding.UTF8;

However, it cause a index out of range error after the successful restore. With a closer look, the TParser take the stream size as TextLength but those multiple bytes char length 1 in FBlockSize but actually take more then 1 bytes. So, at the end, TextLength is bigger then FBlockSize and the Parser ReadNextBlock (reading nothing) and finally read from the empty buffer and cause index out of range.

So I make the following change :

Code: Select all

CRParser.TParser.ReadNextBlock;
Line 1076 Add
if Size  FBlockSize then // i.e. Variable Length char included
  TextLength := TextLength - (Size - FBlockSize);

I actually don't know if this code is appropriate as I am not good at stream management and parsing. However, it work well enough in my own case and prevent the out of range error. I hope your team can do better modification for this issue as saving almost 50% file size is a big plus.

Dimon · Post by **Dimon** » Fri 10 Dec 2010 12:42

Why do you want to store your backup file in the UTF8 mode, and not in Unicode? What prodlems do you encounter when using our implementation of Unicode files?

Justmade · Post by **Justmade** » Fri 10 Dec 2010 13:15

Your unicode file use utf16 coding, which means 2 bytes per char.

On the other hand, uft8 code use 1 byte for most of the code within the latin1 ascii range.

As 90%+ (somethimes even 100%) of the content is within latin1 ASCII range, using utf8 to save has much smaller file size, which imply also a slight read / write performance bones.

I had make a test and backup one of my database. Your unicode format make a file of around 500MB, which dbForge and SQLyong make file size of about 260 MB. After I modify your code to use uft8, the generated file is also about 260 MB.

So, it is not a necessary, but definitely an advantage as there seems to be no drawback?

Justmade · Post by **Justmade** » Fri 10 Dec 2010 13:23

one reminder, my suggesting modification to utf8 is backward compatible to your utf16 / ansi version for restoring because your restoring code has detection for ansi, uft16 and uft8 code. I only fix the uft8 parts and do not affect the ansi / uft16 part.

Justmade · Post by **Justmade** » Sat 11 Dec 2010 04:02

When restoring larger dump, there is still index out of range error and I see that there are many more places in your code that expect fix length charset (and so fixed StoreBlocks size) and there is no simple fix for that.

I understand the involvement needed so you are the one who know the cost and benefit (half file size) and whether it is worthwhile to do.

On the other hand, I would like to request a adding setting statement when dumping.

Some auto-incremental would have an manually added 0 value for special meaning. However, when restoring from dump, the 0 value would automatically become 1 and then gen error of duplicate when inserting the 1 value record.

I see that adding the following setting statement can prevent this :

MyServices.TCustomMyDumpProcessor.AddSettings;
Add
Add('/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE=''NO_AUTO_VALUE_ON_ZERO'' */;');

MyServices.TCustomMyDumpProcessor.RestoreSettings;
Add
Add('/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;');

I hope they can be added to your code so that I do not need to modify the code each time.

Thanks again for your great product.

Justmade · Post by **Justmade** » Mon 13 Dec 2010 02:08

It turns out that the change needed to support uft8 file encoding is quite minimal :

We only need to made two more tiny change in the CRParser.TParser.ReadNextBlock

1. In the place which I suggest check size against FBlockSize, modify also the StreamLength variable :

Code: Select all

    if Size  FBlockSize then // i.e. Variable Length char included
    begin
      StreamLength := StreamLength - (Size - FBlockSize);
      TextLength := TextLength - (Size - FBlockSize);
    end;

2 A little above that place (line 1060),

Code: Select all

Change
FBlockOffset := GetCharSize(FStream.Position - FStartOffset) + 1 - Offset;
to
FBlockOffset := FBlockOffset + FBlockSize; //FBlockSize is Last's block's real length

These code let my last tested database successfully use uft8 to backup and restore. As said before, the file size turn from around 500MB in uft16 file to about 260MB in utf8.

Dimon · Post by **Dimon** » Tue 14 Dec 2010 09:30

To solve the problem try to set the TMyConnection.Options.UseUnicode property to false and the TMyConnection.Options.Charset to 'utf8'. Check if these settings solve the problem.

Justmade · Post by **Justmade** » Tue 14 Dec 2010 11:24

Sorry but do you think your codes, without any code to generate uft8 file as well as read in variable length charset, able to solve the problem and suddenly generate uft8 dump file and read back in variable length charset?

Setting useunicode to false will only fallback to use ansiString / ansichar and generate ansi file with local locale, which is good for single language but not multi-language.

There are reasons why almost every one creating unicode files use uft8 (much smaller file size without data loss and support multi-language) rather then uft16. I think everyone who are using or will use TMyDump and need Unicode support will prefer uft8 file rather then uft16.

Also, my test show that the modification needed is minimal (see code modification suggested).

Of cause, the decision is yours.

Dimon · Post by **Dimon** » Wed 15 Dec 2010 11:46

Ok, we will change this behaviour in the next MyDAC build.

Justmade · Post by **Justmade** » Wed 15 Dec 2010 14:52

Thank you for your kind help.

Justmade · Post by **Justmade** » Mon 07 Feb 2011 10:14

It is disappointing that you change the encoding into utf8 (which is very good in saving space) but did not either implement my full suggestion or do a check yourselves to did a better implementation.

As utf8 is variable length encoding and your existing coding is fixed length calculation, I had already mentioned that we will encounter index out of range error when reading back those unicode string which byte length is more than 1.

I do encounter this error after I install the new update.

Please read my initial post and follow-ups for detail. I think I need to modify the code and switch to manual compile the package again and I do hope I, as well as all who need unicode rather then ansi charset do not need to do this again in each updates.

Justmade · Post by **Justmade** » Mon 07 Feb 2011 15:07

Also, the read back encoding in DAScript is not modified also. So TMyDump read back using

enc := Encoding.Default;

which is obviously not UTF8.

I apply my adjustments to both DAScript and CRParser and all is working fine now.

AndreyZ · Post by **AndreyZ** » Wed 09 Feb 2011 09:20

Thank you for the information. We will certainly include this fix in the next MyDAC build. Please excuse us for the inconvenience.