Topic: Work In Progress (Read 135027 times)

Roman · « **Reply #20 on:** 09 August 2011, 20:19 »

Well, actually I already tried that in the past and complaint that the file either needs a BOM or the encoding attribute...but fine...if the XML standard says that a non existing encoding attribute falls back to utf8, I can easily add that...

Roman · « **Reply #21 on:** 10 August 2011, 19:36 »

ok...some news...somehow related to utf8

- made stats.ini an utf8 file with BOM (to show middle point at the beginning correctly)
- fixed some name conversion for 7z files
- fixed a typo which caused the batcher to not correctly setup rompaths
- xml files are now loaded as utf8, no matter what encoding you specify

Generally you might face issues when you got existing archives which have the name stored in a local-page encoding...and then they get read as utf8 within cmpro...so you run into some wrong conversions...but I hope over time people get rid of such issues...

Roman · « **Reply #22 on:** 11 August 2011, 07:50 »

well well well...Looks hard times will come up

while testing I commonly see that users react on wrong character encoding when working with zipfiles which were created outside of cmpro...

So...let's start by saying that utf8 and zip files are two worlds...there is no real standard how files are stored as utf8 encoded.

Winzip15 allows storing the filename as utf8 with no extra information. Simply store the utf8 hex bytes and you're done...fine...cmpro handles that..

If you use an packer version or a different product the name is most likely stored with the local code page encoding....so what will happen, cmpro reads the zipfile in and converts the filename based on the assumption that it's utf8 encoded...which ends with an output you did not expect. If this happens, well, tough luck, let cmpro recreate the file for you....

I'm currently checking if Winrar zipfiles have similar issues or if they are using the 2nd way to store utf8 filenames in zips (by the way, officially winrar supports this for zips since version 3.80) which is by using the zip extra field for some encoding information. If so, I have to update my zipreader a bit to use the extrafield information...

There were comments that the encoding goes nuts after torrentzipping the file..well..then the torrentzip guys should double check their way of encoding names (I assume they do local code page encoding) as well...but I will check if they maybe use the extra-field method already.

rar and 7z archives seem to have no issues at all anymore since the last build....and Winzip created zips and cmpro created zips either.

...so the next step will be to double check winrar zips...

Keep in mind that the goal is to support utf8 encoding, not to support any local page encoding...

oxyandy · « **Reply #23 on:** 11 August 2011, 09:23 »

Code: [Select]

rar and 7z archives seem to have no issues at all anymore since the last build....and Winzip created zips and cmpro created zips either.

I tested the reading of "Winzip v15.5 created zips" (and other zip creating programs)
and Dir2Dat still creates the same problem as "WinRar created zips"
No file-name is outputted.
It still stands, the only type of zip CMP can read the internal file name from, is a "CMP created zip"

<game name="哈哈哈 (ZIP MADE BY WinZip 15.5)">
<description>哈哈哈 (ZIP MADE BY WinZip 15.5)</description>
<rom name=".txt" size="12" crc="c6302903"/>
</game>

Anyway, enjoy your day at work.
We will discuss more later.

Roman · « **Reply #24 on:** 11 August 2011, 09:29 »

Check winzips15 options if you got "store filenames as utf8" checkbox ticked....

And for another test: drop the zipfile which gives you an empty filename in dir2dat into the about window...does it list an empty filename there, too?

oxyandy · « **Reply #25 on:** 11 August 2011, 09:37 »

Ok, did that.

This doesn't say exactly "store filenames as utf8"
But it does say "Store Unicode filenames in zip files"
By default that was already checked.

Roman · « **Reply #26 on:** 11 August 2011, 09:43 »

yeah yeah..that's the option I was talking about...good to see it's checked...

Now please drag'n drop the file from your dir2dat source (the one which creates the empty filename) into the about window...
a window should open, listing all entries of the zip...I wonder if the name is listed there (correctly).
If it is listed there, then dir2dat somehow cuts off the name....if it's not listed there, then it's related to the encoding / zip reader.

oxyandy · « **Reply #27 on:** 11 August 2011, 09:44 »

Ok did the drag drop test too..
Same, name missing

Code: [Select]

Path:	G:\Desktop\New Folder (10)\哈哈哈 (ZIP MADE BY WinZip 15.5).zip
Name:	.txt
Size:	12
CRC32:	c6302903
MD5:	9f8c1945784842810f22262b7d3aef0f
SHA1:	cfe308fd34664b17cd752537fb04cb59d0bb5070

Roman · « **Reply #28 on:** 11 August 2011, 09:56 »

thanks! good to know...so it seems it can't convert the chars...so the issue is not in dir2dat but in the reader

I will check if you can specify a default char (like a ?) in case of a not converted char...I assume you've mailed me that "\哈哈哈 (ZIP MADE BY WinZip 15.5).zip"....please do so...I bet it uses the extrafield for additional information

oxyandy · « **Reply #29 on:** 11 August 2011, 10:06 »

Well,
I haven't emailed the "WinZip v15.5 created zip" yet,
but now you mention it, I will.

I just hope you will humour me,
remove the "forced UTF-8 encoding" in source, compile and get me a copy.

If a build like this reads and stores the name in the dat,
then I bet a beer, it will also output a zip with the filename too.

Then problem over, without you even looking into the internals of the zips.

Code: [Select]

so the issue is not in dir2dat but in the readerCause don't forget it's not just the ZIP reading, but also the ZIP writing which is quirky!

Roman · « **Reply #30 on:** 11 August 2011, 10:15 »

Again...non-utf8 encoding is not an option here. It will then simply use local page encoding which may work on your side but nowhere else. The goal is to use utf8 stored names and give a s**t on local page encoding.

Zip writing is not a problem. It uses the ziparchive lib for saving (unless you use 'no recompress' in the rebuilder, then the name is converted internally and not in the lib!!!) and that stores the filename utf8 encoded and winzip, winrar and 7z shows the name correctly. It uses the method to store this without EXTRA field usage which all 3 major programs handle fine.

Cassiel · « **Reply #31 on:** 11 August 2011, 16:31 »

FINALLY got around to doing some testing....

Couple of small things so far:
- dir2dat: creates DAT OK using a mix of files with some VERY exotic file names. When opening DAT in Notepad++, the file "06 'and'.smd" appears as "06 'and'.smd". Shouldn't the ACTUAL character now be shown rather than the code? Doesn't seem to be an issue with other 'weirder' characters.
- scanner: file "02 Diaboł.bin" appears incorrect (visually at least) in the scanner window, i.e. the "ł" is replaced with a black rectangle. Fixdat is correct however. When rebuilt it is also correct in the archive and scanner recognises the file as correct once rebuilt. (packer = rar).

Not experienced any issues with the actual rebuilder.

Using:
CMP (email on 20110810) (x64)
Windows 7 Ultimate (x64)
WinRAR v4.01 (x64)
Notepad++ v5.9.2 (x86, unicode)

Cassiel · « **Reply #32 on:** 11 August 2011, 16:45 »

Repeated steps using packer = zip (internal zip lib, NOT utilizing WinZip/WinRAR/7z) and experienced no issues with rebuild/scan. Performed exactly the same as with RAR (i.e. the issue with "Diaboł" remains, but rebuilt correctly).

Roman · « **Reply #33 on:** 11 August 2011, 19:25 »

ok....I won't write apos anymore for an apostrophe...

now back to the general problem: it seems that tools like Winzip do use the extra fields for utf8 filenames. Funnily enough I can use Winzip without doing that and ziparchive lib I'm using can use this mode fine...
So cmpro created files (forget about no-recompress at the moment) work fine in Winzip and within cmpro.
Problematic are files which exist before you run cmpro on them...

Either:
- they got a local code page encoding, then cmpro reads them in, converts them as utf8 (which is wrong) and you got wrong characters. Rezipping them with cmpro will fix this.
- or they use the extra field to store utf8 additional information...which my current parser does not handle....I try to update it...or switch to the ziparchive readers completely...

Thanks for testing...

Roman · « **Reply #34 on:** 11 August 2011, 20:49 »

well, I quickly wrote a zip parser based on the already used ziparchive lib and it seems to solve the remaining issues with extra-field utf8 etc....(I need some confirmation from oxyandy though)

Problem is....I need to get rid of all the other internal in-place-rename and no-recompress copy routines...The good news is that the library already got such functions, however I need to adopt them...which takes time...which I currently don't really have...

I will have a week end of August....and some hours here and there till then....

oxyandy · « **Reply #35 on:** 12 August 2011, 02:25 »

Ok,
a really bad night's sleep, my daughter had a terrible fever.

Awake now, first test shows.

1. The old output from rebuilder which was torrentzipped, now shows the incorrect name.
(great cause it is)

2. Any zip I have tested, now shows the correct internal contents !

So Dir2Dat is now producing a dat which matches the ZIP files 100%

Torrentzip handles 'special characters' badly,
clearly would need to be re-written to cope with Unicode names.

EDIT:
Tested everything else I had issues with previously,
Faultless so far.
Time to get imaginative with 'testing'

"the "ł" is replaced with a black rectangle."
There are 1000's of Unicode characters which will show as black rectangles.
Unless the CMP output windows uses a Unicode font. ??
This Font would then need to be supplied with CMP download package as many OS's are likely not to have them.
And then, ah what a nightmare, there are Unicode characters which belong to specific font sets.

Even Notepad++, needs setting to a Unicode Font, otherwise, of course
it wont show the characters of that font set correctly.

So you might make a dat, which has outputted perfectly, (and it really seems to do this perfect now)
but because you haven't set the correct font in Notepad++
will only show as rectangles too...

Hmm, Roman, would it be possible to have a 'selectable font' for CMP's output windows..

While I'm suggesting things, compressor settings..
Could CMP take for granted that everyone has 7z and WinRar already installed, at their 'default' locations, please.
So they work, 'out of the box' ?

Roman · « **Reply #36 on:** 12 August 2011, 07:23 »

As a father myself I know how these nights are....so all the best for your daughter...

Good to hear that the new reader works fine (again, don't use your special build for any real scanning/fixing. Actually fixing names or no-recompress will use the old routines which will create wrong chars...or even worse since they rely on some attributes which the new reader doesn't fill in yet).

cmpro fonts....hmm...not in the first planned release

torrentzip....hmm...I did already mention that this is not my scope

compressor settings...well...when it comes to zip, for now I would go with the standard ones (level 9, deflate) which the ziparchive lib brings....for 7z/rar you can modify the settings/compressor/7z (rar) edit fields with your own commandline params....yes...I expect winrar and 7z to be installed and located in your %PATH% environment. Otherwise you need to adjust the location in the compressor settings.

So next steps: replace old custom zip routines with ziparchive class calls.....

Cassiel · « **Reply #37 on:** 12 August 2011, 12:37 »

Re output window (big slap for me), didn't occur to me it would be simply due to the font used. Obvious.

Selectable font option (i.e. Ariel is included in Windows, Ariel Unicode is installed by default by Office) would be helpful (and avoid such issues/questions).

Or maybe just use something like:
http://en.wikipedia.org/wiki/GNU_Unifont
by default?

Roman · « **Reply #38 on:** 12 August 2011, 12:55 »

as I said...."later"

oxyandy · « **Reply #39 on:** 12 August 2011, 14:13 »

I was thinking old school when I mentioned a 'download' for a Unicode font.
Vista onwards, should have some Unicode Fonts as standard.

EDIT: Even XP has "Arial Unicode MS Font" released as an update at one stage.

News:

Author Topic: Work In Progress (Read 135027 times)