Rebuilder V0.07

Little tool to rebuild MAME (https://www.mamedev.org/) machine sets.

This software is freeware.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For (c) and licence information of the used 3rd party tools, please refer to the bottom of this document.

The story so far

Another rebuilder?

Well, actually you might know that I started clrmame(pro) in 1997. Back then the MAME world was different, the C++ world was different. clrmamepro grew over decades while MAME added things like using crc32, md5, sha1, merge modes, chds, devices, xml and so on. And then there were other projects which also wanted to use clrmamepro for their purposes and so lots of requests were added.

The downside of this: Maintaining the code is pretty tough. I'm pretty bored looking at that old code basis, so it is time for something new. It feels a bit like 1997 again, code something for me, for fun. It's fun again to see that 'modern' C++ and 3rd party tools make life way easier. Less options, faster, no MS Windows specific coding, complete new ideas how to keep data in memory or how to handle different merge modes ....so it ended (for now) with a commandline utility which does the rebuild job. Maybe some people will enjoy it even if it is -or especially because it is- commandline based.

Time will tell, if this little project turns into a new clrmame in total (with a scanner, an UI, a profiler but definetly no merger). For now it's more or less an experiment.

So, if you're not bored yet, read on.

How the rebuilder works

The rebuilder analyzes each file (or file in an archive) in the input folder and tries to match each file's hash/size against the loaded database. Each found match (there can be multiple, think e.g. of a file which is shared by 10 machines) will be added to the output folder. There it uses the correct file name and machine name. Optionally, rebuilt files (and empty folders) are removed from the input folder. If your input folder holds an archive with an archive or chd inside, this inner archive is unpacked to your temporary folder. Such temporary files are removed at the end of a rebuilding process. By default the system's temporary folder is used. You can alter the temporary folder by modifying the "Rebuilder/TempFolder" entry in settings.xml. This can be very useful if your system's temporary folder is on a slower disk. Selecting a custom temporary folder can have a positive effect on rebuilding speed.

By default, the rebuilder matches files by crc32/sha1 and size. There is an option (-s, --sha1) which allows you to select between no sha1, input, output and both sha1 checks. Enabling one sha1 mode is of course slower since it needs additional time for decompressing the archived data and the actual hash calculation. Surely it's more accurate. In case there are different sha1 values for one crc32 specified in the datfile and you disabled the input sha1 check, it will take the sha1 value into account, too.

This rebuilder is also able to rebuild CHD files (from version 3 onwards). For such files, the sha1 information from the CHD header is used for a match.

If you already have an existing archive/folder in the destination, the matched file is added but only if the file itself does not exist there. If there is an existing file but it does not match the right hash, this existing file is moved to the backup folder and the found match replaces it in the rebuilder destination. The backup folder is generated in the output folder and is called 'backup'.

This rebuilder can also identify source archives which already match an output machine completely. If the destination does not exist yet, such archives are copied directly.

The rebuilder runs through various phases. First it loads the datfile, checks source/destination paths for existance and builds merge mode views. Then it runs through the input folder. Be patient, this can take some time depending, especially if you scan lots of files. After that the output is checked. If no output exists or contains not many files, this should be very fast. An optimize phase is next and -if possible- the rebuilder then copies archives which can be copied directly. Finally, the actual rebuild is done and in the end a cleaning step is performed.

The commandline options

The program allows the following commandline parameters:

-x, --xml: Here you specify your used xml file which holds your databasis. This option is mandatory. See below for the supported types of XML files.

-i, --input: The rebuilder input folder. This folder is checked for matching data. This option is mandatory and the folder has to exist.

-o, --output: The rebuilder output folder. In this folder, the matched data is copied/moved to. This option is mandatory. If the folder does not exist, it will be created. Note: When using -r, --recursive, the rebuilder output path can't be a subfolder of the rebuilder input path. A 'backup' folder is generated in this folder, too, which will hold replaced files. It is generated if a file is replaced. An existing but empty folder is removed at the end.

-m, --mode: Your prefered output merge mode. See below what the 3 modes are. This is an optional setting, default value is split. You can use split, full or standalone.

-p, --pattern: With this option you can specify an output pattern. Basically a path information which is put in between the output folder and the machine name. See below for examples.

-c, --compress: This defines your prefered output compression method. This is an optional setting. Default is zip which keeps your files in zip archives. You can use zip, rezip, 7z or none. The latter one would keep your machines decompressed. rezip will always recompress the destination files and a direct copy of archives is not performed.

-f, --filter: Here you can specify a regular expression on the machine name. Only matching entries from the loaded xml will then taken into account during rebuild. This is optional and no filter is the default value.

-d, --delete: With this option, rebuilt input files are removed. Be warned, they are gone! If the last file from an input archive or folder is removed, this archive/folder is also removed. Deletion is optional and disabled by default.

-s, --sha1: This would turn on or off additional sha1 matching of input and/or output files. Enabling it will be slower but more accurate. Default is input. You can use none to turn on simple crc32/size checks, input to do sha1 checks on the input file only, output to do sha1 checks on a possibly existing output file only and both which is identical to input and output. -s none is the fastest mode but keep in mind, depending on the files you're scanning, you may run into crc32 matches where in fact the sha1 would not match.

-r, --recursive: Turn this on if you want to run through your input folder and all of its subfolders.

-l, --loglevel: Specify the detail level of the output. By default this is set to info. You can use err, warn, info or trace where the latter one additionally lists source file and rebuild information. info shows you a little progress bar here and there and gives you some updates when reading folders. If you redirect your ouput, progress bars and updating file counts are not visible.

-u, --uselinks Default is none. hard or sym are possible other values. Turn this on if you want to generate a filesystem hard or sym link instead doing a file copy operation. This takes place when copying archives 1:1, copying chds or copying single unpacked files from a source to the target. Keep in mind that there are limitations to the use of links in general and based on its type. This includes volume restrictions and access rights.

Supported XML file types (-x, --xml)

Currently three types of xml files are supported:

MAME -listxml XML output which holds machines (devices, bios) with roms, disks and samples.
MAME -listsoftware XML output which holds a software list collection, i.e. several machines with roms and disks for several software lists (e.g. a2600 machines, c64 machines and so on)
MAME software list XML. Such files can typcially be found in MAME's hash folder. They hold machines with roms and disks for one specific software list.

You can create a xml file by running MAME from the commandline interpreter and redirect its output, e.g. mame.exe -listxml >245.xml.

When loading an input file you might see some warnings. For a standard MAME -listxml you e.g. see sample specific warnings. It's mainly about sample relationships from machines to a sample parent machine which is not available in the XML. Such sample-only sets are generated automatically so that the assignment is correct again. Similar warnings exist for the use of samples which aren't available in the sample parent set. This is also fixed internally.

XML, Input, Output

The three things you need to specify are

the XML data basis (the 'datfile'), this defines all the machines
your rebuilder source, a folder which is scanned for possible matches
your rebuilder destination, a folder where found matches are being added to

Examples:

Load a MAME software list xml and rebuild from C:\Users\FooBar\Downloads to f:\softwarelists\a2600_cas: rebuild.exe -x e:\MAME\hash\a2600_cass.xml -i C:\Users\FooBar\Downloads -o f:\softwarelists\a2600_cass

Load a MAME -listxml xml and rebuild from f:\roms and all its subfolders to f:\mame\roms: rebuild.exe -x e:\MAME\244.xml -r -i f:\roms -o f:\mame\roms

Load a MAME -listxml xml and rebuild from f:\roms and all its subfolders to f:\mame\roms and remove all rebuilt files: rebuild.exe -x e:\MAME\244.xml -r -i f:\roms -o f:\mame\roms -d

Rebuilding is more or less a copy operation of files. If we talk about CHDs, we even talk about huge files. If your input and output folders are on the same ssd/hd, you will create pretty much I/O traffic. Ever tried copying (not moving) a multi GB file on one and the same hd? It usually crawls. So be aware of this before you try to rebuild a complete MAME collection from one folder to another.

Modifying the output (-p, --pattern, -c, --compress)

Generally you can define how the output should be stored. Either compressed (currently as zip) or decompressed as files and folders.

Create decompressed output: -c none

Create zip archives: -c zip

Create 7z archives: -c 7z

Alway recompress zip archives: -c rezip

There is another option to modify the output by defining some patterns which can be used to add additional folders to the output: -p sub1/sub2/sub3

This will add three level of sub folders to the given rebuilder output root. Assuming you specified e:/temp as output folder, your machine sets will then be placed in e:/temp/sub1/sub2/sub3. While folder separator characters (/ ) are allowed, . or .. are not possible to be used here.

Way more interesting are predefined patterns which can be used with the -p command. You can use:

#FIRSTCHAR# - uses the first character of the machine's name
#YEAR# - uses the year information from the machine (e.g. 1997)
#MANUFACTURER# - uses the manufacturer / publisher information from the machine (e.g. Sega)
#TYPE# - this uses 'device' for "isdevice=yes" machines, 'mechanical' for "ismechanical=yes" machines, 'bios' for "isbios=yes" machines and 'default' for the rest. So you can split up your collection by type somehow. Bios has a higher priority than mechanical by the way.
#BIOSSPLIT# - Similar to type but this additional splits up machines by bios ('bios_' + biosname). So for example the neogeo bios and all sets having a dependency on the neogeo bios are put in 'bios_neogeo'.
#SOFTLIST# - uses the software list name of the machine (e.g. a2600)
#BIOS# - uses the bios name of the machine (e.g. neogeo)

If you're loading a software list collection datfile, you automatically have a pattern of #SOFTLIST# active internally as top level.

Example: You want to split up your collection by manufacturer and by year: -p #MANUFACTURER#/#YEAR#. You want to split up your collection by something which is nearly identical to clrmamepro's system default paths: -p #BIOSSPLIT#

Destination machine folders and/or archives get a current timestamp. Files inside a setfolder or inside an archive keep their original timestamp.

If you're not happy with the prefix names (e.g. #device), you can alter them in settings.xml (see below).

Merge Modes (-m, --mode)

A merge mode defines how your stored mechanines are bundled. Some machines share a parent / clone relationship which is specified in the underlying datfile. Depending on the chosen mode, such machines can be merged together.

We differ between:

split (default) - You have exactly as many archives/folders as there are machines in the datfile. Clone machines will only hold the files for the specific clone, while the shared files stay in the folder/archive of the parent machine. Parents or machines without any parent/clone relationship exist as they are.
full - Clone archives/folders don't exist (so you only have them for the parent sets or sets which don't have a parent/clone relationship). The clones are kept inside the parent archive/folder, too. They however use subfolders in there, named after their machine name. So for example '10yard85' is a clone of '10yard'. Only '10yard' exists and inside the archive or folder you have the roms of the parent on top level and a subfolder named '10yard85' with the roms of the clone. Subfolders only exist in compressed sets. MAME can't detect files in decompressed sets, so for decompressed modes (-c none), the folders are flat, i.e. there are no clone subfolders. Be aware that in case of a hash collision (same filename within a parent/clone relationship but different hashes), you will only have one file. For CHDs which are always stored in a decompressed way, the same applies.
standalone - If you want to have one archive/folder which holds all needed files for a machine, this is the mode you want. It will not only include the parent files for a clone machine but also all other dependencies like bios and devices. Each are kept in subfolders named after their machine (device/bios) name (similar to the 10yard85 example above). If using decompressed sets, they are stored in a flat folder, i.e. no subfolders exist.

Limit your output (-f, --filter)

With the -f, --filter option you can filter the loaded XML to a subset of machines. You define a regular expression which is matched against the machines name. So for example if you only want to rebuild "pacman", you can simply add: -f pacman

If you want to rebuild all machines which start with 'pac', you can write: -f pac.*

For filtering only pacman and outrun, you'd write: -f pacman|outrun

Settings

settings.xml is created / loaded on startup which allows you to alter some settings. Currently you can only change the default pre-strings for the -p command #TYPE# and #BIOSSPLIT# values. So e.g. you can change #default to 'StandardSets' or similar. Only valid path characters are allowed. Illegal values won't be accepted and the defaults are used again. You can also specify a temporary folder here which is used for temporary decompression purposes (e.g. when an archive is in an archive) or when data needs to get recompressed.

History

2023-11-28 V0.07 released

fix software list rom sizes determination (wasn't limited to loadflag value)
fix software list merging (SL/SL collections don't use merge attributes, so lookup by hash in a parent/clone relationship)
don't use # in default pattern (rompath) names since such names would be cut off when used in mame.ini due to comment handling
pattern names can't end with '.' (Windows doesn't like this), replaced cases with "_"
minor changes to the stats count output
updated 3rd party libs (spdlog, bit7z, pugixml)

2023-05-04 V0.06 released

run source and destination file matching in multiple threads (speed up)

2023-04-14 V0.05 released

general unicode handling overhaul, utf8 chars in pathnames, patterns, xml, files, folders, archives, console output should be fine now

2023-03-12 V0.04 released

support reading of (split)rar/(not split)7z and writing of 7z files
detection of zip, 7z, rar, chd files by byte signature (instead of extensions, but not within archives)
selectable tempfolder in settings.xml
minor speed up due to upfront matching size check
updated various 3rd party libs, added 7z.dll
ctrl-c will stop the rebuilding and cleans up temporary files/folders
various internal cleanup

2022-10-05 V0.03 released

use a real move operation in case of copy/deleting single files (incl. chds)
add option -u, --uselinks to generate filesystem hard or sym links instead doing a file copy or move operation

2022-08-16 V0.02 released

since MAME can't handle subfolders in decompressed sets, decompressed sets and chds are always stored flat in folders (no clone/dependency subfolders in full or standalone mode). When kept compressed, the archives will hold subfolders
not existing romOf reference leads to removed merge information for the machine

2022-07-13 V0.01 released

Since there is no scanner yet, can I scan my rebuilt sets with clrmamepro?

Yes, but be aware of the following:

clrmamepro does not have the standalone merge mode (it only has not-merged + disable separate bios sets, but either way it won't match)
clrmamepro's profiler options to parse disk/rom merge tags and don't create dummy clones have to be enabled (that's the default by the way) since the new rebuilder only cares about merge attributes for checking relationships.
for full merged sets, clrmamepro's settings option "Hash Collision Name" mode has to be enabled (this is not enabled by default) and profiler option "naming pattern" should be "%f%1" (default). The new rebuilder always uses subpaths for clones/dependencies and that matches this behaviour. Keep in mind, this is only valid for compressed sets. Decompressed ones don't have subfolders, then don't enable the hash collision name mode.

What are the benefits over clrmamepro

Besides of the -in my opinion- way better code, worth mentioning is:

loading in a datfile is extremely fast compared to clrmamepro so that actually nothing has to be cached
it supports chd rebuilding which was requested so many times
it supports hard/symlinks which speeds up processing and saves disk space
it supports the standalone mode which makes so much more sense than the old unmerged one
it is able to detect sets (no matter how they are named) which can be directly copied
you can safely load a -listsoftware MAME xml output and can rebuilt all software lists at one go
currently it's commandline only, but even that may have benefits for commandline scripting fans and most likely it runs better on Linux and Mac machines.
codewise, there is no MS MFC use, just plain STL and 3rd party libs.
different merge modes are now handled completely differently as in clrmamepro where you usually have special rules for "if mergemode is full and machine is a clone then do this...else this". The new rebuilder works on a view on the xml data, i.e. in terms of rebuilding data there is no difference at all between the modes. The view either corresponds to full, split or standalone data which makes the actual action totally independent from the chosen mode. Due to the file/folder strucure in full and standalone modes you get rid of possible problems when roms have the same name but different hash within a parent/clone relationship.

On the other hand I of course understand if users start to moan "but it does not have feature x and y", "it does not support datfile type z" and so on. Yes, this is the case but currently I don't want to implement requests which might be used by 1% of the users. Time will tell what comes next.

Future Plans / Source Availability

In this state of the project it is closed source, mainly due to the use of the full version licence of ZipArchive.

There are definetly plans for the future to get open source. Currently there are discussions about some licences (e.g. free version of ZipArchive is currently GPL)

Things I'm interested in:

a new scanner module, later on an UI and profiler

Things I'm not interested in:

Supporting anything old, adding exceptional handling for weird case x and y, anything non MAME related

Bug Reporting / Donation

If you found something spooky, have problems, feel free to use the clrmamepro forum: https://www.emulab.it/forum/index.php?board=6.0

If you're totally happy with it, feel free to donate ;-) https://mamedev.emulab.it/clrmamepro/#donate

Third party licence information

Zip Handling: ZipArchive

ZipArchive Library 4.6.9 Copyright (c) Tadeusz Dracz

Currently using the 'full version' licence. Making the product currently closed source.

7z/Rar Handling Bit7z

https://github.com/rikyoz/bit7z/blob/master/LICENSE

MPLv2 License

You can obtain a copy of the MPLv2 License here https://mozilla.org/MPL/2.0/

7z.dll is used which is part of the 7-Zip program. 7-Zip is licensed under the GNU LGPL license. You can find 7-zip including source code at https://www.7-zip.org

CLI Parser: CLI11

https://github.com/CLIUtils/CLI11/blob/main/LICENSE

Redistribution and use in source and binary forms of CLI11, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

XML Parser: PugiXml

https://github.com/zeux/pugixml/blob/master/LICENSE.md

MIT License (MIT) (see below)

Logging: SpdLog

https://github.com/gabime/spdlog/blob/v1.x/LICENSE

MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

SHA1 calculation: CSHA1

CSHA1 2.1

100% free public domain implementation of the SHA-1 algorithm by Dominik Reichl dominik.reichl@t-online.de