Tag Archives: MKVToolNix

Converting a C++ code base from boost::filesystem to std::filesystem

My project MKVToolNix has used the boost::filesystem library for a long time. It allows for easy path manipulation (e.g. “what’s the parent directory”, “does the named entry exist & is a file” or “create all directories for this path that don’t exist yet”). The central class of the library is boost::filesystem::path which encodes an absolute or relative part of an entry in the file system and offers various functions for modifying, querying or formatting it. All utility functions such as the aforementioned “create all directories” take path instances as arguments.

All of this worked mostly fine.

After release v53 of MKVToolNix a user reported an issue with using Emojis in file names. Turns out this wasn’t trivial to debug and is likely a bug in boost::filesystem or in the way I handled character encoding (more on that later).

This motivated me to look into converting MKVToolNix from boost::filesystem to std::filesystem which had been added to the C++ Standard Library with C++17. Its design is based heavily on boost::filesystem: a lot of the classes & functions are named identically within the two namespaces and most of their semantics match. As I’m eager to reduce the number of third-party libraries by using C++ standard libraries where it makes sense, converting my code base seemed the obvious step.

I mean, how hard could it be? Right?

Easy changes

A lot of changes were trivial: simply replacing boost::filesystem:: with std::filesystem:: got me pretty far.

Another easy change was to replace deprecated function names with the current ones, e.g. branch_path() with parent_path(): look up the deprecated function name in for boost::filesystem, take its replacement & swap the namespace — done.

Encodings & character set conversions

The most glaring difference in my opinion between the two libraries is how they handle encodings/character set conversions for the path classes. In both libraries the path class offer constructors for instances of std::string and std::wstring. Internally the classes use either std::string or std::wstring for storage depending on the operating system they’re compiled for: wide strings (std::wstring) for Windows and narrow ones (std::string) everywhere else.

This means that the constructors taking the string type that doesn’t match the storage type (std::string on Windows, std::wstring everywhere else) must do character set conversion, either from narrow to wide (Windows) or wide to narrow (Linux). This is where things get really dicey.

boost::filesystem solves this by letting the user imbue the whole boost::filesystem::path class with a C++ locale object. That locale object is used for the conversion. The programmer has to set this up once per program invocation, and that’s it. Easy.

MKVToolNix’s internal string handling uses UTF-8 encoded narrow strings everywhere. For file names, too[1]Yes, I know that’s bad as file systems on Unix aren’t guaranteed to have any type of encoding; file names might contain something that isn’t valid according to any encoding etc. etc.. Therefore the logical choice was to imbue boost::filesystem::path with an UTF-8 character set conversion locale. It would convert between UTF-8 encoded narrow strings and wide strings.

std::filesystem does not do it that way. At all. For the constructor taking char-based narrow strings (pointer-to-char or std::string) conversion from the native narrow encoding is done. On POSIX systems this means no conversion which matches how MKVToolNix is coded: std::filesystem::path receives a UTF-8 encoded narrow string & stores it as-is internally.

On Windows, though, this means that conversion is done from a character set such as Windows-1252 on a German Windows. Oooh boy, this is bad, as that isn’t at all how MKVToolNix encodes narrow strings. So what happens when an UTF-8 encoded narrow string containing non-ASCII characters is passed to std::filesystem::path? Mangled non-ASCII characters, of course. German Umlaute, French accents, Asian characters, Emojis — doesn’t matter. Encodings don’t match, conversion does bad things, file names are messed up.

In effect this means that I must never, ever use the narrow string constructor of std::filesystem::path on Windows except for when I know the narrow string only contains ASCII characters.

Implicit conversion

This thing wouldn’t be so bad for me if the constructors std::filesystem::path were explicit. They’re not, though, meaning string-to-path can hide in various places and are hard to spot. For example:

void do_stuff(const std::filesystem::path &p) {
  // do something with p
}

void chunky_bacon(const std::string &file_name) {
  // do chunky things
  do_stuff(file_name);
  // do bacon things
}

Here’s one of those conversions that absolutely must not happen on Windows, due to the implicit constructor.

Another example:

std::filesystem::path dir;
// set dir somehow
auto full_file_name = dir / sub_dir / "index.txt";

Is this OK? Well, depends on the type of sub_dir, actually. If it is a path already, it’s OK; if it’s a wide string, it’s OK, too, but all narrow string types (char arrays, std::string)? Not so much. And this is really hard to spot or grep for.[2]The "index.txt" argument is fine. Even though it’s a narrow string, it solely consists of ASCII characters that’ll convert properly to wide strings no matter what Windows’ native … Continue reading

I’d really appreciate if my compiler was able to help me out here. Having the constructors explicit would mean neither of these examples would compile and I’d need to add an explicit conversion myself. But the C++ committee didn’t make them explicit, probably for a reason.

Solutions

My solution was to ban the use of the constructors[3]C++20 will improve this situation with the introduction of char8_t. The constructor taking a char8_t-based narrow string will always convert from UTF-8, avoiding this mess.. Instead I introduced helper functions to_path() that take both narrow & wide strings & assume narrow strings are UTF-8 encoded. These helper functions work differently depending on whether I’m compiling for Windows or other systems.

Let’s take the second example from above. I’d convert it as follows:

auto full_file_name = dir / mtx::fs::to_path(sub_dir) / "index.txt";

Ugly as sin.

Bugs in the C++ Standard Library

std::filesystem is rather new, being added in C++17. I was prepared for bugs to rear their ugly heads. But I wasn’t prepared for how broken certain parts of the gcc standard C++ library are on Windows wrt. to UNC paths[4]I use a gcc/mingw cross-compiler from the MXE project for cross-compiling from Linux to Windows. The same issues would happen compiling with mingw on Windows itself.. A UNC path might look like this: \\server\share\sub_dir\file.txt Without any particular order, here are a couple of issues I ran into (this is with gcc 10.2.0, the latest release at the time of writing):

  1. On Windows the std::filesystem::file_size function was implemented by calling 32-bit functions that use signed 32-bit variables for storing the file size. Needless to say, this doesn’t work too well with files larger than 2 GB. The corresponding gcc bug 95749 has already been fixed upstream, but it isn’t part of a release yet. My workaround: I cherry-picked the commit fixing the issue into my own copy of gcc I’m using for building MKVToolNix.
  2. std::filesystem::exists doesn’t work correctly with all UNC paths. It works with e.g. \\server\share\sub_dir\file.txt but not the share \\server\share itself. I’ve reported this as gcc bug 99311. My workaround: I don’t use std::filesystem::exists; instead I test for the type I expect (e.g. std::filesystem::is_directory or std::filesystem::is_regular_file) as those functions do work correctly. Even with UNC paths.
  3. As a consequence std::filesystem::create_directories doesn’t work with UNC paths either. It thinks that neither \\server nor \\server\share exist and tries to create them. That fails, obviously, and create_directories aborts at the first failure. My workaround: I NIH’ed my own create_directories function that uses std::filesystem::is_directory for testing for existence. Only on Windows, of course. Reported in the same bug as the one above.
  4. std::filesystem::absolute and std::filesystem::path::is_absolute don’t work on UNC paths. They think those paths aren’t absolute and will do funky things such as return C:\server\share\file.txt when asked for the absolute path to \\server\share\file.txt. Of course I reported this, too; it’s gcc bug 99333 where I was told that it was actually a bug in the std::filesystem::path::has_root_name function. My workaround: I NIH’ed some more functions that treat paths starting with \\ or // as absolute.
  5. UNC paths using forward slashes are only supported if the program is compiled with the cygwin gcc compiler. My workaround: in my NIH’ed functions I normalize forward slashes to backslashes.
  6. UNC device paths starting with \\?\… such as \\?\C:\Path\File.mkv or \\?\UNC\server\share\file.opus don’t work at all. Functions such as std::filesystem::exists() don’t work, and even my workaround from above (using std::filesystem::is_directory() or std::filesystem::is_regular_file()) doesn’t work with these paths. I’ve filed gcc bug 99430 for this. My workaround: at the moment I haven’t implemented one. I might replace \\?\… with \\.\… in my conversion functions as those seem to fare better.

As I wrote above, this is gcc 10.2.0. Your compiler might fare better, of course. I suspect that Microsoft’s Visual C++ has fewer of those Windows-specific issues, simply due to the amount of experience its developers have with Windows’s peculiarities.

Bugs in MKVToolNix

Most bugs that were actually bugs in MKVToolNix instead of the standard C++ library turned out to be due to using the implicit conversion of narrow strings to path objects that I’ve talked about in length above.

There was one bug that was due to changed semantics between the two libraries: what happens if your path is at the root already (e.g. C:\ on Windows) and you call parent_path() on it? boost::filesystem will return an empty path object whereas std::filesystem will return the same path again. This means that loops checking the file system from a point to root have to have their conditions changed. Instead of e.g. while (!path.empty()) you’ll have to do something like while (path.parent_path() != path). It gets even nastier wrt. to UNC paths; see this chart for details. I definitely forgot to fix a couple of these cases leading to endless loops.

macOS woes & older Linux versions

One problem on macOS is that different macOS versions have different levels of support for C++17. Until this change I used to build for macOS 10.14 “Mojave” and newer, as those supported all C++17 features I used. However, std::filesystem, while being a C++17 feature, is only supported on macOS 10.15 “Catalina” and the current 11.0 “Big Sur”. One of the drawbacks of using current technology.

This is a bit different to older Linux distributions. Let’s take CentOS 7, a rather old distribution. I can still compile current MKVToolNix releases there by installing the latest developer toolset which comes with a current gcc & libc++. For macOS the compiler version doesn’t suffice, though: even if I install a recent XCode version on macOS 10.14, I cannot build code using std::filesystem for it as the standard C++ library on macOS 10.14 doesn’t contain that functionality. I haven’t bothered investigating if it’s possible to include a newer standard C++ library itself in the MKVToolNix disk image.

And on Linux? Was it worth it?

Well… I haven’t had a single issue with so far with the conversion on Linux.

And as to the question whether or not it was worth it: hmm… meh. I really like to reduce my number of external dependencies. That is a tangible gain. It simplifies the build process & reduces build times in various situations where Boost isn’t available as I’m now down to solely using header-only libraries from Boost. And I actively take part in improving the gcc standard C++ library implementation. This is still Open Source, after all; you’re expected to give back & help out. This is how all of our projects grow.

Footnotes

Footnotes
1 Yes, I know that’s bad as file systems on Unix aren’t guaranteed to have any type of encoding; file names might contain something that isn’t valid according to any encoding etc. etc.
2 The "index.txt" argument is fine. Even though it’s a narrow string, it solely consists of ASCII characters that’ll convert properly to wide strings no matter what Windows’ native narrow encoding is. I hope.
3 C++20 will improve this situation with the introduction of char8_t. The constructor taking a char8_t-based narrow string will always convert from UTF-8, avoiding this mess.
4 I use a gcc/mingw cross-compiler from the MXE project for cross-compiling from Linux to Windows. The same issues would happen compiling with mingw on Windows itself.

MKVToolNix v54.0.0 released

Heya,

again I’m releasing a bit early, not even four weeks after the previous one. This release, however, does pack quite a bit more of a punch than the previous ones, both in terms of enhancements and bug fixes. On top of that one of the libraries used (libEBML) has just been released fixing several heap overflow bugs, and I didn’t want to wait too long to get those fixes into a new MKVToolNix release.

There have been several changes concerning package maintainers. Please refer to the NEWS below for details.

You can download the source code or one of the binaries. The Windows and macOS binaries as well as the Linux AppImage are available already. The other Linux binaries are stil being built and will be available over the course of the next couple of hours.

Here are the NEWS since the previous release:

New features and enhancements

  • mkvmerge: added support for using ISO 639-3 language codes in IETF BF47 language tags. Part of the implementation of #3007.
  • mkvmerge: AC-3 parser: added support for byte-swapped AC-3 data. Implements
    #3022.
  • mkvmerge: Matroska reader: for audio tracks that have the bit depth track header set mkvmerge will now keep that header even for codecs that don’t require it for decoding. Implements #3009.
  • mkvmerge: MPEG transport stream reader, PCM audio tracks: mkvmerge will now re-order the channels for 5.1, 7.0 and 7.1 channel tracks from the Blu-ray layout to the WAVEFORMATEXTENSIBLE layout expected in Matroska. Patch by Tom Yan. Implements #2988.
  • mkvmerge, mkvinfo, mkvpropedit, MKVToolNix GUI: added support for the following new track header elements: "hearing impaired" flag, "visual impaired" flag, "text descriptions" flag, "original" flag, "commentary" flag. Implements #3011.
  • MKVToolNix GUI: added support for using ISO 639-3 language codes in IETF BF47 language tags. As there are several thousand of them, they’re deactivated by default and must be activated in the preferences ("GUI" → "Often used selections" → "Languages"). Part of the implementation of #3007.
  • MKVToolNix GUI: multiplexer: when adding Blu-rays the user can select multiple playlists to add simultaneously in the "select playlist to add" dialog. Implements #2961.
  • MKVToolNix GUI: multiplexer: the file name extensions "eb3" and "ec3" were added for Dolby Digital Plus & "mpl" for Dolby TrueHD in the file dialogs. Part of the implementation of #3027.
  • MKVToolNix GUI: multiplexer: when adding multiple files the dialog asking the user what to do with them has gained a new checkbox. If enabled, all files containing at least one video track will always be placed in newly created multiplex setting. Implements #2966.
  • MKVToolNix GUI: multiplexer: added a menu entry in the "Multiplexer" for adding all files that are currently in the clipboard. Implements #3006.

Bug fixes

  • all: Windows: fixed compatibility with gettext 0.21 and newer on mingw.
  • all: Windows: fixed several of the programs having problems with certain Unicode characters (primarily emojis) in file names (e.g. mkvextract wrongfully complaining about an "invalid mode" or the GUI not being able to find parts of Blu-ray file structures).
  • mkvextract: AAC: fixed wrong channel mask field in the ADTS headers for 7.1 channel layouts. Fix by Tom Yan. Fixes #2636.
  • mkvextract: h.265/HEVC extraction: if the first frame starts with the parameter sets (SPS, PPS & VPS), the ones from CodecPrivate aren’t written and the ones from the first frame are kept. Fixes #3031.
  • mkvmerge: fixed the calculation of chapter timestamps read from NTSC DVDs. Fix by Tom Yan.
  • MKVToolNix GUI: IETF BCP 47 language widget: the language combo box will now always contain the language code the user enters in the free-form field, even if it isn’t in the list of often-used languages the user configured in the preferences.
  • MKVToolNix GUI: multiplexer: when browsing for the destination file name the default directory is now chosen according to the preferences regarding how the destination file name should be formed. For example, if the policy is set to "fixed output directory" then that output directory will be the one initially set when the directory selection dialog is opened. Fixes #3021.
  • MKVToolNix GUI: multiplexer: fixed the removal of appended source files if the "delete source files" end-of-job action is enabled. Fixes #3029.
  • MKVToolNix GUI: chapter editor: when importing chapters from DVDs the IETF BCP 47 language elements will be set, too, not just the legacy language elements.

Build system changes

  • libEBML v1.4.2 and libMatroska v1.6.3 are now required. The optional, bundled copies of both libraries have been updated to those versions. This bump in requirements fixes several heap overflow bugs in libEBML.
  • MKVToolNix is now using the C++17 library feature "file system library" instead of Boost’s "file system" and "system" libraries. For the GNU Compiler Collection (gcc) libstdc++ this means v8 or newer is required; for clang’s libc++ it means v7 or newer. For macOS this means that provided disk image will only run on 10.15 "Catalina" or newer.

Have fun :)

MKVToolNix v53.0.0 released

Heydiho,

trying to keep up my usual schedule in 2021 of about four to five weeks between releases. This time it’s even a bit less than four weeks. Several bug fixes and small enhancements were made.

There have been no changes for package maintainers.

You can download the source code or one of the binaries. The Windows and macOS binaries as well as the Linux AppImage are available already. The other Linux binaries are stil being built and will be available over the course of the next couple of hours.

Here are the NEWS since the previous release:

New features and enhancements

  • mkvmerge: AVI reader: added support for reading the video aspect ratio from the video properties header (vprp chunk) if present and setting the display dimensions accordingly. Implements #2993.
  • mkvmerge: MP4 reader: for h.264/AVC tracks that don’t have an AVCConfigurationBox (avcC atom) in their sample description (stsd) atom or whose avcC atom contains no content mkvmerge will now re-derive the AVCConfigurationBox from the bitstream. Implements #2995.
  • mkvextract: mkvextract will now check if any of the destination file names is the same as the source file name and abort with an error if that’s the case. Implements #3001.
  • MKVToolNix GUI: when querying the user for a file name for saving things (e.g. multiplexer settings or an attachment in the header editor), the automatically suggested file name will now be based on the situation-specific file names (e.g. the destination file name for multiplexer settings or the attachment’s name when saving an attachment in the header editor) instead of the directory’s name. Implements #3012.
  • MKVToolNix GUI: multiplexer: when deriving track languages from file names the GUI will now select the right-most match instead of the left-most one. For example, "La.vie.en.rose.(fr).srt" will now be detected as French (fr) instead of English (en). Implements #3013.
  • MKVToolNix GUI: preferences: the items in the "pre-defined …" lists can now be renamed by double-clicking with the mouse or pressing the F2 key.
  • Windows installer: the bluray_dump command-line utility will be installed into the tools sub-directory. bluray_dump can read & dump certain file types used on Blu-rays: .mpls playlists, .clpi clip information databases, .bdmv index files, bdmt_….xml disc library databases and tnmt_….xml track & chapter name databases.

Bug fixes

  • mkvmerge: stretching chapter timestamps with --chapter-sync now works correctly with floating point values including fractions of floating point numbers (e.g. 12.3/45.67). The tooltips in the GUI have been adjusted accordingly. Fixes #3002.
  • mkvmerge: MPEG 1/2 video handling: the "default duration" header field was often half the value it actually should be, resulting in all video frames having an explicit block duration with the correct value. This has been fixed with a patch by Tom Yan.
  • mkvmerge: MPEG 1/2 video handling: the data stored in Codec private and Codec state doesn’t contain extensions other than sequence & sequencea display extensions anymore. Fix by Tom Yan.
  • mkvmerge: tag handling: when remuxing a Matroska file with the --no-track-tags, existing SOURCE_ID track tags are now skipped, too.
  • MKVToolNix GUI: multiplexer: the drop-down boxes with pre-defined track names now follow the order set in the preferences instead of sorting the entries alphabetically. Fixes #2999.

Have fun :)

MKVToolNix v52.0.0 released

After three months of not being motivated to spend a significant amount of time on coding outside of work, I finally found some of that motivation again. The result is release v52.0.0 of MKVToolNix. It contains a couple of enhancements to the GUI and one bug fix in libEBML that’s security-sensitive.

You can download the source code or one of the binaries. The Windows and macOS binaries as well as the Linux AppImage are available already. The other Linux binaries are stil being built and will be available over the course of the next couple of hours.

Here are the NEWS since the previous release:

New features and enhancements

  • MKVToolNix GUI: job queue: the maximum number of jobs to run concurrently can now be increased in the preferences. The default remains at 1. Implements #2984.
  • MKVToolNix GUI: the GUI will now add a context-specific default extension to file names selected for saving on platforms that don’t add one itself (e.g. GNOME). For example, when saving multiplexer settings the extension .mtxcfg will be added. Implements #2983.
  • MKVToolNix GUI: added an option to the preferences for the window to stay on top of other windows. Implements #2967.

Bug fixes

  • mkvextract: h.265/HEVC extraction: the code for skipping extraction of prefix SEI NALUs in the first frame was skipping two bytes too few, resulting in broken processing of all following bytes. Patch by Mike Chen.
  • libEBML: the optional, bundled version of libEBML was updated to v1.4.1.

Build system changes

  • libEBML v1.4.1 is now required due to a bug in libEBML that caused pointers to just-freed memory being returned to the caller under certain invalid data constellations, causing use-after-free errors in all of MKVToolNix’s programs. Fixes #2989.

Have fun :)