Skip to content

Latest commit

 

History

History
237 lines (168 loc) · 12.5 KB

README.adoc

File metadata and controls

237 lines (168 loc) · 12.5 KB

License Apache%202.0 blue Maven Build

MAT File Library

Introduction

The MAT File Library (MFL) is a Java library for reading and writing MAT Files that are compatible with MATLAB’s MAT-File Format.

It’s overall design goals are; 1) to provide a user-friendly API that adheres to MATLAB’s semantic behavior, 2) to support working with large amounts of data in heap constrained or allocation limited environments, 3) to allow users to serialize custom data classes without having to convert to temporary objects.

This library currently supports the Level 5 MAT-File Format which has been the default for .mat and .fig files since MATLAB 5.0 (R8) in 1996. This includes save flags -v6 and -v7, but not -v4 or -v7.3. See MAT-File Versions for more info.

This library is free, written in 100% Java, and has been released under an Apache v2.0 license. It works with all Java versions from Java 6 to at least Java 21.

Maven Central

MFL is in the Maven central repository and can easily be added to Maven, Gradle, and similar project managers.

<dependency>
    <groupId>us.hebi.matlab.mat</groupId>
    <artifactId>mfl-core</artifactId>
    <version>0.5.15</version>
</dependency>

In case you are working with EJML matrix types, you may also want to include the mfl-ejml extension.

<dependency>
    <groupId>us.hebi.matlab.mat</groupId>
    <artifactId>mfl-ejml</artifactId>
    <version>0.5.15</version>
</dependency>

Basic Usage

The Mat5 class contains static factory methods that serve as the starting points for the public API. The basic types (e.g. Struct, Cell, Sparse, Char) follow their corresponding MATLAB semantics (e.g. a struct can’t be logical) and provide a fluent API for the most common use cases. All of the numerical types (e.g. double, uint8, int16, logical, …​) are represented by the Matrix type which offers a unified interface for handling numeric conversions.

Similarly to MATLAB, all types are internally represented as a multi-dimensional Array. For example, a scalar struct is really a 1x1 array of structs. Most types offer convenience overloads for scalar, linear, 2-dimensional, and N-dimensional access.

Below are some example snippets on how to use the API. For more examples, please refer to Mat5Examples and the various unit tests.

Creating and Writing MAT Files

A MatFile is a data structure that contains a collection of named Array variables.

// Create MAT file with a scalar in a nested struct
MatFile matFile = Mat5.newMatFile()
    .addArray("var1", Mat5.newString("Test"))
    .addArray("var2", Mat5.newScalar(7))
    .addArray("var3", Mat5.newStruct().set("x", Mat5.newScalar(42)));

The simplest way to write a MatFile to disk is to use the built-in convenience functions.

// Serialize to disk using default configurations
Mat5.writeToFile(matFile, "data.mat");

More complex use cases can leverage the Sink interface to write the MatFile to arbitrary data outputs. Sinks for various outputs (e.g. buffer, stream, or file) can be created via the Sinks factory or by extending AbstractSink.

// Serialize to a streaming file sink using the default configuration
try(Sink sink = Sinks.newStreamingFile("data.mat")){
    matFile.writeTo(sink);
}

Buffers and other sinks that need to be pre-allocated ahead of time can use MatFile::getUncompressedSerializedSize() to calculate the maximum expected size beforehand. Note that compressed data should always be smaller, but there are some corner cases where very small entries such as scalars can actually increase the result by a few bytes.

The Writer API provides further degrees of customization for writing the data. For example, the MatFile can be written to a memory-mapped file using a custom Deflate (compression algorithm) level. We can initialize the file with the maximum expected size, and then automatically truncate it once the Sink is closed.

// Serialize to memory-mapped file w/ custom deflate level
int safetySize = 256; // some (small) arrays may become larger when compressed
long maxExpectedSize = matFile.getUncompressedSerializedSize() + safetySize;
try(Sink sink = Sinks.newMappedFile(new File("data.mat"), Casts.sint32(maxExpectedSize))){
    Mat5.newWriter(sink)
        .setDeflateLevel(Deflater.BEST_SPEED)
        .writeMat(matFile);
}

Reading MAT Files

There are convenience wrappers for reading from files as well. The read API is setup such that users can navigate through known file structures without requiring casts or temporary variables.

// Read scalar from nested struct
double value = Mat5.readFromFile("data.mat")
    .getStruct("var3")
    .getMatrix("x")
    .getDouble(0);

A MatFile can also be read from a Source. Similar to Sink, a Source can represent various types of data inputs. MatFile::getEntries() can be used to iterate all variables inside a mat file.

// Iterate over all entries in the mat file
try(Source source = Sources.openFile("data.mat")){
    MatFile mat = Mat5.newReader(source).readMat();
    for (MatFile.Entry entry : mat.getEntries()) {
        System.out.println(entry);
    }
}

Advanced Filtering

In cases where users are only interested in reading a subset of entries, unwanted entries can be ignored by specifying an EntryFilter.

// Filter arrays that follow some criteria based on the name, type, dimension, or global/logical flags
try(Source source = Sources.openFile("data.mat")){
    MatFile mat = Mat5.newReader(source)
        .setEntryFilter(header -> header.isGlobal())
        .readMat();
}

The filter gets applied only at the root level, so arrays inside a struct or cell array won’t be filtered separately.

Concurrent Compression

Almost all of the CPU time spent on reading or writing MAT files is related to compression. Fortunately, root entries are compressed independently from one another, so it’s possible to do the work multi-threaded.

Users can enable concurrent reading by passing an executor service into the reader. In order to activate, the Source must also support sub-views (slices) on the underlying data (i.e. byte buffers or memory mapped files).

// Concurrent Decompression
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
try(Source source = Sources.openFile("data.mat")){
    MatFile mat = Mat5.newReader(source)
        .enableConcurrentDecompression(executor)
        .readMat();
} finally {
    executor.shutdown();
}

Concurrent writing unfortunately requires a temporary buffer for each root entry due to the size not being known ahead of time. The buffer allocation can be customized in case users want to use buffer-pools or memory-mapped buffers.

// Concurrent Compression
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
try(Sink sink = Sinks.newStreamingFile("data.mat")){
    Mat5.newWriter(sink)
        .enableConcurrentCompression(executor)
        .setDeflateLevel(Deflater.BEST_SPEED)
        .writeMat(mat);
} finally {
    executor.shutdown();
}

The table below shows a rough performance comparison of working with one of our production data logs.

Compression Size Threads Write Time Read Time

BEST_COMPRESSION

144 MB

1

280 sec

3.5 sec

BEST_COMPRESSION

144 MB

8

47 sec

0.8 sec

BEST_SPEED

156 MB

1

7.2 sec

3.6 sec

BEST_SPEED

156 MB

8

1.5 sec

0.8 sec

NO_COMPRESSION

422 MB

1

0.07 sec

0.2 sec

The data set was very multi-threading friendly (33x [95946x18] double matrices on the root level) and first loaded into memory to avoid disk access bottlenecks. The tests were done on a quad core with hyper-threading (Intel NUC6i7kyk).

Serializing Custom Classes

We often encountered cases where we needed to serialize data from an existing math library. Rather than having to convert the data into an API class, we added the ability to create light-weight wrapper classes that serialize the desired data directly.

In order for a class to be serializable, it needs to implement the Array interface (easiest way is to extend AbstractArray) as well as the Mat5Serializable interface. For examples, please take a look at the mfl-ejml module or the link(s) below:

Efficient Java Matrix Library (EJML)

EJML is a popular linear algebra library for Java. The mfl-ejml module has preliminary support for converting between MAT files and EJML data types.

The serialization wrappers are very light and serialize the contained data into the MAT File Format directly without requiring additional memory for storing any intermediate data.

// Add single EJML matrix to root level
MatFile mat = Mat5.newMatFile();
mat.addArray("DMatrix", Mat5Ejml.asArray(new DMatrixRMaj(rows, cols)));

// Add multiple EJML matrices to sub-structure
MatFile mat = Mat5.newMatFile().addArray("struct", Mat5.newStruct()
        .set("FMatrix", Mat5Ejml.asArray(new FMatrixRMaj(rows, cols)))
        .set("CMatrix", Mat5Ejml.asArray(new CMatrixRMaj(rows, cols)))
        .set("ZMatrix", Mat5Ejml.asArray(new ZMatrixRMaj(rows, cols))));

After reading a MAT File the contained Matrix types can be converted to a user supplied EJML matrix via output = Mat5Ejml.convert(matrix, output). The output matrix will be reshaped as needed.

// Convert Matrix to EJML Type
MatFile mat = Mat5.newMatFile();
DMatrixRMaj dMatrix = Mat5Ejml.convert(mat.getArray("DMatrix"), new DMatrixRMaj(0, 0));

General Notes

Memory Efficient Serialization

The MAT 5 format stores all data fields with a header tag that contains the number of bytes and how they should be interpreted. Rather than writing into temporary buffers to determine the serialized size, we added ways to pre-compute all deterministic sizes beforehand.

The only non-deterministic case is compressing data at the root level, which we can work around by writing a dummy size and overwriting it once the final size is known. Thus, enabling compression requires the root level sink to support position seeking (i.e. in-memory buffers, memory mapped files, or random access files).

Support for Undocumented Features

Unfortunately, MAT 5 files have several features that aren’t covered in the official documentation. This includes most of the recently added types (table, timeseries, string, …​), handle classes, function handles, .fig files, Simulink outputs, etc.

Our current implementation supports reading all of the .mat and .fig files we were able to generate. It also supports editing and saving of the loaded MAT files, e.g., adding entries, changing matrices, or using a different compression level. However, changes to the undocumented parts, such as setting a property on a handle class, will not be saved.

Building Sources

The created sources include unit tests that make use of Java 7 and 8 syntax, so the project needs to be compiled with at least JDK 8.

mvn package

For more information, please check the CI build-script Jenkinsfile

Acknowledgements

MatFileRW (active fork of JMatIO maintained by DiffPlug) served as an inspiration for parts of the implementation as well as a source for test data. We ended up porting and supporting all of their unit tests with the exception of Base64 MDL decoding (which we couldn’t figure out the use case for).

The implementation for reading the undocumented MCOS (MATLAB Class Object System) data is based on Matt Bauman's reverse engineering efforts as well as MatFileRW’s implementation by Matthew Dawson.

Preconditions was copied from Guava.