Handle loads of data with MATLAB

by Markus Lauterbacher (comments: 0)

The proceeding digitalization and integration bears an ever increasing amount of data. Engineers suffer when these treasure troves are getting dusty. Additionally our own, simulation-based data sources produce more and more data. Besides MathWorks® recommendable overview for coping with Big Data, I would like to share some experiences in managing and analysing mounds of data.

The Basics

Have a 64-bit MATLAB and OS! Only nostalgia and legacy code justifies a 32-bit MATLAB nowadays. 2GB memory limit per MATLAB? Unthinkable. MATLAB 2015b was anyway the last 32-bit release. Additionally the Execution Engine of MATLB 2015b improved!

Treat your discomfort with slow HDD-seeks with a swift SSD. Whether workstation or notebook, even 512GB are cheap today. Boosted program startups and increased speeds on lots of small, scattered files come with the additional advantage of having a superfast swap-storage in case of memory overflow.

The effect of an SSD on *.mat-files however is marginal. Computing power with compression and decompression seems to be the limiting component:

SSD vs. HDD - timed load() and save() with different matrix-sizes

Speaking of CPU-power: we are using 4-physical-core Intel processors across-the-board. Rigged up with MATLABs Parallel Computing Toolbox and 16GB of memory (at least) we have 4 MATLABs crunching numbers simultaneously. Computing-intensive, homogeneous tasks in our tools are all ready for this parallelization.

Data vs. Information

We often cannot distinguish between relevant and irrelevant data a priori. Every bit might be important. But we can at least store data smart.

Numerics in MATLAB are by default doubles (8 Byte). In many cases the usage of single-precision suffices over that conservative approach. pi==3.141592653589793 becomes pi==3.1415927. Memory usage is cut into half. Furthermore we do not claim our models to so valid as needing 15 decimal places ;)

We compress timeseries data of test vehicles (with more than 900 channels) to a third of their size by recoding the information. Most channels have high sampling but little information. By transforming the data to non-equidistant timeseries data (e.g. with MATLAB timeseries-objects) there is no loss of information.

ANDATA Toolboxes

Finally some  hints for our toolboxs:

  • When managing thousands of requirements in Stipulator, we recommend increasing the (maximum) Java heap space (512MB)
  • Do install the toolboxes locally (not on a network share)
  • Signal processing and export in Stipulator are parallelized if possible, so are Brainer-trainings and model evaluation

PS: More methodical aspects when handling loads of data especially with Brainer will follow on a seperate blog entry.

Comments

Add a comment