Handle loads of data with MATLAB
by Markus Lauterbacher
The proceeding digitalization and integration bears an ever increasing amount of data. Engineers suffer when these treasure troves are getting dusty. Additionally our own, simulation-based data sources produce more and more data. Besides MathWorks® recommendable overview for coping with Big Data, I would like to share some experiences in managing and analysing mounds of data.
Have a 64-bit MATLAB and OS! Only nostalgia and legacy code justifies a 32-bit MATLAB nowadays. 2GB memory limit per MATLAB? Unthinkable. MATLAB 2015b was anyway the last 32-bit release. Additionally the Execution Engine of MATLB 2015b improved!
Treat your discomfort with slow HDD-seeks with a swift SSD. Whether workstation or notebook, even 512GB are cheap today. Boosted program startups and increased speeds on lots of small, scattered files come with the additional advantage of having a superfast swap-storage in case of memory overflow.
The effect of an SSD on *.mat-files however is marginal. Computing power with compression and decompression seems to be the limiting component:
Speaking of CPU-power: we are using 4-physical-core Intel processors across-the-board. Rigged up with MATLABs Parallel Computing Toolbox and 16GB of memory (at least) we have 4 MATLABs crunching numbers simultaneously. Computing-intensive, homogeneous tasks in our tools are all ready for this parallelization.
Data vs. Information
We often cannot distinguish between relevant and irrelevant data a priori. Every bit might be important. But we can at least store data smart.
Numerics in MATLAB are by default doubles (8 Byte). In many cases the usage of single-precision suffices over that conservative approach. pi==3.141592653589793 becomes pi==3.1415927. Memory usage is cut into half. Furthermore we do not claim our models to so valid as needing 15 decimal places ;)
We compress timeseries data of test vehicles (with more than 900 channels) to a third of their size by recoding the information. Most channels have high sampling but little information. By transforming the data to non-equidistant timeseries data (e.g. with MATLAB timeseries-objects) there is no loss of information.
Finally some hints for our toolboxs:
- When managing thousands of requirements in Stipulator, we recommend increasing the (maximum) Java heap space (512MB)
- Do install the toolboxes locally (not on a network share)
- Signal processing and export in Stipulator are parallelized if possible, so are Brainer-trainings and model evaluation
PS: More methodical aspects when handling loads of data especially with Brainer will follow on a seperate blog entry.