Handle loads of data with MATLAB
by Markus Lauterbacher (comments: 0)
The proceeding digitalization and integration bears an ever increasing amount of data. Engineers suffer when these treasure troves are getting dusty. Additionally our own, simulation-based data sources produce more and more data. Besides MathWorks® recommendable overview for coping with Big Data, I would like to share some experiences in managing and analysing mounds of data.
The Basics
Have a 64-bit MATLAB and OS! Only nostalgia and legacy code justifies a 32-bit MATLAB nowadays. 2GB memory limit per MATLAB? Unthinkable. MATLAB 2015b was anyway the last 32-bit release. Additionally the Execution Engine of MATLB 2015b improved!
Treat your discomfort with slow HDD-seeks with a swift SSD. Whether workstation or notebook, even 512GB are cheap today. Boosted program startups and increased speeds on lots of small, scattered files come with the additional advantage of having a superfast swap-storage in case of memory overflow.
The effect of an SSD on *.mat-files however is marginal. Computing power with compression and decompression seems to be the limiting component:

Speaking of CPU-power: we are using 4-physical-core Intel processors across-the-board. Rigged up with MATLABs Parallel Computing Toolbox and 16GB of memory (at least) we have 4 MATLABs crunching numbers simultaneously. Computing-intensive, homogeneous tasks in our tools are all ready for this parallelization.
Data vs. Information
We often cannot distinguish between relevant and irrelevant data a priori. Every bit might be important. But we can at least store data smart.
Numerics in MATLAB are by default doubles (8 Byte). In many cases the usage of single-precision suffices over that conservative approach. pi==3.141592653589793 becomes pi==3.1415927. Memory usage is cut into half. Furthermore we do not claim our models to so valid as needing 15 decimal places ;)
We compress timeseries data of test vehicles (with more than 900 channels) to a third of their size by recoding the information. Most channels have high sampling but little information. By transforming the data to non-equidistant timeseries data (e.g. with MATLAB timeseries-objects) there is no loss of information.
ANDATA Toolboxes
Finally some hints for our toolboxs:
- When managing thousands of requirements in Stipulator, we recommend increasing the (maximum) Java heap space (512MB)
- Do install the toolboxes locally (not on a network share)
- Signal processing and export in Stipulator are parallelized if possible, so are Brainer-trainings and model evaluation
PS: More methodical aspects when handling loads of data especially with Brainer will follow on a seperate blog entry.
Comments
Add a comment