Software Blog
About the golden rule of data science not to touch any raw data
by Andreas Kuhn
One of the golden rules in data science says: "Never touch and change any raw data!"
This also stays valid, if data errors have to be compensated. The reasons for that are simple:
- If the data compensation itself contains an error, such an error can never be found and compensated anymore.
- Each kind of data preparation may contain a loss in information.
- Different applications may also need different approaches for the data correction and error compensation. Therefore each application should run its own data preparation procedures.
- Data errors may be the matter of subject for analysis themselves (e.g. for the development of anomalies and incident detection systems). Such analysis become unfeasible or at least very difficult then.
Therefore Stipulator follows the philosophy that any kind of data preparation should be based on the raw data and has to cover the complete set of data preparation steps beginning from the raw all the time. One can define an individual data processing sequence for each individual data set (vulgo Stipulator case) anyhow beside the general signal processing sequence in Stipulator, which can be used for the specific data manipulation and compensation of data errors.
That way the data preparation from the very beginning of the raw data always stays reproducible, transparent and documented. Errors which are found later in the analysis can be corrected anytime and the data is always in optimal conditions then. In addition also new methods for data compensation and clearance can be reintroduced anytime, also later in the analysis and data preparation process.
Further details and practical hints can be found in the Stipulator user manuals or be requestes from support@andata.at.
Comments
Add a comment