‘s previous article introduced some basic concepts of data quality, and data quality control, as the basic link of data warehouse, is the foundation to protect the application of upper data. Quality assurance data includes data summary analysis (Data Profiling) (Data Auditing), data auditing and data correction (Data Correcting) three parts, the article introduces the related content of Data Profiling, a summary of the statistical information of data obtained from the Data Profiling, so the following is to use these statistics to review the quality of the data, the existence of dirty data check data, so this article mainly introduces the audit data (Data Auditing) content.
The basic elements of
first of all, how to assess the quality of data, or how the data can meet the requirements of the data, can be considered from 4 aspects, these 4 aspects together constitute the 4 basic elements of data quality.
data records and information integrity, the existence of missing cases.
The lack of a lack of field information and record data are missing
in the record, both will cause the statistical results are not accurate, so integrity is the foundation to guarantee the quality of data, and the assessment of the integrity of the relatively easy.
data record conform to specifications and are consistent with the previous and other data sets?.
The consistency of
data mainly includes the specification of data record and the consistency of data logic. The data recorded data encoding and format specification is mainly the problems, such as the website user ID is a 15 bit number, commodity ID is 10 digits, including 20 commodity categories, the IP address must be "4." 0-255 is composed of the digital divide, and some definitions of data integrity constraints, such as the non empty constraint and uniqueness constraint; data logic is consistent, and the calculation of the index statistics such as PV> =UV, the proportion of new users in 0-1 etc.. Data consistency audit is an important and complicated part of data quality audit.
Whether the information and data recorded in
data is accurate and whether there is any abnormal or wrong information.
causes the consistency of the data may be recorded but not necessarily a rule, error and accuracy; attention is existing in data recording errors, such as character data garbled phenomenon should also return to the scope of assessment of the accuracy, also is the abnormal value, large or unusually small numerical abnormalities. Do not meet the requirements of effective number >