
Mod 5 Data Provenance

By CamMG
  • Master Data Source BRFSS

    Master Data Source BRFSS
    Behavioral Risk Factor Surveillance System (BRFSS) conducts more than 400,000 telephone surveys annually.
  • About BRFSS Data

    About BRFSS Data
    • BRFSS data collected includes health-related risk behaviors, chronic health conditions, & use of preventative services
    • A data weighting process is applied to in an attempt to remove bias in the sample -The data set LLCP2017 uses weight variable name _LLCPWT
    • Datasets are available in either SAS of ASCII formats
  • Master Data Retrieval

    Master Data Retrieval
    Data is retrieved from the BRFSS database and stored in CSV format on a secure hard drive within a secured room.
    - BRFSS data sets are available in ASC or SAS file format. This means a conversion to CSV format occurred to the secured master database. Data integrity can be compromised during conversions.
  • About Hearts Matters Data

    About Hearts Matters Data
    • The data set is a subset of a larger data set. It only contains data points for Alabama and a sampling of the columns surveyed as risks.
    • The supporting documentation is contained within a codebook.
    • The data set contains records of variables. Each variable has a unique variable name representing the survey questions, information about the respondent. There are also ranking values with labels describing what the values mean. Lastly, the variables datatype is defined as number, character etc.
  • Initial Analysis

    Initial Analysis
    An initial analysis was conducted to determine if lack of medical care is a problem in Alabama and why. The master data was accessed to retrieve data points for Alabama. The following concerns arise from this analysis:
    - How was the data manipulated or changed, or was there any data loss during this analysis?
    - Was the analysis performed with a copy of the CSV file, or was the initial CSV file overwritten?
  • Relocation and Conversion

    Relocation and Conversion
    During this time, the new location receives funding for an SQL database. The IT system administrator was instructed to transfer and transform the dataset.
    - The dataset and codebook were copied to a flash drive and physically moved to the new location. Although rugged, flash drives can experience file corruption through mishandling, viruses, and malware.
    - The dataset experiences another conversion by importing into the SQL database system. Data integrity can be compromised during conversions.
  • Data Integrity Discovery

    Data Integrity Discovery
    Heart Matters hires a new data analyst for the Alabama location and tasks them with re-reviewing the master data set. The analyst discovered:
    - The CSV file contained unreadable characters during data migration.
    - Some data rows fail to import, resulting in data loss.
    - Import quantity discrepancy between CSV original and CSV imported to SQL.
  • Best Practices for Data Integrity

    Best Practices for Data Integrity
    The recent discovery regarding the data integrity issue warrants a directive for data integrity assurance.
    - The initial BRFSS data retrieval should be directly imported to SQL and evaluated with original file for accuracy.
    - Data loss on import problems must be identified such grouped variable fields, fixed widths or blank columns.
    - Once the data set has been verified, it must be stored with access levels applied to ensure database is not changed unless specifically intended to be.