# VAERS Complete - Enhanced Data Processing Script

## Overview

`vaers_complete.py` is a comprehensive Python script for processing VAERS (Vaccine Adverse Event Reporting System) data with advanced features including multi-core parallel processing, memory-efficient chunked data handling, and comprehensive change tracking across CDC data releases.

**Original Author**: Gary Hawkins - http://univaers.com/download/
**Enhanced Version**: 2025 by Jason Page

## Features

- ✓ Multi-core parallel processing for faster execution
- ✓ Memory-efficient chunked data handling for large datasets
- ✓ Command-line dataset selection (COVID-19 era or full historical data)
- ✓ Progress bars for all major operations
- ✓ Comprehensive error tracking and reporting
- ✓ Fixed statistics functionality
- ✓ Change detection and tracking across data releases
- ✓ Deduplication and data consolidation
- ✓ Complete audit trail of modifications to VAERS reports

## Requirements

### Python Dependencies

```bash
pip install pandas numpy tqdm zipfile-deflate64
```

- **pandas**: Data manipulation and analysis
- **numpy**: Numerical operations
- **tqdm**: Progress bars (optional but recommended)
- **zipfile-deflate64**: Enhanced ZIP file handling (optional, falls back to standard zipfile)

### System Requirements

- Python 3.x
- Multi-core CPU recommended for parallel processing
- Sufficient RAM for large dataset processing (16GB+ recommended for full dataset)

## Command-Line Options

### Basic Syntax

```bash
python vaers_complete.py [OPTIONS]
```

### Options Reference

#### `--dataset {covid,full}`
**Default**: `covid`

Selects which dataset to process:
- `covid`: Process COVID-19 era data only (from 2020-12-13 onwards by default)
- `full`: Process full historical VAERS dataset (from 1990-01-01 onwards by default)

**Examples**:
```bash
python vaers_complete.py --dataset covid
python vaers_complete.py --dataset full
```

#### `--cores NUMBER`
**Default**: Number of CPU cores available on system

Specifies the number of CPU cores to use for parallel processing.

**Examples**:
```bash
python vaers_complete.py --cores 8
python vaers_complete.py --cores 16
python vaers_complete.py --dataset full --cores 4
```

#### `--chunk-size NUMBER`
**Default**: `50000`

Sets the chunk size for processing large datasets. Larger chunks use more memory but may be faster. Smaller chunks are more memory-efficient.

**Examples**:
```bash
python vaers_complete.py --chunk-size 100000
python vaers_complete.py --chunk-size 25000
```

#### `--date-floor DATE`
**Default**: `2020-12-13` for COVID dataset, `1990-01-01` for full dataset

Sets the earliest date to process (format: YYYY-MM-DD). Records before this date will be excluded.

**Examples**:
```bash
python vaers_complete.py --date-floor 2021-01-01
python vaers_complete.py --dataset full --date-floor 2000-01-01
```

#### `--date-ceiling DATE`
**Default**: `2025-01-01`

Sets the latest date to process (format: YYYY-MM-DD). Records after this date will be excluded.

**Examples**:
```bash
python vaers_complete.py --date-ceiling 2024-12-31
python vaers_complete.py --date-floor 2020-01-01 --date-ceiling 2023-12-31
```

#### `--test`
**Default**: Not set

Uses test cases directory (`z_test_cases`) instead of the main working directory. Useful for development and testing.

**Example**:
```bash
python vaers_complete.py --test
```

#### `--no-progress`
**Default**: Not set

Disables progress bars. Useful for logging output to files or when running in environments without terminal support.

**Example**:
```bash
python vaers_complete.py --no-progress > output.log
```

#### `--merge-only`
**Default**: Not set

Skips all processing and only creates the final merged file from existing processed data. Useful when you want to regenerate the final output without reprocessing everything.

**Example**:
```bash
python vaers_complete.py --merge-only
```

## Usage Examples

### Process COVID-19 data with 8 cores
```bash
python vaers_complete.py --dataset covid --cores 8
```

### Process full historical dataset with 16 cores and larger chunks
```bash
python vaers_complete.py --dataset full --cores 16 --chunk-size 100000
```

### Process COVID data from a specific start date
```bash
python vaers_complete.py --dataset covid --date-floor 2021-01-01
```

### Process data for a specific date range
```bash
python vaers_complete.py --date-floor 2021-01-01 --date-ceiling 2023-12-31 --cores 8
```

### Process with smaller chunks for memory-constrained systems
```bash
python vaers_complete.py --dataset covid --chunk-size 25000 --cores 4
```

### Create final merged file only
```bash
python vaers_complete.py --merge-only
```

### Run with test data
```bash
python vaers_complete.py --test --cores 4
```

### Process without progress bars (for logging)
```bash
python vaers_complete.py --dataset covid --no-progress > processing.log 2>&1
```

## Directory Structure

The script expects and creates the following directory structure:

```
.
├── 0_VAERS_Downloads/          # Input: Raw VAERS ZIP files from CDC
├── 1_vaers_working/            # Intermediate: Extracted CSV files
├── 1_vaers_consolidated/       # Intermediate: Consolidated data files
├── 2_vaers_full_compared/      # Output: Comparison results with change tracking
├── 3_vaers_flattened/          # Intermediate: Flattened data (one row per VAERS_ID)
├── stats.csv                   # Output: Processing statistics
├── never_published_any.txt     # Output: VAERS IDs never published
├── ever_published_any.txt      # Output: All VAERS IDs ever published
├── ever_published_covid.txt    # Output: COVID-related VAERS IDs
├── writeups_deduped.txt        # Output: Deduplicated symptom descriptions
└── VAERS_FINAL_MERGED.csv      # Final output: Complete merged dataset
```

### Test Mode Directory Structure

When using `--test` flag:

```
z_test_cases/
├── drops/                      # Input: Test VAERS data
├── 1_vaers_working/
├── 1_vaers_consolidated/
├── 2_vaers_full_compared/
├── 3_vaers_flattened/
└── [output files]
```

## Processing Workflow

The script performs the following main steps:

### 1. **Consolidation**
Combines the three VAERS data files for each data release:
- `*VAERSDATA.csv` - Main report data
- `*VAERSVAX.csv` - Vaccination details
- `*VAERSSYMPTOMS.csv` - Symptom entries

Output: Consolidated files in `1_vaers_consolidated/`

### 2. **Flattening**
Aggregates multiple vaccine entries per report into single rows:
- Groups vaccine records by VAERS_ID
- Merges all related data into one row per report
- Joins symptom entries

Output: Flattened files in `3_vaers_flattened/`

### 3. **Comparison**
Compares current data release with previous releases to detect changes:
- Identifies new reports
- Detects modifications to existing reports
- Tracks deletions
- Records all changes in the `changes` column
- Counts cell edits

Output: Comparison files in `2_vaers_full_compared/`

### 4. **Final Merge**
Creates the final consolidated output file containing:
- All reports with complete change history
- Cell edit counts
- Status indicators (new, modified, deleted)
- Complete audit trail

Output: `VAERS_FINAL_MERGED.csv`

## Output Files

### Primary Output

**`VAERS_FINAL_MERGED.csv`**
- Complete dataset with all VAERS reports
- Includes all historical changes tracked across data releases
- Contains columns: `cell_edits`, `status`, `changes`
- One row per VAERS_ID with complete information

### Statistics and Tracking Files

**`stats.csv`**
- Processing statistics for each data release
- Counts of new reports, modifications, deletions
- Date ranges and record counts

**`never_published_any.txt`**
- VAERS IDs that were never published in any release
- Identifies gaps in the VAERS ID sequence

**`ever_published_any.txt`**
- Complete list of all VAERS IDs ever published
- Includes all vaccine types

**`ever_published_covid.txt`**
- List of COVID-19 vaccine-related VAERS IDs
- Filtered by VAX_TYPE containing 'covid'

**`writeups_deduped.txt`**
- Deduplicated symptom text descriptions
- Useful for analysis of unique symptom patterns

## Key Columns in Output

The final merged file contains all standard VAERS columns plus:

### Standard VAERS Columns
- `VAERS_ID` - Unique report identifier
- `AGE_YRS`, `SEX`, `STATE` - Demographic information
- `DIED`, `L_THREAT`, `ER_VISIT`, `HOSPITAL`, `DISABLE` - Serious outcomes
- `VAX_TYPE`, `VAX_MANU`, `VAX_LOT` - Vaccine information
- `VAX_DATE`, `ONSET_DATE`, `RPT_DATE` - Date information
- `SYMPTOM_TEXT` - Symptom description
- And many more...

### Enhanced Tracking Columns
- `cell_edits` - Count of cells modified across all releases
- `status` - Report status (new, modified, deleted)
- `changes` - Detailed log of all changes made to the report
- `symptom_entries` - Aggregated symptom entries

## Performance Tuning

### For Fast Processing (High RAM)
```bash
python vaers_complete.py --dataset covid --cores 16 --chunk-size 100000
```

### For Memory-Constrained Systems
```bash
python vaers_complete.py --dataset covid --cores 4 --chunk-size 25000
```

### For Very Large Full Dataset
```bash
python vaers_complete.py --dataset full --cores 16 --chunk-size 50000
```

## Error Handling

The script includes comprehensive error handling:
- All errors are collected and displayed at the end of processing
- Errors include timestamps for tracking
- Processing continues when possible, skipping problematic files
- Final error summary shows total errors encountered
- Exit code 0 = success, 1 = errors occurred

## Data Filtering

### COVID Dataset Mode
By default, filters to COVID-19 era data:
- Automatically detects the earliest COVID VAERS_ID
- Removes all reports prior to first COVID vaccine report
- Typically starts from VAERS_ID ~896636 (first trial report)

### Full Dataset Mode
Processes complete historical VAERS data:
- Includes all vaccine types from 1990 onwards (or specified date-floor)
- Significantly larger processing time and storage requirements

## Change Tracking

The script tracks modifications to VAERS reports across CDC data releases:
- **New reports**: First appearance in a data release
- **Modifications**: Changes to any field in existing reports
- **Deletions**: Reports removed from later releases
- **Cell edits**: Count of individual cell changes
- **Change log**: Detailed description of what changed

Example change tracking entry:
```
2023-01-15: AGE_YRS changed from "45" to "46"
2023-01-15: SYMPTOM_TEXT appended with "Patient recovered"
```

## Troubleshooting

### Out of Memory Errors
- Reduce `--chunk-size` to 25000 or lower
- Reduce `--cores` to use fewer parallel processes
- Process smaller date ranges using `--date-floor` and `--date-ceiling`

### Progress Bars Not Showing
- Install tqdm: `pip install tqdm`
- Or disable with `--no-progress` if not needed

### ZIP File Errors
- Install zipfile-deflate64: `pip install zipfile-deflate64`
- Script falls back to standard zipfile if not available

### Missing Input Files
- Ensure VAERS data files are in `0_VAERS_Downloads/` directory
- Check that files are in correct ZIP format from CDC

## License and Attribution

Original script by Gary Hawkins (http://univaers.com/download/)
Enhanced version with performance improvements and additional features by Jason Page.

## Notes

- The script automatically handles mixed date formats (MM/DD/YYYY → YYYY-MM-DD)
- Duplicate records are automatically identified and removed
- String type handling is optimized for memory efficiency
- All CSV files use UTF-8-sig encoding for compatibility
- Progress tracking can be disabled for automated/batch processing

## Support

For issues, questions, or contributions, refer to the original source or the repository where this script is maintained.
