Methods And Audit Trail
This page lists the public data path for the Evidence dashboard. Longer codebooks and release notes live in the checked-in docs.
Public Reproducibility
The public release is built from:
- Public parquets and staged aggregate artifacts.
- Public crosswalk snapshots.
- Evidence source SQL under
sources/usajobsandsources/nlx. - Checked-in ETL and audit code.
The publication companion is the staged reproducibility bundle described in docs/reproducibility-bundle.md. See Agreement And Audit Trail for the compact dashboard-facing checklist of bundle contents, agreement scripts, public-safe limits, and documentation. See What Postings And Taxonomies Can And Cannot Say for the demoted taxonomy-coverage caveat page.
Table 1. Public corpus size and task coverage
Monthly Active Jobs
USAJOBS monthly charts use active-month denominators. A posting contributes to each month between its start and end date when that date range is valid.
Table 2. USAJOBS active-month construction
Table 3. Monthly source windows by corpus
Revealed Comparative Advantage (RCA)
RCA compares an entity's share of a task or Work Activity with that same item's share in the full corpus. Values above 1 indicate over-representation in the selected occupation, agency, or industry.
RCA = (entity item mentions / all entity item mentions) / (corpus item mentions / all corpus item mentions)
Job-Ad-Derived Occupation-Task Pair Two-Model Agreement Procedure
Distinct occupation-task pairs are reviewed with the publication two-model agreement rule to separate likely cross-occupational task signal from extraction noise. This is the job-ad-derived occupation-task pair two-model agreement procedure, one of several LLM agreement procedures in the broader project. Canonical O*NET-listed pairs are retained automatically. Unlisted pairs are retained only when Claude Sonnet 4.6 and GPT-4o both mark the pair plausible. Gemini v2 outputs are diagnostic only and are excluded from the publication rule.
The review now includes three batches with the same prompt and judge panel:
| Sample | Purpose | Main reported result |
|---|---|---|
| Top RCA pairs | Tests whether the most distinctive occupation-task pairs include real cross-occupational task signal | Pass rate by O*NET migration distance; retained vs. excluded pairs |
| High-volume pairs | Tests whether the most frequently mentioned task pairs show the same agreement pattern | Pass rate and exclusion rate compared with the RCA sample |
| Random pairs | Estimates the baseline plausibility rate outside top-ranked dashboard records | Overall pass rate and false-positive/false-negative boundary checks |
The samples remain separate in reporting. The RCA sample supports the Finding 1 headline; the high-volume and random samples are robustness checks and denominator context.
Table 4. LLM agreement inclusion summary by review batch
Table 5. Two-model reliability by review batch
Data Quality Checks
Table 6. Agreement filter effect on task-derived rows
O*NET AI Term Coverage
Table 7. O*NET task statements containing AI-related terms
AI Label Provenance
Public AI code labels are taken from crosswalks/ai_a6_5_redacted_final2.csv. The dashboard does not use older AI label snapshots for public display.
Gated Material
Sentence-level AIMatch examples are held until a public-safe sentence artifact is available.
Release Documentation
Reference files in the public repository:
docs/reproducibility-bundle.mddocs/data-audit/public-dashboard-audit.mddocs/data-audit/public-artifact-manifest.mddocs/data-audit/public-codebook.mddocs/findings-ledger.md
Local-only symlinks and generated build folders are not public release artifacts. Required generated parquets and crosswalk snapshots must be staged and documented before publication.