Methods And Audit Trail

This page lists the public data path for the Evidence dashboard. Longer codebooks and release notes live in the checked-in docs.

Public Reproducibility

The public release is built from:

  • Public parquets and staged aggregate artifacts.
  • Public crosswalk snapshots.
  • Evidence source SQL under sources/usajobs and sources/nlx.
  • Checked-in ETL and audit code.

The publication companion is the staged reproducibility bundle described in docs/reproducibility-bundle.md. See Agreement And Audit Trail for the compact dashboard-facing checklist of bundle contents, agreement scripts, public-safe limits, and documentation. See What Postings And Taxonomies Can And Cannot Say for the demoted taxonomy-coverage caveat page.

Table 1. Public corpus size and task coverage

No Results

Monthly Active Jobs

USAJOBS monthly charts use active-month denominators. A posting contributes to each month between its start and end date when that date range is valid.

Table 2. USAJOBS active-month construction

No Results

Table 3. Monthly source windows by corpus

Terms. Monthly active jobs = summed active posting-month rows. A posting can contribute to more than one month when its active date range spans months.
No Results

Revealed Comparative Advantage (RCA)

RCA compares an entity's share of a task or Work Activity with that same item's share in the full corpus. Values above 1 indicate over-representation in the selected occupation, agency, or industry.

RCA = (entity item mentions / all entity item mentions) / (corpus item mentions / all corpus item mentions)

Job-Ad-Derived Occupation-Task Pair Two-Model Agreement Procedure

Distinct occupation-task pairs are reviewed with the publication two-model agreement rule to separate likely cross-occupational task signal from extraction noise. This is the job-ad-derived occupation-task pair two-model agreement procedure, one of several LLM agreement procedures in the broader project. Canonical O*NET-listed pairs are retained automatically. Unlisted pairs are retained only when Claude Sonnet 4.6 and GPT-4o both mark the pair plausible. Gemini v2 outputs are diagnostic only and are excluded from the publication rule.

The review now includes three batches with the same prompt and judge panel:

Sample Purpose Main reported result
Top RCA pairs Tests whether the most distinctive occupation-task pairs include real cross-occupational task signal Pass rate by O*NET migration distance; retained vs. excluded pairs
High-volume pairs Tests whether the most frequently mentioned task pairs show the same agreement pattern Pass rate and exclusion rate compared with the RCA sample
Random pairs Estimates the baseline plausibility rate outside top-ranked dashboard records Overall pass rate and false-positive/false-negative boundary checks

The samples remain separate in reporting. The RCA sample supports the Finding 1 headline; the high-volume and random samples are robustness checks and denominator context.

Table 4. LLM agreement inclusion summary by review batch

No Results

Table 5. Two-model reliability by review batch

No Results

Data Quality Checks

Table 6. Agreement filter effect on task-derived rows

Terms. Wage and date rows are USAJOBS source-quality diagnostics. Green task rows are the raw O*NET green-task crosswalk size, not the LLM-validated green signal.
No Results

O*NET AI Term Coverage

Table 7. O*NET task statements containing AI-related terms

No Results

AI Label Provenance

Public AI code labels are taken from crosswalks/ai_a6_5_redacted_final2.csv. The dashboard does not use older AI label snapshots for public display.

Gated Material

Sentence-level AIMatch examples are held until a public-safe sentence artifact is available.

Release Documentation

Reference files in the public repository:

  • docs/reproducibility-bundle.md
  • docs/data-audit/public-dashboard-audit.md
  • docs/data-audit/public-artifact-manifest.md
  • docs/data-audit/public-codebook.md
  • docs/findings-ledger.md

Local-only symlinks and generated build folders are not public release artifacts. Required generated parquets and crosswalk snapshots must be staged and documented before publication.