C. Finding Emerging Skills
When standard taxonomies do not contain task or skill items for emerging technologies, labor market signals are missed. This finding asks whether postings can supply emerging task and skill statements missing from standard taxonomies, and whether those statements can support informative measures of the technological frontier.
AIMatch starts from posting sentences, places them in a task and skill architecture, and scores each statement for AI relatedness. The current experimental taxonomy contains 19,660 base-level statements. It is not presented as a finished ontology. It is a provisional map: a structured set of posting-derived statements that can be validated, revised, and used to measure where AI-related work is appearing.
This construction uses the TaxonomyBuilder approach for organizing posting-derived task and skill sentences into inspectable maps. See the TaxonomyBuilder article.
From Statements To Maps
The same measurement problem appears outside AI. For STEM, the standard taxonomy gives an occupational starting point; the dashboard then uses LLM-as-judge review to identify a bundle of task statements that should be treated as STEM-related. Once those task statements are identified, postings can show where they appear across the occupation structure. The dynamic examples are shown on Dynamic Maps.
Related-occupation maps use a second step. For each occupation j and task or skill statement s, define an RCA-screened evidence weight:
RCA(j,s) = (mentions of s in j / postings in j) / (mentions of s in all postings / all postings)
e(j,s) = mentions of s in j / postings in j, if RCA(j,s) >= 1 and mentions(j,s) >= 50
= 0, otherwise
rho(j,j') = sum_s e(j,s) * e(j',s) / max(sum_s e(j,s), sum_s e(j',s)) The denominator makes the measure conservative: a smaller occupation does not become highly related to a larger one merely because its few statements are a subset of the larger occupation's profile. The displayed relatedness movies keep the top partners for each occupation. This frees the evidence from a single canonical O*NET task home and lets postings reveal many-to-many occupational structure found in the data.
Posting pairs judged
Consensus accuracy, gap >=0.10
Judge agreement, kappa
BTOS rank correlation, strict
What Was Tested
For each statement, AI class-vector scores are built from cosine similarity to AI keyword classes and averaged across three embedding models. Strict and lenient AI classes then use agreement across models to separate clearer AI statements from AI-adjacent language.
The validation compares randomly selected pairs of USAJOBS posting text inside two strata:
- Strict AI: postings with at least one strict AI code.
- Lenient / AI-adjacent: postings with at least three lenient codes and no strict codes.
Within each stratum, pairs are grouped by AI-score gap: large gaps (>=0.10), medium gaps (0.05 to 0.10), and very small gaps below 0.03. Claude Sonnet 4.6 and OpenAI GPT-4o ranked which posting was more AI-related, or in the lenient stratum, which was closer to the AI / ML frontier.
Validation Results
The sign tests reject random ranking for large and medium score gaps, and fail to reject for very small gaps. That is the useful measurement claim: the AI score distinguishes postings that differ by about 0.05 or more, but not reliably below about 0.03.
The p-values in the summary are exact one-sided binomial sign-test p-values. They are not identical in the source files; very small values can look similar if rounded too aggressively. The audit tables below keep scientific notation and expose the full gatekeeping tests.
Show consensus tests and kappa
Aggregation Reliability
The pairwise validation is about document-level resolution. For group comparisons, reliability improves with sample size. The aggregation checks compare groups of postings by occupation code and by agency, not individual occupations or agencies. The estimates differ between occupation and agency groups because the same posting-level score has different within-group variation, between-group variation, and typical group size under each grouping. Occupation groups separate AI intensity more strongly than agencies for the strict-share measure, so the occupation-group reliability estimate is higher.
The compact formula is: reliability at group size n = n * ICC / (1 + (n - 1) * ICC), where ICC = between variance / (between variance + within variance). The minimum detectable two-group difference is 1.96 * within SD * sqrt(2 / n). With 300 observations in each group, the strict-share detectable difference is about 0.014, or 1.4 percentage points, for both occupation and agency comparisons. The displayed reliability and MDD use duplicate-content-adjusted effective n.
Show aggregation formula details
External Check
As a convergent benchmark, industry-level AIMatch signals are compared with Census Business Trends and Outlook Survey AI-use rates. The rank correlation is positive for mean AI scores and stronger for strict AI counts.
What This Supports
The evidence supports four conservative uses. First, posting sentences can reveal emerging task and skill language that standard taxonomies miss. Second, validated AI scores can distinguish meaningful posting-level differences above the observed resolution floor. Third, group means can support meaningful occupation, agency, or industry comparisons when each group has enough postings, especially above a few hundred observations. Fourth, the same statement-level map can be used as a substrate for future human-reviewed taxonomy development, where statements are retained, merged, relabeled, or rejected as the frontier changes.
Evidence Path
The validation tables use the public JAAT AI-intensity aggregate sources shown on this page. Release checks and public artifact paths are listed in Agreement And Audit Trail; broader limits of posting-derived taxonomies are summarized in What Postings And Taxonomies Can And Cannot Say.
Audit Downloads
The blocks below show the page-level SQL used for each table or figure on this page. They are collapsed so the main evidence stays readable.
Download STEM movie data and code
- STEM task-map movie edge CSV: rolling-window occupation pairs used in the STEM task-map movie, with SOC cousin-distance labels.
- STEM related-occupation movie edge CSV: rolling-window occupation pairs and relatedness
rhoscores. - RCA and relatedness movie renderer: frame-rendering script for RCA and relatedness movies.
- Movie encoder: script that encodes monthly PNG frames into MP4 movies.
- Relatedness formula source: code defining the RCA-screened evidence matrix and
rhomeasure. - Timelapse frame helpers: rolling-window data collection and rendering helpers.
- Manifest: one-line description of each dynamic-map audit file.
Show SQL for headline measures
Show SQL for pairwise validation results
Show SQL for sequential gatekeeping tests
Show SQL for consensus tests and kappa
Show SQL for aggregation reliability
Show SQL for external BTOS correlation check
Show SQL for BTOS industry rows
Show judge prompt templates
Strict-AI stratum:
You will be given two job postings, each identified by a unique integer ID. Rank them by how AI-related the work is - the degree to which the role involves artificial intelligence, machine learning, data science, advanced analytics, or intelligent automation, including the software and data engineering such systems are built on, and soft skills specifically required for AI/ML roles. Do NOT count general use of computers or office software, or operating or repairing technical or electronic equipment that has no AI or machine-learning component. This is always a relative judgment: even if neither posting is strongly AI-related, decide which one is closer to AI/ML work, based on the posting text. Answer with a comma-separated Python list of the two posting IDs, most AI-related first. Output only that list.
Lenient / AI-adjacent stratum:
You will be given two job postings, each identified by a unique integer ID. Both involve technology to some degree. Rank them by how close the work is to the frontier of artificial intelligence and machine learning - i.e., which posting's tasks, skills, and requirements are nearer to building, applying, or supporting AI/ML, data science, advanced analytics, or intelligent automation, as opposed to more general or conventional technology work. This is always a relative judgment: pick the posting whose work is closer to AI/ML, even if neither is strongly AI. Base your decision on the posting text. Answer with a comma-separated Python list of the two posting IDs, closest to the AI frontier first. Output only that list.
Download AI validation audit CSVs
- Pair-level reviewed results: scores, strata, judge picks, consensus status, and whether consensus selected the higher AI-score posting.
- Judge responses: raw model response, parsed ranking, prompt version/hash, token and cost fields, and error flags.
- Gatekeeping tests: fixed-sequence tests by stratum, judge, and score-gap bin.
- Provider sign tests: per-judge exact sign tests for large-gap, 0.05+ gap, and overall sets.
- Consensus sign tests: consensus-only exact sign tests plus Cohen's kappa.
- Aggregation variance: within/between variance, ICC, reliability, duplicate-content adjustment, and MDD.
- Posting-level reliability fit: probit and lapse-adjusted fit estimates from judge-pair observations.
- Prompt-row audit: prompt construction metadata for large and medium gap pairs.
- Very-small-gap prompt-row audit: prompt construction metadata for the below-0.03 gap stage.
- Download manifest: one-line description of each file.