C. Finding Emerging Skills

When standard taxonomies do not contain task or skill items for emerging technologies, labor market signals are missed. This finding asks whether postings can supply emerging task and skill statements missing from standard taxonomies, and whether those statements can support informative measures of the technological frontier.

AIMatch starts from posting sentences, places them in a task and skill architecture, and scores each statement for AI relatedness. The current experimental taxonomy contains 19,660 base-level statements. It is not presented as a finished ontology. It is a provisional map: a structured set of posting-derived statements that can be validated, revised, and used to measure where AI-related work is appearing.

This construction uses the TaxonomyBuilder approach for organizing posting-derived task and skill sentences into inspectable maps. See the TaxonomyBuilder article.

Provisional status. The statement taxonomy is evidence for building adaptive task and skill maps, not a final replacement for O*NET or ESCO. The validation below supports relative AI intensity measurement, especially above the observed resolution floor.
Taxonomy builder seeded with AI-related task and skill sentences from postings
Taxonomy builder view. The workflow starts with task and skill sentences found in postings with AI keywords, then organizes those statements into a provisional taxonomy for review, scoring, and validation. See the TaxonomyBuilder article.

From Statements To Maps

The same measurement problem appears outside AI. For STEM, the standard taxonomy gives an occupational starting point; the dashboard then uses LLM-as-judge review to identify a bundle of task statements that should be treated as STEM-related. Once those task statements are identified, postings can show where they appear across the occupation structure. The dynamic examples are shown on Dynamic Maps.

Related-occupation maps use a second step. For each occupation j and task or skill statement s, define an RCA-screened evidence weight:

RCA(j,s) = (mentions of s in j / postings in j) / (mentions of s in all postings / all postings)

e(j,s) = mentions of s in j / postings in j, if RCA(j,s) >= 1 and mentions(j,s) >= 50
       = 0, otherwise

rho(j,j') = sum_s e(j,s) * e(j',s) / max(sum_s e(j,s), sum_s e(j',s))

The denominator makes the measure conservative: a smaller occupation does not become highly related to a larger one merely because its few statements are a subset of the larger occupation's profile. The displayed relatedness movies keep the top partners for each occupation. This frees the evidence from a single canonical O*NET task home and lets postings reveal many-to-many occupational structure found in the data.

Posting pairs judged

220

Consensus accuracy, gap >=0.10

88.9%

Judge agreement, kappa

0.88

BTOS rank correlation, strict

0.75

What Was Tested

For each statement, AI class-vector scores are built from cosine similarity to AI keyword classes and averaged across three embedding models. Strict and lenient AI classes then use agreement across models to separate clearer AI statements from AI-adjacent language.

The validation compares randomly selected pairs of USAJOBS posting text inside two strata:

  • Strict AI: postings with at least one strict AI code.
  • Lenient / AI-adjacent: postings with at least three lenient codes and no strict codes.

Within each stratum, pairs are grouped by AI-score gap: large gaps (>=0.10), medium gaps (0.05 to 0.10), and very small gaps below 0.03. Claude Sonnet 4.6 and OpenAI GPT-4o ranked which posting was more AI-related, or in the lenient stratum, which was closer to the AI / ML frontier.

Validation Results

The sign tests reject random ranking for large and medium score gaps, and fail to reject for very small gaps. That is the useful measurement claim: the AI score distinguishes postings that differ by about 0.05 or more, but not reliably below about 0.03.

The p-values in the summary are exact one-sided binomial sign-test p-values. They are not identical in the source files; very small values can look similar if rounded too aggressively. The audit tables below keep scientific notation and expose the full gatekeeping tests.

No Results
Terms. Accuracy = judge selected the posting with the higher AI score. p-value = exact one-sided binomial sign test against 50/50 ranking. Reject random = p <= 0.05.
No Results
Show consensus tests and kappa
Terms. Consensus rows include only pairs where Claude and GPT-4o selected the same posting. Cohen's kappa summarizes inter-judge agreement across the full validation set.
No Results

Aggregation Reliability

The pairwise validation is about document-level resolution. For group comparisons, reliability improves with sample size. The aggregation checks compare groups of postings by occupation code and by agency, not individual occupations or agencies. The estimates differ between occupation and agency groups because the same posting-level score has different within-group variation, between-group variation, and typical group size under each grouping. Occupation groups separate AI intensity more strongly than agencies for the strict-share measure, so the occupation-group reliability estimate is higher.

The compact formula is: reliability at group size n = n * ICC / (1 + (n - 1) * ICC), where ICC = between variance / (between variance + within variance). The minimum detectable two-group difference is 1.96 * within SD * sqrt(2 / n). With 300 observations in each group, the strict-share detectable difference is about 0.014, or 1.4 percentage points, for both occupation and agency comparisons. The displayed reliability and MDD use duplicate-content-adjusted effective n.

Terms. Compared groups = occupation-code or agency grouping used in the check. Units = groups compared. Median N = postings per group. Reliability = estimated reliability of group means. MDD = minimum detectable two-group difference.
No Results
Show aggregation formula details
How the numbers are produced. For each grouping, the script estimates a one-way random-effects decomposition: pooled within-group variance, between-group variance, and ICC. Duplicate template content reduces the effective median N. Reliability is then computed at that effective N.
No Results

External Check

As a convergent benchmark, industry-level AIMatch signals are compared with Census Business Trends and Outlook Survey AI-use rates. The rank correlation is positive for mean AI scores and stronger for strict AI counts.

No Results
No Results

What This Supports

The evidence supports four conservative uses. First, posting sentences can reveal emerging task and skill language that standard taxonomies miss. Second, validated AI scores can distinguish meaningful posting-level differences above the observed resolution floor. Third, group means can support meaningful occupation, agency, or industry comparisons when each group has enough postings, especially above a few hundred observations. Fourth, the same statement-level map can be used as a substrate for future human-reviewed taxonomy development, where statements are retained, merged, relabeled, or rejected as the frontier changes.

Evidence Path

The validation tables use the public JAAT AI-intensity aggregate sources shown on this page. Release checks and public artifact paths are listed in Agreement And Audit Trail; broader limits of posting-derived taxonomies are summarized in What Postings And Taxonomies Can And Cannot Say.

Audit Downloads

The blocks below show the page-level SQL used for each table or figure on this page. They are collapsed so the main evidence stays readable.

Download STEM movie data and code
Show SQL for headline measures
Show SQL for pairwise validation results
Show SQL for sequential gatekeeping tests
Show SQL for consensus tests and kappa
Show SQL for aggregation reliability
Show SQL for external BTOS correlation check
Show SQL for BTOS industry rows
Show judge prompt templates

Strict-AI stratum:

You will be given two job postings, each identified by a unique integer ID.

Rank them by how AI-related the work is - the degree to which the role involves
artificial intelligence, machine learning, data science, advanced analytics, or
intelligent automation, including the software and data engineering such systems
are built on, and soft skills specifically required for AI/ML roles. Do NOT count
general use of computers or office software, or operating or repairing technical
or electronic equipment that has no AI or machine-learning component.

This is always a relative judgment: even if neither posting is strongly
AI-related, decide which one is closer to AI/ML work, based on the posting text.

Answer with a comma-separated Python list of the two posting IDs, most AI-related
first. Output only that list.

Lenient / AI-adjacent stratum:

You will be given two job postings, each identified by a unique integer ID.

Both involve technology to some degree. Rank them by how close the work is to the
frontier of artificial intelligence and machine learning - i.e., which posting's
tasks, skills, and requirements are nearer to building, applying, or supporting
AI/ML, data science, advanced analytics, or intelligent automation, as opposed to
more general or conventional technology work.

This is always a relative judgment: pick the posting whose work is closer to AI/ML,
even if neither is strongly AI. Base your decision on the posting text.

Answer with a comma-separated Python list of the two posting IDs, closest to the
AI frontier first. Output only that list.
Download AI validation audit CSVs