NOSIBLE World

Methodology

NOSIBLE World classifies every event across 15 complementary ontologies because no single taxonomy captures the texture of news. IPTC tells you the topic; GICS tells you the sector; PLOVER tells you the political verb; ATT&CK tells you the cyber technique; Ekman tells you the emotion in the coverage; C2PA tells you whether the attached media is provenance-signed. Each ontology answers a different question. Together they let analysts slice the day from any angle, pivot between coordinates, and find the events that no single-axis system would surface.

How NOSIBLE World builds events

NOSIBLE's search engine ingests millions of articles every day, embeds them, and writes pre-built daily snapshots (codes_v2.ipc and tokens.ndjson) for each slice of the corpus. make_events.py consumes those snapshots and turns them into the structured event records the front end renders. The pipeline is a single linear top-to-bottom flow: nine stages, no inter-module dance, every stage persisted for replay.

Stage 1
Ingest
Stage 2
Cluster
Stage 3
Same-incident dedup
Stage 4
Entity-aware split
Stage 5
Cross-lingual merge
Stage 6
Cross-day linking
Stage 7
LLM enrichment
Stage 8
Multi-ontology classification
Stage 9
Emit

Stage 1Ingest

Ingest the day's slice

NOSIBLE's search engine continuously ingests millions of articles from every major language across the open web, embeds them, and writes a pre-built daily snapshot to disk: codes_v2.ipc (one row per article with dense + sparse signatures, IAB labels, geography, language) and tokens.ndjson (the lexical token stream used for re-ranking). Stage 1 loads that snapshot, drops slices with fewer than 100 qualified anchors, and selects the rows that will seed the day's graph.

Stage 2Cluster

Build the event graph and cluster with Leiden

For every anchor we build a multi-channel kNN graph: 3,072-bit MinHash sketches for lexical similarity, Hamming-distance neighbours over the 1,536-bit dense codes for semantic similarity, and a final lexical re-rank pass on the candidate edges. Edges land in tiered gates (semantic min, lexical min, max degree) so dense topics keep their structure without exploding hub nodes. Leiden community detection (CPM resolution 0.001, three refinement passes) produces the day's raw events, with classifier votes from central members propagating IAB tiers and geography to the cluster.

Stage 3Same-incident dedup

Same-incident netloc-Jaccard merge

Two clusters about the same incident often survive Stage 2 because their language and framing differ enough to keep edges sparse. Stage 3a merges them when their netloc sets agree: shared netloc count >= 3 and netloc Jaccard >= 0.25 collapses two events into one. The smaller cluster is absorbed into the larger; canonical event ids are renumbered by rank.

Stage 4Entity-aware split

Entity-aware cluster splitting (spaCy NER)

Large clusters (>= 200 docs) get a representative subset run through a per-language spaCy NER model (en/de/fr/es/it/zh/ja/... with xx_ent_wiki_sm as the multilingual fallback). The persisted PERSON / ORG / GPE / LOC entity sets drive a KEEP / SPLIT / AMBIGUOUS rule: clusters with one dominant entity (>= 70% share) stay intact; clusters where two candidate entities each clear 30% share AND their doc-set Jaccard is below 0.20 get split. The same entity_tagged map is also handed to Stage 5 as a constraint on the LLM's key_entities and primary_location output.

Stage 5Cross-lingual merge

Cross-lingual event merging

Two events with the same IAB tier and the same primary country, but different languages, are very often the same incident covered by different press traditions. Stage 3c bridges them when shared netlocs >= 2 and netloc Jaccard >= 0.20. The surviving event records every language it now spans in a languages_covered list, so the front end can show "covered in 7 languages" without re-deriving it.

Stage 6Cross-day linking

Cross-day story chains

A 7-day sliding window over yesterday's and previous days' events drives transitive story-chain linking. We score combined similarity (0.7 dense Hamming + 0.3 sparse Bloom Jaccard) against every predecessor and inherit the chain when it clears 0.65. Each event ends up with novelty_score (1 minus the best predecessor similarity), story_id (inherited from the chain head, or freshly minted), story_first_seen, story_span_days, story_position (1, 2, 3, ... down the chain) and story_predecessor_id. This is what the front end uses to mark events as new vs. ongoing, and to render multi-day timelines.

Stage 7LLM enrichment

Single-call LLM enrichment with entity constraints

One Gemini call per event - the entire payload is the cleanest possible representation: top representative documents (capped at ~1,200 chars each), the spaCy entities_tagged map baked in as constraints, and structured numeric metadata. The LLM's job is cleanup and structuring, not generation: it returns title, description, IAB labels, sentiment, materiality, time_horizon, sectors, key_entities (must be drawn from entities_tagged), primary_location (must be in the GPE bucket), and secondary_locations. Lat / lng resolution is the LLM's job. Pydantic validation rejects any output that drifts from the entity constraint, with up to three retry-with-feedback rounds.

Stage 8Multi-ontology classification

Multi-ontology classification (OpenAI embedding)

Stage 6 classifies every enriched event against all 13 v1 taxonomies using one OpenAI text-embedding-3-large vector over the event title and description. The vector scores against precomputed taxonomy caches via cosine similarity, and the highest-scoring label is written into the ontology block. There is no LLM in the classification path.

Stage 9Emit

Atomic emit to artefacts and markdown tree

Every artefact is written to a .tmp file and atomically renamed into place: events.ipc (loc -> event_id + score), events.ndjson (the full record with LLM enrichment + story chain + multi-ontology — the single source of truth for the day), and events_repr.ipc (per-event dense + Bloom representations cached for the next day's Stage 4). The NOSIBLE World frontend reads events.ndjson directly and generates all markdown downloads on the fly.

The 15 ontologies

Every event flows through Stage 6 against thirteen taxonomies in v1 today; two more (Wikidata QIDs and C2PA Content Credentials) are staged for v1.1. Each ontology answers a question the others cannot.

Topical spine#iptc-media-topics

IPTC Media Topics

Authority: International Press Telecommunications Council
Scale: ~1,200 codes, 5 levels deep
Examples: medtop:20000638 (election), medtop:20000553 (monetary policy)

The newsroom-grade hierarchy that desks have used for fifty years. Gives every event a stable topical address ("politics > election > campaign finance") that survives renaming and translation.

Document type#iptc-genre

IPTC Genre

Authority: International Press Telecommunications Council
Scale: ~30 newsroom output types
Examples: Actuality, From the Scene, Press Release, Analysis, Obituary

Distinguishes what kind of newsroom output we are looking at. A press release, a live blog, an obituary, and an analysis piece all describe an event differently; topical taxonomies cannot capture that texture.

Entities#wikidata

extending soon - v1.1

Wikidata QIDs

Authority: Wikimedia Foundation
Scale: ~100M entities (people, orgs, places, products)
Examples: Q312 (Apple Inc.), Q22686 (Donald Trump), Q148 (China)

Anchors event entities to a global, cross-lingual identifier graph. Lets analysts pivot from "Apple Inc." (Q312) to every other event that names the same entity, regardless of language or surface form.

Sectors#gics

GICS Sub-Industry

Authority: MSCI / S&P Dow Jones Indices
Scale: 11 sectors, 158 sub-industries
Examples: Semiconductors, Pharmaceuticals, Integrated Oil & Gas

The sector taxonomy every equity desk in the world already uses. Lets the platform talk to portfolio managers in the language of their book without a translation layer.

Corporate events#nosible-corp-events

NOSIBLE Corporate Events

Authority: NOSIBLE in-house taxonomy
Scale: ~80 action types
Examples: Earnings beat, Layoff announcement, CEO departure, Buyback

GICS tells you which company; this tells you what happened to it. Earnings beats, layoffs, acquisitions, CEO changes, product launches, buybacks, settlements, regulatory probes - the verb of the story.

Political events#plover

PLOVER Event Types

Authority: Open Event Data Alliance
Scale: ~40 political action types
Examples: CONSULT, ASSAULT, PROTEST, ACCUSE, NEGOTIATE

The academic standard for coding political events at scale. Built for conflict analysts and political scientists. Captures the action verb of geopolitical news in a way IPTC cannot.

Disaster events#emdat

EM-DAT Disasters

Authority: Centre for Research on the Epidemiology of Disasters (UCLouvain)
Scale: 5 disaster groups, ~25 sub-types
Examples: Hydrological/Flood, Geophysical/Earthquake, Climatological/Drought

The reference taxonomy used by every humanitarian agency. Distinguishes natural-disaster types in a way reinsurers, NGOs and governments all already index against.

Cyber events#mitre-attack

MITRE ATT&CK

Authority: MITRE Corporation
Scale: 14 tactics, ~200 techniques
Examples: TA0001 (Initial Access), T1566 (Phishing), TA0010 (Exfiltration)

The industry-standard cyber adversary framework. When a breach hits the news, every SOC reads it through ATT&CK. Lets us tag cyber stories in the vocabulary defenders actually use to triage.

Health events#icd11

ICD-11 Chapters

Authority: World Health Organization
Scale: ~17,000 entities (we ship chapter + named-block, ~150)
Examples: 01 (Infectious diseases), 02 (Neoplasms), 06 (Mental health)

The global standard for classifying disease and health events. Bridges health-news coverage to the same vocabulary that ministries of health, the WHO, and clinical research already use.

Sports events#sportsml

IPTC SportsML

Authority: International Press Telecommunications Council
Scale: ~100 sport / event-format combinations
Examples: Soccer/Match, Tennis/Grand Slam, Athletics/Doping

Sports stories are distinct enough that lumping them into IPTC Media Topics buries the structure. SportsML separates the discipline from the event-type (final, qualifier, transfer, suspension).

Cultural and entertainment events#schema-org-event

schema.org Event

Authority: schema.org / W3C
Scale: ~30 event types
Examples: MusicEvent, FoodEvent, BusinessEvent, ScreeningEvent

The web's default event vocabulary. Makes cultural, entertainment and civic events (concerts, festivals, screenings, parades, summits) addressable in the same vocabulary search engines and calendars use.

Framing#media-frames

Media Frames

Authority: Boydstun et al., Policy Frames Codebook
Scale: 15 generic policy frames
Examples: Economic, Morality, Public sentiment, Security and defense

Two outlets can cover the same event under different frames - one as an "economic consequences" story, another as a "morality" story. Topic taxonomies miss this. Frames make it surfaceable.

Emotion#ekman6

Ekman 6 Emotions

Authority: Paul Ekman (cross-cultural emotion research)
Scale: 6 universal categories
Examples: Anger, Disgust, Fear, Joy, Sadness, Surprise

Sentiment alone (positive/negative/neutral) is too thin. Ekman's six capture the emotional texture of coverage at a level that survives across languages and cultures.

Provenance#c2pa

extending soon - v1.1

C2PA Content Credentials

Authority: Coalition for Content Provenance and Authenticity
Scale: Standards-based (assertion + claim signatures)
Examples: c2pa.signed, c2pa.actions:edited, c2pa.actions:generated

A signal of media authenticity, not of subject. Tracks whether the images and video attached to a story carry verifiable provenance, which matters more every cycle as synthetic media gets cheaper.

Ad / brand-safety adjunct#iab

IAB Content Taxonomy

Authority: Interactive Advertising Bureau (Tech Lab)
Scale: ~700 codes, 4 tiers
Examples: IAB1-1 (Books & Literature), IAB13-9 (Personal Investing)

Not for editorial use - for ad-tech and brand-safety alignment. Lets buyers and agencies map NOSIBLE coverage to the same content codes their DSPs and brand-safety vendors already speak.

Why all 15

A single-ontology system forces every event into one frame. A 15-ontology system surfaces the orthogonal dimensions of every event so analysts can pivot between them. The same Apple earnings call is a Technology / Semiconductors story (GICS), an Earnings beat (NOSIBLE corp events), a press-release-driven Analysis (IPTC Genre), an Economic-frame piece (Media Frames), and depending on the day, a Joy or Surprise tilt in tone (Ekman). Every coordinate is independent. Every coordinate is queryable.

The classifier achieves this without paying an LLM per event per taxonomy. Every classification resolves via deterministic cosine matching: one OpenAI title-plus-description embedding scores the event against taxonomy label caches, required taxonomies take the best match, and nullable taxonomies emit null when the best cosine falls below the configured floor. There is no LLM in the classification path.

Every match carries its cosine score so the front end can show the classification strength transparently. No black box.

Methodology

How NOSIBLE World builds events

Stage 1Ingest

Ingest the day's slice

Stage 2Cluster

Build the event graph and cluster with Leiden

Stage 3Same-incident dedup

Same-incident netloc-Jaccard merge

Stage 4Entity-aware split

Entity-aware cluster splitting (spaCy NER)

Stage 5Cross-lingual merge

Cross-lingual event merging

Stage 6Cross-day linking

Cross-day story chains

Stage 7LLM enrichment

Single-call LLM enrichment with entity constraints

Stage 8Multi-ontology classification

Multi-ontology classification (OpenAI embedding)

Stage 9Emit

Atomic emit to artefacts and markdown tree

The 15 ontologies

Topical spine#iptc-media-topics

IPTC Media Topics

Authority: International Press Telecommunications Council
Scale: ~1,200 codes, 5 levels deep
Examples: medtop:20000638 (election), medtop:20000553 (monetary policy)

The newsroom-grade hierarchy that desks have used for fifty years. Gives every event a stable topical address ("politics > election > campaign finance") that survives renaming and translation.

Document type#iptc-genre

IPTC Genre

Authority: International Press Telecommunications Council
Scale: ~30 newsroom output types
Examples: Actuality, From the Scene, Press Release, Analysis, Obituary

Entities#wikidata

extending soon - v1.1

Wikidata QIDs

Authority: Wikimedia Foundation
Scale: ~100M entities (people, orgs, places, products)
Examples: Q312 (Apple Inc.), Q22686 (Donald Trump), Q148 (China)

Sectors#gics

GICS Sub-Industry

Authority: MSCI / S&P Dow Jones Indices
Scale: 11 sectors, 158 sub-industries
Examples: Semiconductors, Pharmaceuticals, Integrated Oil & Gas

The sector taxonomy every equity desk in the world already uses. Lets the platform talk to portfolio managers in the language of their book without a translation layer.

Corporate events#nosible-corp-events

NOSIBLE Corporate Events

Authority: NOSIBLE in-house taxonomy
Scale: ~80 action types
Examples: Earnings beat, Layoff announcement, CEO departure, Buyback

Political events#plover

PLOVER Event Types

Authority: Open Event Data Alliance
Scale: ~40 political action types
Examples: CONSULT, ASSAULT, PROTEST, ACCUSE, NEGOTIATE

The academic standard for coding political events at scale. Built for conflict analysts and political scientists. Captures the action verb of geopolitical news in a way IPTC cannot.

Disaster events#emdat

EM-DAT Disasters

Authority: Centre for Research on the Epidemiology of Disasters (UCLouvain)
Scale: 5 disaster groups, ~25 sub-types
Examples: Hydrological/Flood, Geophysical/Earthquake, Climatological/Drought

The reference taxonomy used by every humanitarian agency. Distinguishes natural-disaster types in a way reinsurers, NGOs and governments all already index against.

Cyber events#mitre-attack

MITRE ATT&CK

Authority: MITRE Corporation
Scale: 14 tactics, ~200 techniques
Examples: TA0001 (Initial Access), T1566 (Phishing), TA0010 (Exfiltration)

The industry-standard cyber adversary framework. When a breach hits the news, every SOC reads it through ATT&CK. Lets us tag cyber stories in the vocabulary defenders actually use to triage.

Health events#icd11

ICD-11 Chapters

Authority: World Health Organization
Scale: ~17,000 entities (we ship chapter + named-block, ~150)
Examples: 01 (Infectious diseases), 02 (Neoplasms), 06 (Mental health)

The global standard for classifying disease and health events. Bridges health-news coverage to the same vocabulary that ministries of health, the WHO, and clinical research already use.

Sports events#sportsml

IPTC SportsML

Authority: International Press Telecommunications Council
Scale: ~100 sport / event-format combinations
Examples: Soccer/Match, Tennis/Grand Slam, Athletics/Doping

Sports stories are distinct enough that lumping them into IPTC Media Topics buries the structure. SportsML separates the discipline from the event-type (final, qualifier, transfer, suspension).

Cultural and entertainment events#schema-org-event

schema.org Event

Authority: schema.org / W3C
Scale: ~30 event types
Examples: MusicEvent, FoodEvent, BusinessEvent, ScreeningEvent

Framing#media-frames

Media Frames

Authority: Boydstun et al., Policy Frames Codebook
Scale: 15 generic policy frames
Examples: Economic, Morality, Public sentiment, Security and defense

Two outlets can cover the same event under different frames - one as an "economic consequences" story, another as a "morality" story. Topic taxonomies miss this. Frames make it surfaceable.

Emotion#ekman6

Ekman 6 Emotions

Authority: Paul Ekman (cross-cultural emotion research)
Scale: 6 universal categories
Examples: Anger, Disgust, Fear, Joy, Sadness, Surprise

Sentiment alone (positive/negative/neutral) is too thin. Ekman's six capture the emotional texture of coverage at a level that survives across languages and cultures.

Provenance#c2pa

extending soon - v1.1

C2PA Content Credentials

Authority: Coalition for Content Provenance and Authenticity
Scale: Standards-based (assertion + claim signatures)
Examples: c2pa.signed, c2pa.actions:edited, c2pa.actions:generated

A signal of media authenticity, not of subject. Tracks whether the images and video attached to a story carry verifiable provenance, which matters more every cycle as synthetic media gets cheaper.

Ad / brand-safety adjunct#iab

IAB Content Taxonomy

Authority: Interactive Advertising Bureau (Tech Lab)
Scale: ~700 codes, 4 tiers
Examples: IAB1-1 (Books & Literature), IAB13-9 (Personal Investing)

Not for editorial use - for ad-tech and brand-safety alignment. Lets buyers and agencies map NOSIBLE coverage to the same content codes their DSPs and brand-safety vendors already speak.

Why all 15

Every match carries its cosine score so the front end can show the classification strength transparently. No black box.