NOSIBLE World classifies every event across 15 complementary ontologies because no single taxonomy captures the texture of news. IPTC tells you the topic; GICS tells you the sector; PLOVER tells you the political verb; ATT&CK tells you the cyber technique; Ekman tells you the emotion in the coverage; C2PA tells you whether the attached media is provenance-signed. Each ontology answers a different question. Together they let analysts slice the day from any angle, pivot between coordinates, and find the events that no single-axis system would surface.
NOSIBLE's search engine ingests millions of articles every day, embeds them, and writes pre-built daily snapshots (codes_v2.ipc and tokens.ndjson) for each slice of the corpus. make_events.py consumes those snapshots and turns them into the structured event records the front end renders. The pipeline is a single linear top-to-bottom flow: nine stages, no inter-module dance, every stage persisted for replay.
NOSIBLE's search engine continuously ingests millions of articles from every major language across the open web, embeds them, and writes a pre-built daily snapshot to disk: codes_v2.ipc (one row per article with dense + sparse signatures, IAB labels, geography, language) and tokens.ndjson (the lexical token stream used for re-ranking). Stage 1 loads that snapshot, drops slices with fewer than 100 qualified anchors, and selects the rows that will seed the day's graph.
For every anchor we build a multi-channel kNN graph: 3,072-bit MinHash sketches for lexical similarity, Hamming-distance neighbours over the 1,536-bit dense codes for semantic similarity, and a final lexical re-rank pass on the candidate edges. Edges land in tiered gates (semantic min, lexical min, max degree) so dense topics keep their structure without exploding hub nodes. Leiden community detection (CPM resolution 0.001, three refinement passes) produces the day's raw events, with classifier votes from central members propagating IAB tiers and geography to the cluster.
Two clusters about the same incident often survive Stage 2 because their language and framing differ enough to keep edges sparse. Stage 3a merges them when their netloc sets agree: shared netloc count >= 3 and netloc Jaccard >= 0.25 collapses two events into one. The smaller cluster is absorbed into the larger; canonical event ids are renumbered by rank.
Large clusters (>= 200 docs) get a representative subset run through a per-language spaCy NER model (en/de/fr/es/it/zh/ja/... with xx_ent_wiki_sm as the multilingual fallback). The persisted PERSON / ORG / GPE / LOC entity sets drive a KEEP / SPLIT / AMBIGUOUS rule: clusters with one dominant entity (>= 70% share) stay intact; clusters where two candidate entities each clear 30% share AND their doc-set Jaccard is below 0.20 get split. The same entity_tagged map is also handed to Stage 5 as a constraint on the LLM's key_entities and primary_location output.
Two events with the same IAB tier and the same primary country, but different languages, are very often the same incident covered by different press traditions. Stage 3c bridges them when shared netlocs >= 2 and netloc Jaccard >= 0.20. The surviving event records every language it now spans in a languages_covered list, so the front end can show "covered in 7 languages" without re-deriving it.
A 7-day sliding window over yesterday's and previous days' events drives transitive story-chain linking. We score combined similarity (0.7 dense Hamming + 0.3 sparse Bloom Jaccard) against every predecessor and inherit the chain when it clears 0.65. Each event ends up with novelty_score (1 minus the best predecessor similarity), story_id (inherited from the chain head, or freshly minted), story_first_seen, story_span_days, story_position (1, 2, 3, ... down the chain) and story_predecessor_id. This is what the front end uses to mark events as new vs. ongoing, and to render multi-day timelines.
One Gemini call per event - the entire payload is the cleanest possible representation: top representative documents (capped at ~1,200 chars each), the spaCy entities_tagged map baked in as constraints, and structured numeric metadata. The LLM's job is cleanup and structuring, not generation: it returns title, description, IAB labels, sentiment, materiality, time_horizon, sectors, key_entities (must be drawn from entities_tagged), primary_location (must be in the GPE bucket), and secondary_locations. Lat / lng resolution is the LLM's job. Pydantic validation rejects any output that drifts from the entity constraint, with up to three retry-with-feedback rounds.
Stage 6 classifies every enriched event against all 13 v1 taxonomies using one OpenAI text-embedding-3-large vector over the event title and description. The vector scores against precomputed taxonomy caches via cosine similarity, and the highest-scoring label is written into the ontology block. There is no LLM in the classification path.
Every artefact is written to a .tmp file and atomically renamed into place: events.ipc (loc -> event_id + score), events.ndjson (the full record with LLM enrichment + story chain + multi-ontology — the single source of truth for the day), and events_repr.ipc (per-event dense + Bloom representations cached for the next day's Stage 4). The NOSIBLE World frontend reads events.ndjson directly and generates all markdown downloads on the fly.
Every event flows through Stage 6 against thirteen taxonomies in v1 today; two more (Wikidata QIDs and C2PA Content Credentials) are staged for v1.1. Each ontology answers a question the others cannot.
The newsroom-grade hierarchy that desks have used for fifty years. Gives every event a stable topical address ("politics > election > campaign finance") that survives renaming and translation.
Distinguishes what kind of newsroom output we are looking at. A press release, a live blog, an obituary, and an analysis piece all describe an event differently; topical taxonomies cannot capture that texture.
Anchors event entities to a global, cross-lingual identifier graph. Lets analysts pivot from "Apple Inc." (Q312) to every other event that names the same entity, regardless of language or surface form.
The sector taxonomy every equity desk in the world already uses. Lets the platform talk to portfolio managers in the language of their book without a translation layer.
GICS tells you which company; this tells you what happened to it. Earnings beats, layoffs, acquisitions, CEO changes, product launches, buybacks, settlements, regulatory probes - the verb of the story.
The academic standard for coding political events at scale. Built for conflict analysts and political scientists. Captures the action verb of geopolitical news in a way IPTC cannot.
The reference taxonomy used by every humanitarian agency. Distinguishes natural-disaster types in a way reinsurers, NGOs and governments all already index against.
The industry-standard cyber adversary framework. When a breach hits the news, every SOC reads it through ATT&CK. Lets us tag cyber stories in the vocabulary defenders actually use to triage.
The global standard for classifying disease and health events. Bridges health-news coverage to the same vocabulary that ministries of health, the WHO, and clinical research already use.
Sports stories are distinct enough that lumping them into IPTC Media Topics buries the structure. SportsML separates the discipline from the event-type (final, qualifier, transfer, suspension).
The web's default event vocabulary. Makes cultural, entertainment and civic events (concerts, festivals, screenings, parades, summits) addressable in the same vocabulary search engines and calendars use.
Two outlets can cover the same event under different frames - one as an "economic consequences" story, another as a "morality" story. Topic taxonomies miss this. Frames make it surfaceable.
Sentiment alone (positive/negative/neutral) is too thin. Ekman's six capture the emotional texture of coverage at a level that survives across languages and cultures.
A signal of media authenticity, not of subject. Tracks whether the images and video attached to a story carry verifiable provenance, which matters more every cycle as synthetic media gets cheaper.
Not for editorial use - for ad-tech and brand-safety alignment. Lets buyers and agencies map NOSIBLE coverage to the same content codes their DSPs and brand-safety vendors already speak.
A single-ontology system forces every event into one frame. A 15-ontology system surfaces the orthogonal dimensions of every event so analysts can pivot between them. The same Apple earnings call is a Technology / Semiconductors story (GICS), an Earnings beat (NOSIBLE corp events), a press-release-driven Analysis (IPTC Genre), an Economic-frame piece (Media Frames), and depending on the day, a Joy or Surprise tilt in tone (Ekman). Every coordinate is independent. Every coordinate is queryable.
The classifier achieves this without paying an LLM per event per taxonomy. Every classification resolves via deterministic cosine matching: one OpenAI title-plus-description embedding scores the event against taxonomy label caches, required taxonomies take the best match, and nullable taxonomies emit null when the best cosine falls below the configured floor. There is no LLM in the classification path.
Every match carries its cosine score so the front end can show the classification strength transparently. No black box.