Drug Dictionary Design for West African Markets

Marcus Aurelius wrote that you have power over your mind, not outside events. In data engineering, I have come to understand this principle through a very specific lens: you cannot control how a pharmacist in Kano records a drug name. But you can build a system that handles whatever they write.

That is what a drug dictionary does. Not just list medicines. Decide what things are called, how they relate to each other, and what happens when the same product arrives under twenty different names across forty different data sources. In most pharmaceutical markets, someone else has already made those decisions. There are national drug codes, standardised INN databases, GS1 barcodes, WHO ATC classifications. Analysts build on top of these.

In Nigeria and Ghana, that foundation is incomplete. The drug dictionary is not a secondary problem you solve before the real work begins. It is the real work. Get it wrong, and everything built on top of it is unreliable. Get it right, and you have infrastructure that the entire market can use for years.

Understanding why the problem exists

The challenges in West African drug data are not random quality failures. They are predictable structural features of how pharmaceutical products move through these markets, and building a dictionary that works requires understanding them clearly.

Parallel importation means the same product can enter Nigeria through a licensed importer, through a parallel import channel, and through a free trade zone, each time acquiring different batch numbers and sometimes different names on secondary packaging. A dictionary that does not account for this treats three records of the same product as three distinct products, silently undermining every analysis that depends on it.

Registry gaps mean that NAFDAC in Nigeria and the FDA in Ghana, both essential resources I have used extensively, do not offer complete coverage at the scale and machine-readability that serious analytics requires. You cannot build a reliable drug dictionary by joining to the registry and calling the work done. The registry is where you start, not where you finish.

Name proliferation means a single active ingredient can appear as its INN, as a brand name that has become the common term, as a distributor-specific shorthand, as an abbreviation, and in combinations with strength and dosage form packed into one field. I have seen the same molecule appear in over a hundred distinct surface forms within a dataset of 70,000 records. The dictionary has to know they are all the same thing.

The dictionary must be built for the market as it actually operates, not for the market as it would operate in ideal regulatory conditions.

The architecture that works

After working through this problem across multiple datasets and markets, I think about drug dictionary design in four layers, each handling a different aspect of the harmonisation challenge.

The canonical record is the foundation. One stable entry for each distinct product, carrying a unique identifier, the active constituent INN, the ATC classification, the dosage form, the strength, and the manufacturer. This is what everything else maps to. It must be versioned, source-agnostic, and never edited lightly.

The alias table is the mechanism. Every name variant that legitimately refers to a canonical product lives here: brand names, abbreviations, strength-embedded names, distributor labels. When incoming records arrive with whatever naming convention they carry, the alias table resolves them to the canonical product. It is never finished. It grows as new variants are encountered, and that is by design.

The matching pipeline handles what the alias table cannot. For records with no exact match, a cascade runs in sequence: exact match first, then normalised string match with punctuation stripped and case unified, then fuzzy matching using token similarity, then phonetic matching, then a manual review queue for what remains. Each stage resolves what it can and passes the rest forward.

Confidence scoring and audit records complete the system. Every match carries a score and a record of how it was resolved, which stage processed it, what the matched string was, and when the decision was made. In a system built incrementally from imperfect source data, the ability to review, correct, and retrace matching decisions is what separates a trustworthy dictionary from an opaque one.

One principle worth holding firmly

The alias table should be the single place where name variation is resolved. Never resolve names ad hoc inside analysis scripts. That creates inconsistencies that compound quietly across studies until something breaks in a way that is very difficult to diagnose. Centralise the logic, version it, and let every downstream pipeline inherit it.

Barcodes: the long game worth playing

GS1 barcodes offer the cleanest possible product identifier. A barcode is unambiguous in a way that a product name can never be. If every pharmaceutical product in Nigeria and Ghana were consistently assigned and scanned at point of sale, most of the name harmonisation problem would disappear at the source.

Barcode adoption in West African pharmaceutical retail is partial today. Building toward barcode-anchored product identification is the right long-term direction, and I believe it will come. In the short and medium term, the architecture above has to carry the load. The two are not in conflict. The dictionary built now becomes easier to maintain once barcodes are more consistently present.

Maintenance is the real commitment

A drug dictionary is not a project with a completion date. New products enter the market. Formulations change. Manufacturers are acquired. Brand names are retired. The alias table keeps growing. The canonical records need updating.

This means dictionary maintenance must be treated as an operational function with ownership, a defined update process, and tooling that allows pharmacists or data stewards to validate new mappings before they are committed to the canonical layer. The organisations that invest in this will see the return compound over time. Every new dataset that joins the pipeline becomes more useful because the dictionary already knows how to read it.

Learning is not a pastime but a strategic imperative. That belief drives how I approach this problem. The drug dictionary challenge in West Africa is hard and specific to our context. It will not be solved by importing a framework designed for a different market. It will be solved by practitioners who are willing to build honestly from where we are, not from where we wish we were.

The infrastructure built today will determine what questions can be asked about African pharmaceutical markets for years to come. I am committed to getting it right.

Olayinka Akerekan

Pharmacist and data engineer working at the intersection of pharmaceutical science and analytics across sub-Saharan Africa. B.Pharm, University of Ibadan. Based in Lagos, Nigeria.

LinkedIn GitHub Email