Address matching and standardization: a practical guide
Addresses are the hardest field to match. Abbreviations, unit numbers, directionals, and international formats make exact matching useless. Here's how to handle them.
Addresses look simple. They’re not. They’re the single hardest field type to match reliably, and they appear in nearly every record matching project.
The core problem: there are dozens of valid ways to write the same physical address, and they differ at every level — abbreviations, punctuation, component ordering, unit designation, directional prefixes, and formatting conventions. Two addresses can look completely different as strings and refer to the same building, the same floor, the same mailbox.
Why addresses are uniquely difficult
Consider this address: 123 North Main Street, Suite 200, San Francisco, CA 94102
Here are eleven ways it might appear in real datasets:
123 N Main St Ste 200 San Francisco CA 94102123 N. Main St., Ste. 200, San Francisco, CA 94102123 North Main Street #200, SF, CA 94102123 N Main, Suite 200, San Francisco, California 94102123 N. MAIN STREET STE 200 SAN FRANCISCO CA 94102-3456123 N Main St, Unit 200, San Francisco CA123 No. Main St., S. 200, San Fran., CA 94102123 North Main, 2nd Floor, San Francisco, CASuite 200, 123 N Main St, San Francisco, CA 94102123 N Main Street Suite 200123 N Main St Apt 200 San Francisco CA 94102
Every single one of these is a valid representation of the same location. An exact string match between any pair returns false. Even sophisticated fuzzy matching struggles because the strings can differ by 30-40% of their characters while being semantically identical.
The abbreviation problem
The USPS maintains a list of standard abbreviations for street suffixes, directional prefixes, and secondary unit designators. Most data sources don’t follow them consistently.
| Component type | Standard form | Common variations in data |
|---|---|---|
| Street suffix | ST | Street, Str, Str., Strt |
| Street suffix | AVE | Avenue, Av, Av., Aven |
| Street suffix | BLVD | Boulevard, Blv, Boul, Bvd |
| Street suffix | DR | Drive, Driv, Dr. |
| Street suffix | LN | Lane, La, Ln. |
| Street suffix | CT | Court, Crt, Ct. |
| Directional | N | North, No, No., Nor |
| Directional | NW | Northwest, North West, N.W. |
| Unit designator | STE | Suite, Ste., S., St (ambiguous) |
| Unit designator | APT | Apartment, Apt., Ap, Unit |
| Unit designator | # | No., Num, Number, Unit |
| City | ST LOUIS | Saint Louis, St. Louis |
The USPS Publication 28 lists over 200 standard street suffix abbreviations.
The ambiguity runs deep. ST is both the abbreviation for Street and for Saint. DR is both Drive and Doctor. N can mean North or be part of a name (N Street in Washington, DC is a real street, not North Street). Context determines meaning, and automated parsers don’t always get it right.
Practical normalization steps
You don’t need USPS CASS certification for most matching jobs. A practical normalization pipeline handles 80-90% of address variation.
Step 1: Case and punctuation normalization
Lowercase everything. Remove periods, commas, and hyphens that aren’t part of unit numbers. This alone resolves N. vs N vs n. and St. vs St vs st.
Step 2: Abbreviation expansion
Expand all abbreviations to their full form. St becomes street. Ave becomes avenue. N becomes north. Ste becomes suite. Apt becomes apartment.
Expansion is safer than contraction. There’s less ambiguity in the full word. And once everything is expanded, you have consistent strings to compare.
Handle the St/Saint ambiguity explicitly: if St appears before the street name (e.g., St Louis), it’s Saint. If it appears after (e.g., 123 Main St), it’s Street.
Step 3: Parse into components
Split the address into structured fields:
- Street number:
123 - Directional:
north - Street name:
main - Street suffix:
street - Unit designator:
suite - Unit number:
200 - City:
san francisco - State:
ca - ZIP:
94102
Parsing is where most address normalization goes from “pretty good” to “reliable.” Once parsed, you can compare each component independently rather than comparing the full address string.
Step 4: Component-level comparison
With parsed addresses, match using rules tailored to each component:
- Street number: Exact match.
123must equal123. - Street name: Fuzzy match. Handles remaining variation (
Martin Luther King JrvsMLK). - Unit number: Exact match after normalization.
200must equal200. - City: Fuzzy match.
San FranciscovsSFvsSan Fran— or use a lookup table for known city abbreviations. - State: Exact match after expanding to abbreviation.
California=CA. - ZIP: Exact match on first 5 digits. Ignore ZIP+4 variation.
This component-level approach handles the fundamental problem: two addresses that look 40% different as strings are actually identical in every component.
Common edge cases
Even with good normalization, certain address patterns cause persistent problems.
Honorific street names. Martin Luther King Jr Boulevard appears as MLK Blvd, M L King Jr Blvd, ML King Boulevard, Martin Luther King Drive (is it the same street with the wrong suffix, or a different street?). For these, a lookup table of known aliases is more reliable than fuzzy matching.
Numbered streets. 1st Street, First Street, 1 Street, and 1st St all appear in data. Ordinal-to-cardinal conversion (1st to 1, Second to 2) standardizes these, but introduces ambiguity with numbered avenues in grid cities where 1st Street and 1st Avenue are different locations.
Building names vs. addresses. One World Trade Center and 1 WTC and 285 Fulton Street all refer to the same building. Building name resolution requires a lookup table — no string matching algorithm will connect these.
PO Boxes and rural routes. PO Box 142 has no geographic specificity without a ZIP code. RR 3 Box 42 (rural route) follows a different format entirely. These can only be matched on their own terms — PO Box number + ZIP, or route + box number.
Apartment vs. suite vs. unit. These mean different things in different contexts, but in practice they’re used interchangeably in data. Apt 200, Suite 200, Unit 200, and #200 should all match.
When string matching is enough
For many matching projects, normalized string comparison handles addresses well enough. Specifically:
Same-source data. If both datasets come from the same upstream system (e.g., two exports from the same CRM at different times), formatting is consistent and normalization + exact matching works.
High-quality data. If both datasets are commercially standardized (e.g., CASS-certified mailing lists), abbreviations are already consistent and the main variation is unit number formatting.
Addresses are not the primary matching field. If you’re matching on name + email + address, and the address is a secondary confirmation field rather than the primary discriminator, approximate string matching is fine. You don’t need perfect address matching when two other fields already establish high confidence.
When you need more than string matching
Cross-source matching. Two datasets from different vendors, different countries, or different eras of data entry. Formatting conventions differ systematically.
Address is the primary identifier. Property records, delivery logistics, real estate — where the address is the entity being matched, not a supporting field. Here, component-level parsing is necessary.
International addresses. Address formats vary dramatically by country. Japanese addresses use block numbers instead of street names. German addresses put the house number after the street name. UK postcodes follow a different format than US ZIP codes. For international matching, you need format-aware parsing for each country.
Geocoding. When two addresses refer to the same location but look nothing alike (building name vs. street address, old street name vs. new street name), geocoding converts both to latitude/longitude coordinates and compares the coordinates. If two addresses are within 50 meters of each other, they’re probably the same building regardless of what the strings say.
The jump from raw strings (54%) to normalized strings (71%) is the cheapest improvement — basic text processing with no external dependencies. Component-based matching adds another 15 points by eliminating format-dependent comparison failures. Geocoding handles the remaining edge cases where the strings are genuinely different representations of the same location.
For most projects, component-based matching (86%) is the practical ceiling without investing in geocoding infrastructure. That’s good enough for the majority of matching use cases.
USPS CASS certification
CASS (Coding Accuracy Support System) is the USPS standard for address validation and standardization. CASS-certified tools validate addresses against the USPS database, correct errors, and output standardized addresses with ZIP+4 codes and delivery point barcodes.
When it’s worth it. Mass mailings (required for postal discounts), address-as-primary-key matching, and any application where you need to verify that an address actually exists and is deliverable.
When it’s overkill. Record matching where the address is one of several comparison fields and you just need to determine if two addresses are “probably the same.” CASS certification costs per address and requires a licensed tool. For matching purposes, normalization and component parsing are usually sufficient.
Putting it together with Match Data Studio
Match Data Studio’s transformation pipeline handles address normalization automatically. When the AI assistant detects address fields, it configures:
- Case and punctuation normalization
- Abbreviation expansion using USPS standard mappings
- Component parsing where field structure permits
- Fuzzy comparison for the street name component
- Exact comparison for street number, unit number, state, and ZIP
For addresses that remain unmatched after string-based comparison, the embedding similarity layer captures semantic equivalence that string operations miss — MLK Boulevard and Martin Luther King Jr Blvd embed to similar vectors even though their string similarity is low.
Address matching is a solvable problem. It just requires treating addresses as structured data, not opaque strings. Upload your datasets to Match Data Studio and see how the pipeline handles your address fields.
Keep reading
- Data cleaning before matching — general prep steps that apply to every dataset
- Fuzzy matching algorithms explained — the string-distance methods behind address comparison
- Reconciling property owner names for HOA and tax billing — a real-world case where address matching is critical