TypeThoughts Blog by Pratyush

Fingerprinting names

We source many public documents with “names”. The problem with names is that we can write a simple name like MAPL Industries Limited in many different ways:

- M A P L Industries Limited
- MAPL Industries Ltd
- M.A.P.L Industries Ltd
- MAPL Ind. Ltd.

These all should be treated as a same person. We created a fingerprinting algorithm to generate a same string for all the above inputs. This fingerprinting makes it easy to store, search and match these names in database.

It handles:
- abbreviations such as Ind, dev and corp
- common suffixes and prefixes such as Mr, Mrs and Ms

Code is given below.

Update
There are some good algorithms for fuzzy comparison. However, these compute the similarity in real time. We needed a way to do such searches at database level. Santhosh's soundex library for Indian languages is a wonderful way. We didn't use it because we needed exact searches and didn't want any false positives.
import re


NON_WORD = re.compile(r"[\W]+")


def get_fingerprint(name):
    """
    Strips non-alphanumeric characters and common prefixes and suffixes
    Motilal Oswal Services -> motilaloswalservices
    Motilal Oswal -> motilaloswal
    """
    original_name = name.replace("\n", " ").strip()
    name = original_name.lower()
    name = NON_WORD.sub(" ", name)

    removals = [
        r"^the ",
        r" and ",
        r"^mr ",
        r"^mrs ",
        r"^ms ",
        # public private limited company
        r"\bp ltd\b",
        r"\blim[ited]+\b",
        r"\bltd\b",
        r"\bpvt\b",
        r"\bprivate\b",
        r"\bpublic\b",
        r"\bco\b",
        r"\bco[mpany]+\b",
        r"\bplc\b",
    ]
    for removal in removals:
        name = re.sub(removal, "", name)

    replacements = {
        r"\bcorp[oration]+\b": "corp",
        r"\bdev[elopment]+\b": "dev",
        r"\bdev[lopers]+\b": "dev",
        r"\binv[estments]+\b": "inv",
        r"\bind[ia]+\b": "ind",
        r"\bind[ustries]+\b": "ind",
        r"\bind[ustrial]+\b": "ind",
        r"\bint[ernational]+\b": "intl",
    }
    for pattern, replacement in replacements.items():
        name = re.sub(pattern, replacement, name)

    # join everything
    name = name.replace(" ", "")
    return name