TypeThoughts Blog by Pratyush

Fingerprinting names

We source many public documents with “names”. The problem with names is that we can write a simple name like MAPL Industries Limited in many different ways:

- M A P L Industries Limited
- MAPL Industries Ltd
- M.A.P.L Industries Ltd
- MAPL Ind. Ltd.

These all should be treated as a same person. We created a fingerprinting algorithm to generate a same string for all the above inputs. This fingerprinting makes it easy to store, search and match these names in database.

It handles:
- abbreviations such as Ind, dev and corp
- common suffixes and prefixes such as Mr, Mrs and Ms

Code is given below.

Update
There are some good algorithms for fuzzy comparison. However, these compute the similarity in real time. We needed a way to do such searches at database level. Santhosh's soundex library for Indian languages is a wonderful way. We didn't use it because we needed exact searches and didn't want any false positives.
import re


NON_WORD = re.compile(r"[\W]+")


def get_fingerprint(name):
    """
    Strips non-alphanumeric characters and common prefixes and suffixes
    Motilal Oswal Services -> motilaloswalservices
    Motilal Oswal -> motilaloswal
    """
    original_name = name.replace("\n", " ").strip()
    name = original_name.lower()
    name = NON_WORD.sub(" ", name)

    removals = [
        r"^the ",
        r" and ",
        r"^mr ",
        r"^mrs ",
        r"^ms ",
        # public private limited company
        r"\bp ltd\b",
        r"\blim[ited]+\b",
        r"\bltd\b",
        r"\bpvt\b",
        r"\bprivate\b",
        r"\bpublic\b",
        r"\bco\b",
        r"\bco[mpany]+\b",
        r"\bplc\b",
    ]
    for removal in removals:
        name = re.sub(removal, "", name)

    replacements = {
        r"\bcorp[oration]+\b": "corp",
        r"\bdev[elopment]+\b": "dev",
        r"\bdev[lopers]+\b": "dev",
        r"\binv[estments]+\b": "inv",
        r"\bind[ia]+\b": "ind",
        r"\bind[ustries]+\b": "ind",
        r"\bind[ustrial]+\b": "ind",
        r"\bint[ernational]+\b": "intl",
    }
    for pattern, replacement in replacements.items():
        name = re.sub(pattern, replacement, name)

    # join everything
    name = name.replace(" ", "")
    return name

Using cloud-init to setup the servers quickly

We recently migrated the servers of our website Screener.in. One of the things that helped us complete the migration in 2 hours was the ease of setting up the server. We used cloud-init to automate the setup.

Cloud-init is a script that runs at the time of creation of the virtual machine. We use it to automate these things:
- setup a new user and add SSH keys
- set timezone of machine
- install postfix
- install docker and docker-compose
- setup directories which deploy the project on "git push"
- customise login messages

The script comes very handy whenever we are starting a new project. It also comes handy when we want to restart a project on a new droplet.

Cloud-init is part of Ubuntu's cloud server scripts. It is supported on almost all the hosting platforms including AWS, DigitalOcean or Vultr.

You can find the complete script here.

A fresh start

I have been blogging on Fully-Faltoo.com. It was more a scrapbook than a blog.

Over the years I have tried various blogging platforms. From Wordpress to Tumblr to Pelican. Now trying our my own simple blog in Django.

## Why starting my own?
  1. Pelican has long time from draft to publish.
  2. Uploading pictures is tough.
  3. Upgrading requires lot of time in APIs
  4. Bookmarklet posting rocks
  5. Ease to create custom links and pages.
  6. Own commenting system
  7. Django is easy to setup and manage.
  8. Start fresh
Let's hope this works!