Build your own Resume Parser Using Python and NLP

Umut Gokbayrak
Umut Gokbayrak 24 October 2020
Author:

Let’s start with making one thing clear. A resume is a brief summary of your skills and experience over one or two pages while a CV is more detailed and a longer representation of what the applicant is capable of doing. Saying so, let’s dive into building a parser tool using Python and basic natural language processing techniques.

Resume

Resumes are a great example of unstructured data. Since there is no widely accepted resume layout; each resume has its own style of formatting, different text blocks, or even category titles do vary a lot. I don't even need to mention how big a challenge it is to parse multilingual resumes.

One of the misconceptions about building a resume parser is to think that it is an easy task. “No, it is not”.

Let’s just talk about predicting the names of the applicant looking at the resume.

There are millions of person names around the world, varying from Björk Guðmundsdóttir to 毛泽东, from Наина Иосифовна to Nguyễn Tấn Dũng. Many cultures are accustomed to using middle initials such as Maria J. Sampson while some cultures use prefixes extensively such as Ms. Maria Brown. Trying to build a database of names is a desperate effort because you’ll never keep up with it.

So, with an understanding of how complex is this thing, shall we start building our own resume parser?

Understanding the tools

We’ll use Python 3 for its wide range of libraries that is already available and for its general acceptance in the data sciences area.

We’ll also be using nltk for NLP (natural language processing) tasks such as stop word filtering and tokenization, docx2txt and pdfminer.six for extracting text from MS Word and PDF formats.

We assume you already have Python3, pip3 on your system and possibly using the marvels of virtualenv. We’ll not get into details for installing those. We also assume you are running on a Posix based system such as Linux (Debian based) or macOS.

Converting resumes into plain text

Extracting text from PDF files

Let’s give a start by extracting text from PDF files with pdfminer. You can install it using pip3 (Python Package Installer) utility or compile it from source code (not recommended). Using pip it is as simple as running the following at the command prompt.

pip install pdfminer.six

Using pdfminer you can easily extract text from PDF files, using the following code.

# example_01.py

from pdfminer.high_level import extract_text


def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)


if __name__ == '__main__':
    print(extract_text_from_pdf('./resume.pdf'))  # noqa: T001

Pretty simple, right? PDF files are very popular among resumes, but some people will prefer docx and doc formats. Let’s move on to extracting text from these formats also.

Extracting text from docx files

In order to extract text from docx files, the procedure is pretty similar to what we’ve done for PDF files. Let’s install the required dependency (docx2txt) using pip and then write some code to do the actual work.

pip install docx2txt

And the code is as follows:

# example_02.py

import docx2txt


def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None


if __name__ == '__main__':
    print(extract_text_from_docx('./resume.docx'))  # noqa: T001

So simple. But the problem arises when we try to extract text from old-fashioned doc files. These formats are not handled correctly by the docx2txt package so we’ll do a trick to extract text from them. Please move on.

Extracting text from doc files

In order to extract text from doc files, we’ll use the neat but extremely powerful catdoc command line tool from Pete Warden.

Catdoc reads MS-Word file and prints readable ASCII text to stdout, just like Unix cat command. We’ll install it using apt tool, as we’re running on Ubuntu Linux. You should choose to run your favorite package installer or you may build the utility from source code.

apt-get update
yes | apt-get install catdoc

When ready, now we can type the code which will instantiate a subprocess, capture the stdout into a string variable and return it as we did for pdf and docx file formats.

# example_03.py

import subprocess  # noqa: S404
import sys


def doc_to_text_catdoc(file_path):
    try:
        process = subprocess.Popen(  # noqa: S607,S603
            ['catdoc', '-w', file_path],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            universal_newlines=True,
        )
    except (
        FileNotFoundError,
        ValueError,
        subprocess.TimeoutExpired,
        subprocess.SubprocessError,
    ) as err:
        return (None, str(err))
    else:
        stdout, stderr = process.communicate()

    return (stdout.strip(), stderr.strip())


if __name__ == '__main__':
    text, err = doc_to_text_catdoc('./resume-word.doc')

    if err:
        print(err)  # noqa: T001
        sys.exit(2)

    print(text)  # noqa: T001

Now as we’ve got the resume in text format, we can start extracting specific fields from it.

Extracting fields from resumes

Extracting names from resumes

This may seem easy but in reality one of most challenging tasks of resume parsing is to extract the person's name out of it. There are millions of names around the world and living a globalized world, we may come up with a resume from anywhere.

This is where natural language processing comes into play. Let’s start with installing a new library called nltk (Natural Language Toolkit), which is quite popular for such tasks.

pip install nltk
pip install numpy # (also required by nltk, for running the following code)

Now it’s time to write some code to test what nltk’s “named entity recognition” (NER) functionality is capable of doing.

# example_04.py

import docx2txt
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')


def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None


def extract_names(txt):
    person_names = []

    for sent in nltk.sent_tokenize(txt):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'PERSON':
                person_names.append(
                    ' '.join(chunk_leave[0] for chunk_leave in chunk.leaves())
                )

    return person_names


if __name__ == '__main__':
    text = extract_text_from_docx('resume.docx')
    names = extract_names(text)

    if names:
        print(names[0])  # noqa: T001

Actually, nltk’s person name detection algorithm is far from being correct. Run this code and try to see if it works for you. If it doesn’t, you may try to use the NER model from Stanford University. A detailed tutorial is available at Listendata.

Extracting phone numbers from resumes

Unlike extracting person names from resumes, phone numbers are much easier to deal with. Generally speaking, using a simple regex will do fine for most cases. Try the following code to extract phone numbers from a resume. You may choose to modify the regex to your taste.

# example_05.py

import re
import subprocess  # noqa: S404

PHONE_REG = re.compile(r'[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]')


def doc_to_text_catdoc(file_path):
    try:
        process = subprocess.Popen(  # noqa: S607,S603
            ['catdoc', '-w', file_path],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            universal_newlines=True,
        )
    except (
        FileNotFoundError,
        ValueError,
        subprocess.TimeoutExpired,
        subprocess.SubprocessError,
    ) as err:
        return (None, str(err))
    else:
        stdout, stderr = process.communicate()

    return (stdout.strip(), stderr.strip())


def extract_phone_number(resume_text):
    phone = re.findall(PHONE_REG, resume_text)

    if phone:
        number = ''.join(phone[0])

        if resume_text.find(number) >= 0 and len(number) < 16:
            return number
    return None


if __name__ == '__main__':
    text = doc_to_text_catdoc('resume.pdf')
    phone_number = extract_phone_number(text)

    print(phone_number)  # noqa: T001

Extracting email addresses from resumes

Similar to phone number extraction, this is also pretty straightforward. Just fire up a regular expression and extract the email address from the resume. The first one that appears above others is generally the applicant’s actual email address, since people tend to place their contact details in the header section of their resumes.

# example_06.py

import re

from pdfminer.high_level import extract_text

EMAIL_REG = re.compile(r'[a-z0-9\.\-+_][email protected][a-z0-9\.\-+_]+\.[a-z]+')


def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)


def extract_emails(resume_text):
    return re.findall(EMAIL_REG, resume_text)


if __name__ == '__main__':
    text = extract_text_from_pdf('resume.pdf')
    emails = extract_emails(text)

    if emails:
        print(emails[0])  # noqa: T001

Extracting skills from the resumes

Well you’ve been so far so good. This is the section where things get trickier. Exporting skills from a text is a very challenging task and in order to increase accuracy, you’ll need a database or an API to verify if a text is a skill or not.

Check out the following code. It first, uses the nltk library to filter out the the stopwords and generates tokens.

# example_07.py

import docx2txt
import nltk

nltk.download('stopwords')

# you may read the database from a csv file or some other database
SKILLS_DB = [
    'machine learning',
    'data science',
    'python',
    'word',
    'excel',
    'English',
]


def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None


def extract_skills(input_text):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    word_tokens = nltk.tokenize.word_tokenize(input_text)

    # remove the stop words
    filtered_tokens = [w for w in word_tokens if w not in stop_words]

    # remove the punctuation
    filtered_tokens = [w for w in word_tokens if w.isalpha()]

    # generate bigrams and trigrams (such as artificial intelligence)
    bigrams_trigrams = list(map(' '.join, nltk.everygrams(filtered_tokens, 2, 3)))

    # we create a set to keep the results in.
    found_skills = set()

    # we search for each token in our skills database
    for token in filtered_tokens:
        if token.lower() in SKILLS_DB:
            found_skills.add(token)

    # we search for each bigram and trigram in our skills database
    for ngram in bigrams_trigrams:
        if ngram.lower() in SKILLS_DB:
            found_skills.add(ngram)

    return found_skills


if __name__ == '__main__':
    text = extract_text_from_docx('resume.docx')
    skills = extract_skills(text)

    print(skills)  # noqa: T001

It’s not easy to maintain an up to date database of skills for each and every industry. You may wish to have a look at the Skills API, which will give you an easy and affordable alternative to maintaining your own skills database. The Skills API, features 70.000+ skills, well organized and updated often. Just check the following code and see how easy it would be to extract the skills from a resume if the Skills API was used.

Before proceeding, first you’ll need to import a new dependency called requests, also using the pip tool.

pip install requests

Now below is the source code using the Skills API.

# example_08.py

import docx2txt
import nltk
import requests

nltk.download('stopwords')


def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None


def skill_exists(skill):
    url = f'https://api.promptapi.com/skills?q={skill}&count=1'
    headers = {'apikey': 'YOUR API KEY'}
    response = requests.request('GET', url, headers=headers)
    result = response.json()

    if response.status_code == 200:
        return len(result) > 0 and result[0].lower() == skill.lower()
    raise Exception(result.get('message'))


def extract_skills(input_text):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    word_tokens = nltk.tokenize.word_tokenize(input_text)

    # remove the stop words
    filtered_tokens = [w for w in word_tokens if w not in stop_words]

    # remove the punctuation
    filtered_tokens = [w for w in word_tokens if w.isalpha()]

    # generate bigrams and trigrams (such as artificial intelligence)
    bigrams_trigrams = list(map(' '.join, nltk.everygrams(filtered_tokens, 2, 3)))

    # we create a set to keep the results in.
    found_skills = set()

    # we search for each token in our skills database
    for token in filtered_tokens:
        if skill_exists(token.lower()):
            found_skills.add(token)

    # we search for each bigram and trigram in our skills database
    for ngram in bigrams_trigrams:
        if skill_exists(ngram.lower()):
            found_skills.add(ngram)

    return found_skills


if __name__ == '__main__':
    text = extract_text_from_docx('resume.docx')
    skills = extract_skills(text)

    print(skills)  # noqa: T001

Extracting education and schools from resumes

If you’ve understood the principles of skills extraction, you’ll be more comfortable with the education and school extraction topic.

Not surprisingly, there are many ways of doing it.

First, you can use a database that contains all (is it?) the school names from all around the world.

Here is a database that we've collected of names of 25.673 schools from around the world. Feel free to use it as you wish. :) https://assets.promptapi.com/blog/resume_parser/schools.csv

You may use our school names database to train your own NER model using Spacy or any other NLP framework, but we’ll follow a much simpler way. Saying so, in the following code, we look for words such as “university, college etc… in named entities labeled as organization types. Believe it or not, it performs quite well for most cases. You may enrich the reserved_words list in the code as you wish.

Similar to person name extraction we’ll first filter out the reserved words and punctuation. Secondly, we’ll store all the “organization typed” named entities into a list and check if they contain reserved words or not.

Check the following code, as it speaks for itself.

# example_09.py

import docx2txt
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')


RESERVED_WORDS = [
    'school',
    'college',
    'univers',
    'academy',
    'faculty',
    'institute',
    'faculdades',
    'Schola',
    'schule',
    'lise',
    'lyceum',
    'lycee',
    'polytechnic',
    'kolej',
    'ünivers',
    'okul',
]


def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None


def extract_education(input_text):
    organizations = []

    # first get all the organization names using nltk
    for sent in nltk.sent_tokenize(input_text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'ORGANIZATION':
                organizations.append(' '.join(c[0] for c in chunk.leaves()))

    # we search for each bigram and trigram for reserved words
    # (college, university etc...)
    education = set()
    for org in organizations:
        for word in RESERVED_WORDS:
            if org.lower().find(word) >= 0:
                education.add(org)

    return education


if __name__ == '__main__':
    text = extract_text_from_docx('resume.docx')
    education_information = extract_education(text)

    print(education_information)  # noqa: T001
Last words:

Resume parsing is tricky. There are hundreds of ways of doing it. We’ve just covered one easy way of doing it and unfortunately do not expect miracles. It may work for some layouts and otherwise for some.

If you need a professional solution, have a look at our hosted solution called: Resume Parser API. It is well maintained and supported by its API provider, which is also the maintainer for the Skills API. It is pre-trained with thousands of different resume layout formats and is the most affordable solution in the market, compared with others.

Free tier is available and no credit cards are asked during registration. Just check it if it suits your needs or not. Feel free to leave comments below. Any contribution is appreciated

Share:
 
Umut Gokbayrak
Written by

Umut Gokbayrak

I turn ideas into software products