Text can be beautiful

How visualisation can uncover hidden patterns in text data

Josh TaylorFollowApr 15

Modern Slavery Act filings: We will be building an interactive visualisation to uncover key trends and insights from a series of text documents. Words at the top right are common across the two industries analysed. Words at the bottom left are uncommon across the industries.

It is the words at the top left and bottom right which show the key differences between the two industries in their approach to combating Modern Slavery

The internet is littered with articles on Natural Language Processing. Many of these describe how to build classification systems, perform topic modelling, question answering. Far fewer address more general analysis of text corpora, fewer still show how to build outputs in a visual, interactive and easy to understand format.

In the real-world, finding ways to understand how and why text differs across a data set is often far more useful. Take customer reviews; whilst it may be useful to predict a review rating based on its content, it is often far more valuable to understand why customers are providing positive or negative reviews for a particular product or service.

This can be used to create more effective marketing, inform future product development and even design customer support processes tailored to issues which are being identified through product reviews. Can a predictive model do all of this? No, it can’t.

In this post we will analyse thousands of companies’ Modern Slavery returns to understand what commitments are being made by businesses to prevent Modern Slavery, we will then use advanced visualisation techniques to determine how different industries are responding to the risk of Modern Slavery both internally and in their supply chains.


An important digression: Modern Slavery

Shellfish Gathering is just one of many industries with a high inherent risk of Modern Slavery

If you have read my previous posts on Medium, you may have picked-up on the underlying data being used. This relates to Modern Slavery returns submitted by several thousand UK and international companies since 2015. Why use this data? The below is an important, if sombering, digression into Modern Slavery and the measures put in place to prevent it.

Modern Slavery is a huge issue. It is estimated that it costs the taxpayer £4.3bn a year in the UK alone. As a crime, it has been placed second only to homicide in terms of harm to its victims and society.

“The exploitation and enslaving of men, women and children across the world and within the UK is one of the most shocking crimes and one of the most profitable.”

Baroness Butler-Sloss, Lord Justice of Appeal until 2004

It is estimated that 136,000 or (1 in 500) individuals in the UK is a victim of Modern Slavery and that there has been a 10-fold increase in levels since 2013. Crimes relating to Modern Slavery are abhorrent. Examples include labour exploitation, sexual exploitation, domestic servitude, organ harvesting and criminal exploitation.

Victims are controlled by force, threats, coercion, abduction, fraud and deception.

What is being done to prevent it?

The Modern Slavery Act 2015 was introduced to combat Modern Slavery in the UK. Part of this act requires companies with a Turnover of more than £36m to publish what they are doing to prevent Modern Slavery in their business and within their supply chains. As of April 2019, there are 8,700 statements identified as published.

But what are companies actually doing? What are they committing to? What processes are they implementing? How are different industries approaching the issue?


Identifying company commitments with SpaCy

In order to understand what companies are actively doing and committing to doing, we will need to create an intelligent way of identifying such commitments in each Modern Slavery return.

A typical return will include a lot of non-relevant information such as background on the Company and the Modern Slavery Act. Thankfully using the SpaCy NLP library, we can filter these out using its powerful matching features.

The problem with text matching is that it can quickly become burdensome, even when using techniques like regular expression. The issue is all of the different combinations of phrases you need to consider when looking to find even a simple pattern. For example, we are interested in identifying phrases which contain statements like:

"We are committed to..."

However the below phrases would be of interest as well, how can we include these in our analysis without having to write code for every example?

"We promise to"
"We have committed to"
"We will continue to"
"[COMPANY NAME] has committed to"
"[COMPANY NAME] has implemented"

Part of Speech matching

The matching engine in SpaCy allows you to use Part of Speech (POS) tags to match phrases to a specific pattern, for example, rather than searching for specific words, we could filter for a sequence of POS tags:

PRON, VERB, VERB

This matching identifies the following phrase from a snippet from a Modern Slavery return:

Even using a very simple POS filter we can identify phrases which denote commitments made from businesses in their Modern Slavery returns. The match here is highlighted in yellow.

SpaCy even provides an online tool for helping to build and review the results of different rules:

SpaCy’s rule based matcher

It does not take long at all to create a set of rules which produce good results. The below code implements these rules and returns the whole sentence where a result has been identified:

def collect_sents(matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start:end] # Matched span
sent = span.sent # Sentence containing matched span
# Append mock entity for match in displaCy style to matched_sents
# get the match span by ofsetting the start and end of the span with the
# start and end of the sentence in the doc
match_ents = [{
"start": span.start_char - sent.start_char,
"end": span.end_char - sent.start_char,
"label": "MATCH",
}]
matched_sents.append({"text": sent.text, "ents": match_ents})


matcher = Matcher(nlp.vocab)
#this type of pattern matching requires SpaCy >2.1:
pattern = [{'POS': {'IN': ['PROPN', 'PRON']}, 'LOWER': {'NOT_IN': ['they','who','you','it','us']}  },
{'POS': 'VERB', 'LOWER': {'NOT_IN': ['may','might','could']} },
{'POS': {'IN': ['VERB', 'DET']}, 'LOWER': {'NOT_IN': ['a']}}]
matcher.add("commit", collect_sents, pattern)
pattern = [{'POS': {'IN': ['PROPN','PRON']}, 'LOWER': {'NOT_IN': ['they','who','you','it','us']}  },
{'POS': 'VERB', 'LOWER': {'NOT_IN': ['may','might','could']}},
{'POS': 'ADJ'},
{'POS': 'ADP'}]
matcher.add("commit", collect_sents, pattern)

How are different industries tackling Modern Slavery?

Now that we have a set statements filtered for commitments and actions from company submissions what can this tell us about how different industries are responding to Modern Slavery?

For this analysis we will use the fantastic ScatterText library developed by Jason Kessler.

This uses a simple, yet powerful approach to find key words and phrases which separate two categories of text. The results can then be output easily into an interactive visualisation.

The below code filters our modern slavery returns to two high-risk industries: Construction and Retail. It then creates a corpus of text for use in the scatter text visualisation:

#select industries to compare:
ind1 = 'Specialty Retail'
ind2 = 'Construction & Engineering'
#Filter into a new df with 3 columns one for industry, one for company and the third containing the text
ftr = (df['Industry'] == ind1) | (df['Industry'] == ind2)
df_corp = df.loc[ftr]
df_corp = df_corp[['Industry','Company','clean text']]
#Create a scattertext corpus from the df:
corpus = st.CorpusFromPandas( df_corp,
category_col='Industry',
text_col='clean text',
nlp=nlp).build()

Once this is complete, we can run the below to create an interactive scattertext output:

html = st.produce_scattertext_explorer(corpus,
category='Construction & Engineering',
category_name='Construction & Engineering',
not_category_name=ind1,
width_in_pixels=1600)
open("MS-Visualization.html", 'wb').write(html.encode('utf-8'))
HTML(html)

This produces the following output:

The output from comparing two industries Modern Slavery returns in ScatterText

This plots the distribution of words by the two categories (in this case the Retail and Construction industries). Words at the top right are common across both categories, words at the bottom left are uncommon across both categories.

It is the words at the top left and bottom right which show the key differences between the two industries in their approach to combating Modern Slavery. Clicking on a word reveals where it has been used within the corpus. This is useful to find the context and the reasons why certain words and phrases occur within one industry and not another. The full output is available to download at the end of this article.

After just a few minutes of analysis, it is easy to find significant differences in the way the two industries are approaching the issue of Modern Slavery (items in bold represent the words on the chart which have been analysed):

Construction

  • The construction industry already has regulation in place regarding quality management (ISO 9001) and environmental management systems (ISO 14001). Companies are leveraging processes put in place by these standards to help combat modern slavery risks.
  • The industry is aware that subcontractors pose a risk, little is currently being done with regards to implementing checks or controls on subcontractors.
  • It places greater emphasis on its internal workforce. Responsibility is placed with the HR department and line managers to put processes in place to reduce risk.

Retail

  • The retail industry is more externally facing in its approach; placing importance on auditsperformed with suppliers at high risk locations (with India, China and Turkey often been categorised as high risk countries).
  • In retail, more focus is placed on the supply chain and mapping beyond direct suppliers to understand what lies below the first tier of the supply network. It is clear that some companies have made more progress in this area than others.

Closing thoughts:

The value of being able to look across thousands of documents and instantly understand trends across industries is huge. It can be used to:

  • highlight best practice;
  • help to bring innovation from one industry to others, and;
  • identify where enough is not being done to prevent Modern Slavery.

Hopefully this article has been helpful in showing that some simple, but powerful NLP and visualisation techniques can unlock insights that are otherwise locked within unstructured data.


Further reading

The below Colab Notebook contains all the code used in this post:Google Colaboratory
Modern Slavery Analysiscolab.research.google.com

To view the interactive scattertext output please see the embed below:MS-ScatterText.html
ScatterText outputdrive.google.com

Source: Towards Data Science

Judith Chao Andrade

Apasionada del conocimiento, de compartirlo y de aprender de todo lo que me rodea, disfruto aprendiendo y realizando actividades. Actualmente estoy aprendiendo programación pero me fascinan los temas relacionados con los materiales especiales, las cuiriosidades, el humor, los eventos, las redes sociales ... Mi mayor interés podría decir que es no perder nunca la cuiriosidad por lo que si tienes un plan en mente solo proponlo !.

Deja una respuesta

Tu dirección de correo electrónico no será publicada.

X
X
X
X