In any organisation, master data management consists of standards and tools that record and manage their critical data to provide them with a single point of reference .
The organisation in Atheon‘s term is a grocery retail supplier like Johnson and Johnson or Nestle, and the critical data for these suppliers would be the information on their products.
These suppliers refer to their products differently from the retailers, for example, if Nestle described one of their products as “Nescafe premium gold 50gm”, Retailer X could call it “Nescafe gold 50 grams – small” and Retailer Y could refer to the same product as “Nescafe gold coffee premiem”. The problem then lies with the fact that apart from there not being a single way to describe the product, there is also a human element present, which means that there will be mistakes which usually has a knock on effect, for example when a product has been wrongly matched or not matched at all and the supplier is compiling a monthly report the sale figures reported would be incorrect.
Another problem the suppliers face is that as the total number of products they supply increases the harder it becomes to manage these descriptions across retailers. The larger suppliers have to maintain over 500 products across 10 or more retailers, and this means a lot of man-power goes in sorting out the mapping between these products, and we believe most of this wasted time is better spent understanding where there are actual problems in the supply chain.
Product matching is the master data management of these product descriptions and by providing a simple-to-use tool that can manage the product descriptions for the suppliers we save them time and money.
We decided to tackle this problem by letting the suppliers manage their product master data themselves, and as the volume of products would make this task tedious for larger suppliers we automated some of the matching process.
I’ll go into a bit of detail here about the techniques used to automate the matches, but I wanted to make a disclaimer here that this process works better for some retailers than it does for others; the descriptions sometimes are shortened and mangled together which makes them harder to expand out into the original product description.
Intro to the Data
First let me describe what I found while looking at the supplier version of the product descriptions and the retailer version.
Supplier products are generally well described but here are some peculiarities that I found:
- their brand name is shortened, especially when their products have longer description
- the size of the products are usually in the description and consistent
- common words are shortened but mostly still readable eg: crm for cream
- shorthand is used while shortening descriptions so special characters are present eg: p/nut for peanut
- some have pack size in the description
Retailer product descriptions vary by retailer, but have a general pattern:
- some retailers split the description to main product, flavours, pack size etc
- the brand name is usually in the original long format rather than shorthand, but there are always exceptions
- in some cases the flavours are shortened and joined together eg: Mango and apple to mngapl
- the size of the products are usually in the description and consistent
But in both versions of the description there are some really hard to interpret shorthand which would be best left to the supplier to identify and match.
Techniques I used: what they are, why I used them and how I used them
Soundex is a hashing system to represent English words phonetically. Soundex converts words into alphanumeric representations and the result is that similar sounding words will have similar codes .
When humans shorten words they usually take out the vowels or shorten it to sound similar to the original word, and this makes Soundex ideal to deal with some parts of the product description
soundex("creamy") => C650 soundex("crm") => C650
n-gram is a contiguous sequence of n items in a string , eg: the 3 character n-grams of “Nescafe coffee” would be ‘nes’, ‘esc’, ‘sca’, ‘caf’, ‘afe’, ‘fe ‘, ‘e c’, ‘ co’, ‘cof’, ‘off’, ‘ffe’, ‘fee’
n-grams are useful where words are combined together and when they are misspelled. So when a word such as “premium” is misspelled as “premiem” the n-grams give you a better match than if the words were compared to each other.
As you can see below, there’s a match of 3 out of 5 n-grams below meaning if we used just n-grams to compare “premium” and “premiem” we get a 60% match, and in this case if we used 1-grams we would have 6 out of 7 matches.
premium => ['pre', 'rem', 'emi', 'miu', 'ium'] premiem => ['pre', 'rem', 'emi', 'mie', 'iem']
Vector space model
Vector space model is a vector representation of text documents in a common vector space . This is generally used in information retrieval for querying documents. Another term for this would be a frequency-distribution matrix.
This model gives us a single space for representing the supplier and retailer version of the product descriptions, which makes it easier to compare the vectors of retailer and supplier description.
To represent the products in the vector space we first get all the unique words from all the descriptions combined to give us the features. A product description is now represented as a combination of features which gives us a vector. The closer these vectors are in the vector space, the better the match between these product descriptions.
If Supplier X has 3 products:
X oven fresh cookies butter X oven fresh butter croissants X chocolate tart
If Retailer A has these products described in their systems as:
X ovnfrsh cookies btr X ovnfrsh croissants X choclat tart
To represent these 6 product descriptions in a single vector space I represent it in the table below. If a word occurs in a description then I mark it as 1 and if it doesn’t 0. The vector space for these two combined would look like:
Now each product is described in a uniform way, but we can easily see that if a few words were expanded out then the vectors would be closer matches to each other. This representation is one way to vectorize data, sklearn provides several very useful ways to represent the data above.
A combination of soundex and n-grams to transform the descriptions would give us better features, along with a vectorizer that can give us how important each feature is.
TF-IDF weighting of features
TF-IDF stands for Term Frequency – Inverse Document Frequency. It is a measure we use to identify the relative importance of each word in a set of documents.
TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general .
I used tf-idf to help reduce the score of commonly occurring words like brand names without removing them, because even though they are repetitive they are not irrelevant.
I used TfidfVectorizer from sklearn to get the term-document matrix which is similar to the above table but with a decimal score between 0 and 1 instead of just 0 and 1. Now with this we get a multidimensional representation of all the products in the vector space.
Cosine similarity measure
Cosine similarity is used to measure the similarity between two non zero vectors. The inner product of the vectors gives us the cosine of the angle between them, and this angle represents how close the vectors are to each other; the cosine of 0° is 1, and it is less than 1 for any other angle .
This is a convenient measure that will give a score between 0 and 1 for any pair of vectors, effectively giving us a ratio/percentage of how well they match.
The cosine similarity metric from sklearn provides an easy way of comparing the entire set of retailer products against the entire set of supplier descriptions. I apply this metric and get back a matrix of similarity between the retailer and supplier descriptions and can easily find the top match for each description.
Precision and Recall
Precision and Recall are measures to identify the relevance of the retrieved instances.
Wikipedia defines them as,
In pattern recognition and information retrieval binary classification, precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance .
We use Precision and Recall to measure the relevance of the matches made by the algorithm using the combined techniques outlined above. These measures helped me compare methods against one another till I found the best suited parameters across all retailers.
Recall is the measure of how well the algorithm is doing in finding relevant matches and Precision is the measure of how well the algorithm is doing in rejecting non-relevant matches.
For the context of product matching the relevant elements are the manually matched products existing in our systems.
We assume that all the existing matches are correct as they have been verified by the Supplier, so the True Positives (TP) are the matches that both the algorithm and the supplier picked, while the False Negatives (FN) are the matches made by the supplier but missed by the algorithm and the False Positives (FP) is then are all the matches that the algorithm assumes is correct while the supplier has not made that match.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
TP = matches made manually and found by the algorithm
FN = matches made manually not found by the algorithm
FP = matches made by the algorithm not done manually
We needed to pick the algorithm that gave us the best Recall without having to compromise the Precision. As this is subjective to each person we decided to set a threshold to show only matches that sit above this threshold.
To put product matching into development I needed a few questions answered:
- Do we set the threshold or let the user play with the threshold?
- If we set the threshold how high or low do we go?
- How many matches per product do we show the user?
Question 1 was an easy one, in the first version of the application where we introduce automated matching to the user we should give them the best matches possible and in turn a reason to come back, so we now define the threshold we want to set and not the user.
Question 2 and 3 needed more analysis using Precision and Recall detailed below.
Threshold and Top matches
The metric I am using to determine the threshold is the cosine similarity score between each pair of descriptions. The higher this score the better the match and the lower the score the worse the match, that is the general rule but there are some exceptions.
We should also keep in mind that Recall and Precision are two measures that can help us choose the best value for threshold; we want Recall as high as possible so the user gets the best matches the algorithm can find, and a relatively high Precision, so the user doesn’t get too many matches marked as the best match when it is not.
I have two images below composed of 6 charts each, the top charts shows the Precision and Recall versus Threshold and the bottom charts show the number of True Positives versus Threshold. The three blocks from left to right are the number of top matches, 3, 5 or 7.
The x-axis below shows the threshold values from 0.3 to 0.9. The top chart shows for each threshold value the calculated Precision and Recall scores; the orange line is Recall and the blue line is Precision and the y-axis is on a scale of 0% to a 100% roughly.. The bottom chart shows the grey bars which is the number of matches that are True Positives for each threshold value tested.
We can see from these charts that as the threshold increases the number of matches decreases while the Precision and Recall increase. This is more visible in the matching between Retailer B and Supplier X; the jump in the Precision as the threshold increases is more drastic, which shows that automated matching using the above techniques is better for some retailers than others. The ideal value for the threshold looks like it should be at least 0.7.
The next thing we need to look at are the number of top matches. As stated above the three blocks (from L-R) in each image shows us the changes as we increase the number of top matches, what this means is that when a product has been matched and we look at the results we get either 3 matches or 5 or 7, and from these top choices one product description matches with the retailer description.
We can see above that as we increase the number of top matches from 3 to 5 to 7 the number of True Positives returned increases slightly, but this also has the unwanted effect of increasing the amount of work the user has to do. If you had to decide the best match between 3 long strings or 7 long strings which do you think they would find harder?
The number of additional matches we get as a result of having more top matches are not impressive enough for us to pick a higher number here. So, the ideal value for the top matches looks like 3.
Now that we decided to show the supplier the top 3 above a threshold over 0.7 we let the user make the decision on a much smaller set of products. By doing so we have tried to optimise for the end user using the Threshold and Top matches, there by reducing the number of product descriptions that a user has to look at when matching a product.
Now for some observations. With the values we chose for top matches and threshold, I apply them to the data and get back a set of matches. And in cases where they were less than 100% of matches I looked for any obvious reason as to why the human was better at matching these products.
I wanted answers to these two questions:
Q) What kind of matches did the algorithm fail to detect (False Negatives)?
A) These are the matches that were matched by the supplier but not the algorithm. I found that most of these are because of the words missing from either the supplier or retailer’s descriptions. So these are easier for a human with additional context (such as a person working for the supplier) to understand and match, whereas without the additional information the algorithm fails. There are also some matches that are better than the ones made by supplier, so I was able to show some cases of human error.
Q) What kind of matches did the algorithm detect but the supplier did not (False Positives)?
A) In general 50% of these are True Positives, but the supplier has not matched these yet. The other 50% are a mismatch for various reasons, one is because of different product sizes, and another is because of missing words, the same reason as the False Negatives above.
As (explained above) matching works better for some retailers than others and there are some challenges to improving the Precision. One way we can improve the Precision is to gather more information from the user to help create a pre-processing dictionary by:
- getting the user to confirm matches of unmatched words that are not retailer or supplier specific such as measures. Eg: have a human confirm if G and GM are the same
- getting the user to match supplier features to retailer features to help identify words that need to be treated as the same for that client. Eg: match P/Nut to Peanut
- getting the user to confirm whether words marked as similar is correct. This can help us eliminate words that should not be treated the same for the client. Eg: if HNZ and Heinz are the same or not.
This is just one solution with a few interesting techniques, but maybe not necessarily the best ones. There is definitely potential for further tuning based on user feedback as well as being able to “train” and then build a product matcher, but that will be after we collect enough information of product matching from the usage of the application.
For now this is the first version of Automated Product Matching used at Atheon.