This article is dedicated to describe the methodology of Zero-shot, FastText and Lemma search and see the difference in how these 3 approaches work in our platform.
- Lemma search - this is the simplest of our methods for identifying patterns in text and can also be referred to as keyword search or rule-based search.
Lemma refers to the base or ‘dictionary’ form of a word (ie how you would look up a word in a dictionary). Lemma search involves looking for the base form of a word, which helps in finding variations of the same word (inflections, tenses, etc.). Example: Searching for "run" would also return results for "running" and "ran."
Lemma search is essentially a way to find matches for a specific word (lemma from a given data source (finding common lemma's from a data source), it is currently used in MINE Queries for filtering, word clouds, correlation and co-occurrence charts and Synonym Dictionaries. Lemma search is a common way to combine with the Query Label functionality.
E.g. if Text contains employee, agent, operator, staff then Query Label=”Staff”
Lemma search is also the building block of the conditional operator “Equal any Lemmas” and “Not equal any Lemmas” available in Recoded Variables
- Fast Text - is a service used to perform supervised (pre-labelled) classification of open text feedback. This is based on a train, (optionally) validate and predict methodology. This is the approach used in sandsiv+’s CLASSIFY module for creating Sentiment and Topic classification models. Although this classification approach demands more effort for the user, it is also much more flexible to adjust/optimize and custom train the model for an increased accuracy (which can be measured).
FastText is a library for efficient learning of word representations and sentence classification.
- Zero-shot (TopicAI) - is a service used to automatically assign a set of labels (categories) to open text. Zero-shot is a globally accepted machine learning paradigm, based on the idea that the underlying model has already been trained on vast amounts of available data on the internet. This means that with this approach, the user does not have to do any data labeling for training a model. Zero-shot is one half of what is sitting behind our new TopicAI module and the Data Labeling part of MINE. The other half is the concept of Topics, which generates the Topic suggestions related to a dataset.
In contrast to the Fast Text classification approach, the labels assigned using Topic AI are fixed, ie they cannot be optimized and the accuracy cannot be automatically measured in the platform. Basically, what you see is what you get. If the user requires a custom approach to continuously improve and train the model, the Topic AI model can be used as a basis to create a Fast Text model, offering these benefits.
For more information about how TopicAI, Topic Sets look like and work, please check this link.
In summary, lemma search focuses on finding text based on the ‘dictionary’ form of words, zero-shot learning deals with leveraging vast pre-trained models to label text data without user intervention, and FastText is a technique allowing the user to have full control over the pre-labelled data used for training and validation. Each approach serves different purposes and is applied in various contexts, all of them offered by sandsiv+.
Let's check this example for better understanding on how to use different approaches to get the best use of it:
-If I use a datasource in a Mine query and select a specific Lemma (ie "Store"), I get 200 results.
-If I use the same datasource in TopicAI and select the same topic (ie "Store") I get only 100 results.
-This is because Lemma Search and TopicAI (Zero-shot) work in different ways, Lemma search searches for the exact lemma that was used, however Zero-shot looks at the context of the whole phrase. A customer could be talking about "Store" but in a sentence like this "No one was at the store today, so I had to leave", TopicAI will probably identify it as "Staff" - given that the context is about the missing staff at the store.