For your AIs only – how we test our NLU

*Read as: for your ‘eyes’ only.

Attention: this blogpost is specifically for AI experts & enthusiasts.

Big tech companies have a huge advantage when it comes to Machine Learning & AI. They have massive amounts of data, datacenters and the budget to hire the most talented software engineers in the world. So how realistically is to compete with these tech giants in the AI and Machine Learning game?

At we are always concerned about delivering the most accurate and cutting-edge Natural Language Understanding (NLU) models. NLU models are used by some of the largest and most innovative companies in the world for conversational interfaces like chat, voice, e-mail, … To underpin the claim that our models are best-in-class, we decided to benchmark the platform for intent classification against giants of the industry:  LUIS from Microsoft,  DialogFlow  from Google and Watson Conversation from IBM.

For this benchmark, we have chosen French & Dutch. Our clients build chatbots and voicebots throughout Europe where English is not necessarily the dominant language, and for our clients, privacy and language-specific nuances are of great importance. The platform mainly uses Deep Learning  models for its NLU where the language is treated as a sequence of words (like utterances and sentences) connected to each other and processed with a Recurrent Neural Net architecture.

We focused this benchmarking effort on several key aspects. First we wanted to understand how the platform is classifying intents for clean Dutch expressions. Secondly, we benchmarked the platform against real-world chatbot and voicebot expressions used in production systems. Finally, we compare the accuracy of the platform for 2 languages, namely French and Dutch. This gave us a clear insight into the language-specific accuracy of the models.

Experimental Setting

Here we define necessary steps to reproduce our results with respect to LUIS, DialogFlow and IBM Watson. We start with data pre-processing and cleaning routines which can be reduced to the following set of actions.

For expressions:

  1. Strip the utterance (remove heading and trailing whitespaces).
  2. For every utterance replace all EOF, tabs and newline characters with whitespaces.
  3. If the total number of characters is >500: split the expression on whitespace and re-join words until the 500-character limit is reached (LUIS API requirement).
  4. Convert all characters in the expression to lowercase.

For intent labels:

  1. Strip the intent (remove heading and trailing whitespaces).
  2. For every intent replace all non alphanumeric characters with underscores (Watson API requirement).
  3. Take only the first 128 characters from every intent name (Watson API requirement).
  4. Convert all characters in the expression to lowercase.

After these necessary steps were performed on the raw expression data we proceeded with some additional post-processing steps to ensure the integrity of the input data:

  1. Remove all duplicates.
  2. Take only non-empty expressions into account.

Training and test setup

To ensure the reproducibility of the training/test routines we define here some of the tools, techniques and methodologies used to split and evaluate the models.

  1. We used only Python code and Scikit-Learn framework to split the data and evaluate the models.
  2. We performed a  stratified random  5-fold test-train split (data was shuffled) for the chatbot expressions.
  3. Where possible (e.g. every model run) we set the random seed to  123.
  4. We ran all the models with default parameters and confidence thresholds.

All test predictions coming from different test-train splits (separately for clean and production chatbot and voicebot expressions) were consolidated in the end to one CSV file. For LUIS and IBM Watson we always keep on polling the server for the training procedure to end. All other LUIS and Watson specific API requirements are met as well.


We start with the analysis of clean Dutch expressions which are ideal to quickly verify the predictive power of all players. Then we proceed to the results for chatbot and voicebot expressions and their corresponding intent classification.

We present 2 main performance metrics:  classification accuracy  and  F1-score. All results are aggregated separately according to either the intent name or chatbot name. We outline the boxplots of these metrics displaying mean, median, quantiles and outliers.

The displayed information can be summarized as follows:

  1. The X-axis represents the NLU of the different players with additional accompanying scores denoting:
    • weighted average across all scores (weighting is done  w.r.t.  the number of expressions per intent or chatbot) – the  first score  in the brackets.
    • the median of all scores – the  second score  in the brackets and the orange line in the boxplot.
  2. The Y-axis represents the score dimension in the range [0..1]
  3. In addition to median, quantiles and fences, the mean score is indicated with a  green triangle.

Figures for clean Dutch expressions

The boxplots below represent the classification accuracies and F1-scores per intent for clean Dutch expressions. We can easily recognize that the platform is outperforming LUIS by a large margin and is winning over IBM Watson in terms of overall statistics and variance of results.

(last updated on 24/08/2018)

Figures for real-world chatbot expressions

Next we switch to the results for chatbot and voicebot expressions that are less clean and more realistic examples of the different players in action. We can still observe a clear win of the platform over LUIS and a close runner-up by IBM Watson. The  first figure  denotes results obtained from the Dutch expressions. The  second figure  represents French chatbot and voicebot expressions and corresponding classification metrics.

Notice that all players have problems with the classification of some particular intents which do end up being completely misclassified (zero accuracies and F1-scores). These intents do not have enough expressions in the training data. Next we aggregate performance metrics per chatbot and outline these statistics in the boxplot below. As before the first figure denotes our findings obtained from the Dutch  expressions  while the second one  is representable of the French chatbots.

We observe better results with the platform in comparison to LUIS and IBM Watson. This is well aligned with our production day-to-day classification performance findings.

Discussion and conclusion

In this small blog-post we discussed the NLU performance for languages that are not often (if ever) taken into consideration for benchmarking:  Dutch and French. We have demonstrated that a mixture of good Machine/Deep Learning, Natural Language Understanding and domain-specific expertise can lead to a significant boost in performance.

Our empirical findings and obtained performance evaluations do confirm that the platform outperforms Big Tech players. Keeping in mind customer needs and having the right blend of technical and domain expertise in the hands of highly devoted team of engineers and data scientists assures that we keep challenging the status quo.