Thursday, July 19, 2018

Anikiev_Sukretna Machine Learning approach for Natural Language Processing (NLP) text classification problem

Machine Learning approach for Natural Language Processing (NLP) text classification problem

Download: Please download the full article (47 pages) PDF from here:https://1drv.ms/b/s!AvejO2r1DmacgYdbnfzLrJZ-SOxIsA

The purpose of this document is to illustrate the application of Machine Learning approach for Natural Language Processing (NLP) text classification problem. We would like to detail the math apparatus behind Natural Language Processing (NLP) and get the reader comfortable with the numbers. We are also the strong believers in visuals, that’s why in the text of the document we present diagrams for the ease of comprehension, analysis and comparison. This document may help a person passionate about Machine Learning and Document Classification to get started quickly

About authors:
  • Alex Anikiev (LinkedIn) holds Master’s degree in Computer Science and PhD degree in Applied Mathematics from the National University of Ukraine “KPI”, 
Alex is interested in and passionate about Artificial intelligence and Machine Learning, works as Software Architect and lives in Redmond, WA
  • Alena Sukretna (LinkedIn) holds Master’s degree in Computer Science from the National University of Ukraine “KPI” and Nano-degree Data Science and Data Analysis from Udacity,
Alena is interested in and passionate about Artificial intelligence and Machine Learning, currently working as a Freelance Data Scientist and lives in Redmond, WA

Business problem: Say that our customer has (or has access to) a large volume of information (data, documents, etc.) and they would love to be able to categorize the information into certain categories, structure it better, make better sense of it, draw their own meaningful insights from it, etc. Much like what you see on the Bing News web page where Bing is suggesting certain categories of information, for example, “FIFA World Cup”, “Wimbledon”, “U.S.”, “World”, etc. which might interest you

Problem domain: This business problem is likely to fall into the Natural Language Processing (NLP) domain. Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. You can find more information about Natural Language Processing (NLP) in general on Wikipedia here: https://en.wikipedia.org/wiki/Natural_language_processing

Types of problems: There’re numerous problems formulated and known in the Natural Language Processing (NLP) domain. However, for the purposes of this document we’ll mention just a few instead. Namely, Search, Extraction and Classification. Depending on the business problem at hand we may define the right approach to solve the problem. Doing some business problem analysis upfront before start tackling it is proven to always be a good idea. Often the real business problem our client is attempting to solve is a Search problem when the real goal is to effectively and efficiently query and filter the info. Once the problem has been clarified (and re-identified) we may then leverage appropriate tools, for example, Azure Cognitive Services (perhaps, Bing Custom Search), Azure Search or OSS Elastic Search. In case the real problem is an Extraction problem when the data needs to be extracted and relationships between data elements need to be identified and visualized, the approach to extract triples (subject-predicate-object), storing them in a Graph data store and visualize them using a Graph structure for exploration and analysis may yield some very impressive practical results. For these purposes Azure Cosmos DB as No SQL data store along with Resource Description Framework (RDF) or Property Graph may be a perfect choice. Now if we look at the classic Machine Learning problems, Regression, Classification and Clustering, we can project them into the Text Analytics space. An example of Supervised Learning Classification task would be Document classification based on the labelled data. An example of Unsupervised Learning Clustering task would be Topic modeling. There’re also other popular Text Analytics tasks such as, Named Entity Recognition (NER), Keyword extraction, Document summarization, etc. Please note that these tasks may be resolved in multiple ways, either with the help of Azure Cloud services or with Python specialized libraries, etc. Some notable means which help tackling Text Analytics problems include Azure Cognitive Services, Azure ML Studio, specialized Python libraries (Scikit Learn, NLTK, Gensim, Spacy), Azure ML Workbench, Jupyter notebooks, Azure Text Analytics Toolkit

Focus problem: For the purposes of this document we will focus on the Text Classification problem, specifically, Document Classification. Document Classification problem is a Supervised Learning problem which required a labelled data set for training. By other words, if we expect to categorize documents into N different categories, we’ll need to provide the system with enough examples of documents belonging to different categories for the system to learn from and be able to make a reliable prediction for new documents

Types of approaches: To solve the Document Classification problem different approaches can be used. One approach may be Machine Learning (ML) which is more suitable for small and medium size data sets. Another approach may be Deep Learning (DL) using Neural Nets (NN) which is more suitable for medium and large data sets

Focus approach: For the purposes of this document we will focus on Machine Learning (ML) approach for the Document Classification. A good place to start will be to consider using Naïve Bayes and Support Vector Machines (SVM) algorithms to tackle Document Classification problem. These algorithms apply for a single-label classification tasks and their parameters may be fine-tuned to achieve the best results. There’re other algorithms which may be applicable for the task, in the future articles we may consider them as well as multi-label classification task when multiple labels may be assigned to a document at the same time (this will require specific algorithms to be used)  

Solution architecture (E2E): For the purposes of this document we would like to illustrate the End-to-end solution for the Document Classification problem which includes Research & Development (R&D) and Operationalization aspects. You may want to develop and test your models locally first in your Experimentation workspace using, for example, Azure ML Workbench, Jupyter notebooks and appropriate Python libraries. When you are comfortable with the performance of your model you may want to export its definition and wrap your models into a Docker Container for the ease of deployment into the Azure Cloud. Once moved to the Cloud your pre-trained model may be reused and invoked on-demand via Web Service from within the container. Azure Cloud allows you to manage container images and instances via Azure Container Registry (ACR) and Azure Container Service (ACS). In case you need to orchestrate a number of containers you may leverage Azure Kubernetes Service (AKS). After the model has been deployed and it is in use, at some point you may want to re-train it with the new additional data and leverage this new knowledge obtained from the new data for the more quality classification


Download: Please download the full article (47 pages) PDF from here: https://1drv.ms/b/s!AvejO2r1DmacgYdbnfzLrJZ-SOxIsA

Disclaimer: This material is presented As Is with no warranties provided by the authors. This article is also available on our blog here: http://anikiev.blogspot.com/. Please note that the content of the article can be updated over time to better explain the topic

Tags: Microsoft, Azure, Cloud, Machine Learning, ML, Natural Language Processing, NLP, Document Classification, Python, Scikit Learn, Naïve Bayes, NB, Support Vector Machine, SVM, Stochastic Gradient Descent, SGD