Thursday, July 19, 2018

Anikiev_Sukretna Machine Learning approach for Natural Language Processing (NLP) text classification problem

Machine Learning approach for Natural Language Processing (NLP) text classification problem

Download: Please download the full article (47 pages) PDF from here:https://1drv.ms/b/s!AvejO2r1DmacgYdbnfzLrJZ-SOxIsA

The purpose of this document is to illustrate the application of Machine Learning approach for Natural Language Processing (NLP) text classification problem. We would like to detail the math apparatus behind Natural Language Processing (NLP) and get the reader comfortable with the numbers. We are also the strong believers in visuals, that’s why in the text of the document we present diagrams for the ease of comprehension, analysis and comparison. This document may help a person passionate about Machine Learning and Document Classification to get started quickly

About authors:
  • Alex Anikiev (LinkedIn) holds Master’s degree in Computer Science and PhD degree in Applied Mathematics from the National University of Ukraine “KPI”, 
Alex is interested in and passionate about Artificial intelligence and Machine Learning, works as Software Architect and lives in Redmond, WA
  • Alena Sukretna (LinkedIn) holds Master’s degree in Computer Science from the National University of Ukraine “KPI” and Nano-degree Data Science and Data Analysis from Udacity,
Alena is interested in and passionate about Artificial intelligence and Machine Learning, currently working as a Freelance Data Scientist and lives in Redmond, WA

Business problem: Say that our customer has (or has access to) a large volume of information (data, documents, etc.) and they would love to be able to categorize the information into certain categories, structure it better, make better sense of it, draw their own meaningful insights from it, etc. Much like what you see on the Bing News web page where Bing is suggesting certain categories of information, for example, “FIFA World Cup”, “Wimbledon”, “U.S.”, “World”, etc. which might interest you

Problem domain: This business problem is likely to fall into the Natural Language Processing (NLP) domain. Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. You can find more information about Natural Language Processing (NLP) in general on Wikipedia here: https://en.wikipedia.org/wiki/Natural_language_processing

Types of problems: There’re numerous problems formulated and known in the Natural Language Processing (NLP) domain. However, for the purposes of this document we’ll mention just a few instead. Namely, Search, Extraction and Classification. Depending on the business problem at hand we may define the right approach to solve the problem. Doing some business problem analysis upfront before start tackling it is proven to always be a good idea. Often the real business problem our client is attempting to solve is a Search problem when the real goal is to effectively and efficiently query and filter the info. Once the problem has been clarified (and re-identified) we may then leverage appropriate tools, for example, Azure Cognitive Services (perhaps, Bing Custom Search), Azure Search or OSS Elastic Search. In case the real problem is an Extraction problem when the data needs to be extracted and relationships between data elements need to be identified and visualized, the approach to extract triples (subject-predicate-object), storing them in a Graph data store and visualize them using a Graph structure for exploration and analysis may yield some very impressive practical results. For these purposes Azure Cosmos DB as No SQL data store along with Resource Description Framework (RDF) or Property Graph may be a perfect choice. Now if we look at the classic Machine Learning problems, Regression, Classification and Clustering, we can project them into the Text Analytics space. An example of Supervised Learning Classification task would be Document classification based on the labelled data. An example of Unsupervised Learning Clustering task would be Topic modeling. There’re also other popular Text Analytics tasks such as, Named Entity Recognition (NER), Keyword extraction, Document summarization, etc. Please note that these tasks may be resolved in multiple ways, either with the help of Azure Cloud services or with Python specialized libraries, etc. Some notable means which help tackling Text Analytics problems include Azure Cognitive Services, Azure ML Studio, specialized Python libraries (Scikit Learn, NLTK, Gensim, Spacy), Azure ML Workbench, Jupyter notebooks, Azure Text Analytics Toolkit

Focus problem: For the purposes of this document we will focus on the Text Classification problem, specifically, Document Classification. Document Classification problem is a Supervised Learning problem which required a labelled data set for training. By other words, if we expect to categorize documents into N different categories, we’ll need to provide the system with enough examples of documents belonging to different categories for the system to learn from and be able to make a reliable prediction for new documents

Types of approaches: To solve the Document Classification problem different approaches can be used. One approach may be Machine Learning (ML) which is more suitable for small and medium size data sets. Another approach may be Deep Learning (DL) using Neural Nets (NN) which is more suitable for medium and large data sets

Focus approach: For the purposes of this document we will focus on Machine Learning (ML) approach for the Document Classification. A good place to start will be to consider using Naïve Bayes and Support Vector Machines (SVM) algorithms to tackle Document Classification problem. These algorithms apply for a single-label classification tasks and their parameters may be fine-tuned to achieve the best results. There’re other algorithms which may be applicable for the task, in the future articles we may consider them as well as multi-label classification task when multiple labels may be assigned to a document at the same time (this will require specific algorithms to be used)  

Solution architecture (E2E): For the purposes of this document we would like to illustrate the End-to-end solution for the Document Classification problem which includes Research & Development (R&D) and Operationalization aspects. You may want to develop and test your models locally first in your Experimentation workspace using, for example, Azure ML Workbench, Jupyter notebooks and appropriate Python libraries. When you are comfortable with the performance of your model you may want to export its definition and wrap your models into a Docker Container for the ease of deployment into the Azure Cloud. Once moved to the Cloud your pre-trained model may be reused and invoked on-demand via Web Service from within the container. Azure Cloud allows you to manage container images and instances via Azure Container Registry (ACR) and Azure Container Service (ACS). In case you need to orchestrate a number of containers you may leverage Azure Kubernetes Service (AKS). After the model has been deployed and it is in use, at some point you may want to re-train it with the new additional data and leverage this new knowledge obtained from the new data for the more quality classification


Download: Please download the full article (47 pages) PDF from here: https://1drv.ms/b/s!AvejO2r1DmacgYdbnfzLrJZ-SOxIsA

Disclaimer: This material is presented As Is with no warranties provided by the authors. This article is also available on our blog here: http://anikiev.blogspot.com/. Please note that the content of the article can be updated over time to better explain the topic

Tags: Microsoft, Azure, Cloud, Machine Learning, ML, Natural Language Processing, NLP, Document Classification, Python, Scikit Learn, Naïve Bayes, NB, Support Vector Machine, SVM, Stochastic Gradient Descent, SGD


34 comments:

  1. Great post! If you need to know everything regarding artificial intelligence and machine learning , visit TURING TRIBE

    ReplyDelete
  2. Thanks for sharing this valuable information and we collected some information from this blog.
    Machine Learning Training in Gurgaon

    ReplyDelete
    Replies
    1. Thank you for your interest! We are currently preparing more material on NLP and CV. Stay tuned :)

      Delete
  3. More great information! Thanks blogger! Definitely taking your recommendations.
    Machine learning course

    ReplyDelete
  4. Very informative post. I was looking for information about this topic and this post really helped me a lot. Thanks for sharing.

    nlp training in chennai
    nlp practitioner course in chennai
    nlp coaching courses in chennai
    nlp certification in chennai

    ReplyDelete
  5. Thanks for sharing such a great blog Keep posting.. 
    Machine Learning Training in Delhi

    ReplyDelete
  6. Hello,

    It's too good to read this guide. I was looking for the where i can the details on that point and guess what. I got it for hrere.
    Thanks for sharing awesome content.

    Keep sharing.
    Amaresh Jha
    Life Coach

    ReplyDelete
  7. In case, you are planning to pursue any Data Science course in Gurgaon then select our training program immediately. We will help you to improve your career diagram. Before attending any demo session, you can speak to one of our specialists. Our sources of info can help you to increase an edge in your career. This training will help you to get a lucrative salary in IT companies and other industries. A significant advantage of our course is that you don't require any pre-requisites.
    For More Info: Data Science Course in Gurgaon

    ReplyDelete
  8. Was in search for this information from a long time. Thank you for such informative post. Looking forward for more of such informative postings.
    Machine Learning Training in Noida

    ReplyDelete
  9. Really useful information.

    Machine Learning Training in Pune

    Thank You Very Much For Sharing These Nice Tips.

    ReplyDelete
  10. Really useful information.

    Machine Learning Training in Pune

    Thank You Very Much For Sharing These Nice Tips.

    ReplyDelete
  11. Hi, Amazing your article you know this article helping for me and everyone and thanks for sharing information Machine Learning Training in Delhi

    ReplyDelete
  12. NLP classification algorithms are the most important part to develop a program in AI. Algorithms state how your designed AI will work. Thanks for posting such an informative blog.

    ReplyDelete
  13. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Companies

    Sentiment Analysis Tool

    ReplyDelete
  14. This is like an information overload to me. Hopefully i'll be able to process it all. I'll probably head to a Hair salon in North Brighton and read this article again.

    ReplyDelete
  15. This is really helpful and informative, as this gave me more insight to create more ideas and solutions for my plan. Excellent and very cool idea and great content of different kinds of the valuable information's.
    Chatbot Company in Dubai
    Chatbot Companies in Dubai
    Chatbot Development
    Chatbot Companies
    AI Chatbot Development
    Chatbot Companies in UAE
    Chatbot Company in Chennai
    Chatbot Company in Mumbai
    AI Chatbot Companies
    Chatbot Development Companies

    ReplyDelete
  16. I’ve been searching for some decent stuff on the subject and haven't had any luck up until this point, You just got a new biggest fan!..artificial intelligence course in noida

    ReplyDelete
  17. i am glad to discover this page : i have to thank you for the time i spent on this especially great reading !! i really liked each part and also bookmarked you for new information on your site.Top QA Companies
    Top Automation Testing Companies
    Top Mobile App Testing Companies
    Top Performance Testing Companies

    ReplyDelete
  18. This is a well written article. Loved it! I happened to read a similar article on same subject written by Dr. Paras and it was called WHAT IS NEURO-LINGUISTIC PROGRAMMING (NLP)? Do check that out quite interesting.

    ReplyDelete
  19. This post gave me a lot of information on this topic. Keep it up and keep sharing this type of information with us. Try to explore our services towards digital transformation.

    Data Analytics Solutions

    Data Engineering Solutions

    Artificial Intelligence (AI) Solutions

    ReplyDelete
  20. Enrolling in AI Patasala, the real-time training program for Machine Learning Training in Hyderabad, is the ideal option to benefit from a thorough understanding of the Analytics machine Learning domain.
    Machine Learning Training in Hyderabad with Placements

    ReplyDelete
  21. I like this post,And I figure that they having a great time to peruse this post,they might take a decent site to make an information,thanks for sharing it to me business analytics course in kanpur

    ReplyDelete
  22. TINIAN TRUST RAPETROW TIPS | TITanium
    TINIAN titanium band ring TRUST RAPETROW columbia titanium boots TIPS · 1. RUPI · 2. RUPI. · 3. titanium white octane RUPI. · 4. 출장마사지 RUPI. · 5. RUPI. is titanium a metal · 6. RUPI. · 7. RUPI.

    ReplyDelete