Deep and Machine Learning methods for document clustering and classification
Dr. Alexei I. Streltsov (Senior Data Scientist, SAP SE, Germany) and HybriLIT heterogeneous computation team
in frames of the XXIII International Scientific Conference of Young Scientists and Specialists (AYSS-2019) on the basis of the developed ecosystem for ML/DL.
In this tutorial, we consider a complete workflow of a typical Data Science project dealing with text documents. We define a problem, generate data, analyze data, explore relevant features – discuss several ways how to extract and describe semantic information, and show how to incorporate/augment it by an additional non-semantic one (which might help to improve the results). Next, we consider, construct and apply several standard Machine Learning (ML) models to describe our data: we cast it to a classification and regression problems. Then, we analyze an efficiency of the ML methods as well as a role, impact and relevance of our semantic and non-sematic features. Next, we show how to apply Deep Learning methods to attack the same problem – we consider simple DNN (Deep Neural Network) and CNN (Convolutional Neural Network) models. At the end we contrast our ML and DL results, discuss their pluses and minuses: efficiencies, required computational resources, possible way to improve them…
Tutorial supports an active and passive participations. I will use an alive Jupiter Notebook presentation to describe, discuss and execute each end every block of the Python-code requited for the above program/workflow. The corresponding blocks will be shared/available on a dedicated Slack channel (HybriLIT subscription required: https://web-stc.jinr.ru). If you have a valid account on the HybriLIT cluster you will be able to copy/paste them from the Slack channel and re-execute it in on-line mode in your own Notebook via GITLab (https://jhub.jinr.ru/) service. No extra work on your side to install, tune, support the required python packages: JHub – already did it for you.
Registration is available at: https://indico-hlit.jinr.ru/event/146/
Senior Data Scientist,
ML Deep Learning COE WDF
SAP SE, Dietmar-Hopp-Allee 16, 69190
Walldorf, Germany