- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
深度学习实体抽取
展开查看详情
1 .Deep Learning for Domain- Specific Entity Extraction from Unstructured Text Mohamed AbdelHady, Microsoft AI Platform Zoran Dzunic, Microsoft AI Platform #DL1SAIS
2 .Goals • What is entity extraction? • When to train a custom entity extraction model? • What are word embeddings? • How to train a custom word embedding model on a Spark cluster? • How to train a custom Deep Neural Network for entity extraction? #DL1SAIS 2
3 .Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Zoran and Mohamed are at Spark+AI Summit in San Francisco. Zoran : PERSON : LOC #DL1SAIS 3
4 .Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Zoran and Mohamed are at Spark+AI Summit in San Francisco. Zoran : PERSON Mohamed : PERSON Spark+AI Summit : ORG San Francisco : LOC #DL1SAIS 4
5 .Biomedical Entity Extraction • Entity types drug/chemical, disease, protein, DNA, etc. • Critical step for complex biomedical NLP tasks: – Extraction of diseases, symptoms from electronic medical or health records – Understanding the interactions between different entity types such as drug- drug interaction, drug-disease relationship and gene-protein relationship, e.g., • Drug A cures Disease B. • Drug A causes Disease B. Similar for other domains (e.g., legal, finance) #DL1SAIS 5
6 .Biomedical Entity Extraction #DL1SAIS 6
7 . Demo https://medicalentitydemo.azurewebsites.net #DL1SAIS 7
8 .Approach 1. Feature Extraction Phase – Domain Specific Features Use a large amount of unlabeled domain-specific data corpus such as Medline PubMed abstracts to train a neural word embedding model. 2. Model Training Phase – Domain Specific Model The output embeddings are considered as automatically generated features to train a neural entity extractor using a small/reasonable amount of labeled data. #DL1SAIS 8
9 .Word Embedding a semantic continuous representation of words #DL1SAIS 9
10 .#DL1SAIS 10
11 .Input: Words B-Chemical O O Words: Naloxon reverses the e #DL1SAIS 11
12 .Features: Word Embeddings B-Chemical O O Embedding: [0.3, 0.2, 0.9 …] [0.8, 0.8, 0.1 …][0.5, 0.1, 0.5 …] dim small (e.g., 50, 200) #DL1SAIS 12
13 .Embeddings #DL1SAIS 13
14 .Custom Word Embeddings • Publicly available pre-trained models such as Google News • Can we do better on a specific domain? • We trained a word embedding model for biomedical domain on 27 million Pubmed abstracts (22GB) • Azure HDInsight Spark Cluster, 11 worker nodes • Spark MLlib Word2Vec • Trained in ~30min #DL1SAIS 14
15 .DNNs for Entity Extraction #DL1SAIS 15
16 .#DL1SAIS 16
17 .Why Deep Learning? #DL1SAIS 17
18 . DNN Architecture • Keras with TensorFlow • GPU enabled Azure Data Science VM (DSVM) NC6 Standard (56 GB, K80 NVIDIA Tesla) or Deep Learning VM (DLVM) • Parameters – # recurrent units = 150 – droput rate = 0.2 #DL1SAIS 18
19 .Results #DL1SAIS 19
20 .Datasets • Proteins, Cell Line, Cell Type, DNA and RNA Detection Bio-Entity Recognition Task at BioNLP/NLPBA 2004 - http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html • Chemicals and Diseases Detection BioCreative V CDR task corpus - http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ • Drugs Detection Semeval 2013 - Task 9.1 (Drug Recognition) - https://www.cs.york.ac.uk/semeval-2013/task9/ #DL1SAIS 20
21 .Dataset Description http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ • - - - #DL1SAIS 21
22 .Experimental Setup • Azure ML Python Package for Text Analytics. https://docs.microsoft.com/en-us/python/api/overview/azure-machine- learning/textanalytics https://aka.ms/aml-packages/text/download #DL1SAIS 22
23 .Conditional Random Fields (CRF) CRFSuite: • Extract traditional features • Train CRF model #DL1SAIS 23
24 .Results (exact match) Algorithm + Features Recall Precision F-score Dictionary Lookup 64% 74% 68% CRF: Traditional Features 61% 81% 70% CRF: Pubmed Embedding 40% 61% 48% CRF: Traditional + Pubmed Embed. 65% 80% 71% LSTM: Pubmed Embedding 76% 77% 76% LSTM: Generic Embeddings 74% 63% 67% #DL1SAIS 24
25 .Embedding Comparison #DL1SAIS 25
26 .Embedding Comparison #DL1SAIS 26
27 . Takeaways • Recipe for building a custom entity extraction pipeline: – Get a large amount of in-domain unlabeled data – Train a word2vec model on unlabeled data on Spark – Get as much of labeled data as possible – Train an LSTM -based Neural Network on a GPU-enabled machine • Word embeddings are powerful features – Convey word semantics – Perform better than traditional features – No feature engineering • LSTM NN is more powerful model than traditional CRF #DL1SAIS 27
28 .Questions #DL1SAIS 28