第1章 開始駕馭文本 ...............................................................................1
1.1 駕馭文本重要的原因 ...............................................................................................2
1.2 預覽:一個基於事實的問答係統 ...........................................................................4
1.2.1 嗨,弗蘭肯斯坦醫生 ...................................................................................5
1.3 理解文本很睏難 .......................................................................................................8
1.4 駕馭的文本 .............................................................................................................11
1.5 文本及智能應用:搜索及其他 .............................................................................13
1.5.1 搜索和匹配 .................................................................................................13
1.5.2 抽取信息 .....................................................................................................14
1.5.3 對信息分組 .................................................................................................15
1.5.4 一個智能應用 .............................................................................................15
1.6 小結 .........................................................................................................................15
1.7 相關資源 .................................................................................................................16
第2章 駕馭文本的基礎 ..........................................................................17
2.1 語言基礎知識 .........................................................................................................18
2.1.1 詞語及其類彆 .............................................................................................19
2.1.2 短語及子句 .................................................................................................20
2.1.3 詞法 .............................................................................................................21
2.2 文本處理常見工具 .................................................................................................23
2.2.1 字符串處理工具 .........................................................................................23
2.2.2 詞條及切詞 .................................................................................................23
2.2.3 詞性標注 .....................................................................................................25
2.2.4 詞乾還原 .....................................................................................................27
2.2.5 句子檢測 .....................................................................................................29
2.2.6 句法分析和文法 .........................................................................................31
2.2.7 序列建模 .....................................................................................................33
2.3 從常見格式文件中抽取內容並做預處理 .............................................................34
2.3.1 預處理的重要性 .........................................................................................35
2.3.2 利用Apache Tika抽取內容 ........................................................................37
2.4 小結 .........................................................................................................................39
2.5 相關資源 .................................................................................................................40
第3章 搜索 ............................................................................................41
3.1 搜索和多麵示例:Amazon.com ............................................................................42
3.2 搜索概念入門 .........................................................................................................44
3.2.1 索引內容 .....................................................................................................45
3.2.2 用戶輸入 .....................................................................................................47
3.2.3 利用嚮量空間模型對文檔排名 .................................................................51
3.2.4 結果展示 .....................................................................................................54
3.3 Apache Solr搜索服務器介紹 .................................................................................57
3.3.1 首次運行Solr ..............................................................................................58
3.3.2 理解Solr中的概念 ......................................................................................59
3.4 利用Apache Solr對內容構建索引 .........................................................................63
3.4.1 使用XML構建索引 ....................................................................................64
3.4.2 利用Solr和Apache Tika對內容進行抽取和索引 ......................................66
3.5 利用Apache Solr來搜索內容 .................................................................................69
3.5.1 Solr查詢輸入參數 ......................................................................................71
3.5.2 抽取內容的多麵展示 .................................................................................74
3.6 理解搜索性能因素 .................................................................................................77
3.6.1 數量判定 .....................................................................................................77
3.6.2 判斷數量 .....................................................................................................81
3.7 提高搜索性能 .........................................................................................................82
3.7.1 硬件改進 .....................................................................................................82
3.7.2 分析的改進 .................................................................................................83
3.7.3 提高查詢性能 .............................................................................................85
3.7.4 其他評分模型 .............................................................................................88
3.7.5 提升Solr性能的技術 ..................................................................................89
3.8 其他搜索工具 .........................................................................................................91
3.9 小結 .........................................................................................................................93
3.10 相關資源 ...............................................................................................................93
第4章 模糊字符串匹配 ..........................................................................94
4.1 模糊字符串匹配方法 .............................................................................................96
4.1.1 字符重閤度度量方法 .................................................................................96
4.1.2 編輯距離 .....................................................................................................99
4.1.3 n元組編輯距離 .........................................................................................102
4.2 尋找模糊匹配串 ...................................................................................................105
4.2.1 在Solr中使用前綴來匹配 ........................................................................105
4.2.2 利用trie樹進行前綴匹配 .........................................................................106
4.2.3 使用n元組進行匹配 ..................................................................................111
4.3 構建模糊串匹配應用 ...........................................................................................112
4.3.1 在搜索中加入提前輸入功能 ...................................................................113
4.3.2 搜索中的查詢拼寫校正 ...........................................................................117
4.3.3 記錄匹配 ...................................................................................................122
4.4 小結 .......................................................................................................................127
4.5 相關資源 ...............................................................................................................128
第5章 命名實體識彆 ...........................................................................129
5.1 命名實體的識彆方法 ...........................................................................................131
5.1.1 基於規則的實體識彆 ...............................................................................131
5.1.2 基於統計分類器的實體識彆 ...................................................................132
5.2 基於OpenNLP的基本實體識彆 ...........................................................................133
5.2.1 利用OpenNLP尋找人名 ...........................................................................134
5.2.2 OpenNLP識彆的實體解讀 .......................................................................136
5.2.3 基於概率過濾實體 ...................................................................................137
5.3 利用OpenNLP進行深度命名實體識彆 ...............................................................137
5.3.1 利用OpenNLP識彆多種實體類型 ...........................................................138
5.3.2 OpenNLP識彆實體的背後機理 ...............................................................141
5.4 OpenNLP的性能 ...................................................................................................143
5.4.1 結果的質量 ...............................................................................................144
5.4.2 運行性能 ...................................................................................................145
5.4.3 OpenNLP的內存使用 ...............................................................................146
5.5 對新領域定製OpenNLP實體識彆 .......................................................................147
5.5.1 訓練模型的原因和方法 ...........................................................................147
5.5.2 訓練OpenNLP模型 ...................................................................................148
5.5.3 改變建模輸入 ...........................................................................................150
5.5.4 對實體建模的新方法 ...............................................................................152
5.6 小結 .......................................................................................................................154
5.7 進一步閱讀材料 ...................................................................................................155
第6章 文本聚類 ..................................................................................156
6.1 Google News中的文檔聚類 .................................................................................157
6.2 聚類基礎 ...............................................................................................................158
6.2.1 三種聚類的文本類型 ...............................................................................158
6.2.2 選擇聚類算法 ...........................................................................................160
6.2.3 確定相似度 ...............................................................................................161
6.2.4 給聚類結果打標簽 ...................................................................................162
6.2.5 聚類結果的評估 .......................................................................................163
6.3 搭建一個簡單的聚類應用 ...................................................................................165
6.4 利用Carrot2對搜索結果聚類 ...............................................................................166
6.4.1 使用Carrot2API ........................................................................................166
6.4.2 使用Carrot2對Solr的搜索結果聚類 ........................................................168
6.5 利用Apache Mahout對文檔集聚類 ......................................................................171
6.5.1 對聚類的數據進行預處理 .......................................................................172
6.5.2 K-means聚類 ............................................................................................175
6.6 利用Apache Mahout進行主題建模 ......................................................................180
6.7 考察聚類性能 .......................................................................................................183
6.7.1 特徵選擇與特徵約簡 ...............................................................................183
6.7.2 Carrot2的性能和質量 ...............................................................................186
6.7.3 Mahout基準聚類算法 ..............................................................................187
6.8 緻謝 .......................................................................................................................192
6.9 小結 .......................................................................................................................192
6.10 參考文獻 .............................................................................................................193
第7章 分類及標注 ...............................................................................195
7.1 分類及歸類概述 ...................................................................................................197
7.2 分類過程 ...............................................................................................................200
7.2.1 選擇分類機製 ...........................................................................................201
7.2.2 識彆文本分類中的特徵 ...........................................................................202
7.2.3 訓練數據的重要性 ...................................................................................203
7.2.4 評估分類器性能 .......................................................................................206
7.2.5 將分類器部署到生産環境 .......................................................................208
7.3 利用Apache Lucene構建文檔分類器 ..................................................................209
7.3.1 利用Lucene對文本進行分類 ...................................................................210
7.3.2 為MoreLikeThis分類器準備訓練數據 ....................................................212
7.3.3 訓練MoreLikeThis分類器 ........................................................................214
7.3.4 利用MoreLikeThis分類器對文檔進行分類 ............................................217
7.3.5 測試MoreLikeThis分類器 ........................................................................220
7.3.6 將MoreLikeThis投入生産環境 ................................................................223
7.4 利用Apache Mahout訓練樸素貝葉斯分類器 ......................................................223
7.4.1 利用樸素貝葉斯算法進行文本分類 .......................................................224
7.4.2 準備訓練數據 ...........................................................................................225
7.4.3 留存測試數據 ...........................................................................................229
7.4.4 訓練分類器 ...............................................................................................229
7.4.5 測試分類器 ...............................................................................................231
7.4.6 改進自舉過程 ...........................................................................................232
7.4.7 將Mahout貝葉斯分類器集成到Solr ........................................................234
7.5 利用OpenNLP進行文檔分類 ...............................................................................238
7.5.1 迴歸模型及最大熵文檔分類 ...................................................................239
7.5.2 為最大熵文檔分類器準備訓練數據 .......................................................241
7.5.3 訓練最大熵文檔分類器 ...........................................................................242
7.5.4 測試最大熵文檔分類器 ...........................................................................248
7.5.5 生産環境下的最大熵文檔分類器 ...........................................................249
7.6 利用Apache Solr構建標簽推薦係統 ...................................................................250
7.6.1 為標簽推薦收集訓練數據 .......................................................................253
7.6.2 準備訓練數據 ...........................................................................................255
7.6.3 訓練Solr標簽推薦係統 ............................................................................256
7.6.4 構建推薦標簽 ...........................................................................................258
7.6.5 對標簽推薦係統進行評估 .......................................................................261
7.7 小結 .......................................................................................................................263
7.8 參考文獻 ...............................................................................................................265
第8章 構建示例問答係統 ....................................................................266
8.1 問答係統基礎知識 ...............................................................................................268
8.2 安裝並運行QA代碼 .............................................................................................270
8.3 一個示例問答係統的架構 ...................................................................................271
8.4 理解問題並産生答案 ...........................................................................................274
8.4.1 訓練答案類型分類器 ...............................................................................275
8.4.2 對查詢進行組塊分析 ...............................................................................279
8.4.3 計算答案類型 ...........................................................................................280
8.4.4 生成查詢 ...................................................................................................283
8.4.5 對候選段落排序 .......................................................................................285
8.5 改進係統的步驟 ...................................................................................................287
8.6 本章小結 ...............................................................................................................287
8.7 相關資源 ...............................................................................................................288
第9章 未駕馭的文本:探索未來前沿 ..................................................289
9.1 語義、篇章和語用:探索高級NLP ....................................................................290
9.1.1 語義 ...........................................................................................................291
9.1.2 篇章 ...........................................................................................................292
9.1.3 語用 ...........................................................................................................294
9.2 文檔及文檔集自動摘要 .......................................................................................295
9.3 關係抽取 ...............................................................................................................298
9.3.1 關係抽取方法綜述 ...................................................................................299
9.3.2 評估 ...........................................................................................................302
9.3.3 關係抽取工具 ...........................................................................................303
9.4 識彆重要內容和人物 ...........................................................................................303
9.4.1 全局重要性及權威度 ...............................................................................304
9.4.2 個人重要性 ...............................................................................................305
9.4.3 與重要性相關的資源及位置 ...................................................................306
9.5 通過情感分析來探測情感 ...................................................................................306
9.5.1 曆史及綜述 ...............................................................................................307
9.5.2 工具及數據需求 .......................................................................................308
9.5.3 一個基本的極性算法 ...............................................................................309
9.5.4 高級話題 ...................................................................................................311
9.5.5 用於情感分析的開源庫 ...........................................................................312
9.6 跨語言檢索 ...........................................................................................................313
9.7 本章小結 ...............................................................................................................315
9.8 相關資源 ...............................................................................................................315
· · · · · · (
收起)