在Keras模型中使用預(yù)訓(xùn)練的詞向量

文章信息

通過本教程，你可以掌握技能：使用預(yù)先訓(xùn)練的詞向量和卷積神經(jīng)網(wǎng)絡(luò)解決一個文本分類問題本文代碼已上傳到Github

本文地址：http://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

本文作者：Francois Chollet

什么是詞向量?

”詞向量”（詞嵌入）是將一類將詞的語義映射到向量空間中去的自然語言處理技術(shù)。即將一個詞用特定的向量來表示，向量之間的距離（例如，任意兩個向量之間的L2范式距離或更常用的余弦距離）一定程度上表征了的詞之間的語義關(guān)系。由這些向量形成的幾何空間被稱為一個嵌入空間。

例如，“椰子”和“北極熊”是語義上完全不同的詞，所以它們的詞向量在一個合理的嵌入空間的距離將會非常遙遠。但“廚房”和“晚餐”是相關(guān)的話，所以它們的詞向量之間的距離會相對小。

理想的情況下，在一個良好的嵌入空間里，從“廚房”向量到“晚餐”向量的“路徑”向量會精確地捕捉這兩個概念之間的語義關(guān)系。在這種情況下，“路徑”向量表示的是“發(fā)生的地點”，所以你會期望“廚房”向量 - “晚餐'向量（兩個詞向量的差異）捕捉到“發(fā)生的地點”這樣的語義關(guān)系。基本上，我們應(yīng)該有向量等式：晚餐 + 發(fā)生的地點 = 廚房（至少接近）。如果真的是這樣的話，那么我們可以使用這樣的關(guān)系向量來回答某些問題。例如，應(yīng)用這種語義關(guān)系到一個新的向量，比如“工作”，我們應(yīng)該得到一個有意義的等式，工作+ 發(fā)生的地點 = 辦公室，來回答“工作發(fā)生在哪里？”。

詞向量通過降維技術(shù)表征文本數(shù)據(jù)集中的詞的共現(xiàn)信息。方法包括神經(jīng)網(wǎng)絡(luò)(“Word2vec”技術(shù))，或矩陣分解。

GloVe 詞向量

本文使用GloVe詞向量。GloVe 是 'Global Vectors for Word Representation'的縮寫，一種基于共現(xiàn)矩陣分解的詞向量。本文所使用的GloVe詞向量是在2014年的英文維基百科上訓(xùn)練的，有400k個不同的詞，每個詞用100維向量表示。點此下載 (友情提示，詞向量文件大小約為822M)

20 Newsgroup dataset

本文使用的數(shù)據(jù)集是著名的'20 Newsgroup dataset'。該數(shù)據(jù)集共有20種新聞文本數(shù)據(jù)，我們將實現(xiàn)對該數(shù)據(jù)集的文本分類任務(wù)。數(shù)據(jù)集的說明和下載請參考這里。

不同類別的新聞包含大量不同的單詞，在語義上存在極大的差別，。一些新聞類別如下所示

comp.sys.ibm.pc.hardware
comp.graphics
comp.os.ms-windows.misc
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey

實驗方法

以下是我們?nèi)绾谓鉀Q分類問題的步驟

將所有的新聞樣本轉(zhuǎn)化為詞索引序列。所謂詞索引就是為每一個詞依次分配一個整數(shù)ID。遍歷所有的新聞文本，我們只保留最參見的20,000個詞，而且每個新聞文本最多保留1000個詞。
生成一個詞向量矩陣。第i列表示詞索引為i的詞的詞向量。
將詞向量矩陣載入Keras Embedding層，設(shè)置該層的權(quán)重不可再訓(xùn)練（也就是說在之后的網(wǎng)絡(luò)訓(xùn)練過程中，詞向量不再改變）。
Keras Embedding層之后連接一個1D的卷積層，并用一個softmax全連接輸出新聞類別

數(shù)據(jù)預(yù)處理

我們首先遍歷下語料文件下的所有文件夾，獲得不同類別的新聞以及對應(yīng)的類別標簽，代碼如下所示

texts = [] # list of text sampleslabels_index = {} # dictionary mapping label name to numeric idlabels = [] # list of label idsfor name in sorted(os.listdir(TEXT_DATA_DIR)): path = os.path.join(TEXT_DATA_DIR, name) if os.path.isdir(path): label_id = len(labels_index) labels_index[name] = label_id for fname in sorted(os.listdir(path)): if fname.isdigit(): fpath = os.path.join(path, fname) f = open(fpath) texts.append(f.read()) f.close() labels.append(label_id)print('Found %s texts.' % len(texts))

之后，我們可以新聞樣本轉(zhuǎn)化為神經(jīng)網(wǎng)絡(luò)訓(xùn)練所用的張量。所用到的Keras庫是keras.preprocessing.text.Tokenizer和keras.preprocessing.sequence.pad_sequences。代碼如下所示

from keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencestokenizer = Tokenizer(nb_words=MAX_NB_WORDS)tokenizer.fit_on_texts(texts)sequences = tokenizer.texts_to_sequences(texts)word_index = tokenizer.word_indexprint('Found %s unique tokens.' % len(word_index))data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)labels = to_categorical(np.asarray(labels))print('Shape of data tensor:', data.shape)print('Shape of label tensor:', labels.shape)# split the data into a training set and a validation setindices = np.arange(data.shape[0])np.random.shuffle(indices)data = data[indices]labels = labels[indices]nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])x_train = data[:-nb_validation_samples]y_train = labels[:-nb_validation_samples]x_val = data[-nb_validation_samples:]y_val = labels[-nb_validation_samples:]

Embedding layer設(shè)置

接下來，我們從GloVe文件中解析出每個詞和它所對應(yīng)的詞向量，并用字典的方式存儲

embeddings_index = {}f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefsf.close()print('Found %s word vectors.' % len(embeddings_index))

此時，我們可以根據(jù)得到的字典生成上文所定義的詞向量矩陣

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector

現(xiàn)在我們將這個詞向量矩陣加載到Embedding層中，注意，我們設(shè)置trainable=False使得這個編碼層不可再訓(xùn)練。

from keras.layers import Embeddingembedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False)

一個Embedding層的輸入應(yīng)該是一系列的整數(shù)序列，比如一個2D的輸入，它的shape值為(samples, indices)，也就是一個samples行，indeces列的矩陣。每一次的batch訓(xùn)練的輸入應(yīng)該被padded成相同大?。ūM管Embedding層有能力處理不定長序列，如果你不指定數(shù)列長度這一參數(shù)）dim).所有的序列中的整數(shù)都將被對應(yīng)的詞向量矩陣中對應(yīng)的列（也就是它的詞向量）代替,比如序列[1,2]將被序列[詞向量[1],詞向量[2]]代替。這樣，輸入一個2D張量后，我們可以得到一個3D張量。

訓(xùn)練1D卷積

最后，我們可以使用一個小型的1D卷積解決這個新聞分類問題。

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)x = Conv1D(128, 5, activation='relu')(embedded_sequences)x = MaxPooling1D(5)(x)x = Conv1D(128, 5, activation='relu')(x)x = MaxPooling1D(5)(x)x = Conv1D(128, 5, activation='relu')(x)x = MaxPooling1D(35)(x) # global max poolingx = Flatten()(x)x = Dense(128, activation='relu')(x)preds = Dense(len(labels_index), activation='softmax')(x)model = Model(sequence_input, preds)model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])# happy learning!model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=2, batch_size=128)

在兩次迭代之后，這個模型最后可以達到0.95的分類準確率（4:1分割訓(xùn)練和測試集合）。你可以利用正則方法（例如dropout）或在Embedding層上進行fine-tuning獲得更高的準確率。

我們可以做一個對比實驗，直接使用Keras自帶的Embedding層訓(xùn)練詞向量而不用GloVe向量。代碼如下所示

embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)

兩次迭代之后，我們可以得到0.9的準確率。所以使用預(yù)訓(xùn)練的詞向量作為特征是非常有效的。一般來說，在自然語言處理任務(wù)中，當(dāng)樣本數(shù)量非常少時，使用預(yù)訓(xùn)練的詞向量是可行的（實際上，預(yù)訓(xùn)練的詞向量引入了外部語義信息，往往對模型很有幫助）。

以下部分為譯者添加

國內(nèi)的Rachel-Zhang用sklearn對同樣的數(shù)據(jù)集做過基于傳統(tǒng)機器學(xué)習(xí)算法的實驗，請點擊這里。同時Richard Socher等在提出GloVe詞向量的那篇論文中指出GloVe詞向量比word2vec的性能更好[1]。之后的研究表示word2vec和GloVe其實各有千秋，例如Schnabel等提出了用于測評詞向量的各項指標，測評顯示 word2vec在大部分測評指標優(yōu)于GloVe和C&W詞向量[2]。本文實現(xiàn)其實可以利用谷歌新聞的word2vec詞向量再做一組測評實驗。

參考文獻

[1]: Pennington J, Socher R, Manning C D. Glove: Global Vectors for Word Representation[C]//EMNLP. 2014, 14: 1532-1543

[2]: Schnabel T, Labutov I, Mimno D, et al. Evaluation methods for unsupervised word embeddings[C]//Proc. of EMNLP. 2015

本站僅提供存儲服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊舉報。

中文字幕理论片,69视频免费在线观看,亚洲成人app,国产1级毛片,刘涛最大尺度戏视频,欧美亚洲美女视频,2021韩国美女仙女屋vip视频