正文共3499個字,9張圖,預計閱讀時間13分鐘。
Word2Vec是由Google的Mikolov等人提出的一個詞向量計算模型。
輸入:大量已分詞的文本
輸出:用一個稠密向量來表示每個詞
詞向量的重要意義在于將自然語言轉換成了計算機能夠理解的向量。相對于詞袋模型、TF-IDF等模型,詞向量能抓住詞的上下文、語義,衡量詞與詞的相似性,在文本分類、情感分析等許多自然語言處理領域有重要作用。
詞向量經典例子:
http://latex.codecogs.com/png.latex?\vec{man}-\vec{woman}\approx\vec{king}-\vec{queen}
gensim已經用python封裝好了word2vec的實現(xiàn),有語料的話可以直接訓練了,參考中英文維基百科語料上的Word2Vec實驗。
會使用gensim訓練詞向量,并不表示真的掌握了word2vec,只表示會讀文檔會調接口而已。
word2vec的詳細實現(xiàn),簡而言之,就是一個三層的神經網絡。要理解word2vec的實現(xiàn),需要的預備知識是神經網絡和Logistic Regression。
神經網絡結構
word2vec原理圖
上圖是Word2vec的簡要流程圖。首先假設,詞庫里的詞數(shù)為10000; 詞向量的長度為300(根據(jù)斯坦福CS224d的講解,詞向量一般為25-1000維,300維是一個好的選擇)。下面以單個訓練樣本為例,依次介紹每個部分的含義。
1、輸入層:輸入為一個詞的one-hot向量表示。這個向量長度為10000。假設這個詞為ants,ants在詞庫中的ID為i,則輸入向量的第i個分量為1,其余為0。[0, 0, ..., 0, 0, 1, 0, 0, ..., 0, 0]
2、隱藏層:隱藏層的神經元個數(shù)就是詞向量的長度。隱藏層的參數(shù)是一個[10000 ,300]的矩陣。 實際上,這個參數(shù)矩陣就是詞向量?;貞浺幌戮仃囅喑?,一個one-hot行向量和矩陣相乘,結果就是矩陣的第i行。經過隱藏層,實際上就是把10000維的one-hot向量映射成了最終想要得到的300維的詞向量。
矩陣乘法
3、輸出層: 輸出層的神經元個數(shù)為總詞數(shù)10000,參數(shù)矩陣尺寸為[300,10000]。詞向量經過矩陣計算后再加上softmax歸一化,重新變?yōu)?0000維的向量,每一維對應詞庫中的一個詞與輸入的詞(在這里是ants)共同出現(xiàn)在上下文中的概率。
輸出層
上圖中計算了car與ants共現(xiàn)的概率,car所對應的300維列向量就是輸出層參數(shù)矩陣中的一列。輸出層的參數(shù)矩陣是[300,10000],也就是計算了詞庫中所有詞與ants共現(xiàn)的概率。輸出層的參數(shù)矩陣在訓練完畢后沒有作用。
4、訓練:訓練樣本(x, y)有輸入也有輸出,我們知道哪個詞實際上跟ants共現(xiàn),因此y也是一個10000維的向量。損失函數(shù)跟Logistic Regression相似,是神經網絡的最終輸出向量和y的交叉熵(cross-entropy)。最后用隨機梯度下降來求解
交叉熵(cross-entropy)
上述步驟是一個詞作為輸入和一個上下文中的詞作為輸出的情況,但實際情況顯然更復雜,什么是上下文呢?用一個詞去預測周圍的其他詞,還是用周圍的好多詞來預測一個詞?這里就要引入實際訓練時的兩個模型skip-gram和CBOW。
skip-gram: 核心思想是根據(jù)中心詞來預測周圍的詞。假設中心詞是cat,窗口長度為2,則根據(jù)cat預測左邊兩個詞和右邊兩個詞。這時,cat作為神經網絡的input,預測的詞作為label。下圖為一個例子:
skip-gram
在這里窗口長度為2,中心詞一個一個移動,遍歷所有文本。每一次中心詞的移動,最多會產生4對訓練樣本(input,label)。
CBOW(continuous-bag-of-words):如果理解了skip-gram,那CBOW模型其實就是倒過來,用周圍的所有詞來預測中心詞。這時候,每一次中心詞的移動,只能產生一個訓練樣本。如果還是用上面的例子,則CBOW模型會產生下列4個訓練樣本:
這時候,input很可能是4個詞,label只是一個詞,怎么辦呢?其實很簡單,只要求平均就行了。經過隱藏層后,輸入的4個詞被映射成了4個300維的向量,對這4個向量求平均,然后就可以作為下一層的輸入了。
([quick, brown], the)
([the, brown, fox], quick)
([the, quick, fox, jumps], brown)
([quick, brown, jumps, over], fox)
兩個模型相比,skip-gram模型能產生更多訓練樣本,抓住更多詞與詞之間語義上的細節(jié),在語料足夠多足夠好的理想條件下,skip-gram模型是優(yōu)于CBOW模型的。在語料較少的情況下,難以抓住足夠多詞與詞之間的細節(jié),CBOW模型求平均的特性,反而效果可能更好。
實際訓練時,還是假設詞庫有10000個詞,詞向量300維,那么每一層神經網絡的參數(shù)是300萬個,輸出層相當于有一萬個可能類的多分類問題??梢韵胂?,這樣的計算量非常非常非常大。
作者Mikolov等人提出了許多優(yōu)化的方法,在這里著重講一下負采樣。
負采樣的思想非常簡單,簡單地令人發(fā)指:我們知道最終神經網絡經過softmax輸出一個向量,只有一個概率最大的對應正確的單詞,其余的稱為negative sample。現(xiàn)在只選擇5個negative sample,所以輸出向量就只是一個6維的向量。要考慮的參數(shù)不是300萬個,而減少到了1800個! 這樣做看上去很偷懶,實際效果卻很好,大大提升了運算效率。
我們知道,訓練神經網絡時,每一次訓練會對神經網絡的參數(shù)進行微小的修改。在word2vec中,每一個訓練樣本并不會對所有參數(shù)進行修改。假設輸入的詞是cat,我們的隱藏層參數(shù)有300萬個,但這一步訓練只會修改cat相對應的300個參數(shù),因為此時隱藏層的輸出只跟這300個參數(shù)有關!
負采樣是有效的,我們不需要那么多negative sample。Mikolov等人在論文中說:對于小數(shù)據(jù)集,負采樣的個數(shù)在5-20個;對于大數(shù)據(jù)集,負采樣的個數(shù)在2-5個。
那具體如何選擇負采樣的詞呢?論文給出了如下公式:
負采樣的選擇
其中f(w)是詞頻。可以看到,負采樣的選擇只跟詞頻有關,詞頻越大,越有可能選中。
最后用tensorflow動手實踐一下。參考Udacity Deep Learning的一次作業(yè)
這里只是訓練了128維的詞向量,并通過TSNE的方法可視化。作為練手和深入理解word2vec不錯,實戰(zhàn)還是推薦gensim。
1# These are all the modules we'll be using later. Make sure you can import them
2# before proceeding further.
3%matplotlib inline
4from __future__ import print_function
5import collections
6import math
7import numpy as np
8import os
9import random
10import tensorflow as tf
11import zipfile
12from matplotlib import pylab
13from six.moves import range
14from six.moves.urllib.request import urlretrieve
15from sklearn.manifold import TSNE
Download the data from the source website if necessary.
1url = 'http://mattmahoney.net/dc/'
2def maybe_download(filename, expected_bytes):
3'''Download a file if not present, and make sure it's the right size.'''
4if not os.path.exists(filename):
5filename, _ = urlretrieve(url filename, filename)
6statinfo = os.stat(filename)
7if statinfo.st_size == expected_bytes:
8print('Found and verified %s' % filename)
9else:
10print(statinfo.st_size)
11raise Exception(
12'Failed to verify ' filename '. Can you get to it with a browser?')
13return filename
14filename = maybe_download('text8.zip', 31344016)
15Found and verified text8.zip
Read the data into a string.
1def read_data(filename):
2'''Extract the first file enclosed in a zip file as a list of words'''
3with zipfile.ZipFile(filename) as f:
4data = tf.compat.as_str(f.read(f.namelist()[0])).split()
5return data
6words = read_data(filename)
7print('Data size %d' % len(words))
8Data size 17005207
Build the dictionary and replace rare words with UNK token.
1vocabulary_size = 50000
2def build_dataset(words):
3count = [['UNK', -1]]
4count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
5dictionary = dict()
6for word, _ in count:
7dictionary[word] = len(dictionary)
8data = list()
9unk_count = 0
10for word in words:
11if word in dictionary:
12index = dictionary[word]
13else:
14index = 0 # dictionary['UNK']
15unk_count = unk_count 1
16data.append(index)
17count[0][1] = unk_count
18reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
19return data, count, dictionary, reverse_dictionary
20data, count, dictionary, reverse_dictionary = build_dataset(words)
21print('Most common words ( UNK)', count[:5])
22print('Sample data', data[:10])
23del words # Hint to reduce memory.
1Most common words ( UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
2Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]
Function to generate a training batch for the skip-gram model.
1data_index = 0
2def generate_batch(batch_size, num_skips, skip_window):
3global data_index
4assert batch_size % num_skips == 0
5assert num_skips <= 2 * skip_window
6batch = np.ndarray(shape=(batch_size), dtype=np.int32)
7labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
8span = 2 * skip_window 1 # [ skip_window target skip_window ]
9buffer = collections.deque(maxlen=span)
10for _ in range(span):
11buffer.append(data[data_index])
12data_index = (data_index 1) % len(data)
13for i in range(batch_size // num_skips):
14target = skip_window # target label at the center of the buffer
15targets_to_avoid = [ skip_window ]
16for j in range(num_skips):
17 while target in targets_to_avoid:
18 target = random.randint(0, span - 1)
19 targets_to_avoid.append(target)
20 batch[i * num_skips j] = buffer[skip_window]
21 labels[i * num_skips j, 0] = buffer[target]
22buffer.append(data[data_index])
23data_index = (data_index 1) % len(data)
24return batch, labels
25print('data:', [reverse_dictionary[di] for di in data[:8]])
26for num_skips, skip_window in [(2, 1), (4, 2)]:
27data_index = 0
28batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
29print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
30print(' batch:', [reverse_dictionary[bi] for bi in batch])
31print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
1data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']
2with num_skips = 2 and skip_window = 1:
3batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term']
4labels: ['anarchism', 'as', 'originated', 'a', 'as', 'term', 'a', 'of']
5with num_skips = 4 and skip_window = 2:
6batch: ['as', 'as', 'as', 'as', 'a', 'a', 'a', 'a']
7labels: ['originated', 'term', 'anarchism', 'a', 'of', 'as', 'originated', 'term']
Train a skip-gram model.
1batch_size = 128
2embedding_size = 128 # Dimension of the embedding vector.
3skip_window = 1 # How many words to consider left and right.
4num_skips = 2 # How many times to reuse an input to generate a label.
5# We pick a random validation set to sample nearest neighbors. here we limit the
6# validation samples to the words that have a low numeric ID, which by
7# construction are also the most frequent.
8valid_size = 16 # Random set of words to evaluate similarity on.
9valid_window = 100 # Only pick dev samples in the head of the distribution.
10valid_examples = np.array(random.sample(range(valid_window), valid_size))
11#######important#########
12num_sampled = 64 # Number of negative examples to sample.
13graph = tf.Graph()
14with graph.as_default(), tf.device('/cpu:0'):
15# Input data.
16train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
17train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
18valid_dataset = tf.constant(valid_examples,dtype=tf.int32)
19# Variables.
20embeddings = tf.Variable(
21tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
22softmax_weights = tf.Variable(
23tf.truncated_normal([vocabulary_size, embedding_size],
24 stddev=1.0 / math.sqrt(embedding_size)))
25softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
26# Model.
27# Look up embeddings for inputs.
28embed = tf.nn.embedding_lookup(embeddings, train_dataset)
29# Compute the softmax loss, using a sample of the negative labels each time.
30loss = tf.reduce_mean(
31tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
32labels=train_labels,num_sampled=num_sampled, num_classes=vocabulary_size))
33# Optimizer.
34# Note: The optimizer will optimize the softmax_weights AND the embeddings.
35# This is because the embeddings are defined as a variable quantity and the
36# optimizer's `minimize` method will by default modify all variable quantities
37# that contribute to the tensor it is passed.
38# See docs on `tf.train.Optimizer.minimize()` for more details.
39optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
40# Compute the similarity between minibatch examples and all embeddings.
41# We use the cosine distance:
42norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
43normalized_embeddings = embeddings / norm
44valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
45similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
1num_steps = 100001
2with tf.Session(graph=graph) as session:
3tf.global_variables_initializer().run()
4print('Initialized')
5average_loss = 0
6for step in range(num_steps):
7batch_data, batch_labels = generate_batch(
8batch_size, num_skips, skip_window)
9feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
10 _, l = session.run([optimizer, loss],feed_dict=feed_dict)average_loss = l
11if step % 2000 == 0:
12if step > 0:
13average_loss = average_loss / 2000
14# The average loss is an estimate of the loss over the last 2000 batches.
15 print('Average loss at step %d: %f' % (step,average_loss))
16 average_loss = 0
17# note that this is expensive (~20% slowdown if computed every 500 steps)
18if step % 10000 == 0:
19sim = similarity.eval()
20for i in range(valid_size):
21valid_word = reverse_dictionary[valid_examples[i]]
22top_k = 8 # number of nearest neighbors
23nearest = (-sim[i, :]).argsort()[1:top_k 1]
24log = 'Nearest to %s:' % valid_word
25for k in range(top_k):
26 close_word = reverse_dictionary[nearest[k]]
27 log = '%s %s,' % (log, close_word)
28print(log)
29final_embeddings = normalized_embeddings.eval()
1num_points = 400
2tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
3two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points 1, :])
1def plot(embeddings, labels):
2assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
3pylab.figure(figsize=(15,15)) # in inches
4for i, label in enumerate(labels):
5x, y = embeddings[i,:]
6pylab.scatter(x, y)
7pylab.annotate(label, xy=(x, y), xytext=(5, 2),textcoords='offset points',
8 ha='right', va='bottom')
9pylab.show()
10words = [reverse_dictionary[i] for i in range(1, num_points 1)]
11plot(two_d_embeddings, words)
skip-gram可視化
1data_index_cbow = 0
2def get_cbow_batch(batch_size, num_skips, skip_window):
3global data_index_cbow
4assert batch_size % num_skips == 0
5assert num_skips <= 2 * skip_window
6batch = np.ndarray(shape=(batch_size), dtype=np.int32)
7labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
8span = 2 * skip_window 1 # [ skip_window target skip_window ]
9buffer = collections.deque(maxlen=span)
10for _ in range(span):
11buffer.append(data[data_index_cbow])
12data_index_cbow = (data_index_cbow 1) % len(data)
13for i in range(batch_size // num_skips):
14target = skip_window # target label at the center of the buffer
15targets_to_avoid = [ skip_window ]
16for j in range(num_skips):
17 while target in targets_to_avoid:
18 target = random.randint(0, span - 1)
19 targets_to_avoid.append(target)
20 batch[i * num_skips j] = buffer[skip_window]
21 labels[i * num_skips j, 0] = buffer[target]
22buffer.append(data[data_index_cbow])
23data_index_cbow = (data_index_cbow 1) % len(data)
24cbow_batch = np.ndarray(shape=(batch_size), dtype=np.int32)
25cbow_labels = np.ndarray(shape=(batch_size // (skip_window * 2), 1), dtype=np.int32)
26for i in range(batch_size):
27cbow_batch[i] = labels[i]
28cbow_batch = np.reshape(cbow_batch, [batch_size // (skip_window * 2), skip_window * 2])
29for i in range(batch_size // (skip_window * 2)):
30# center word
31cbow_labels[i] = batch[2 * skip_window * i]
32return cbow_batch, cbow_labels
1# actual batch_size = batch_size // (2 * skip_window)
2batch_size = 128
3embedding_size = 128 # Dimension of the embedding vector.
4skip_window = 1 # How many words to consider left and right.
5num_skips = 2 # How many times to reuse an input to generate a label.
6# We pick a random validation set to sample nearest neighbors. here we limit the
7# validation samples to the words that have a low numeric ID, which by
8# construction are also the most frequent.
9valid_size = 16 # Random set of words to evaluate similarity on.
10valid_window = 100 # Only pick dev samples in the head of the distribution.
11valid_examples = np.array(random.sample(range(valid_window), valid_size))
12#######important#########
13num_sampled = 64 # Number of negative examples to sample.
14graph = tf.Graph()
15with graph.as_default(), tf.device('/cpu:0'):
16# Input data.
17train_dataset = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), skip_window * 2])
18train_labels = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), 1])
19valid_dataset = tf.constant(valid_examples,dtype=tf.int32)
20# Variables.
21embeddings = tf.Variable(
22tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
23softmax_weights = tf.Variable(
24tf.truncated_normal([vocabulary_size, embedding_size],
25 stddev=1.0 / math.sqrt(embedding_size)))
26softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
27# Model.
28# Look up embeddings for inputs.
29embed = tf.nn.embedding_lookup(embeddings, train_dataset)
30# reshape embed
31embed = tf.reshape(embed, (skip_window * 2, batch_size // (skip_window * 2), embedding_size))
32# average embedembed = tf.reduce_mean(embed, 0)
33# Compute the softmax loss, using a sample of the negative labels each time.
34loss = tf.reduce_mean(
35tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
36# Optimizer.
37# Note: The optimizer will optimize the softmax_weights AND the embeddings.
38# This is because the embeddings are defined as a variable quantity and the
39# optimizer's `minimize` method will by default modify all variable quantities
40# that contribute to the tensor it is passed.
41# See docs on `tf.train.Optimizer.minimize()` for more details.
42optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
43# Compute the similarity between minibatch examples and all embeddings.
44# We use the cosine distance:
45norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
46normalized_embeddings = embeddings / norm
47valid_embeddings = tf.nn.embedding_lookup(
48normalized_embeddings, valid_dataset)
49similarity = tf.matmul(valid_embeddings,tf.transpose(normalized_embeddings))
1num_steps = 100001
2with tf.Session(graph=graph) as session:
3tf.global_variables_initializer().run()
4print('Initialized')
5average_loss = 0
6for step in range(num_steps):
7batch_data, batch_labels = get_cbow_batch(
8batch_size, num_skips, skip_window)
9feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
10_, l = session.run([optimizer, loss],feed_dict=feed_dict)
11average_loss = l
12if step % 2000 == 0:
13if step > 0:
14average_loss = average_loss / 2000
15# The average loss is an estimate of the loss over the last 2000 batches.
16print('Average loss at step %d: %f' % (step, average_loss))
17average_loss = 0
18# note that this is expensive (~20% slowdown if computed every 500 steps)
19if step % 10000 == 0:
20sim = similarity.eval()
21for i in range(valid_size):
22valid_word = reverse_dictionary[valid_examples[i]]
23top_k = 8 # number of nearest neighbors
24nearest = (-sim[i, :]).argsort()[1:top_k 1]
25log = 'Nearest to %s:' % valid_word
26for k in range(top_k):
27 close_word = reverse_dictionary[nearest[k]]
28 log = '%s %s,' % (log, close_word)
29print(log)
30final_embeddings = normalized_embeddings.eval()
1num_points = 400
2tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
3two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points 1, :])
4words = [reverse_dictionary[i] for i in range(200, num_points 1)]
5plot(two_d_embeddings, words)
CBOW可視化
1、Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[J]. 2014, 4:II-1188.
2、Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26:3111-3119.
3、Word2Vec Tutorial - The Skip-Gram Model
4、Udacity Deep Learning
5、Stanford CS224d Lecture2,3
原文鏈接:https://www.jianshu.com/p/b779f8219f74
查閱更為簡潔方便的分類文章以及最新的課程、產品信息,請移步至全新呈現(xiàn)的“LeadAI學院官網”:
www.leadai.org
聯(lián)系客服