系列文章
✓
词向量
✗
Adam,sgd
✗
梯度消失和梯度爆炸
✗
初始化的方法
✗
过拟合&欠拟合
✗
评价&损失函数的说明
✗
深度学习模型及常用任务说明
✗
RNN的时间复杂度
✗
neo4j图数据库
分词、词向量
【关键词提取-TFIDF(一)】
文章图片
TfidfVectorizer
基本介绍
# python:3.8
# sklearn:0.23.1
# 1、CountVectorizer 的作用是将文本文档转换为计数的稀疏矩阵
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# 查看每个单词的位置
print(vectorizer.get_feature_names())
#['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
# 查看结果
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
#[0 2 0 1 0 1 1 0 1]
#[1 0 0 1 1 0 1 1 1]
#[0 1 1 1 0 0 1 0 1]]# 2、TfidfTransformer:使用计算 tf-idf
from sklearn.feature_extraction.text import TfidfTransformer
transform = TfidfTransformer()
Y = transform.fit_transform(X)
print(Y.toarray())# 输出tfidf的值
# [[0.0.46979139 0.58028582 0.38408524 0.0. 0.38408524 0.0.38408524]
#[0.0.68762360.0.28108867 0.0.53864762 0.28108867 0.0.28108867]
#[0.51184851 0.0.0.26710379 0.51184851 0. 0.26710379 0.51184851 0.26710379]
#[0.0.46979139 0.58028582 0.38408524 0.0. 0.38408524 0.0.38408524]]# 3、TfidfVectorizer:TfidfVectorizer 相当于 CountVectorizer 和 TfidfTransformer 的结合使用
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer() #构建一个计算词频(TF)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.shape)
# (4, 9)
系列文章
✓
词向量
✗
Adam