您當前的位置:首頁 > 動漫

TensorFlow2.0教程-使用RNN實現文字分類

作者:由 Doit 發表于 動漫時間:2019-05-07

TensorFlow2。0教程-使用RNN實現文字分類

最全Tensorflow 2.0 入門教程持續更新:

完整

tensorflow2。0

教程程式碼請看https://github。com/czy36mengfei/tensorflow2_tutorials_chinese (歡迎star)

本教程主要由tensorflow2.0官方教程的個人學習復現筆記整理而來,中文講解,方便喜歡閱讀中文教程的朋友,官方教程:

https://www。tensorflow。org

1。使用tensorflow_datasets 構造輸入資料

pip

install

-

q

tensorflow_datasets

import

tensorflow_datasets

as

tfds

dataset

info

=

tfds

load

‘imdb_reviews/subwords8k’

with_info

=

True

as_supervised

=

True

獲取訓練集、測試集

train_dataset

test_dataset

=

dataset

‘train’

],

dataset

‘test’

獲取tokenizer物件,用進行字元處理級id轉換(這裡先轉換成subword,再轉換為id)等操作

tokenizer

=

info

features

‘text’

encoder

print

‘vocabulary size: ’

tokenizer

vocab_size

vocabulary

size

8185

token物件

測試

sample_string

=

‘Hello word , Tensorflow’

tokenized_string

=

tokenizer

encode

sample_string

print

‘tokened id: ’

tokenized_string

# 解碼會原字串

src_string

=

tokenizer

decode

tokenized_string

print

‘original string: ’

src_string

tokened

id

4025

222

2621

1199

6307

2327

2934

original

string

Hello

word

Tensorflow

解出每個subword

for

t

in

tokenized_string

print

str

t

+

‘->[’

+

tokenizer

decode

([

t

])

+

‘]’

4025

->

Hell

222

->

o

2621

->

word

1199

->

6307

->

Ten

2327

->

sor

2934

->

flow

構建批次訓練集

BUFFER_SIZE

=

10000

BATCH_SIZE

=

64

train_dataset

=

train_dataset

shuffle

BUFFER_SIZE

train_dataset

=

train_dataset

padded_batch

BATCH_SIZE

train_dataset

output_shapes

test_dataset

=

test_dataset

padded_batch

BATCH_SIZE

test_dataset

output_shapes

模型構建

下面,因為此處的句子是變長的,所以只能使用

序列模型

,而不能使用keras的函式api

# def get_model():

# inputs = tf。keras。Input((1240,))

# emb = tf。keras。layers。Embedding(tokenizer。vocab_size, 64)(inputs)

# h1 = tf。keras。layers。Bidirectional(tf。keras。layers。LSTM(64))(emb)

# h1 = tf。keras。layers。Dense(64, activation=‘relu’)(h1)

# outputs = tf。keras。layers。Dense(1, activation=‘sigmoid’)(h1)

# model = tf。keras。Model(inputs, outputs)

# return model

def

get_model

():

model

=

tf

keras

Sequential

([

tf

keras

layers

Embedding

tokenizer

vocab_size

64

),

tf

keras

layers

Bidirectional

tf

keras

layers

LSTM

64

)),

tf

keras

layers

Dense

64

activation

=

‘relu’

),

tf

keras

layers

Dense

1

activation

=

‘sigmoid’

])

return

model

model

=

get_model

()

model

compile

loss

=

‘binary_crossentropy’

optimizer

=

‘adam’

metrics

=

‘accuracy’

])

模型訓練

history

=

model

fit

train_dataset

epochs

=

10

validation_data

=

test_dataset

Epoch

1

/

10

391

/

391

==============================

-

827

s

2

s

/

step

-

loss

0。5606

-

accuracy

0。7068

-

val_loss

0。0000e+00

-

val_accuracy

0。0000e+00

。。。

Epoch

10

/

10

391

/

391

==============================

-

791

s

2

s

/

step

-

loss

0。1333

-

accuracy

0。9548

-

val_loss

0。6117

-

val_accuracy

0。8199

# 檢視訓練過程

import

matplotlib。pyplot

as

plt

def

plot_graphs

history

string

):

plt

plot

history

history

string

])

plt

plot

history

history

‘val_’

+

string

])

plt

xlabel

‘epochs’

plt

ylabel

string

plt

legend

([

string

‘val_’

+

string

])

plt

show

()

plot_graphs

history

‘accuracy’

TensorFlow2.0教程-使用RNN實現文字分類

plot_graphs

history

‘loss’

TensorFlow2.0教程-使用RNN實現文字分類

測試

test_loss

test_acc

=

model

evaluate

test_dataset

print

‘test loss: ’

test_loss

print

‘test acc: ’

test_acc

391

/

Unknown

-

68

s

174

ms

/

step

-

loss

0。6117

-

accuracy

0。8199

test

loss

0。6117385012262008

test

acc

0。81988

上述模型不會mask掉序列的padding,所以如果在有padding的尋列上訓練,測試沒有padding的序列時可能有所偏差。

def

pad_to_size

vec

size

):

zeros

=

0

*

size

-

len

vec

))

vec

extend

zeros

return

vec

def

sample_predict

sentence

pad

=

False

):

tokened_sent

=

tokenizer

encode

sentence

if

pad

tokened_sent

=

pad_to_size

tokened_sent

64

pred

=

model

predict

tf

expand_dims

tokened_sent

0

))

return

pred

# 沒有padding的情況

sample_pred_text

=

‘The movie was cool。 The animation and the graphics ’

‘were out of this world。 I would recommend this movie。’

predictions

=

sample_predict

sample_pred_text

pad

=

False

print

predictions

[[

0。2938048

]]

# 有paddin的情況

sample_pred_text

=

‘The movie was cool。 The animation and the graphics ’

‘were out of this world。 I would recommend this movie。’

predictions

=

sample_predict

sample_pred_text

pad

=

True

print

predictions

[[

0。42541984

]]

堆疊更多的lstm層

from

tensorflow。keras

import

layers

model

=

keras

Sequential

layers

Embedding

tokenizer

vocab_size

64

),

layers

Bidirectional

layers

LSTM

64

return_sequences

=

True

)),

layers

Bidirectional

layers

LSTM

32

)),

layers

Dense

64

activation

=

‘relu’

),

layers

Dense

1

activation

=

‘sigmoid’

])

model

compile

loss

=

tf

keras

losses

binary_crossentropy

optimizer

=

tf

keras

optimizers

Adam

(),

metrics

=

‘accuracy’

])

history

=

model

fit

train_dataset

epochs

=

6

validation_data

=

test_dataset

Epoch

1

/

6

391

/

391

==============================

-

1646

s

4

s

/

step

-

loss

0。5270

-

accuracy

0。7414

-

val_loss

0。0000e+00

-

val_accuracy

0。0000e+00

。。。

Epoch

6

/

6

391

/

391

==============================

-

1622

s

4

s

/

step

-

loss

0。1619

-

accuracy

0。9430

-

val_loss

0。5484

-

val_accuracy

0。7808

plot_graphs

history

‘accuracy’

TensorFlow2.0教程-使用RNN實現文字分類

plot_graphs

history

‘loss’

TensorFlow2.0教程-使用RNN實現文字分類

res

=

model

evaluate

test_dataset

print

res

391

/

Unknown

-

125

s

320

ms

/

step

-

loss

0。5484

-

accuracy

0。7808

0。5484032468570162

0。78084

標簽: dataset  keras  layers  loss  tf