Classifying Fakes News from Real News

In this blog post, I will build a fake news detector using TensorFlow word embedding and visualize the embedding using Plotly.

§1. Setup

First, we need to import all the necessary packages:

import numpy as np
import pandas as pd
import tensorflow as tf
import re
import string

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.experimental.preprocessing import StringLookup
from tensorflow import keras

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# for stopwords
from nltk.corpus import stopwords

# for embedding viz
import plotly.express as px 
import plotly.io as pio
pio.templates.default = "plotly_white"
import matplotlib.pyplot as plt

Acquiring Training Data

Next, let’s go ahead and read in our training data and take a look at the dataset we have been given.

train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"
df = pd.read_csv(train_url)
df.head()
Unnamed: 0 title text fake
0 17366 Merkel: Strong result for Austria's FPO 'big c... German Chancellor Angela Merkel said on Monday... 0
1 5634 Trump says Pence will lead voter fraud panel WEST PALM BEACH, Fla.President Donald Trump sa... 0
2 17487 JUST IN: SUSPECTED LEAKER and “Close Confidant... On December 5, 2017, Circa s Sara Carter warne... 1
3 12217 Thyssenkrupp has offered help to Argentina ove... Germany s Thyssenkrupp, has offered assistance... 0
4 5535 Trump say appeals court decision on travel ban... President Donald Trump on Thursday called the ... 0

We see that each row of the data corresponds to an article. The title column gives the title of the article, while the text column gives the full article text. The final column, called fake, is 0 if the article is true and 1 if the article contains fake news, as determined by the authors of the paper above.

§2. Make the Dataset

Our next step is to remove the stopwords from the title and text columns. A stopword is a word that is usually considered to be uninformative, such as “the,” “and,” or “but.” To do this we first run the following code:

import nltk
nltk.download('stopwords')
stop = stopwords.words('english')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

Now let’s construct and return a tf.data.Dataset with two inputs and one output. The input will be of the form (title, text), and the output will consist only of the fake column. We will also batch our data to help increase the speed of training.

def make_dataset(df):
  '''
  removes stopwords from df and then converts
  the dataframe to a tf.data.dataset.
  '''
  # remove stopwords
  df['title_without_stopwords'] = df['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
  df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

  # specify inputs and outputs of tf.data.dataset
  my_data_set = tf.data.Dataset.from_tensor_slices(
      ( # dictionary for input data/features
          {
              "title" : df[["title_without_stopwords"]], 
              "text" : df[["text_without_stopwords"]]
          }, 
           # dictionary for output data/labels
          {
              "fake" : df[["fake"]]
          }
      )
  )
  my_data_set = my_data_set.batch(100)
  return my_data_set

Split the Dataset

Now, let’s call our function and construct our dataset. Then we need to split it into train, validation, and test. This is done below, and 20% of our training data is set aside as validation data.

data = make_dataset(df) 

# 70% train, 20% validation, 10% test
train_size = int(0.7*len(data)) 
val_size = int(0.2*len(data))

# get validation and training and test data
train = data.take(train_size) # data[:train_size]
val = data.skip(train_size).take(val_size) # data[train_size : train_size + val_size]
test = data.skip(train_size+val_size) #  data[train_size + val_size:]

Base Rate

The base rate refers to the accuracy of a model that always makes the same guess (for example, such a model might always say “fake news!”). Now we need to determine the base rate for this data set by examining the labels on the training set.

#iterate through the labels on the training data 
labels_iterator= train.unbatch().map(lambda text, fake: fake).as_numpy_iterator()

true = 0
fake = 0

for labels in labels_iterator:
  #if label = 0, add to true
  if labels['fake'] == 0:
      true += 1
  #if label = 1, add to fake
  else:
      fake += 1

print(str("Articles labeled as true:"), true)
print(str("Articles labeled as fake:"), fake)

#how often will the model identify an article as true
base_rate = fake / (true + fake)
base_rate = str(round(base_rate*100, 2))
print(str("The base model will predict 'fake'"), base_rate, str("% of the time."))
Articles labeled as true: 7483
Articles labeled as fake: 8217
The base model will predict 'fake' 52.34 % of the time.

There are 7,483 articles labeled as true. There are 8,217 articles labeled as fake. Thus, the base model will predict an article is ‘fake’ 52.34 % of the time.

§3. Create Models

Model 1

In the first model, we will use only the article title as an input.

We will also add the following code to vectorize our layers and format them so that we can adapt them for our models.

#preparing a text vectorization layer for tf model
size_vocabulary = 2000

def standardization(input_data):
    lowercase = tf.strings.lower(input_data)   # convert into lowercase
    no_punctuation = tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),'')
    # remove punctuation and some other elements
    return no_punctuation 

title_vectorize_layer = TextVectorization(
    standardize=standardization,
    max_tokens=size_vocabulary, # only consider this many words
    output_mode='int',
    output_sequence_length=500) 

title_vectorize_layer.adapt(train.map(lambda x, y: x["title"]))

Our first model uses only the title of the articles to predict whether an article is fake or not. We will pass our title_input through the following layers and then predict the output using the following code.

title_input = keras.Input(
    shape = (1,), 
    name = "title",
    dtype = "string"
)

title_features = title_vectorize_layer(title_input) # apply the vectorization layer to the titles_input
title_features = layers.Embedding(size_vocabulary, output_dim = 3, name="embedding1")(title_features)
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.GlobalAveragePooling1D()(title_features)
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.Dense(32, activation='relu')(title_features)
title_features = layers.Dense(32, activation='relu')(title_features)
output = layers.Dense(2, name = "fake")(title_features)

model1 = keras.Model(
    inputs = title_input,
    outputs = output
)

Now we have to fit our model to our training set.

model1.compile(optimizer="adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=["accuracy"])

history = model1.fit(train, 
                    validation_data=val,
                    epochs = 50, 
                    verbose = 1)
    Epoch 1/50


    /usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:559: UserWarning: Input dict contained keys ['text'] which did not match any model input. They will be ignored by the model.
      inputs = self._flatten_to_reference_inputs(inputs)


    157/157 [==============================] - 2s 7ms/step - loss: 0.6924 - accuracy: 0.5181 - val_loss: 0.6928 - val_accuracy: 0.5164
    Epoch 2/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.6919 - accuracy: 0.5234 - val_loss: 0.6917 - val_accuracy: 0.5164
    Epoch 3/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.6334 - accuracy: 0.6839 - val_loss: 0.4596 - val_accuracy: 0.9216
    Epoch 4/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.2796 - accuracy: 0.9297 - val_loss: 0.1559 - val_accuracy: 0.9533
    Epoch 5/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.1474 - accuracy: 0.9535 - val_loss: 0.1025 - val_accuracy: 0.9667
    Epoch 6/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.1153 - accuracy: 0.9611 - val_loss: 0.0838 - val_accuracy: 0.9720
    Epoch 7/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0964 - accuracy: 0.9652 - val_loss: 0.0727 - val_accuracy: 0.9731
    Epoch 8/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0884 - accuracy: 0.9694 - val_loss: 0.0677 - val_accuracy: 0.9764
    Epoch 9/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0809 - accuracy: 0.9710 - val_loss: 0.0617 - val_accuracy: 0.9773
    Epoch 10/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0743 - accuracy: 0.9728 - val_loss: 0.0584 - val_accuracy: 0.9782
    Epoch 11/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0710 - accuracy: 0.9745 - val_loss: 0.0622 - val_accuracy: 0.9773
    Epoch 12/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0686 - accuracy: 0.9750 - val_loss: 0.0543 - val_accuracy: 0.9800
    Epoch 13/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0638 - accuracy: 0.9768 - val_loss: 0.0528 - val_accuracy: 0.9804
    Epoch 14/50
    157/157 [==============================] - 1s 9ms/step - loss: 0.0610 - accuracy: 0.9780 - val_loss: 0.0513 - val_accuracy: 0.9811
    Epoch 15/50
    157/157 [==============================] - 2s 10ms/step - loss: 0.0560 - accuracy: 0.9784 - val_loss: 0.0501 - val_accuracy: 0.9813
    Epoch 16/50
    157/157 [==============================] - 1s 10ms/step - loss: 0.0562 - accuracy: 0.9790 - val_loss: 0.0500 - val_accuracy: 0.9824
    Epoch 17/50
    157/157 [==============================] - 1s 7ms/step - loss: 0.0514 - accuracy: 0.9818 - val_loss: 0.0488 - val_accuracy: 0.9824
    Epoch 18/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0534 - accuracy: 0.9803 - val_loss: 0.0490 - val_accuracy: 0.9824
    Epoch 19/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0502 - accuracy: 0.9811 - val_loss: 0.0491 - val_accuracy: 0.9827
    Epoch 20/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0504 - accuracy: 0.9803 - val_loss: 0.0538 - val_accuracy: 0.9813
    Epoch 21/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0479 - accuracy: 0.9814 - val_loss: 0.0474 - val_accuracy: 0.9836
    Epoch 22/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0456 - accuracy: 0.9828 - val_loss: 0.0474 - val_accuracy: 0.9833
    Epoch 23/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0447 - accuracy: 0.9829 - val_loss: 0.0485 - val_accuracy: 0.9822
    Epoch 24/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0440 - accuracy: 0.9824 - val_loss: 0.0473 - val_accuracy: 0.9831
    Epoch 25/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0447 - accuracy: 0.9832 - val_loss: 0.0521 - val_accuracy: 0.9822
    Epoch 26/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0448 - accuracy: 0.9829 - val_loss: 0.0473 - val_accuracy: 0.9833
    Epoch 27/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0460 - accuracy: 0.9833 - val_loss: 0.0471 - val_accuracy: 0.9824
    Epoch 28/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0447 - accuracy: 0.9823 - val_loss: 0.0486 - val_accuracy: 0.9833
    Epoch 29/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0416 - accuracy: 0.9829 - val_loss: 0.0481 - val_accuracy: 0.9822
    Epoch 30/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0413 - accuracy: 0.9846 - val_loss: 0.0492 - val_accuracy: 0.9822
    Epoch 31/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0395 - accuracy: 0.9835 - val_loss: 0.0512 - val_accuracy: 0.9833
    Epoch 32/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0423 - accuracy: 0.9829 - val_loss: 0.0501 - val_accuracy: 0.9831
    Epoch 33/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0367 - accuracy: 0.9854 - val_loss: 0.0504 - val_accuracy: 0.9831
    Epoch 34/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0374 - accuracy: 0.9860 - val_loss: 0.0610 - val_accuracy: 0.9811
    Epoch 35/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0404 - accuracy: 0.9844 - val_loss: 0.0489 - val_accuracy: 0.9829
    Epoch 36/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0327 - accuracy: 0.9873 - val_loss: 0.0511 - val_accuracy: 0.9836
    Epoch 37/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0374 - accuracy: 0.9857 - val_loss: 0.0496 - val_accuracy: 0.9829
    Epoch 38/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0382 - accuracy: 0.9846 - val_loss: 0.0539 - val_accuracy: 0.9833
    Epoch 39/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0363 - accuracy: 0.9866 - val_loss: 0.0619 - val_accuracy: 0.9811
    Epoch 40/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0384 - accuracy: 0.9852 - val_loss: 0.0503 - val_accuracy: 0.9829
    Epoch 41/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0364 - accuracy: 0.9857 - val_loss: 0.0799 - val_accuracy: 0.9764
    Epoch 42/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0359 - accuracy: 0.9859 - val_loss: 0.0505 - val_accuracy: 0.9833
    Epoch 43/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0371 - accuracy: 0.9850 - val_loss: 0.0630 - val_accuracy: 0.9804
    Epoch 44/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0354 - accuracy: 0.9861 - val_loss: 0.0509 - val_accuracy: 0.9831
    Epoch 45/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0362 - accuracy: 0.9853 - val_loss: 0.0591 - val_accuracy: 0.9809
    Epoch 46/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0346 - accuracy: 0.9873 - val_loss: 0.0551 - val_accuracy: 0.9811
    Epoch 47/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0332 - accuracy: 0.9861 - val_loss: 0.0679 - val_accuracy: 0.9793
    Epoch 48/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0316 - accuracy: 0.9876 - val_loss: 0.0606 - val_accuracy: 0.9813
    Epoch 49/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0335 - accuracy: 0.9864 - val_loss: 0.0740 - val_accuracy: 0.9787
    Epoch 50/50
    157/157 [==============================] - 1s 6ms/step - loss: 0.0334 - accuracy: 0.9873 - val_loss: 0.0539 - val_accuracy: 0.9822

Let’s look at the flow of our model

from tensorflow.keras import utils
utils.plot_model(model1)

output_11_0.png

Finally, let’s plot its performance on the validation set.

def history_plot():
  plt.plot(history.history["accuracy"], label = "training")
  plt.plot(history.history["val_accuracy"], label = "validation")
  plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
  plt.legend()

history_plot()

output_12_0.png

Looks like we got our model to have around a consistent 98% accuracy! Now let’s perform the same process but this time we will only use the text of the articles in our model.

Model 2

In the second model, we will use only the article text as an input.

Now, we will add the following code to vectorize our layers and format them so that we can adapt them for our models.

text_vectorize_layer = TextVectorization(
    standardize=standardization,
    max_tokens=size_vocabulary, # only consider this many words
    output_mode='int',
    output_sequence_length=500) 

text_vectorize_layer.adapt(train.map(lambda x, y: x["text"]))

Our second model uses only the text of the articles to predict whether an article is fake or not.

text_input = keras.Input(
    shape = (1,), 
    name = "text",
    dtype = "string"
)

text_features = text_vectorize_layer(text_input) # apply the vectorization layer to the titles_input
text_features = layers.Embedding(size_vocabulary, output_dim = 10, name="embedding2")(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.GlobalAveragePooling1D()(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.Dense(32, activation='relu')(text_features)
text_features = layers.Dense(32, activation='relu')(text_features)
output = layers.Dense(2, name = "fake")(text_features)

model2 = keras.Model(
    inputs = text_input,
    outputs = output
)

Now we have to fit our model to our training set.

model2.compile(optimizer="adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=["accuracy"])

history = model2.fit(train, 
                    validation_data=val,
                    epochs = 50, 
                    verbose = 1)
Epoch 1/50


    /usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:559: UserWarning: Input dict contained keys ['title'] which did not match any model input. They will be ignored by the model.
      inputs = self._flatten_to_reference_inputs(inputs)


    157/157 [==============================] - 2s 12ms/step - loss: 0.5829 - accuracy: 0.7111 - val_loss: 0.2748 - val_accuracy: 0.9389
    Epoch 2/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.1767 - accuracy: 0.9497 - val_loss: 0.1298 - val_accuracy: 0.9618
    Epoch 3/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.1096 - accuracy: 0.9697 - val_loss: 0.1011 - val_accuracy: 0.9689
    Epoch 4/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0836 - accuracy: 0.9775 - val_loss: 0.0890 - val_accuracy: 0.9720
    Epoch 5/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0669 - accuracy: 0.9824 - val_loss: 0.0840 - val_accuracy: 0.9744
    Epoch 6/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0559 - accuracy: 0.9848 - val_loss: 0.0831 - val_accuracy: 0.9756
    Epoch 7/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0455 - accuracy: 0.9888 - val_loss: 0.0854 - val_accuracy: 0.9744
    Epoch 8/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0406 - accuracy: 0.9899 - val_loss: 0.0846 - val_accuracy: 0.9742
    Epoch 9/50
    157/157 [==============================] - 2s 15ms/step - loss: 0.0340 - accuracy: 0.9917 - val_loss: 0.0859 - val_accuracy: 0.9769
    Epoch 10/50
    157/157 [==============================] - 3s 19ms/step - loss: 0.0300 - accuracy: 0.9920 - val_loss: 0.0927 - val_accuracy: 0.9740
    Epoch 11/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0277 - accuracy: 0.9938 - val_loss: 0.0939 - val_accuracy: 0.9769
    Epoch 12/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0222 - accuracy: 0.9947 - val_loss: 0.1022 - val_accuracy: 0.9751
    Epoch 13/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0223 - accuracy: 0.9943 - val_loss: 0.1067 - val_accuracy: 0.9731
    Epoch 14/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0199 - accuracy: 0.9948 - val_loss: 0.1080 - val_accuracy: 0.9749
    Epoch 15/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0184 - accuracy: 0.9947 - val_loss: 0.1170 - val_accuracy: 0.9727
    Epoch 16/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0192 - accuracy: 0.9939 - val_loss: 0.1215 - val_accuracy: 0.9731
    Epoch 17/50
    157/157 [==============================] - 2s 12ms/step - loss: 0.0196 - accuracy: 0.9939 - val_loss: 0.1173 - val_accuracy: 0.9740
    Epoch 18/50
    157/157 [==============================] - 2s 12ms/step - loss: 0.0148 - accuracy: 0.9959 - val_loss: 0.1176 - val_accuracy: 0.9747
    Epoch 19/50
    157/157 [==============================] - 2s 12ms/step - loss: 0.0139 - accuracy: 0.9961 - val_loss: 0.1293 - val_accuracy: 0.9731
    Epoch 20/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0134 - accuracy: 0.9957 - val_loss: 0.1301 - val_accuracy: 0.9742
    Epoch 21/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0136 - accuracy: 0.9959 - val_loss: 0.1257 - val_accuracy: 0.9769
    Epoch 22/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0092 - accuracy: 0.9975 - val_loss: 0.1325 - val_accuracy: 0.9760
    Epoch 23/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0090 - accuracy: 0.9975 - val_loss: 0.1381 - val_accuracy: 0.9742
    Epoch 24/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0087 - accuracy: 0.9973 - val_loss: 0.1426 - val_accuracy: 0.9751
    Epoch 25/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0075 - accuracy: 0.9976 - val_loss: 0.1434 - val_accuracy: 0.9767
    Epoch 26/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0072 - accuracy: 0.9978 - val_loss: 0.1552 - val_accuracy: 0.9751
    Epoch 27/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0077 - accuracy: 0.9976 - val_loss: 0.1545 - val_accuracy: 0.9751
    Epoch 28/50
    157/157 [==============================] - 2s 12ms/step - loss: 0.0066 - accuracy: 0.9980 - val_loss: 0.1583 - val_accuracy: 0.9751
    Epoch 29/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0078 - accuracy: 0.9976 - val_loss: 0.1636 - val_accuracy: 0.9744
    Epoch 30/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0060 - accuracy: 0.9978 - val_loss: 0.1676 - val_accuracy: 0.9749
    Epoch 31/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0056 - accuracy: 0.9985 - val_loss: 0.1894 - val_accuracy: 0.9713
    Epoch 32/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0134 - accuracy: 0.9949 - val_loss: 0.1658 - val_accuracy: 0.9753
    Epoch 33/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0068 - accuracy: 0.9976 - val_loss: 0.1687 - val_accuracy: 0.9747
    Epoch 34/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0076 - accuracy: 0.9976 - val_loss: 0.1684 - val_accuracy: 0.9744
    Epoch 35/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0056 - accuracy: 0.9981 - val_loss: 0.1744 - val_accuracy: 0.9749
    Epoch 36/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0061 - accuracy: 0.9978 - val_loss: 0.1769 - val_accuracy: 0.9736
    Epoch 37/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0103 - accuracy: 0.9964 - val_loss: 0.1820 - val_accuracy: 0.9729
    Epoch 38/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0078 - accuracy: 0.9978 - val_loss: 0.1898 - val_accuracy: 0.9718
    Epoch 39/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0032 - accuracy: 0.9991 - val_loss: 0.1719 - val_accuracy: 0.9760
    Epoch 40/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0102 - accuracy: 0.9962 - val_loss: 0.1995 - val_accuracy: 0.9711
    Epoch 41/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0058 - accuracy: 0.9980 - val_loss: 0.1784 - val_accuracy: 0.9744
    Epoch 42/50
    157/157 [==============================] - 2s 12ms/step - loss: 0.0046 - accuracy: 0.9985 - val_loss: 0.1738 - val_accuracy: 0.9767
    Epoch 43/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0055 - accuracy: 0.9980 - val_loss: 0.1806 - val_accuracy: 0.9758
    Epoch 44/50
    157/157 [==============================] - 2s 12ms/step - loss: 0.0031 - accuracy: 0.9991 - val_loss: 0.1852 - val_accuracy: 0.9744
    Epoch 45/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0040 - accuracy: 0.9988 - val_loss: 0.1853 - val_accuracy: 0.9744
    Epoch 46/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0030 - accuracy: 0.9990 - val_loss: 0.1950 - val_accuracy: 0.9738
    Epoch 47/50
    157/157 [==============================] - 2s 12ms/step - loss: 0.0015 - accuracy: 0.9996 - val_loss: 0.1996 - val_accuracy: 0.9742
    Epoch 48/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0152 - accuracy: 0.9952 - val_loss: 0.1923 - val_accuracy: 0.9733
    Epoch 49/50
    157/157 [==============================] - 2s 12ms/step - loss: 0.0067 - accuracy: 0.9973 - val_loss: 0.1744 - val_accuracy: 0.9744
    Epoch 50/50
    157/157 [==============================] - 2s 11ms/step - loss: 0.0039 - accuracy: 0.9987 - val_loss: 0.1931 - val_accuracy: 0.9731

Let’s look at the flow of our model

utils.plot_model(model2)

output_19_0.png

Finally, let’s plot its performance on the validation set.

history_plot()

output_20_0.png

Once again, looks pretty good! We see there is about a consistent 97% accuracy! For our last model, we will take into account both the titles and the text when we construct our model. We predict that this will be our most accurate model yet.

Model 3

In the third model, we will use both the article title and the article text as input. Now we can combine the two models using layers.concatenate.

title_features = title_features
text_features = text_features

# combine text and title
main = layers.concatenate([title_features, text_features], axis = 1)

# output
main = layers.Dense(32, activation='relu')(main)
output = layers.Dense(2, name = "fake")(main)

model3 = keras.Model(
    inputs = [title_input, text_input],
    outputs = output
)

Now we have to fit our model to our training set.

model3.compile(optimizer="adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=["accuracy"])

history = model3.fit(train, 
                    validation_data=val,
                    epochs = 50, 
                    verbose = 1)
Epoch 1/50
    157/157 [==============================] - 6s 26ms/step - loss: 7.4767e-04 - accuracy: 0.9998 - val_loss: 0.0537 - val_accuracy: 0.9911
    Epoch 2/50
    157/157 [==============================] - 2s 14ms/step - loss: 4.1076e-05 - accuracy: 1.0000 - val_loss: 0.0554 - val_accuracy: 0.9911
    Epoch 3/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0034 - accuracy: 0.9990 - val_loss: 0.0511 - val_accuracy: 0.9884
    Epoch 4/50
    157/157 [==============================] - 2s 14ms/step - loss: 7.2481e-04 - accuracy: 0.9998 - val_loss: 0.0735 - val_accuracy: 0.9873
    Epoch 5/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0012 - accuracy: 0.9997 - val_loss: 0.0553 - val_accuracy: 0.9889
    Epoch 6/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0023 - accuracy: 0.9992 - val_loss: 0.0312 - val_accuracy: 0.9944
    Epoch 7/50
    157/157 [==============================] - 2s 13ms/step - loss: 5.4093e-04 - accuracy: 0.9998 - val_loss: 0.0846 - val_accuracy: 0.9851
    Epoch 8/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0012 - accuracy: 0.9996 - val_loss: 0.0483 - val_accuracy: 0.9916
    Epoch 9/50
    157/157 [==============================] - 2s 13ms/step - loss: 5.4437e-04 - accuracy: 0.9999 - val_loss: 0.0882 - val_accuracy: 0.9844
    Epoch 10/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0029 - accuracy: 0.9991 - val_loss: 0.0457 - val_accuracy: 0.9900
    Epoch 11/50
    157/157 [==============================] - 2s 14ms/step - loss: 3.9874e-04 - accuracy: 0.9999 - val_loss: 0.0358 - val_accuracy: 0.9933
    Epoch 12/50
    157/157 [==============================] - 2s 13ms/step - loss: 3.0234e-04 - accuracy: 0.9999 - val_loss: 0.0313 - val_accuracy: 0.9944
    Epoch 13/50
    157/157 [==============================] - 2s 13ms/step - loss: 6.8990e-04 - accuracy: 0.9997 - val_loss: 0.1000 - val_accuracy: 0.9820
    Epoch 14/50
    157/157 [==============================] - 2s 13ms/step - loss: 8.4658e-04 - accuracy: 0.9997 - val_loss: 0.0707 - val_accuracy: 0.9860
    Epoch 15/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0022 - accuracy: 0.9994 - val_loss: 0.0469 - val_accuracy: 0.9904
    Epoch 16/50
    157/157 [==============================] - 2s 13ms/step - loss: 6.7228e-04 - accuracy: 0.9998 - val_loss: 0.0715 - val_accuracy: 0.9858
    Epoch 17/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0028 - accuracy: 0.9989 - val_loss: 0.0495 - val_accuracy: 0.9889
    Epoch 18/50
    157/157 [==============================] - 2s 14ms/step - loss: 2.8878e-04 - accuracy: 0.9999 - val_loss: 0.0526 - val_accuracy: 0.9887
    Epoch 19/50
    157/157 [==============================] - 2s 13ms/step - loss: 1.4654e-04 - accuracy: 1.0000 - val_loss: 0.0458 - val_accuracy: 0.9909
    Epoch 20/50
    157/157 [==============================] - 2s 13ms/step - loss: 4.1262e-04 - accuracy: 0.9998 - val_loss: 0.0576 - val_accuracy: 0.9891
    Epoch 21/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0014 - accuracy: 0.9996 - val_loss: 0.0475 - val_accuracy: 0.9907
    Epoch 22/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0028 - accuracy: 0.9988 - val_loss: 0.0741 - val_accuracy: 0.9840
    Epoch 23/50
    157/157 [==============================] - 2s 14ms/step - loss: 8.8363e-04 - accuracy: 0.9996 - val_loss: 0.0386 - val_accuracy: 0.9909
    Epoch 24/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0013 - accuracy: 0.9997 - val_loss: 0.0514 - val_accuracy: 0.9898
    Epoch 25/50
    157/157 [==============================] - 2s 14ms/step - loss: 3.1105e-04 - accuracy: 0.9998 - val_loss: 0.0728 - val_accuracy: 0.9851
    Epoch 26/50
    157/157 [==============================] - 2s 14ms/step - loss: 2.5246e-04 - accuracy: 1.0000 - val_loss: 0.0543 - val_accuracy: 0.9893
    Epoch 27/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0013 - accuracy: 0.9994 - val_loss: 0.0801 - val_accuracy: 0.9820
    Epoch 28/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0013 - accuracy: 0.9996 - val_loss: 0.0427 - val_accuracy: 0.9907
    Epoch 29/50
    157/157 [==============================] - 2s 14ms/step - loss: 8.0247e-04 - accuracy: 0.9997 - val_loss: 0.0476 - val_accuracy: 0.9900
    Epoch 30/50
    157/157 [==============================] - 2s 13ms/step - loss: 4.1942e-04 - accuracy: 0.9998 - val_loss: 0.0633 - val_accuracy: 0.9873
    Epoch 31/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0018 - accuracy: 0.9995 - val_loss: 0.0324 - val_accuracy: 0.9929
    Epoch 32/50
    157/157 [==============================] - 2s 14ms/step - loss: 1.8973e-04 - accuracy: 1.0000 - val_loss: 0.0326 - val_accuracy: 0.9938
    Epoch 33/50
    157/157 [==============================] - 2s 14ms/step - loss: 9.3997e-04 - accuracy: 0.9996 - val_loss: 0.0384 - val_accuracy: 0.9929
    Epoch 34/50
    157/157 [==============================] - 2s 14ms/step - loss: 3.5687e-04 - accuracy: 0.9999 - val_loss: 0.0792 - val_accuracy: 0.9851
    Epoch 35/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0017 - accuracy: 0.9995 - val_loss: 0.0371 - val_accuracy: 0.9931
    Epoch 36/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0020 - accuracy: 0.9994 - val_loss: 0.0430 - val_accuracy: 0.9918
    Epoch 37/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0021 - accuracy: 0.9994 - val_loss: 0.0463 - val_accuracy: 0.9922
    Epoch 38/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0011 - accuracy: 0.9997 - val_loss: 0.0756 - val_accuracy: 0.9847
    Epoch 39/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0011 - accuracy: 0.9996 - val_loss: 0.0576 - val_accuracy: 0.9884
    Epoch 40/50
    157/157 [==============================] - 2s 13ms/step - loss: 5.3687e-04 - accuracy: 0.9999 - val_loss: 0.1249 - val_accuracy: 0.9804
    Epoch 41/50
    157/157 [==============================] - 2s 14ms/step - loss: 3.8168e-04 - accuracy: 0.9999 - val_loss: 0.0524 - val_accuracy: 0.9922
    Epoch 42/50
    157/157 [==============================] - 2s 13ms/step - loss: 8.7413e-04 - accuracy: 0.9997 - val_loss: 0.1130 - val_accuracy: 0.9836
    Epoch 43/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0013 - accuracy: 0.9996 - val_loss: 0.0359 - val_accuracy: 0.9933
    Epoch 44/50
    157/157 [==============================] - 2s 13ms/step - loss: 0.0015 - accuracy: 0.9994 - val_loss: 0.0876 - val_accuracy: 0.9851
    Epoch 45/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0011 - accuracy: 0.9998 - val_loss: 0.0893 - val_accuracy: 0.9833
    Epoch 46/50
    157/157 [==============================] - 2s 13ms/step - loss: 4.5296e-04 - accuracy: 0.9998 - val_loss: 0.1328 - val_accuracy: 0.9771
    Epoch 47/50
    157/157 [==============================] - 2s 13ms/step - loss: 4.0214e-04 - accuracy: 0.9999 - val_loss: 0.0384 - val_accuracy: 0.9933
    Epoch 48/50
    157/157 [==============================] - 2s 13ms/step - loss: 4.2899e-04 - accuracy: 0.9999 - val_loss: 0.0517 - val_accuracy: 0.9918
    Epoch 49/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0026 - accuracy: 0.9990 - val_loss: 0.0392 - val_accuracy: 0.9922
    Epoch 50/50
    157/157 [==============================] - 2s 14ms/step - loss: 0.0018 - accuracy: 0.9994 - val_loss: 0.0429 - val_accuracy: 0.9927

Let’s look at the flow of our model

utils.plot_model(model3)

output_25_0.png

Finally, let’s plot its performance on the validation set.

history_plot()

output_26_0.png

WOW! This our best Model with around a consistent 99% accuracy!! Now lets test our models on the test data.

§4. Model Evaluation

Now let’s test our models performance on unseen test data.

test_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true" #import test data
test_pd = pd.read_csv(train_url, index_col = 0)
test_data = make_dataset(test_pd)

loss1, accuracy1 = model1.evaluate(test_data)
print('Test accuracy for Model 1 :', accuracy1)

loss2, accuracy2 = model2.evaluate(test_data)
print('Test accuracy for Model 2 :', accuracy2)

loss3, accuracy3 = model3.evaluate(test_data)
print('Test accuracy for Model 3 :', accuracy3)
225/225 [==============================] - 1s 3ms/step - loss: 0.0479 - accuracy: 0.9881
Test accuracy for Model 1 : 0.9880618453025818
225/225 [==============================] - 2s 8ms/step - loss: 0.1199 - accuracy: 0.9760
Test accuracy for Model 2 : 0.9759899973869324
225/225 [==============================] - 2s 9ms/step - loss: 0.0126 - accuracy: 0.9968
Test accuracy for Model 3 : 0.9968372583389282

We see that the test accuracy for Model3 is 99.68% which is expected since it is our best model.

§5. Embedding Visualization

Our final step is to visualize the embeddings using PCA to reduce the features to 2-dimensional weights and take a closer look at some of the words that were most associated with fake news articles.

weights = model3.get_layer('embedding1').get_weights()[0] # get the weights from the embedding layer
vocab = title_vectorize_layer.get_vocabulary()                # get the vocabulary from our data prep for later

from sklearn.decomposition import PCA
pca = PCA(n_components=2)     # Convert our data into 2 dimensions
weights = pca.fit_transform(weights)

# visualzing the text embedding
embedding_df = pd.DataFrame({
    'word' : vocab, 
    'x0'   : weights[:,0],
    'x1'   : weights[:,1]
})

import plotly.express as px 
fig = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 size = [2]*len(embedding_df),
                # size_max = 2,
                 hover_name = "word")

fig.show()

The words that are far from the orgin suggests strong indication towards fake or true news. On the negative x-axis side we have “spokesman” and “factbox” while the “breaking” and “just” and “hollywood” is on the postive x-axis side. It seems to me that the negative x-axis holds words that indicate a true article while the positive x-axis holds words that indicate a fake article.

Written on May 18, 2022