How I Built a Simple Fake News Detector on Amazon SageMaker

Published in

TDS Archive

5 min readApr 18, 2020

Recently, I decided to enroll in a Udacity Nanodegree as this thought had been hovering about my head for a long time.

During the last two months, every day after dinner and in the weekends I followed along the Machine Learning Engineer Nanodegree courses and I came across Amazon SageMaker.

Amazon SageMaker is a fully-managed service which allows data scientists and developers to build, train and deploy machine learning models at scale.

The amazing aspect is that you can really perform the whole end-to-end data science pipeline on the very same platform.
In fact, with Amazon SageMaker you have the ability to create Jupyter Notebook instances on a set of different machines which differ based on Compute (CPU/GPU), RAM and Networking capabilities.

You can start from importing your data, exploring and cleaning it, to train a model and rapidly put it into production environment.

The common workflow with SageMaker (at least from what I have learned with my small experience is the following):

DATA INTEGRATION AND PROCESSING
Integrate your dataset from whatever source;
Explore it, make visualizations and summary statistics to understand the data;
Clean what has to be cleaned, preprocess and engineer your features, if necessary;
Save your processed data to an S3 bucket, which can be the default SageMaker instance bucket or another bucket of your choice.
MODEL BUILDING & DEPLOYMENT
To build a model, Amazon SageMaker comes with a set of supervised and unsupervised models, but you can also provide your custom model with a framework of your choice (Scikit-learn, TensorFlow, MXNet…) together with a training script;
Train the model on one or more Compute instances with the data you saved on S3
Deploy the estimator on a SageMaker endpoint to make an inference

I find SageMaker a really valuable choice for Data Science projects. From here on, I’ll share with you the last experience I made on SageMaker with my Udacity Capstone Project.

This project deals with fake/true news detection. It can be inserted undoubtedly in the context of Natural Language Processing problems.
While I was navigating on Kaggle, I found this interesting dataset.
The dataset is made of 2 CSV files (true, fake news) which store title, article, date and subject of the articles.

Problem Statement

So, the problem can be stated in the following way: Given the text of an article, I want the algorithm to be able to predict whether it refers to True or Fake news. In particular, I am structuring the solution to the problem as follows:
• The data, which comes from different sources (CSVs) will be labelled and stacked;
• After being stacked, the text features such as “title” and “article” will be processed in order to generate a meaningful vocabulary (no hashtags, URLs, weird punctuation and stopwords)
From here, two roads can be followed, depending on the choice of the algorithm.
• If a Machine Learning algorithm is used, then it is necessary to create a Bag of Words representation of the texts, either by using word counts, one hot encoding of term frequencyinverse document frequency that can be used together with other features (extracted from date, for example) to train the model;
• Instead, if a Deep Learning model is chosen, such as a Recurrent Neural Network, one could think of using only directly text sequences padded to same length and mapped with a word_to_integer vocabulary.
Then, the neural network can be trained to solve a binary classification problem with a binary cross-entropy loss.

Since my report is 10 pages long, I will report only the main steps:

Preprocessing
For what concerns the preprocessing steps for the LSTM model:

I considered only the article texts as a feature over which I filtered texts with length below 20 and above 500 words to avoid empty sequences or too long sequences. The texts have already been filtered by stopwords, weird punctuation and transformed to lowercase.
I split the data in Train, Validation and Test datasets with train_test_split from Sklearn
I applied a Tokenizer from keras to the Training set which then I used to transform also Validation and Test Datasets (To avoid data leakage) and then padded all sequences to a max_len of 500

from tf.keras.preprocessing.text import Tokenizer
from tf.keras.preprocessing import sequencetokenizer = Tokenizer(num_words=80000)
text_tokenizer.fit_on_texts(X_train['article'].astype(str))X_train = text_tokenizer.texts_to_sequences(X_train)
X_val = text_tokenizer.texts_to_sequences(X_val)
X_test = text_tokenizer.texts_to_sequences(X_test)X_train = sequence.pad_sequences(X_train, maxlen=500, padding='post')
X_val= sequence.pad_sequences(X_val, maxlen=500, padding='post')
X_test= sequence.pad_sequences(X_test, maxlen=500, padding='post')

The Model

In the training script (remember that I am on SageMaker) I define the environment variables and it is where you define the model structure, fit it and save its artifacts on S3. This is the structure of the network I used (Keras).

from tf.keras.layers import Embedding, Bidirectional, LSTM,Dense,Activation
from tf.keras.models import Sequentialdef RNN():model = Sequential()
    layer = model.add(Embedding(80000, 128, input_length = 500))
    layer = model.add(Bidirectional(LSTM(128))
    layer = model.add(Dense(128))   
    layer = model.add(Activation('relu'))
    layer = model.add(Dense(1))
    layer = model.add(Activation('sigmoid'))    return model

By adding Bidirectionality on the LSTM layer I improved accuracy by over 15%.

Then you add code to fit and save the model; this code will be called by SageMaker during the training job.

model.fit(train_X,
          train_y, 
          batch_size=256,
          epochs=args.n_epochs,
          validation_data=(val_X,val_y))model_path = '/opt/ml/model/'
model.save(os.path.join(model_path,'bi_lstm/1'), save_format='tf')

On the Instance side, instead, I instantiated a TensorFlow object, where I set the path to the training script, number and type of instance I want to choose, the IAM role and hyperparameters:

input_channels = {"train":train_data,
                  "validation":val_data}from sagemaker.tensorflow import TensorFlowestimator = TensorFlow(entry_point = 'source_train/train_keras_lstm.py',
                       train_instance_type='ml.p2.xlarge',
                       train_instance_count=1, 
                       role=role,
                       framework_version='2.1.0',
                       py_version='py3',
                       hyperparameters={"n_epochs":3}

As you can see, I chose a ‘ml.p2.xlarge’ instance, which is the Amazon entry-level machine with GPU access.

With the same strategy, I deployed my model after training it:

predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.c4.xlarge')

And performed inference on the test set (this can be done both with the predict() API but also by creating a Batch Transform job in case of larger data):

from sklearn.metrics import accuracy_scorepreds_df = pd.DataFrame(predictor.predict(X_test)
target_preds = pd.concat([y_test,preds_df], axis=1)
target_preds.columns=['targetClass','preds']print(accuracy_score(target_preds['targetClass'],
                     target_preds['preds']))0.986639753940792

And I got a 98% accuracy on the test set.

Apart from the model itself, I hope I got your attention on SageMaker capabilities.

If you want to see the whole steps, read the report or give a look to the training script, this is the GitHub repo of the project.

Until next time, goodbye and thanks for reading!

How I Built a Simple Fake News Detector on Amazon SageMaker

Problem Statement

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in TDS Archive

Written by Andrea Guidi

Responses (1)