While definitions vary, in this work, unreliable news was used as an umbrella term for articles with unverified or false content, regardless of whether it intentionally aims to misinform or not. Identifying the intention of the articles’ authors was beyond the scope of this work. While not new, the increasing use of the internet and social media, accelerated the spread of these articles. Manual classification requires expertise and time, and cannot keep up with amount of articles posted each day. Augmenting these efforts with an automated unreliable news classifier is a growing body of research.
Datasets
There are multiple organizations that perform article-level fact-checking. The became the source of early datasets for this task. Those datasets only contained hundreds of samples. Fortunately, these organizations also created source-level scores where they analyzed online news sources as a whole and scored them on reliability. With these source-level scores, datasets can be created by scraping articles from websites and giving each article the score of its source. We should note that this labelling is considered weaker, as it might not be accurate to each article. This process however, enables researchers to build datasets in the tens to hundreds of thousands of entries.
The dataset used in this project, is a subset of the NELA-GT-2018 dataset. The original dataset contained upwards of 700,000 English news articles scraped from 194 websites collected in 2018. The various news sites were given scores by different fact checking agencies and aggregated into an overall score. This project only used sites that either scored really high, or really low on the reliability score.
Additionally, instead of randomly splitting the news articles into training, testing, and validation sets, they were split using news sources. For example, if one set contains articles from the BBC, then no other set have articles from the BBC. This means that the resulting model was tested on news articles from sources that it wasn’t trained on. A paper from 2021 has shown that the model learns the site-level labels and might not be robust on new articles from sources not seen during training.
The thresholding and balanced splitting reduced our dataset into 150,000 split. About 70k for training, and 40k each for validation and testing. The figures below show the top 10 news sites for each split.



Unreliable News Classifier
In this project, the Transformers implementation of BERT and distilBERT were used. Starting off with pre-trained weights and then fine-tuning on specific tasks, has been a huge boost to individual researchers. The base models were trained on a very large English dataset for multiple days on multiple GPUs and achieved state-of-the-art performance on language tasks when released. The distilled version were able to retain most of large models performance while giving it a significant speed boost. For this project a BERT (12-layers) was compared to distilBERT (6-layers and 4-layers).
Dataset/Dataloader
The working dataset was stored in Google Drive in a jsonl format. The dataset class (PyTorch) loads the data as a pandas data frame, it then feeds the title and content to an appropriate pre-trained tokenizer. The output of the tokenizer are number representations of words/sub-words that the models will use as inputs. The dataset class also one hot encodes the labels from one column into two columns. One limitation of these models is that they have a token limit of 512. This means that significant portions of the articles were not being read by the model. The Data Loader class primarily handles the batching of input data. Due to hardware constraints, the training and testing was limited to a batch size of 8 only.
encoding = self.tokenizer.encode_plus(
data_row.title,
' [SEP] ' + data_row.content,
add_special_tokens=True,
max_length = self.max_token_len,
return_token_type_ids = True,
padding = 'max_length',
truncation = 'only_second',
return_attention_mask = True,
return_tensors = 'pt'
)
Model
The BERT and distilBERT for Sequence Classification classes from the Transformers library attaches an untrained neural network to the outputs of the pre-trained models. In this case, the output size was two (one-hot encoded). The model class outputs both the scores (logits) and the computed loss that will be back propagated. To know the actual predictions of the model, the logits was passed through a softmax layer to convert scores into probabilities, and then the larger probability was considered the prediction.
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, dropout = dropout, num_labels = 2, n_layers = n_layers)
Training/Evaluation
A very basic training and evaluation loops were written using PyTorch. This project used a linear decaying learning rate with warmup. The training was initially set to run for 10 epochs, but early stopping was implemented. After each training epoch, the model was evaluated on the validation data set. The model with the highest validation accuracy was saved as checkpoint. If the validation accuracy did not increase for three consecutive epochs, the training was stopped. After training, the checkpoint was loaded and tested against the held out testing set. The final test accuracy and mean inference time per batch was recorded.
Other Articles
Mastering Pipelines: Integrating Feature Engineering into Your Predictive Models
Master predictive modeling with Scikit-Learn pipelines. Learn the importance of feature engineering and how to prevent data leakage.
In this short project, I scraped all of Lebron's regular season points and plotted them in an interactive graph.
Predicting a Fitness Center’s Class Attendance with Machine Learning
In this project I analyzed a fitness center's attendance data to predict attendance rates of its group classes.
A marketing agency presents a promotional plan to a telecommunication company to increase its subscriber base and generate revenue. Using population and location data, we estimated the feasibility of this plan in this case study.
Results

Weights and Biases was used to log the metrics during training. It was observed that after a few epochs, the models are already getting almost perfect accuracy on the training set, while the validation accuracy grows very little and sometimes decreases. Finally, on the testing set performance, the BERT version got the highest accuracy while the DistilBERT versions are not far behind, while offering significant speed boost.
A simple demo application was hosted on HuggingFace spaces which uses the DistilBERT 4-layer version. You can try it using random samples from the test dataset, or input your own title and article. Finally, the code for this project can be seen in GitHub.
Future Work
- Develop models that can handle more tokens (e.g. Longformer).
- Add newer versions of the NELA-GT dataset to the working dataset.
- Test the model on articles published after the articles on its training data.