wwu-577/docs/paper.tex

62 lines
4.1 KiB
TeX
Raw Normal View History

2023-05-17 13:38:07 -07:00
\documentclass{article}
\usepackage{multicol}
\usepackage{hyperref}
\title{Data Mining CS 571}
\author{Matt Jensen}
\date{2023-04-25}
\begin{document}
\maketitle
\section*{Abstract}
News organizations have been repeatedly accused of being partisan.
Additionally, they have been accused of polarizing dicussion to drive up revenue and engagement.
This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
\begin{multicols}{2}
\section{Data Preparation}
The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
Each news article has a title, author, description, publisher, publish date, url and related discussions.
The site also has a concept of references, where a main, popular story may be covered by other sources.
This link association might be used to support one or more of the hypothesis of the main analysis.
After scraping the site, the data will need to be deduplicated and normalized to minimize storage costs and processing errors.
What remains after these cleaning steps is approximitely 6,400 days of material, 300,000 distinct headlines from 21,000 publishers and 34,000 authors used in the study.
\section{Missing Data Policy}
The largest data policy that will have to be dealt with is news organizations that share the same parent company, but might have slightly different names.
Wall Street Journal news is drastically different than their opinion section.
Other organizations have slightly different names for the same thing and a product of the aggregation service and not due to any real difference.
Luckily, most of the anaylsis is operating on the content of the news headlines, which do not suffer from this data impurity.
\section{Classification Task}
The classification of news titles into emotional categories was accomplished by using a pretrained large langauge model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probabilty class from the output.
The data has been discretized into years.
Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
\section{Experiments}
No computational experiment have been done yet.
Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
\section{Results}
\textbf{TODO.}
\end{multicols}
\end{document}