finish final notecard.
This commit is contained in:
parent
38c49616a2
commit
a21ed7a7d9
Binary file not shown.
Binary file not shown.
|
@ -1,213 +0,0 @@
|
|||
\documentclass[sigconf,authorversion,nonacm]{acmart}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\title{Political Polarization In Media Headlines}
|
||||
\subtitle{CSCI 577 - Data Mining}
|
||||
|
||||
\author{Matt Jensen}
|
||||
\email{contact@publicmatt.com}
|
||||
\affiliation{%
|
||||
\institution{Western Washington University}
|
||||
\streetaddress{516 High St.}
|
||||
\city{Bellingham}
|
||||
\state{Washington}
|
||||
\country{USA}
|
||||
\postcode{98225}
|
||||
}
|
||||
|
||||
\renewcommand{\shortauthors}{Jensen, et al.}
|
||||
|
||||
\begin{abstract}
|
||||
Political polarization in the United States has increased in recent years according to studies \cite{stewart_polarization_2020}.
|
||||
A number of polling methods and data sources have been used to track this phenomenon \cite{prior_media_2013}.
|
||||
A casual link between polarization and partisanship in elections and the community has been hard to establish.
|
||||
One possible cause is the media diet of the average American.
|
||||
In particular, the medium of consumption has shifted online and the range of sources has widened considerably.
|
||||
In an effort to quantify the range of online media, a study of online news article headlines was conducted.
|
||||
It found that titles with emotionally neutral wording have decreased in the share of all articles over time.
|
||||
A model was built to classify titles using BERT-style word embeddings and a simple classifier.
|
||||
|
||||
\end{abstract}
|
||||
|
||||
\keywords{data mining, datasets, classification, clustering, neural networks}
|
||||
|
||||
\received{4 April 2023}
|
||||
\received[revised]{9 June 2023}
|
||||
|
||||
\maketitle
|
||||
|
||||
\section{Background}
|
||||
|
||||
Media and new publishers have been accused of polarizing discussion to drive up revenue and engagement.
|
||||
This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
|
||||
A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
|
||||
This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
|
||||
Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
|
||||
|
||||
There is evidence supporting and increase in political polarization in the United States over the past 16 years.
|
||||
There have been a number of studies conducted in an attempt to measure and explain this phenomenon. \cite{flaxman_filter_2016}
|
||||
|
||||
These studies attempt to link increased media options and a decrease in the proportion of less engaged and less partisan voters.
|
||||
This drop in less engaged voters might explain the increased partisanship in elections.
|
||||
However, the evidence regarding a direct causal relationship between partisan media messages and changes in attitudes or behaviors is inconclusive.
|
||||
Directly measuring the casual relationship between media messages and behavior is difficult.
|
||||
There is currently no solid evidence to support the claim that partisan media outlets are causing average Americans to become more partisan.
|
||||
|
||||
The number of media publishers has increased and in this particular data set:
|
||||
|
||||
These studies rest on the assumption that media outlets are becoming more partisan.
|
||||
We study this assumption in detail.
|
||||
|
||||
Party Sorting: Over the past few decades, there has been a significant increase in party sorting, where Democrats have become more ideologically liberal, and Republicans have become more ideologically conservative.
|
||||
This trend indicates a growing gap between the two major political parties.
|
||||
A study published in the journal American Political Science Review in 2018 found that party sorting increased significantly between 2004 and 2016.
|
||||
|
||||
Congressional Polarization: There has been a substantial increase in polarization among members of the U.S. Congress. Studies analyzing voting patterns and ideological positions of legislators have consistently shown a widening gap between Democrats and Republicans.
|
||||
The Pew Research Center reported that the median Democrat and the median Republican in Congress have become further apart ideologically between 2004 and 2017.
|
||||
|
||||
Public Opinion: Surveys and polls also provide evidence of increasing political polarization among the American public.
|
||||
According to a study conducted by Pew Research Center in 2017, the gap between Republicans and Democrats on key policy issues, such as immigration, the environment, and social issues, has widened significantly since 1994.
|
||||
|
||||
Media Fragmentation: The rise of social media and digital media platforms has contributed to the fragmentation of media consumption, leading to the creation of ideological echo chambers.
|
||||
Individuals are more likely to consume news and information that aligns with their pre-existing beliefs, reinforcing and intensifying polarization.
|
||||
|
||||
Increased Negative Attitudes: Studies have shown that Americans' attitudes towards members of the opposing political party have become increasingly negative. The Pew Research Center reported in 2016 that negative feelings towards the opposing party have doubled since the late 1990s, indicating a deepening divide.
|
||||
|
||||
- Memeorandum: **stories**
|
||||
- AllSides: **bias**
|
||||
- HuggingFace: **sentiment**
|
||||
- ChatGPT: **election dates**
|
||||
\section{Data Sources}
|
||||
|
||||
All data was collected over the course of 2023.
|
||||
|
||||
\begin{table}
|
||||
\label{tab:freq}
|
||||
\caption{News Dataset Sources}
|
||||
\label{tab:1}
|
||||
\begin{tabular}{ll}
|
||||
\toprule
|
||||
Source & Description \\
|
||||
\midrule
|
||||
Memeorandum & News aggregation service. \\
|
||||
AllSides & Bias evaluator. \\
|
||||
MediaBiasFactCheck & Bias evaluator. \\
|
||||
HuggingFace & Classification model repository. \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
\section{Data Preparation}
|
||||
|
||||
\subsection{Memeorandum}
|
||||
The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
|
||||
Each news article has a title, author, description, publisher, publish date and url.
|
||||
All of these are non-numeric, except for the publication date which is ordinal.
|
||||
The site also has a concept of references, where a main, popular story may be covered by other sources.
|
||||
Using an archive of the website, each day's headlines were downloaded and parsed using python, then normalized and stored in sqlite database tables \cite{jensen_data_2023-1}.
|
||||
|
||||
\subsection{AllSides\\MediaBiasFactCheck}
|
||||
|
||||
|
||||
|
||||
What remains after cleaning is approximately 240,000 headlines from 1,700 publishers, 34,000 authors over about 64,000 days \ref{tab:1}.
|
||||
|
||||
\begin{table}
|
||||
\label{tab:freq}
|
||||
\caption{News Dataset Statistics After Cleaning}
|
||||
\label{tab:1}
|
||||
\begin{tabular}{ll}
|
||||
\toprule
|
||||
stat & value \\
|
||||
\midrule
|
||||
publishers & 1,735 \\
|
||||
stories & 242,343 \\
|
||||
authors & 34,346 \\
|
||||
children & 808,628 \\
|
||||
date range & 2006-2022 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
\subsection{Missing Data Policy}
|
||||
|
||||
The only news headlines used in this study were those with an associated bias rating from either AllSides or MediaBiasFactCheck.
|
||||
This elimiated about 5300 publishers and 50,000 headlines, which are outlets publishing only less than 1 story per year.
|
||||
Another consideration was the relationship between the opinion and news sections of organizations.
|
||||
MediaBiasFactCheck makes a distinct between things like the Wall Street Journal's news organization, one it rates as 'Least Bias', and Wall Street Journal's opinion organization, one it rates as 'Right'.
|
||||
Due to the nature of the Memeorandum dataset, and the way that organizations design their url structure, this study was not able to parse the headlines into news, opinion, blogs or other sub-categories recognized by the bias datasets.
|
||||
As such, news and opinion was combined under the same bias rating, and the rating with the most articles published was taken as the default value.
|
||||
This might lead to organizations with large newsrooms to bias toward the center in the dataset.
|
||||
|
||||
|
||||
\section{Experiments}
|
||||
|
||||
\subsection{Link Similarity Clustering and Classification}
|
||||
|
||||
\subsection{Title Sentiment Classification}
|
||||
|
||||
for every title, tokenize, classify.
|
||||
|
||||
The classification of news titles into emotional categories was accomplished by using a pre-trained large language model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
|
||||
This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
|
||||
The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probability class from the output.
|
||||
|
||||
The data has been discretized into years.
|
||||
Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
|
||||
Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
|
||||
Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
|
||||
|
||||
No computational experiment have been done yet.
|
||||
Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
|
||||
The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
|
||||
If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
|
||||
The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
|
||||
|
||||
|
||||
\section{Results}
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figures/articles_per_year.png}
|
||||
\caption{Articles per year.}
|
||||
\Description{descriptive statistics on the news data source}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figures/bias_vs_sentiment_over_time.png}
|
||||
\caption{Sentiment vs. bias over time}
|
||||
\Description{Timeseries classifcation of news titles sentiment and bias}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figures/link_pca_clusters_onehot.png}
|
||||
\caption{kNN confusion matrix of related links adjacency matrix}
|
||||
\Description{}
|
||||
\end{figure}
|
||||
|
||||
|
||||
% \section{Math Equations}
|
||||
|
||||
% \begin{equation}
|
||||
% \sum_{i=0}^{\infty}x_i=\int_{0}^{\pi+2} f
|
||||
% \end{equation}
|
||||
|
||||
|
||||
\begin{acks}
|
||||
To Dr. Hearne, for the instruction on clustering and classification techniques, and to Pax Newman for the discussion on word embeddings.
|
||||
\end{acks}
|
||||
|
||||
\bibliographystyle{ACM-Reference-Format}
|
||||
\bibliography{data_mining_577}
|
||||
|
||||
\appendix
|
||||
|
||||
\section{Online Resources}
|
||||
|
||||
The source code for the study is available on GitHub \cite{jensen_data_2023}.
|
||||
|
||||
\end{document}
|
||||
\endinput
|
|
@ -348,7 +348,31 @@ Frequency with which the items in S occur together in the database.
|
|||
where n is the number of transactions in the database.
|
||||
|
||||
|
||||
\subsubsection{Uses}
|
||||
\subsection{APRIORI}
|
||||
|
||||
\paragraph{Pseudo-code}
|
||||
\begin{enumerate}
|
||||
\item Create $L_{1} =$ set of supported itemsets of cardinality one.
|
||||
\item Set $k = 2$.
|
||||
\item while $(L_{k-1} \neq \empty)$.
|
||||
\begin{enumerate}
|
||||
\item Create $C_{k}$ from $L_{k-1}$.
|
||||
\item Prune all the itemsets in $C_{k}$ that are not supported, to create $L_{K}$.
|
||||
\item Increase $k$ by $1$.
|
||||
\end{enumerate}
|
||||
\item The set of all supported itemsets is $L_1 \cup L_2 \cup \cdots \cup L_k$.
|
||||
\end{enumerate}
|
||||
|
||||
To start the process we construct $C_1$.
|
||||
\begin{enumerate}
|
||||
\item Set of all itemsets comprising just a single item,
|
||||
\item Make a pass through the database counting the number of transactions that match each of these itemsets.
|
||||
\item Divide these counts by the number of transactions in the database
|
||||
\item Checking for minsup each single-element itemset.
|
||||
\item Discard all those with $\text{support} < minsup$ to yield $L_k$.
|
||||
\item Continue until is empty.
|
||||
\end{enumerate}
|
||||
|
||||
|
||||
\subsection{Confidence}
|
||||
|
||||
|
@ -417,16 +441,22 @@ In some cases a rule with higher support and lower lift can be more interesting
|
|||
|
||||
\subsection{Rules Possible}
|
||||
|
||||
The number of ways of selecting $i$ items from the $k$ in a supported itemset of cardinality $k$ for the right-hand side of a rule is given by:
|
||||
\begin{displaymath}
|
||||
\sideset{_k}{_i}C
|
||||
\sideset{_i}{_k}C
|
||||
\end{displaymath}
|
||||
|
||||
or
|
||||
Total number of rules:
|
||||
|
||||
\begin{displaymath}
|
||||
\sideset{_k}{_{k-1}}C
|
||||
\end{displaymath}
|
||||
|
||||
\begin{displaymath}
|
||||
2^{k} - 2
|
||||
\end{displaymath}
|
||||
|
||||
800 supported itemsets in $C_2$ if $800 \times \frac{799}{2}$.
|
||||
|
||||
\end{document}
|
||||
\endinput
|
||||
|
|
BIN
docs/paper.pdf
BIN
docs/paper.pdf
Binary file not shown.
216
docs/paper.tex
216
docs/paper.tex
|
@ -1,129 +1,213 @@
|
|||
\documentclass{article}
|
||||
\usepackage{multicol,caption}
|
||||
\usepackage{hyperref}
|
||||
\usepackage{caption}
|
||||
\usepackage{subcaption}
|
||||
\usepackage{graphicx}
|
||||
\usepackage{fancyvrb}
|
||||
\usepackage[utf8]{inputenc}
|
||||
\bibliographystyle{acm}
|
||||
|
||||
\newenvironment{Figure}
|
||||
{\par\medskip\noindent\minipage{\linewidth}}
|
||||
{\endminipage\par\medskip}
|
||||
|
||||
\title{Data Mining CS 571}
|
||||
\author{Matt Jensen}
|
||||
\date{2023-04-25}
|
||||
\documentclass[sigconf,authorversion,nonacm]{acmart}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\title{Political Polarization In Media Headlines}
|
||||
\subtitle{CSCI 577 - Data Mining}
|
||||
|
||||
\author{Matt Jensen}
|
||||
\email{contact@publicmatt.com}
|
||||
\affiliation{%
|
||||
\institution{Western Washington University}
|
||||
\streetaddress{516 High St.}
|
||||
\city{Bellingham}
|
||||
\state{Washington}
|
||||
\country{USA}
|
||||
\postcode{98225}
|
||||
}
|
||||
|
||||
\renewcommand{\shortauthors}{Jensen, et al.}
|
||||
|
||||
\begin{abstract}
|
||||
Political polarization in the United States has increased in recent years according to studies \cite{stewart_polarization_2020}.
|
||||
A number of polling methods and data sources have been used to track this phenomenon \cite{prior_media_2013}.
|
||||
A casual link between polarization and partisanship in elections and the community has been hard to establish.
|
||||
One possible cause is the media diet of the average American.
|
||||
In particular, the medium of consumption has shifted online and the range of sources has widened considerably.
|
||||
In an effort to quantify the range of online media, a study of online news article headlines was conducted.
|
||||
It found that titles with emotionally neutral wording have decreased in the share of all articles over time.
|
||||
A model was built to classify titles using BERT-style word embeddings and a simple classifier.
|
||||
|
||||
\end{abstract}
|
||||
|
||||
\keywords{data mining, datasets, classification, clustering, neural networks}
|
||||
|
||||
\received{4 April 2023}
|
||||
\received[revised]{9 June 2023}
|
||||
|
||||
\maketitle
|
||||
|
||||
\section*{Abstract}
|
||||
\section{Background}
|
||||
|
||||
News organizations have been repeatedly accused of being partisan.
|
||||
Additionally, they have been accused of polarizing dicussion to drive up revenue and engagement.
|
||||
Media and new publishers have been accused of polarizing discussion to drive up revenue and engagement.
|
||||
This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
|
||||
A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
|
||||
This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
|
||||
Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
|
||||
|
||||
|
||||
\begin{multicols}{2}
|
||||
|
||||
\section{Background}
|
||||
|
||||
There is evidence supporting and increase in political polarization in the United States over the past 16 years.
|
||||
There have been a number of studies conducted in an attempt to measure and explain this phenomenon. \cite{stewart_polarization_2020} \cite{flaxman_filter_2016}
|
||||
There have been a number of studies conducted in an attempt to measure and explain this phenomenon. \cite{flaxman_filter_2016}
|
||||
|
||||
These studies attempt to link increased media options and a decrease in the proportion of less engaged and less partisan voters.
|
||||
This drop in less engaged voters might explain the increased partisanship in elections.
|
||||
However, the evidence regarding a direct causal relationship between partisan media messages and changes in attitudes or behaviors is inconclusive.
|
||||
Directly measuring the casual relationship between media messages and behavior is difficult \cite{prior_media_2013}.
|
||||
Directly measuring the casual relationship between media messages and behavior is difficult.
|
||||
There is currently no solid evidence to support the claim that partisan media outlets are causing average Americans to become more partisan.
|
||||
|
||||
The number of media publishers has increased and in this particular data set:
|
||||
|
||||
\begin{Figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figures/distinct_publishers.png}
|
||||
\captionof{figure}{Publishers Per Year}
|
||||
\end{Figure}
|
||||
|
||||
These studies rest on the assumption that media outlets are becoming more partisan.
|
||||
We study this assumption in detail.
|
||||
|
||||
Party Sorting: Over the past few decades, there has been a significant increase in party sorting, where Democrats have become more ideologically liberal, and Republicans have become more ideologically conservative. This trend indicates a growing gap between the two major political parties. A study published in the journal American Political Science Review in 2018 found that party sorting increased significantly between 2004 and 2016.
|
||||
Party Sorting: Over the past few decades, there has been a significant increase in party sorting, where Democrats have become more ideologically liberal, and Republicans have become more ideologically conservative.
|
||||
This trend indicates a growing gap between the two major political parties.
|
||||
A study published in the journal American Political Science Review in 2018 found that party sorting increased significantly between 2004 and 2016.
|
||||
|
||||
Congressional Polarization: There has been a substantial increase in polarization among members of the U.S. Congress. Studies analyzing voting patterns and ideological positions of legislators have consistently shown a widening gap between Democrats and Republicans. The Pew Research Center reported that the median Democrat and the median Republican in Congress have become further apart ideologically between 2004 and 2017.
|
||||
Congressional Polarization: There has been a substantial increase in polarization among members of the U.S. Congress. Studies analyzing voting patterns and ideological positions of legislators have consistently shown a widening gap between Democrats and Republicans.
|
||||
The Pew Research Center reported that the median Democrat and the median Republican in Congress have become further apart ideologically between 2004 and 2017.
|
||||
|
||||
Public Opinion: Surveys and polls also provide evidence of increasing political polarization among the American public. According to a study conducted by Pew Research Center in 2017, the gap between Republicans and Democrats on key policy issues, such as immigration, the environment, and social issues, has widened significantly since 1994.
|
||||
Public Opinion: Surveys and polls also provide evidence of increasing political polarization among the American public.
|
||||
According to a study conducted by Pew Research Center in 2017, the gap between Republicans and Democrats on key policy issues, such as immigration, the environment, and social issues, has widened significantly since 1994.
|
||||
|
||||
Media Fragmentation: The rise of social media and digital media platforms has contributed to the fragmentation of media consumption, leading to the creation of ideological echo chambers. Individuals are more likely to consume news and information that aligns with their pre-existing beliefs, reinforcing and intensifying polarization.
|
||||
Media Fragmentation: The rise of social media and digital media platforms has contributed to the fragmentation of media consumption, leading to the creation of ideological echo chambers.
|
||||
Individuals are more likely to consume news and information that aligns with their pre-existing beliefs, reinforcing and intensifying polarization.
|
||||
|
||||
Increased Negative Attitudes: Studies have shown that Americans' attitudes towards members of the opposing political party have become increasingly negative. The Pew Research Center reported in 2016 that negative feelings towards the opposing party have doubled since the late 1990s, indicating a deepening divide.
|
||||
|
||||
\section{Data Preparation}
|
||||
The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
|
||||
Each news article has a title, author, description, publisher, publish date, url and related discussions \ref{tab:1}.
|
||||
The site also has a concept of references, where a main, popular story may be covered by other sources.
|
||||
This link association might be used to support one or more of the hypothesis of the main analysis.
|
||||
After scraping the site, the data will need to be deduplicated and normalized to minimize storage costs and processing errors.
|
||||
- Memeorandum: **stories**
|
||||
- AllSides: **bias**
|
||||
- HuggingFace: **sentiment**
|
||||
- ChatGPT: **election dates**
|
||||
\section{Data Sources}
|
||||
|
||||
What remains after these cleaning steps is approximitely 6,400 days of material, 300,000 distinct headlines from 21,000 publishers and 34,000 authors used in the study.
|
||||
All data was collected over the course of 2023.
|
||||
|
||||
\begin{center}
|
||||
\begin{table}
|
||||
\label{tab:freq}
|
||||
\caption{News Dataset Sources}
|
||||
\label{tab:1}
|
||||
\begin{tabular}{ll}
|
||||
\toprule
|
||||
Source & Description \\
|
||||
\midrule
|
||||
Memeorandum & News aggregation service. \\
|
||||
AllSides & Bias evaluator. \\
|
||||
MediaBiasFactCheck & Bias evaluator. \\
|
||||
HuggingFace & Classification model repository. \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
\section{Data Preparation}
|
||||
|
||||
\subsection{Memeorandum}
|
||||
The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
|
||||
Each news article has a title, author, description, publisher, publish date and url.
|
||||
All of these are non-numeric, except for the publication date which is ordinal.
|
||||
The site also has a concept of references, where a main, popular story may be covered by other sources.
|
||||
Using an archive of the website, each day's headlines were downloaded and parsed using python, then normalized and stored in sqlite database tables \cite{jensen_data_2023-1}.
|
||||
|
||||
\subsection{AllSides\\MediaBiasFactCheck}
|
||||
|
||||
|
||||
|
||||
What remains after cleaning is approximately 240,000 headlines from 1,700 publishers, 34,000 authors over about 64,000 days \ref{tab:1}.
|
||||
|
||||
\begin{table}
|
||||
\label{tab:freq}
|
||||
\caption{News Dataset Statistics After Cleaning}
|
||||
\label{tab:1}
|
||||
\begin{tabular}{ll}
|
||||
\toprule
|
||||
stat & value \\
|
||||
\midrule
|
||||
publishers & 1,735 \\
|
||||
stories & 242,343 \\
|
||||
authors & 34,346 \\
|
||||
children & 808,628 \\
|
||||
date range & 2006-2022
|
||||
date range & 2006-2022 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\captionof{table}{dataset statistics}
|
||||
\label{tab:1}
|
||||
\end{center}
|
||||
\end{table}
|
||||
|
||||
\subsection{Missing Data Policy}
|
||||
|
||||
The only news headlines used in this study were those with an associated bias rating from either AllSides or MediaBiasFactCheck.
|
||||
This elimiated about 5300 publishers and 50,000 headlines, which are outlets publishing only less than 1 story per year.
|
||||
Another consideration was the relationship between the opinion and news sections of organizations.
|
||||
MediaBiasFactCheck makes a distinct between things like the Wall Street Journal's news organization, one it rates as 'Least Bias', and Wall Street Journal's opinion organization, one it rates as 'Right'.
|
||||
Due to the nature of the Memeorandum dataset, and the way that organizations design their url structure, this study was not able to parse the headlines into news, opinion, blogs or other sub-categories recognized by the bias datasets.
|
||||
As such, news and opinion was combined under the same bias rating, and the rating with the most articles published was taken as the default value.
|
||||
This might lead to organizations with large newsrooms to bias toward the center in the dataset.
|
||||
|
||||
|
||||
\section{Missing Data Policy}
|
||||
\section{Experiments}
|
||||
|
||||
The largest data policy that will have to be dealt with is news organizations that share the same parent company, but might have slightly different names.
|
||||
Wall Street Journal news is drastically different than their opinion section.
|
||||
Other organizations have slightly different names for the same thing and a product of the aggregation service and not due to any real difference.
|
||||
Luckily, most of the anaylsis is operating on the content of the news headlines, which do not suffer from this data impurity.
|
||||
\subsection{Link Similarity Clustering and Classification}
|
||||
|
||||
\section{Classification Task}
|
||||
\subsection{Title Sentiment Classification}
|
||||
|
||||
The classification of news titles into emotional categories was accomplished by using a pretrained large langauge model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
|
||||
for every title, tokenize, classify.
|
||||
|
||||
The classification of news titles into emotional categories was accomplished by using a pre-trained large language model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
|
||||
This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
|
||||
The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probabilty class from the output.
|
||||
The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probability class from the output.
|
||||
|
||||
The data has been discretized into years.
|
||||
Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
|
||||
Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
|
||||
Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
|
||||
|
||||
\section{Experiments}
|
||||
|
||||
No computational experiment have been done yet.
|
||||
Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
|
||||
The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
|
||||
If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
|
||||
The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
|
||||
|
||||
|
||||
\section{Results}
|
||||
|
||||
\begin{Figure}
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figures/articles_per_year.png}
|
||||
\captionof{figure}{Three simple graphs}
|
||||
\end{Figure}
|
||||
\caption{Articles per year.}
|
||||
\Description{descriptive statistics on the news data source}
|
||||
\end{figure}
|
||||
|
||||
test
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figures/bias_vs_sentiment_over_time.png}
|
||||
\caption{Sentiment vs. bias over time}
|
||||
\Description{Timeseries classifcation of news titles sentiment and bias}
|
||||
\end{figure}
|
||||
|
||||
\end{multicols}
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figures/link_pca_clusters_onehot.png}
|
||||
\caption{kNN confusion matrix of related links adjacency matrix}
|
||||
\Description{}
|
||||
\end{figure}
|
||||
|
||||
\newpage
|
||||
|
||||
\bibliography{data_mining_577.bib}
|
||||
% \section{Math Equations}
|
||||
|
||||
% \begin{equation}
|
||||
% \sum_{i=0}^{\infty}x_i=\int_{0}^{\pi+2} f
|
||||
% \end{equation}
|
||||
|
||||
|
||||
\begin{acks}
|
||||
To Dr. Hearne, for the instruction on clustering and classification techniques, and to Pax Newman for the discussion on word embeddings.
|
||||
\end{acks}
|
||||
|
||||
\bibliographystyle{ACM-Reference-Format}
|
||||
\bibliography{data_mining_577}
|
||||
|
||||
\appendix
|
||||
|
||||
\section{Online Resources}
|
||||
|
||||
The source code for the study is available on GitHub \cite{jensen_data_2023}.
|
||||
|
||||
\end{document}
|
||||
\endinput
|
||||
|
|
Loading…
Reference in New Issue