v1.0 of presentation.

This commit is contained in:
matt
2023-05-17 13:38:07 -07:00
parent 4d93cf7adb
commit 74c2d8afa2
37 changed files with 1959 additions and 144 deletions

3
docs/Makefile Normal file
View File

@@ -0,0 +1,3 @@
paper.pdf: paper.tex
pdflatex $^ -o $@
evince $@

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

BIN
docs/figures/common_tld.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

BIN
docs/paper.pdf Normal file

Binary file not shown.

61
docs/paper.tex Normal file
View File

@@ -0,0 +1,61 @@
\documentclass{article}
\usepackage{multicol}
\usepackage{hyperref}
\title{Data Mining CS 571}
\author{Matt Jensen}
\date{2023-04-25}
\begin{document}
\maketitle
\section*{Abstract}
News organizations have been repeatedly accused of being partisan.
Additionally, they have been accused of polarizing dicussion to drive up revenue and engagement.
This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
\begin{multicols}{2}
\section{Data Preparation}
The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
Each news article has a title, author, description, publisher, publish date, url and related discussions.
The site also has a concept of references, where a main, popular story may be covered by other sources.
This link association might be used to support one or more of the hypothesis of the main analysis.
After scraping the site, the data will need to be deduplicated and normalized to minimize storage costs and processing errors.
What remains after these cleaning steps is approximitely 6,400 days of material, 300,000 distinct headlines from 21,000 publishers and 34,000 authors used in the study.
\section{Missing Data Policy}
The largest data policy that will have to be dealt with is news organizations that share the same parent company, but might have slightly different names.
Wall Street Journal news is drastically different than their opinion section.
Other organizations have slightly different names for the same thing and a product of the aggregation service and not due to any real difference.
Luckily, most of the anaylsis is operating on the content of the news headlines, which do not suffer from this data impurity.
\section{Classification Task}
The classification of news titles into emotional categories was accomplished by using a pretrained large langauge model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probabilty class from the output.
The data has been discretized into years.
Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
\section{Experiments}
No computational experiment have been done yet.
Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
\section{Results}
\textbf{TODO.}
\end{multicols}
\end{document}

552
docs/presentation.md Normal file
View File

@@ -0,0 +1,552 @@
_model: slides
---
title: CSCI 577 - Data Mining
---
body:
# Political Polarization
Matt Jensen
===
# Hypothesis
Political polarization is rising, and news articles are a proxy measure.
==
# Is this reasonable?
==
# Why is polarization rising?
Not my job, but there's research<sup>[ref](#references)</sup> to support it
==
# Sub-hypothesis
- The polarization increases near elections. <!-- .element: class="fragment" -->
- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
==
# Sub-sub-hypothesis
- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->
===
# Data Source(s)
memeorandum.com <!-- .element: class="fragment" -->
allsides.com <!-- .element: class="fragment" -->
huggingface.com <!-- .element: class="fragment" -->
===
<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>
===
# memeorandum.com
- News aggregation site. <!-- .element: class="fragment" -->
- Was really famous before Google News. <!-- .element: class="fragment" -->
- Still aggregates sites today. <!-- .element: class="fragment" -->
==
# Why Memeorandum?
- Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->
- Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->
- Convenient: most publishers block bots. <!-- .element class="fragment" -->
- Convenient: dead simple html to parse. <!-- .element class="fragment" -->
- Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->
- Archival: automated, not editorialized. <!-- .element class="fragment" -->
===
<section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>
===
# AllSides.com
- Rates news publications as left, center or right. <!-- .element: class="fragment" -->
- Ratings combine: <!-- .element: class="fragment" -->
- blind bias surveys.
- editorial reviews.
- third party research.
- community voting.
- Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->
==
# Why AllSides?
- Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->
- Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
- Convenient: Easy format. <!-- .element class="fragment" -->
- Archival: Covers 1400 publishers. <!-- .element class="fragment" -->
===
<section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>
===
# HuggingFace.com
- Deep Learning library. <!-- .element: class="fragment" -->
- Lots of pretrained models. <!-- .element: class="fragment" -->
- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->
==
# Why HuggingFace?
- Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->
- Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->
- Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->
- Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->
- Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->
===
# Data Structures
Stories
- Top level stories. <!-- .element: class="fragment" -->
- title.
- publisher.
- author.
- Related discussion. <!-- .element: class="fragment" -->
- publisher.
- uses 'parent' story as a source.
- Stream of stories (changes constantly). <!-- .element: class="fragment" -->
==
# Data Structures
Bias
- Per publisher. <!-- .element: class="fragment" -->
- name.
- label.
- agree/disagree vote by community.
- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->
==
# Data Structures
Embeddings
- Per story title. <!-- .element: class="fragment" -->
- sentence embedding (n, 384).
- sentiment classification (n, 1).
- emotional classification (n, 1).
- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->
===
# Data Collection
==
# Data Collection
Story Scraper (simplified)
```python
day = timedelta(days=1)
cur = date(2005, 10, 1)
end = date.today()
while cur <= end:
cur = cur + day
save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
r = requests.get(url)
with open(save_as, 'w') as f:
f.write(r.text)
```
==
# Data Collection
Bias Scraper (hard)
```python
...
bias_html = DATA_DIR / 'allsides.html'
parser = etree.HTMLParser()
tree = etree.parse(str(bias_html), parser)
root = tree.getroot()
rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')
ratings = []
for row in rows:
rating = dict()
...
```
==
# Data Collection
Bias Scraper (easy)
![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)
==
# Data Collection
Embeddings (easy)
```python
# table = ...
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")
for chunk in table:
tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
outputs = model(**tokens)
embeddings = outputs.last_hidden_state.detach().numpy()
...
```
==
# Data Collection
Classification Embeddings (medium)
```python
...
outputs = model(**tokens)[0].detach().numpy()
scores = 1 / (1 + np.exp(-outputs)) # Sigmoid
class_ids = np.argmax(scores, axis=1)
for i, class_id in enumerate(class_ids):
results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
...
```
===
# Data Selection
==
# Data Selection
Stories
- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
==
# Data Selection
Publishers
- Combine subdomains of stories. <!-- .element: class="fragment" -->
- blog.washingtonpost.com and washingtonpost.com are considered the same publisher.
- This could be bad. For example: opinion.wsj.com != wsj.com.
==
# Data Selection
Links
- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
- Eliminates small blogs and non-original news.
- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
- Eliminate silo'ed publications.
- Link matrix is square and low'ish dimensional.
==
# Data Selection
Bias
- Keep all ratings, even ones with low agree/disagree ratio.
- Join datasets on publisher name.
- Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->
- Started with 'jaro winkler similarity' then manually from there.
- Use numeric values
- [left: -2, left-center: -1, ...]
===
# Descriptive Stats
Raw
| metric | value |
|:------------------|--------:|
| total stories | 299714 |
| total related | 960111 |
| publishers | 7031 |
| authors | 34346 |
| max year | 2023 |
| min year | 2005 |
| top level domains | 7063 |
==
# Descriptive Stats
Stories Per Publisher
![stories per publisher](/static/577/stories_per_publisher.png)
==
# Descriptive Stats
Top Publishers
![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)
==
# Descriptive Stats
Articles Per Year
![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)
==
# Descriptive Stats
Common TLDs
![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)
==
# Descriptive Stats
Post Process
| key | value |
|:------------------|--------:|
| total stories | 251553 |
| total related | 815183 |
| publishers | 223 |
| authors | 23809 |
| max year | 2022 |
| min year | 2006 |
| top level domains | 234 |
===
# Experiments
1. **clustering** on link similarity. <!-- .element: class="fragment" -->
2. **classification** on link similarity. <!-- .element: class="fragment" -->
3. **classification** on sentence embedding. <!-- .element: class="fragment" -->
4. **classification** on sentiment analysis. <!-- .element: class="fragment" -->
5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->
===
# Experiment 1
Setup
- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
- Cluster the encoding. <!-- .element: class="fragment" -->
- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
- Use PCA to visualize clusters. <!-- .element: class="fragment" -->
Note:
Principle Component Analysis:
- a statistical technique for reducing the dimensionality of a dataset.
- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
==
# Experiment 1
One Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 1| 1| 1| ...|
| wsj | 1| 1| 0| ...|
| newsweek | 0| 0| 1| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
n-Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 11| 1| 141| ...|
| wsj | 1| 31| 0| ...|
| newsweek | 0| 0| 1| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
Normalized n-Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 0| 0.4| 0.2| ...|
| wsj | 0.2| 0| 0.4| ...|
| newsweek | 0.0| 0.0| 0.0| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
Elbow criterion
![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)
Note:
The elbow method looks at the percentage of explained variance as a function of the number of clusters:
One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
Percentage of variance explained is the ratio of the between-group variance to the total variance,
==
# Experiment 1
Link Magnitude
![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)
==
# Experiment 1
Normalized
![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)
==
# Experiment 1
Onehot
![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)
==
# Experiment 1
Discussion
- Best encoding: One hot. <!-- .element: class="fragment" -->
- Clusters based on total links otherwise.
- Clusters, but no explanation
- Limitation: need the link encoding to cluster.
- Smaller publishers might not link very much.
===
# Experiment 2
Setup
- Create features. <!-- .element: class="fragment" -->:
- Publisher frequency.
- Reuse link encodings.
- Create classes: <!-- .element: class="fragment" -->
- Join bias classifications.
- Train classifier. <!-- .element: class="fragment" -->
Note:
==
# Experiment 2
Descriptive stats
| metric | value |
|:------------|:----------|
| publishers | 1582 |
| labels | 6 |
| left | 482 |
| center | 711 |
| right | 369 |
| agree range | [0.0-1.0] |
==
# Experiment 2
PCA + Labels
![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)
==
# Experiment 2
Discussion
- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
- Labels are (sort of) separated and clustered.
- Creating them for smaller publishers is trivial.
==
# Experiment 2
Limitations
- Dependent on accurate rating. <!-- .element: class="fragment" -->
- Ordinal ratings not available. <!-- .element: class="fragment" -->
- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
- Don't know what to do with community rating. <!-- .element: class="fragment" -->
===
# Experiment 3
Setup
==
# Limitations
- Many different authors under the same publisher. <!-- .element: class="fragment" -->
- Publishers use syndication. <!-- .element: class="fragment" -->
- Bias ratings are biased. <!-- .element: class="fragment" -->
===
# Questions
===
<!-- .section: id="references" -->
# References
[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
Note: