2023-05-17 13:38:07 -07:00
|
|
|
_model: slides
|
|
|
|
---
|
|
|
|
|
|
|
|
title: CSCI 577 - Data Mining
|
|
|
|
|
|
|
|
---
|
|
|
|
body:
|
|
|
|
|
|
|
|
# Political Polarization
|
|
|
|
|
|
|
|
Matt Jensen
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Hypothesis
|
|
|
|
|
|
|
|
Political polarization is rising, and news articles are a proxy measure.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Is this reasonable?
|
|
|
|
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Why is polarization rising?
|
|
|
|
|
|
|
|
Not my job, but there's research<sup>[ref](#references)</sup> to support it
|
|
|
|
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Sub-hypothesis
|
|
|
|
|
|
|
|
- The polarization increases near elections. <!-- .element: class="fragment" -->
|
|
|
|
- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
|
|
|
|
- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Sub-sub-hypothesis
|
|
|
|
|
|
|
|
- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
|
|
|
|
- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
|
|
|
|
- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Data Source(s)
|
|
|
|
|
|
|
|
memeorandum.com <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
allsides.com <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
huggingface.com <!-- .element: class="fragment" -->
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
Note:
|
|
|
|
Let's get a handle on the shape of the data.
|
|
|
|
|
|
|
|
The sources, size, and features of the data.
|
|
|
|
|
2023-05-17 13:38:07 -07:00
|
|
|
===
|
|
|
|
|
|
|
|
<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# memeorandum.com
|
|
|
|
|
|
|
|
- News aggregation site. <!-- .element: class="fragment" -->
|
|
|
|
- Was really famous before Google News. <!-- .element: class="fragment" -->
|
|
|
|
- Still aggregates sites today. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Why Memeorandum?
|
|
|
|
|
|
|
|
- Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->
|
|
|
|
- Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->
|
|
|
|
- Convenient: most publishers block bots. <!-- .element class="fragment" -->
|
|
|
|
- Convenient: dead simple html to parse. <!-- .element class="fragment" -->
|
|
|
|
- Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->
|
|
|
|
- Archival: automated, not editorialized. <!-- .element class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
<section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# AllSides.com
|
|
|
|
|
|
|
|
- Rates news publications as left, center or right. <!-- .element: class="fragment" -->
|
|
|
|
- Ratings combine: <!-- .element: class="fragment" -->
|
|
|
|
- blind bias surveys.
|
|
|
|
- editorial reviews.
|
|
|
|
- third party research.
|
|
|
|
- community voting.
|
|
|
|
- Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Why AllSides?
|
|
|
|
|
|
|
|
- Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->
|
|
|
|
- Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
|
|
|
|
- Convenient: Easy format. <!-- .element class="fragment" -->
|
|
|
|
- Archival: Covers 1400 publishers. <!-- .element class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
<section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# HuggingFace.com
|
|
|
|
|
|
|
|
- Deep Learning library. <!-- .element: class="fragment" -->
|
|
|
|
- Lots of pretrained models. <!-- .element: class="fragment" -->
|
|
|
|
- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Why HuggingFace?
|
|
|
|
|
|
|
|
- Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->
|
|
|
|
- Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->
|
|
|
|
- Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->
|
|
|
|
- Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->
|
|
|
|
- Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Data Structures
|
2023-05-17 21:38:21 -07:00
|
|
|
## Stories
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Top level stories. <!-- .element: class="fragment" -->
|
|
|
|
- title.
|
|
|
|
- publisher.
|
|
|
|
- author.
|
|
|
|
- Related discussion. <!-- .element: class="fragment" -->
|
|
|
|
- publisher.
|
|
|
|
- uses 'parent' story as a source.
|
|
|
|
- Stream of stories (changes constantly). <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Structures
|
2023-05-17 21:38:21 -07:00
|
|
|
## Bias
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Per publisher. <!-- .element: class="fragment" -->
|
|
|
|
- name.
|
|
|
|
- label.
|
|
|
|
- agree/disagree vote by community.
|
|
|
|
- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Structures
|
2023-05-17 21:38:21 -07:00
|
|
|
## Embeddings
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Per story title. <!-- .element: class="fragment" -->
|
|
|
|
- sentence embedding (n, 384).
|
|
|
|
- sentiment classification (n, 1).
|
|
|
|
- emotional classification (n, 1).
|
|
|
|
- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Data Collection
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Collection
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Story Scraper (simplified)
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
```python
|
|
|
|
day = timedelta(days=1)
|
|
|
|
cur = date(2005, 10, 1)
|
|
|
|
end = date.today()
|
|
|
|
while cur <= end:
|
|
|
|
cur = cur + day
|
|
|
|
save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
|
|
|
|
url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
|
|
|
|
r = requests.get(url)
|
|
|
|
with open(save_as, 'w') as f:
|
|
|
|
f.write(r.text)
|
|
|
|
```
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Collection
|
2023-05-17 21:38:21 -07:00
|
|
|
|
|
|
|
## Bias Scraper (hard)
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
```python
|
|
|
|
...
|
|
|
|
bias_html = DATA_DIR / 'allsides.html'
|
|
|
|
parser = etree.HTMLParser()
|
|
|
|
tree = etree.parse(str(bias_html), parser)
|
|
|
|
root = tree.getroot()
|
|
|
|
rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')
|
|
|
|
|
|
|
|
ratings = []
|
|
|
|
for row in rows:
|
|
|
|
rating = dict()
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Collection
|
2023-05-17 21:38:21 -07:00
|
|
|
|
|
|
|
## Bias Scraper (easy)
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Collection
|
2023-05-17 21:38:21 -07:00
|
|
|
|
|
|
|
## Embeddings (easy)
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
```python
|
|
|
|
# table = ...
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
|
|
|
|
model = AutoModel.from_pretrained("roberta-base")
|
|
|
|
|
|
|
|
for chunk in table:
|
|
|
|
tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
|
|
|
|
outputs = model(**tokens)
|
|
|
|
embeddings = outputs.last_hidden_state.detach().numpy()
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Collection
|
2023-05-17 21:38:21 -07:00
|
|
|
|
|
|
|
## Classification Embeddings (medium)
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
```python
|
|
|
|
...
|
|
|
|
outputs = model(**tokens)[0].detach().numpy()
|
|
|
|
scores = 1 / (1 + np.exp(-outputs)) # Sigmoid
|
|
|
|
class_ids = np.argmax(scores, axis=1)
|
|
|
|
for i, class_id in enumerate(class_ids):
|
|
|
|
results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Data Selection
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Selection
|
2023-05-17 21:38:21 -07:00
|
|
|
|
|
|
|
## Stories
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
|
|
|
|
- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
==
|
|
|
|
# Data Selection
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Publishers
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Combine subdomains of stories. <!-- .element: class="fragment" -->
|
|
|
|
- blog.washingtonpost.com and washingtonpost.com are considered the same publisher.
|
|
|
|
- This could be bad. For example: opinion.wsj.com != wsj.com.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Selection
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Links
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
|
|
|
|
- Eliminates small blogs and non-original news.
|
|
|
|
- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
|
|
|
|
- Eliminate silo'ed publications.
|
|
|
|
- Link matrix is square and low'ish dimensional.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Data Selection
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Bias
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Keep all ratings, even ones with low agree/disagree ratio.
|
|
|
|
- Join datasets on publisher name.
|
|
|
|
- Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->
|
|
|
|
- Started with 'jaro winkler similarity' then manually from there.
|
|
|
|
- Use numeric values
|
|
|
|
- [left: -2, left-center: -1, ...]
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Descriptive Stats
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Raw
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
| metric | value |
|
|
|
|
|:------------------|--------:|
|
|
|
|
| total stories | 299714 |
|
|
|
|
| total related | 960111 |
|
|
|
|
| publishers | 7031 |
|
|
|
|
| authors | 34346 |
|
|
|
|
| max year | 2023 |
|
|
|
|
| min year | 2005 |
|
|
|
|
| top level domains | 7063 |
|
|
|
|
|
|
|
|
==
|
|
|
|
# Descriptive Stats
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Stories Per Publisher
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![stories per publisher](/static/577/stories_per_publisher.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Descriptive Stats
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Top Publishers
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Descriptive Stats
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Articles Per Year
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Descriptive Stats
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Common TLDs
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Descriptive Stats
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Post Process
|
2023-05-17 13:38:07 -07:00
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
| metric | value |
|
2023-05-17 13:38:07 -07:00
|
|
|
|:------------------|--------:|
|
|
|
|
| total stories | 251553 |
|
|
|
|
| total related | 815183 |
|
|
|
|
| publishers | 223 |
|
|
|
|
| authors | 23809 |
|
|
|
|
| max year | 2022 |
|
|
|
|
| min year | 2006 |
|
|
|
|
| top level domains | 234 |
|
|
|
|
|
|
|
|
===
|
2023-05-17 21:38:21 -07:00
|
|
|
|
2023-05-17 13:38:07 -07:00
|
|
|
# Experiments
|
|
|
|
|
|
|
|
1. **clustering** on link similarity. <!-- .element: class="fragment" -->
|
|
|
|
2. **classification** on link similarity. <!-- .element: class="fragment" -->
|
|
|
|
3. **classification** on sentence embedding. <!-- .element: class="fragment" -->
|
|
|
|
4. **classification** on sentiment analysis. <!-- .element: class="fragment" -->
|
|
|
|
5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
2023-05-17 21:38:21 -07:00
|
|
|
|
2023-05-17 13:38:07 -07:00
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
**clustering** on link similarity.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
|
|
|
## Setup
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
|
|
|
|
- Cluster the encoding. <!-- .element: class="fragment" -->
|
|
|
|
- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
|
|
|
|
- Use PCA to visualize clusters. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
Note:
|
|
|
|
Principle Component Analysis:
|
|
|
|
- a statistical technique for reducing the dimensionality of a dataset.
|
|
|
|
- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## One Hot Encoding
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
| publisher | nytimes| wsj| newsweek| ...|
|
|
|
|
|:----------|--------:|----:|--------:|----:|
|
|
|
|
| nytimes | 1| 1| 1| ...|
|
|
|
|
| wsj | 1| 1| 0| ...|
|
|
|
|
| newsweek | 0| 0| 1| ...|
|
|
|
|
| ... | ...| ...| ...| ...|
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## n-Hot Encoding
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
| publisher | nytimes| wsj| newsweek| ...|
|
|
|
|
|:----------|--------:|----:|--------:|----:|
|
|
|
|
| nytimes | 11| 1| 141| ...|
|
|
|
|
| wsj | 1| 31| 0| ...|
|
|
|
|
| newsweek | 0| 0| 1| ...|
|
|
|
|
| ... | ...| ...| ...| ...|
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Normalized n-Hot Encoding
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
| publisher | nytimes| wsj| newsweek| ...|
|
|
|
|
|:----------|--------:|----:|--------:|----:|
|
|
|
|
| nytimes | 0| 0.4| 0.2| ...|
|
|
|
|
| wsj | 0.2| 0| 0.4| ...|
|
|
|
|
| newsweek | 0.0| 0.0| 0.0| ...|
|
|
|
|
| ... | ...| ...| ...| ...|
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Elbow criterion
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)
|
|
|
|
|
|
|
|
Note:
|
|
|
|
|
|
|
|
The elbow method looks at the percentage of explained variance as a function of the number of clusters:
|
|
|
|
|
|
|
|
One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
|
|
|
|
|
|
|
|
Percentage of variance explained is the ratio of the between-group variance to the total variance,
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Link Magnitude
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Normalized
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## One Hot
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 1
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Discussion
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Best encoding: One hot. <!-- .element: class="fragment" -->
|
2023-05-17 21:38:21 -07:00
|
|
|
- Clusters, but no explanation. <!-- .element: class="fragment" -->
|
|
|
|
- Limitation: need the link encoding to cluster. <!-- .element: class="fragment" -->
|
2023-05-17 13:38:07 -07:00
|
|
|
- Smaller publishers might not link very much.
|
2023-05-17 21:38:21 -07:00
|
|
|
- TODO: Association Rule Mining. <!-- .element: class="fragment" -->
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Experiment 2
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
**classification** on link similarity.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 2
|
|
|
|
|
|
|
|
## Setup
|
2023-05-17 13:38:07 -07:00
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
- **clustering**. <!-- .element: class="fragment" -->
|
2023-05-17 13:38:07 -07:00
|
|
|
- Create features. <!-- .element: class="fragment" -->:
|
|
|
|
- Publisher frequency.
|
|
|
|
- Reuse link encodings.
|
|
|
|
- Create classes: <!-- .element: class="fragment" -->
|
|
|
|
- Join bias classifications.
|
|
|
|
- Train classifier. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
Note:
|
|
|
|
|
|
|
|
==
|
|
|
|
# Experiment 2
|
2023-05-17 21:38:21 -07:00
|
|
|
|
|
|
|
## Descriptive stats
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
| metric | value |
|
|
|
|
|:------------|:----------|
|
|
|
|
| publishers | 1582 |
|
|
|
|
| labels | 6 |
|
|
|
|
| left | 482 |
|
|
|
|
| center | 711 |
|
|
|
|
| right | 369 |
|
|
|
|
| agree range | [0.0-1.0] |
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 2
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## PCA + Labels
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 2
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Discussion
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
|
|
|
|
- Labels are (sort of) separated and clustered.
|
|
|
|
- Creating them for smaller publishers is trivial.
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 2
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Limitations
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Dependent on accurate rating. <!-- .element: class="fragment" -->
|
|
|
|
- Ordinal ratings not available. <!-- .element: class="fragment" -->
|
|
|
|
- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
|
|
|
|
- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
|
|
|
|
- Don't know what to do with community rating. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
**classification** on sentence embedding.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Setup
|
|
|
|
|
|
|
|
|
|
|
|
- **classification**. <!-- .element: class="fragment" -->
|
|
|
|
- Generate sentence embedding for each title. <!-- .element: class="fragment" -->
|
|
|
|
- Rerun PCA analysis on title embeddings. <!-- .element: class="fragment" -->
|
|
|
|
- Use kNN classifier to map embedding features to bias rating. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
==
|
|
|
|
|
2023-05-17 13:38:07 -07:00
|
|
|
# Experiment 3
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
## Sentence Embeddings
|
|
|
|
|
|
|
|
1. Extract titles.
|
|
|
|
2. Tokenize titles.
|
|
|
|
3. Pick pretrained Language Model.
|
|
|
|
4. Generate embeddings from tokens.
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
==
|
|
|
|
|
2023-05-17 21:38:21 -07:00
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Tokens
|
|
|
|
|
|
|
|
**The sentence:**
|
|
|
|
|
|
|
|
"Spain, Land of 10 P.M. Dinners, Asks if It's Time to Reset Clock"
|
|
|
|
|
|
|
|
**Tokenizes to:**
|
|
|
|
|
|
|
|
```
|
|
|
|
['[CLS]', 'spain', ',', 'land', 'of', '10', 'p', '.', 'm', '.',
|
|
|
|
'dinners', ',', 'asks', 'if', 'it', "'", 's', 'time', 'to',
|
|
|
|
'reset', 'clock', '[SEP]']
|
|
|
|
```
|
|
|
|
|
|
|
|
Note:
|
|
|
|
[CLS] is unique to BERT models and stands for classification.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Tokens
|
|
|
|
|
|
|
|
**The sentence:**
|
|
|
|
|
|
|
|
"NPR/PBS NewsHour/Marist Poll Results and Analysis"
|
|
|
|
|
|
|
|
**Tokenizes to:**
|
|
|
|
|
|
|
|
```
|
|
|
|
['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris',
|
|
|
|
'##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]',
|
|
|
|
'[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
|
|
|
|
```
|
|
|
|
|
|
|
|
Note:
|
|
|
|
The padding is there to make all tokenized vectors equal length.
|
|
|
|
|
|
|
|
The tokenizer also outputs a mask vector that the language model uses to ignore the padding.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Embeddings
|
|
|
|
|
|
|
|
- Using a BERT (Bidirectional Encoder Representations from Transformers) based model.
|
|
|
|
- Input: tokens.
|
|
|
|
- Output: dense vectors representing 'semantic meaning' of tokens.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Embeddings
|
|
|
|
|
|
|
|
**The tokens:**
|
|
|
|
|
|
|
|
```
|
|
|
|
['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris',
|
|
|
|
'##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]',
|
|
|
|
'[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
|
|
|
|
```
|
|
|
|
|
|
|
|
**Embeds to a vector (1, 384):**
|
|
|
|
|
|
|
|
```
|
|
|
|
array([[ 0.12444635, -0.05962477, -0.00127911, ..., 0.13943022,
|
|
|
|
-0.2552534 , -0.00238779],
|
|
|
|
[ 0.01535596, -0.05933844, -0.0099495 , ..., 0.48110735,
|
|
|
|
0.1370568 , 0.3285091 ],
|
|
|
|
[ 0.2831368 , -0.4200529 , 0.10879617, ..., 0.15663117,
|
|
|
|
-0.29782432, 0.4289513 ],
|
|
|
|
...,
|
|
|
|
```
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Results
|
|
|
|
|
|
|
|
![pca vs. classes](https://studentweb.cs.wwu.edu/~jensen33/static/577/embedding_sentence_pca.png)
|
|
|
|
|
|
|
|
Note:
|
|
|
|
Not a lot of information in PCA this time.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Results
|
|
|
|
|
|
|
|
![pca vs. avg embedding](https://studentweb.cs.wwu.edu/~jensen33/static/577/avg_embedding_sentence_pca.png) <!-- .element: class="r-stretch" -->
|
|
|
|
|
|
|
|
|
|
|
|
Note:
|
|
|
|
What about average publisher embedding?
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Results
|
|
|
|
|
|
|
|
![knn embedding confusion](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentence_confusion.png)
|
|
|
|
|
|
|
|
Note:
|
|
|
|
Trained a kNN from sklearn.
|
|
|
|
|
|
|
|
Set aside 20% of the data as a test set.
|
|
|
|
|
|
|
|
Once trained, compared the predictions with the true on the test set.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 3
|
|
|
|
|
|
|
|
## Discussion
|
|
|
|
|
|
|
|
- Embedding space is hard to condense with PCA. <!-- .element: class="fragment" -->
|
|
|
|
- Maybe the classifier is learning to guess 'left-ish'? <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Experiment 4
|
|
|
|
|
|
|
|
**classification** on sentiment analysis.
|
|
|
|
|
|
|
|
==
|
|
|
|
# Experiment 4
|
|
|
|
|
|
|
|
## Setup
|
|
|
|
|
|
|
|
- Use pretrained Language Classifier. <!-- .element: class="fragment" -->
|
|
|
|
- Previously: Mapped twitter posts to tokens, to embedding, to ['positive', 'negative'] labels. <!-- .element: class="fragment" -->
|
|
|
|
- Predict: rate of neutral titles decreasing over time.
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 4
|
|
|
|
|
|
|
|
## Results
|
|
|
|
|
|
|
|
![sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentiment_over_time.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
# Experiment 4
|
|
|
|
|
|
|
|
## Results
|
|
|
|
|
|
|
|
![bias vs. sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_sentiment_over_time.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 4
|
|
|
|
|
|
|
|
## Discussion
|
|
|
|
|
|
|
|
-
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Experiment 5
|
|
|
|
|
|
|
|
**regression** on emotional classification over time and publication.
|
|
|
|
|
|
|
|
==
|
|
|
|
# Experiment 5
|
|
|
|
|
|
|
|
## Setup
|
|
|
|
|
|
|
|
- Use pretrained language classifier. <!-- .element: class="fragment" -->
|
|
|
|
- Previously: Mapped reddit posts to tokens, to embedding, to emotion labels. <!-- .element: class="fragment" -->
|
|
|
|
- Predict: rate of neutral titles decreasing over time.
|
|
|
|
- Classify:
|
|
|
|
- features: emotional labels
|
|
|
|
- labels: bias
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 5
|
|
|
|
|
|
|
|
## Results
|
|
|
|
|
|
|
|
![emotion over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_over_time.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 5
|
|
|
|
|
|
|
|
## Results
|
|
|
|
|
|
|
|
![emotion regression time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_regression.png)
|
|
|
|
|
|
|
|
==
|
|
|
|
|
|
|
|
# Experiment 5
|
|
|
|
|
|
|
|
## Discussion
|
|
|
|
|
|
|
|
- Neutral story titles dominate the dataset. <!-- .element: class="fragment" -->
|
|
|
|
- Increase in stories published might explain most of the trend. <!-- .element: class="fragment" -->
|
|
|
|
- Far-right and far-left both became less neutral. <!-- .element: class="fragment" -->
|
|
|
|
- Left-Center and right-center became more emotional, but also neutral. <!-- .element: class="fragment" -->
|
|
|
|
- Not a lot of movement overall. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Experiment 6 (**TODO**)
|
|
|
|
|
|
|
|
## Setup
|
|
|
|
|
|
|
|
- Have a lot of features now. <!-- .element: class="fragment" -->
|
|
|
|
- Link PCA components.
|
|
|
|
- Embedding PCA components.
|
|
|
|
- Sentiment.
|
|
|
|
- Emotion.
|
|
|
|
- Can we predict with all of them: Bias. <!-- .element: class="fragment" -->
|
|
|
|
- End user: Is that useful? Where will I get all that at inference time? <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Overall Limitations
|
2023-05-17 13:38:07 -07:00
|
|
|
|
|
|
|
- Many different authors under the same publisher. <!-- .element: class="fragment" -->
|
|
|
|
- Publishers use syndication. <!-- .element: class="fragment" -->
|
|
|
|
- Bias ratings are biased. <!-- .element: class="fragment" -->
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
# Questions
|
|
|
|
|
|
|
|
===
|
|
|
|
|
|
|
|
<!-- .section: id="references" -->
|
|
|
|
|
|
|
|
# References
|
|
|
|
|
|
|
|
[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
|
|
|
|
|
|
|
|
Note:
|