2023-05-17 13:38:07 -07:00
_model: slides
---
title: CSCI 577 - Data Mining
---
body:
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
# Political Polarization
2023-05-18 19:55:15 -07:00
## CSCI 577
**Matt Jensen**
*May 18, 2023*
==
# Outline
- Hypothesis
- Sources
- Data Workup
- Experiments
- Remaining Work
- Questions
2023-05-17 13:38:07 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
# Hypothesis
==
2023-05-17 13:38:07 -07:00
# Hypothesis
Political polarization is rising, and news articles are a proxy measure.
==
2023-05-18 19:55:15 -07:00
# Why might we expect this?
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
Mostly anecdotal experience. <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
< p class = "fragment" >
Evidence is mixed in the literature
< sup > < a href = "#references" > 1< / a > ,< a href = "#references" > 2< / a > ,< a href = "#references" > 3< / a > < / sup > .
< / p >
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
Our goal is whether, not why. <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
Note:
> Proliferation of media choices lowered the share of less interested, less partisan
> voters and thereby made elections more partisan. But evidence for a causal
> link between more partisan messages and changing attitudes or behaviors is
> mixed at best. Measurement problems hold back research on partisan selec-
> tive exposure and its consequences. Ideologically one-sided news exposure
> may be largely confined to a small, but highly involved and influential, seg-
> ment of the population. There is no firm evidence that partisan media are
> making ordinary Americans more partisan.
2023-05-17 13:38:07 -07:00
==
# Sub-hypothesis
- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
2023-05-18 19:55:15 -07:00
- The polarization increases near elections. <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
==
# Sub-sub-hypothesis
- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->
2023-05-18 19:55:15 -07:00
Note:
- Publication longivity is not covered currently.
- Mainstream media dominates the dataset.
2023-05-17 13:38:07 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
# Data Sources
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
==
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
# Data Sources
- Memeorandum: **stories** <!-- .element: class="fragment" -->
- AllSides: **bias** <!-- .element: class="fragment" -->
- HuggingFace: **sentiment** <!-- .element: class="fragment" -->
- ChatGPT: **election dates** <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
2023-05-17 21:38:21 -07:00
Note:
2023-05-18 19:55:15 -07:00
2023-05-17 21:38:21 -07:00
Let's get a handle on the shape of the data.
2023-05-18 19:55:15 -07:00
- sources
- size
- features
2023-05-17 21:38:21 -07:00
2023-05-17 13:38:07 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
# Memeorandum
==
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
<!-- .slide: data - background - iframe="https://www.memeorandum.com" data - background - interactive -->
==
# Memeorandum
2023-05-17 13:38:07 -07:00
- News aggregation site. <!-- .element: class="fragment" -->
- Was really famous before Google News. <!-- .element: class="fragment" -->
- Still aggregates sites today. <!-- .element: class="fragment" -->
==
2023-05-18 19:55:15 -07:00
# Memeorandum
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
- I still use it. <!-- .element class="fragment" -->
- I like to read titles. <!-- .element class="fragment" -->
- Publishers block bots. <!-- .element class="fragment" -->
- Simple html to parse. <!-- .element class="fragment" -->
- Headlines from 2006 forward. <!-- .element class="fragment" -->
- Automated, not editorialized. <!-- .element class="fragment" -->
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
Note:
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
- It limits doom scrolling.
2023-05-17 13:38:07 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
# AllSides
==
<!-- .slide: data - background - iframe="https://www.allsides.com/media - bias/ratings" data - background - interactive -->
==
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
# AllSides
- Rates publications as left, center or right. <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
- Ratings combine: <!-- .element: class="fragment" -->
- blind bias surveys.
- editorial reviews.
- third party research.
- community voting.
2023-05-18 19:55:15 -07:00
Note:
Originally scraped website, but direct access eventually.
2023-05-17 13:38:07 -07:00
==
2023-05-18 19:55:15 -07:00
# AllSides
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
- One of the only bias apis. <!-- .element class="fragment" -->
- Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
- Covers 1400 publishers + some blog and authors. <!-- .element class="fragment" -->
- Easy format and semi-complete data. <!-- .element class="fragment" -->
2023-05-17 13:38:07 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
# HuggingFace
2023-05-17 13:38:07 -07:00
==
2023-05-18 19:55:15 -07:00
<!-- .slide: data - background - iframe="https://huggingface.co/models" data - background - interactive -->
2023-05-17 13:38:07 -07:00
==
2023-05-18 19:55:15 -07:00
# HuggingFace
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
- Deep learning library. <!-- .element: class="fragment" -->
- Lots of pretrained models. <!-- .element: class="fragment" -->
- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
==
2023-05-18 19:55:15 -07:00
# HuggingFace
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
- Language models are **HOT** . <!-- .element: class="fragment" -->
- Literally 5 lines of python.<!-- .element: class="fragment" -->
- The dataset needed more features.<!-- .element: class="fragment" -->
- Testing different model performance was easy.<!-- .element: class="fragment" -->
- Lots of pretrained classification tasks.<!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
# Data Collection
==
# Data Collection
2023-05-18 19:55:15 -07:00
## Stories
2023-05-17 13:38:07 -07:00
```python
day = timedelta(days=1)
cur = date(2005, 10, 1)
end = date.today()
while cur < = end:
cur = cur + day
save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
r = requests.get(url)
with open(save_as, 'w') as f:
f.write(r.text)
```
2023-05-18 19:55:15 -07:00
Note:
grab every page from 2005 forward.
later: parse it into csv/database.
2023-05-17 13:38:07 -07:00
==
# Data Collection
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
## Bias **hard**
2023-05-17 13:38:07 -07:00
```python
...
bias_html = DATA_DIR / 'allsides.html'
parser = etree.HTMLParser()
tree = etree.parse(str(bias_html), parser)
root = tree.getroot()
rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')
ratings = []
for row in rows:
rating = dict()
...
```
2023-05-18 19:55:15 -07:00
Note:
grab entire index
later parse it into csv/database
2023-05-17 13:38:07 -07:00
==
# Data Collection
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
## Bias **easy**
2023-05-17 13:38:07 -07:00
![allsides request ](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png )
2023-05-18 19:55:15 -07:00
Note:
json format, including authors and blogs.
2023-05-17 13:38:07 -07:00
==
# Data Collection
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
## Embeddings
2023-05-17 13:38:07 -07:00
```python
# table = ...
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")
for chunk in table:
tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
outputs = model(**tokens)
embeddings = outputs.last_hidden_state.detach().numpy()
...
```
2023-05-18 19:55:15 -07:00
Note:
for every title, tokenize then embed.
hidden state is last linear layer before training tasks.
2023-05-17 13:38:07 -07:00
==
# Data Collection
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
## Classification Embeddings
2023-05-17 13:38:07 -07:00
```python
...
outputs = model(**tokens)[0].detach().numpy()
scores = 1 / (1 + np.exp(-outputs)) # Sigmoid
class_ids = np.argmax(scores, axis=1)
for i, class_id in enumerate(class_ids):
results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
...
```
2023-05-18 19:55:15 -07:00
Note:
for every title, tokenize, classify.
~ 1 hour
2023-05-17 13:38:07 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
# Data Structures
## Stories
Note:
Great, we have the data, now what does it look like?
==
# Data Structures
## Stories
- Top level stories. <!-- .element: class="fragment" -->
- title, author, publisher, url, date.
- Related discussion. <!-- .element: class="fragment" -->
- publisher, url.
- uses 'parent' story as a source.
- Story stream changes constantly (dedup. required). <!-- .element: class="fragment" -->
==
# Data Structures
## Stories
![raw story table ](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_stories_table.png )
==
# Data Structures
## Stories
![raw related table ](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_related_table.png )
==
# Data Structures
## Stories
| metric | value |
|:------------------|--------:|
| total stories | 299714 |
| total related | 960111 |
| publishers | 7031 |
| authors | 34346 |
| max year | 2023 |
| min year | 2005 |
| top level domains | 7063 |
2023-05-17 13:38:07 -07:00
==
# Data Selection
2023-05-17 21:38:21 -07:00
## Stories
2023-05-17 13:38:07 -07:00
- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
2023-05-18 19:55:15 -07:00
- Convert urls to tld to link to publishers. <!-- .element: class="fragment" -->
Note:
tld: top level domain.
2023-05-17 13:38:07 -07:00
==
2023-05-18 19:55:15 -07:00
2023-05-17 13:38:07 -07:00
# Data Selection
2023-05-17 21:38:21 -07:00
## Publishers
2023-05-17 13:38:07 -07:00
- Combine subdomains of stories. <!-- .element: class="fragment" -->
- blog.washingtonpost.com and washingtonpost.com are considered the same publisher.
- This could be bad. For example: opinion.wsj.com != wsj.com.
2023-05-18 19:55:15 -07:00
- Find common name of publisher. <!-- .element: class="fragment" -->
Note:
Sometime authors are the publisher name.
2023-05-17 13:38:07 -07:00
==
# Data Selection
2023-05-18 19:55:15 -07:00
## Related
2023-05-17 13:38:07 -07:00
- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
- Eliminates small blogs and non-original news.
- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
- Eliminate silo'ed publications.
- Link matrix is square and low'ish dimensional.
2023-05-18 19:55:15 -07:00
Note:
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
Going to build a data structure of the related links, so I have to be judicious about which ones to include.
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
==
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
# Data Selection
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
## Post Process
2023-05-17 13:38:07 -07:00
| metric | value |
|:------------------|--------:|
2023-05-18 19:55:15 -07:00
| total stories | 251553 |
| total related | 815183 |
| publishers | 223 |
| authors | 23809 |
| max year | 2022 |
| min year | 2006 |
| top level domains | 234 |
Note:
much less publishers, but count(stories) about the same - main stream represent.
2023-05-17 13:38:07 -07:00
==
2023-05-18 19:55:15 -07:00
2023-05-17 13:38:07 -07:00
# Descriptive Stats
2023-05-17 21:38:21 -07:00
## Stories Per Publisher
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
![stories per publisher ](https://studentweb.cs.wwu.edu/~jensen33/static/577/stories_per_publisher.png )
Note:
Power law in effect.
2023-05-17 13:38:07 -07:00
==
# Descriptive Stats
2023-05-17 21:38:21 -07:00
## Top Publishers
2023-05-17 13:38:07 -07:00
![top publishers ](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png )
2023-05-18 19:55:15 -07:00
Note:
Some publishers come and go.
Some publishers change their domains.
2023-05-17 13:38:07 -07:00
==
# Descriptive Stats
2023-05-17 21:38:21 -07:00
## Articles Per Year
2023-05-17 13:38:07 -07:00
![articles per year ](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png )
2023-05-18 19:55:15 -07:00
Note:
Shape of total articles per year dominates some of the analysis.
2023-05-17 13:38:07 -07:00
==
# Descriptive Stats
2023-05-17 21:38:21 -07:00
## Common TLDs
2023-05-17 13:38:07 -07:00
![common tlds ](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png )
2023-05-18 19:55:15 -07:00
Note:
just for funs.
Lots of IP addresses and spammy looking ones.
===
<!-- .slide: class="center" -->
# Data Structures
## Bias
2023-05-17 13:38:07 -07:00
==
2023-05-18 19:55:15 -07:00
# Data Structures
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
## Bias
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
- Per publisher. <!-- .element: class="fragment" -->
- name,
- label/ordinal value.
- agree/disagree vote by community.
- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->
==
# Data Structures
## Bias
![raw bias table ](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_bias_table.png )
Note:
Later, media type and explicit ordinal values were added via api access.
==
# Data Selection
## Bias
- Keep all ratings. <!-- .element: class="fragment" -->
- Join datasets on publisher name. <!-- .element: class="fragment" -->
- Started with 'jaro winkler similarity' then manually from there (look up Named Entity Recognition).
- Use numeric values. <!-- .element: class="fragment" -->
- [left: -2, left-center: -1, ...].
- Possibly scale ordinal based on agree/disagree ratio.
Note:
Lots of agrees on the ends of the spectrum implies their very left or very right.
Lots of agrees in the middle implies very neutral?
==
# Data
## Bias
![bias hist ](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_hist.png )
==
# Data
## Bias
![selected bias ](https://studentweb.cs.wwu.edu/~jensen33/static/577/selected_bias_table.png )
Note:
much smaller dataset.
TODO: manually add more joins to story source.
2023-05-17 13:38:07 -07:00
===
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
# Data Structures
## Embeddings
==
# Data Structures
## Embeddings
- Per story title. <!-- .element: class="fragment" -->
- sentence embedding (n, 384) - **BERT** .
- sentiment classification (n, 1) - **RoBERTa base** .
- emotional classification (n, 1) - **RoBERTa Go-Emotions** .
- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->
Note:
RoBERTa - pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words.
SST - Stanford Sentiment Treebank: 11,855 single sentences extracted from movie reviews, annotated by 3 human judges.
==
# Data Selection
## Embeddings
- Word embeddings were too complicated. <!-- .element: class="fragment" -->
- Kept argmax of classification prediction ([0.82, 0.18] -> LABEL_0). <!-- .element: class="fragment" -->
- For publisher based analysis, averaged sentence embeddings for all stories. <!-- .element: class="fragment" -->
==
# Data
## Embeddings
| label | stories | publishers |
|:---------|----------:|-------------:|
| positive | 87830 | 223 |
| negative | 163723 | 223 |
Note:
There was a model with a neutral label as well, but I opted out.
==
# Data
## Embeddings
| label | stories | publishers |
|:---------|----------:|-------------:|
| neutral | 124257 | 223 |
| anger | 34124 | 223 |
| fear | 36756 | 223 |
| sadness | 27449 | 223 |
| disgust | 17939 | 222 |
| surprise | 5710 | 216 |
| joy | 5318 | 214 |
===
<!-- .slide: class="center" -->
# Experiments
==
2023-05-17 13:38:07 -07:00
# Experiments
2023-05-18 19:55:15 -07:00
1. **clustering** on link similarity.
2. **classification** on link similarity.
3. **classification** on sentence embedding.
4. **classification** on sentiment analysis.
5. **regression** on emotional classification over time and publication.
Note:
5 main experiments.
Lots of tinkering and 'agile development'.
Use source control.
2023-05-17 13:38:07 -07:00
===
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
# Experiment 1
2023-05-17 21:38:21 -07:00
**clustering** on link similarity.
==
# Experiment 1
## Setup
2023-05-17 13:38:07 -07:00
- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
- Cluster the encoding. <!-- .element: class="fragment" -->
- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
- Use PCA to visualize clusters. <!-- .element: class="fragment" -->
Note:
Principle Component Analysis:
- a statistical technique for reducing the dimensionality of a dataset.
- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
2023-05-18 19:55:15 -07:00
- I use it alot to map from high dimensional space (links adj. and embeddings) to lower, most significant space.
2023-05-17 13:38:07 -07:00
==
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
# Experiment 1
2023-05-18 19:55:15 -07:00
## Encoding schemes
==
# Experiment 1
## One-hot Encoding
2023-05-17 13:38:07 -07:00
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 1| 1| 1| ...|
| wsj | 1| 1| 0| ...|
| newsweek | 0| 0| 1| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
2023-05-17 21:38:21 -07:00
## n-Hot Encoding
2023-05-17 13:38:07 -07:00
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 11| 1| 141| ...|
| wsj | 1| 31| 0| ...|
| newsweek | 0| 0| 1| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
2023-05-17 21:38:21 -07:00
## Normalized n-Hot Encoding
2023-05-17 13:38:07 -07:00
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 0| 0.4| 0.2| ...|
| wsj | 0.2| 0| 0.4| ...|
| newsweek | 0.0| 0.0| 0.0| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
2023-05-17 21:38:21 -07:00
## Elbow criterion
2023-05-17 13:38:07 -07:00
![elbow ](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png )
Note:
The elbow method looks at the percentage of explained variance as a function of the number of clusters:
One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
2023-05-18 19:55:15 -07:00
Percentage of variance explained is the ratio of the between-group variance to the total variance
sklearn eliminated 2 cluster groups??
==
<!-- .slide: class="center" -->
# Experiment 1
## Comparing encoding schemes
Note:
They all have good clusters.
2023-05-17 13:38:07 -07:00
==
# Experiment 1
2023-05-17 21:38:21 -07:00
## Link Magnitude
2023-05-17 13:38:07 -07:00
![link magnitude cluster ](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png )
2023-05-18 19:55:15 -07:00
Note:
link frequency dominates one component.
more interested in bias between publishers, not difference between mainstream and outliers.
2023-05-17 13:38:07 -07:00
==
# Experiment 1
2023-05-17 21:38:21 -07:00
## Normalized
2023-05-17 13:38:07 -07:00
![link normalized cluster ](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png )
2023-05-18 19:55:15 -07:00
Note:
a few outliers still, but better.
2023-05-17 13:38:07 -07:00
==
# Experiment 1
2023-05-18 19:55:15 -07:00
## One-Hot
2023-05-17 13:38:07 -07:00
![link onehot cluster ](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png )
2023-05-18 19:55:15 -07:00
Note:
really dispursed
2023-05-17 13:38:07 -07:00
==
# Experiment 1
2023-05-17 21:38:21 -07:00
## Discussion
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
- One-hot seems to reflect the right features. <!-- .element: class="fragment" -->
- Found clusters, but meaning is arbitrary. <!-- .element: class="fragment" -->
- map to PCA results nicely.
2023-05-17 21:38:21 -07:00
- Limitation: need the link encoding to cluster. <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
- Smaller publishers might not link very much.
2023-05-17 21:38:21 -07:00
- TODO: Association Rule Mining. <!-- .element: class="fragment" -->
2023-05-18 19:55:15 -07:00
- 'Basket of goods' analysis to group publishers.
2023-05-17 13:38:07 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
# Experiment 2
2023-05-17 21:38:21 -07:00
**classification** on link similarity.
==
# Experiment 2
## Setup
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
- Create features: <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
- Publisher frequency.
- Reuse link encodings.
- Create classes: <!-- .element: class="fragment" -->
- Join bias classifications.
- Train classifier. <!-- .element: class="fragment" -->
Note:
==
# Experiment 2
2023-05-17 21:38:21 -07:00
## Descriptive stats
2023-05-17 13:38:07 -07:00
| metric | value |
|:------------|:----------|
| publishers | 1582 |
| labels | 6 |
| left | 482 |
| center | 711 |
| right | 369 |
| agree range | [0.0-1.0] |
2023-05-18 19:55:15 -07:00
Note:
rehash of what bias data is available.
2023-05-17 13:38:07 -07:00
==
# Experiment 2
2023-05-18 19:55:15 -07:00
## Results
2023-05-17 13:38:07 -07:00
![pca vs. bias labels ](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png )
2023-05-18 19:55:15 -07:00
Note:
pca maps to bias labels well, left on one end, right on the other.
if you squint.
==
# Experiment 2
## Results
![link confusion ](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_confusion.png )
Note:
hot diagonal is good.
all data.
train test split only had 20 or so samples in it?
overlap between link choices and bias ratings is slim.
2023-05-17 13:38:07 -07:00
==
# Experiment 2
2023-05-17 21:38:21 -07:00
## Discussion
2023-05-17 13:38:07 -07:00
- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
- Labels are (sort of) separated and clustered.
- Creating them for smaller publishers is trivial.
2023-05-18 19:55:15 -07:00
- Hot diagonal confusion matrix is good. <!-- .element: class="fragment" -->
- Need to link more publisher data to get good test data. <!-- .element: class="fragment" -->
Note:
2023-05-17 13:38:07 -07:00
==
# Experiment 2
2023-05-17 21:38:21 -07:00
## Limitations
2023-05-17 13:38:07 -07:00
- Dependent on accurate rating. <!-- .element: class="fragment" -->
2023-05-18 19:55:15 -07:00
- Ordinal ratings weren't available. <!-- .element: class="fragment" -->
2023-05-17 13:38:07 -07:00
- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
- Don't know what to do with community rating. <!-- .element: class="fragment" -->
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 21:38:21 -07:00
# Experiment 3
**classification** on sentence embedding.
==
# Experiment 3
## Setup
- Generate sentence embedding for each title. <!-- .element: class="fragment" -->
- Rerun PCA analysis on title embeddings. <!-- .element: class="fragment" -->
- Use kNN classifier to map embedding features to bias rating. <!-- .element: class="fragment" -->
==
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
# Experiment 3
2023-05-18 19:55:15 -07:00
## Embeddings Primer
==
# Experiment 3
## Embedding Steps
2023-05-17 21:38:21 -07:00
1. Extract titles.
2. Tokenize titles.
2023-05-18 19:55:15 -07:00
3. Pick pretrained language model.
4. Generate embeddings from tokens using model.
2023-05-17 13:38:07 -07:00
==
2023-05-17 21:38:21 -07:00
# Experiment 3
## Tokens
**The sentence:**
"Spain, Land of 10 P.M. Dinners, Asks if It's Time to Reset Clock"
**Tokenizes to:**
```
['[CLS]', 'spain', ',', 'land', 'of', '10', 'p', '.', 'm', '.',
'dinners', ',', 'asks', 'if', 'it', "'", 's', 'time', 'to',
'reset', 'clock', '[SEP]']
```
Note:
[CLS] is unique to BERT models and stands for classification.
==
# Experiment 3
## Tokens
**The sentence:**
"NPR/PBS NewsHour/Marist Poll Results and Analysis"
**Tokenizes to:**
```
['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris',
'##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]',
'[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
```
Note:
The padding is there to make all tokenized vectors equal length.
The tokenizer also outputs a mask vector that the language model uses to ignore the padding.
==
# Experiment 3
## Embeddings
- Using a BERT (Bidirectional Encoder Representations from Transformers) based model.
- Input: tokens.
- Output: dense vectors representing 'semantic meaning' of tokens.
==
# Experiment 3
## Embeddings
**The tokens:**
```
['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris',
'##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]',
'[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
```
**Embeds to a vector (1, 384):**
```
array([[ 0.12444635, -0.05962477, -0.00127911, ..., 0.13943022,
-0.2552534 , -0.00238779],
[ 0.01535596, -0.05933844, -0.0099495 , ..., 0.48110735,
0.1370568 , 0.3285091 ],
[ 0.2831368 , -0.4200529 , 0.10879617, ..., 0.15663117,
-0.29782432, 0.4289513 ],
...,
```
2023-05-18 19:55:15 -07:00
Note:
attention masks allow the model to ignore padding so all vectors are same length.
embedding space has semantic meaning.
can do vector math on them:
king - man = monarch
monarch + dance = happy?
2023-05-17 21:38:21 -07:00
==
# Experiment 3
## Results
![pca vs. classes ](https://studentweb.cs.wwu.edu/~jensen33/static/577/embedding_sentence_pca.png )
Note:
2023-05-18 19:55:15 -07:00
pca on the sentence embeddings of the titles.
not a lot of information in PCA this time.
2023-05-17 21:38:21 -07:00
==
# Experiment 3
## Results
![pca vs. avg embedding ](https://studentweb.cs.wwu.edu/~jensen33/static/577/avg_embedding_sentence_pca.png ) <!-- .element: class="r-stretch" -->
Note:
2023-05-18 19:55:15 -07:00
2023-05-17 21:38:21 -07:00
What about average publisher embedding?
2023-05-18 19:55:15 -07:00
centers are pushed outside?
sorry about the color pallet.
2023-05-17 21:38:21 -07:00
==
# Experiment 3
## Results
![knn embedding confusion ](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentence_confusion.png )
Note:
Trained a kNN from sklearn.
Set aside 20% of the data as a test set.
Once trained, compared the predictions with the true on the test set.
2023-05-18 19:55:15 -07:00
not bad.
2023-05-17 21:38:21 -07:00
==
# Experiment 3
## Discussion
- Embedding space is hard to condense with PCA. <!-- .element: class="fragment" -->
- Maybe the classifier is learning to guess 'left-ish'? <!-- .element: class="fragment" -->
2023-05-18 19:55:15 -07:00
- Does DL work better on sparse inputs? <!-- .element: class="fragment" -->
2023-05-17 21:38:21 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 21:38:21 -07:00
# Experiment 4
**classification** on sentiment analysis.
==
# Experiment 4
## Setup
2023-05-18 19:55:15 -07:00
- Use pretrained language classifier. <!-- .element: class="fragment" -->
2023-05-17 21:38:21 -07:00
- Previously: Mapped twitter posts to tokens, to embedding, to ['positive', 'negative'] labels. <!-- .element: class="fragment" -->
2023-05-18 19:55:15 -07:00
- Predict: rate of neutral titles decreasing over time. <!-- .element: class="fragment" -->
2023-05-17 21:38:21 -07:00
==
# Experiment 4
## Results
![sentiment over time ](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentiment_over_time.png )
2023-05-18 19:55:15 -07:00
Note:
maybe there's something there.
less positive after 2008?
low around 2016?
increase around 202?
overall still lower.
2023-05-17 21:38:21 -07:00
==
2023-05-18 19:55:15 -07:00
2023-05-17 21:38:21 -07:00
# Experiment 4
## Results
![bias vs. sentiment over time ](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_sentiment_over_time.png )
2023-05-18 19:55:15 -07:00
Note:
right has not a lot of data.
all trend down over time.
people loved Obama at the beginning.
==
# Experiment 4
## Results
![sentiment vs. election recency ](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_recent_winner.png )
Note:
assumption: national elections drive news sentiment.
expected a taller band in the middle then the edges.
2023-05-17 21:38:21 -07:00
==
# Experiment 4
## Discussion
2023-05-18 19:55:15 -07:00
- Bump post Obama election for left and center. <!-- .element: class="fragment" -->
- Dip pre Trump election for left and center. <!-- .element: class="fragment" -->
- Right is all over the place - not enough data? <!-- .element: class="fragment" -->
- Recency of election not a clear factor. <!-- .element: class="fragment" -->
2023-05-17 21:38:21 -07:00
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 21:38:21 -07:00
# Experiment 5
2023-05-18 19:55:15 -07:00
**regression** on title emotional expression.
2023-05-17 21:38:21 -07:00
==
2023-05-18 19:55:15 -07:00
2023-05-17 21:38:21 -07:00
# Experiment 5
## Setup
- Use pretrained language classifier. <!-- .element: class="fragment" -->
- Previously: Mapped reddit posts to tokens, to embedding, to emotion labels. <!-- .element: class="fragment" -->
2023-05-18 19:55:15 -07:00
- Predict: rate of neutral titles decreasing over time. <!-- .element: class="fragment" -->
- Classify: <!-- .element: class="fragment" -->
2023-05-17 21:38:21 -07:00
- features: emotional labels
- labels: bias
==
# Experiment 5
## Results
![emotion over time ](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_over_time.png )
2023-05-18 19:55:15 -07:00
Note:
neutrality between Obama and Trump
emotional titles all increased - shape of the underlying data.
TODO: normalize relative expression.
2023-05-17 21:38:21 -07:00
==
# Experiment 5
## Results
![emotion regression time ](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_regression.png )
2023-05-18 19:55:15 -07:00
Note:
left and right got less neutral over time.
2023-05-17 21:38:21 -07:00
==
# Experiment 5
## Discussion
- Neutral story titles dominate the dataset. <!-- .element: class="fragment" -->
- Increase in stories published might explain most of the trend. <!-- .element: class="fragment" -->
- Far-right and far-left both became less neutral. <!-- .element: class="fragment" -->
- Left-Center and right-center became more emotional, but also neutral. <!-- .element: class="fragment" -->
- Not a lot of movement overall. <!-- .element: class="fragment" -->
===
2023-05-18 19:55:15 -07:00
<!-- .slide: class="center" -->
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
# Conclusion
==
# Hypothesis
- The polarization is not evenly distributed across publishers. **unproven**
- The polarization is not evenly distributed across political specturm. **unproven**
- The polarization increases near elections. **false**
- Similarly polarized publishers link to each other. **sorta**
- 'Mainstream' media uses more neutral titles. **true**
- Highly polarized publications don't last as long. **untested**
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
==
# Conclusion
- Article titles do not have a lot of predictive power. <!-- .element: class="fragment" -->
- Mainstream, neutral publications dominate the dataset. <!-- .element: class="fragment" -->
- Link frequency, sentence embeddings, and sentiments are useful features. <!-- .element: class="fragment" -->
- A few questions remain. <!-- .element: class="fragment" -->
Note:
Experiment 6 (**TODO**)
- Have a lot of features now.
2023-05-17 21:38:21 -07:00
- Link PCA components.
- Embedding PCA components.
- Sentiment.
- Emotion.
2023-05-18 19:55:15 -07:00
- Can we predict with all of them: Bias.
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
limitations
2023-05-17 21:38:21 -07:00
2023-05-18 19:55:15 -07:00
- Many different authors under the same publisher.
- Publishers use syndication.
- Bias ratings are biased and not linked automaticall.
- National news is generally designed to be neutral sounding.
- End user: Is that useful? Where will I get all that at inference time?
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
==
<!-- .slide: class="center" -->
2023-05-17 13:38:07 -07:00
# Questions
2023-05-18 19:55:15 -07:00
==
2023-05-17 13:38:07 -07:00
2023-05-18 19:55:15 -07:00
<!-- .slide: id="references" -->
2023-05-17 13:38:07 -07:00
# References
[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
Note: