wwu-577/docs/presentation.md

_model: slides
---

title: CSCI 577 - Data Mining

---
body:

<!-- .slide: class="center" -->


# Political Polarization 

## CSCI 577

**Matt Jensen**

*May 18, 2023*

==

# Outline

- Hypothesis
- Sources
- Data Workup
- Experiments
- Remaining Work
- Questions

===

<!-- .slide: class="center" -->

# Hypothesis 

==

# Hypothesis 

Political polarization is rising, and news articles are a proxy measure.

==

# Why might we expect this?

Mostly anecdotal experience. <!-- .element: class="fragment" -->


<p class="fragment">
    Evidence is mixed in the literature
    <sup><a href="#references">1</a>,<a href="#references">2</a>,<a href="#references">3</a></sup>.
</p>

Our goal is whether, not why. <!-- .element: class="fragment" -->

Note:
> Proliferation of media choices lowered the share of less interested, less partisan
> voters and thereby made elections more partisan. But evidence for a causal
> link between more partisan messages and changing attitudes or behaviors is
> mixed at best. Measurement problems hold back research on partisan selec-
> tive exposure and its consequences. Ideologically one-sided news exposure
> may be largely confined to a small, but highly involved and influential, seg-
> ment of the population. There is no firm evidence that partisan media are
> making ordinary Americans more partisan.

==

# Sub-hypothesis 

- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
- The polarization increases near elections. <!-- .element: class="fragment" -->

==

# Sub-sub-hypothesis 

- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->

Note:

- Publication longivity is not covered currently.
- Mainstream media dominates the dataset.

===

<!-- .slide: class="center" -->

# Data Sources

==

# Data Sources

- Memeorandum: **stories** <!-- .element: class="fragment" -->
- AllSides: **bias** <!-- .element: class="fragment" -->
- HuggingFace: **sentiment** <!-- .element: class="fragment" -->
- ChatGPT: **election dates** <!-- .element: class="fragment" -->

Note:

Let's get a handle on the shape of the data.

- sources
- size
- features


===

<!-- .slide: class="center" -->

# Memeorandum

==

<!-- .slide: data-background-iframe="https://www.memeorandum.com" data-background-interactive -->

==

# Memeorandum

- News aggregation site. <!-- .element: class="fragment" -->
- Was really famous before Google News. <!-- .element: class="fragment" -->
- Still aggregates sites today. <!-- .element: class="fragment" -->

==

# Memeorandum 

- I still use it. <!-- .element class="fragment" -->
- I like to read titles. <!-- .element class="fragment" -->
- Publishers block bots. <!-- .element class="fragment" -->
- Simple html to parse. <!-- .element class="fragment" -->
- Headlines from 2006 forward. <!-- .element class="fragment" -->
- Automated, not editorialized. <!-- .element class="fragment" -->

Note:

- It limits doom scrolling.

===

<!-- .slide: class="center" -->

# AllSides

==

<!-- .slide: data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive -->

==

# AllSides

- Rates publications as left, center or right. <!-- .element: class="fragment" -->
- Ratings combine: <!-- .element: class="fragment" -->
    - blind bias surveys.
    - editorial reviews.
    - third party research.
    - community voting.

Note:
Originally scraped website, but direct access eventually.

==

# AllSides

- One of the only bias apis. <!-- .element class="fragment" -->
- Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
- Covers 1400 publishers + some blog and authors. <!-- .element class="fragment" -->
- Easy format and semi-complete data. <!-- .element class="fragment" -->

===

<!-- .slide: class="center" -->

# HuggingFace

==

<!-- .slide: data-background-iframe="https://huggingface.co/models" data-background-interactive -->

==

# HuggingFace

- Deep learning library. <!-- .element: class="fragment" -->
- Lots of pretrained models. <!-- .element: class="fragment" -->
- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->

==

# HuggingFace

- Language models are **HOT**. <!-- .element: class="fragment" -->
- Literally 5 lines of python.<!-- .element: class="fragment" -->
- The dataset needed more features.<!-- .element: class="fragment" -->
- Testing different model performance was easy.<!-- .element: class="fragment" -->
- Lots of pretrained classification tasks.<!-- .element: class="fragment" -->

===

<!-- .slide: class="center" -->

# Data Collection

==

# Data Collection

## Stories

```python
day = timedelta(days=1)
cur = date(2005, 10, 1)
end = date.today()
while cur <= end:
    cur = cur + day
    save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
    url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
    r = requests.get(url)
    with open(save_as, 'w') as f:
        f.write(r.text)
```

Note:

grab every page from 2005 forward.

later: parse it into csv/database.

==

# Data Collection

## Bias **hard**

```python
...
bias_html = DATA_DIR / 'allsides.html'
parser = etree.HTMLParser()
tree = etree.parse(str(bias_html), parser)
root = tree.getroot()
rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')

ratings = []
for row in rows:
    rating = dict()
    ...
```

Note:

grab entire index

later parse it into csv/database

==

# Data Collection

## Bias **easy**

![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)


Note:

json format, including authors and blogs.

==

# Data Collection

## Embeddings

```python
# table = ...
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

for chunk in table:
    tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
    outputs = model(**tokens)
    embeddings = outputs.last_hidden_state.detach().numpy()
    ...
```

Note:

for every title, tokenize then embed.

hidden state is last linear layer before training tasks.

==

# Data Collection

## Classification Embeddings

```python
...
outputs = model(**tokens)[0].detach().numpy()
scores = 1 / (1 + np.exp(-outputs))  # Sigmoid
class_ids = np.argmax(scores, axis=1)
for i, class_id in enumerate(class_ids):
    results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
...
```

Note:

for every title, tokenize, classify.

~ 1 hour

===

<!-- .slide: class="center" -->
# Data Structures

## Stories

Note:

Great, we have the data, now what does it look like?

==

# Data Structures

## Stories

- Top level stories. <!-- .element: class="fragment" -->
    - title, author, publisher, url, date.
- Related discussion. <!-- .element: class="fragment" -->
    - publisher, url.
    - uses 'parent' story as a source.
- Story stream changes constantly (dedup. required). <!-- .element: class="fragment" -->

==

# Data Structures

## Stories

![raw story table](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_stories_table.png)

==

# Data Structures

## Stories

![raw related table](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_related_table.png)

==

# Data Structures

## Stories

| metric            |   value |
|:------------------|--------:|
| total stories     |  299714 |
| total related     |  960111 |
| publishers        |    7031 |
| authors           |   34346 |
| max year          |    2023 |
| min year          |    2005 |
| top level domains |    7063 |

==

# Data Selection

## Stories

- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
- Convert urls to tld to link to publishers. <!-- .element: class="fragment" -->

Note:

tld: top level domain.

==

# Data Selection

## Publishers

- Combine subdomains of stories. <!-- .element: class="fragment" -->
    - blog.washingtonpost.com and washingtonpost.com are considered the same publisher. 
    - This could be bad. For example: opinion.wsj.com != wsj.com. 
- Find common name of publisher. <!-- .element: class="fragment" -->

Note:

Sometime authors are the publisher name.

==

# Data Selection

## Related

- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
    - Eliminates small blogs and non-original news.
- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
    - Eliminate silo'ed publications. 
    - Link matrix is square and low'ish dimensional. 

Note:

Going to build a data structure of the related links, so I have to be judicious about which ones to include.

==

# Data Selection

## Post Process

| metric            |   value |
|:------------------|--------:|
| total stories     |  251553 |
| total related     |  815183 |
| publishers        |     223 |
| authors           |   23809 |
| max year          |    2022 |
| min year          |    2006 |
| top level domains |     234 |

Note:

much less publishers, but count(stories) about the same - main stream represent.

==

# Descriptive Stats

## Stories Per Publisher

![stories per publisher](https://studentweb.cs.wwu.edu/~jensen33/static/577/stories_per_publisher.png)

Note:

Power law in effect.

==

# Descriptive Stats

## Top Publishers

![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)

Note:

Some publishers come and go.

Some publishers change their domains.

==

# Descriptive Stats

## Articles Per Year

![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)

Note:

Shape of total articles per year dominates some of the analysis.

==

# Descriptive Stats

## Common TLDs

![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)

Note:

just for funs.

Lots of IP addresses and spammy looking ones.

===

<!-- .slide: class="center" -->

# Data Structures

## Bias

==

# Data Structures

## Bias

- Per publisher. <!-- .element: class="fragment" -->
    - name, 
    - label/ordinal value.
    - agree/disagree vote by community. 
- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->

==

# Data Structures

## Bias

![raw bias table](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_bias_table.png)

Note:

Later, media type and explicit ordinal values were added via api access.

==

# Data Selection

## Bias

- Keep all ratings. <!-- .element: class="fragment" -->
- Join datasets on publisher name.  <!-- .element: class="fragment" -->
    - Started with 'jaro winkler similarity' then manually from there (look up Named Entity Recognition).
- Use numeric values. <!-- .element: class="fragment" -->
    - [left: -2, left-center: -1, ...].
    - Possibly scale ordinal based on agree/disagree ratio.

Note:

Lots of agrees on the ends of the spectrum implies their very left or very right.

Lots of agrees in the middle implies very neutral?

==

# Data

## Bias

![bias hist](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_hist.png)

==

# Data

## Bias

![selected bias](https://studentweb.cs.wwu.edu/~jensen33/static/577/selected_bias_table.png)

Note:

much smaller dataset.

TODO: manually add more joins to story source.

===

<!-- .slide: class="center" -->

# Data Structures

## Embeddings

==

# Data Structures

## Embeddings

- Per story title. <!-- .element: class="fragment" -->
    - sentence embedding (n, 384) - **BERT**.
    - sentiment classification (n, 1) - **RoBERTa base**.
    - emotional classification (n, 1) - **RoBERTa Go-Emotions**.
- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->

Note:

RoBERTa - pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. 

SST - Stanford Sentiment Treebank:  11,855 single sentences extracted from movie reviews, annotated by 3 human judges.

==

# Data Selection

## Embeddings

- Word embeddings were too complicated. <!-- .element: class="fragment" -->
- Kept argmax of classification prediction ([0.82, 0.18] -> LABEL_0). <!-- .element: class="fragment" -->
- For publisher based analysis, averaged sentence embeddings for all stories. <!-- .element: class="fragment" -->

==

# Data

## Embeddings

| label    |   stories |   publishers |
|:---------|----------:|-------------:|
| positive |     87830 |          223 |
| negative |    163723 |          223 |

Note:

There was a model with a neutral label as well, but I opted out.

==

# Data

## Embeddings

| label    |   stories |   publishers |
|:---------|----------:|-------------:|
| neutral  |    124257 |          223 |
| anger    |     34124 |          223 |
| fear     |     36756 |          223 |
| sadness  |     27449 |          223 |
| disgust  |     17939 |          222 |
| surprise |      5710 |          216 |
| joy      |      5318 |          214 |

===

<!-- .slide: class="center" -->

# Experiments

==

# Experiments

1. **clustering** on link similarity.
2. **classification** on link similarity.
3. **classification** on sentence embedding.
4. **classification** on sentiment analysis.
5. **regression** on emotional classification over time and publication.

Note:

5 main experiments.

Lots of tinkering and 'agile development'.

Use source control.

===

<!-- .slide: class="center" -->

# Experiment 1

**clustering** on link similarity.

==

# Experiment 1

## Setup

- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
- Cluster the encoding. <!-- .element: class="fragment" -->
- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
- Use PCA to visualize clusters. <!-- .element: class="fragment" -->

Note:
Principle Component Analysis: 
- a statistical technique for reducing the dimensionality of a dataset.
- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
- I use it alot to map from high dimensional space (links adj. and embeddings) to lower, most significant space.

==

<!-- .slide: class="center" -->

# Experiment 1

## Encoding schemes

==

# Experiment 1

## One-hot Encoding

| publisher |  nytimes|  wsj| newsweek|  ...|
|:----------|--------:|----:|--------:|----:|
| nytimes   |        1|    1|        1|  ...|
| wsj       |        1|    1|        0|  ...|
| newsweek  |        0|    0|        1|  ...|
| ...       |      ...|  ...|      ...|  ...|

==

# Experiment 1

## n-Hot Encoding

| publisher |  nytimes|  wsj| newsweek|  ...|
|:----------|--------:|----:|--------:|----:|
| nytimes   |       11|    1|      141|  ...|
| wsj       |        1|   31|        0|  ...|
| newsweek  |        0|    0|        1|  ...|
| ...       |      ...|  ...|      ...|  ...|

==

# Experiment 1

## Normalized n-Hot Encoding

| publisher |  nytimes|  wsj| newsweek|  ...|
|:----------|--------:|----:|--------:|----:|
| nytimes   |        0|  0.4|      0.2|  ...|
| wsj       |      0.2|    0|      0.4|  ...|
| newsweek  |      0.0|  0.0|      0.0|  ...|
| ...       |      ...|  ...|      ...|  ...|

==

# Experiment 1

## Elbow criterion

![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)

Note:

The elbow method looks at the percentage of explained variance as a function of the number of clusters: 

One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.

Percentage of variance explained is the ratio of the between-group variance to the total variance

sklearn eliminated 2 cluster groups??

==

<!-- .slide: class="center" -->

# Experiment 1

## Comparing encoding schemes

Note:

They all have good clusters.


==

# Experiment 1

## Link Magnitude

![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)

Note:

link frequency dominates one component.

more interested in bias between publishers, not difference between mainstream and outliers.

==

# Experiment 1

## Normalized

![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)

Note:

a few outliers still, but better.

==

# Experiment 1

## One-Hot

![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)

Note: 

really dispursed

==

# Experiment 1

## Discussion

- One-hot seems to reflect the right features. <!-- .element: class="fragment" -->
- Found clusters, but meaning is arbitrary. <!-- .element: class="fragment" -->
    - map to PCA results nicely.
- Limitation: need the link encoding to cluster. <!-- .element: class="fragment" -->
    - Smaller publishers might not link very much.
- TODO: Association Rule Mining. <!-- .element: class="fragment" -->
    - 'Basket of goods' analysis to group publishers.

===

<!-- .slide: class="center" -->

# Experiment 2

**classification** on link similarity.

==

# Experiment 2

## Setup

- Create features: <!-- .element: class="fragment" -->
    - Publisher frequency.
    - Reuse link encodings.
- Create classes: <!-- .element: class="fragment" -->
    - Join bias classifications.
- Train classifier. <!-- .element: class="fragment" -->

Note:

==
# Experiment 2

## Descriptive stats

| metric      | value     |
|:------------|:----------|
| publishers  | 1582      |
| labels      | 6         |
| left        | 482       |
| center      | 711       |
| right       | 369       |
| agree range | [0.0-1.0] |

Note:

rehash of what bias data is available.

==

# Experiment 2

## Results

![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)

Note:

pca maps to bias labels well, left on one end, right on the other.

if you squint.

==

# Experiment 2

## Results

![link confusion](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_confusion.png)

Note:

hot diagonal is good.

all data.

train test split only had 20 or so samples in it?

overlap between link choices and bias ratings is slim.

==

# Experiment 2

## Discussion

- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
    - Labels are (sort of) separated and clustered.
    - Creating them for smaller publishers is trivial.
- Hot diagonal confusion matrix is good. <!-- .element: class="fragment" -->
- Need to link more publisher data to get good test data. <!-- .element: class="fragment" -->

Note:


==

# Experiment 2

## Limitations

- Dependent on accurate rating. <!-- .element: class="fragment" -->
- Ordinal ratings weren't available. <!-- .element: class="fragment" -->
- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
- Don't know what to do with community rating. <!-- .element: class="fragment" -->

===

<!-- .slide: class="center" -->

# Experiment 3 

**classification** on sentence embedding.

==

# Experiment 3 

## Setup

- Generate sentence embedding for each title. <!-- .element: class="fragment" -->
- Rerun PCA analysis on title embeddings. <!-- .element: class="fragment" -->
- Use kNN classifier to map embedding features to bias rating. <!-- .element: class="fragment" -->

==

<!-- .slide: class="center" -->

# Experiment 3

## Embeddings Primer

==

# Experiment 3

## Embedding Steps

1. Extract titles.
2. Tokenize titles.
3. Pick pretrained language model.
4. Generate embeddings from tokens using model.

==

# Experiment 3

## Tokens

**The sentence:**

"Spain, Land of 10 P.M. Dinners, Asks if It's Time to Reset Clock"

**Tokenizes to:**

```
['[CLS]', 'spain', ',', 'land', 'of', '10', 'p', '.', 'm', '.', 
    'dinners', ',', 'asks', 'if', 'it', "'", 's', 'time', 'to', 
    'reset', 'clock', '[SEP]']
```

Note:
[CLS] is unique to BERT models and stands for classification.

==

# Experiment 3

## Tokens

**The sentence:**

"NPR/PBS NewsHour/Marist Poll Results and Analysis"

**Tokenizes to:**

```
['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris', 
    '##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]', 
    '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
```

Note:
The padding is there to make all tokenized vectors equal length.

The tokenizer also outputs a mask vector that the language model uses to ignore the padding.

==

# Experiment 3

## Embeddings

- Using a BERT (Bidirectional Encoder Representations from Transformers) based model.
- Input: tokens.
- Output: dense vectors representing 'semantic meaning' of tokens.

==

# Experiment 3

## Embeddings

**The tokens:**

```
['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris', 
    '##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]', 
    '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
```

**Embeds to a vector (1, 384):**

```
array([[ 0.12444635, -0.05962477, -0.00127911, ...,  0.13943022,
        -0.2552534 , -0.00238779],
       [ 0.01535596, -0.05933844, -0.0099495 , ...,  0.48110735,
         0.1370568 ,  0.3285091 ],
       [ 0.2831368 , -0.4200529 ,  0.10879617, ...,  0.15663117,
        -0.29782432,  0.4289513 ],
       ...,
```

Note:

attention masks allow the model to ignore padding so all vectors are same length.

embedding space has semantic meaning.

can do vector math on them:

king - man = monarch

monarch + dance = happy?

==

# Experiment 3

## Results

![pca vs. classes](https://studentweb.cs.wwu.edu/~jensen33/static/577/embedding_sentence_pca.png)

Note:

pca on the sentence embeddings of the titles.

not a lot of information in PCA this time.

==

# Experiment 3

## Results

![pca vs. avg embedding](https://studentweb.cs.wwu.edu/~jensen33/static/577/avg_embedding_sentence_pca.png) <!-- .element: class="r-stretch" -->


Note:

What about average publisher embedding?

centers are pushed outside?

sorry about the color pallet.

==

# Experiment 3

## Results

![knn embedding confusion](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentence_confusion.png)

Note:
Trained a kNN from sklearn.

Set aside 20% of the data as a test set.

Once trained, compared the predictions with the true on the test set.

not bad.

==

# Experiment 3

## Discussion

- Embedding space is hard to condense with PCA. <!-- .element: class="fragment" -->
- Maybe the classifier is learning to guess 'left-ish'? <!-- .element: class="fragment" -->
- Does DL work better on sparse inputs? <!-- .element: class="fragment" -->

===

<!-- .slide: class="center" -->

# Experiment 4

**classification** on sentiment analysis.

==
# Experiment 4

## Setup

- Use pretrained language classifier. <!-- .element: class="fragment" -->
- Previously: Mapped twitter posts to tokens, to embedding, to ['positive', 'negative'] labels. <!-- .element: class="fragment" -->
- Predict: rate of neutral titles decreasing over time. <!-- .element: class="fragment" -->

==

# Experiment 4

## Results

![sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentiment_over_time.png)

Note:

maybe there's something there.

less positive after 2008?

low around 2016?

increase around 202?

overall still lower.

==

# Experiment 4

## Results

![bias vs. sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_sentiment_over_time.png)

Note:

right has not a lot of data.

all trend down over time.

people loved Obama at the beginning.

==

# Experiment 4

## Results

![sentiment vs. election recency](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_recent_winner.png)

Note:

assumption: national elections drive news sentiment.

expected a taller band in the middle then the edges.

==

# Experiment 4

## Discussion

- Bump post Obama election for left and center. <!-- .element: class="fragment" -->
- Dip pre Trump election for left and center. <!-- .element: class="fragment" -->
- Right is all over the place - not enough data? <!-- .element: class="fragment" -->
- Recency of election not a clear factor. <!-- .element: class="fragment" -->

===

<!-- .slide: class="center" -->

# Experiment 5

**regression** on title emotional expression.

==

# Experiment 5

## Setup

- Use pretrained language classifier. <!-- .element: class="fragment" -->
- Previously: Mapped reddit posts to tokens, to embedding, to emotion labels. <!-- .element: class="fragment" -->
- Predict: rate of neutral titles decreasing over time. <!-- .element: class="fragment" -->
- Classify:  <!-- .element: class="fragment" -->
    - features: emotional labels
    - labels: bias

==

# Experiment 5

## Results

![emotion over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_over_time.png)

Note: 

neutrality between Obama and Trump

emotional titles all increased - shape of the underlying data.

TODO: normalize relative expression.

==

# Experiment 5

## Results

![emotion regression time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_regression.png)

Note:

left and right got less neutral over time.

==

# Experiment 5

## Discussion

- Neutral story titles dominate the dataset. <!-- .element: class="fragment" -->
- Increase in stories published might explain most of the trend. <!-- .element: class="fragment" -->
- Far-right and far-left both became less neutral. <!-- .element: class="fragment" -->
- Left-Center and right-center became more emotional, but also neutral. <!-- .element: class="fragment" -->
- Not a lot of movement overall. <!-- .element: class="fragment" -->

===

<!-- .slide: class="center" -->

# Conclusion

==

# Hypothesis

- The polarization is not evenly distributed across publishers. **unproven**
- The polarization is not evenly distributed across political specturm. **unproven**
- The polarization increases near elections. **false**
- Similarly polarized publishers link to each other. **sorta**
- 'Mainstream' media uses more neutral titles. **true**
- Highly polarized publications don't last as long. **untested**

==

# Conclusion

- Article titles do not have a lot of predictive power. <!-- .element: class="fragment" -->
- Mainstream, neutral publications dominate the dataset. <!-- .element: class="fragment" -->
- Link frequency, sentence embeddings, and sentiments are useful features. <!-- .element: class="fragment" -->
- A few questions remain. <!-- .element: class="fragment" -->

Note:

Experiment 6 (**TODO**)

- Have a lot of features now.
    - Link PCA components.
    - Embedding PCA components.
    - Sentiment.
    - Emotion.
- Can we predict with all of them: Bias.

limitations

- Many different authors under the same publisher.
- Publishers use syndication.
- Bias ratings are biased and not linked automaticall.
- National news is generally designed to be neutral sounding.
- End user: Is that useful? Where will I get all that at inference time?


==

<!-- .slide: class="center" -->

# Questions

==

<!-- .slide: id="references" -->

# References

[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.

Note: