wwu-577/docs/presentation.md

_model: slides
---

title: CSCI 577 - Data Mining

---
body:

# Political Polarization 

Matt Jensen

===

# Hypothesis 

Political polarization is rising, and news articles are a proxy measure.

==

# Is this reasonable? 


==

# Why is polarization rising? 

Not my job, but there's research<sup>[ref](#references)</sup> to support it


==

# Sub-hypothesis 

- The polarization increases near elections. <!-- .element: class="fragment" -->
- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->

==

# Sub-sub-hypothesis 

- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->

===

# Data Source(s) 

memeorandum.com <!-- .element: class="fragment" -->

allsides.com <!-- .element: class="fragment" -->

huggingface.com <!-- .element: class="fragment" -->

===

<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>

===

# memeorandum.com

- News aggregation site. <!-- .element: class="fragment" -->
- Was really famous before Google News. <!-- .element: class="fragment" -->
- Still aggregates sites today. <!-- .element: class="fragment" -->

==

# Why Memeorandum? 

- Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->
- Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->
- Convenient: most publishers block bots. <!-- .element class="fragment" -->
- Convenient: dead simple html to parse. <!-- .element class="fragment" -->
- Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->
- Archival: automated, not editorialized. <!-- .element class="fragment" -->

===

<section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>

===

# AllSides.com

- Rates news publications as left, center or right. <!-- .element: class="fragment" -->
- Ratings combine: <!-- .element: class="fragment" -->
    - blind bias surveys.
    - editorial reviews.
    - third party research.
    - community voting.
- Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->


==

# Why AllSides? 

- Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->
- Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
- Convenient: Easy format. <!-- .element class="fragment" -->
- Archival: Covers 1400 publishers. <!-- .element class="fragment" -->

===

<section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>

===

# HuggingFace.com

- Deep Learning library. <!-- .element: class="fragment" -->
- Lots of pretrained models. <!-- .element: class="fragment" -->
- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->

==

# Why HuggingFace? 

- Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->
- Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->
- Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->
- Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->
- Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->

===

# Data Structures
Stories

- Top level stories. <!-- .element: class="fragment" -->
    - title.
    - publisher.
    - author.
- Related discussion. <!-- .element: class="fragment" -->
    - publisher.
    - uses 'parent' story as a source.
- Stream of stories (changes constantly). <!-- .element: class="fragment" -->

==

# Data Structures
Bias

- Per publisher. <!-- .element: class="fragment" -->
    - name.
    - label.
    - agree/disagree vote by community. 
- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->

==

# Data Structures
Embeddings

- Per story title. <!-- .element: class="fragment" -->
    - sentence embedding (n, 384).
    - sentiment classification (n, 1).
    - emotional classification (n, 1).
- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->

===

# Data Collection

==

# Data Collection

Story Scraper (simplified)

```python
day = timedelta(days=1)
cur = date(2005, 10, 1)
end = date.today()
while cur <= end:
    cur = cur + day
    save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
    url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
    r = requests.get(url)
    with open(save_as, 'w') as f:
        f.write(r.text)
```

==

# Data Collection
Bias Scraper (hard)

```python
...
bias_html = DATA_DIR / 'allsides.html'
parser = etree.HTMLParser()
tree = etree.parse(str(bias_html), parser)
root = tree.getroot()
rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')

ratings = []
for row in rows:
    rating = dict()
    ...
```

==

# Data Collection
Bias Scraper (easy)

![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)

==

# Data Collection
Embeddings (easy)

```python
# table = ...
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

for chunk in table:
    tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
    outputs = model(**tokens)
    embeddings = outputs.last_hidden_state.detach().numpy()
    ...
```

==

# Data Collection
Classification Embeddings (medium) 

```python
...
outputs = model(**tokens)[0].detach().numpy()
scores = 1 / (1 + np.exp(-outputs))  # Sigmoid
class_ids = np.argmax(scores, axis=1)
for i, class_id in enumerate(class_ids):
    results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
...
```

===

# Data Selection

==

# Data Selection
Stories

- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->

==
# Data Selection

Publishers

- Combine subdomains of stories. <!-- .element: class="fragment" -->
    - blog.washingtonpost.com and washingtonpost.com are considered the same publisher. 
    - This could be bad. For example: opinion.wsj.com != wsj.com. 

==

# Data Selection

Links

- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
    - Eliminates small blogs and non-original news.
- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
    - Eliminate silo'ed publications. 
    - Link matrix is square and low'ish dimensional. 

==

# Data Selection

Bias

- Keep all ratings, even ones with low agree/disagree ratio.
- Join datasets on publisher name. 
    - Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->
    - Started with 'jaro winkler similarity' then manually from there.
- Use numeric values
    - [left: -2, left-center: -1, ...]

===

# Descriptive Stats

Raw

| metric            |   value |
|:------------------|--------:|
| total stories     |  299714 |
| total related     |  960111 |
| publishers        |    7031 |
| authors           |   34346 |
| max year          |    2023 |
| min year          |    2005 |
| top level domains |    7063 |

==
# Descriptive Stats

Stories Per Publisher

![stories per publisher](/static/577/stories_per_publisher.png)

==

# Descriptive Stats

Top Publishers

![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)

==

# Descriptive Stats

Articles Per Year

![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)

==

# Descriptive Stats

Common TLDs

![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)

==

# Descriptive Stats

Post Process

| key               |   value |
|:------------------|--------:|
| total stories     |  251553 |
| total related     |  815183 |
| publishers        |     223 |
| authors           |   23809 |
| max year          |    2022 |
| min year          |    2006 |
| top level domains |     234 |

===
# Experiments

1. **clustering** on link similarity. <!-- .element: class="fragment" -->
2. **classification** on link similarity. <!-- .element: class="fragment" -->
3. **classification** on sentence embedding. <!-- .element: class="fragment" -->
4. **classification** on sentiment analysis. <!-- .element: class="fragment" -->
5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->

===
# Experiment 1

Setup

- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
- Cluster the encoding. <!-- .element: class="fragment" -->
- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
- Use PCA to visualize clusters. <!-- .element: class="fragment" -->

Note:
Principle Component Analysis: 
- a statistical technique for reducing the dimensionality of a dataset.
- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.

==

# Experiment 1

One Hot Encoding

| publisher |  nytimes|  wsj| newsweek|  ...|
|:----------|--------:|----:|--------:|----:|
| nytimes   |        1|    1|        1|  ...|
| wsj       |        1|    1|        0|  ...|
| newsweek  |        0|    0|        1|  ...|
| ...       |      ...|  ...|      ...|  ...|

==

# Experiment 1

n-Hot Encoding

| publisher |  nytimes|  wsj| newsweek|  ...|
|:----------|--------:|----:|--------:|----:|
| nytimes   |       11|    1|      141|  ...|
| wsj       |        1|   31|        0|  ...|
| newsweek  |        0|    0|        1|  ...|
| ...       |      ...|  ...|      ...|  ...|

==

# Experiment 1

Normalized n-Hot Encoding

| publisher |  nytimes|  wsj| newsweek|  ...|
|:----------|--------:|----:|--------:|----:|
| nytimes   |        0|  0.4|      0.2|  ...|
| wsj       |      0.2|    0|      0.4|  ...|
| newsweek  |      0.0|  0.0|      0.0|  ...|
| ...       |      ...|  ...|      ...|  ...|

==

# Experiment 1

Elbow criterion

![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)

Note:

The elbow method looks at the percentage of explained variance as a function of the number of clusters: 

One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.

Percentage of variance explained is the ratio of the between-group variance to the total variance,

==

# Experiment 1

Link Magnitude

![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)

==

# Experiment 1

Normalized

![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)

==

# Experiment 1

Onehot

![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)

==

# Experiment 1

Discussion

- Best encoding: One hot. <!-- .element: class="fragment" -->
    - Clusters based on total links otherwise.
- Clusters, but no explanation
- Limitation: need the link encoding to cluster.
    - Smaller publishers might not link very much.

===

# Experiment 2

Setup

- Create features. <!-- .element: class="fragment" -->:
    - Publisher frequency.
    - Reuse link encodings.
- Create classes: <!-- .element: class="fragment" -->
    - Join bias classifications.
- Train classifier. <!-- .element: class="fragment" -->

Note:

==
# Experiment 2
Descriptive stats

| metric      | value     |
|:------------|:----------|
| publishers  | 1582      |
| labels      | 6         |
| left        | 482       |
| center      | 711       |
| right       | 369       |
| agree range | [0.0-1.0] |

==

# Experiment 2

PCA + Labels

![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)

==

# Experiment 2

Discussion

- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
    - Labels are (sort of) separated and clustered.
    - Creating them for smaller publishers is trivial.
==

# Experiment 2

Limitations

- Dependent on accurate rating. <!-- .element: class="fragment" -->
- Ordinal ratings not available. <!-- .element: class="fragment" -->
- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
- Don't know what to do with community rating. <!-- .element: class="fragment" -->

===

# Experiment 3

Setup

==

# Limitations

- Many different authors under the same publisher. <!-- .element: class="fragment" -->
- Publishers use syndication. <!-- .element: class="fragment" -->
- Bias ratings are biased. <!-- .element: class="fragment" -->

===

# Questions

===

<!-- .section: id="references" -->

# References

[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.

Note:
v1.0 of presentation. 2023-05-17 13:38:07 -07:00			`_model: slides`
			`---`

			`title: CSCI 577 - Data Mining`

			`---`
			`body:`

			`# Political Polarization`

			`Matt Jensen`

			`===`

			`# Hypothesis`

			`Political polarization is rising, and news articles are a proxy measure.`

			`==`

			`# Is this reasonable?`


			`==`

			`# Why is polarization rising?`

			`Not my job, but there's research<sup>[ref](#references)</sup> to support it`


			`==`

			`# Sub-hypothesis`

			`- The polarization increases near elections. <!-- .element: class="fragment" -->`
			`- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->`
			`- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->`

			`==`

			`# Sub-sub-hypothesis`

			`- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->`
			`- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->`
			`- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->`

			`===`

			`# Data Source(s)`

			`memeorandum.com <!-- .element: class="fragment" -->`

			`allsides.com <!-- .element: class="fragment" -->`

			`huggingface.com <!-- .element: class="fragment" -->`

			`===`

			`<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>`

			`===`

			`# memeorandum.com`

			`- News aggregation site. <!-- .element: class="fragment" -->`
			`- Was really famous before Google News. <!-- .element: class="fragment" -->`
			`- Still aggregates sites today. <!-- .element: class="fragment" -->`

			`==`

			`# Why Memeorandum?`

			`- Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->`
			`- Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->`
			`- Convenient: most publishers block bots. <!-- .element class="fragment" -->`
			`- Convenient: dead simple html to parse. <!-- .element class="fragment" -->`
			`- Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->`
			`- Archival: automated, not editorialized. <!-- .element class="fragment" -->`

			`===`

			`<section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>`

			`===`

			`# AllSides.com`

			`- Rates news publications as left, center or right. <!-- .element: class="fragment" -->`
			`- Ratings combine: <!-- .element: class="fragment" -->`
			`- blind bias surveys.`
			`- editorial reviews.`
			`- third party research.`
			`- community voting.`
			`- Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->`


			`==`

			`# Why AllSides?`

			`- Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->`
			`- Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->`
			`- Convenient: Easy format. <!-- .element class="fragment" -->`
			`- Archival: Covers 1400 publishers. <!-- .element class="fragment" -->`

			`===`

			`<section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>`

			`===`

			`# HuggingFace.com`

			`- Deep Learning library. <!-- .element: class="fragment" -->`
			`- Lots of pretrained models. <!-- .element: class="fragment" -->`
			`- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->`

			`==`

			`# Why HuggingFace?`

			`- Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->`
			`- Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->`
			`- Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->`
			`- Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->`
			`- Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->`

			`===`

			`# Data Structures`
			`Stories`

			`- Top level stories. <!-- .element: class="fragment" -->`
			`- title.`
			`- publisher.`
			`- author.`
			`- Related discussion. <!-- .element: class="fragment" -->`
			`- publisher.`
			`- uses 'parent' story as a source.`
			`- Stream of stories (changes constantly). <!-- .element: class="fragment" -->`

			`==`

			`# Data Structures`
			`Bias`

			`- Per publisher. <!-- .element: class="fragment" -->`
			`- name.`
			`- label.`
			`- agree/disagree vote by community.`
			`- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->`

			`==`

			`# Data Structures`
			`Embeddings`

			`- Per story title. <!-- .element: class="fragment" -->`
			`- sentence embedding (n, 384).`
			`- sentiment classification (n, 1).`
			`- emotional classification (n, 1).`
			`- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->`

			`===`

			`# Data Collection`

			`==`

			`# Data Collection`

			`Story Scraper (simplified)`

			```python
			`day = timedelta(days=1)`
			`cur = date(2005, 10, 1)`
			`end = date.today()`
			`while cur <= end:`
			`cur = cur + day`
			`save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"`
			`url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"`
			`r = requests.get(url)`
			`with open(save_as, 'w') as f:`
			`f.write(r.text)`
			```

			`==`

			`# Data Collection`
			`Bias Scraper (hard)`

			```python
			`...`
			`bias_html = DATA_DIR / 'allsides.html'`
			`parser = etree.HTMLParser()`
			`tree = etree.parse(str(bias_html), parser)`
			`root = tree.getroot()`
			`rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')`

			`ratings = []`
			`for row in rows:`
			`rating = dict()`
			`...`
			```

			`==`

			`# Data Collection`
			`Bias Scraper (easy)`

			`![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)`

			`==`

			`# Data Collection`
			`Embeddings (easy)`

			```python
			`# table = ...`
			`tokenizer = AutoTokenizer.from_pretrained("roberta-base")`
			`model = AutoModel.from_pretrained("roberta-base")`

			`for chunk in table:`
			`tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")`
			`outputs = model(**tokens)`
			`embeddings = outputs.last_hidden_state.detach().numpy()`
			`...`
			```

			`==`

			`# Data Collection`
			`Classification Embeddings (medium)`

			```python
			`...`
			`outputs = model(**tokens)[0].detach().numpy()`
			`scores = 1 / (1 + np.exp(-outputs)) # Sigmoid`
			`class_ids = np.argmax(scores, axis=1)`
			`for i, class_id in enumerate(class_ids):`
			`results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})`
			`...`
			```

			`===`

			`# Data Selection`

			`==`

			`# Data Selection`
			`Stories`

			`- Clip the first and last full year of stories. <!-- .element: class="fragment" -->`
			`- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->`

			`==`
			`# Data Selection`

			`Publishers`

			`- Combine subdomains of stories. <!-- .element: class="fragment" -->`
			`- blog.washingtonpost.com and washingtonpost.com are considered the same publisher.`
			`- This could be bad. For example: opinion.wsj.com != wsj.com.`

			`==`

			`# Data Selection`

			`Links`

			`- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->`
			`- Eliminates small blogs and non-original news.`
			`- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->`
			`- Eliminate silo'ed publications.`
			`- Link matrix is square and low'ish dimensional.`

			`==`

			`# Data Selection`

			`Bias`

			`- Keep all ratings, even ones with low agree/disagree ratio.`
			`- Join datasets on publisher name.`
			`- Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->`
			`- Started with 'jaro winkler similarity' then manually from there.`
			`- Use numeric values`
			`- [left: -2, left-center: -1, ...]`

			`===`

			`# Descriptive Stats`

			`Raw`

			`\| metric \| value \|`
			`\|:------------------\|--------:\|`
			`\| total stories \| 299714 \|`
			`\| total related \| 960111 \|`
			`\| publishers \| 7031 \|`
			`\| authors \| 34346 \|`
			`\| max year \| 2023 \|`
			`\| min year \| 2005 \|`
			`\| top level domains \| 7063 \|`

			`==`
			`# Descriptive Stats`

			`Stories Per Publisher`

			`![stories per publisher](/static/577/stories_per_publisher.png)`

			`==`

			`# Descriptive Stats`

			`Top Publishers`

			`![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)`

			`==`

			`# Descriptive Stats`

			`Articles Per Year`

			`![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)`

			`==`

			`# Descriptive Stats`

			`Common TLDs`

			`![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)`

			`==`

			`# Descriptive Stats`

			`Post Process`

			`\| key \| value \|`
			`\|:------------------\|--------:\|`
			`\| total stories \| 251553 \|`
			`\| total related \| 815183 \|`
			`\| publishers \| 223 \|`
			`\| authors \| 23809 \|`
			`\| max year \| 2022 \|`
			`\| min year \| 2006 \|`
			`\| top level domains \| 234 \|`

			`===`
			`# Experiments`

			`1. clustering on link similarity. <!-- .element: class="fragment" -->`
			`2. classification on link similarity. <!-- .element: class="fragment" -->`
			`3. classification on sentence embedding. <!-- .element: class="fragment" -->`
			`4. classification on sentiment analysis. <!-- .element: class="fragment" -->`
			`5. regression on emotional classification over time and publication. <!-- .element: class="fragment" -->`

			`===`
			`# Experiment 1`

			`Setup`

			`- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->`
			`- Cluster the encoding. <!-- .element: class="fragment" -->`
			`- Expect similar publications in same cluster. <!-- .element: class="fragment" -->`
			`- Use PCA to visualize clusters. <!-- .element: class="fragment" -->`

			`Note:`
			`Principle Component Analysis:`
			`- a statistical technique for reducing the dimensionality of a dataset.`
			`- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.`

			`==`

			`# Experiment 1`

			`One Hot Encoding`

			`\| publisher \| nytimes\| wsj\| newsweek\| ...\|`
			`\|:----------\|--------:\|----:\|--------:\|----:\|`
			`\| nytimes \| 1\| 1\| 1\| ...\|`
			`\| wsj \| 1\| 1\| 0\| ...\|`
			`\| newsweek \| 0\| 0\| 1\| ...\|`
			`\| ... \| ...\| ...\| ...\| ...\|`

			`==`

			`# Experiment 1`

			`n-Hot Encoding`

			`\| publisher \| nytimes\| wsj\| newsweek\| ...\|`
			`\|:----------\|--------:\|----:\|--------:\|----:\|`
			`\| nytimes \| 11\| 1\| 141\| ...\|`
			`\| wsj \| 1\| 31\| 0\| ...\|`
			`\| newsweek \| 0\| 0\| 1\| ...\|`
			`\| ... \| ...\| ...\| ...\| ...\|`

			`==`

			`# Experiment 1`

			`Normalized n-Hot Encoding`

			`\| publisher \| nytimes\| wsj\| newsweek\| ...\|`
			`\|:----------\|--------:\|----:\|--------:\|----:\|`
			`\| nytimes \| 0\| 0.4\| 0.2\| ...\|`
			`\| wsj \| 0.2\| 0\| 0.4\| ...\|`
			`\| newsweek \| 0.0\| 0.0\| 0.0\| ...\|`
			`\| ... \| ...\| ...\| ...\| ...\|`

			`==`

			`# Experiment 1`

			`Elbow criterion`

			`![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)`

			`Note:`

			`The elbow method looks at the percentage of explained variance as a function of the number of clusters:`

			`One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.`

			`Percentage of variance explained is the ratio of the between-group variance to the total variance,`

			`==`

			`# Experiment 1`

			`Link Magnitude`

			`![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)`

			`==`

			`# Experiment 1`

			`Normalized`

			`![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)`

			`==`

			`# Experiment 1`

			`Onehot`

			`![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)`

			`==`

			`# Experiment 1`

			`Discussion`

			`- Best encoding: One hot. <!-- .element: class="fragment" -->`
			`- Clusters based on total links otherwise.`
			`- Clusters, but no explanation`
			`- Limitation: need the link encoding to cluster.`
			`- Smaller publishers might not link very much.`

			`===`

			`# Experiment 2`

			`Setup`

			`- Create features. <!-- .element: class="fragment" -->:`
			`- Publisher frequency.`
			`- Reuse link encodings.`
			`- Create classes: <!-- .element: class="fragment" -->`
			`- Join bias classifications.`
			`- Train classifier. <!-- .element: class="fragment" -->`

			`Note:`

			`==`
			`# Experiment 2`
			`Descriptive stats`

			`\| metric \| value \|`
			`\|:------------\|:----------\|`
			`\| publishers \| 1582 \|`
			`\| labels \| 6 \|`
			`\| left \| 482 \|`
			`\| center \| 711 \|`
			`\| right \| 369 \|`
			`\| agree range \| [0.0-1.0] \|`

			`==`

			`# Experiment 2`

			`PCA + Labels`

			`![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)`

			`==`

			`# Experiment 2`

			`Discussion`

			`- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->`
			`- Labels are (sort of) separated and clustered.`
			`- Creating them for smaller publishers is trivial.`
			`==`

			`# Experiment 2`

			`Limitations`

			`- Dependent on accurate rating. <!-- .element: class="fragment" -->`
			`- Ordinal ratings not available. <!-- .element: class="fragment" -->`
			`- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->`
			`- Entire publication is rated, not authors. <!-- .element: class="fragment" -->`
			`- Don't know what to do with community rating. <!-- .element: class="fragment" -->`

			`===`

			`# Experiment 3`

			`Setup`

			`==`

			`# Limitations`

			`- Many different authors under the same publisher. <!-- .element: class="fragment" -->`
			`- Publishers use syndication. <!-- .element: class="fragment" -->`
			`- Bias ratings are biased. <!-- .element: class="fragment" -->`

			`===`

			`# Questions`

			`===`

			`<!-- .section: id="references" -->`

			`# References`

			`[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.`

			`Note:`