wwu-577/docs/presentation.md

553 lines
13 KiB
Markdown
Raw Normal View History

2023-05-17 13:38:07 -07:00
_model: slides
---
title: CSCI 577 - Data Mining
---
body:
# Political Polarization
Matt Jensen
===
# Hypothesis
Political polarization is rising, and news articles are a proxy measure.
==
# Is this reasonable?
==
# Why is polarization rising?
Not my job, but there's research<sup>[ref](#references)</sup> to support it
==
# Sub-hypothesis
- The polarization increases near elections. <!-- .element: class="fragment" -->
- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
==
# Sub-sub-hypothesis
- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->
===
# Data Source(s)
memeorandum.com <!-- .element: class="fragment" -->
allsides.com <!-- .element: class="fragment" -->
huggingface.com <!-- .element: class="fragment" -->
===
<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>
===
# memeorandum.com
- News aggregation site. <!-- .element: class="fragment" -->
- Was really famous before Google News. <!-- .element: class="fragment" -->
- Still aggregates sites today. <!-- .element: class="fragment" -->
==
# Why Memeorandum?
- Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->
- Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->
- Convenient: most publishers block bots. <!-- .element class="fragment" -->
- Convenient: dead simple html to parse. <!-- .element class="fragment" -->
- Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->
- Archival: automated, not editorialized. <!-- .element class="fragment" -->
===
<section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>
===
# AllSides.com
- Rates news publications as left, center or right. <!-- .element: class="fragment" -->
- Ratings combine: <!-- .element: class="fragment" -->
- blind bias surveys.
- editorial reviews.
- third party research.
- community voting.
- Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->
==
# Why AllSides?
- Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->
- Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
- Convenient: Easy format. <!-- .element class="fragment" -->
- Archival: Covers 1400 publishers. <!-- .element class="fragment" -->
===
<section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>
===
# HuggingFace.com
- Deep Learning library. <!-- .element: class="fragment" -->
- Lots of pretrained models. <!-- .element: class="fragment" -->
- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->
==
# Why HuggingFace?
- Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->
- Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->
- Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->
- Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->
- Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->
===
# Data Structures
Stories
- Top level stories. <!-- .element: class="fragment" -->
- title.
- publisher.
- author.
- Related discussion. <!-- .element: class="fragment" -->
- publisher.
- uses 'parent' story as a source.
- Stream of stories (changes constantly). <!-- .element: class="fragment" -->
==
# Data Structures
Bias
- Per publisher. <!-- .element: class="fragment" -->
- name.
- label.
- agree/disagree vote by community.
- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->
==
# Data Structures
Embeddings
- Per story title. <!-- .element: class="fragment" -->
- sentence embedding (n, 384).
- sentiment classification (n, 1).
- emotional classification (n, 1).
- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->
===
# Data Collection
==
# Data Collection
Story Scraper (simplified)
```python
day = timedelta(days=1)
cur = date(2005, 10, 1)
end = date.today()
while cur <= end:
cur = cur + day
save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
r = requests.get(url)
with open(save_as, 'w') as f:
f.write(r.text)
```
==
# Data Collection
Bias Scraper (hard)
```python
...
bias_html = DATA_DIR / 'allsides.html'
parser = etree.HTMLParser()
tree = etree.parse(str(bias_html), parser)
root = tree.getroot()
rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')
ratings = []
for row in rows:
rating = dict()
...
```
==
# Data Collection
Bias Scraper (easy)
![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)
==
# Data Collection
Embeddings (easy)
```python
# table = ...
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")
for chunk in table:
tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
outputs = model(**tokens)
embeddings = outputs.last_hidden_state.detach().numpy()
...
```
==
# Data Collection
Classification Embeddings (medium)
```python
...
outputs = model(**tokens)[0].detach().numpy()
scores = 1 / (1 + np.exp(-outputs)) # Sigmoid
class_ids = np.argmax(scores, axis=1)
for i, class_id in enumerate(class_ids):
results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
...
```
===
# Data Selection
==
# Data Selection
Stories
- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
==
# Data Selection
Publishers
- Combine subdomains of stories. <!-- .element: class="fragment" -->
- blog.washingtonpost.com and washingtonpost.com are considered the same publisher.
- This could be bad. For example: opinion.wsj.com != wsj.com.
==
# Data Selection
Links
- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
- Eliminates small blogs and non-original news.
- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
- Eliminate silo'ed publications.
- Link matrix is square and low'ish dimensional.
==
# Data Selection
Bias
- Keep all ratings, even ones with low agree/disagree ratio.
- Join datasets on publisher name.
- Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->
- Started with 'jaro winkler similarity' then manually from there.
- Use numeric values
- [left: -2, left-center: -1, ...]
===
# Descriptive Stats
Raw
| metric | value |
|:------------------|--------:|
| total stories | 299714 |
| total related | 960111 |
| publishers | 7031 |
| authors | 34346 |
| max year | 2023 |
| min year | 2005 |
| top level domains | 7063 |
==
# Descriptive Stats
Stories Per Publisher
![stories per publisher](/static/577/stories_per_publisher.png)
==
# Descriptive Stats
Top Publishers
![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)
==
# Descriptive Stats
Articles Per Year
![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)
==
# Descriptive Stats
Common TLDs
![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)
==
# Descriptive Stats
Post Process
| key | value |
|:------------------|--------:|
| total stories | 251553 |
| total related | 815183 |
| publishers | 223 |
| authors | 23809 |
| max year | 2022 |
| min year | 2006 |
| top level domains | 234 |
===
# Experiments
1. **clustering** on link similarity. <!-- .element: class="fragment" -->
2. **classification** on link similarity. <!-- .element: class="fragment" -->
3. **classification** on sentence embedding. <!-- .element: class="fragment" -->
4. **classification** on sentiment analysis. <!-- .element: class="fragment" -->
5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->
===
# Experiment 1
Setup
- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
- Cluster the encoding. <!-- .element: class="fragment" -->
- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
- Use PCA to visualize clusters. <!-- .element: class="fragment" -->
Note:
Principle Component Analysis:
- a statistical technique for reducing the dimensionality of a dataset.
- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
==
# Experiment 1
One Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 1| 1| 1| ...|
| wsj | 1| 1| 0| ...|
| newsweek | 0| 0| 1| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
n-Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 11| 1| 141| ...|
| wsj | 1| 31| 0| ...|
| newsweek | 0| 0| 1| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
Normalized n-Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
| nytimes | 0| 0.4| 0.2| ...|
| wsj | 0.2| 0| 0.4| ...|
| newsweek | 0.0| 0.0| 0.0| ...|
| ... | ...| ...| ...| ...|
==
# Experiment 1
Elbow criterion
![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)
Note:
The elbow method looks at the percentage of explained variance as a function of the number of clusters:
One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
Percentage of variance explained is the ratio of the between-group variance to the total variance,
==
# Experiment 1
Link Magnitude
![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)
==
# Experiment 1
Normalized
![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)
==
# Experiment 1
Onehot
![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)
==
# Experiment 1
Discussion
- Best encoding: One hot. <!-- .element: class="fragment" -->
- Clusters based on total links otherwise.
- Clusters, but no explanation
- Limitation: need the link encoding to cluster.
- Smaller publishers might not link very much.
===
# Experiment 2
Setup
- Create features. <!-- .element: class="fragment" -->:
- Publisher frequency.
- Reuse link encodings.
- Create classes: <!-- .element: class="fragment" -->
- Join bias classifications.
- Train classifier. <!-- .element: class="fragment" -->
Note:
==
# Experiment 2
Descriptive stats
| metric | value |
|:------------|:----------|
| publishers | 1582 |
| labels | 6 |
| left | 482 |
| center | 711 |
| right | 369 |
| agree range | [0.0-1.0] |
==
# Experiment 2
PCA + Labels
![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)
==
# Experiment 2
Discussion
- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
- Labels are (sort of) separated and clustered.
- Creating them for smaller publishers is trivial.
==
# Experiment 2
Limitations
- Dependent on accurate rating. <!-- .element: class="fragment" -->
- Ordinal ratings not available. <!-- .element: class="fragment" -->
- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
- Don't know what to do with community rating. <!-- .element: class="fragment" -->
===
# Experiment 3
Setup
==
# Limitations
- Many different authors under the same publisher. <!-- .element: class="fragment" -->
- Publishers use syndication. <!-- .element: class="fragment" -->
- Bias ratings are biased. <!-- .element: class="fragment" -->
===
# Questions
===
<!-- .section: id="references" -->
# References
[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
Note: