add exp 4/5

This commit is contained in:
matt
2023-05-17 21:38:21 -07:00
parent 74c2d8afa2
commit 3f7b3ad467
16 changed files with 905 additions and 59 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 275 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

View File

@@ -54,6 +54,11 @@ allsides.com <!-- .element: class="fragment" -->
huggingface.com <!-- .element: class="fragment" -->
Note:
Let's get a handle on the shape of the data.
The sources, size, and features of the data.
===
<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>
@@ -128,7 +133,7 @@ huggingface.com <!-- .element: class="fragment" -->
===
# Data Structures
Stories
## Stories
- Top level stories. <!-- .element: class="fragment" -->
- title.
@@ -142,7 +147,7 @@ Stories
==
# Data Structures
Bias
## Bias
- Per publisher. <!-- .element: class="fragment" -->
- name.
@@ -153,7 +158,7 @@ Bias
==
# Data Structures
Embeddings
## Embeddings
- Per story title. <!-- .element: class="fragment" -->
- sentence embedding (n, 384).
@@ -169,7 +174,7 @@ Embeddings
# Data Collection
Story Scraper (simplified)
## Story Scraper (simplified)
```python
day = timedelta(days=1)
@@ -187,7 +192,8 @@ while cur <= end:
==
# Data Collection
Bias Scraper (hard)
## Bias Scraper (hard)
```python
...
@@ -206,14 +212,16 @@ for row in rows:
==
# Data Collection
Bias Scraper (easy)
## Bias Scraper (easy)
![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)
==
# Data Collection
Embeddings (easy)
## Embeddings (easy)
```python
# table = ...
@@ -230,7 +238,8 @@ for chunk in table:
==
# Data Collection
Classification Embeddings (medium)
## Classification Embeddings (medium)
```python
...
@@ -249,7 +258,8 @@ for i, class_id in enumerate(class_ids):
==
# Data Selection
Stories
## Stories
- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
@@ -257,7 +267,7 @@ Stories
==
# Data Selection
Publishers
## Publishers
- Combine subdomains of stories. <!-- .element: class="fragment" -->
- blog.washingtonpost.com and washingtonpost.com are considered the same publisher.
@@ -267,7 +277,7 @@ Publishers
# Data Selection
Links
## Links
- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
- Eliminates small blogs and non-original news.
@@ -279,7 +289,7 @@ Links
# Data Selection
Bias
## Bias
- Keep all ratings, even ones with low agree/disagree ratio.
- Join datasets on publisher name.
@@ -292,7 +302,7 @@ Bias
# Descriptive Stats
Raw
## Raw
| metric | value |
|:------------------|--------:|
@@ -307,7 +317,7 @@ Raw
==
# Descriptive Stats
Stories Per Publisher
## Stories Per Publisher
![stories per publisher](/static/577/stories_per_publisher.png)
@@ -315,7 +325,7 @@ Stories Per Publisher
# Descriptive Stats
Top Publishers
## Top Publishers
![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)
@@ -323,7 +333,7 @@ Top Publishers
# Descriptive Stats
Articles Per Year
## Articles Per Year
![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)
@@ -331,7 +341,7 @@ Articles Per Year
# Descriptive Stats
Common TLDs
## Common TLDs
![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)
@@ -339,9 +349,9 @@ Common TLDs
# Descriptive Stats
Post Process
## Post Process
| key | value |
| metric | value |
|:------------------|--------:|
| total stories | 251553 |
| total related | 815183 |
@@ -352,6 +362,7 @@ Post Process
| top level domains | 234 |
===
# Experiments
1. **clustering** on link similarity. <!-- .element: class="fragment" -->
@@ -361,9 +372,16 @@ Post Process
5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->
===
# Experiment 1
Setup
**clustering** on link similarity.
==
# Experiment 1
## Setup
- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
- Cluster the encoding. <!-- .element: class="fragment" -->
@@ -379,7 +397,7 @@ Principle Component Analysis:
# Experiment 1
One Hot Encoding
## One Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
@@ -392,7 +410,7 @@ One Hot Encoding
# Experiment 1
n-Hot Encoding
## n-Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
@@ -405,7 +423,7 @@ n-Hot Encoding
# Experiment 1
Normalized n-Hot Encoding
## Normalized n-Hot Encoding
| publisher | nytimes| wsj| newsweek| ...|
|:----------|--------:|----:|--------:|----:|
@@ -418,7 +436,7 @@ Normalized n-Hot Encoding
# Experiment 1
Elbow criterion
## Elbow criterion
![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)
@@ -434,7 +452,7 @@ Percentage of variance explained is the ratio of the between-group variance to t
# Experiment 1
Link Magnitude
## Link Magnitude
![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)
@@ -442,7 +460,7 @@ Link Magnitude
# Experiment 1
Normalized
## Normalized
![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)
@@ -450,7 +468,7 @@ Normalized
# Experiment 1
Onehot
## One Hot
![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)
@@ -458,20 +476,27 @@ Onehot
# Experiment 1
Discussion
## Discussion
- Best encoding: One hot. <!-- .element: class="fragment" -->
- Clusters based on total links otherwise.
- Clusters, but no explanation
- Limitation: need the link encoding to cluster.
- Clusters, but no explanation. <!-- .element: class="fragment" -->
- Limitation: need the link encoding to cluster. <!-- .element: class="fragment" -->
- Smaller publishers might not link very much.
- TODO: Association Rule Mining. <!-- .element: class="fragment" -->
===
# Experiment 2
Setup
**classification** on link similarity.
==
# Experiment 2
## Setup
- **clustering**. <!-- .element: class="fragment" -->
- Create features. <!-- .element: class="fragment" -->:
- Publisher frequency.
- Reuse link encodings.
@@ -483,7 +508,8 @@ Note:
==
# Experiment 2
Descriptive stats
## Descriptive stats
| metric | value |
|:------------|:----------|
@@ -498,7 +524,7 @@ Descriptive stats
# Experiment 2
PCA + Labels
## PCA + Labels
![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)
@@ -506,7 +532,7 @@ PCA + Labels
# Experiment 2
Discussion
## Discussion
- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
- Labels are (sort of) separated and clustered.
@@ -515,7 +541,7 @@ Discussion
# Experiment 2
Limitations
## Limitations
- Dependent on accurate rating. <!-- .element: class="fragment" -->
- Ordinal ratings not available. <!-- .element: class="fragment" -->
@@ -525,13 +551,260 @@ Limitations
===
# Experiment 3
# Experiment 3
Setup
**classification** on sentence embedding.
==
# Limitations
# Experiment 3
## Setup
- **classification**. <!-- .element: class="fragment" -->
- Generate sentence embedding for each title. <!-- .element: class="fragment" -->
- Rerun PCA analysis on title embeddings. <!-- .element: class="fragment" -->
- Use kNN classifier to map embedding features to bias rating. <!-- .element: class="fragment" -->
==
# Experiment 3
## Sentence Embeddings
1. Extract titles.
2. Tokenize titles.
3. Pick pretrained Language Model.
4. Generate embeddings from tokens.
==
# Experiment 3
## Tokens
**The sentence:**
"Spain, Land of 10 P.M. Dinners, Asks if It's Time to Reset Clock"
**Tokenizes to:**
```
['[CLS]', 'spain', ',', 'land', 'of', '10', 'p', '.', 'm', '.',
'dinners', ',', 'asks', 'if', 'it', "'", 's', 'time', 'to',
'reset', 'clock', '[SEP]']
```
Note:
[CLS] is unique to BERT models and stands for classification.
==
# Experiment 3
## Tokens
**The sentence:**
"NPR/PBS NewsHour/Marist Poll Results and Analysis"
**Tokenizes to:**
```
['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris',
'##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]',
'[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
```
Note:
The padding is there to make all tokenized vectors equal length.
The tokenizer also outputs a mask vector that the language model uses to ignore the padding.
==
# Experiment 3
## Embeddings
- Using a BERT (Bidirectional Encoder Representations from Transformers) based model.
- Input: tokens.
- Output: dense vectors representing 'semantic meaning' of tokens.
==
# Experiment 3
## Embeddings
**The tokens:**
```
['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris',
'##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]',
'[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
```
**Embeds to a vector (1, 384):**
```
array([[ 0.12444635, -0.05962477, -0.00127911, ..., 0.13943022,
-0.2552534 , -0.00238779],
[ 0.01535596, -0.05933844, -0.0099495 , ..., 0.48110735,
0.1370568 , 0.3285091 ],
[ 0.2831368 , -0.4200529 , 0.10879617, ..., 0.15663117,
-0.29782432, 0.4289513 ],
...,
```
==
# Experiment 3
## Results
![pca vs. classes](https://studentweb.cs.wwu.edu/~jensen33/static/577/embedding_sentence_pca.png)
Note:
Not a lot of information in PCA this time.
==
# Experiment 3
## Results
![pca vs. avg embedding](https://studentweb.cs.wwu.edu/~jensen33/static/577/avg_embedding_sentence_pca.png) <!-- .element: class="r-stretch" -->
Note:
What about average publisher embedding?
==
# Experiment 3
## Results
![knn embedding confusion](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentence_confusion.png)
Note:
Trained a kNN from sklearn.
Set aside 20% of the data as a test set.
Once trained, compared the predictions with the true on the test set.
==
# Experiment 3
## Discussion
- Embedding space is hard to condense with PCA. <!-- .element: class="fragment" -->
- Maybe the classifier is learning to guess 'left-ish'? <!-- .element: class="fragment" -->
===
# Experiment 4
**classification** on sentiment analysis.
==
# Experiment 4
## Setup
- Use pretrained Language Classifier. <!-- .element: class="fragment" -->
- Previously: Mapped twitter posts to tokens, to embedding, to ['positive', 'negative'] labels. <!-- .element: class="fragment" -->
- Predict: rate of neutral titles decreasing over time.
==
# Experiment 4
## Results
![sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentiment_over_time.png)
==
# Experiment 4
## Results
![bias vs. sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_sentiment_over_time.png)
==
# Experiment 4
## Discussion
-
===
# Experiment 5
**regression** on emotional classification over time and publication.
==
# Experiment 5
## Setup
- Use pretrained language classifier. <!-- .element: class="fragment" -->
- Previously: Mapped reddit posts to tokens, to embedding, to emotion labels. <!-- .element: class="fragment" -->
- Predict: rate of neutral titles decreasing over time.
- Classify:
- features: emotional labels
- labels: bias
==
# Experiment 5
## Results
![emotion over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_over_time.png)
==
# Experiment 5
## Results
![emotion regression time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_regression.png)
==
# Experiment 5
## Discussion
- Neutral story titles dominate the dataset. <!-- .element: class="fragment" -->
- Increase in stories published might explain most of the trend. <!-- .element: class="fragment" -->
- Far-right and far-left both became less neutral. <!-- .element: class="fragment" -->
- Left-Center and right-center became more emotional, but also neutral. <!-- .element: class="fragment" -->
- Not a lot of movement overall. <!-- .element: class="fragment" -->
===
# Experiment 6 (**TODO**)
## Setup
- Have a lot of features now. <!-- .element: class="fragment" -->
- Link PCA components.
- Embedding PCA components.
- Sentiment.
- Emotion.
- Can we predict with all of them: Bias. <!-- .element: class="fragment" -->
- End user: Is that useful? Where will I get all that at inference time? <!-- .element: class="fragment" -->
===
# Overall Limitations
- Many different authors under the same publisher. <!-- .element: class="fragment" -->
- Publishers use syndication. <!-- .element: class="fragment" -->