v1.0 of presentation.

2023-05-17 13:38:07 -07:00
parent 4d93cf7adb
commit 74c2d8afa2
37 changed files with 1959 additions and 144 deletions
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -0,0 +1,3 @@
+paper.pdf: paper.tex
+	pdflatex $^ -o $@
+	evince $@
--- a/docs/figures/allsides_request.png
+++ b/docs/figures/allsides_request.png
--- a/docs/figures/articles_per_year.png
+++ b/docs/figures/articles_per_year.png
--- a/docs/figures/common_tld.png
+++ b/docs/figures/common_tld.png
--- a/docs/figures/distinct_publishers.png
+++ b/docs/figures/distinct_publishers.png
--- a/docs/figures/link_cluster_elbow.png
+++ b/docs/figures/link_cluster_elbow.png
--- a/docs/figures/link_pca_clusters_links.png
+++ b/docs/figures/link_pca_clusters_links.png
--- a/docs/figures/link_pca_clusters_normalized.png
+++ b/docs/figures/link_pca_clusters_normalized.png
--- a/docs/figures/link_pca_clusters_onehot.png
+++ b/docs/figures/link_pca_clusters_onehot.png
--- a/docs/figures/pca_with_classes.png
+++ b/docs/figures/pca_with_classes.png
--- a/docs/figures/stories_per_publisher.png
+++ b/docs/figures/stories_per_publisher.png
--- a/docs/figures/top_publishers.png
+++ b/docs/figures/top_publishers.png
--- a/docs/paper.pdf
+++ b/docs/paper.pdf
--- a/docs/paper.tex
+++ b/docs/paper.tex
@@ -0,0 +1,61 @@
+\documentclass{article}
+\usepackage{multicol}
+\usepackage{hyperref}
+\title{Data Mining CS 571}
+\author{Matt Jensen}
+\date{2023-04-25}
+
+\begin{document}
+\maketitle
+
+\section*{Abstract}
+
+News organizations have been repeatedly accused of being partisan.
+Additionally, they have been accused of polarizing dicussion to drive up revenue and engagement.
+This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
+A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
+This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
+Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
+
+\begin{multicols}{2}
+
+\section{Data Preparation}
+The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
+Each news article has a title, author, description, publisher, publish date, url and related discussions. 
+The site also has a concept of references, where a main, popular story may be covered by other sources.
+This link association might be used to support one or more of the hypothesis of the main analysis.
+After scraping the site, the data will need to be deduplicated and normalized to minimize storage costs and processing errors.
+What remains after these cleaning steps is approximitely 6,400 days of material, 300,000 distinct headlines from 21,000 publishers and 34,000 authors used in the study.
+
+\section{Missing Data Policy}
+
+The largest data policy that will have to be dealt with is news organizations that share the same parent company, but might have slightly different names.
+Wall Street Journal news is drastically different than their opinion section. 
+Other organizations have slightly different names for the same thing and a product of the aggregation service and not due to any real difference.
+Luckily, most of the anaylsis is operating on the content of the news headlines, which do not suffer from this data impurity.
+
+\section{Classification Task}
+
+The classification of news titles into emotional categories was accomplished by using a pretrained large langauge model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
+This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
+The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probabilty class from the output.
+
+The data has been discretized into years.
+    Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
+Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
+Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
+
+\section{Experiments}
+
+No computational experiment have been done yet.
+Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
+The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
+If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
+The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
+
+\section{Results}
+\textbf{TODO.}
+
+\end{multicols}
+
+\end{document}
--- a/docs/presentation.md
+++ b/docs/presentation.md
@@ -0,0 +1,552 @@
+_model: slides
+---
+
+title: CSCI 577 - Data Mining
+
+---
+body:
+
+# Political Polarization 
+
+Matt Jensen
+
+===
+
+# Hypothesis 
+
+Political polarization is rising, and news articles are a proxy measure.
+
+==
+
+# Is this reasonable? 
+
+
+==
+
+# Why is polarization rising? 
+
+Not my job, but there's research<sup>[ref](#references)</sup> to support it
+
+
+==
+
+# Sub-hypothesis 
+
+- The polarization increases near elections. <!-- .element: class="fragment" -->
+- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
+- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
+
+==
+
+# Sub-sub-hypothesis 
+
+- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
+- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
+- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->
+
+===
+
+# Data Source(s) 
+
+memeorandum.com <!-- .element: class="fragment" -->
+
+allsides.com <!-- .element: class="fragment" -->
+
+huggingface.com <!-- .element: class="fragment" -->
+
+===
+
+<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>
+
+===
+
+# memeorandum.com
+
+- News aggregation site. <!-- .element: class="fragment" -->
+- Was really famous before Google News. <!-- .element: class="fragment" -->
+- Still aggregates sites today. <!-- .element: class="fragment" -->
+
+==
+
+# Why Memeorandum? 
+
+- Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->
+- Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->
+- Convenient: most publishers block bots. <!-- .element class="fragment" -->
+- Convenient: dead simple html to parse. <!-- .element class="fragment" -->
+- Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->
+- Archival: automated, not editorialized. <!-- .element class="fragment" -->
+
+===
+
+<section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>
+
+===
+
+# AllSides.com
+
+- Rates news publications as left, center or right. <!-- .element: class="fragment" -->
+- Ratings combine: <!-- .element: class="fragment" -->
+    - blind bias surveys.
+    - editorial reviews.
+    - third party research.
+    - community voting.
+- Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->
+
+
+==
+
+# Why AllSides? 
+
+- Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->
+- Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
+- Convenient: Easy format. <!-- .element class="fragment" -->
+- Archival: Covers 1400 publishers. <!-- .element class="fragment" -->
+
+===
+
+<section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>
+
+===
+
+# HuggingFace.com
+
+- Deep Learning library. <!-- .element: class="fragment" -->
+- Lots of pretrained models. <!-- .element: class="fragment" -->
+- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->
+
+==
+
+# Why HuggingFace? 
+
+- Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->
+- Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->
+- Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->
+- Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->
+- Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->
+
+===
+
+# Data Structures
+Stories
+
+- Top level stories. <!-- .element: class="fragment" -->
+    - title.
+    - publisher.
+    - author.
+- Related discussion. <!-- .element: class="fragment" -->
+    - publisher.
+    - uses 'parent' story as a source.
+- Stream of stories (changes constantly). <!-- .element: class="fragment" -->
+
+==
+
+# Data Structures
+Bias
+
+- Per publisher. <!-- .element: class="fragment" -->
+    - name.
+    - label.
+    - agree/disagree vote by community. 
+- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->
+
+==
+
+# Data Structures
+Embeddings
+
+- Per story title. <!-- .element: class="fragment" -->
+    - sentence embedding (n, 384).
+    - sentiment classification (n, 1).
+    - emotional classification (n, 1).
+- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->
+
+===
+
+# Data Collection
+
+==
+
+# Data Collection
+
+Story Scraper (simplified)
+
+```python
+day = timedelta(days=1)
+cur = date(2005, 10, 1)
+end = date.today()
+while cur <= end:
+    cur = cur + day
+    save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
+    url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
+    r = requests.get(url)
+    with open(save_as, 'w') as f:
+        f.write(r.text)
+```
+
+==
+
+# Data Collection
+Bias Scraper (hard)
+
+```python
+...
+bias_html = DATA_DIR / 'allsides.html'
+parser = etree.HTMLParser()
+tree = etree.parse(str(bias_html), parser)
+root = tree.getroot()
+rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')
+
+ratings = []
+for row in rows:
+    rating = dict()
+    ...
+```
+
+==
+
+# Data Collection
+Bias Scraper (easy)
+
+![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)
+
+==
+
+# Data Collection
+Embeddings (easy)
+
+```python
+# table = ...
+tokenizer = AutoTokenizer.from_pretrained("roberta-base")
+model = AutoModel.from_pretrained("roberta-base")
+
+for chunk in table:
+    tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
+    outputs = model(**tokens)
+    embeddings = outputs.last_hidden_state.detach().numpy()
+    ...
+```
+
+==
+
+# Data Collection
+Classification Embeddings (medium) 
+
+```python
+...
+outputs = model(**tokens)[0].detach().numpy()
+scores = 1 / (1 + np.exp(-outputs))  # Sigmoid
+class_ids = np.argmax(scores, axis=1)
+for i, class_id in enumerate(class_ids):
+    results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
+...
+```
+
+===
+
+# Data Selection
+
+==
+
+# Data Selection
+Stories
+
+- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
+- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
+
+==
+# Data Selection
+
+Publishers
+
+- Combine subdomains of stories. <!-- .element: class="fragment" -->
+    - blog.washingtonpost.com and washingtonpost.com are considered the same publisher. 
+    - This could be bad. For example: opinion.wsj.com != wsj.com. 
+
+==
+
+# Data Selection
+
+Links
+
+- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
+    - Eliminates small blogs and non-original news.
+- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
+    - Eliminate silo'ed publications. 
+    - Link matrix is square and low'ish dimensional. 
+
+==
+
+# Data Selection
+
+Bias
+
+- Keep all ratings, even ones with low agree/disagree ratio.
+- Join datasets on publisher name. 
+    - Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->
+    - Started with 'jaro winkler similarity' then manually from there.
+- Use numeric values
+    - [left: -2, left-center: -1, ...]
+
+===
+
+# Descriptive Stats
+
+Raw
+
+| metric            |   value |
+|:------------------|--------:|
+| total stories     |  299714 |
+| total related     |  960111 |
+| publishers        |    7031 |
+| authors           |   34346 |
+| max year          |    2023 |
+| min year          |    2005 |
+| top level domains |    7063 |
+
+==
+# Descriptive Stats
+
+Stories Per Publisher
+
+![stories per publisher](/static/577/stories_per_publisher.png)
+
+==
+
+# Descriptive Stats
+
+Top Publishers
+
+![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)
+
+==
+
+# Descriptive Stats
+
+Articles Per Year
+
+![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)
+
+==
+
+# Descriptive Stats
+
+Common TLDs
+
+![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)
+
+==
+
+# Descriptive Stats
+
+Post Process
+
+| key               |   value |
+|:------------------|--------:|
+| total stories     |  251553 |
+| total related     |  815183 |
+| publishers        |     223 |
+| authors           |   23809 |
+| max year          |    2022 |
+| min year          |    2006 |
+| top level domains |     234 |
+
+===
+# Experiments
+
+1. **clustering** on link similarity. <!-- .element: class="fragment" -->
+2. **classification** on link similarity. <!-- .element: class="fragment" -->
+3. **classification** on sentence embedding. <!-- .element: class="fragment" -->
+4. **classification** on sentiment analysis. <!-- .element: class="fragment" -->
+5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->
+
+===
+# Experiment 1
+
+Setup
+
+- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
+- Cluster the encoding. <!-- .element: class="fragment" -->
+- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
+- Use PCA to visualize clusters. <!-- .element: class="fragment" -->
+
+Note:
+Principle Component Analysis: 
+- a statistical technique for reducing the dimensionality of a dataset.
+- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
+
+==
+
+# Experiment 1
+
+One Hot Encoding
+
+| publisher |  nytimes|  wsj| newsweek|  ...|
+|:----------|--------:|----:|--------:|----:|
+| nytimes   |        1|    1|        1|  ...|
+| wsj       |        1|    1|        0|  ...|
+| newsweek  |        0|    0|        1|  ...|
+| ...       |      ...|  ...|      ...|  ...|
+
+==
+
+# Experiment 1
+
+n-Hot Encoding
+
+| publisher |  nytimes|  wsj| newsweek|  ...|
+|:----------|--------:|----:|--------:|----:|
+| nytimes   |       11|    1|      141|  ...|
+| wsj       |        1|   31|        0|  ...|
+| newsweek  |        0|    0|        1|  ...|
+| ...       |      ...|  ...|      ...|  ...|
+
+==
+
+# Experiment 1
+
+Normalized n-Hot Encoding
+
+| publisher |  nytimes|  wsj| newsweek|  ...|
+|:----------|--------:|----:|--------:|----:|
+| nytimes   |        0|  0.4|      0.2|  ...|
+| wsj       |      0.2|    0|      0.4|  ...|
+| newsweek  |      0.0|  0.0|      0.0|  ...|
+| ...       |      ...|  ...|      ...|  ...|
+
+==
+
+# Experiment 1
+
+Elbow criterion
+
+![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)
+
+Note:
+
+The elbow method looks at the percentage of explained variance as a function of the number of clusters: 
+
+One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
+
+Percentage of variance explained is the ratio of the between-group variance to the total variance,
+
+==
+
+# Experiment 1
+
+Link Magnitude
+
+![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)
+
+==
+
+# Experiment 1
+
+Normalized
+
+![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)
+
+==
+
+# Experiment 1
+
+Onehot
+
+![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)
+
+==
+
+# Experiment 1
+
+Discussion
+
+- Best encoding: One hot. <!-- .element: class="fragment" -->
+    - Clusters based on total links otherwise.
+- Clusters, but no explanation
+- Limitation: need the link encoding to cluster.
+    - Smaller publishers might not link very much.
+
+===
+
+# Experiment 2
+
+Setup
+
+- Create features. <!-- .element: class="fragment" -->:
+    - Publisher frequency.
+    - Reuse link encodings.
+- Create classes: <!-- .element: class="fragment" -->
+    - Join bias classifications.
+- Train classifier. <!-- .element: class="fragment" -->
+
+Note:
+
+==
+# Experiment 2
+Descriptive stats
+
+| metric      | value     |
+|:------------|:----------|
+| publishers  | 1582      |
+| labels      | 6         |
+| left        | 482       |
+| center      | 711       |
+| right       | 369       |
+| agree range | [0.0-1.0] |
+
+==
+
+# Experiment 2
+
+PCA + Labels
+
+![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)
+
+==
+
+# Experiment 2
+
+Discussion
+
+- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
+    - Labels are (sort of) separated and clustered.
+    - Creating them for smaller publishers is trivial.
+==
+
+# Experiment 2
+
+Limitations
+
+- Dependent on accurate rating. <!-- .element: class="fragment" -->
+- Ordinal ratings not available. <!-- .element: class="fragment" -->
+- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
+- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
+- Don't know what to do with community rating. <!-- .element: class="fragment" -->
+
+===
+
+# Experiment 3
+
+Setup
+
+==
+
+# Limitations
+
+- Many different authors under the same publisher. <!-- .element: class="fragment" -->
+- Publishers use syndication. <!-- .element: class="fragment" -->
+- Bias ratings are biased. <!-- .element: class="fragment" -->
+
+===
+
+# Questions
+
+===
+
+<!-- .section: id="references" -->
+
+# References
+
+[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
+
+Note: