_model: slides --- title: CSCI 577 - Data Mining --- body: # Political Polarization Matt Jensen === # Hypothesis Political polarization is rising, and news articles are a proxy measure. == # Is this reasonable? == # Why is polarization rising? Not my job, but there's research^{[ref](#references)} to support it == # Sub-hypothesis - The polarization increases near elections. - The polarization is not evenly distributed across publishers. - The polarization is not evenly distributed across political specturm. == # Sub-sub-hypothesis - Similarly polarized publishers link to each other. - 'Mainstream' media uses more neutral titles. - Highly polarized publications don't last as long. === # Data Source(s) memeorandum.com allsides.com huggingface.com Note: Let's get a handle on the shape of the data. The sources, size, and features of the data. ===

=== # memeorandum.com - News aggregation site. - Was really famous before Google News. - Still aggregates sites today. == # Why Memeorandum? - Behavioral: I only read titles sometimes. (doom scrolling). - Behavioral: It's my source of news (with sister site TechMeme.com). - Convenient: most publishers block bots. - Convenient: dead simple html to parse. - Archival: all headlines from 2006 forward. - Archival: automated, not editorialized. ===

=== # AllSides.com - Rates news publications as left, center or right. - Ratings combine: - blind bias surveys. - editorial reviews. - third party research. - community voting. - Originally scraped website, but direct access eventually. == # Why AllSides? - Behavioral: One of the first google results on bias apis. - Convenient: Ordinal ratings [-2: very left, 2: very right]. - Convenient: Easy format. - Archival: Covers 1400 publishers. ===

=== # HuggingFace.com - Deep Learning library. - Lots of pretrained models. - Easy, off the shelf word/sentence embeddings and text classification models. == # Why HuggingFace? - Behavioral: Language Models are HOT right now. - Behavioral: The dataset needed more features. - Convenient: Literally 5 lines of python. - Convenient: Testing different model performance was easy. - Archival: Lots of pretrained classification tasks. === # Data Structures ## Stories - Top level stories. - title. - publisher. - author. - Related discussion. - publisher. - uses 'parent' story as a source. - Stream of stories (changes constantly). == # Data Structures ## Bias - Per publisher. - name. - label. - agree/disagree vote by community. - Name could be semi-automatically joined to stories. == # Data Structures ## Embeddings - Per story title. - sentence embedding (n, 384). - sentiment classification (n, 1). - emotional classification (n, 1). - ~ 1 hour of inference time to map story titles and descriptions. === # Data Collection == # Data Collection ## Story Scraper (simplified) ```python day = timedelta(days=1) cur = date(2005, 10, 1) end = date.today() while cur <= end: cur = cur + day save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html" url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000" r = requests.get(url) with open(save_as, 'w') as f: f.write(r.text) ``` == # Data Collection ## Bias Scraper (hard) ```python ... bias_html = DATA_DIR / 'allsides.html' parser = etree.HTMLParser() tree = etree.parse(str(bias_html), parser) root = tree.getroot() rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr') ratings = [] for row in rows: rating = dict() ... ``` == # Data Collection ## Bias Scraper (easy) ![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png) == # Data Collection ## Embeddings (easy) ```python # table = ... tokenizer = AutoTokenizer.from_pretrained("roberta-base") model = AutoModel.from_pretrained("roberta-base") for chunk in table: tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt") outputs = model(**tokens) embeddings = outputs.last_hidden_state.detach().numpy() ... ``` == # Data Collection ## Classification Embeddings (medium) ```python ... outputs = model(**tokens)[0].detach().numpy() scores = 1 / (1 + np.exp(-outputs)) # Sigmoid class_ids = np.argmax(scores, axis=1) for i, class_id in enumerate(class_ids): results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]}) ... ``` === # Data Selection == # Data Selection ## Stories - Clip the first and last full year of stories. - Remove duplicate stories (big stories span multiple days). == # Data Selection ## Publishers - Combine subdomains of stories. - blog.washingtonpost.com and washingtonpost.com are considered the same publisher. - This could be bad. For example: opinion.wsj.com != wsj.com. == # Data Selection ## Links - Select only stories with publishers whose story had been a 'parent' ('original publishers'). - Eliminates small blogs and non-original news. - Eliminate publishers without links to original publishers. - Eliminate silo'ed publications. - Link matrix is square and low'ish dimensional. == # Data Selection ## Bias - Keep all ratings, even ones with low agree/disagree ratio. - Join datasets on publisher name. - Not automatic (look up Named Entity Recognition). - Started with 'jaro winkler similarity' then manually from there. - Use numeric values - [left: -2, left-center: -1, ...] === # Descriptive Stats ## Raw | metric | value | |:------------------|--------:| | total stories | 299714 | | total related | 960111 | | publishers | 7031 | | authors | 34346 | | max year | 2023 | | min year | 2005 | | top level domains | 7063 | == # Descriptive Stats ## Stories Per Publisher ![stories per publisher](/static/577/stories_per_publisher.png) == # Descriptive Stats ## Top Publishers ![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png) == # Descriptive Stats ## Articles Per Year ![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png) == # Descriptive Stats ## Common TLDs ![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png) == # Descriptive Stats ## Post Process | metric | value | |:------------------|--------:| | total stories | 251553 | | total related | 815183 | | publishers | 223 | | authors | 23809 | | max year | 2022 | | min year | 2006 | | top level domains | 234 | === # Experiments 1. **clustering** on link similarity. 2. **classification** on link similarity. 3. **classification** on sentence embedding. 4. **classification** on sentiment analysis. 5. **regression** on emotional classification over time and publication. === # Experiment 1 **clustering** on link similarity. == # Experiment 1 ## Setup - Create one-hot encoding of links between publishers. - Cluster the encoding. - Expect similar publications in same cluster. - Use PCA to visualize clusters. Note: Principle Component Analysis: - a statistical technique for reducing the dimensionality of a dataset. - linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data. == # Experiment 1 ## One Hot Encoding | publisher | nytimes| wsj| newsweek| ...| |:----------|--------:|----:|--------:|----:| | nytimes | 1| 1| 1| ...| | wsj | 1| 1| 0| ...| | newsweek | 0| 0| 1| ...| | ... | ...| ...| ...| ...| == # Experiment 1 ## n-Hot Encoding | publisher | nytimes| wsj| newsweek| ...| |:----------|--------:|----:|--------:|----:| | nytimes | 11| 1| 141| ...| | wsj | 1| 31| 0| ...| | newsweek | 0| 0| 1| ...| | ... | ...| ...| ...| ...| == # Experiment 1 ## Normalized n-Hot Encoding | publisher | nytimes| wsj| newsweek| ...| |:----------|--------:|----:|--------:|----:| | nytimes | 0| 0.4| 0.2| ...| | wsj | 0.2| 0| 0.4| ...| | newsweek | 0.0| 0.0| 0.0| ...| | ... | ...| ...| ...| ...| == # Experiment 1 ## Elbow criterion ![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png) Note: The elbow method looks at the percentage of explained variance as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. Percentage of variance explained is the ratio of the between-group variance to the total variance, == # Experiment 1 ## Link Magnitude ![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png) == # Experiment 1 ## Normalized ![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png) == # Experiment 1 ## One Hot ![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png) == # Experiment 1 ## Discussion - Best encoding: One hot. - Clusters, but no explanation. - Limitation: need the link encoding to cluster. - Smaller publishers might not link very much. - TODO: Association Rule Mining. === # Experiment 2 **classification** on link similarity. == # Experiment 2 ## Setup - **clustering**. - Create features. : - Publisher frequency. - Reuse link encodings. - Create classes: - Join bias classifications. - Train classifier. Note: == # Experiment 2 ## Descriptive stats | metric | value | |:------------|:----------| | publishers | 1582 | | labels | 6 | | left | 482 | | center | 711 | | right | 369 | | agree range | [0.0-1.0] | == # Experiment 2 ## PCA + Labels ![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png) == # Experiment 2 ## Discussion - Link encodings (and their PCA) are useful. - Labels are (sort of) separated and clustered. - Creating them for smaller publishers is trivial. == # Experiment 2 ## Limitations - Dependent on accurate rating. - Ordinal ratings not available. - Dependent on accurate joining across datasets. - Entire publication is rated, not authors. - Don't know what to do with community rating. === # Experiment 3 **classification** on sentence embedding. == # Experiment 3 ## Setup - **classification**. - Generate sentence embedding for each title. - Rerun PCA analysis on title embeddings. - Use kNN classifier to map embedding features to bias rating. == # Experiment 3 ## Sentence Embeddings 1. Extract titles. 2. Tokenize titles. 3. Pick pretrained Language Model. 4. Generate embeddings from tokens. == # Experiment 3 ## Tokens **The sentence:** "Spain, Land of 10 P.M. Dinners, Asks if It's Time to Reset Clock" **Tokenizes to:** ``` ['[CLS]', 'spain', ',', 'land', 'of', '10', 'p', '.', 'm', '.', 'dinners', ',', 'asks', 'if', 'it', "'", 's', 'time', 'to', 'reset', 'clock', '[SEP]'] ``` Note: [CLS] is unique to BERT models and stands for classification. == # Experiment 3 ## Tokens **The sentence:** "NPR/PBS NewsHour/Marist Poll Results and Analysis" **Tokenizes to:** ``` ['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris', '##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] ``` Note: The padding is there to make all tokenized vectors equal length. The tokenizer also outputs a mask vector that the language model uses to ignore the padding. == # Experiment 3 ## Embeddings - Using a BERT (Bidirectional Encoder Representations from Transformers) based model. - Input: tokens. - Output: dense vectors representing 'semantic meaning' of tokens. == # Experiment 3 ## Embeddings **The tokens:** ``` ['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris', '##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] ``` **Embeds to a vector (1, 384):** ``` array([[ 0.12444635, -0.05962477, -0.00127911, ..., 0.13943022, -0.2552534 , -0.00238779], [ 0.01535596, -0.05933844, -0.0099495 , ..., 0.48110735, 0.1370568 , 0.3285091 ], [ 0.2831368 , -0.4200529 , 0.10879617, ..., 0.15663117, -0.29782432, 0.4289513 ], ..., ``` == # Experiment 3 ## Results ![pca vs. classes](https://studentweb.cs.wwu.edu/~jensen33/static/577/embedding_sentence_pca.png) Note: Not a lot of information in PCA this time. == # Experiment 3 ## Results ![pca vs. avg embedding](https://studentweb.cs.wwu.edu/~jensen33/static/577/avg_embedding_sentence_pca.png) Note: What about average publisher embedding? == # Experiment 3 ## Results ![knn embedding confusion](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentence_confusion.png) Note: Trained a kNN from sklearn. Set aside 20% of the data as a test set. Once trained, compared the predictions with the true on the test set. == # Experiment 3 ## Discussion - Embedding space is hard to condense with PCA. - Maybe the classifier is learning to guess 'left-ish'? === # Experiment 4 **classification** on sentiment analysis. == # Experiment 4 ## Setup - Use pretrained Language Classifier. - Previously: Mapped twitter posts to tokens, to embedding, to ['positive', 'negative'] labels. - Predict: rate of neutral titles decreasing over time. == # Experiment 4 ## Results ![sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentiment_over_time.png) == # Experiment 4 ## Results ![bias vs. sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_sentiment_over_time.png) == # Experiment 4 ## Discussion - === # Experiment 5 **regression** on emotional classification over time and publication. == # Experiment 5 ## Setup - Use pretrained language classifier. - Previously: Mapped reddit posts to tokens, to embedding, to emotion labels. - Predict: rate of neutral titles decreasing over time. - Classify: - features: emotional labels - labels: bias == # Experiment 5 ## Results ![emotion over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_over_time.png) == # Experiment 5 ## Results ![emotion regression time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_regression.png) == # Experiment 5 ## Discussion - Neutral story titles dominate the dataset. - Increase in stories published might explain most of the trend. - Far-right and far-left both became less neutral. - Left-Center and right-center became more emotional, but also neutral. - Not a lot of movement overall. === # Experiment 6 (**TODO**) ## Setup - Have a lot of features now. - Link PCA components. - Embedding PCA components. - Sentiment. - Emotion. - Can we predict with all of them: Bias. - End user: Is that useful? Where will I get all that at inference time? === # Overall Limitations - Many different authors under the same publisher. - Publishers use syndication. - Bias ratings are biased. === # Questions === # References [1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201. Note: