_model: slides --- title: CSCI 577 - Data Mining --- body: # Political Polarization ## CSCI 577 **Matt Jensen** *May 18, 2023* == # Outline - Hypothesis - Sources - Data Workup - Experiments - Remaining Work - Questions === # Hypothesis == # Hypothesis Political polarization is rising, and news articles are a proxy measure. == # Why might we expect this? Mostly anecdotal experience.
Our goal is whether, not why. Note: > Proliferation of media choices lowered the share of less interested, less partisan > voters and thereby made elections more partisan. But evidence for a causal > link between more partisan messages and changing attitudes or behaviors is > mixed at best. Measurement problems hold back research on partisan selec- > tive exposure and its consequences. Ideologically one-sided news exposure > may be largely confined to a small, but highly involved and influential, seg- > ment of the population. There is no firm evidence that partisan media are > making ordinary Americans more partisan. == # Sub-hypothesis - The polarization is not evenly distributed across publishers. - The polarization is not evenly distributed across political specturm. - The polarization increases near elections. == # Sub-sub-hypothesis - Similarly polarized publishers link to each other. - 'Mainstream' media uses more neutral titles. - Highly polarized publications don't last as long. Note: - Publication longivity is not covered currently. - Mainstream media dominates the dataset. === # Data Sources == # Data Sources - Memeorandum: **stories** - AllSides: **bias** - HuggingFace: **sentiment** - ChatGPT: **election dates** Note: Let's get a handle on the shape of the data. - sources - size - features === # Memeorandum == == # Memeorandum - News aggregation site. - Was really famous before Google News. - Still aggregates sites today. == # Memeorandum - I still use it. - I like to read titles. - Publishers block bots. - Simple html to parse. - Headlines from 2006 forward. - Automated, not editorialized. Note: - It limits doom scrolling. === # AllSides == == # AllSides - Rates publications as left, center or right. - Ratings combine: - blind bias surveys. - editorial reviews. - third party research. - community voting. Note: Originally scraped website, but direct access eventually. == # AllSides - One of the only bias apis. - Ordinal ratings [-2: very left, 2: very right]. - Covers 1400 publishers + some blog and authors. - Easy format and semi-complete data. === # HuggingFace == == # HuggingFace - Deep learning library. - Lots of pretrained models. - Easy, off the shelf word/sentence embeddings and text classification models. == # HuggingFace - Language models are **HOT**. - Literally 5 lines of python. - The dataset needed more features. - Testing different model performance was easy. - Lots of pretrained classification tasks. === # Data Collection == # Data Collection ## Stories ```python day = timedelta(days=1) cur = date(2005, 10, 1) end = date.today() while cur <= end: cur = cur + day save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html" url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000" r = requests.get(url) with open(save_as, 'w') as f: f.write(r.text) ``` Note: grab every page from 2005 forward. later: parse it into csv/database. == # Data Collection ## Bias **hard** ```python ... bias_html = DATA_DIR / 'allsides.html' parser = etree.HTMLParser() tree = etree.parse(str(bias_html), parser) root = tree.getroot() rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr') ratings = [] for row in rows: rating = dict() ... ``` Note: grab entire index later parse it into csv/database == # Data Collection ## Bias **easy** ![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png) Note: json format, including authors and blogs. == # Data Collection ## Embeddings ```python # table = ... tokenizer = AutoTokenizer.from_pretrained("roberta-base") model = AutoModel.from_pretrained("roberta-base") for chunk in table: tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt") outputs = model(**tokens) embeddings = outputs.last_hidden_state.detach().numpy() ... ``` Note: for every title, tokenize then embed. hidden state is last linear layer before training tasks. == # Data Collection ## Classification Embeddings ```python ... outputs = model(**tokens)[0].detach().numpy() scores = 1 / (1 + np.exp(-outputs)) # Sigmoid class_ids = np.argmax(scores, axis=1) for i, class_id in enumerate(class_ids): results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]}) ... ``` Note: for every title, tokenize, classify. ~ 1 hour === # Data Structures ## Stories Note: Great, we have the data, now what does it look like? == # Data Structures ## Stories - Top level stories. - title, author, publisher, url, date. - Related discussion. - publisher, url. - uses 'parent' story as a source. - Story stream changes constantly (dedup. required). == # Data Structures ## Stories ![raw story table](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_stories_table.png) == # Data Structures ## Stories ![raw related table](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_related_table.png) == # Data Structures ## Stories | metric | value | |:------------------|--------:| | total stories | 299714 | | total related | 960111 | | publishers | 7031 | | authors | 34346 | | max year | 2023 | | min year | 2005 | | top level domains | 7063 | == # Data Selection ## Stories - Clip the first and last full year of stories. - Remove duplicate stories (big stories span multiple days). - Convert urls to tld to link to publishers. Note: tld: top level domain. == # Data Selection ## Publishers - Combine subdomains of stories. - blog.washingtonpost.com and washingtonpost.com are considered the same publisher. - This could be bad. For example: opinion.wsj.com != wsj.com. - Find common name of publisher. Note: Sometime authors are the publisher name. == # Data Selection ## Related - Select only stories with publishers whose story had been a 'parent' ('original publishers'). - Eliminates small blogs and non-original news. - Eliminate publishers without links to original publishers. - Eliminate silo'ed publications. - Link matrix is square and low'ish dimensional. Note: Going to build a data structure of the related links, so I have to be judicious about which ones to include. == # Data Selection ## Post Process | metric | value | |:------------------|--------:| | total stories | 251553 | | total related | 815183 | | publishers | 223 | | authors | 23809 | | max year | 2022 | | min year | 2006 | | top level domains | 234 | Note: much less publishers, but count(stories) about the same - main stream represent. == # Descriptive Stats ## Stories Per Publisher ![stories per publisher](https://studentweb.cs.wwu.edu/~jensen33/static/577/stories_per_publisher.png) Note: Power law in effect. == # Descriptive Stats ## Top Publishers ![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png) Note: Some publishers come and go. Some publishers change their domains. == # Descriptive Stats ## Articles Per Year ![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png) Note: Shape of total articles per year dominates some of the analysis. == # Descriptive Stats ## Common TLDs ![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png) Note: just for funs. Lots of IP addresses and spammy looking ones. === # Data Structures ## Bias == # Data Structures ## Bias - Per publisher. - name, - label/ordinal value. - agree/disagree vote by community. - Name could be semi-automatically joined to stories. == # Data Structures ## Bias ![raw bias table](https://studentweb.cs.wwu.edu/~jensen33/static/577/raw_bias_table.png) Note: Later, media type and explicit ordinal values were added via api access. == # Data Selection ## Bias - Keep all ratings. - Join datasets on publisher name. - Started with 'jaro winkler similarity' then manually from there (look up Named Entity Recognition). - Use numeric values. - [left: -2, left-center: -1, ...]. - Possibly scale ordinal based on agree/disagree ratio. Note: Lots of agrees on the ends of the spectrum implies their very left or very right. Lots of agrees in the middle implies very neutral? == # Data ## Bias ![bias hist](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_hist.png) == # Data ## Bias ![selected bias](https://studentweb.cs.wwu.edu/~jensen33/static/577/selected_bias_table.png) Note: much smaller dataset. TODO: manually add more joins to story source. === # Data Structures ## Embeddings == # Data Structures ## Embeddings - Per story title. - sentence embedding (n, 384) - **BERT**. - sentiment classification (n, 1) - **RoBERTa base**. - emotional classification (n, 1) - **RoBERTa Go-Emotions**. - ~ 1 hour of inference time to map story titles and descriptions. Note: RoBERTa - pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. SST - Stanford Sentiment Treebank: 11,855 single sentences extracted from movie reviews, annotated by 3 human judges. == # Data Selection ## Embeddings - Word embeddings were too complicated. - Kept argmax of classification prediction ([0.82, 0.18] -> LABEL_0). - For publisher based analysis, averaged sentence embeddings for all stories. == # Data ## Embeddings | label | stories | publishers | |:---------|----------:|-------------:| | positive | 87830 | 223 | | negative | 163723 | 223 | Note: There was a model with a neutral label as well, but I opted out. == # Data ## Embeddings | label | stories | publishers | |:---------|----------:|-------------:| | neutral | 124257 | 223 | | anger | 34124 | 223 | | fear | 36756 | 223 | | sadness | 27449 | 223 | | disgust | 17939 | 222 | | surprise | 5710 | 216 | | joy | 5318 | 214 | === # Experiments == # Experiments 1. **clustering** on link similarity. 2. **classification** on link similarity. 3. **classification** on sentence embedding. 4. **classification** on sentiment analysis. 5. **regression** on emotional classification over time and publication. Note: 5 main experiments. Lots of tinkering and 'agile development'. Use source control. === # Experiment 1 **clustering** on link similarity. == # Experiment 1 ## Setup - Create one-hot encoding of links between publishers. - Cluster the encoding. - Expect similar publications in same cluster. - Use PCA to visualize clusters. Note: Principle Component Analysis: - a statistical technique for reducing the dimensionality of a dataset. - linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data. - I use it alot to map from high dimensional space (links adj. and embeddings) to lower, most significant space. == # Experiment 1 ## Encoding schemes == # Experiment 1 ## One-hot Encoding | publisher | nytimes| wsj| newsweek| ...| |:----------|--------:|----:|--------:|----:| | nytimes | 1| 1| 1| ...| | wsj | 1| 1| 0| ...| | newsweek | 0| 0| 1| ...| | ... | ...| ...| ...| ...| == # Experiment 1 ## n-Hot Encoding | publisher | nytimes| wsj| newsweek| ...| |:----------|--------:|----:|--------:|----:| | nytimes | 11| 1| 141| ...| | wsj | 1| 31| 0| ...| | newsweek | 0| 0| 1| ...| | ... | ...| ...| ...| ...| == # Experiment 1 ## Normalized n-Hot Encoding | publisher | nytimes| wsj| newsweek| ...| |:----------|--------:|----:|--------:|----:| | nytimes | 0| 0.4| 0.2| ...| | wsj | 0.2| 0| 0.4| ...| | newsweek | 0.0| 0.0| 0.0| ...| | ... | ...| ...| ...| ...| == # Experiment 1 ## Elbow criterion ![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png) Note: The elbow method looks at the percentage of explained variance as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. Percentage of variance explained is the ratio of the between-group variance to the total variance sklearn eliminated 2 cluster groups?? == # Experiment 1 ## Comparing encoding schemes Note: They all have good clusters. == # Experiment 1 ## Link Magnitude ![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png) Note: link frequency dominates one component. more interested in bias between publishers, not difference between mainstream and outliers. == # Experiment 1 ## Normalized ![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png) Note: a few outliers still, but better. == # Experiment 1 ## One-Hot ![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png) Note: really dispursed == # Experiment 1 ## Discussion - One-hot seems to reflect the right features. - Found clusters, but meaning is arbitrary. - map to PCA results nicely. - Limitation: need the link encoding to cluster. - Smaller publishers might not link very much. - TODO: Association Rule Mining. - 'Basket of goods' analysis to group publishers. === # Experiment 2 **classification** on link similarity. == # Experiment 2 ## Setup - Create features: - Publisher frequency. - Reuse link encodings. - Create classes: - Join bias classifications. - Train classifier. Note: == # Experiment 2 ## Descriptive stats | metric | value | |:------------|:----------| | publishers | 1582 | | labels | 6 | | left | 482 | | center | 711 | | right | 369 | | agree range | [0.0-1.0] | Note: rehash of what bias data is available. == # Experiment 2 ## Results ![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png) Note: pca maps to bias labels well, left on one end, right on the other. if you squint. == # Experiment 2 ## Results ![link confusion](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_confusion.png) Note: hot diagonal is good. all data. train test split only had 20 or so samples in it? overlap between link choices and bias ratings is slim. == # Experiment 2 ## Discussion - Link encodings (and their PCA) are useful. - Labels are (sort of) separated and clustered. - Creating them for smaller publishers is trivial. - Hot diagonal confusion matrix is good. - Need to link more publisher data to get good test data. Note: == # Experiment 2 ## Limitations - Dependent on accurate rating. - Ordinal ratings weren't available. - Dependent on accurate joining across datasets. - Entire publication is rated, not authors. - Don't know what to do with community rating. === # Experiment 3 **classification** on sentence embedding. == # Experiment 3 ## Setup - Generate sentence embedding for each title. - Rerun PCA analysis on title embeddings. - Use kNN classifier to map embedding features to bias rating. == # Experiment 3 ## Embeddings Primer == # Experiment 3 ## Embedding Steps 1. Extract titles. 2. Tokenize titles. 3. Pick pretrained language model. 4. Generate embeddings from tokens using model. == # Experiment 3 ## Tokens **The sentence:** "Spain, Land of 10 P.M. Dinners, Asks if It's Time to Reset Clock" **Tokenizes to:** ``` ['[CLS]', 'spain', ',', 'land', 'of', '10', 'p', '.', 'm', '.', 'dinners', ',', 'asks', 'if', 'it', "'", 's', 'time', 'to', 'reset', 'clock', '[SEP]'] ``` Note: [CLS] is unique to BERT models and stands for classification. == # Experiment 3 ## Tokens **The sentence:** "NPR/PBS NewsHour/Marist Poll Results and Analysis" **Tokenizes to:** ``` ['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris', '##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] ``` Note: The padding is there to make all tokenized vectors equal length. The tokenizer also outputs a mask vector that the language model uses to ignore the padding. == # Experiment 3 ## Embeddings - Using a BERT (Bidirectional Encoder Representations from Transformers) based model. - Input: tokens. - Output: dense vectors representing 'semantic meaning' of tokens. == # Experiment 3 ## Embeddings **The tokens:** ``` ['[CLS]', 'npr', '/', 'pbs', 'news', '##ho', '##ur', '/', 'maris', '##t', 'poll', 'results', 'and', 'analysis', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] ``` **Embeds to a vector (1, 384):** ``` array([[ 0.12444635, -0.05962477, -0.00127911, ..., 0.13943022, -0.2552534 , -0.00238779], [ 0.01535596, -0.05933844, -0.0099495 , ..., 0.48110735, 0.1370568 , 0.3285091 ], [ 0.2831368 , -0.4200529 , 0.10879617, ..., 0.15663117, -0.29782432, 0.4289513 ], ..., ``` Note: attention masks allow the model to ignore padding so all vectors are same length. embedding space has semantic meaning. can do vector math on them: king - man = monarch monarch + dance = happy? == # Experiment 3 ## Results ![pca vs. classes](https://studentweb.cs.wwu.edu/~jensen33/static/577/embedding_sentence_pca.png) Note: pca on the sentence embeddings of the titles. not a lot of information in PCA this time. == # Experiment 3 ## Results ![pca vs. avg embedding](https://studentweb.cs.wwu.edu/~jensen33/static/577/avg_embedding_sentence_pca.png) Note: What about average publisher embedding? centers are pushed outside? sorry about the color pallet. == # Experiment 3 ## Results ![knn embedding confusion](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentence_confusion.png) Note: Trained a kNN from sklearn. Set aside 20% of the data as a test set. Once trained, compared the predictions with the true on the test set. not bad. == # Experiment 3 ## Discussion - Embedding space is hard to condense with PCA. - Maybe the classifier is learning to guess 'left-ish'? - Does DL work better on sparse inputs? === # Experiment 4 **classification** on sentiment analysis. == # Experiment 4 ## Setup - Use pretrained language classifier. - Previously: Mapped twitter posts to tokens, to embedding, to ['positive', 'negative'] labels. - Predict: rate of neutral titles decreasing over time. == # Experiment 4 ## Results ![sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/sentiment_over_time.png) Note: maybe there's something there. less positive after 2008? low around 2016? increase around 202? overall still lower. == # Experiment 4 ## Results ![bias vs. sentiment over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_sentiment_over_time.png) Note: right has not a lot of data. all trend down over time. people loved Obama at the beginning. == # Experiment 4 ## Results ![sentiment vs. election recency](https://studentweb.cs.wwu.edu/~jensen33/static/577/bias_vs_recent_winner.png) Note: assumption: national elections drive news sentiment. expected a taller band in the middle then the edges. == # Experiment 4 ## Discussion - Bump post Obama election for left and center. - Dip pre Trump election for left and center. - Right is all over the place - not enough data? - Recency of election not a clear factor. === # Experiment 5 **regression** on title emotional expression. == # Experiment 5 ## Setup - Use pretrained language classifier. - Previously: Mapped reddit posts to tokens, to embedding, to emotion labels. - Predict: rate of neutral titles decreasing over time. - Classify: - features: emotional labels - labels: bias == # Experiment 5 ## Results ![emotion over time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_over_time.png) Note: neutrality between Obama and Trump emotional titles all increased - shape of the underlying data. TODO: normalize relative expression. == # Experiment 5 ## Results ![emotion regression time](https://studentweb.cs.wwu.edu/~jensen33/static/577/emotion_regression.png) Note: left and right got less neutral over time. == # Experiment 5 ## Discussion - Neutral story titles dominate the dataset. - Increase in stories published might explain most of the trend. - Far-right and far-left both became less neutral. - Left-Center and right-center became more emotional, but also neutral. - Not a lot of movement overall. === # Conclusion == # Hypothesis - The polarization is not evenly distributed across publishers. **unproven** - The polarization is not evenly distributed across political specturm. **unproven** - The polarization increases near elections. **false** - Similarly polarized publishers link to each other. **sorta** - 'Mainstream' media uses more neutral titles. **true** - Highly polarized publications don't last as long. **untested** == # Conclusion - Article titles do not have a lot of predictive power. - Mainstream, neutral publications dominate the dataset. - Link frequency, sentence embeddings, and sentiments are useful features. - A few questions remain. Note: Experiment 6 (**TODO**) - Have a lot of features now. - Link PCA components. - Embedding PCA components. - Sentiment. - Emotion. - Can we predict with all of them: Bias. limitations - Many different authors under the same publisher. - Publishers use syndication. - Bias ratings are biased and not linked automaticall. - National news is generally designed to be neutral sounding. - End user: Is that useful? Where will I get all that at inference time? == # Questions == # References [1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201. Note: