title: CSCI 577 - Data Mining


Political Polarization

Matt Jensen



Political polarization is rising, and news articles are a proxy measure.


Is this reasonable?


Why is polarization rising?

Not my job, but there's researchref to support it



  • The polarization increases near elections.
  • The polarization is not evenly distributed across publishers.
  • The polarization is not evenly distributed across political specturm.



  • Similarly polarized publishers link to each other.
  • 'Mainstream' media uses more neutral titles.
  • Highly polarized publications don't last as long.


Data Source(s)







  • News aggregation site.
  • Was really famous before Google News.
  • Still aggregates sites today.


Why Memeorandum?

  • Behavioral: I only read titles sometimes. (doom scrolling).
  • Behavioral: It's my source of news (with sister site TechMeme.com).
  • Convenient: most publishers block bots.
  • Convenient: dead simple html to parse.
  • Archival: all headlines from 2006 forward.
  • Archival: automated, not editorialized.




  • Rates news publications as left, center or right.
  • Ratings combine:
    • blind bias surveys.
    • editorial reviews.
    • third party research.
    • community voting.
  • Originally scraped website, but direct access eventually.


Why AllSides?

  • Behavioral: One of the first google results on bias apis.
  • Convenient: Ordinal ratings [-2: very left, 2: very right].
  • Convenient: Easy format.
  • Archival: Covers 1400 publishers.




  • Deep Learning library.
  • Lots of pretrained models.
  • Easy, off the shelf word/sentence embeddings and text classification models.


Why HuggingFace?

  • Behavioral: Language Models are HOT right now.
  • Behavioral: The dataset needed more features.
  • Convenient: Literally 5 lines of python.
  • Convenient: Testing different model performance was easy.
  • Archival: Lots of pretrained classification tasks.


Data Structures


  • Top level stories.
    • title.
    • publisher.
    • author.
  • Related discussion.
    • publisher.
    • uses 'parent' story as a source.
  • Stream of stories (changes constantly).


Data Structures


  • Per publisher.
    • name.
    • label.
    • agree/disagree vote by community.
  • Name could be semi-automatically joined to stories.


Data Structures


  • Per story title.
    • sentence embedding (n, 384).
    • sentiment classification (n, 1).
    • emotional classification (n, 1).
  • ~ 1 hour of inference time to map story titles and descriptions.


Data Collection


Data Collection

Story Scraper (simplified)

day = timedelta(days=1)
cur = date(2005, 10, 1)
end = date.today()
while cur <= end:
    cur = cur + day
    save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
    url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
    r = requests.get(url)
    with open(save_as, 'w') as f:


Data Collection

Bias Scraper (hard)

bias_html = DATA_DIR / 'allsides.html'
parser = etree.HTMLParser()
tree = etree.parse(str(bias_html), parser)
root = tree.getroot()
rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')

ratings = []
for row in rows:
    rating = dict()


Data Collection

Bias Scraper (easy)

allsides request


Data Collection

Embeddings (easy)

# table = ...
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

for chunk in table:
    tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
    outputs = model(**tokens)
    embeddings = outputs.last_hidden_state.detach().numpy()


Data Collection

Classification Embeddings (medium)

outputs = model(**tokens)[0].detach().numpy()
scores = 1 / (1 + np.exp(-outputs))  # Sigmoid
class_ids = np.argmax(scores, axis=1)
for i, class_id in enumerate(class_ids):
    results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})


Data Selection


Data Selection


  • Clip the first and last full year of stories.
  • Remove duplicate stories (big stories span multiple days).


Data Selection


  • Combine subdomains of stories.
    • blog.washingtonpost.com and washingtonpost.com are considered the same publisher.
    • This could be bad. For example: opinion.wsj.com != wsj.com.


Data Selection


  • Select only stories with publishers whose story had been a 'parent' ('original publishers').
    • Eliminates small blogs and non-original news.
  • Eliminate publishers without links to original publishers.
    • Eliminate silo'ed publications.
    • Link matrix is square and low'ish dimensional.


Data Selection


  • Keep all ratings, even ones with low agree/disagree ratio.
  • Join datasets on publisher name.
    • Not automatic (look up Named Entity Recognition).
    • Started with 'jaro winkler similarity' then manually from there.
  • Use numeric values
    • [left: -2, left-center: -1, ...]


Descriptive Stats


metric value
total stories 299714
total related 960111
publishers 7031
authors 34346
max year 2023
min year 2005
top level domains 7063


Descriptive Stats

Stories Per Publisher

stories per publisher


Descriptive Stats

Top Publishers

top publishers


Descriptive Stats

Articles Per Year

articles per year


Descriptive Stats

Common TLDs

common tlds


Descriptive Stats

Post Process

key value
total stories 251553
total related 815183
publishers 223
authors 23809
max year 2022
min year 2006
top level domains 234



  1. clustering on link similarity.
  2. classification on link similarity.
  3. classification on sentence embedding.
  4. classification on sentiment analysis.
  5. regression on emotional classification over time and publication.


Experiment 1


  • Create one-hot encoding of links between publishers.
  • Cluster the encoding.
  • Expect similar publications in same cluster.
  • Use PCA to visualize clusters.

Note: Principle Component Analysis:

  • a statistical technique for reducing the dimensionality of a dataset.
  • linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.


Experiment 1

One Hot Encoding

publisher nytimes wsj newsweek ...
nytimes 1 1 1 ...
wsj 1 1 0 ...
newsweek 0 0 1 ...
... ... ... ... ...


Experiment 1

n-Hot Encoding

publisher nytimes wsj newsweek ...
nytimes 11 1 141 ...
wsj 1 31 0 ...
newsweek 0 0 1 ...
... ... ... ... ...


Experiment 1

Normalized n-Hot Encoding

publisher nytimes wsj newsweek ...
nytimes 0 0.4 0.2 ...
wsj 0.2 0 0.4 ...
newsweek 0.0 0.0 0.0 ...
... ... ... ... ...


Experiment 1

Elbow criterion



The elbow method looks at the percentage of explained variance as a function of the number of clusters:

One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.

Percentage of variance explained is the ratio of the between-group variance to the total variance,


Experiment 1

Link Magnitude

link magnitude cluster


Experiment 1


link normalized cluster


Experiment 1


link onehot cluster


Experiment 1


  • Best encoding: One hot.
    • Clusters based on total links otherwise.
  • Clusters, but no explanation
  • Limitation: need the link encoding to cluster.
    • Smaller publishers might not link very much.


Experiment 2


  • Create features. :
    • Publisher frequency.
    • Reuse link encodings.
  • Create classes:
    • Join bias classifications.
  • Train classifier.



Experiment 2

Descriptive stats

metric value
publishers 1582
labels 6
left 482
center 711
right 369
agree range [0.0-1.0]


Experiment 2

PCA + Labels

pca vs. bias labels


Experiment 2


  • Link encodings (and their PCA) are useful.
    • Labels are (sort of) separated and clustered.
    • Creating them for smaller publishers is trivial. ==

Experiment 2


  • Dependent on accurate rating.
  • Ordinal ratings not available.
  • Dependent on accurate joining across datasets.
  • Entire publication is rated, not authors.
  • Don't know what to do with community rating.


Experiment 3




  • Many different authors under the same publisher.
  • Publishers use syndication.
  • Bias ratings are biased.





[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
