v1.0 of presentation.
|
@ -2,3 +2,8 @@
|
||||||
*.swp
|
*.swp
|
||||||
__pycache__
|
__pycache__
|
||||||
tmp.py
|
tmp.py
|
||||||
|
.env
|
||||||
|
*.aux
|
||||||
|
*.log
|
||||||
|
*.out
|
||||||
|
tmp.*
|
||||||
|
|
|
@ -0,0 +1,11 @@
|
||||||
|
.PHONY:to_wwu
|
||||||
|
|
||||||
|
all: to_wwu
|
||||||
|
|
||||||
|
to_wwu:
|
||||||
|
rsync -avz ~/577/repo/docs/figures/ linux-04:/home/jensen33/Dev/studentweb/assets/static/577/
|
||||||
|
scp ~/577/repo/docs/presentation.md linux-04:/home/jensen33/Dev/studentweb/content/577/contents.lr
|
||||||
|
scp ~/Dev/www.publicmatt.com/models/slides.ini linux-04:/home/jensen33/Dev/studentweb/models/
|
||||||
|
scp ~/Dev/www.publicmatt.com/templates/slides.html linux-04:/home/jensen33/Dev/studentweb/templates/
|
||||||
|
rsync -avz ~/Dev/www.publicmatt.com/assets/static/revealjs linux-04:/home/jensen33/Dev/studentweb/assets/static/
|
||||||
|
ssh linux-04 cd /home/jensen33/Dev/studentweb \; make
|
|
@ -0,0 +1,3 @@
|
||||||
|
paper.pdf: paper.tex
|
||||||
|
pdflatex $^ -o $@
|
||||||
|
evince $@
|
After Width: | Height: | Size: 47 KiB |
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 24 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 33 KiB |
After Width: | Height: | Size: 29 KiB |
After Width: | Height: | Size: 48 KiB |
After Width: | Height: | Size: 61 KiB |
After Width: | Height: | Size: 51 KiB |
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 54 KiB |
|
@ -0,0 +1,61 @@
|
||||||
|
\documentclass{article}
|
||||||
|
\usepackage{multicol}
|
||||||
|
\usepackage{hyperref}
|
||||||
|
\title{Data Mining CS 571}
|
||||||
|
\author{Matt Jensen}
|
||||||
|
\date{2023-04-25}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
\section*{Abstract}
|
||||||
|
|
||||||
|
News organizations have been repeatedly accused of being partisan.
|
||||||
|
Additionally, they have been accused of polarizing dicussion to drive up revenue and engagement.
|
||||||
|
This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
|
||||||
|
A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
|
||||||
|
This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
|
||||||
|
Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
|
||||||
|
|
||||||
|
\begin{multicols}{2}
|
||||||
|
|
||||||
|
\section{Data Preparation}
|
||||||
|
The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
|
||||||
|
Each news article has a title, author, description, publisher, publish date, url and related discussions.
|
||||||
|
The site also has a concept of references, where a main, popular story may be covered by other sources.
|
||||||
|
This link association might be used to support one or more of the hypothesis of the main analysis.
|
||||||
|
After scraping the site, the data will need to be deduplicated and normalized to minimize storage costs and processing errors.
|
||||||
|
What remains after these cleaning steps is approximitely 6,400 days of material, 300,000 distinct headlines from 21,000 publishers and 34,000 authors used in the study.
|
||||||
|
|
||||||
|
\section{Missing Data Policy}
|
||||||
|
|
||||||
|
The largest data policy that will have to be dealt with is news organizations that share the same parent company, but might have slightly different names.
|
||||||
|
Wall Street Journal news is drastically different than their opinion section.
|
||||||
|
Other organizations have slightly different names for the same thing and a product of the aggregation service and not due to any real difference.
|
||||||
|
Luckily, most of the anaylsis is operating on the content of the news headlines, which do not suffer from this data impurity.
|
||||||
|
|
||||||
|
\section{Classification Task}
|
||||||
|
|
||||||
|
The classification of news titles into emotional categories was accomplished by using a pretrained large langauge model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
|
||||||
|
This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
|
||||||
|
The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probabilty class from the output.
|
||||||
|
|
||||||
|
The data has been discretized into years.
|
||||||
|
Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
|
||||||
|
Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
|
||||||
|
Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
|
||||||
|
|
||||||
|
\section{Experiments}
|
||||||
|
|
||||||
|
No computational experiment have been done yet.
|
||||||
|
Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
|
||||||
|
The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
|
||||||
|
If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
|
||||||
|
The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
|
||||||
|
|
||||||
|
\section{Results}
|
||||||
|
\textbf{TODO.}
|
||||||
|
|
||||||
|
\end{multicols}
|
||||||
|
|
||||||
|
\end{document}
|
|
@ -0,0 +1,552 @@
|
||||||
|
_model: slides
|
||||||
|
---
|
||||||
|
|
||||||
|
title: CSCI 577 - Data Mining
|
||||||
|
|
||||||
|
---
|
||||||
|
body:
|
||||||
|
|
||||||
|
# Political Polarization
|
||||||
|
|
||||||
|
Matt Jensen
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Hypothesis
|
||||||
|
|
||||||
|
Political polarization is rising, and news articles are a proxy measure.
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Is this reasonable?
|
||||||
|
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Why is polarization rising?
|
||||||
|
|
||||||
|
Not my job, but there's research<sup>[ref](#references)</sup> to support it
|
||||||
|
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Sub-hypothesis
|
||||||
|
|
||||||
|
- The polarization increases near elections. <!-- .element: class="fragment" -->
|
||||||
|
- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
|
||||||
|
- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Sub-sub-hypothesis
|
||||||
|
|
||||||
|
- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
|
||||||
|
- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
|
||||||
|
- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Data Source(s)
|
||||||
|
|
||||||
|
memeorandum.com <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
allsides.com <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
huggingface.com <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# memeorandum.com
|
||||||
|
|
||||||
|
- News aggregation site. <!-- .element: class="fragment" -->
|
||||||
|
- Was really famous before Google News. <!-- .element: class="fragment" -->
|
||||||
|
- Still aggregates sites today. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Why Memeorandum?
|
||||||
|
|
||||||
|
- Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->
|
||||||
|
- Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->
|
||||||
|
- Convenient: most publishers block bots. <!-- .element class="fragment" -->
|
||||||
|
- Convenient: dead simple html to parse. <!-- .element class="fragment" -->
|
||||||
|
- Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->
|
||||||
|
- Archival: automated, not editorialized. <!-- .element class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
<section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# AllSides.com
|
||||||
|
|
||||||
|
- Rates news publications as left, center or right. <!-- .element: class="fragment" -->
|
||||||
|
- Ratings combine: <!-- .element: class="fragment" -->
|
||||||
|
- blind bias surveys.
|
||||||
|
- editorial reviews.
|
||||||
|
- third party research.
|
||||||
|
- community voting.
|
||||||
|
- Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Why AllSides?
|
||||||
|
|
||||||
|
- Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->
|
||||||
|
- Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
|
||||||
|
- Convenient: Easy format. <!-- .element class="fragment" -->
|
||||||
|
- Archival: Covers 1400 publishers. <!-- .element class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
<section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# HuggingFace.com
|
||||||
|
|
||||||
|
- Deep Learning library. <!-- .element: class="fragment" -->
|
||||||
|
- Lots of pretrained models. <!-- .element: class="fragment" -->
|
||||||
|
- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Why HuggingFace?
|
||||||
|
|
||||||
|
- Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->
|
||||||
|
- Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->
|
||||||
|
- Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->
|
||||||
|
- Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->
|
||||||
|
- Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Data Structures
|
||||||
|
Stories
|
||||||
|
|
||||||
|
- Top level stories. <!-- .element: class="fragment" -->
|
||||||
|
- title.
|
||||||
|
- publisher.
|
||||||
|
- author.
|
||||||
|
- Related discussion. <!-- .element: class="fragment" -->
|
||||||
|
- publisher.
|
||||||
|
- uses 'parent' story as a source.
|
||||||
|
- Stream of stories (changes constantly). <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Structures
|
||||||
|
Bias
|
||||||
|
|
||||||
|
- Per publisher. <!-- .element: class="fragment" -->
|
||||||
|
- name.
|
||||||
|
- label.
|
||||||
|
- agree/disagree vote by community.
|
||||||
|
- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Structures
|
||||||
|
Embeddings
|
||||||
|
|
||||||
|
- Per story title. <!-- .element: class="fragment" -->
|
||||||
|
- sentence embedding (n, 384).
|
||||||
|
- sentiment classification (n, 1).
|
||||||
|
- emotional classification (n, 1).
|
||||||
|
- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Data Collection
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Collection
|
||||||
|
|
||||||
|
Story Scraper (simplified)
|
||||||
|
|
||||||
|
```python
|
||||||
|
day = timedelta(days=1)
|
||||||
|
cur = date(2005, 10, 1)
|
||||||
|
end = date.today()
|
||||||
|
while cur <= end:
|
||||||
|
cur = cur + day
|
||||||
|
save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
|
||||||
|
url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
|
||||||
|
r = requests.get(url)
|
||||||
|
with open(save_as, 'w') as f:
|
||||||
|
f.write(r.text)
|
||||||
|
```
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Collection
|
||||||
|
Bias Scraper (hard)
|
||||||
|
|
||||||
|
```python
|
||||||
|
...
|
||||||
|
bias_html = DATA_DIR / 'allsides.html'
|
||||||
|
parser = etree.HTMLParser()
|
||||||
|
tree = etree.parse(str(bias_html), parser)
|
||||||
|
root = tree.getroot()
|
||||||
|
rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')
|
||||||
|
|
||||||
|
ratings = []
|
||||||
|
for row in rows:
|
||||||
|
rating = dict()
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Collection
|
||||||
|
Bias Scraper (easy)
|
||||||
|
|
||||||
|
![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Collection
|
||||||
|
Embeddings (easy)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# table = ...
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
|
||||||
|
model = AutoModel.from_pretrained("roberta-base")
|
||||||
|
|
||||||
|
for chunk in table:
|
||||||
|
tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
|
||||||
|
outputs = model(**tokens)
|
||||||
|
embeddings = outputs.last_hidden_state.detach().numpy()
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Collection
|
||||||
|
Classification Embeddings (medium)
|
||||||
|
|
||||||
|
```python
|
||||||
|
...
|
||||||
|
outputs = model(**tokens)[0].detach().numpy()
|
||||||
|
scores = 1 / (1 + np.exp(-outputs)) # Sigmoid
|
||||||
|
class_ids = np.argmax(scores, axis=1)
|
||||||
|
for i, class_id in enumerate(class_ids):
|
||||||
|
results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Data Selection
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Selection
|
||||||
|
Stories
|
||||||
|
|
||||||
|
- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
|
||||||
|
- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
==
|
||||||
|
# Data Selection
|
||||||
|
|
||||||
|
Publishers
|
||||||
|
|
||||||
|
- Combine subdomains of stories. <!-- .element: class="fragment" -->
|
||||||
|
- blog.washingtonpost.com and washingtonpost.com are considered the same publisher.
|
||||||
|
- This could be bad. For example: opinion.wsj.com != wsj.com.
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Selection
|
||||||
|
|
||||||
|
Links
|
||||||
|
|
||||||
|
- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
|
||||||
|
- Eliminates small blogs and non-original news.
|
||||||
|
- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
|
||||||
|
- Eliminate silo'ed publications.
|
||||||
|
- Link matrix is square and low'ish dimensional.
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Data Selection
|
||||||
|
|
||||||
|
Bias
|
||||||
|
|
||||||
|
- Keep all ratings, even ones with low agree/disagree ratio.
|
||||||
|
- Join datasets on publisher name.
|
||||||
|
- Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->
|
||||||
|
- Started with 'jaro winkler similarity' then manually from there.
|
||||||
|
- Use numeric values
|
||||||
|
- [left: -2, left-center: -1, ...]
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Descriptive Stats
|
||||||
|
|
||||||
|
Raw
|
||||||
|
|
||||||
|
| metric | value |
|
||||||
|
|:------------------|--------:|
|
||||||
|
| total stories | 299714 |
|
||||||
|
| total related | 960111 |
|
||||||
|
| publishers | 7031 |
|
||||||
|
| authors | 34346 |
|
||||||
|
| max year | 2023 |
|
||||||
|
| min year | 2005 |
|
||||||
|
| top level domains | 7063 |
|
||||||
|
|
||||||
|
==
|
||||||
|
# Descriptive Stats
|
||||||
|
|
||||||
|
Stories Per Publisher
|
||||||
|
|
||||||
|
![stories per publisher](/static/577/stories_per_publisher.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Descriptive Stats
|
||||||
|
|
||||||
|
Top Publishers
|
||||||
|
|
||||||
|
![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Descriptive Stats
|
||||||
|
|
||||||
|
Articles Per Year
|
||||||
|
|
||||||
|
![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Descriptive Stats
|
||||||
|
|
||||||
|
Common TLDs
|
||||||
|
|
||||||
|
![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Descriptive Stats
|
||||||
|
|
||||||
|
Post Process
|
||||||
|
|
||||||
|
| key | value |
|
||||||
|
|:------------------|--------:|
|
||||||
|
| total stories | 251553 |
|
||||||
|
| total related | 815183 |
|
||||||
|
| publishers | 223 |
|
||||||
|
| authors | 23809 |
|
||||||
|
| max year | 2022 |
|
||||||
|
| min year | 2006 |
|
||||||
|
| top level domains | 234 |
|
||||||
|
|
||||||
|
===
|
||||||
|
# Experiments
|
||||||
|
|
||||||
|
1. **clustering** on link similarity. <!-- .element: class="fragment" -->
|
||||||
|
2. **classification** on link similarity. <!-- .element: class="fragment" -->
|
||||||
|
3. **classification** on sentence embedding. <!-- .element: class="fragment" -->
|
||||||
|
4. **classification** on sentiment analysis. <!-- .element: class="fragment" -->
|
||||||
|
5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
Setup
|
||||||
|
|
||||||
|
- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
|
||||||
|
- Cluster the encoding. <!-- .element: class="fragment" -->
|
||||||
|
- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
|
||||||
|
- Use PCA to visualize clusters. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
Note:
|
||||||
|
Principle Component Analysis:
|
||||||
|
- a statistical technique for reducing the dimensionality of a dataset.
|
||||||
|
- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
One Hot Encoding
|
||||||
|
|
||||||
|
| publisher | nytimes| wsj| newsweek| ...|
|
||||||
|
|:----------|--------:|----:|--------:|----:|
|
||||||
|
| nytimes | 1| 1| 1| ...|
|
||||||
|
| wsj | 1| 1| 0| ...|
|
||||||
|
| newsweek | 0| 0| 1| ...|
|
||||||
|
| ... | ...| ...| ...| ...|
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
n-Hot Encoding
|
||||||
|
|
||||||
|
| publisher | nytimes| wsj| newsweek| ...|
|
||||||
|
|:----------|--------:|----:|--------:|----:|
|
||||||
|
| nytimes | 11| 1| 141| ...|
|
||||||
|
| wsj | 1| 31| 0| ...|
|
||||||
|
| newsweek | 0| 0| 1| ...|
|
||||||
|
| ... | ...| ...| ...| ...|
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
Normalized n-Hot Encoding
|
||||||
|
|
||||||
|
| publisher | nytimes| wsj| newsweek| ...|
|
||||||
|
|:----------|--------:|----:|--------:|----:|
|
||||||
|
| nytimes | 0| 0.4| 0.2| ...|
|
||||||
|
| wsj | 0.2| 0| 0.4| ...|
|
||||||
|
| newsweek | 0.0| 0.0| 0.0| ...|
|
||||||
|
| ... | ...| ...| ...| ...|
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
Elbow criterion
|
||||||
|
|
||||||
|
![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)
|
||||||
|
|
||||||
|
Note:
|
||||||
|
|
||||||
|
The elbow method looks at the percentage of explained variance as a function of the number of clusters:
|
||||||
|
|
||||||
|
One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
|
||||||
|
|
||||||
|
Percentage of variance explained is the ratio of the between-group variance to the total variance,
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
Link Magnitude
|
||||||
|
|
||||||
|
![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
Normalized
|
||||||
|
|
||||||
|
![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
Onehot
|
||||||
|
|
||||||
|
![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 1
|
||||||
|
|
||||||
|
Discussion
|
||||||
|
|
||||||
|
- Best encoding: One hot. <!-- .element: class="fragment" -->
|
||||||
|
- Clusters based on total links otherwise.
|
||||||
|
- Clusters, but no explanation
|
||||||
|
- Limitation: need the link encoding to cluster.
|
||||||
|
- Smaller publishers might not link very much.
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Experiment 2
|
||||||
|
|
||||||
|
Setup
|
||||||
|
|
||||||
|
- Create features. <!-- .element: class="fragment" -->:
|
||||||
|
- Publisher frequency.
|
||||||
|
- Reuse link encodings.
|
||||||
|
- Create classes: <!-- .element: class="fragment" -->
|
||||||
|
- Join bias classifications.
|
||||||
|
- Train classifier. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
Note:
|
||||||
|
|
||||||
|
==
|
||||||
|
# Experiment 2
|
||||||
|
Descriptive stats
|
||||||
|
|
||||||
|
| metric | value |
|
||||||
|
|:------------|:----------|
|
||||||
|
| publishers | 1582 |
|
||||||
|
| labels | 6 |
|
||||||
|
| left | 482 |
|
||||||
|
| center | 711 |
|
||||||
|
| right | 369 |
|
||||||
|
| agree range | [0.0-1.0] |
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 2
|
||||||
|
|
||||||
|
PCA + Labels
|
||||||
|
|
||||||
|
![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 2
|
||||||
|
|
||||||
|
Discussion
|
||||||
|
|
||||||
|
- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
|
||||||
|
- Labels are (sort of) separated and clustered.
|
||||||
|
- Creating them for smaller publishers is trivial.
|
||||||
|
==
|
||||||
|
|
||||||
|
# Experiment 2
|
||||||
|
|
||||||
|
Limitations
|
||||||
|
|
||||||
|
- Dependent on accurate rating. <!-- .element: class="fragment" -->
|
||||||
|
- Ordinal ratings not available. <!-- .element: class="fragment" -->
|
||||||
|
- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
|
||||||
|
- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
|
||||||
|
- Don't know what to do with community rating. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Experiment 3
|
||||||
|
|
||||||
|
Setup
|
||||||
|
|
||||||
|
==
|
||||||
|
|
||||||
|
# Limitations
|
||||||
|
|
||||||
|
- Many different authors under the same publisher. <!-- .element: class="fragment" -->
|
||||||
|
- Publishers use syndication. <!-- .element: class="fragment" -->
|
||||||
|
- Bias ratings are biased. <!-- .element: class="fragment" -->
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
# Questions
|
||||||
|
|
||||||
|
===
|
||||||
|
|
||||||
|
<!-- .section: id="references" -->
|
||||||
|
|
||||||
|
# References
|
||||||
|
|
||||||
|
[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
|
||||||
|
|
||||||
|
Note:
|
218
src/bias.py
|
@ -1,12 +1,12 @@
|
||||||
import click
|
import click
|
||||||
from data import connect
|
from data.main import connect
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
from lxml import etree
|
from lxml import etree
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import os
|
import os
|
||||||
import csv
|
import csv
|
||||||
|
|
||||||
def map(rating:str) -> int:
|
def label_to_int(rating:str) -> int:
|
||||||
|
|
||||||
mapping = {
|
mapping = {
|
||||||
'left' : 0,
|
'left' : 0,
|
||||||
|
@ -19,20 +19,18 @@ def map(rating:str) -> int:
|
||||||
|
|
||||||
return mapping[rating]
|
return mapping[rating]
|
||||||
|
|
||||||
|
def int_to_label(class_id: int) -> str:
|
||||||
|
mapping = {
|
||||||
|
0 : 'left',
|
||||||
|
1 : 'left-center',
|
||||||
|
2 : 'center',
|
||||||
|
3 : 'right-center',
|
||||||
|
4 : 'right',
|
||||||
|
-1 : 'allsides',
|
||||||
|
}
|
||||||
|
return mapping[class_id]
|
||||||
|
|
||||||
@click.command(name="bias:load")
|
|
||||||
def load() -> None:
|
|
||||||
DB = connect()
|
|
||||||
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
|
|
||||||
f = str(DATA_DIR / "bias_ratings.csv")
|
|
||||||
|
|
||||||
DB.sql(f"""
|
|
||||||
create table bias_ratings as
|
|
||||||
select
|
|
||||||
row_number() over(order by b.publisher) as id
|
|
||||||
,b.*
|
|
||||||
from read_csv_auto('{f}') b
|
|
||||||
""")
|
|
||||||
@click.command(name="bias:normalize")
|
@click.command(name="bias:normalize")
|
||||||
def normalize() -> None:
|
def normalize() -> None:
|
||||||
DB = connect()
|
DB = connect()
|
||||||
|
@ -41,133 +39,48 @@ def normalize() -> None:
|
||||||
CREATE OR REPLACE TABLE publisher_bias AS
|
CREATE OR REPLACE TABLE publisher_bias AS
|
||||||
WITH cte AS (
|
WITH cte AS (
|
||||||
SELECT
|
SELECT
|
||||||
p.id
|
p.id as publisher_id
|
||||||
|
,b.id as bias_id
|
||||||
,b.bias as label
|
,b.bias as label
|
||||||
,JARO_WINKLER_SIMILARITY(LOWER(p.name), LOWER(b.publisher)) as similarity
|
,JARO_WINKLER_SIMILARITY(LOWER(p.name), LOWER(b.publisher)) as similarity
|
||||||
FROM bias_ratings b
|
FROM bias_ratings b
|
||||||
JOIN publishers p
|
JOIN top.publishers p
|
||||||
ON JARO_WINKLER_SIMILARITY(LOWER(p.name), LOWER(b.publisher)) > 0.95
|
ON JARO_WINKLER_SIMILARITY(LOWER(p.name), LOWER(b.publisher)) > 0.95
|
||||||
),ranked AS (
|
),ranked AS (
|
||||||
SELECT
|
SELECT
|
||||||
id
|
publisher_id
|
||||||
|
,bias_id
|
||||||
,label
|
,label
|
||||||
,similarity
|
,similarity
|
||||||
,ROW_NUMBER() OVER(PARTITION BY id ORDER BY similarity DESC) AS rn
|
,ROW_NUMBER() OVER(PARTITION BY publisher_id ORDER BY similarity DESC) AS rn
|
||||||
FROM cte
|
FROM cte
|
||||||
)
|
)
|
||||||
SELECT
|
SELECT
|
||||||
id
|
publisher_id
|
||||||
,label
|
,label
|
||||||
|
,bias_id
|
||||||
FROM ranked
|
FROM ranked
|
||||||
WHERE ranked.rn = 1
|
WHERE ranked.rn = 1
|
||||||
""")
|
""")
|
||||||
|
|
||||||
|
mapping = [
|
||||||
|
{'label' :'left' , 'ordinal': -2},
|
||||||
|
{'label' :'left-center' , 'ordinal': -1},
|
||||||
|
{'label' :'center' , 'ordinal': 0},
|
||||||
|
{'label' :'right-center' , 'ordinal': 1},
|
||||||
|
{'label' :'right' , 'ordinal': 2},
|
||||||
|
]
|
||||||
|
mapping = pd.DataFrame(mapping)
|
||||||
|
|
||||||
DB.sql("""
|
DB.query("alter table bias_ratings add column ordinal int")
|
||||||
with cte as (
|
|
||||||
select
|
DB.query("""
|
||||||
s.publisher_id
|
update bias_ratings b
|
||||||
,count(1) as stories
|
set ordinal = o.ordinal
|
||||||
from stories s
|
FROM mapping o
|
||||||
group by s.publisher_id
|
WHERE o.label = b.bias
|
||||||
)
|
|
||||||
select
|
|
||||||
s.publisher
|
|
||||||
,s.stories
|
|
||||||
,b.publisher
|
|
||||||
,b.bias
|
|
||||||
from bias_ratings b
|
|
||||||
join cte s
|
|
||||||
on s.publisher = b.publisher
|
|
||||||
order by
|
|
||||||
stories desc
|
|
||||||
limit 15
|
|
||||||
""")
|
""")
|
||||||
|
|
||||||
DB.sql("""
|
|
||||||
with cte as (
|
|
||||||
select
|
|
||||||
s.publisher
|
|
||||||
,count(1) as stories
|
|
||||||
from stories s
|
|
||||||
group by s.publisher
|
|
||||||
)
|
|
||||||
select
|
|
||||||
sum(stories)
|
|
||||||
,avg(agree / disagree)
|
|
||||||
from bias_ratings b
|
|
||||||
join cte s
|
|
||||||
on s.publisher = b.publisher
|
|
||||||
""")
|
|
||||||
|
|
||||||
DB.sql("""
|
|
||||||
with cte as (
|
|
||||||
select
|
|
||||||
s.publisher
|
|
||||||
,count(1) as stories
|
|
||||||
from stories s
|
|
||||||
group by s.publisher
|
|
||||||
)
|
|
||||||
select
|
|
||||||
sum(s.stories) filter(where b.publisher is not null) as matched
|
|
||||||
,sum(s.stories) filter(where b.publisher is null) as unmatched
|
|
||||||
,cast(sum(s.stories) filter(where b.publisher is not null) as numeric)
|
|
||||||
/ sum(s.stories) filter(where b.publisher is null) as precent_matched
|
|
||||||
from bias_ratings b
|
|
||||||
right join cte s
|
|
||||||
on s.publisher = b.publisher
|
|
||||||
""")
|
|
||||||
|
|
||||||
DB.sql("""
|
|
||||||
select
|
|
||||||
*
|
|
||||||
from bias_ratings
|
|
||||||
where publisher ilike '%CNN%'
|
|
||||||
""")
|
|
||||||
|
|
||||||
@click.command(name='bias:debug')
|
|
||||||
def debug() -> None:
|
|
||||||
DB = connect()
|
|
||||||
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
|
|
||||||
f = str(DATA_DIR / "bias_ratings.csv")
|
|
||||||
|
|
||||||
DB.sql("""
|
|
||||||
with cte as (
|
|
||||||
select
|
|
||||||
outlet
|
|
||||||
,count(1) as stories
|
|
||||||
from stories
|
|
||||||
group by outlet
|
|
||||||
)
|
|
||||||
,total as (
|
|
||||||
select
|
|
||||||
sum(stories) as total
|
|
||||||
from cte
|
|
||||||
)
|
|
||||||
select
|
|
||||||
cte.outlet
|
|
||||||
,cte.stories
|
|
||||||
,bias.outlet
|
|
||||||
,bias.lean
|
|
||||||
,sum(100 * (cte.stories / cast(total.total as float))) over() as rep
|
|
||||||
,total.total
|
|
||||||
from cte
|
|
||||||
join bias
|
|
||||||
on jaro_winkler_similarity(bias.outlet, cte.outlet) > 0.9
|
|
||||||
cross join total.total
|
|
||||||
""")
|
|
||||||
|
|
||||||
DB.sql("""
|
|
||||||
select
|
|
||||||
outlet
|
|
||||||
,count(1) as stories
|
|
||||||
from stories
|
|
||||||
group by outlet
|
|
||||||
order by count(1) desc
|
|
||||||
limit 50
|
|
||||||
""")
|
|
||||||
|
|
||||||
outlets
|
|
||||||
|
|
||||||
@click.command(name='bias:parse')
|
@click.command(name='bias:parse')
|
||||||
def parse() -> None:
|
def parse() -> None:
|
||||||
|
@ -199,3 +112,64 @@ def parse() -> None:
|
||||||
ratings.append(rating)
|
ratings.append(rating)
|
||||||
df = pd.DataFrame(ratings)
|
df = pd.DataFrame(ratings)
|
||||||
df.to_csv(DATA_DIR / 'bias_ratings.csv', sep="|", index=False, quoting=csv.QUOTE_NONNUMERIC)
|
df.to_csv(DATA_DIR / 'bias_ratings.csv', sep="|", index=False, quoting=csv.QUOTE_NONNUMERIC)
|
||||||
|
|
||||||
|
@click.command(name="bias:load")
|
||||||
|
def load() -> None:
|
||||||
|
DB = connect()
|
||||||
|
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
|
||||||
|
f = str(DATA_DIR / "bias_ratings.csv")
|
||||||
|
|
||||||
|
DB.sql(f"""
|
||||||
|
CREATE TABLE bias_ratings as
|
||||||
|
select
|
||||||
|
row_number() over(order by b.publisher) as id
|
||||||
|
,b.*
|
||||||
|
from read_csv_auto('{f}') b
|
||||||
|
""")
|
||||||
|
|
||||||
|
@click.command('bias:export')
|
||||||
|
def export():
|
||||||
|
data_path = Path(os.environ['DATA_MINING_DATA_DIR'])
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
all_bias = DB.query("""
|
||||||
|
SELECT
|
||||||
|
id as bias_id
|
||||||
|
,publisher as name
|
||||||
|
,bias as label
|
||||||
|
FROM bias_ratings
|
||||||
|
ORDER by agree desc
|
||||||
|
""")
|
||||||
|
all_bias.df().to_csv(data_path / 'TMP_publisher_bias.csv', sep="|", index=False)
|
||||||
|
mapped_bias = DB.query("""
|
||||||
|
SELECT
|
||||||
|
p.id as publisher_id
|
||||||
|
,p.name as name
|
||||||
|
,p.tld as tld
|
||||||
|
,b.label as bias
|
||||||
|
,b.bias_id as bias_id
|
||||||
|
FROM top.publishers p
|
||||||
|
LEFT JOIN publisher_bias b
|
||||||
|
ON b.publisher_id = p.id
|
||||||
|
""")
|
||||||
|
mapped_bias.df().to_csv(data_path / 'TMP_publisher_bias_to_load.csv', sep="|", index=False)
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
@click.command('bias:import-mapped')
|
||||||
|
def import_mapped():
|
||||||
|
data_path = Path(os.environ['DATA_MINING_DATA_DIR'])
|
||||||
|
table_name = "top.publisher_bias"
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
df = pd.read_csv(data_path / 'TMP_publisher_bias_to_load.csv', sep="|")
|
||||||
|
|
||||||
|
DB.query(f"""
|
||||||
|
CREATE OR REPLACE TABLE {table_name} AS
|
||||||
|
SELECT
|
||||||
|
publisher_id AS publisher_id
|
||||||
|
,cast(bias_id AS int) as bias_id
|
||||||
|
FROM df
|
||||||
|
WHERE bias_id IS NOT NULL
|
||||||
|
""")
|
||||||
|
print(f"created table: {table_name}")
|
||||||
|
|
||||||
|
|
24
src/cli.py
|
@ -7,7 +7,7 @@ def cli():
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
load_dotenv()
|
load_dotenv()
|
||||||
import scrape
|
from data import scrape
|
||||||
cli.add_command(scrape.download)
|
cli.add_command(scrape.download)
|
||||||
cli.add_command(scrape.parse)
|
cli.add_command(scrape.parse)
|
||||||
cli.add_command(scrape.load)
|
cli.add_command(scrape.load)
|
||||||
|
@ -32,4 +32,26 @@ if __name__ == "__main__":
|
||||||
cli.add_command(emotion.create_table)
|
cli.add_command(emotion.create_table)
|
||||||
import sentence
|
import sentence
|
||||||
cli.add_command(sentence.embed)
|
cli.add_command(sentence.embed)
|
||||||
|
from train import main as train_main
|
||||||
|
cli.add_command(train_main.main)
|
||||||
|
|
||||||
|
import plots.descriptive as plotd
|
||||||
|
cli.add_command(plotd.articles_per_year)
|
||||||
|
cli.add_command(plotd.distinct_publishers)
|
||||||
|
cli.add_command(plotd.stories_per_publisher)
|
||||||
|
cli.add_command(plotd.top_publishers)
|
||||||
|
cli.add_command(plotd.common_tld)
|
||||||
|
|
||||||
|
import links as linkcli
|
||||||
|
cli.add_command(linkcli.create_table)
|
||||||
|
cli.add_command(linkcli.create_pca)
|
||||||
|
cli.add_command(linkcli.create_clusters)
|
||||||
|
|
||||||
|
import plots.links as plotl
|
||||||
|
cli.add_command(plotl.elbow)
|
||||||
|
cli.add_command(plotl.link_pca_clusters)
|
||||||
|
|
||||||
|
import plots.classifier as plotc
|
||||||
|
cli.add_command(plotc.pca_with_classes)
|
||||||
|
|
||||||
cli()
|
cli()
|
||||||
|
|
|
@ -0,0 +1,6 @@
|
||||||
|
import data.main
|
||||||
|
import data.scrape
|
||||||
|
__all__ = [
|
||||||
|
'main'
|
||||||
|
,'scrape'
|
||||||
|
]
|
|
@ -4,10 +4,12 @@ import requests
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import click
|
import click
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
from data import data_dir, connect
|
from data.main import data_dir, connect
|
||||||
from lxml import etree
|
from lxml import etree
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
from urllib.parse import urlparse
|
from urllib.parse import urlparse
|
||||||
|
from tld import get_tld
|
||||||
|
from tld.utils import update_tld_names
|
||||||
|
|
||||||
@click.command(name='scrape:load')
|
@click.command(name='scrape:load')
|
||||||
@click.option('--directory', type=Path, default=data_dir(), show_default=True)
|
@click.option('--directory', type=Path, default=data_dir(), show_default=True)
|
||||||
|
@ -61,6 +63,7 @@ def download(output_dir):
|
||||||
@click.option('-o', '--output_dir', type=Path, default=data_dir(), show_default=True)
|
@click.option('-o', '--output_dir', type=Path, default=data_dir(), show_default=True)
|
||||||
def parse(directory, output_dir):
|
def parse(directory, output_dir):
|
||||||
"""parse the html files on disk into a structured csv format."""
|
"""parse the html files on disk into a structured csv format."""
|
||||||
|
update_tld_names()
|
||||||
directory = data_dir() / "memeorandum"
|
directory = data_dir() / "memeorandum"
|
||||||
parser = etree.HTMLParser()
|
parser = etree.HTMLParser()
|
||||||
pages = [f for f in directory.glob("*.html")]
|
pages = [f for f in directory.glob("*.html")]
|
||||||
|
@ -104,8 +107,7 @@ def parse(directory, output_dir):
|
||||||
|
|
||||||
url = item.xpath('.//strong/a')[0].get('href')
|
url = item.xpath('.//strong/a')[0].get('href')
|
||||||
out['url'] = url
|
out['url'] = url
|
||||||
out['publisher_url_domain'] = urlparse(publisher_url).netloc
|
out['tld'] = get_tld(publisher_url)
|
||||||
out['domain'] = urlparse(url).netloc
|
|
||||||
|
|
||||||
item_id = hash((page.stem, url))
|
item_id = hash((page.stem, url))
|
||||||
out['id'] = item_id
|
out['id'] = item_id
|
||||||
|
@ -225,3 +227,111 @@ def normalize():
|
||||||
alter table related_stories drop publisher_domain;
|
alter table related_stories drop publisher_domain;
|
||||||
""")
|
""")
|
||||||
|
|
||||||
|
|
||||||
|
def another_norm():
|
||||||
|
sv2 = pd.read_csv(data_dir / 'stories.csv', sep="|")
|
||||||
|
related = pd.read_csv(data_dir / 'related.csv', sep="|")
|
||||||
|
|
||||||
|
related['tld'] = related.url.apply(lambda x: map_tld(x))
|
||||||
|
|
||||||
|
DB.query("""
|
||||||
|
update related_stories
|
||||||
|
set publisher_id = p.id
|
||||||
|
from publishers p
|
||||||
|
join related r
|
||||||
|
on r.tld = p.tld
|
||||||
|
where r.url = related_stories.url
|
||||||
|
""")
|
||||||
|
|
||||||
|
|
||||||
|
DB.query("""alter table stories add column tld text""")
|
||||||
|
|
||||||
|
s_url = DB.query("""
|
||||||
|
select
|
||||||
|
id
|
||||||
|
,url
|
||||||
|
from stories
|
||||||
|
""").df()
|
||||||
|
|
||||||
|
|
||||||
|
s_url['tld'] = s_url.url.apply(lambda x: map_tld(x))
|
||||||
|
|
||||||
|
DB.query("""
|
||||||
|
update stories
|
||||||
|
set tld = s_url.tld
|
||||||
|
from s_url
|
||||||
|
where s_url.id = stories.id
|
||||||
|
""")
|
||||||
|
|
||||||
|
DB.query("""
|
||||||
|
update stories
|
||||||
|
set publisher_id = p.id
|
||||||
|
from publishers p
|
||||||
|
where p.tld = stories.tld
|
||||||
|
""")
|
||||||
|
|
||||||
|
|
||||||
|
select
|
||||||
|
DB.query("""
|
||||||
|
update stories
|
||||||
|
set stories.publisher_id = p.id
|
||||||
|
from new_pub
|
||||||
|
""")
|
||||||
|
sv2['tld'] = sv2.publisher_url.apply(lambda x: map_tld(x))
|
||||||
|
|
||||||
|
|
||||||
|
new_pub = DB.query("""
|
||||||
|
with cte as (
|
||||||
|
select
|
||||||
|
tld
|
||||||
|
,publisher
|
||||||
|
,count(1) filter(where year(published_at) = 2022) as recent_ctn
|
||||||
|
,count(1) as ctn
|
||||||
|
from sv2
|
||||||
|
group by
|
||||||
|
tld
|
||||||
|
,publisher
|
||||||
|
)
|
||||||
|
,r as (
|
||||||
|
select
|
||||||
|
tld
|
||||||
|
,publisher
|
||||||
|
,ctn
|
||||||
|
,row_number() over(partition by tld order by recent_ctn desc) as rn
|
||||||
|
from cte
|
||||||
|
)
|
||||||
|
select
|
||||||
|
row_number() over() as id
|
||||||
|
,publisher as name
|
||||||
|
,tld
|
||||||
|
from r
|
||||||
|
where rn = 1
|
||||||
|
order by ctn desc
|
||||||
|
""").df()
|
||||||
|
|
||||||
|
DB.query("""
|
||||||
|
CREATE OR REPLACE TABLE publishers AS
|
||||||
|
SELECT
|
||||||
|
id
|
||||||
|
,name
|
||||||
|
,tld
|
||||||
|
FROM new_pub
|
||||||
|
""")
|
||||||
|
|
||||||
|
|
||||||
|
def map_tld(x):
|
||||||
|
try:
|
||||||
|
res = get_tld(x, as_object=True)
|
||||||
|
return res.fld
|
||||||
|
except:
|
||||||
|
return None
|
||||||
|
|
||||||
|
DB.sql("""
|
||||||
|
SELECT
|
||||||
|
s.id
|
||||||
|
,sv2.publisher_url
|
||||||
|
FROM stories s
|
||||||
|
JOIN sv2
|
||||||
|
on sv2.id = s.id
|
||||||
|
limit 5
|
||||||
|
""")
|
|
@ -6,7 +6,7 @@ import numpy as np
|
||||||
|
|
||||||
from transformers import BertTokenizer
|
from transformers import BertTokenizer
|
||||||
from model import BertForMultiLabelClassification
|
from model import BertForMultiLabelClassification
|
||||||
from data import connect, data_dir
|
from data.main import connect, data_dir
|
||||||
import seaborn as sns
|
import seaborn as sns
|
||||||
import matplotlib.pyplot as plt
|
import matplotlib.pyplot as plt
|
||||||
from matplotlib.dates import DateFormatter
|
from matplotlib.dates import DateFormatter
|
||||||
|
@ -376,3 +376,99 @@ def debug():
|
||||||
DB.close()
|
DB.close()
|
||||||
|
|
||||||
out.to_csv(data_dir() / 'emotions.csv', sep="|")
|
out.to_csv(data_dir() / 'emotions.csv', sep="|")
|
||||||
|
|
||||||
|
def another():
|
||||||
|
DB = connect()
|
||||||
|
DB.sql("""
|
||||||
|
select
|
||||||
|
*
|
||||||
|
from emotions
|
||||||
|
""")
|
||||||
|
|
||||||
|
emotions = DB.sql("""
|
||||||
|
select
|
||||||
|
year(s.published_at) as year
|
||||||
|
,se.label as emotion
|
||||||
|
,count(1) as stories
|
||||||
|
from stories s
|
||||||
|
join story_emotions se
|
||||||
|
on s.id = se.story_id
|
||||||
|
group by
|
||||||
|
year(s.published_at)
|
||||||
|
,se.label
|
||||||
|
""").df()
|
||||||
|
|
||||||
|
sns.scatterplot(x=emotions['year'], y=emotions['stories'], hue=emotions['emotion'])
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
pivot = emotions.pivot(index='year', columns='emotion', values='stories')
|
||||||
|
pivot.reset_index(inplace=True)
|
||||||
|
from sklearn.linear_model import LinearRegression
|
||||||
|
reg = LinearRegression()
|
||||||
|
|
||||||
|
for emotion in pivot.keys()[1:].tolist():
|
||||||
|
_ = reg.fit(pivot['year'].to_numpy().reshape(-1, 1), pivot[emotion])
|
||||||
|
print(f"{emotion}: {reg.coef_[0]}")
|
||||||
|
|
||||||
|
fig, ax = plt.subplots()
|
||||||
|
#sns.lineplot(x=pivot['anger'], y=pivot['joy'])
|
||||||
|
#sns.lineplot(x=pivot['anger'], y=pivot['surprise'], ax=ax)
|
||||||
|
sns.lineplot(x=pivot['anger'], y=pivot['fear'], ax=ax)
|
||||||
|
sns.lineplot(x=pivot[''], y=pivot['fear'], ax=ax)
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
normalized = DB.sql("""
|
||||||
|
with cte as (
|
||||||
|
select
|
||||||
|
year(s.published_at) as year
|
||||||
|
,se.label as emotion
|
||||||
|
,b.label as bias
|
||||||
|
from stories s
|
||||||
|
join story_emotions se
|
||||||
|
on s.id = se.story_id
|
||||||
|
join publisher_bias b
|
||||||
|
on b.id = s.publisher_id
|
||||||
|
where b.label != 'allsides'
|
||||||
|
and se.label != 'neutral'
|
||||||
|
)
|
||||||
|
select
|
||||||
|
distinct
|
||||||
|
year
|
||||||
|
,emotion
|
||||||
|
,bias
|
||||||
|
,cast(count(1) over(partition by year, bias, emotion) as float) / count(1) over(partition by year, bias) as group_count
|
||||||
|
from cte
|
||||||
|
""").df()
|
||||||
|
|
||||||
|
DB.sql("""
|
||||||
|
select
|
||||||
|
b.label as bias
|
||||||
|
,count(1) as stories
|
||||||
|
from stories s
|
||||||
|
join story_emotions se
|
||||||
|
on s.id = se.story_id
|
||||||
|
join publisher_bias b
|
||||||
|
on b.id = s.publisher_id
|
||||||
|
group by
|
||||||
|
b.label
|
||||||
|
""").df()
|
||||||
|
|
||||||
|
another_pivot = emotional_bias.pivot(index=['bias', 'year'], columns='emotion', values='stories')
|
||||||
|
another_pivot.reset_index(inplace=True)
|
||||||
|
|
||||||
|
sns.lineplot(data=normalized, x='year', y='group_count', hue='bias', style='emotion')
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
sns.relplot(
|
||||||
|
data=normalized, x="year", y="group_count", hue="emotion", col='bias', kind="line"
|
||||||
|
#data=normalized, x="year", y="group_count", hue="emotion", col='bias', kind="line", facet_kws=dict(sharey=False)
|
||||||
|
)
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
DB.sql("""
|
||||||
|
select
|
||||||
|
*
|
||||||
|
from another_pivot
|
||||||
|
""")
|
||||||
|
|
|
@ -1,8 +0,0 @@
|
||||||
import sklearn
|
|
||||||
import polars as pl
|
|
||||||
import toml
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
config = toml.load('/home/user/577/repo/config.toml')
|
|
||||||
app_dir = Path(config.get('app').get('path'))
|
|
||||||
df = pl.read_csv(app_dir / "data/articles.csv")
|
|
158
src/links.py
|
@ -1,12 +1,148 @@
|
||||||
from data import connect
|
import click
|
||||||
|
from data.main import connect
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from sklearn.decomposition import PCA, TruncatedSVD
|
|
||||||
from sklearn.cluster import MiniBatchKMeans
|
|
||||||
import seaborn as sns
|
import seaborn as sns
|
||||||
import matplotlib.pyplot as plt
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
|
||||||
|
@click.command('links:create-table')
|
||||||
|
def create_table():
|
||||||
|
|
||||||
|
table_name = "top.link_edges"
|
||||||
|
DB = connect()
|
||||||
|
DB.query(f"""
|
||||||
|
CREATE OR REPLACE TABLE {table_name} AS
|
||||||
|
with cte as(
|
||||||
|
SELECT
|
||||||
|
s.publisher_id as parent_id
|
||||||
|
,r.publisher_id as child_id
|
||||||
|
,count(1) as links
|
||||||
|
FROM top.stories s
|
||||||
|
JOIN top.related_stories r
|
||||||
|
ON s.id = r.parent_id
|
||||||
|
group by
|
||||||
|
s.publisher_id
|
||||||
|
,r.publisher_id
|
||||||
|
)
|
||||||
|
SELECT
|
||||||
|
cte.parent_id
|
||||||
|
,cte.child_id
|
||||||
|
,cte.links as links
|
||||||
|
,cast(cte.links as float) / sum(cte.links) over(partition by cte.parent_id) as normalized
|
||||||
|
,case when cte.links > 0 then 1 else 0 end as onehot
|
||||||
|
FROM cte
|
||||||
|
WHERE cte.child_id in (
|
||||||
|
SELECT
|
||||||
|
distinct parent_id
|
||||||
|
FROM cte
|
||||||
|
)
|
||||||
|
AND cte.parent_id in (
|
||||||
|
SELECT
|
||||||
|
distinct child_id
|
||||||
|
FROM cte
|
||||||
|
)
|
||||||
|
""")
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
DB.query("""
|
||||||
|
SELECT
|
||||||
|
*
|
||||||
|
,-log10(links)
|
||||||
|
--distinct parent_id
|
||||||
|
FROM top.link_edges e
|
||||||
|
WHERE e.parent_id = 238
|
||||||
|
""")
|
||||||
|
DB.close()
|
||||||
|
print(f"created {table_name}")
|
||||||
|
|
||||||
|
@click.command('links:create-pca')
|
||||||
|
@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
|
||||||
|
def create_pca(source):
|
||||||
|
"""create 2D pca labels"""
|
||||||
|
|
||||||
|
from sklearn.decomposition import PCA
|
||||||
|
|
||||||
|
table_name = f"top.publisher_pca_{source}"
|
||||||
|
DB = connect()
|
||||||
|
pub = DB.query("""
|
||||||
|
SELECT
|
||||||
|
*
|
||||||
|
FROM top.publishers
|
||||||
|
""").df()
|
||||||
|
df = DB.query(f"""
|
||||||
|
SELECT
|
||||||
|
parent_id
|
||||||
|
,child_id
|
||||||
|
,{source} as links
|
||||||
|
FROM top.link_edges
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
|
||||||
|
|
||||||
|
svd = PCA(n_components=2)
|
||||||
|
svd_out = svd.fit_transform(pivot)
|
||||||
|
|
||||||
|
out = pivot.reset_index()[['parent_id']]
|
||||||
|
out['first'] = svd_out[:, 0]
|
||||||
|
out['second'] = svd_out[:, 1]
|
||||||
|
out = pd.merge(out, pub, left_on='parent_id', right_on='id')
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
DB.query(f"""
|
||||||
|
CREATE OR REPLACE TABLE {table_name} AS
|
||||||
|
SELECT
|
||||||
|
out.id as publisher_id
|
||||||
|
,out.first as first
|
||||||
|
,out.second as second
|
||||||
|
FROM out
|
||||||
|
""")
|
||||||
|
DB.close()
|
||||||
|
print(f"created {table_name}")
|
||||||
|
|
||||||
|
|
||||||
|
@click.command('links:create-clusters')
|
||||||
|
@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
|
||||||
|
def create_clusters(source):
|
||||||
|
from sklearn.cluster import KMeans
|
||||||
|
|
||||||
|
table_name = f"top.publisher_clusters_{source}"
|
||||||
|
DB = connect()
|
||||||
|
df = DB.query(f"""
|
||||||
|
SELECT
|
||||||
|
parent_id
|
||||||
|
,child_id
|
||||||
|
,{source} as links
|
||||||
|
FROM top.link_edges
|
||||||
|
""").df()
|
||||||
|
pub = DB.query("""
|
||||||
|
SELECT
|
||||||
|
*
|
||||||
|
FROM top.publishers
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
|
||||||
|
|
||||||
|
|
||||||
|
k = 8
|
||||||
|
kmeans = KMeans(n_clusters=k, n_init="auto")
|
||||||
|
pred = kmeans.fit_predict(pivot)
|
||||||
|
out = pivot.reset_index()[['parent_id']]
|
||||||
|
out['label'] = pred
|
||||||
|
out = pd.merge(out, pub, left_on='parent_id', right_on='id')
|
||||||
|
new_table = out[['id', 'label']]
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
DB.query(f"""
|
||||||
|
CREATE OR REPLACE TABLE {table_name} AS
|
||||||
|
SELECT
|
||||||
|
n.id as publisher_id
|
||||||
|
,n.label as label
|
||||||
|
FROM new_table n
|
||||||
|
""")
|
||||||
|
DB.close()
|
||||||
|
print(f"created {table_name}")
|
||||||
|
|
||||||
def to_matrix():
|
def to_matrix():
|
||||||
"""returns an adjacency matrix of publishers to publisher link frequency"""
|
"""returns an adjacency matrix of publishers to publisher link frequency"""
|
||||||
|
@ -21,6 +157,7 @@ def to_matrix():
|
||||||
{'label' :'right', 'value' : 4},
|
{'label' :'right', 'value' : 4},
|
||||||
{'label' :'allsides', 'value' : -1},
|
{'label' :'allsides', 'value' : -1},
|
||||||
])
|
])
|
||||||
|
|
||||||
bias = DB.sql("""
|
bias = DB.sql("""
|
||||||
SELECT
|
SELECT
|
||||||
b.id
|
b.id
|
||||||
|
@ -37,11 +174,7 @@ def to_matrix():
|
||||||
p.id
|
p.id
|
||||||
,p.name
|
,p.name
|
||||||
,p.url
|
,p.url
|
||||||
,b.label
|
|
||||||
,b.value
|
|
||||||
from publishers p
|
from publishers p
|
||||||
left join bias b
|
|
||||||
on b.id = p.id
|
|
||||||
""").df()
|
""").df()
|
||||||
|
|
||||||
edges = DB.sql("""
|
edges = DB.sql("""
|
||||||
|
@ -81,12 +214,23 @@ def to_matrix():
|
||||||
ON p.id = cte.parent_id
|
ON p.id = cte.parent_id
|
||||||
""").df()
|
""").df()
|
||||||
|
|
||||||
|
# only keep values that have more than 1 link
|
||||||
|
test = edges[edges['links'] > 2].pivot(index='parent_id', columns='child_id', values='links').fillna(0).reset_index()
|
||||||
|
edges.dropna().pivot(index='parent_id', columns='child_id', values='links').fillna(0)
|
||||||
|
pd.merge(adj, pub, how='left', left_on='parent_id', right_on='id')
|
||||||
adj = edges.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
|
adj = edges.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
|
||||||
|
adj.values.shape
|
||||||
|
|
||||||
|
|
||||||
out = pd.DataFrame(adj.index.values, columns=['id'])
|
out = pd.DataFrame(adj.index.values, columns=['id'])
|
||||||
out = pd.merge(out, pub, how='left', on='id')
|
out = pd.merge(out, pub, how='left', on='id')
|
||||||
|
return out
|
||||||
|
|
||||||
|
@click.command('links:analysis')
|
||||||
|
def analysis():
|
||||||
|
from sklearn.decomposition import PCA, TruncatedSVD
|
||||||
|
from sklearn.cluster import MiniBatchKMeans
|
||||||
|
adj = to_matrix()
|
||||||
pca = PCA(n_components=4)
|
pca = PCA(n_components=4)
|
||||||
pca_out = pca.fit_transform(adj)
|
pca_out = pca.fit_transform(adj)
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from data import data_dir, connect
|
from data.main import data_dir, connect
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import sklearn
|
import sklearn
|
||||||
from sklearn.cluster import MiniBatchKMeans
|
from sklearn.cluster import MiniBatchKMeans
|
||||||
|
|
|
@ -0,0 +1,34 @@
|
||||||
|
import click
|
||||||
|
from data.main import connect
|
||||||
|
import os
|
||||||
|
import seaborn as sns
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
|
||||||
|
|
||||||
|
@click.command('plot:pca-with-classes')
|
||||||
|
def pca_with_classes():
|
||||||
|
filename = "pca_with_classes.png"
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
data = DB.query(f"""
|
||||||
|
SELECT
|
||||||
|
p.tld
|
||||||
|
,b.bias
|
||||||
|
,c.first
|
||||||
|
,c.second
|
||||||
|
,round(cast(b.agree as float) / (b.agree + b.disagree), 2) ratio
|
||||||
|
FROM top.publishers p
|
||||||
|
JOIN top.publisher_bias pb
|
||||||
|
ON p.id = pb.publisher_id
|
||||||
|
JOIN bias_ratings b
|
||||||
|
ON b.id = pb.bias_id
|
||||||
|
JOIN top.publisher_pca_normalized c
|
||||||
|
ON c.publisher_id = p.id
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
ax = sns.scatterplot(x=data['first'], y=data['second'], hue=data['bias'], s=100)
|
||||||
|
ax.set(title="pca components vs. bias labels", xlabel="first pca component", ylabel="second pca component")
|
||||||
|
plt.savefig(out_dir / filename)
|
||||||
|
print(f"saved: {filename}")
|
|
@ -0,0 +1,302 @@
|
||||||
|
import click
|
||||||
|
from data.main import connect
|
||||||
|
import os
|
||||||
|
import seaborn as sns
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from pathlib import Path
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
|
||||||
|
|
||||||
|
@click.command('plot:articles-per-year')
|
||||||
|
def articles_per_year():
|
||||||
|
filename = 'articles_per_year.png'
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
data = DB.query("""
|
||||||
|
select
|
||||||
|
year(published_at) as year
|
||||||
|
,count(1) as stories
|
||||||
|
from stories
|
||||||
|
group by
|
||||||
|
year(published_at)
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
ax = sns.barplot(x=data.year, y=data.stories, color='tab:blue')
|
||||||
|
ax.tick_params(axis='x', rotation=90)
|
||||||
|
ax.set(title="count of articles per year", ylabel="count of stories (#)")
|
||||||
|
plt.tight_layout()
|
||||||
|
plt.savefig(out_dir / filename)
|
||||||
|
|
||||||
|
@click.command('plot:distinct-publishers')
|
||||||
|
def distinct_publishers():
|
||||||
|
filename = 'distinct_publishers.png'
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
data = DB.query("""
|
||||||
|
select
|
||||||
|
year(published_at) as year
|
||||||
|
,count(distinct publisher_id) as publishers
|
||||||
|
from stories
|
||||||
|
group by
|
||||||
|
year(published_at)
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
ax = sns.barplot(x=data.year, y=data.publishers, color='tab:blue')
|
||||||
|
ax.tick_params(axis='x', rotation=90)
|
||||||
|
ax.set(title="count of publishers per year", ylabel="count of publishers (#)")
|
||||||
|
plt.tight_layout()
|
||||||
|
plt.savefig(out_dir / filename)
|
||||||
|
plt.close()
|
||||||
|
|
||||||
|
@click.command('plot:stories-per-publisher')
|
||||||
|
def stories_per_publisher():
|
||||||
|
filename = 'stories_per_publisher.png'
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
data = DB.query("""
|
||||||
|
with cte as (
|
||||||
|
select
|
||||||
|
publisher_id
|
||||||
|
,year(published_at) as year
|
||||||
|
,count(1) as stories
|
||||||
|
from stories
|
||||||
|
group by
|
||||||
|
publisher_id
|
||||||
|
,year(published_at)
|
||||||
|
) , agg as (
|
||||||
|
select
|
||||||
|
publisher_id
|
||||||
|
,avg(stories) as stories_per_year
|
||||||
|
,case
|
||||||
|
when avg(stories) < 2 then 2
|
||||||
|
when avg(stories) < 4 then 4
|
||||||
|
when avg(stories) < 8 then 8
|
||||||
|
when avg(stories) < 16 then 16
|
||||||
|
when avg(stories) < 32 then 32
|
||||||
|
when avg(stories) < 64 then 64
|
||||||
|
when avg(stories) < 128 then 128
|
||||||
|
else 129
|
||||||
|
end as max_avg
|
||||||
|
from cte
|
||||||
|
group by
|
||||||
|
publisher_id
|
||||||
|
)
|
||||||
|
select
|
||||||
|
max_avg
|
||||||
|
,count(1) as publishers
|
||||||
|
from agg
|
||||||
|
group by
|
||||||
|
max_avg
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
ax = sns.barplot(x=data.max_avg, y=data.publishers, color='tab:blue')
|
||||||
|
ax.set(title="histogram of publisher stories per year", ylabel="count of publishers (#)", xlabel="max average stories / year")
|
||||||
|
plt.tight_layout()
|
||||||
|
plt.savefig(out_dir / filename)
|
||||||
|
plt.close()
|
||||||
|
|
||||||
|
|
||||||
|
@click.command('plot:top-publishers')
|
||||||
|
def top_publishers():
|
||||||
|
"""plot top publishers over time"""
|
||||||
|
|
||||||
|
filename = 'top_publishers.png'
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
data = DB.query("""
|
||||||
|
select
|
||||||
|
p.tld
|
||||||
|
,year(published_at) as year
|
||||||
|
,count(1) as stories
|
||||||
|
from (
|
||||||
|
select
|
||||||
|
p.tld
|
||||||
|
,p.id
|
||||||
|
from top.publishers p
|
||||||
|
join top.stories s
|
||||||
|
on s.publisher_id = p.id
|
||||||
|
group by
|
||||||
|
p.tld
|
||||||
|
,p.id
|
||||||
|
order by count(1) desc
|
||||||
|
limit 20
|
||||||
|
) p
|
||||||
|
join top.stories s
|
||||||
|
on s.publisher_id = p.id
|
||||||
|
group by
|
||||||
|
p.tld
|
||||||
|
,year(published_at)
|
||||||
|
order by count(distinct s.id) desc
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
pivot = data.pivot(columns='year', index='tld', values='stories')
|
||||||
|
ax = sns.heatmap(pivot, cmap="crest")
|
||||||
|
ax.set(title="top 20 publishers (by tld)", ylabel="tld", xlabel="stories / year (#)")
|
||||||
|
plt.tight_layout()
|
||||||
|
plt.savefig(out_dir / filename)
|
||||||
|
plt.close()
|
||||||
|
|
||||||
|
|
||||||
|
@click.command('plot:common_tld')
|
||||||
|
def common_tld():
|
||||||
|
import dataframe_image as dfi
|
||||||
|
filename = 'common_tld.png'
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
data = DB.query("""
|
||||||
|
select
|
||||||
|
split_part(url, '.', -1) as tld
|
||||||
|
,count(1) as publishers
|
||||||
|
,case when count(1) < 20
|
||||||
|
then string_agg(distinct url, '\t')
|
||||||
|
else NULL
|
||||||
|
end as urls
|
||||||
|
from publishers
|
||||||
|
group by
|
||||||
|
split_part(url, '.', -1)
|
||||||
|
order by
|
||||||
|
count(1) desc
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
data[:15][['tld', 'publishers']].style.hide(axis="index").export_png(out_dir / filename, table_conversion='matplotlib')
|
||||||
|
|
||||||
|
def stats():
|
||||||
|
|
||||||
|
# raw
|
||||||
|
DB.query("""
|
||||||
|
SELECT
|
||||||
|
'total stories' as key
|
||||||
|
,COUNT(1) as value
|
||||||
|
FROM stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'total related' as key
|
||||||
|
,COUNT(1) as value
|
||||||
|
FROM related_stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'top level domains' as key
|
||||||
|
,COUNT(distinct tld) as value
|
||||||
|
FROM stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'publishers' as key
|
||||||
|
,COUNT(1) as value
|
||||||
|
FROM publishers
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'authors' as key
|
||||||
|
,COUNT(distinct author) as value
|
||||||
|
FROM stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'min year' as key
|
||||||
|
,min(year(published_at)) as value
|
||||||
|
FROM stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'max year' as key
|
||||||
|
,max(year(published_at)) as value
|
||||||
|
FROM stories
|
||||||
|
""").df().to_markdown(index=False)
|
||||||
|
|
||||||
|
# selected
|
||||||
|
DB.query("""
|
||||||
|
SELECT
|
||||||
|
'total stories' as key
|
||||||
|
,COUNT(1) as value
|
||||||
|
FROM top.stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'total related' as key
|
||||||
|
,COUNT(1) as value
|
||||||
|
FROM top.related_stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'top level domains' as key
|
||||||
|
,COUNT(distinct tld) as value
|
||||||
|
FROM top.stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'publishers' as key
|
||||||
|
,COUNT(1) as value
|
||||||
|
FROM top.publishers
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'authors' as key
|
||||||
|
,COUNT(distinct author) as value
|
||||||
|
FROM top.stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'min year' as key
|
||||||
|
,min(year(published_at)) as value
|
||||||
|
FROM top.stories
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'max year' as key
|
||||||
|
,max(year(published_at)) as value
|
||||||
|
FROM top.stories
|
||||||
|
""").df().to_markdown(index=False)
|
||||||
|
|
||||||
|
@click.command('plot:bias-stats')
|
||||||
|
def bias_stats():
|
||||||
|
import dataframe_image as dfi
|
||||||
|
filename = 'bias_stats.png'
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
|
||||||
|
df = DB.query("""
|
||||||
|
SELECT
|
||||||
|
string_agg(distinct bias)
|
||||||
|
FROM bias_ratings
|
||||||
|
""").df()
|
||||||
|
|
||||||
|
df.keys()
|
||||||
|
|
||||||
|
df = DB.query("""
|
||||||
|
SELECT
|
||||||
|
'publishers' as metric
|
||||||
|
,count(1) as value
|
||||||
|
FROM bias_ratings
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'labels' as metric
|
||||||
|
,count(distinct bias) as value
|
||||||
|
FROM bias_ratings
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'right' as metric
|
||||||
|
,count(1) as value
|
||||||
|
FROM bias_ratings
|
||||||
|
WHERE bias in ('right', 'right-center')
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'left' as metric
|
||||||
|
,count(1) as value
|
||||||
|
FROM bias_ratings
|
||||||
|
WHERE bias in ('left', 'left-center')
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'center' as metric
|
||||||
|
,count(1) as value
|
||||||
|
FROM bias_ratings
|
||||||
|
WHERE bias in ('center')
|
||||||
|
UNION
|
||||||
|
SELECT
|
||||||
|
'agree_range' as metric
|
||||||
|
,'['
|
||||||
|
|| min(cast(agree as float) / (agree + disagree))
|
||||||
|
|| '-'
|
||||||
|
|| max(cast(agree as float) / (agree + disagree))
|
||||||
|
|| ']'
|
||||||
|
as value
|
||||||
|
FROM bias_ratings
|
||||||
|
WHERE bias in ('center')
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
print(df.to_markdown(index=False))
|
|
@ -0,0 +1,114 @@
|
||||||
|
import click
|
||||||
|
from data.main import connect
|
||||||
|
from links import to_matrix
|
||||||
|
import os
|
||||||
|
import seaborn as sns
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from pathlib import Path
|
||||||
|
import numpy as np
|
||||||
|
from sklearn.metrics import silhouette_score
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
|
||||||
|
|
||||||
|
@click.command('plot:link-elbow')
|
||||||
|
def elbow():
|
||||||
|
from sklearn.cluster import KMeans
|
||||||
|
|
||||||
|
filename = 'link_cluster_elbow.png'
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
df = DB.query("""
|
||||||
|
SELECT
|
||||||
|
*
|
||||||
|
FROM link_edges
|
||||||
|
""").df()
|
||||||
|
pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
|
||||||
|
|
||||||
|
to_plot = []
|
||||||
|
for k in range(2, 15):
|
||||||
|
kmeans = KMeans(n_clusters=k, n_init="auto")
|
||||||
|
kmeans.fit(pivot)
|
||||||
|
label = kmeans.labels_
|
||||||
|
coeff = silhouette_score(pivot, label, metric='euclidean')
|
||||||
|
to_plot.append({'k': k, 'inertia' : kmeans.inertia_, 'coeff': coeff})
|
||||||
|
to_plot = pd.DataFrame(to_plot)
|
||||||
|
|
||||||
|
ax = sns.lineplot(x=to_plot.k, y=to_plot.inertia)
|
||||||
|
ax.set(title="elbow criterion plot of clusters", xlabel="bin size (k)", ylabel="sum of squared distances between centroids/points")
|
||||||
|
plt.savefig(out_dir / filename)
|
||||||
|
plt.close()
|
||||||
|
|
||||||
|
# randomly pick 8
|
||||||
|
|
||||||
|
@click.command('plot:link-pca-clusters')
|
||||||
|
@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
|
||||||
|
def link_pca_clusters(source):
|
||||||
|
|
||||||
|
filename = f"link_pca_clusters_{source}.png"
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
df = DB.query(f"""
|
||||||
|
SELECT
|
||||||
|
c.label as cluster
|
||||||
|
,p.tld
|
||||||
|
--,b.label as bias
|
||||||
|
,pca.first
|
||||||
|
,pca.second
|
||||||
|
,s.cnt as stories
|
||||||
|
FROM top.publisher_clusters_{source} c
|
||||||
|
JOIN top.publishers p
|
||||||
|
ON c.publisher_id = p.id
|
||||||
|
JOIN
|
||||||
|
(
|
||||||
|
select
|
||||||
|
s.publisher_id
|
||||||
|
,count(1) as cnt
|
||||||
|
FROM top.stories s
|
||||||
|
GROUP BY
|
||||||
|
s.publisher_id
|
||||||
|
) s
|
||||||
|
ON s.publisher_id = p.id
|
||||||
|
JOIN top.publisher_pca_{source} pca
|
||||||
|
ON pca.publisher_id = p.id
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
ax = sns.scatterplot(x=df['first'], y=df['second'], hue=df['cluster'])
|
||||||
|
ax.set(title=f"pca components vs. clusters ({source})", xlabel="first pca component", ylabel="second pca component")
|
||||||
|
plt.savefig(out_dir / filename)
|
||||||
|
|
||||||
|
# .df().groupby(['cluster', 'bias']).describe()
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def test():
|
||||||
|
data_dir = Path(os.getenv('DATA_MINING_DATA_DIR'))
|
||||||
|
|
||||||
|
DB.query("""
|
||||||
|
SELECT
|
||||||
|
p.id as publisher_id
|
||||||
|
,p.name
|
||||||
|
,p.tld
|
||||||
|
,cast(b.bias_id as int) as bias_id
|
||||||
|
,count(1) as stories
|
||||||
|
FROM publishers p
|
||||||
|
JOIN stories s
|
||||||
|
ON s.publisher_id = p.id
|
||||||
|
JOIN publisher_clusters c
|
||||||
|
ON c.publisher_id = p.id
|
||||||
|
LEFT JOIN publisher_bias b
|
||||||
|
ON b.publisher_id = p.id
|
||||||
|
where bias_id is null
|
||||||
|
group by
|
||||||
|
p.id
|
||||||
|
,p.name
|
||||||
|
,p.tld
|
||||||
|
,b.bias_id
|
||||||
|
ORDER BY count(1) desc
|
||||||
|
""")
|
||||||
|
|
||||||
|
# .df().to_csv(data_dir / 'cluster_publishers.csv', sep="|", index=False)
|
||||||
|
DB.close()
|
||||||
|
|
|
@ -0,0 +1,48 @@
|
||||||
|
from data.main import connect
|
||||||
|
import pandas as pd
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
edges = DB.query("""
|
||||||
|
select
|
||||||
|
*
|
||||||
|
from link_edges
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
edges
|
||||||
|
|
||||||
|
adj = edges.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
|
||||||
|
select_publishers = pd.DataFrame(adj.index.tolist(), columns=['publisher_id'])
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
DB.query("create schema top")
|
||||||
|
|
||||||
|
DB.query("""
|
||||||
|
CREATE OR REPLACE TABLE top.publishers AS
|
||||||
|
SELECT
|
||||||
|
p.*
|
||||||
|
FROM publishers p
|
||||||
|
JOIN select_publishers s
|
||||||
|
ON s.publisher_id = p.id
|
||||||
|
""")
|
||||||
|
|
||||||
|
DB.query("""
|
||||||
|
CREATE OR REPLACE TABLE top.stories AS
|
||||||
|
SELECT
|
||||||
|
s.*
|
||||||
|
FROM stories s
|
||||||
|
JOIN top.publishers p
|
||||||
|
ON s.publisher_id = p.id
|
||||||
|
WHERE year(s.published_at) >= 2006
|
||||||
|
AND year(s.published_at) < 2023
|
||||||
|
""")
|
||||||
|
|
||||||
|
DB.query("""
|
||||||
|
CREATE OR REPLACE TABLE top.related_stories AS
|
||||||
|
SELECT
|
||||||
|
r.*
|
||||||
|
FROM top.stories s
|
||||||
|
JOIN related_stories r
|
||||||
|
ON s.id = r.parent_id
|
||||||
|
""")
|
|
@ -0,0 +1,138 @@
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
import torch
|
||||||
|
import torch.nn.functional as F
|
||||||
|
from data.main import connect, data_dir
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from tqdm import tqdm
|
||||||
|
import click
|
||||||
|
|
||||||
|
#Mean Pooling - Take attention mask into account for correct averaging
|
||||||
|
def mean_pooling(model_output, attention_mask):
|
||||||
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
||||||
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
||||||
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
||||||
|
|
||||||
|
@click.option('-c', '--chunks', type=int, default=500, show_default=True)
|
||||||
|
@click.command("sentence:embed")
|
||||||
|
def embed(chunks):
|
||||||
|
|
||||||
|
# Load model from HuggingFace Hub
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
|
||||||
|
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
|
||||||
|
|
||||||
|
# load data
|
||||||
|
DB = connect()
|
||||||
|
table = DB.sql("""
|
||||||
|
select
|
||||||
|
id
|
||||||
|
,title
|
||||||
|
from stories
|
||||||
|
order by id desc
|
||||||
|
""").df()
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
# normalize text
|
||||||
|
table['title'] = table['title'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
|
||||||
|
|
||||||
|
|
||||||
|
chunked = np.array_split(table, chunks)
|
||||||
|
# generate embeddings from list of titles
|
||||||
|
iterator = tqdm(chunked, 'embedding')
|
||||||
|
embeddings = []
|
||||||
|
embedding_ids = []
|
||||||
|
for _, chunk in enumerate(iterator):
|
||||||
|
sentences = chunk['title'].tolist()
|
||||||
|
ids = chunk['id'].tolist()
|
||||||
|
# Tokenize sentences
|
||||||
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
||||||
|
# Compute token embeddings
|
||||||
|
with torch.no_grad():
|
||||||
|
model_output = model(**encoded_input)
|
||||||
|
# Perform pooling
|
||||||
|
output = mean_pooling(model_output, encoded_input['attention_mask'])
|
||||||
|
# Normalize embeddings
|
||||||
|
output = F.normalize(output, p=2, dim=1)
|
||||||
|
embeddings.append(output)
|
||||||
|
embedding_ids.append(ids)
|
||||||
|
|
||||||
|
embeddings = np.concatenate(embeddings)
|
||||||
|
ids = np.concatenate(embedding_ids)
|
||||||
|
|
||||||
|
# save embeddings
|
||||||
|
save_to = data_dir() / 'embeddings.npy'
|
||||||
|
np.save(save_to, embeddings)
|
||||||
|
print(f"embeddings saved: {save_to}")
|
||||||
|
|
||||||
|
# save ids
|
||||||
|
save_to = data_dir() / 'embedding_ids.npy'
|
||||||
|
np.save(save_to, ids)
|
||||||
|
print(f"ids saved: {save_to}")
|
||||||
|
|
||||||
|
|
||||||
|
@click.command('sentence:create-pca-table')
|
||||||
|
def create_table():
|
||||||
|
from sklearn import linear_model
|
||||||
|
data_path = Path(os.getenv('DATA_MINING_DATA_DIR'))
|
||||||
|
|
||||||
|
embeddings = np.load(data_path / 'embeddings.npy')
|
||||||
|
embedding_ids = np.load(data_path / 'embedding_ids.npy')
|
||||||
|
ids = pd.DataFrame(embedding_ids, columns=['story_id']).reset_index()
|
||||||
|
DB = connect()
|
||||||
|
|
||||||
|
data = DB.query("""
|
||||||
|
SELECT
|
||||||
|
ids.index
|
||||||
|
,s.id
|
||||||
|
,b.ordinal
|
||||||
|
FROM ids
|
||||||
|
JOIN top.stories s
|
||||||
|
ON ids.story_id = s.id
|
||||||
|
JOIN top.publisher_bias pb
|
||||||
|
ON pb.publisher_id = s.publisher_id
|
||||||
|
JOIN bias_ratings b
|
||||||
|
ON b.id = pb.bias_id
|
||||||
|
""").df()
|
||||||
|
|
||||||
|
x = embeddings[data['index']]
|
||||||
|
y = data['ordinal'].to_numpy().reshape(-1, 1)
|
||||||
|
|
||||||
|
reg = linear_model.LinearRegression()
|
||||||
|
|
||||||
|
reg.fit(x, y)
|
||||||
|
|
||||||
|
reg.coef_.shape
|
||||||
|
|
||||||
|
@click.command('sentence:create-svm-table')
|
||||||
|
def create_svm_table():
|
||||||
|
from sklearn import svm
|
||||||
|
data_path = Path(os.getenv('DATA_MINING_DATA_DIR'))
|
||||||
|
|
||||||
|
embeddings = np.load(data_path / 'embeddings.npy')
|
||||||
|
embedding_ids = np.load(data_path / 'embedding_ids.npy')
|
||||||
|
ids = pd.DataFrame(embedding_ids, columns=['story_id']).reset_index()
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
data = DB.query("""
|
||||||
|
SELECT
|
||||||
|
ids.index
|
||||||
|
,s.id
|
||||||
|
,b.ordinal
|
||||||
|
FROM ids
|
||||||
|
JOIN top.stories s
|
||||||
|
ON ids.story_id = s.id
|
||||||
|
JOIN top.publisher_bias pb
|
||||||
|
ON pb.publisher_id = s.publisher_id
|
||||||
|
JOIN bias_ratings b
|
||||||
|
ON b.id = pb.bias_id
|
||||||
|
""").df()
|
||||||
|
|
||||||
|
x = embeddings[data['index']]
|
||||||
|
#y = data['ordinal'].to_numpy().reshape(-1, 1)
|
||||||
|
y = data['ordinal']
|
||||||
|
|
||||||
|
clf = svm.SVC()
|
||||||
|
pred = clf.fit(x, y)
|
||||||
|
|
|
@ -0,0 +1,5 @@
|
||||||
|
import train.main
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
'main'
|
||||||
|
]
|
|
@ -0,0 +1,38 @@
|
||||||
|
from torch.utils.data import Dataset
|
||||||
|
from data.main import connect, data_dir
|
||||||
|
from bias import label_to_int
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
class NewsDataset(Dataset):
|
||||||
|
def __init__(self):
|
||||||
|
self.embeddings = np.load(data_dir() / 'embeddings.npy')
|
||||||
|
embedding_ids = pd.DataFrame(np.load(data_dir() / 'embedding_ids.npy'), columns=['id']).reset_index()
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
s.id
|
||||||
|
,b.label
|
||||||
|
,count(1) over (partition by publisher_id) as stories
|
||||||
|
FROM stories s
|
||||||
|
JOIN publisher_bias b
|
||||||
|
ON b.id = s.publisher_id
|
||||||
|
WHERE b.label != 'allsides'
|
||||||
|
"""
|
||||||
|
data = DB.sql(query).df()
|
||||||
|
DB.close()
|
||||||
|
|
||||||
|
data['label'] = data['label'].apply(lambda x: label_to_int(x))
|
||||||
|
data = data.merge(embedding_ids)
|
||||||
|
self.data = data
|
||||||
|
|
||||||
|
def __len__(self):
|
||||||
|
return len(self.data)
|
||||||
|
|
||||||
|
def __getitem__(self, idx):
|
||||||
|
row = self.data.iloc[idx]
|
||||||
|
y = row['label']
|
||||||
|
# x = np.concatenate((self.embeddings[row['index']], [row['stories']])).astype(np.float32)
|
||||||
|
x = self.embeddings[row['index']]
|
||||||
|
return x, y
|
|
@ -0,0 +1,132 @@
|
||||||
|
import click
|
||||||
|
from tqdm import tqdm
|
||||||
|
from enum import Enum, auto
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
import os
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from torch import nn
|
||||||
|
from torch import optim
|
||||||
|
from torch.utils.data import DataLoader
|
||||||
|
from accelerate import Accelerator
|
||||||
|
|
||||||
|
from train.dataset import NewsDataset
|
||||||
|
from train.model import Classifier
|
||||||
|
#from model.linear import LinearClassifier
|
||||||
|
|
||||||
|
class Stage(Enum):
|
||||||
|
TRAIN = auto()
|
||||||
|
DEV = auto()
|
||||||
|
|
||||||
|
@click.command('train:main')
|
||||||
|
def main():
|
||||||
|
dev_after = 20
|
||||||
|
visible_devices = None
|
||||||
|
lr = 1e-4
|
||||||
|
epochs = 10
|
||||||
|
debug = False
|
||||||
|
torch.manual_seed(0)
|
||||||
|
num_workers = 0
|
||||||
|
|
||||||
|
embedding_length = int(os.getenv('EMBEDDING_LENGTH', 384))
|
||||||
|
|
||||||
|
dataset = NewsDataset()
|
||||||
|
trainset, devset = torch.utils.data.random_split(dataset, [0.8, 0.2])
|
||||||
|
batch_size = 512
|
||||||
|
trainloader = DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=num_workers, drop_last=True)
|
||||||
|
devloader = DataLoader(devset, shuffle=False, num_workers=num_workers)
|
||||||
|
accelerator = Accelerator()
|
||||||
|
model = Classifier(embedding_length=embedding_length, classes=5)
|
||||||
|
|
||||||
|
# it's possible to control which GPUs the process can see using an environmental variable
|
||||||
|
if visible_devices:
|
||||||
|
os.environ['CUDA_VISIBLE_DEVICES'] = visible_devices
|
||||||
|
if debug:
|
||||||
|
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
|
||||||
|
#accelerator.log({"message" :"debug enabled"})
|
||||||
|
|
||||||
|
criterion = torch.nn.CrossEntropyLoss()
|
||||||
|
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
|
||||||
|
|
||||||
|
# wrap objects with accelerate
|
||||||
|
model, optimizer, trainloader, devloader = accelerator.prepare(model, optimizer, trainloader, devloader)
|
||||||
|
def run():
|
||||||
|
"""runner for training and valdating"""
|
||||||
|
running_loss = 0.0
|
||||||
|
# set the model to train model
|
||||||
|
model.train() if stage == Stage.TRAIN else model.eval()
|
||||||
|
dataloader = trainloader if stage == Stage.TRAIN else devloader
|
||||||
|
desc = 'train epoch' if stage == Stage.TRAIN else 'dev epoch'
|
||||||
|
if debug:
|
||||||
|
...
|
||||||
|
|
||||||
|
# Make sure there are no leftover gradients before starting training an epoch
|
||||||
|
optimizer.zero_grad()
|
||||||
|
|
||||||
|
for batch, (x, y) in enumerate(tqdm(dataloader, desc=desc)):
|
||||||
|
pred_y = model(x) # Forward pass through model
|
||||||
|
loss = criterion(pred_y, y)
|
||||||
|
running_loss += loss # Increment running loss
|
||||||
|
# Only update model weights on training
|
||||||
|
if stage == Stage.TRAIN:
|
||||||
|
accelerator.backward(loss) # Increment gradients within model by sending loss backwards
|
||||||
|
optimizer.step() # Update model weights
|
||||||
|
optimizer.zero_grad() # Reset gradients to 0
|
||||||
|
return running_loss / len(dataloader)
|
||||||
|
|
||||||
|
|
||||||
|
for epoch in range(epochs):
|
||||||
|
if (epoch - 1) % dev_after == 0:
|
||||||
|
stage = Stage.DEV
|
||||||
|
log = run()
|
||||||
|
print(f"dev loss: {log}")
|
||||||
|
else:
|
||||||
|
stage = Stage.TRAIN
|
||||||
|
log = run()
|
||||||
|
print(f"train loss: {log}")
|
||||||
|
|
||||||
|
breakpoint()
|
||||||
|
from data.main import data_dir, connect
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from bias import int_to_label
|
||||||
|
|
||||||
|
embeddings = dataset.embeddings
|
||||||
|
embedding_ids = dataset.data
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
s.id
|
||||||
|
,title
|
||||||
|
,p.name
|
||||||
|
,count(1) over (partition by publisher_id) as stories
|
||||||
|
FROM stories s
|
||||||
|
JOIN publishers p
|
||||||
|
on p.id = s.publisher_id
|
||||||
|
WHERE s.publisher_id NOT IN (
|
||||||
|
SELECT
|
||||||
|
id
|
||||||
|
FROM publisher_bias b
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
data = DB.sql(query).df()
|
||||||
|
embeddings = np.load(data_dir() / 'embeddings.npy')
|
||||||
|
embedding_ids = pd.DataFrame(np.load(data_dir() / 'embedding_ids.npy'), columns=['id']).reset_index()
|
||||||
|
|
||||||
|
|
||||||
|
for i in range(10):
|
||||||
|
embedding = embeddings[embedding_ids[embedding_ids['id'] == data.iloc[i]['id']]['index']]
|
||||||
|
title = data.iloc[i]['title']
|
||||||
|
publisher = data.iloc[i]['name']
|
||||||
|
class_pred = nn.functional.softmax( model(torch.tensor(embedding))).detach()
|
||||||
|
class_id = int(torch.argmax(nn.functional.softmax( model(torch.tensor(embedding))).detach()))
|
||||||
|
print(f"{publisher}: {int_to_label(class_id)} - \"{title}\"")
|
||||||
|
|
||||||
|
embedding_ids['id'] == data.iloc[0]['id']
|
||||||
|
embedding_ids[embedding_ids['id'] == data.iloc[0]['id']]
|
||||||
|
embedding = embeddings[embedding_ids[embedding_ids['id'] == data.iloc[0]['id']]['index']]
|
||||||
|
title
|
||||||
|
publisher
|
||||||
|
|
||||||
|
model().get_last_layer(torch.tensor(embedding))
|
|
@ -0,0 +1,28 @@
|
||||||
|
from torch import nn
|
||||||
|
|
||||||
|
class Classifier(nn.Module):
|
||||||
|
def __init__(self, embedding_length: int, classes: int):
|
||||||
|
super().__init__()
|
||||||
|
out_len = 16
|
||||||
|
self.stack = nn.Sequential(
|
||||||
|
nn.Linear(embedding_length, 256),
|
||||||
|
nn.ReLU(),
|
||||||
|
nn.Linear(256, 256),
|
||||||
|
nn.ReLU(),
|
||||||
|
nn.Linear(256, 64),
|
||||||
|
nn.ReLU(),
|
||||||
|
nn.Linear(64, 64),
|
||||||
|
nn.ReLU(),
|
||||||
|
nn.Linear(64, out_len),
|
||||||
|
nn.ReLU(),
|
||||||
|
)
|
||||||
|
self.logits = nn.Linear(out_len, classes)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
x = self.stack(x)
|
||||||
|
self.last_hidden_layer = x.detach()
|
||||||
|
return self.logits(x)
|
||||||
|
|
||||||
|
def get_last_layer(self, x):
|
||||||
|
x = self.stack(x)
|
||||||
|
return x
|
|
@ -1,7 +1,7 @@
|
||||||
import click
|
import click
|
||||||
from transformers import AutoTokenizer, RobertaModel
|
from transformers import AutoTokenizer, RobertaModel
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from data import Data, from_db, connect, data_dir
|
from data.main import Data, from_db, connect, data_dir
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
import torch
|
import torch
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|