v1.0 of presentation.

2023-05-17 13:38:07 -07:00 · 2023-05-17 13:38:07 -07:00 · 74c2d8afa2
parent 4d93cf7adb
commit 74c2d8afa2
37 changed files with 1959 additions and 144 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,3 +2,8 @@
 *.swp
 __pycache__
 tmp.py
+.env
+*.aux
+*.log
+*.out
+tmp.*
--- a/11
+++ b/11
@ -0,0 +1,11 @@
+.PHONY:to_wwu
+
+all: to_wwu
+
+to_wwu:
+	rsync -avz ~/577/repo/docs/figures/ linux-04:/home/jensen33/Dev/studentweb/assets/static/577/
+	scp ~/577/repo/docs/presentation.md linux-04:/home/jensen33/Dev/studentweb/content/577/contents.lr
+	scp ~/Dev/www.publicmatt.com/models/slides.ini linux-04:/home/jensen33/Dev/studentweb/models/
+	scp ~/Dev/www.publicmatt.com/templates/slides.html linux-04:/home/jensen33/Dev/studentweb/templates/
+	rsync -avz ~/Dev/www.publicmatt.com/assets/static/revealjs linux-04:/home/jensen33/Dev/studentweb/assets/static/
+	ssh linux-04 cd /home/jensen33/Dev/studentweb \; make
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,3 @@
+paper.pdf: paper.tex
+	pdflatex $^ -o $@
+	evince $@
--- a/docs/figures/allsides_request.png
+++ b/docs/figures/allsides_request.png
--- a/docs/figures/articles_per_year.png
+++ b/docs/figures/articles_per_year.png
--- a/docs/figures/common_tld.png
+++ b/docs/figures/common_tld.png
--- a/docs/figures/distinct_publishers.png
+++ b/docs/figures/distinct_publishers.png
--- a/docs/figures/link_cluster_elbow.png
+++ b/docs/figures/link_cluster_elbow.png
--- a/docs/figures/link_pca_clusters_links.png
+++ b/docs/figures/link_pca_clusters_links.png
--- a/docs/figures/link_pca_clusters_normalized.png
+++ b/docs/figures/link_pca_clusters_normalized.png
--- a/docs/figures/link_pca_clusters_onehot.png
+++ b/docs/figures/link_pca_clusters_onehot.png
--- a/docs/figures/pca_with_classes.png
+++ b/docs/figures/pca_with_classes.png
--- a/docs/figures/stories_per_publisher.png
+++ b/docs/figures/stories_per_publisher.png
--- a/docs/figures/top_publishers.png
+++ b/docs/figures/top_publishers.png
--- a/docs/paper.pdf
+++ b/docs/paper.pdf
--- a/docs/paper.tex
+++ b/docs/paper.tex
@ -0,0 +1,61 @@
+\documentclass{article}
+\usepackage{multicol}
+\usepackage{hyperref}
+\title{Data Mining CS 571}
+\author{Matt Jensen}
+\date{2023-04-25}
+
+\begin{document}
+\maketitle
+
+\section*{Abstract}
+
+News organizations have been repeatedly accused of being partisan.
+Additionally, they have been accused of polarizing dicussion to drive up revenue and engagement.
+This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
+A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
+This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
+Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
+
+\begin{multicols}{2}
+
+\section{Data Preparation}
+The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
+Each news article has a title, author, description, publisher, publish date, url and related discussions. 
+The site also has a concept of references, where a main, popular story may be covered by other sources.
+This link association might be used to support one or more of the hypothesis of the main analysis.
+After scraping the site, the data will need to be deduplicated and normalized to minimize storage costs and processing errors.
+What remains after these cleaning steps is approximitely 6,400 days of material, 300,000 distinct headlines from 21,000 publishers and 34,000 authors used in the study.
+
+\section{Missing Data Policy}
+
+The largest data policy that will have to be dealt with is news organizations that share the same parent company, but might have slightly different names.
+Wall Street Journal news is drastically different than their opinion section. 
+Other organizations have slightly different names for the same thing and a product of the aggregation service and not due to any real difference.
+Luckily, most of the anaylsis is operating on the content of the news headlines, which do not suffer from this data impurity.
+
+\section{Classification Task}
+
+The classification of news titles into emotional categories was accomplished by using a pretrained large langauge model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
+This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
+The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probabilty class from the output.
+
+The data has been discretized into years.
+    Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
+Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
+Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
+
+\section{Experiments}
+
+No computational experiment have been done yet.
+Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
+The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
+If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
+The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
+
+\section{Results}
+\textbf{TODO.}
+
+\end{multicols}
+
+\end{document}
--- a/docs/presentation.md
+++ b/docs/presentation.md
@ -0,0 +1,552 @@
+_model: slides
+---
+
+title: CSCI 577 - Data Mining
+
+---
+body:
+
+# Political Polarization 
+
+Matt Jensen
+
+===
+
+# Hypothesis 
+
+Political polarization is rising, and news articles are a proxy measure.
+
+==
+
+# Is this reasonable? 
+
+
+==
+
+# Why is polarization rising? 
+
+Not my job, but there's research<sup>[ref](#references)</sup> to support it
+
+
+==
+
+# Sub-hypothesis 
+
+- The polarization increases near elections. <!-- .element: class="fragment" -->
+- The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
+- The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
+
+==
+
+# Sub-sub-hypothesis 
+
+- Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
+- 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
+- Highly polarized publications don't last as long. <!-- .element: class="fragment" -->
+
+===
+
+# Data Source(s) 
+
+memeorandum.com <!-- .element: class="fragment" -->
+
+allsides.com <!-- .element: class="fragment" -->
+
+huggingface.com <!-- .element: class="fragment" -->
+
+===
+
+<section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>
+
+===
+
+# memeorandum.com
+
+- News aggregation site. <!-- .element: class="fragment" -->
+- Was really famous before Google News. <!-- .element: class="fragment" -->
+- Still aggregates sites today. <!-- .element: class="fragment" -->
+
+==
+
+# Why Memeorandum? 
+
+- Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->
+- Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->
+- Convenient: most publishers block bots. <!-- .element class="fragment" -->
+- Convenient: dead simple html to parse. <!-- .element class="fragment" -->
+- Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->
+- Archival: automated, not editorialized. <!-- .element class="fragment" -->
+
+===
+
+<section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>
+
+===
+
+# AllSides.com
+
+- Rates news publications as left, center or right. <!-- .element: class="fragment" -->
+- Ratings combine: <!-- .element: class="fragment" -->
+    - blind bias surveys.
+    - editorial reviews.
+    - third party research.
+    - community voting.
+- Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->
+
+
+==
+
+# Why AllSides? 
+
+- Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->
+- Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
+- Convenient: Easy format. <!-- .element class="fragment" -->
+- Archival: Covers 1400 publishers. <!-- .element class="fragment" -->
+
+===
+
+<section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>
+
+===
+
+# HuggingFace.com
+
+- Deep Learning library. <!-- .element: class="fragment" -->
+- Lots of pretrained models. <!-- .element: class="fragment" -->
+- Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->
+
+==
+
+# Why HuggingFace? 
+
+- Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->
+- Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->
+- Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->
+- Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->
+- Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->
+
+===
+
+# Data Structures
+Stories
+
+- Top level stories. <!-- .element: class="fragment" -->
+    - title.
+    - publisher.
+    - author.
+- Related discussion. <!-- .element: class="fragment" -->
+    - publisher.
+    - uses 'parent' story as a source.
+- Stream of stories (changes constantly). <!-- .element: class="fragment" -->
+
+==
+
+# Data Structures
+Bias
+
+- Per publisher. <!-- .element: class="fragment" -->
+    - name.
+    - label.
+    - agree/disagree vote by community. 
+- Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->
+
+==
+
+# Data Structures
+Embeddings
+
+- Per story title. <!-- .element: class="fragment" -->
+    - sentence embedding (n, 384).
+    - sentiment classification (n, 1).
+    - emotional classification (n, 1).
+- ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->
+
+===
+
+# Data Collection
+
+==
+
+# Data Collection
+
+Story Scraper (simplified)
+
+```python
+day = timedelta(days=1)
+cur = date(2005, 10, 1)
+end = date.today()
+while cur <= end:
+    cur = cur + day
+    save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
+    url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
+    r = requests.get(url)
+    with open(save_as, 'w') as f:
+        f.write(r.text)
+```
+
+==
+
+# Data Collection
+Bias Scraper (hard)
+
+```python
+...
+bias_html = DATA_DIR / 'allsides.html'
+parser = etree.HTMLParser()
+tree = etree.parse(str(bias_html), parser)
+root = tree.getroot()
+rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')
+
+ratings = []
+for row in rows:
+    rating = dict()
+    ...
+```
+
+==
+
+# Data Collection
+Bias Scraper (easy)
+
+![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)
+
+==
+
+# Data Collection
+Embeddings (easy)
+
+```python
+# table = ...
+tokenizer = AutoTokenizer.from_pretrained("roberta-base")
+model = AutoModel.from_pretrained("roberta-base")
+
+for chunk in table:
+    tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
+    outputs = model(**tokens)
+    embeddings = outputs.last_hidden_state.detach().numpy()
+    ...
+```
+
+==
+
+# Data Collection
+Classification Embeddings (medium) 
+
+```python
+...
+outputs = model(**tokens)[0].detach().numpy()
+scores = 1 / (1 + np.exp(-outputs))  # Sigmoid
+class_ids = np.argmax(scores, axis=1)
+for i, class_id in enumerate(class_ids):
+    results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
+...
+```
+
+===
+
+# Data Selection
+
+==
+
+# Data Selection
+Stories
+
+- Clip the first and last full year of stories. <!-- .element: class="fragment" -->
+- Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
+
+==
+# Data Selection
+
+Publishers
+
+- Combine subdomains of stories. <!-- .element: class="fragment" -->
+    - blog.washingtonpost.com and washingtonpost.com are considered the same publisher. 
+    - This could be bad. For example: opinion.wsj.com != wsj.com. 
+
+==
+
+# Data Selection
+
+Links
+
+- Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
+    - Eliminates small blogs and non-original news.
+- Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
+    - Eliminate silo'ed publications. 
+    - Link matrix is square and low'ish dimensional. 
+
+==
+
+# Data Selection
+
+Bias
+
+- Keep all ratings, even ones with low agree/disagree ratio.
+- Join datasets on publisher name. 
+    - Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->
+    - Started with 'jaro winkler similarity' then manually from there.
+- Use numeric values
+    - [left: -2, left-center: -1, ...]
+
+===
+
+# Descriptive Stats
+
+Raw
+
+| metric            |   value |
+|:------------------|--------:|
+| total stories     |  299714 |
+| total related     |  960111 |
+| publishers        |    7031 |
+| authors           |   34346 |
+| max year          |    2023 |
+| min year          |    2005 |
+| top level domains |    7063 |
+
+==
+# Descriptive Stats
+
+Stories Per Publisher
+
+![stories per publisher](/static/577/stories_per_publisher.png)
+
+==
+
+# Descriptive Stats
+
+Top Publishers
+
+![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)
+
+==
+
+# Descriptive Stats
+
+Articles Per Year
+
+![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)
+
+==
+
+# Descriptive Stats
+
+Common TLDs
+
+![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)
+
+==
+
+# Descriptive Stats
+
+Post Process
+
+| key               |   value |
+|:------------------|--------:|
+| total stories     |  251553 |
+| total related     |  815183 |
+| publishers        |     223 |
+| authors           |   23809 |
+| max year          |    2022 |
+| min year          |    2006 |
+| top level domains |     234 |
+
+===
+# Experiments
+
+1. **clustering** on link similarity. <!-- .element: class="fragment" -->
+2. **classification** on link similarity. <!-- .element: class="fragment" -->
+3. **classification** on sentence embedding. <!-- .element: class="fragment" -->
+4. **classification** on sentiment analysis. <!-- .element: class="fragment" -->
+5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->
+
+===
+# Experiment 1
+
+Setup
+
+- Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
+- Cluster the encoding. <!-- .element: class="fragment" -->
+- Expect similar publications in same cluster. <!-- .element: class="fragment" -->
+- Use PCA to visualize clusters. <!-- .element: class="fragment" -->
+
+Note:
+Principle Component Analysis: 
+- a statistical technique for reducing the dimensionality of a dataset.
+- linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
+
+==
+
+# Experiment 1
+
+One Hot Encoding
+
+| publisher |  nytimes|  wsj| newsweek|  ...|
+|:----------|--------:|----:|--------:|----:|
+| nytimes   |        1|    1|        1|  ...|
+| wsj       |        1|    1|        0|  ...|
+| newsweek  |        0|    0|        1|  ...|
+| ...       |      ...|  ...|      ...|  ...|
+
+==
+
+# Experiment 1
+
+n-Hot Encoding
+
+| publisher |  nytimes|  wsj| newsweek|  ...|
+|:----------|--------:|----:|--------:|----:|
+| nytimes   |       11|    1|      141|  ...|
+| wsj       |        1|   31|        0|  ...|
+| newsweek  |        0|    0|        1|  ...|
+| ...       |      ...|  ...|      ...|  ...|
+
+==
+
+# Experiment 1
+
+Normalized n-Hot Encoding
+
+| publisher |  nytimes|  wsj| newsweek|  ...|
+|:----------|--------:|----:|--------:|----:|
+| nytimes   |        0|  0.4|      0.2|  ...|
+| wsj       |      0.2|    0|      0.4|  ...|
+| newsweek  |      0.0|  0.0|      0.0|  ...|
+| ...       |      ...|  ...|      ...|  ...|
+
+==
+
+# Experiment 1
+
+Elbow criterion
+
+![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)
+
+Note:
+
+The elbow method looks at the percentage of explained variance as a function of the number of clusters: 
+
+One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
+
+Percentage of variance explained is the ratio of the between-group variance to the total variance,
+
+==
+
+# Experiment 1
+
+Link Magnitude
+
+![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)
+
+==
+
+# Experiment 1
+
+Normalized
+
+![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)
+
+==
+
+# Experiment 1
+
+Onehot
+
+![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)
+
+==
+
+# Experiment 1
+
+Discussion
+
+- Best encoding: One hot. <!-- .element: class="fragment" -->
+    - Clusters based on total links otherwise.
+- Clusters, but no explanation
+- Limitation: need the link encoding to cluster.
+    - Smaller publishers might not link very much.
+
+===
+
+# Experiment 2
+
+Setup
+
+- Create features. <!-- .element: class="fragment" -->:
+    - Publisher frequency.
+    - Reuse link encodings.
+- Create classes: <!-- .element: class="fragment" -->
+    - Join bias classifications.
+- Train classifier. <!-- .element: class="fragment" -->
+
+Note:
+
+==
+# Experiment 2
+Descriptive stats
+
+| metric      | value     |
+|:------------|:----------|
+| publishers  | 1582      |
+| labels      | 6         |
+| left        | 482       |
+| center      | 711       |
+| right       | 369       |
+| agree range | [0.0-1.0] |
+
+==
+
+# Experiment 2
+
+PCA + Labels
+
+![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)
+
+==
+
+# Experiment 2
+
+Discussion
+
+- Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
+    - Labels are (sort of) separated and clustered.
+    - Creating them for smaller publishers is trivial.
+==
+
+# Experiment 2
+
+Limitations
+
+- Dependent on accurate rating. <!-- .element: class="fragment" -->
+- Ordinal ratings not available. <!-- .element: class="fragment" -->
+- Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
+- Entire publication is rated, not authors. <!-- .element: class="fragment" -->
+- Don't know what to do with community rating. <!-- .element: class="fragment" -->
+
+===
+
+# Experiment 3
+
+Setup
+
+==
+
+# Limitations
+
+- Many different authors under the same publisher. <!-- .element: class="fragment" -->
+- Publishers use syndication. <!-- .element: class="fragment" -->
+- Bias ratings are biased. <!-- .element: class="fragment" -->
+
+===
+
+# Questions
+
+===
+
+<!-- .section: id="references" -->
+
+# References
+
+[1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
+
+Note:
--- a/src/bias.py
+++ b/src/bias.py
@ -1,12 +1,12 @@
 import click
-from data import connect
+from data.main import connect
 import pandas as pd
 from lxml import etree
 from pathlib import Path
 import os
 import csv

-def map(rating:str) -> int:
+def label_to_int(rating:str) -> int:

    mapping = {
        'left' : 0,
@ -19,20 +19,18 @@ def map(rating:str) -> int:

    return mapping[rating]

+def int_to_label(class_id: int) -> str:
+    mapping = {
+            0 : 'left',
+            1 : 'left-center',
+            2 : 'center',
+            3 : 'right-center',
+            4 : 'right',
+            -1 : 'allsides',
+    }
+    return mapping[class_id]

-@click.command(name="bias:load")
-def load() -> None:
-    DB = connect()
-    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
-    f = str(DATA_DIR / "bias_ratings.csv")

-    DB.sql(f"""
-        create table bias_ratings as 
-        select 
-            row_number() over(order by b.publisher) as id
-            ,b.*
-        from read_csv_auto('{f}') b
-    """)
@click.command(name="bias:normalize")
 def normalize() -> None:
    DB = connect()
@ -41,133 +39,48 @@ def normalize() -> None:
        CREATE OR REPLACE TABLE publisher_bias AS
        WITH cte AS (
            SELECT
-                p.id
+                p.id as publisher_id
+                ,b.id as bias_id
                ,b.bias as label
                ,JARO_WINKLER_SIMILARITY(LOWER(p.name), LOWER(b.publisher)) as similarity
            FROM bias_ratings b
-            JOIN publishers p
+            JOIN top.publishers p
            ON JARO_WINKLER_SIMILARITY(LOWER(p.name), LOWER(b.publisher)) > 0.95
        ),ranked AS (
            SELECT
-                id
+                publisher_id
+                ,bias_id
                ,label
                ,similarity
-                ,ROW_NUMBER() OVER(PARTITION BY id ORDER BY similarity DESC) AS rn
+                ,ROW_NUMBER() OVER(PARTITION BY publisher_id ORDER BY similarity DESC) AS rn
            FROM cte
        )
        SELECT
-            id
+            publisher_id
            ,label
+            ,bias_id
        FROM ranked
        WHERE ranked.rn = 1
    """)

+    mapping = [
+            {'label' :'left' , 'ordinal': -2},
+            {'label' :'left-center' , 'ordinal': -1},
+            {'label' :'center' , 'ordinal': 0},
+            {'label' :'right-center' , 'ordinal': 1},
+            {'label' :'right' , 'ordinal': 2},
+    ]
+    mapping = pd.DataFrame(mapping)

-    DB.sql("""
-        with cte as (
-            select 
-                s.publisher_id
-                ,count(1) as stories
-            from stories s
-            group by s.publisher_id
-        )
-        select
-            s.publisher
-            ,s.stories
-            ,b.publisher
-            ,b.bias
-        from bias_ratings b
-        join cte s
-        on s.publisher = b.publisher
-        order by
-        stories desc
-        limit 15
+    DB.query("alter table bias_ratings add column ordinal int")
+
+    DB.query("""
+        update bias_ratings b
+        set ordinal = o.ordinal
+        FROM mapping o
+        WHERE o.label = b.bias
    """)

-    DB.sql("""
-        with cte as (
-            select 
-                s.publisher
-                ,count(1) as stories
-            from stories s
-            group by s.publisher
-        )
-        select
-            sum(stories)
-            ,avg(agree / disagree)
-        from bias_ratings b
-        join cte s
-        on s.publisher = b.publisher
-    """)
-
-    DB.sql("""
-        with cte as (
-            select 
-                s.publisher
-                ,count(1) as stories
-            from stories s
-            group by s.publisher
-        )
-        select
-            sum(s.stories) filter(where b.publisher is not null) as matched
-            ,sum(s.stories) filter(where b.publisher is null) as unmatched
-            ,cast(sum(s.stories) filter(where b.publisher is not null) as numeric)
-            / sum(s.stories) filter(where b.publisher is null) as precent_matched
-        from bias_ratings b
-        right join cte s
-        on s.publisher = b.publisher
-    """)
-
-    DB.sql("""
-        select
-        *
-        from bias_ratings
-        where publisher ilike '%CNN%'
-        """)
-
-@click.command(name='bias:debug')
-def debug() -> None:
-    DB = connect()
-    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
-    f = str(DATA_DIR / "bias_ratings.csv")
-
-    DB.sql("""
-        with cte as (
-            select 
-                outlet 
-                ,count(1) as stories
-                from stories 
-                group by outlet
-        ) 
-        ,total as (
-            select
-                sum(stories) as total
-            from cte
-        )
-        select
-            cte.outlet
-            ,cte.stories
-            ,bias.outlet
-            ,bias.lean
-            ,sum(100 * (cte.stories / cast(total.total as float))) over() as rep
-            ,total.total
-        from cte
-        join bias 
-        on jaro_winkler_similarity(bias.outlet, cte.outlet) > 0.9
-        cross join total.total
-    """)
-
-    DB.sql("""
-        select
-            outlet
-            ,count(1) as stories
-        from stories
-        group by outlet
-        order by count(1) desc
-        limit 50
-    """)
-
-    outlets

@click.command(name='bias:parse')
 def parse() -> None:
@ -199,3 +112,64 @@ def parse() -> None:
        ratings.append(rating)
    df = pd.DataFrame(ratings)
    df.to_csv(DATA_DIR / 'bias_ratings.csv', sep="|", index=False, quoting=csv.QUOTE_NONNUMERIC)
+
+@click.command(name="bias:load")
+def load() -> None:
+    DB = connect()
+    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
+    f = str(DATA_DIR / "bias_ratings.csv")
+
+    DB.sql(f"""
+        CREATE TABLE bias_ratings as 
+        select 
+            row_number() over(order by b.publisher) as id
+            ,b.*
+        from read_csv_auto('{f}') b
+    """)
+
+@click.command('bias:export')
+def export():
+    data_path = Path(os.environ['DATA_MINING_DATA_DIR'])
+
+    DB = connect()
+    all_bias = DB.query("""
+        SELECT
+            id as bias_id
+            ,publisher as name
+            ,bias as label
+        FROM bias_ratings
+        ORDER by agree desc
+    """)
+    all_bias.df().to_csv(data_path / 'TMP_publisher_bias.csv', sep="|", index=False)
+    mapped_bias = DB.query("""
+        SELECT
+            p.id as publisher_id
+            ,p.name as name
+            ,p.tld as tld
+            ,b.label as bias
+            ,b.bias_id as bias_id
+        FROM top.publishers p
+        LEFT JOIN publisher_bias b
+        ON b.publisher_id = p.id
+    """)
+    mapped_bias.df().to_csv(data_path / 'TMP_publisher_bias_to_load.csv', sep="|", index=False)
+    DB.close()
+
+@click.command('bias:import-mapped')
+def import_mapped():
+    data_path = Path(os.environ['DATA_MINING_DATA_DIR'])
+    table_name = "top.publisher_bias"
+
+    DB = connect()
+    df = pd.read_csv(data_path / 'TMP_publisher_bias_to_load.csv', sep="|")
+
+    DB.query(f"""
+        CREATE OR REPLACE TABLE {table_name} AS
+        SELECT
+            publisher_id AS publisher_id
+            ,cast(bias_id AS int) as bias_id
+        FROM df
+        WHERE bias_id IS NOT NULL
+    """)
+    print(f"created table: {table_name}")
+
--- a/src/cli.py
+++ b/src/cli.py
@ -7,7 +7,7 @@ def cli():

 if __name__ == "__main__":
    load_dotenv()
-    import scrape
+    from data import scrape
    cli.add_command(scrape.download)
    cli.add_command(scrape.parse)
    cli.add_command(scrape.load)
@ -32,4 +32,26 @@ if __name__ == "__main__":
    cli.add_command(emotion.create_table)
    import sentence
    cli.add_command(sentence.embed)
+    from train import main as train_main
+    cli.add_command(train_main.main)
+
+    import plots.descriptive as plotd
+    cli.add_command(plotd.articles_per_year)
+    cli.add_command(plotd.distinct_publishers)
+    cli.add_command(plotd.stories_per_publisher)
+    cli.add_command(plotd.top_publishers)
+    cli.add_command(plotd.common_tld)
+
+    import  links as linkcli
+    cli.add_command(linkcli.create_table)
+    cli.add_command(linkcli.create_pca)
+    cli.add_command(linkcli.create_clusters)
+
+    import  plots.links as plotl
+    cli.add_command(plotl.elbow)
+    cli.add_command(plotl.link_pca_clusters)
+
+    import  plots.classifier as plotc
+    cli.add_command(plotc.pca_with_classes)
+
    cli()
--- a/src/data/init.py
+++ b/src/data/init.py
@ -0,0 +1,6 @@
+import data.main
+import data.scrape
+__all__ = [
+    'main'
+    ,'scrape'
+]
--- a/src/data/main.py
+++ b/src/data/main.py
--- a/src/data/scrape.py
+++ b/src/data/scrape.py
@ -4,10 +4,12 @@ import requests
 from pathlib import Path
 import click
 from tqdm import tqdm
-from data import data_dir, connect
+from data.main import data_dir, connect
 from lxml import etree
 import pandas as pd
 from urllib.parse import urlparse
+from tld import get_tld
+from tld.utils import update_tld_names

@click.command(name='scrape:load')
@click.option('--directory', type=Path, default=data_dir(), show_default=True)
@ -61,6 +63,7 @@ def download(output_dir):
@click.option('-o', '--output_dir', type=Path, default=data_dir(), show_default=True)
 def parse(directory, output_dir):
    """parse the html files on disk into a structured csv format."""
+    update_tld_names()
    directory = data_dir() / "memeorandum"
    parser = etree.HTMLParser()
    pages = [f for f in directory.glob("*.html")]
@ -104,8 +107,7 @@ def parse(directory, output_dir):

            url = item.xpath('.//strong/a')[0].get('href')
            out['url'] = url
-            out['publisher_url_domain'] = urlparse(publisher_url).netloc
-            out['domain'] = urlparse(url).netloc
+            out['tld'] = get_tld(publisher_url)

            item_id = hash((page.stem, url))
            out['id'] = item_id
@ -225,3 +227,111 @@ def normalize():
        alter table related_stories drop publisher_domain;
    """)

+
+def another_norm():
+    sv2 = pd.read_csv(data_dir / 'stories.csv', sep="|")
+    related = pd.read_csv(data_dir / 'related.csv', sep="|")
+
+    related['tld'] = related.url.apply(lambda x: map_tld(x))
+
+    DB.query("""
+        update related_stories
+        set publisher_id = p.id
+        from publishers p
+        join related r
+        on r.tld = p.tld
+        where r.url = related_stories.url
+    """)
+
+
+    DB.query("""alter table stories add column tld text""")
+
+    s_url = DB.query("""
+    select
+        id
+        ,url
+        from stories
+    """).df()
+
+
+    s_url['tld'] = s_url.url.apply(lambda x: map_tld(x))
+
+    DB.query("""
+        update stories
+        set tld = s_url.tld
+        from s_url
+        where s_url.id = stories.id
+    """)
+
+    DB.query("""
+        update stories
+        set publisher_id = p.id
+        from publishers p
+        where p.tld = stories.tld
+    """)
+
+
+    select
+    DB.query("""
+        update stories
+        set stories.publisher_id = p.id
+        from new_pub
+    """)
+    sv2['tld'] = sv2.publisher_url.apply(lambda x: map_tld(x))
+
+
+    new_pub = DB.query("""
+        with cte as (
+            select
+                tld
+                ,publisher
+                ,count(1) filter(where year(published_at) = 2022) as recent_ctn
+                ,count(1) as ctn
+            from sv2
+            group by
+                tld
+                ,publisher
+        )
+        ,r as (
+        select
+            tld
+            ,publisher
+            ,ctn
+            ,row_number() over(partition by tld order by recent_ctn desc) as rn
+        from cte
+        )
+        select
+            row_number() over() as id
+            ,publisher as name
+            ,tld
+        from r
+        where rn = 1
+        order by ctn desc
+    """).df()
+
+    DB.query("""
+        CREATE OR REPLACE TABLE publishers AS
+        SELECT
+            id
+            ,name
+            ,tld
+        FROM new_pub
+    """)
+
+
+    def map_tld(x):
+        try:
+            res = get_tld(x, as_object=True)
+            return res.fld
+        except:
+            return None
+
+    DB.sql("""
+        SELECT
+            s.id
+            ,sv2.publisher_url
+        FROM stories s
+        JOIN sv2
+        on sv2.id = s.id
+        limit 5
+    """)
--- a/src/emotion.py
+++ b/src/emotion.py
@ -6,7 +6,7 @@ import numpy as np

 from transformers import BertTokenizer
 from model import BertForMultiLabelClassification
-from data import connect, data_dir
+from data.main import connect, data_dir
 import seaborn as sns
 import matplotlib.pyplot as plt
 from matplotlib.dates import DateFormatter
@ -376,3 +376,99 @@ def debug():
    DB.close()

    out.to_csv(data_dir() / 'emotions.csv', sep="|")
+
+def another():
+    DB = connect()
+    DB.sql("""
+        select
+            *
+        from emotions
+    """)
+
+    emotions = DB.sql("""
+        select
+            year(s.published_at) as year
+            ,se.label as emotion
+            ,count(1) as stories
+        from stories s
+        join story_emotions se
+        on s.id = se.story_id
+        group by
+            year(s.published_at)
+            ,se.label
+    """).df()
+
+    sns.scatterplot(x=emotions['year'], y=emotions['stories'], hue=emotions['emotion'])
+    plt.show()
+
+    pivot = emotions.pivot(index='year', columns='emotion', values='stories')
+    pivot.reset_index(inplace=True)
+    from sklearn.linear_model import LinearRegression
+    reg = LinearRegression()
+
+    for emotion in pivot.keys()[1:].tolist():
+        _ = reg.fit(pivot['year'].to_numpy().reshape(-1, 1), pivot[emotion])
+        print(f"{emotion}: {reg.coef_[0]}")
+
+    fig, ax = plt.subplots()
+    #sns.lineplot(x=pivot['anger'], y=pivot['joy'])
+    #sns.lineplot(x=pivot['anger'], y=pivot['surprise'], ax=ax)
+    sns.lineplot(x=pivot['anger'], y=pivot['fear'], ax=ax)
+    sns.lineplot(x=pivot[''], y=pivot['fear'], ax=ax)
+    plt.show()
+
+    DB.close()
+
+    normalized = DB.sql("""
+        with cte as (
+            select
+                year(s.published_at) as year
+                ,se.label as emotion
+                ,b.label as bias
+            from stories s
+            join story_emotions se
+            on s.id = se.story_id
+            join publisher_bias b
+            on b.id = s.publisher_id
+            where b.label != 'allsides'
+            and se.label != 'neutral'
+        )
+        select
+            distinct
+            year
+            ,emotion
+            ,bias
+            ,cast(count(1) over(partition by year, bias, emotion) as float) / count(1) over(partition by year, bias) as group_count
+        from cte
+    """).df()
+
+    DB.sql("""
+        select
+            b.label as bias
+            ,count(1) as stories
+        from stories s
+        join story_emotions se
+        on s.id = se.story_id
+        join publisher_bias b
+        on b.id = s.publisher_id
+        group by
+            b.label
+    """).df()
+
+    another_pivot = emotional_bias.pivot(index=['bias', 'year'], columns='emotion', values='stories')
+    another_pivot.reset_index(inplace=True)
+
+    sns.lineplot(data=normalized, x='year', y='group_count', hue='bias', style='emotion')
+    plt.show()
+
+    sns.relplot(
+        data=normalized, x="year", y="group_count", hue="emotion", col='bias', kind="line"
+        #data=normalized, x="year", y="group_count", hue="emotion", col='bias', kind="line", facet_kws=dict(sharey=False)
+    )
+    plt.show()
+
+    DB.sql("""
+    select
+    *
+    from another_pivot
+    """)
--- a/src/lib.py
+++ b/src/lib.py
@ -1,8 +0,0 @@
-import sklearn
-import polars as pl
-import toml
-from pathlib import Path
-
-config = toml.load('/home/user/577/repo/config.toml')
-app_dir = Path(config.get('app').get('path'))
-df = pl.read_csv(app_dir / "data/articles.csv")
--- a/src/links.py
+++ b/src/links.py
@ -1,12 +1,148 @@
-from data import connect
+import click
+from data.main import connect
 import pandas as pd
 import numpy as np
-from sklearn.decomposition import PCA, TruncatedSVD
-from sklearn.cluster import MiniBatchKMeans
 import seaborn as sns
 import matplotlib.pyplot as plt


+@click.command('links:create-table')
+def create_table():
+
+    table_name = "top.link_edges"
+    DB = connect()
+    DB.query(f"""
+        CREATE OR REPLACE TABLE {table_name} AS
+        with cte as(
+            SELECT 
+                s.publisher_id as parent_id
+                ,r.publisher_id as child_id
+                ,count(1) as links
+            FROM top.stories s
+            JOIN top.related_stories r
+            ON s.id = r.parent_id
+            group by 
+                s.publisher_id
+                ,r.publisher_id
+        )
+        SELECT
+            cte.parent_id
+            ,cte.child_id
+            ,cte.links as links
+            ,cast(cte.links as float) / sum(cte.links) over(partition by cte.parent_id) as normalized
+            ,case when cte.links > 0 then 1 else 0 end as onehot
+        FROM cte
+        WHERE cte.child_id in (
+            SELECT
+                distinct parent_id
+            FROM cte
+        )
+        AND cte.parent_id in (
+            SELECT
+                distinct child_id
+            FROM cte
+        )
+    """)
+    DB.close()
+
+    DB = connect()
+    DB.query("""
+        SELECT
+        *
+        ,-log10(links)
+        --distinct parent_id
+        FROM top.link_edges e
+        WHERE e.parent_id = 238
+    """)
+    DB.close()
+    print(f"created {table_name}")
+
+@click.command('links:create-pca')
+@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
+def create_pca(source):
+    """create 2D pca labels"""
+
+    from sklearn.decomposition import PCA
+
+    table_name = f"top.publisher_pca_{source}"
+    DB = connect()
+    pub = DB.query("""
+        SELECT
+            *
+        FROM top.publishers
+    """).df()
+    df = DB.query(f"""
+        SELECT
+            parent_id
+            ,child_id
+            ,{source} as links
+        FROM top.link_edges
+    """).df()
+    DB.close()
+    pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
+
+    svd = PCA(n_components=2)
+    svd_out = svd.fit_transform(pivot)
+
+    out = pivot.reset_index()[['parent_id']]
+    out['first'] = svd_out[:, 0]
+    out['second'] = svd_out[:, 1]
+    out = pd.merge(out, pub, left_on='parent_id', right_on='id')
+
+    DB = connect()
+    DB.query(f"""
+        CREATE OR REPLACE TABLE {table_name} AS
+        SELECT
+            out.id as publisher_id
+            ,out.first as first
+            ,out.second as second
+        FROM out
+    """)
+    DB.close()
+    print(f"created {table_name}")
+
+
+@click.command('links:create-clusters')
+@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
+def create_clusters(source):
+    from sklearn.cluster import KMeans
+
+    table_name = f"top.publisher_clusters_{source}"
+    DB = connect()
+    df = DB.query(f"""
+        SELECT
+            parent_id
+            ,child_id
+            ,{source} as links
+        FROM top.link_edges
+    """).df()
+    pub = DB.query("""
+        SELECT
+            *
+        FROM top.publishers
+    """).df()
+    DB.close()
+    pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
+
+
+    k = 8
+    kmeans = KMeans(n_clusters=k, n_init="auto")
+    pred = kmeans.fit_predict(pivot)
+    out = pivot.reset_index()[['parent_id']]
+    out['label'] = pred
+    out = pd.merge(out, pub, left_on='parent_id', right_on='id')
+    new_table = out[['id', 'label']]
+
+    DB = connect()
+    DB.query(f"""
+        CREATE OR REPLACE TABLE {table_name} AS
+        SELECT
+            n.id as publisher_id
+            ,n.label as label
+        FROM new_table n
+    """)
+    DB.close()
+    print(f"created {table_name}")

 def to_matrix():
    """returns an adjacency matrix of publishers to publisher link frequency"""
@ -21,6 +157,7 @@ def to_matrix():
        {'label' :'right', 'value' : 4},
        {'label' :'allsides', 'value' : -1},
    ])
+
    bias = DB.sql("""
        SELECT
            b.id
@ -37,11 +174,7 @@ def to_matrix():
                p.id
                ,p.name
                ,p.url
-                ,b.label
-                ,b.value
            from publishers p
-            left join bias b
-            on b.id = p.id
    """).df()

    edges = DB.sql("""
@ -81,12 +214,23 @@ def to_matrix():
        ON p.id = cte.parent_id
    """).df()

+    # only keep values that have more than 1 link
+    test = edges[edges['links'] > 2].pivot(index='parent_id', columns='child_id', values='links').fillna(0).reset_index()
+    edges.dropna().pivot(index='parent_id', columns='child_id', values='links').fillna(0)
+    pd.merge(adj, pub, how='left', left_on='parent_id', right_on='id')
    adj = edges.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
+    adj.values.shape

    
    out = pd.DataFrame(adj.index.values, columns=['id'])
    out = pd.merge(out, pub, how='left', on='id')
+    return out

+@click.command('links:analysis')
+def analysis():
+    from sklearn.decomposition import PCA, TruncatedSVD
+    from sklearn.cluster import MiniBatchKMeans
+    adj = to_matrix()
    pca = PCA(n_components=4)
    pca_out = pca.fit_transform(adj)

--- a/src/mine.py
+++ b/src/mine.py
@ -1,4 +1,4 @@
-from data import data_dir, connect
+from data.main import data_dir, connect
 import numpy as np
 import sklearn
 from sklearn.cluster import MiniBatchKMeans
--- a/src/plots/init.py
+++ b/src/plots/init.py
--- a/src/plots/classifier.py
+++ b/src/plots/classifier.py
@ -0,0 +1,34 @@
+import click
+from data.main import connect
+import os
+import seaborn as sns
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
+
+@click.command('plot:pca-with-classes')
+def pca_with_classes():
+    filename = "pca_with_classes.png"
+
+    DB = connect()
+    data = DB.query(f"""
+        SELECT
+            p.tld
+            ,b.bias
+            ,c.first
+            ,c.second
+            ,round(cast(b.agree as float) / (b.agree + b.disagree), 2) ratio
+        FROM top.publishers p
+        JOIN top.publisher_bias pb
+        ON p.id = pb.publisher_id
+        JOIN bias_ratings b
+        ON b.id = pb.bias_id
+        JOIN top.publisher_pca_normalized c
+        ON c.publisher_id = p.id
+    """).df()
+    DB.close()
+    ax = sns.scatterplot(x=data['first'], y=data['second'], hue=data['bias'], s=100)
+    ax.set(title="pca components vs. bias labels", xlabel="first pca component", ylabel="second pca component")
+    plt.savefig(out_dir / filename)
+    print(f"saved: {filename}")
--- a/src/plots/descriptive.py
+++ b/src/plots/descriptive.py
@ -0,0 +1,302 @@
+import click
+from data.main import connect
+import os
+import seaborn as sns
+import matplotlib.pyplot as plt
+from pathlib import Path
+import numpy as np
+
+out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
+
+@click.command('plot:articles-per-year')
+def articles_per_year():
+    filename = 'articles_per_year.png'
+
+    DB = connect()
+    data = DB.query("""
+        select
+            year(published_at) as year
+            ,count(1) as stories
+        from stories
+        group by 
+            year(published_at)
+    """).df()
+    DB.close()
+
+    ax = sns.barplot(x=data.year, y=data.stories, color='tab:blue')
+    ax.tick_params(axis='x', rotation=90)
+    ax.set(title="count of articles per year", ylabel="count of stories (#)")
+    plt.tight_layout()
+    plt.savefig(out_dir / filename)
+
+@click.command('plot:distinct-publishers')
+def distinct_publishers():
+    filename = 'distinct_publishers.png'
+
+    DB = connect()
+    data = DB.query("""
+        select
+            year(published_at) as year
+            ,count(distinct publisher_id) as publishers
+        from stories
+        group by 
+            year(published_at)
+    """).df()
+    DB.close()
+
+    ax = sns.barplot(x=data.year, y=data.publishers, color='tab:blue')
+    ax.tick_params(axis='x', rotation=90)
+    ax.set(title="count of publishers per year", ylabel="count of publishers (#)")
+    plt.tight_layout()
+    plt.savefig(out_dir / filename)
+    plt.close()
+
+@click.command('plot:stories-per-publisher')
+def stories_per_publisher():
+    filename = 'stories_per_publisher.png'
+
+    DB = connect()
+    data = DB.query("""
+        with cte as (
+        select
+            publisher_id
+            ,year(published_at) as year
+            ,count(1) as stories
+        from stories
+        group by 
+            publisher_id
+            ,year(published_at)
+        ) , agg as (
+            select
+                publisher_id
+                ,avg(stories) as stories_per_year
+                ,case 
+                    when avg(stories) < 2 then 2
+                    when avg(stories) < 4 then 4
+                    when avg(stories) < 8 then 8
+                    when avg(stories) < 16 then 16
+                    when avg(stories) < 32 then 32
+                    when avg(stories) < 64 then 64
+                    when avg(stories) < 128 then 128
+                    else 129
+                end as max_avg
+            from cte
+            group by 
+                publisher_id
+        )
+        select
+            max_avg
+            ,count(1) as publishers
+        from agg
+        group by
+            max_avg
+    """).df()
+    DB.close()
+
+    ax = sns.barplot(x=data.max_avg, y=data.publishers, color='tab:blue')
+    ax.set(title="histogram of publisher stories per year", ylabel="count of publishers (#)", xlabel="max average stories / year")
+    plt.tight_layout()
+    plt.savefig(out_dir / filename)
+    plt.close()
+
+
+@click.command('plot:top-publishers')
+def top_publishers():
+    """plot top publishers over time"""
+
+    filename = 'top_publishers.png'
+
+    DB = connect()
+    data = DB.query("""
+        select
+            p.tld
+            ,year(published_at) as year
+            ,count(1) as stories
+        from (
+            select
+                p.tld
+                ,p.id
+            from top.publishers p
+            join top.stories s
+            on s.publisher_id = p.id
+            group by
+                p.tld
+                ,p.id
+            order by count(1) desc
+            limit 20
+        ) p
+        join top.stories s
+        on s.publisher_id = p.id
+        group by 
+            p.tld
+            ,year(published_at)
+        order by count(distinct s.id) desc
+    """).df()
+    DB.close()
+    
+    pivot = data.pivot(columns='year', index='tld', values='stories')
+    ax = sns.heatmap(pivot, cmap="crest")
+    ax.set(title="top 20 publishers (by tld)", ylabel="tld", xlabel="stories / year (#)")
+    plt.tight_layout()
+    plt.savefig(out_dir / filename)
+    plt.close()
+
+
+@click.command('plot:common_tld')
+def common_tld():
+    import dataframe_image as dfi
+    filename = 'common_tld.png'
+
+    DB = connect()
+    data = DB.query("""
+        select
+            split_part(url, '.', -1) as tld
+            ,count(1) as publishers
+            ,case when count(1) < 20
+                then string_agg(distinct url, '\t')
+                else NULL
+                end as urls
+        from publishers
+        group by
+            split_part(url, '.', -1)
+        order by 
+            count(1) desc
+    """).df()
+    DB.close()
+    data[:15][['tld', 'publishers']].style.hide(axis="index").export_png(out_dir / filename, table_conversion='matplotlib')
+
+def stats():
+
+    # raw
+    DB.query("""
+        SELECT
+            'total stories' as key
+            ,COUNT(1) as value
+        FROM stories
+        UNION
+        SELECT
+            'total related' as key
+            ,COUNT(1) as value
+        FROM related_stories
+        UNION
+        SELECT
+            'top level domains' as key
+            ,COUNT(distinct tld) as value
+        FROM stories
+        UNION
+        SELECT
+            'publishers' as key
+            ,COUNT(1) as value
+        FROM publishers
+        UNION
+        SELECT
+            'authors' as key
+            ,COUNT(distinct author) as value
+        FROM stories
+        UNION
+        SELECT
+            'min year' as key
+            ,min(year(published_at)) as value
+        FROM stories
+        UNION
+        SELECT
+            'max year' as key
+            ,max(year(published_at)) as value
+        FROM stories
+    """).df().to_markdown(index=False)
+
+    # selected
+    DB.query("""
+        SELECT
+            'total stories' as key
+            ,COUNT(1) as value
+        FROM top.stories
+        UNION
+        SELECT
+            'total related' as key
+            ,COUNT(1) as value
+        FROM top.related_stories
+        UNION
+        SELECT
+            'top level domains' as key
+            ,COUNT(distinct tld) as value
+        FROM top.stories
+        UNION
+        SELECT
+            'publishers' as key
+            ,COUNT(1) as value
+        FROM top.publishers
+        UNION
+        SELECT
+            'authors' as key
+            ,COUNT(distinct author) as value
+        FROM top.stories
+        UNION
+        SELECT
+            'min year' as key
+            ,min(year(published_at)) as value
+        FROM top.stories
+        UNION
+        SELECT
+            'max year' as key
+            ,max(year(published_at)) as value
+        FROM top.stories
+    """).df().to_markdown(index=False)
+
+@click.command('plot:bias-stats')
+def bias_stats():
+    import dataframe_image as dfi
+    filename = 'bias_stats.png'
+
+    DB = connect()
+
+    df = DB.query("""
+        SELECT
+            string_agg(distinct bias)
+        FROM bias_ratings
+    """).df()
+
+    df.keys()
+
+    df = DB.query("""
+        SELECT
+            'publishers' as metric
+            ,count(1) as value
+        FROM bias_ratings
+        UNION
+        SELECT
+            'labels' as metric
+            ,count(distinct bias) as value
+        FROM bias_ratings
+        UNION
+        SELECT
+            'right' as metric
+            ,count(1) as value
+        FROM bias_ratings
+        WHERE bias in ('right', 'right-center')
+        UNION
+        SELECT
+            'left' as metric
+            ,count(1) as value
+        FROM bias_ratings
+        WHERE bias in ('left', 'left-center')
+        UNION
+        SELECT
+            'center' as metric
+            ,count(1) as value
+        FROM bias_ratings
+        WHERE bias in ('center')
+        UNION
+        SELECT
+            'agree_range' as metric
+            ,'[' 
+                || min(cast(agree as float) / (agree + disagree)) 
+                || '-' 
+                || max(cast(agree as float) / (agree + disagree)) 
+                || ']'
+                as value
+        FROM bias_ratings
+        WHERE bias in ('center')
+    """).df()
+    DB.close()
+    print(df.to_markdown(index=False))
--- a/src/plots/links.py
+++ b/src/plots/links.py
@ -0,0 +1,114 @@
+import click
+from data.main import connect
+from links import to_matrix
+import os
+import seaborn as sns
+import matplotlib.pyplot as plt
+from pathlib import Path
+import numpy as np
+from sklearn.metrics import silhouette_score
+import pandas as pd
+
+out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
+
+@click.command('plot:link-elbow')
+def elbow():
+    from sklearn.cluster import KMeans
+
+    filename = 'link_cluster_elbow.png'
+
+    DB = connect()
+    df = DB.query("""
+        SELECT
+            *
+        FROM link_edges
+    """).df()
+    pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
+
+    to_plot = []
+    for k in range(2, 15):
+        kmeans = KMeans(n_clusters=k, n_init="auto")
+        kmeans.fit(pivot)
+        label = kmeans.labels_
+        coeff = silhouette_score(pivot, label, metric='euclidean')
+        to_plot.append({'k': k, 'inertia' : kmeans.inertia_, 'coeff': coeff})
+    to_plot = pd.DataFrame(to_plot)
+
+    ax = sns.lineplot(x=to_plot.k, y=to_plot.inertia)
+    ax.set(title="elbow criterion plot of clusters", xlabel="bin size (k)", ylabel="sum of squared distances between centroids/points")
+    plt.savefig(out_dir / filename)
+    plt.close()
+
+    # randomly pick 8
+
+@click.command('plot:link-pca-clusters')
+@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
+def link_pca_clusters(source):
+
+    filename = f"link_pca_clusters_{source}.png"
+
+    DB = connect()
+    df = DB.query(f"""
+        SELECT
+            c.label as cluster
+            ,p.tld
+            --,b.label as bias
+            ,pca.first
+            ,pca.second
+            ,s.cnt as stories
+        FROM top.publisher_clusters_{source} c
+        JOIN top.publishers p
+        ON c.publisher_id = p.id
+        JOIN 
+        (
+            select
+                s.publisher_id
+                ,count(1) as cnt
+            FROM top.stories s
+            GROUP BY
+                s.publisher_id
+        ) s
+        ON s.publisher_id = p.id
+        JOIN top.publisher_pca_{source} pca
+        ON pca.publisher_id = p.id
+    """).df()
+    DB.close()
+
+    ax = sns.scatterplot(x=df['first'], y=df['second'], hue=df['cluster'])
+    ax.set(title=f"pca components vs. clusters ({source})", xlabel="first pca component", ylabel="second pca component")
+    plt.savefig(out_dir / filename)
+
+    # .df().groupby(['cluster', 'bias']).describe()
+
+
+
+
+def test():
+    data_dir = Path(os.getenv('DATA_MINING_DATA_DIR'))
+
+    DB.query("""
+        SELECT
+            p.id as publisher_id
+            ,p.name
+            ,p.tld
+            ,cast(b.bias_id as int) as bias_id
+            ,count(1) as stories
+        FROM publishers p
+        JOIN stories s
+        ON s.publisher_id = p.id
+        JOIN publisher_clusters c
+        ON c.publisher_id = p.id
+        LEFT JOIN publisher_bias b
+        ON b.publisher_id = p.id
+        where bias_id is null
+        group by
+            p.id
+            ,p.name
+            ,p.tld
+            ,b.bias_id
+        ORDER BY count(1) desc
+    """)
+
+    # .df().to_csv(data_dir / 'cluster_publishers.csv', sep="|", index=False)
+    DB.close()
+
--- a/src/selection.py
+++ b/src/selection.py
@ -0,0 +1,48 @@
+from data.main import connect
+import pandas as pd
+import numpy as np
+
+DB = connect()
+edges = DB.query("""
+    select
+    *
+    from link_edges
+""").df()
+DB.close()
+
+edges
+
+adj = edges.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
+select_publishers = pd.DataFrame(adj.index.tolist(), columns=['publisher_id'])
+
+DB = connect()
+DB.query("create schema top")
+
+DB.query("""
+    CREATE OR REPLACE TABLE top.publishers AS
+    SELECT
+        p.*
+    FROM publishers p
+    JOIN select_publishers s
+    ON s.publisher_id = p.id
+""")
+
+DB.query("""
+    CREATE OR REPLACE TABLE top.stories AS
+    SELECT
+        s.*
+    FROM stories s
+    JOIN top.publishers p
+    ON s.publisher_id = p.id
+    WHERE year(s.published_at) >= 2006
+    AND year(s.published_at) < 2023
+""")
+
+DB.query("""
+    CREATE OR REPLACE TABLE top.related_stories AS
+    SELECT
+        r.*
+    FROM top.stories s
+    JOIN related_stories r
+    ON s.id = r.parent_id
+""")
--- a/src/sentence.py
+++ b/src/sentence.py
@ -0,0 +1,138 @@
+from transformers import AutoTokenizer, AutoModel
+import torch
+import torch.nn.functional as F
+from data.main import connect, data_dir
+import os
+from pathlib import Path
+import numpy as np
+import pandas as pd
+from tqdm import tqdm
+import click
+
+#Mean Pooling - Take attention mask into account for correct averaging
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+
+@click.option('-c', '--chunks', type=int, default=500, show_default=True)
+@click.command("sentence:embed")
+def embed(chunks):
+
+    # Load model from HuggingFace Hub
+    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
+    model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
+
+    # load data
+    DB = connect()
+    table = DB.sql("""
+        select
+        id
+        ,title
+        from stories
+        order by id desc
+    """).df()
+    DB.close()
+
+    # normalize text
+    table['title'] = table['title'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
+
+
+    chunked = np.array_split(table, chunks)
+    # generate embeddings from list of titles
+    iterator = tqdm(chunked, 'embedding')
+    embeddings = []
+    embedding_ids = []
+    for _, chunk in enumerate(iterator):
+        sentences = chunk['title'].tolist()
+        ids = chunk['id'].tolist()
+        # Tokenize sentences
+        encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+        # Compute token embeddings
+        with torch.no_grad():
+            model_output = model(**encoded_input)
+        # Perform pooling
+        output = mean_pooling(model_output, encoded_input['attention_mask'])
+        # Normalize embeddings
+        output = F.normalize(output, p=2, dim=1)
+        embeddings.append(output)
+        embedding_ids.append(ids)
+
+    embeddings = np.concatenate(embeddings)
+    ids = np.concatenate(embedding_ids)
+
+    # save embeddings
+    save_to = data_dir() / 'embeddings.npy'
+    np.save(save_to, embeddings)
+    print(f"embeddings saved: {save_to}")
+
+    # save ids
+    save_to = data_dir() / 'embedding_ids.npy'
+    np.save(save_to, ids)
+    print(f"ids saved: {save_to}")
+
+
+@click.command('sentence:create-pca-table')
+def create_table():
+    from sklearn import linear_model
+    data_path = Path(os.getenv('DATA_MINING_DATA_DIR'))
+
+    embeddings = np.load(data_path / 'embeddings.npy')
+    embedding_ids = np.load(data_path / 'embedding_ids.npy')
+    ids = pd.DataFrame(embedding_ids, columns=['story_id']).reset_index()
+    DB = connect()
+
+    data = DB.query("""
+        SELECT
+            ids.index
+            ,s.id
+            ,b.ordinal
+        FROM ids
+        JOIN top.stories s
+        ON ids.story_id = s.id
+        JOIN top.publisher_bias pb
+        ON pb.publisher_id = s.publisher_id
+        JOIN bias_ratings b
+        ON b.id = pb.bias_id
+    """).df()
+
+    x = embeddings[data['index']]
+    y = data['ordinal'].to_numpy().reshape(-1, 1)
+
+    reg = linear_model.LinearRegression()
+
+    reg.fit(x, y)
+
+    reg.coef_.shape
+
+@click.command('sentence:create-svm-table')
+def create_svm_table():
+    from sklearn import svm
+    data_path = Path(os.getenv('DATA_MINING_DATA_DIR'))
+
+    embeddings = np.load(data_path / 'embeddings.npy')
+    embedding_ids = np.load(data_path / 'embedding_ids.npy')
+    ids = pd.DataFrame(embedding_ids, columns=['story_id']).reset_index()
+
+    DB = connect()
+    data = DB.query("""
+        SELECT
+            ids.index
+            ,s.id
+            ,b.ordinal
+        FROM ids
+        JOIN top.stories s
+        ON ids.story_id = s.id
+        JOIN top.publisher_bias pb
+        ON pb.publisher_id = s.publisher_id
+        JOIN bias_ratings b
+        ON b.id = pb.bias_id
+    """).df()
+
+    x = embeddings[data['index']]
+    #y = data['ordinal'].to_numpy().reshape(-1, 1)
+    y = data['ordinal']
+
+    clf = svm.SVC()
+    pred = clf.fit(x, y)
+
--- a/src/train/init.py
+++ b/src/train/init.py
@ -0,0 +1,5 @@
+import train.main
+
+__all__ = [
+    'main'
+]
--- a/src/train/dataset.py
+++ b/src/train/dataset.py
@ -0,0 +1,38 @@
+from torch.utils.data import Dataset
+from data.main import connect, data_dir
+from bias import label_to_int
+import numpy as np
+import pandas as pd
+
+class NewsDataset(Dataset):
+    def __init__(self):
+        self.embeddings = np.load(data_dir() / 'embeddings.npy')
+        embedding_ids = pd.DataFrame(np.load(data_dir() / 'embedding_ids.npy'), columns=['id']).reset_index()
+
+        DB = connect()
+        query = """
+            SELECT
+                s.id
+                ,b.label
+                ,count(1) over (partition by publisher_id) as stories
+            FROM stories s
+            JOIN publisher_bias b
+            ON b.id = s.publisher_id
+            WHERE b.label != 'allsides'
+        """
+        data = DB.sql(query).df()
+        DB.close()
+
+        data['label'] = data['label'].apply(lambda x: label_to_int(x))
+        data = data.merge(embedding_ids)
+        self.data = data
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, idx):
+        row = self.data.iloc[idx]
+        y = row['label']
+        # x = np.concatenate((self.embeddings[row['index']], [row['stories']])).astype(np.float32)
+        x = self.embeddings[row['index']]
+        return x, y
--- a/src/train/main.py
+++ b/src/train/main.py
@ -0,0 +1,132 @@
+import click
+from tqdm import tqdm
+from enum import Enum, auto
+from dotenv import load_dotenv
+import os
+
+import torch
+from torch import nn
+from torch import optim
+from torch.utils.data import DataLoader
+from accelerate import Accelerator
+
+from train.dataset import NewsDataset
+from train.model import Classifier
+#from model.linear import LinearClassifier
+
+class Stage(Enum):
+    TRAIN = auto()
+    DEV = auto()
+
+@click.command('train:main')
+def main():
+    dev_after = 20
+    visible_devices = None
+    lr = 1e-4
+    epochs = 10
+    debug = False
+    torch.manual_seed(0)
+    num_workers = 0
+
+    embedding_length = int(os.getenv('EMBEDDING_LENGTH', 384))
+
+    dataset = NewsDataset()
+    trainset, devset = torch.utils.data.random_split(dataset, [0.8, 0.2])
+    batch_size = 512
+    trainloader = DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=num_workers, drop_last=True)
+    devloader = DataLoader(devset, shuffle=False, num_workers=num_workers)
+    accelerator = Accelerator()
+    model = Classifier(embedding_length=embedding_length, classes=5)
+
+    # it's possible to control which GPUs the process can see using an environmental variable
+    if visible_devices:
+        os.environ['CUDA_VISIBLE_DEVICES'] = visible_devices
+    if debug:
+        os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
+        #accelerator.log({"message" :"debug enabled"})
+
+    criterion = torch.nn.CrossEntropyLoss()
+    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
+
+    # wrap objects with accelerate
+    model, optimizer, trainloader, devloader = accelerator.prepare(model, optimizer, trainloader, devloader)
+    def run():
+        """runner for training and valdating"""
+        running_loss = 0.0
+        # set the model to train model
+        model.train() if stage == Stage.TRAIN else model.eval()
+        dataloader = trainloader if stage == Stage.TRAIN else devloader
+        desc = 'train epoch' if stage == Stage.TRAIN else 'dev epoch'
+        if debug:
+            ...
+
+        # Make sure there are no leftover gradients before starting training an epoch
+        optimizer.zero_grad()
+
+        for batch, (x, y) in enumerate(tqdm(dataloader, desc=desc)):
+            pred_y = model(x)  # Forward pass through model
+            loss = criterion(pred_y, y)
+            running_loss += loss                        # Increment running loss
+            # Only update model weights on training
+            if stage == Stage.TRAIN:
+                accelerator.backward(loss)             # Increment gradients within model by sending loss backwards
+                optimizer.step()                       # Update model weights
+                optimizer.zero_grad()                  # Reset gradients to 0
+        return running_loss / len(dataloader)
+
+
+    for epoch in range(epochs):
+        if (epoch - 1) % dev_after == 0:
+            stage = Stage.DEV
+            log = run()
+            print(f"dev loss: {log}")
+        else:
+            stage = Stage.TRAIN
+            log = run()
+            print(f"train loss: {log}")
+
+    breakpoint()
+    from data.main import data_dir, connect
+    import numpy as np
+    import pandas as pd
+    from bias import int_to_label
+
+    embeddings = dataset.embeddings
+    embedding_ids = dataset.data
+
+    DB = connect()
+    query = """
+        SELECT
+            s.id
+            ,title
+            ,p.name
+            ,count(1) over (partition by publisher_id) as stories
+        FROM stories s
+        JOIN publishers p
+        on p.id = s.publisher_id
+        WHERE s.publisher_id NOT IN (
+            SELECT
+                id 
+            FROM publisher_bias b
+        )
+    """
+    data = DB.sql(query).df()
+    embeddings = np.load(data_dir() / 'embeddings.npy')
+    embedding_ids = pd.DataFrame(np.load(data_dir() / 'embedding_ids.npy'), columns=['id']).reset_index()
+
+
+    for i in range(10):
+        embedding =  embeddings[embedding_ids[embedding_ids['id'] == data.iloc[i]['id']]['index']]
+        title = data.iloc[i]['title']
+        publisher = data.iloc[i]['name']
+        class_pred = nn.functional.softmax( model(torch.tensor(embedding))).detach()
+        class_id = int(torch.argmax(nn.functional.softmax( model(torch.tensor(embedding))).detach()))
+        print(f"{publisher}: {int_to_label(class_id)} - \"{title}\"")
+
+    embedding_ids['id'] == data.iloc[0]['id']
+    embedding_ids[embedding_ids['id'] == data.iloc[0]['id']]
+    embedding =  embeddings[embedding_ids[embedding_ids['id'] == data.iloc[0]['id']]['index']]
+    title
+    publisher
+
+    model().get_last_layer(torch.tensor(embedding))
--- a/src/train/model.py
+++ b/src/train/model.py
@ -0,0 +1,28 @@
+from torch import nn
+
+class Classifier(nn.Module):
+    def __init__(self, embedding_length: int, classes: int):
+        super().__init__()
+        out_len = 16
+        self.stack = nn.Sequential(
+            nn.Linear(embedding_length, 256),
+            nn.ReLU(),
+            nn.Linear(256, 256),
+            nn.ReLU(),
+            nn.Linear(256, 64),
+            nn.ReLU(),
+            nn.Linear(64, 64),
+            nn.ReLU(),
+            nn.Linear(64, out_len),
+            nn.ReLU(),
+        )
+        self.logits = nn.Linear(out_len, classes)
+
+    def forward(self, x):
+        x = self.stack(x)
+        self.last_hidden_layer = x.detach()
+        return self.logits(x)
+
+    def get_last_layer(self, x):
+        x = self.stack(x)
+        return x
--- a/src/word.py
+++ b/src/word.py
@ -1,7 +1,7 @@
 import click
 from transformers import AutoTokenizer, RobertaModel
 import numpy as np
-from data import Data, from_db, connect, data_dir
+from data.main import Data, from_db, connect, data_dir
 from tqdm import tqdm
 import torch
 from pathlib import Path