v1.0 of presentation.

2023-05-17 13:38:07 -07:00 · 2023-05-17 13:38:07 -07:00 · 74c2d8afa2
parent 4d93cf7adb
commit 74c2d8afa2
37 changed files with 1959 additions and 144 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,3 +2,8 @@
 *.swp
 __pycache__
 tmp.py
 .env
 *.aux
 *.log
 *.out
 tmp.*
--- a/11
+++ b/11
@ -0,0 +1,11 @@
 .PHONY:to_wwu
 all: to_wwu
 to_wwu:
 	rsync -avz ~/577/repo/docs/figures/ linux-04:/home/jensen33/Dev/studentweb/assets/static/577/
 	scp ~/577/repo/docs/presentation.md linux-04:/home/jensen33/Dev/studentweb/content/577/contents.lr
 	scp ~/Dev/www.publicmatt.com/models/slides.ini linux-04:/home/jensen33/Dev/studentweb/models/
 	scp ~/Dev/www.publicmatt.com/templates/slides.html linux-04:/home/jensen33/Dev/studentweb/templates/
 	rsync -avz ~/Dev/www.publicmatt.com/assets/static/revealjs linux-04:/home/jensen33/Dev/studentweb/assets/static/
 	ssh linux-04 cd /home/jensen33/Dev/studentweb \; make
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,3 @@
 paper.pdf: paper.tex
 	pdflatex $^ -o $@
 	evince $@
--- a/docs/figures/allsides_request.png
+++ b/docs/figures/allsides_request.png
--- a/docs/figures/articles_per_year.png
+++ b/docs/figures/articles_per_year.png
--- a/docs/figures/common_tld.png
+++ b/docs/figures/common_tld.png
--- a/docs/figures/distinct_publishers.png
+++ b/docs/figures/distinct_publishers.png
--- a/docs/figures/link_cluster_elbow.png
+++ b/docs/figures/link_cluster_elbow.png
--- a/docs/figures/link_pca_clusters_links.png
+++ b/docs/figures/link_pca_clusters_links.png
--- a/docs/figures/link_pca_clusters_normalized.png
+++ b/docs/figures/link_pca_clusters_normalized.png
--- a/docs/figures/link_pca_clusters_onehot.png
+++ b/docs/figures/link_pca_clusters_onehot.png
--- a/docs/figures/pca_with_classes.png
+++ b/docs/figures/pca_with_classes.png
--- a/docs/figures/stories_per_publisher.png
+++ b/docs/figures/stories_per_publisher.png
--- a/docs/figures/top_publishers.png
+++ b/docs/figures/top_publishers.png
--- a/docs/paper.pdf
+++ b/docs/paper.pdf
--- a/docs/paper.tex
+++ b/docs/paper.tex
@ -0,0 +1,61 @@
 \documentclass{article}
 \usepackage{multicol}
 \usepackage{hyperref}
 \title{Data Mining CS 571}
 \author{Matt Jensen}
 \date{2023-04-25}
 \begin{document}
 \maketitle
 \section*{Abstract}
 News organizations have been repeatedly accused of being partisan.
 Additionally, they have been accused of polarizing dicussion to drive up revenue and engagement.
 This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
 A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
 This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
 Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
 \begin{multicols}{2}
 \section{Data Preparation}
 The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
 Each news article has a title, author, description, publisher, publish date, url and related discussions. 
 The site also has a concept of references, where a main, popular story may be covered by other sources.
 This link association might be used to support one or more of the hypothesis of the main analysis.
 After scraping the site, the data will need to be deduplicated and normalized to minimize storage costs and processing errors.
 What remains after these cleaning steps is approximitely 6,400 days of material, 300,000 distinct headlines from 21,000 publishers and 34,000 authors used in the study.
 \section{Missing Data Policy}
 The largest data policy that will have to be dealt with is news organizations that share the same parent company, but might have slightly different names.
 Wall Street Journal news is drastically different than their opinion section. 
 Other organizations have slightly different names for the same thing and a product of the aggregation service and not due to any real difference.
 Luckily, most of the anaylsis is operating on the content of the news headlines, which do not suffer from this data impurity.
 \section{Classification Task}
 The classification of news titles into emotional categories was accomplished by using a pretrained large langauge model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
 This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
 The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probabilty class from the output.
 The data has been discretized into years.
    Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
 Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
 Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
 \section{Experiments}
 No computational experiment have been done yet.
 Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
 The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
 If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
 The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
 \section{Results}
 \textbf{TODO.}
 \end{multicols}
 \end{document}
--- a/docs/presentation.md
+++ b/docs/presentation.md
@ -0,0 +1,552 @@
 _model: slides
 ---
 title: CSCI 577 - Data Mining
 ---
 body:
 # Political Polarization 
 Matt Jensen
 ===
 # Hypothesis 
 Political polarization is rising, and news articles are a proxy measure.
 ==
 # Is this reasonable? 
 ==
 # Why is polarization rising? 
 Not my job, but there's research<sup>[ref](#references)</sup> to support it
 ==
 # Sub-hypothesis 
 - The polarization increases near elections. <!-- .element: class="fragment" -->
 - The polarization is not evenly distributed across publishers. <!-- .element: class="fragment" -->
 - The polarization is not evenly distributed across political specturm. <!-- .element: class="fragment" -->
 ==
 # Sub-sub-hypothesis 
 - Similarly polarized publishers link to each other. <!-- .element: class="fragment" -->
 - 'Mainstream' media uses more neutral titles. <!-- .element: class="fragment" -->
 - Highly polarized publications don't last as long. <!-- .element: class="fragment" -->
 ===
 # Data Source(s) 
 memeorandum.com <!-- .element: class="fragment" -->
 allsides.com <!-- .element: class="fragment" -->
 huggingface.com <!-- .element: class="fragment" -->
 ===
 <section data-background-iframe="https://www.memeorandum.com" data-background-interactive></section>
 ===
 # memeorandum.com
 - News aggregation site. <!-- .element: class="fragment" -->
 - Was really famous before Google News. <!-- .element: class="fragment" -->
 - Still aggregates sites today. <!-- .element: class="fragment" -->
 ==
 # Why Memeorandum? 
 - Behavioral: I only read titles sometimes. (doom scrolling). <!-- .element class="fragment" -->
 - Behavioral: It's my source of news (with sister site TechMeme.com). <!-- .element class="fragment" -->
 - Convenient: most publishers block bots. <!-- .element class="fragment" -->
 - Convenient: dead simple html to parse. <!-- .element class="fragment" -->
 - Archival: all headlines from 2006 forward. <!-- .element class="fragment" -->
 - Archival: automated, not editorialized. <!-- .element class="fragment" -->
 ===
 <section data-background-iframe="https://www.allsides.com/media-bias/ratings" data-background-interactive></section>
 ===
 # AllSides.com
 - Rates news publications as left, center or right. <!-- .element: class="fragment" -->
 - Ratings combine: <!-- .element: class="fragment" -->
    - blind bias surveys.
    - editorial reviews.
    - third party research.
    - community voting.
 - Originally scraped website, but direct access eventually. <!-- .element: class="fragment" -->
 ==
 # Why AllSides? 
 - Behavioral: One of the first google results on bias apis. <!-- .element class="fragment" -->
 - Convenient: Ordinal ratings [-2: very left, 2: very right]. <!-- .element class="fragment" -->
 - Convenient: Easy format. <!-- .element class="fragment" -->
 - Archival: Covers 1400 publishers. <!-- .element class="fragment" -->
 ===
 <section data-background-iframe="https://huggingface.co/models" data-background-interactive></section>
 ===
 # HuggingFace.com
 - Deep Learning library. <!-- .element: class="fragment" -->
 - Lots of pretrained models. <!-- .element: class="fragment" -->
 - Easy, off the shelf word/sentence embeddings and text classification models. <!-- .element: class="fragment" -->
 ==
 # Why HuggingFace? 
 - Behavioral: Language Models are HOT right now. <!-- .element: class="fragment" -->
 - Behavioral: The dataset needed more features.<!-- .element: class="fragment" -->
 - Convenient: Literally 5 lines of python.<!-- .element: class="fragment" -->
 - Convenient: Testing different model performance was easy.<!-- .element: class="fragment" -->
 - Archival: Lots of pretrained classification tasks.<!-- .element: class="fragment" -->
 ===
 # Data Structures
 Stories
 - Top level stories. <!-- .element: class="fragment" -->
    - title.
    - publisher.
    - author.
 - Related discussion. <!-- .element: class="fragment" -->
    - publisher.
    - uses 'parent' story as a source.
 - Stream of stories (changes constantly). <!-- .element: class="fragment" -->
 ==
 # Data Structures
 Bias
 - Per publisher. <!-- .element: class="fragment" -->
    - name.
    - label.
    - agree/disagree vote by community. 
 - Name could be semi-automatically joined to stories. <!-- .element: class="fragment" -->
 ==
 # Data Structures
 Embeddings
 - Per story title. <!-- .element: class="fragment" -->
    - sentence embedding (n, 384).
    - sentiment classification (n, 1).
    - emotional classification (n, 1).
 - ~ 1 hour of inference time to map story titles and descriptions. <!-- .element: class="fragment" -->
 ===
 # Data Collection
 ==
 # Data Collection
 Story Scraper (simplified)
 ```python
 day = timedelta(days=1)
 cur = date(2005, 10, 1)
 end = date.today()
 while cur <= end:
    cur = cur + day
    save_as = output_dir / f"{cur.strftime('%y-%m-%d')}.html"
    url = f"https://www.memeorandum.com/{cur.strftime('%y%m%d')}/h2000"
    r = requests.get(url)
    with open(save_as, 'w') as f:
        f.write(r.text)
 ```
 ==
 # Data Collection
 Bias Scraper (hard)
 ```python
 ...
 bias_html = DATA_DIR / 'allsides.html'
 parser = etree.HTMLParser()
 tree = etree.parse(str(bias_html), parser)
 root = tree.getroot()
 rows = root.xpath('//table[contains(@class,"views-table")]/tbody/tr')
 ratings = []
 for row in rows:
    rating = dict()
    ...
 ```
 ==
 # Data Collection
 Bias Scraper (easy)
 ![allsides request](https://studentweb.cs.wwu.edu/~jensen33/static/577/allsides_request.png)
 ==
 # Data Collection
 Embeddings (easy)
 ```python
 # table = ...
 tokenizer = AutoTokenizer.from_pretrained("roberta-base")
 model = AutoModel.from_pretrained("roberta-base")
 for chunk in table:
    tokens = tokenizer(chunk, add_special_tokens = True, truncation = True, padding = "max_length", max_length=92, return_attention_mask = True, return_tensors = "pt")
    outputs = model(**tokens)
    embeddings = outputs.last_hidden_state.detach().numpy()
    ...
 ```
 ==
 # Data Collection
 Classification Embeddings (medium) 
 ```python
 ...
 outputs = model(**tokens)[0].detach().numpy()
 scores = 1 / (1 + np.exp(-outputs))  # Sigmoid
 class_ids = np.argmax(scores, axis=1)
 for i, class_id in enumerate(class_ids):
    results.append({"story_id": ids[i], "label" : model.config.id2label[class_id]})
 ...
 ```
 ===
 # Data Selection
 ==
 # Data Selection
 Stories
 - Clip the first and last full year of stories. <!-- .element: class="fragment" -->
 - Remove duplicate stories (big stories span multiple days). <!-- .element: class="fragment" -->
 ==
 # Data Selection
 Publishers
 - Combine subdomains of stories. <!-- .element: class="fragment" -->
    - blog.washingtonpost.com and washingtonpost.com are considered the same publisher. 
    - This could be bad. For example: opinion.wsj.com != wsj.com. 
 ==
 # Data Selection
 Links
 - Select only stories with publishers whose story had been a 'parent' ('original publishers'). <!-- .element: class="fragment" -->
    - Eliminates small blogs and non-original news.
 - Eliminate publishers without links to original publishers. <!-- .element: class="fragment" -->
    - Eliminate silo'ed publications. 
    - Link matrix is square and low'ish dimensional. 
 ==
 # Data Selection
 Bias
 - Keep all ratings, even ones with low agree/disagree ratio.
 - Join datasets on publisher name. 
    - Not automatic (look up Named Entity Recognition). <!-- .element: class="fragment" -->
    - Started with 'jaro winkler similarity' then manually from there.
 - Use numeric values
    - [left: -2, left-center: -1, ...]
 ===
 # Descriptive Stats
 Raw
 | metric            |   value |
 |:------------------|--------:|
 | total stories     |  299714 |
 | total related     |  960111 |
 | publishers        |    7031 |
 | authors           |   34346 |
 | max year          |    2023 |
 | min year          |    2005 |
 | top level domains |    7063 |
 ==
 # Descriptive Stats
 Stories Per Publisher
 ![stories per publisher](/static/577/stories_per_publisher.png)
 ==
 # Descriptive Stats
 Top Publishers
 ![top publishers](https://studentweb.cs.wwu.edu/~jensen33/static/577/top_publishers.png)
 ==
 # Descriptive Stats
 Articles Per Year
 ![articles per year](https://studentweb.cs.wwu.edu/~jensen33/static/577/articles_per_year.png)
 ==
 # Descriptive Stats
 Common TLDs
 ![common tlds](https://studentweb.cs.wwu.edu/~jensen33/static/577/common_tld.png)
 ==
 # Descriptive Stats
 Post Process
 | key               |   value |
 |:------------------|--------:|
 | total stories     |  251553 |
 | total related     |  815183 |
 | publishers        |     223 |
 | authors           |   23809 |
 | max year          |    2022 |
 | min year          |    2006 |
 | top level domains |     234 |
 ===
 # Experiments
 1. **clustering** on link similarity. <!-- .element: class="fragment" -->
 2. **classification** on link similarity. <!-- .element: class="fragment" -->
 3. **classification** on sentence embedding. <!-- .element: class="fragment" -->
 4. **classification** on sentiment analysis. <!-- .element: class="fragment" -->
 5. **regression** on emotional classification over time and publication. <!-- .element: class="fragment" -->
 ===
 # Experiment 1
 Setup
 - Create one-hot encoding of links between publishers. <!-- .element: class="fragment" -->
 - Cluster the encoding. <!-- .element: class="fragment" -->
 - Expect similar publications in same cluster. <!-- .element: class="fragment" -->
 - Use PCA to visualize clusters. <!-- .element: class="fragment" -->
 Note:
 Principle Component Analysis: 
 - a statistical technique for reducing the dimensionality of a dataset.
 - linear transformation into a new coordinate system where (most of) the variation data can be described with fewer dimensions than the initial data.
 ==
 # Experiment 1
 One Hot Encoding
 | publisher |  nytimes|  wsj| newsweek|  ...|
 |:----------|--------:|----:|--------:|----:|
 | nytimes   |        1|    1|        1|  ...|
 | wsj       |        1|    1|        0|  ...|
 | newsweek  |        0|    0|        1|  ...|
 | ...       |      ...|  ...|      ...|  ...|
 ==
 # Experiment 1
 n-Hot Encoding
 | publisher |  nytimes|  wsj| newsweek|  ...|
 |:----------|--------:|----:|--------:|----:|
 | nytimes   |       11|    1|      141|  ...|
 | wsj       |        1|   31|        0|  ...|
 | newsweek  |        0|    0|        1|  ...|
 | ...       |      ...|  ...|      ...|  ...|
 ==
 # Experiment 1
 Normalized n-Hot Encoding
 | publisher |  nytimes|  wsj| newsweek|  ...|
 |:----------|--------:|----:|--------:|----:|
 | nytimes   |        0|  0.4|      0.2|  ...|
 | wsj       |      0.2|    0|      0.4|  ...|
 | newsweek  |      0.0|  0.0|      0.0|  ...|
 | ...       |      ...|  ...|      ...|  ...|
 ==
 # Experiment 1
 Elbow criterion
 ![elbow](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_cluster_elbow.png)
 Note:
 The elbow method looks at the percentage of explained variance as a function of the number of clusters: 
 One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
 Percentage of variance explained is the ratio of the between-group variance to the total variance,
 ==
 # Experiment 1
 Link Magnitude
 ![link magnitude cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_links.png)
 ==
 # Experiment 1
 Normalized
 ![link normalized cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_normalized.png)
 ==
 # Experiment 1
 Onehot
 ![link onehot cluster](https://studentweb.cs.wwu.edu/~jensen33/static/577/link_pca_clusters_onehot.png)
 ==
 # Experiment 1
 Discussion
 - Best encoding: One hot. <!-- .element: class="fragment" -->
    - Clusters based on total links otherwise.
 - Clusters, but no explanation
 - Limitation: need the link encoding to cluster.
    - Smaller publishers might not link very much.
 ===
 # Experiment 2
 Setup
 - Create features. <!-- .element: class="fragment" -->:
    - Publisher frequency.
    - Reuse link encodings.
 - Create classes: <!-- .element: class="fragment" -->
    - Join bias classifications.
 - Train classifier. <!-- .element: class="fragment" -->
 Note:
 ==
 # Experiment 2
 Descriptive stats
 | metric      | value     |
 |:------------|:----------|
 | publishers  | 1582      |
 | labels      | 6         |
 | left        | 482       |
 | center      | 711       |
 | right       | 369       |
 | agree range | [0.0-1.0] |
 ==
 # Experiment 2
 PCA + Labels
 ![pca vs. bias labels](https://studentweb.cs.wwu.edu/~jensen33/static/577/pca_with_classes.png)
 ==
 # Experiment 2
 Discussion
 - Link encodings (and their PCA) are useful. <!-- .element: class="fragment" -->
    - Labels are (sort of) separated and clustered.
    - Creating them for smaller publishers is trivial.
 ==
 # Experiment 2
 Limitations
 - Dependent on accurate rating. <!-- .element: class="fragment" -->
 - Ordinal ratings not available. <!-- .element: class="fragment" -->
 - Dependent on accurate joining across datasets. <!-- .element: class="fragment" -->
 - Entire publication is rated, not authors. <!-- .element: class="fragment" -->
 - Don't know what to do with community rating. <!-- .element: class="fragment" -->
 ===
 # Experiment 3
 Setup
 ==
 # Limitations
 - Many different authors under the same publisher. <!-- .element: class="fragment" -->
 - Publishers use syndication. <!-- .element: class="fragment" -->
 - Bias ratings are biased. <!-- .element: class="fragment" -->
 ===
 # Questions
 ===
 <!-- .section: id="references" -->
 # References
 [1]: Stewart, A.J. et al. 2020. Polarization under rising inequality and economic decline. Science Advances. 6, 50 (Dec. 2020), eabd4201. DOI:https://doi.org/10.1126/sciadv.abd4201.
 Note:
--- a/src/bias.py
+++ b/src/bias.py
@ -1,12 +1,12 @@
 import click
-from data import connect
+from data.main import connect
 import pandas as pd
 from lxml import etree
 from pathlib import Path
 import os
 import csv
-def map(rating:str) -> int:
+def label_to_int(rating:str) -> int:
    mapping = {
        'left' : 0,
@ -19,20 +19,18 @@ def map(rating:str) -> int:
    return mapping[rating]
 def int_to_label(class_id: int) -> str:
    mapping = {
            0 : 'left',
            1 : 'left-center',
            2 : 'center',
            3 : 'right-center',
            4 : 'right',
            -1 : 'allsides',
    }
    return mapping[class_id]
@click.command(name="bias:load")
 def load() -> None:
    DB = connect()
    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
    f = str(DATA_DIR / "bias_ratings.csv")
    DB.sql(f"""
        create table bias_ratings as 
        select 
            row_number() over(order by b.publisher) as id
            ,b.*
        from read_csv_auto('{f}') b
    """)
@click.command(name="bias:normalize")
 def normalize() -> None:
    DB = connect()
@ -41,133 +39,48 @@ def normalize() -> None:
        CREATE OR REPLACE TABLE publisher_bias AS
        WITH cte AS (
            SELECT
-                p.id
+                p.id as publisher_id
                ,b.id as bias_id
                ,b.bias as label
                ,JARO_WINKLER_SIMILARITY(LOWER(p.name), LOWER(b.publisher)) as similarity
            FROM bias_ratings b
-            JOIN publishers p
+            JOIN top.publishers p
            ON JARO_WINKLER_SIMILARITY(LOWER(p.name), LOWER(b.publisher)) > 0.95
        ),ranked AS (
            SELECT
-                id
+                publisher_id
                ,bias_id
                ,label
                ,similarity
-                ,ROW_NUMBER() OVER(PARTITION BY id ORDER BY similarity DESC) AS rn
+                ,ROW_NUMBER() OVER(PARTITION BY publisher_id ORDER BY similarity DESC) AS rn
            FROM cte
        )
        SELECT
-            id
+            publisher_id
            ,label
            ,bias_id
        FROM ranked
        WHERE ranked.rn = 1
    """)
    mapping = [
            {'label' :'left' , 'ordinal': -2},
            {'label' :'left-center' , 'ordinal': -1},
            {'label' :'center' , 'ordinal': 0},
            {'label' :'right-center' , 'ordinal': 1},
            {'label' :'right' , 'ordinal': 2},
    ]
    mapping = pd.DataFrame(mapping)
-    DB.sql("""
+    DB.query("alter table bias_ratings add column ordinal int")
-        with cte as (
+
-            select 
+    DB.query("""
-                s.publisher_id
+        update bias_ratings b
-                ,count(1) as stories
+        set ordinal = o.ordinal
-            from stories s
+        FROM mapping o
-            group by s.publisher_id
+        WHERE o.label = b.bias
        )
        select
            s.publisher
            ,s.stories
            ,b.publisher
            ,b.bias
        from bias_ratings b
        join cte s
        on s.publisher = b.publisher
        order by
        stories desc
        limit 15
    """)
    DB.sql("""
        with cte as (
            select 
                s.publisher
                ,count(1) as stories
            from stories s
            group by s.publisher
        )
        select
            sum(stories)
            ,avg(agree / disagree)
        from bias_ratings b
        join cte s
        on s.publisher = b.publisher
    """)
    DB.sql("""
        with cte as (
            select 
                s.publisher
                ,count(1) as stories
            from stories s
            group by s.publisher
        )
        select
            sum(s.stories) filter(where b.publisher is not null) as matched
            ,sum(s.stories) filter(where b.publisher is null) as unmatched
            ,cast(sum(s.stories) filter(where b.publisher is not null) as numeric)
            / sum(s.stories) filter(where b.publisher is null) as precent_matched
        from bias_ratings b
        right join cte s
        on s.publisher = b.publisher
    """)
    DB.sql("""
        select
        *
        from bias_ratings
        where publisher ilike '%CNN%'
        """)
@click.command(name='bias:debug')
 def debug() -> None:
    DB = connect()
    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
    f = str(DATA_DIR / "bias_ratings.csv")
    DB.sql("""
        with cte as (
            select 
                outlet 
                ,count(1) as stories
                from stories 
                group by outlet
        ) 
        ,total as (
            select
                sum(stories) as total
            from cte
        )
        select
            cte.outlet
            ,cte.stories
            ,bias.outlet
            ,bias.lean
            ,sum(100 * (cte.stories / cast(total.total as float))) over() as rep
            ,total.total
        from cte
        join bias 
        on jaro_winkler_similarity(bias.outlet, cte.outlet) > 0.9
        cross join total.total
    """)
    DB.sql("""
        select
            outlet
            ,count(1) as stories
        from stories
        group by outlet
        order by count(1) desc
        limit 50
    """)
    outlets
@click.command(name='bias:parse')
 def parse() -> None:
@ -199,3 +112,64 @@ def parse() -> None:
        ratings.append(rating)
    df = pd.DataFrame(ratings)
    df.to_csv(DATA_DIR / 'bias_ratings.csv', sep="|", index=False, quoting=csv.QUOTE_NONNUMERIC)
@click.command(name="bias:load")
 def load() -> None:
    DB = connect()
    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
    f = str(DATA_DIR / "bias_ratings.csv")
    DB.sql(f"""
        CREATE TABLE bias_ratings as 
        select 
            row_number() over(order by b.publisher) as id
            ,b.*
        from read_csv_auto('{f}') b
    """)
@click.command('bias:export')
 def export():
    data_path = Path(os.environ['DATA_MINING_DATA_DIR'])
    DB = connect()
    all_bias = DB.query("""
        SELECT
            id as bias_id
            ,publisher as name
            ,bias as label
        FROM bias_ratings
        ORDER by agree desc
    """)
    all_bias.df().to_csv(data_path / 'TMP_publisher_bias.csv', sep="|", index=False)
    mapped_bias = DB.query("""
        SELECT
            p.id as publisher_id
            ,p.name as name
            ,p.tld as tld
            ,b.label as bias
            ,b.bias_id as bias_id
        FROM top.publishers p
        LEFT JOIN publisher_bias b
        ON b.publisher_id = p.id
    """)
    mapped_bias.df().to_csv(data_path / 'TMP_publisher_bias_to_load.csv', sep="|", index=False)
    DB.close()
@click.command('bias:import-mapped')
 def import_mapped():
    data_path = Path(os.environ['DATA_MINING_DATA_DIR'])
    table_name = "top.publisher_bias"
    DB = connect()
    df = pd.read_csv(data_path / 'TMP_publisher_bias_to_load.csv', sep="|")
    DB.query(f"""
        CREATE OR REPLACE TABLE {table_name} AS
        SELECT
            publisher_id AS publisher_id
            ,cast(bias_id AS int) as bias_id
        FROM df
        WHERE bias_id IS NOT NULL
    """)
    print(f"created table: {table_name}")
--- a/src/cli.py
+++ b/src/cli.py
@ -7,7 +7,7 @@ def cli():
 if __name__ == "__main__":
    load_dotenv()
-    import scrape
+    from data import scrape
    cli.add_command(scrape.download)
    cli.add_command(scrape.parse)
    cli.add_command(scrape.load)
@ -32,4 +32,26 @@ if __name__ == "__main__":
    cli.add_command(emotion.create_table)
    import sentence
    cli.add_command(sentence.embed)
    from train import main as train_main
    cli.add_command(train_main.main)
    import plots.descriptive as plotd
    cli.add_command(plotd.articles_per_year)
    cli.add_command(plotd.distinct_publishers)
    cli.add_command(plotd.stories_per_publisher)
    cli.add_command(plotd.top_publishers)
    cli.add_command(plotd.common_tld)
    import  links as linkcli
    cli.add_command(linkcli.create_table)
    cli.add_command(linkcli.create_pca)
    cli.add_command(linkcli.create_clusters)
    import  plots.links as plotl
    cli.add_command(plotl.elbow)
    cli.add_command(plotl.link_pca_clusters)
    import  plots.classifier as plotc
    cli.add_command(plotc.pca_with_classes)
    cli()
--- a/src/data/init.py
+++ b/src/data/init.py
@ -0,0 +1,6 @@
 import data.main
 import data.scrape
 __all__ = [
    'main'
    ,'scrape'
 ]
--- a/src/data/main.py
+++ b/src/data/main.py
--- a/src/data/scrape.py
+++ b/src/data/scrape.py
@ -4,10 +4,12 @@ import requests
 from pathlib import Path
 import click
 from tqdm import tqdm
-from data import data_dir, connect
+from data.main import data_dir, connect
 from lxml import etree
 import pandas as pd
 from urllib.parse import urlparse
 from tld import get_tld
 from tld.utils import update_tld_names
@click.command(name='scrape:load')
@click.option('--directory', type=Path, default=data_dir(), show_default=True)
@ -61,6 +63,7 @@ def download(output_dir):
@click.option('-o', '--output_dir', type=Path, default=data_dir(), show_default=True)
 def parse(directory, output_dir):
    """parse the html files on disk into a structured csv format."""
    update_tld_names()
    directory = data_dir() / "memeorandum"
    parser = etree.HTMLParser()
    pages = [f for f in directory.glob("*.html")]
@ -104,8 +107,7 @@ def parse(directory, output_dir):
            url = item.xpath('.//strong/a')[0].get('href')
            out['url'] = url
-            out['publisher_url_domain'] = urlparse(publisher_url).netloc
+            out['tld'] = get_tld(publisher_url)
            out['domain'] = urlparse(url).netloc
            item_id = hash((page.stem, url))
            out['id'] = item_id
@ -225,3 +227,111 @@ def normalize():
        alter table related_stories drop publisher_domain;
    """)
 def another_norm():
    sv2 = pd.read_csv(data_dir / 'stories.csv', sep="|")
    related = pd.read_csv(data_dir / 'related.csv', sep="|")
    related['tld'] = related.url.apply(lambda x: map_tld(x))
    DB.query("""
        update related_stories
        set publisher_id = p.id
        from publishers p
        join related r
        on r.tld = p.tld
        where r.url = related_stories.url
    """)
    DB.query("""alter table stories add column tld text""")
    s_url = DB.query("""
    select
        id
        ,url
        from stories
    """).df()
    s_url['tld'] = s_url.url.apply(lambda x: map_tld(x))
    DB.query("""
        update stories
        set tld = s_url.tld
        from s_url
        where s_url.id = stories.id
    """)
    DB.query("""
        update stories
        set publisher_id = p.id
        from publishers p
        where p.tld = stories.tld
    """)
    select
    DB.query("""
        update stories
        set stories.publisher_id = p.id
        from new_pub
    """)
    sv2['tld'] = sv2.publisher_url.apply(lambda x: map_tld(x))
    new_pub = DB.query("""
        with cte as (
            select
                tld
                ,publisher
                ,count(1) filter(where year(published_at) = 2022) as recent_ctn
                ,count(1) as ctn
            from sv2
            group by
                tld
                ,publisher
        )
        ,r as (
        select
            tld
            ,publisher
            ,ctn
            ,row_number() over(partition by tld order by recent_ctn desc) as rn
        from cte
        )
        select
            row_number() over() as id
            ,publisher as name
            ,tld
        from r
        where rn = 1
        order by ctn desc
    """).df()
    DB.query("""
        CREATE OR REPLACE TABLE publishers AS
        SELECT
            id
            ,name
            ,tld
        FROM new_pub
    """)
    def map_tld(x):
        try:
            res = get_tld(x, as_object=True)
            return res.fld
        except:
            return None
    DB.sql("""
        SELECT
            s.id
            ,sv2.publisher_url
        FROM stories s
        JOIN sv2
        on sv2.id = s.id
        limit 5
    """)
--- a/src/emotion.py
+++ b/src/emotion.py
@ -6,7 +6,7 @@ import numpy as np
 from transformers import BertTokenizer
 from model import BertForMultiLabelClassification
-from data import connect, data_dir
+from data.main import connect, data_dir
 import seaborn as sns
 import matplotlib.pyplot as plt
 from matplotlib.dates import DateFormatter
@ -376,3 +376,99 @@ def debug():
    DB.close()
    out.to_csv(data_dir() / 'emotions.csv', sep="|")
 def another():
    DB = connect()
    DB.sql("""
        select
            *
        from emotions
    """)
    emotions = DB.sql("""
        select
            year(s.published_at) as year
            ,se.label as emotion
            ,count(1) as stories
        from stories s
        join story_emotions se
        on s.id = se.story_id
        group by
            year(s.published_at)
            ,se.label
    """).df()
    sns.scatterplot(x=emotions['year'], y=emotions['stories'], hue=emotions['emotion'])
    plt.show()
    pivot = emotions.pivot(index='year', columns='emotion', values='stories')
    pivot.reset_index(inplace=True)
    from sklearn.linear_model import LinearRegression
    reg = LinearRegression()
    for emotion in pivot.keys()[1:].tolist():
        _ = reg.fit(pivot['year'].to_numpy().reshape(-1, 1), pivot[emotion])
        print(f"{emotion}: {reg.coef_[0]}")
    fig, ax = plt.subplots()
    #sns.lineplot(x=pivot['anger'], y=pivot['joy'])
    #sns.lineplot(x=pivot['anger'], y=pivot['surprise'], ax=ax)
    sns.lineplot(x=pivot['anger'], y=pivot['fear'], ax=ax)
    sns.lineplot(x=pivot[''], y=pivot['fear'], ax=ax)
    plt.show()
    DB.close()
    normalized = DB.sql("""
        with cte as (
            select
                year(s.published_at) as year
                ,se.label as emotion
                ,b.label as bias
            from stories s
            join story_emotions se
            on s.id = se.story_id
            join publisher_bias b
            on b.id = s.publisher_id
            where b.label != 'allsides'
            and se.label != 'neutral'
        )
        select
            distinct
            year
            ,emotion
            ,bias
            ,cast(count(1) over(partition by year, bias, emotion) as float) / count(1) over(partition by year, bias) as group_count
        from cte
    """).df()
    DB.sql("""
        select
            b.label as bias
            ,count(1) as stories
        from stories s
        join story_emotions se
        on s.id = se.story_id
        join publisher_bias b
        on b.id = s.publisher_id
        group by
            b.label
    """).df()
    another_pivot = emotional_bias.pivot(index=['bias', 'year'], columns='emotion', values='stories')
    another_pivot.reset_index(inplace=True)
    sns.lineplot(data=normalized, x='year', y='group_count', hue='bias', style='emotion')
    plt.show()
    sns.relplot(
        data=normalized, x="year", y="group_count", hue="emotion", col='bias', kind="line"
        #data=normalized, x="year", y="group_count", hue="emotion", col='bias', kind="line", facet_kws=dict(sharey=False)
    )
    plt.show()
    DB.sql("""
    select
    *
    from another_pivot
    """)
--- a/src/lib.py
+++ b/src/lib.py
@ -1,8 +0,0 @@
 import sklearn
 import polars as pl
 import toml
 from pathlib import Path
 config = toml.load('/home/user/577/repo/config.toml')
 app_dir = Path(config.get('app').get('path'))
 df = pl.read_csv(app_dir / "data/articles.csv")
--- a/src/links.py
+++ b/src/links.py
@ -1,12 +1,148 @@
-from data import connect
+import click
 from data.main import connect
 import pandas as pd
 import numpy as np
 from sklearn.decomposition import PCA, TruncatedSVD
 from sklearn.cluster import MiniBatchKMeans
 import seaborn as sns
 import matplotlib.pyplot as plt
@click.command('links:create-table')
 def create_table():
    table_name = "top.link_edges"
    DB = connect()
    DB.query(f"""
        CREATE OR REPLACE TABLE {table_name} AS
        with cte as(
            SELECT 
                s.publisher_id as parent_id
                ,r.publisher_id as child_id
                ,count(1) as links
            FROM top.stories s
            JOIN top.related_stories r
            ON s.id = r.parent_id
            group by 
                s.publisher_id
                ,r.publisher_id
        )
        SELECT
            cte.parent_id
            ,cte.child_id
            ,cte.links as links
            ,cast(cte.links as float) / sum(cte.links) over(partition by cte.parent_id) as normalized
            ,case when cte.links > 0 then 1 else 0 end as onehot
        FROM cte
        WHERE cte.child_id in (
            SELECT
                distinct parent_id
            FROM cte
        )
        AND cte.parent_id in (
            SELECT
                distinct child_id
            FROM cte
        )
    """)
    DB.close()
    DB = connect()
    DB.query("""
        SELECT
        *
        ,-log10(links)
        --distinct parent_id
        FROM top.link_edges e
        WHERE e.parent_id = 238
    """)
    DB.close()
    print(f"created {table_name}")
@click.command('links:create-pca')
@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
 def create_pca(source):
    """create 2D pca labels"""
    from sklearn.decomposition import PCA
    table_name = f"top.publisher_pca_{source}"
    DB = connect()
    pub = DB.query("""
        SELECT
            *
        FROM top.publishers
    """).df()
    df = DB.query(f"""
        SELECT
            parent_id
            ,child_id
            ,{source} as links
        FROM top.link_edges
    """).df()
    DB.close()
    pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
    svd = PCA(n_components=2)
    svd_out = svd.fit_transform(pivot)
    out = pivot.reset_index()[['parent_id']]
    out['first'] = svd_out[:, 0]
    out['second'] = svd_out[:, 1]
    out = pd.merge(out, pub, left_on='parent_id', right_on='id')
    DB = connect()
    DB.query(f"""
        CREATE OR REPLACE TABLE {table_name} AS
        SELECT
            out.id as publisher_id
            ,out.first as first
            ,out.second as second
        FROM out
    """)
    DB.close()
    print(f"created {table_name}")
@click.command('links:create-clusters')
@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
 def create_clusters(source):
    from sklearn.cluster import KMeans
    table_name = f"top.publisher_clusters_{source}"
    DB = connect()
    df = DB.query(f"""
        SELECT
            parent_id
            ,child_id
            ,{source} as links
        FROM top.link_edges
    """).df()
    pub = DB.query("""
        SELECT
            *
        FROM top.publishers
    """).df()
    DB.close()
    pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
    k = 8
    kmeans = KMeans(n_clusters=k, n_init="auto")
    pred = kmeans.fit_predict(pivot)
    out = pivot.reset_index()[['parent_id']]
    out['label'] = pred
    out = pd.merge(out, pub, left_on='parent_id', right_on='id')
    new_table = out[['id', 'label']]
    DB = connect()
    DB.query(f"""
        CREATE OR REPLACE TABLE {table_name} AS
        SELECT
            n.id as publisher_id
            ,n.label as label
        FROM new_table n
    """)
    DB.close()
    print(f"created {table_name}")
 def to_matrix():
    """returns an adjacency matrix of publishers to publisher link frequency"""
@ -21,6 +157,7 @@ def to_matrix():
        {'label' :'right', 'value' : 4},
        {'label' :'allsides', 'value' : -1},
    ])
    bias = DB.sql("""
        SELECT
            b.id
@ -37,11 +174,7 @@ def to_matrix():
                p.id
                ,p.name
                ,p.url
                ,b.label
                ,b.value
            from publishers p
            left join bias b
            on b.id = p.id
    """).df()
    edges = DB.sql("""
@ -81,12 +214,23 @@ def to_matrix():
        ON p.id = cte.parent_id
    """).df()
    # only keep values that have more than 1 link
    test = edges[edges['links'] > 2].pivot(index='parent_id', columns='child_id', values='links').fillna(0).reset_index()
    edges.dropna().pivot(index='parent_id', columns='child_id', values='links').fillna(0)
    pd.merge(adj, pub, how='left', left_on='parent_id', right_on='id')
    adj = edges.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
    adj.values.shape
    out = pd.DataFrame(adj.index.values, columns=['id'])
    out = pd.merge(out, pub, how='left', on='id')
    return out
@click.command('links:analysis')
 def analysis():
    from sklearn.decomposition import PCA, TruncatedSVD
    from sklearn.cluster import MiniBatchKMeans
    adj = to_matrix()
    pca = PCA(n_components=4)
    pca_out = pca.fit_transform(adj)
--- a/src/mine.py
+++ b/src/mine.py
@ -1,4 +1,4 @@
-from data import data_dir, connect
+from data.main import data_dir, connect
 import numpy as np
 import sklearn
 from sklearn.cluster import MiniBatchKMeans
--- a/src/plots/init.py
+++ b/src/plots/init.py
--- a/src/plots/classifier.py
+++ b/src/plots/classifier.py
@ -0,0 +1,34 @@
 import click
 from data.main import connect
 import os
 import seaborn as sns
 import matplotlib.pyplot as plt
 from pathlib import Path
 out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
@click.command('plot:pca-with-classes')
 def pca_with_classes():
    filename = "pca_with_classes.png"
    DB = connect()
    data = DB.query(f"""
        SELECT
            p.tld
            ,b.bias
            ,c.first
            ,c.second
            ,round(cast(b.agree as float) / (b.agree + b.disagree), 2) ratio
        FROM top.publishers p
        JOIN top.publisher_bias pb
        ON p.id = pb.publisher_id
        JOIN bias_ratings b
        ON b.id = pb.bias_id
        JOIN top.publisher_pca_normalized c
        ON c.publisher_id = p.id
    """).df()
    DB.close()
    ax = sns.scatterplot(x=data['first'], y=data['second'], hue=data['bias'], s=100)
    ax.set(title="pca components vs. bias labels", xlabel="first pca component", ylabel="second pca component")
    plt.savefig(out_dir / filename)
    print(f"saved: {filename}")
--- a/src/plots/descriptive.py
+++ b/src/plots/descriptive.py
@ -0,0 +1,302 @@
 import click
 from data.main import connect
 import os
 import seaborn as sns
 import matplotlib.pyplot as plt
 from pathlib import Path
 import numpy as np
 out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
@click.command('plot:articles-per-year')
 def articles_per_year():
    filename = 'articles_per_year.png'
    DB = connect()
    data = DB.query("""
        select
            year(published_at) as year
            ,count(1) as stories
        from stories
        group by 
            year(published_at)
    """).df()
    DB.close()
    ax = sns.barplot(x=data.year, y=data.stories, color='tab:blue')
    ax.tick_params(axis='x', rotation=90)
    ax.set(title="count of articles per year", ylabel="count of stories (#)")
    plt.tight_layout()
    plt.savefig(out_dir / filename)
@click.command('plot:distinct-publishers')
 def distinct_publishers():
    filename = 'distinct_publishers.png'
    DB = connect()
    data = DB.query("""
        select
            year(published_at) as year
            ,count(distinct publisher_id) as publishers
        from stories
        group by 
            year(published_at)
    """).df()
    DB.close()
    ax = sns.barplot(x=data.year, y=data.publishers, color='tab:blue')
    ax.tick_params(axis='x', rotation=90)
    ax.set(title="count of publishers per year", ylabel="count of publishers (#)")
    plt.tight_layout()
    plt.savefig(out_dir / filename)
    plt.close()
@click.command('plot:stories-per-publisher')
 def stories_per_publisher():
    filename = 'stories_per_publisher.png'
    DB = connect()
    data = DB.query("""
        with cte as (
        select
            publisher_id
            ,year(published_at) as year
            ,count(1) as stories
        from stories
        group by 
            publisher_id
            ,year(published_at)
        ) , agg as (
            select
                publisher_id
                ,avg(stories) as stories_per_year
                ,case 
                    when avg(stories) < 2 then 2
                    when avg(stories) < 4 then 4
                    when avg(stories) < 8 then 8
                    when avg(stories) < 16 then 16
                    when avg(stories) < 32 then 32
                    when avg(stories) < 64 then 64
                    when avg(stories) < 128 then 128
                    else 129
                end as max_avg
            from cte
            group by 
                publisher_id
        )
        select
            max_avg
            ,count(1) as publishers
        from agg
        group by
            max_avg
    """).df()
    DB.close()
    ax = sns.barplot(x=data.max_avg, y=data.publishers, color='tab:blue')
    ax.set(title="histogram of publisher stories per year", ylabel="count of publishers (#)", xlabel="max average stories / year")
    plt.tight_layout()
    plt.savefig(out_dir / filename)
    plt.close()
@click.command('plot:top-publishers')
 def top_publishers():
    """plot top publishers over time"""
    filename = 'top_publishers.png'
    DB = connect()
    data = DB.query("""
        select
            p.tld
            ,year(published_at) as year
            ,count(1) as stories
        from (
            select
                p.tld
                ,p.id
            from top.publishers p
            join top.stories s
            on s.publisher_id = p.id
            group by
                p.tld
                ,p.id
            order by count(1) desc
            limit 20
        ) p
        join top.stories s
        on s.publisher_id = p.id
        group by 
            p.tld
            ,year(published_at)
        order by count(distinct s.id) desc
    """).df()
    DB.close()
    pivot = data.pivot(columns='year', index='tld', values='stories')
    ax = sns.heatmap(pivot, cmap="crest")
    ax.set(title="top 20 publishers (by tld)", ylabel="tld", xlabel="stories / year (#)")
    plt.tight_layout()
    plt.savefig(out_dir / filename)
    plt.close()
@click.command('plot:common_tld')
 def common_tld():
    import dataframe_image as dfi
    filename = 'common_tld.png'
    DB = connect()
    data = DB.query("""
        select
            split_part(url, '.', -1) as tld
            ,count(1) as publishers
            ,case when count(1) < 20
                then string_agg(distinct url, '\t')
                else NULL
                end as urls
        from publishers
        group by
            split_part(url, '.', -1)
        order by 
            count(1) desc
    """).df()
    DB.close()
    data[:15][['tld', 'publishers']].style.hide(axis="index").export_png(out_dir / filename, table_conversion='matplotlib')
 def stats():
    # raw
    DB.query("""
        SELECT
            'total stories' as key
            ,COUNT(1) as value
        FROM stories
        UNION
        SELECT
            'total related' as key
            ,COUNT(1) as value
        FROM related_stories
        UNION
        SELECT
            'top level domains' as key
            ,COUNT(distinct tld) as value
        FROM stories
        UNION
        SELECT
            'publishers' as key
            ,COUNT(1) as value
        FROM publishers
        UNION
        SELECT
            'authors' as key
            ,COUNT(distinct author) as value
        FROM stories
        UNION
        SELECT
            'min year' as key
            ,min(year(published_at)) as value
        FROM stories
        UNION
        SELECT
            'max year' as key
            ,max(year(published_at)) as value
        FROM stories
    """).df().to_markdown(index=False)
    # selected
    DB.query("""
        SELECT
            'total stories' as key
            ,COUNT(1) as value
        FROM top.stories
        UNION
        SELECT
            'total related' as key
            ,COUNT(1) as value
        FROM top.related_stories
        UNION
        SELECT
            'top level domains' as key
            ,COUNT(distinct tld) as value
        FROM top.stories
        UNION
        SELECT
            'publishers' as key
            ,COUNT(1) as value
        FROM top.publishers
        UNION
        SELECT
            'authors' as key
            ,COUNT(distinct author) as value
        FROM top.stories
        UNION
        SELECT
            'min year' as key
            ,min(year(published_at)) as value
        FROM top.stories
        UNION
        SELECT
            'max year' as key
            ,max(year(published_at)) as value
        FROM top.stories
    """).df().to_markdown(index=False)
@click.command('plot:bias-stats')
 def bias_stats():
    import dataframe_image as dfi
    filename = 'bias_stats.png'
    DB = connect()
    df = DB.query("""
        SELECT
            string_agg(distinct bias)
        FROM bias_ratings
    """).df()
    df.keys()
    df = DB.query("""
        SELECT
            'publishers' as metric
            ,count(1) as value
        FROM bias_ratings
        UNION
        SELECT
            'labels' as metric
            ,count(distinct bias) as value
        FROM bias_ratings
        UNION
        SELECT
            'right' as metric
            ,count(1) as value
        FROM bias_ratings
        WHERE bias in ('right', 'right-center')
        UNION
        SELECT
            'left' as metric
            ,count(1) as value
        FROM bias_ratings
        WHERE bias in ('left', 'left-center')
        UNION
        SELECT
            'center' as metric
            ,count(1) as value
        FROM bias_ratings
        WHERE bias in ('center')
        UNION
        SELECT
            'agree_range' as metric
            ,'[' 
                || min(cast(agree as float) / (agree + disagree)) 
                || '-' 
                || max(cast(agree as float) / (agree + disagree)) 
                || ']'
                as value
        FROM bias_ratings
        WHERE bias in ('center')
    """).df()
    DB.close()
    print(df.to_markdown(index=False))
--- a/src/plots/links.py
+++ b/src/plots/links.py
@ -0,0 +1,114 @@
 import click
 from data.main import connect
 from links import to_matrix
 import os
 import seaborn as sns
 import matplotlib.pyplot as plt
 from pathlib import Path
 import numpy as np
 from sklearn.metrics import silhouette_score
 import pandas as pd
 out_dir = Path(os.getenv('DATA_MINING_DOC_DIR')) / 'figures'
@click.command('plot:link-elbow')
 def elbow():
    from sklearn.cluster import KMeans
    filename = 'link_cluster_elbow.png'
    DB = connect()
    df = DB.query("""
        SELECT
            *
        FROM link_edges
    """).df()
    pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
    to_plot = []
    for k in range(2, 15):
        kmeans = KMeans(n_clusters=k, n_init="auto")
        kmeans.fit(pivot)
        label = kmeans.labels_
        coeff = silhouette_score(pivot, label, metric='euclidean')
        to_plot.append({'k': k, 'inertia' : kmeans.inertia_, 'coeff': coeff})
    to_plot = pd.DataFrame(to_plot)
    ax = sns.lineplot(x=to_plot.k, y=to_plot.inertia)
    ax.set(title="elbow criterion plot of clusters", xlabel="bin size (k)", ylabel="sum of squared distances between centroids/points")
    plt.savefig(out_dir / filename)
    plt.close()
    # randomly pick 8
@click.command('plot:link-pca-clusters')
@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
 def link_pca_clusters(source):
    filename = f"link_pca_clusters_{source}.png"
    DB = connect()
    df = DB.query(f"""
        SELECT
            c.label as cluster
            ,p.tld
            --,b.label as bias
            ,pca.first
            ,pca.second
            ,s.cnt as stories
        FROM top.publisher_clusters_{source} c
        JOIN top.publishers p
        ON c.publisher_id = p.id
        JOIN 
        (
            select
                s.publisher_id
                ,count(1) as cnt
            FROM top.stories s
            GROUP BY
                s.publisher_id
        ) s
        ON s.publisher_id = p.id
        JOIN top.publisher_pca_{source} pca
        ON pca.publisher_id = p.id
    """).df()
    DB.close()
    ax = sns.scatterplot(x=df['first'], y=df['second'], hue=df['cluster'])
    ax.set(title=f"pca components vs. clusters ({source})", xlabel="first pca component", ylabel="second pca component")
    plt.savefig(out_dir / filename)
    # .df().groupby(['cluster', 'bias']).describe()
 def test():
    data_dir = Path(os.getenv('DATA_MINING_DATA_DIR'))
    DB.query("""
        SELECT
            p.id as publisher_id
            ,p.name
            ,p.tld
            ,cast(b.bias_id as int) as bias_id
            ,count(1) as stories
        FROM publishers p
        JOIN stories s
        ON s.publisher_id = p.id
        JOIN publisher_clusters c
        ON c.publisher_id = p.id
        LEFT JOIN publisher_bias b
        ON b.publisher_id = p.id
        where bias_id is null
        group by
            p.id
            ,p.name
            ,p.tld
            ,b.bias_id
        ORDER BY count(1) desc
    """)
    # .df().to_csv(data_dir / 'cluster_publishers.csv', sep="|", index=False)
    DB.close()
--- a/src/selection.py
+++ b/src/selection.py
@ -0,0 +1,48 @@
 from data.main import connect
 import pandas as pd
 import numpy as np
 DB = connect()
 edges = DB.query("""
    select
    *
    from link_edges
 """).df()
 DB.close()
 edges
 adj = edges.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
 select_publishers = pd.DataFrame(adj.index.tolist(), columns=['publisher_id'])
 DB = connect()
 DB.query("create schema top")
 DB.query("""
    CREATE OR REPLACE TABLE top.publishers AS
    SELECT
        p.*
    FROM publishers p
    JOIN select_publishers s
    ON s.publisher_id = p.id
 """)
 DB.query("""
    CREATE OR REPLACE TABLE top.stories AS
    SELECT
        s.*
    FROM stories s
    JOIN top.publishers p
    ON s.publisher_id = p.id
    WHERE year(s.published_at) >= 2006
    AND year(s.published_at) < 2023
 """)
 DB.query("""
    CREATE OR REPLACE TABLE top.related_stories AS
    SELECT
        r.*
    FROM top.stories s
    JOIN related_stories r
    ON s.id = r.parent_id
 """)
--- a/src/sentence.py
+++ b/src/sentence.py
@ -0,0 +1,138 @@
 from transformers import AutoTokenizer, AutoModel
 import torch
 import torch.nn.functional as F
 from data.main import connect, data_dir
 import os
 from pathlib import Path
 import numpy as np
 import pandas as pd
 from tqdm import tqdm
 import click
 #Mean Pooling - Take attention mask into account for correct averaging
 def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
@click.option('-c', '--chunks', type=int, default=500, show_default=True)
@click.command("sentence:embed")
 def embed(chunks):
    # Load model from HuggingFace Hub
    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
    model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
    # load data
    DB = connect()
    table = DB.sql("""
        select
        id
        ,title
        from stories
        order by id desc
    """).df()
    DB.close()
    # normalize text
    table['title'] = table['title'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
    chunked = np.array_split(table, chunks)
    # generate embeddings from list of titles
    iterator = tqdm(chunked, 'embedding')
    embeddings = []
    embedding_ids = []
    for _, chunk in enumerate(iterator):
        sentences = chunk['title'].tolist()
        ids = chunk['id'].tolist()
        # Tokenize sentences
        encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
        # Compute token embeddings
        with torch.no_grad():
            model_output = model(**encoded_input)
        # Perform pooling
        output = mean_pooling(model_output, encoded_input['attention_mask'])
        # Normalize embeddings
        output = F.normalize(output, p=2, dim=1)
        embeddings.append(output)
        embedding_ids.append(ids)
    embeddings = np.concatenate(embeddings)
    ids = np.concatenate(embedding_ids)
    # save embeddings
    save_to = data_dir() / 'embeddings.npy'
    np.save(save_to, embeddings)
    print(f"embeddings saved: {save_to}")
    # save ids
    save_to = data_dir() / 'embedding_ids.npy'
    np.save(save_to, ids)
    print(f"ids saved: {save_to}")
@click.command('sentence:create-pca-table')
 def create_table():
    from sklearn import linear_model
    data_path = Path(os.getenv('DATA_MINING_DATA_DIR'))
    embeddings = np.load(data_path / 'embeddings.npy')
    embedding_ids = np.load(data_path / 'embedding_ids.npy')
    ids = pd.DataFrame(embedding_ids, columns=['story_id']).reset_index()
    DB = connect()
    data = DB.query("""
        SELECT
            ids.index
            ,s.id
            ,b.ordinal
        FROM ids
        JOIN top.stories s
        ON ids.story_id = s.id
        JOIN top.publisher_bias pb
        ON pb.publisher_id = s.publisher_id
        JOIN bias_ratings b
        ON b.id = pb.bias_id
    """).df()
    x = embeddings[data['index']]
    y = data['ordinal'].to_numpy().reshape(-1, 1)
    reg = linear_model.LinearRegression()
    reg.fit(x, y)
    reg.coef_.shape
@click.command('sentence:create-svm-table')
 def create_svm_table():
    from sklearn import svm
    data_path = Path(os.getenv('DATA_MINING_DATA_DIR'))
    embeddings = np.load(data_path / 'embeddings.npy')
    embedding_ids = np.load(data_path / 'embedding_ids.npy')
    ids = pd.DataFrame(embedding_ids, columns=['story_id']).reset_index()
    DB = connect()
    data = DB.query("""
        SELECT
            ids.index
            ,s.id
            ,b.ordinal
        FROM ids
        JOIN top.stories s
        ON ids.story_id = s.id
        JOIN top.publisher_bias pb
        ON pb.publisher_id = s.publisher_id
        JOIN bias_ratings b
        ON b.id = pb.bias_id
    """).df()
    x = embeddings[data['index']]
    #y = data['ordinal'].to_numpy().reshape(-1, 1)
    y = data['ordinal']
    clf = svm.SVC()
    pred = clf.fit(x, y)
--- a/src/train/init.py
+++ b/src/train/init.py
@ -0,0 +1,5 @@
 import train.main
 __all__ = [
    'main'
 ]
--- a/src/train/dataset.py
+++ b/src/train/dataset.py
@ -0,0 +1,38 @@
 from torch.utils.data import Dataset
 from data.main import connect, data_dir
 from bias import label_to_int
 import numpy as np
 import pandas as pd
 class NewsDataset(Dataset):
    def __init__(self):
        self.embeddings = np.load(data_dir() / 'embeddings.npy')
        embedding_ids = pd.DataFrame(np.load(data_dir() / 'embedding_ids.npy'), columns=['id']).reset_index()
        DB = connect()
        query = """
            SELECT
                s.id
                ,b.label
                ,count(1) over (partition by publisher_id) as stories
            FROM stories s
            JOIN publisher_bias b
            ON b.id = s.publisher_id
            WHERE b.label != 'allsides'
        """
        data = DB.sql(query).df()
        DB.close()
        data['label'] = data['label'].apply(lambda x: label_to_int(x))
        data = data.merge(embedding_ids)
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        y = row['label']
        # x = np.concatenate((self.embeddings[row['index']], [row['stories']])).astype(np.float32)
        x = self.embeddings[row['index']]
        return x, y
--- a/src/train/main.py
+++ b/src/train/main.py
@ -0,0 +1,132 @@
 import click
 from tqdm import tqdm
 from enum import Enum, auto
 from dotenv import load_dotenv
 import os
 import torch
 from torch import nn
 from torch import optim
 from torch.utils.data import DataLoader
 from accelerate import Accelerator
 from train.dataset import NewsDataset
 from train.model import Classifier
 #from model.linear import LinearClassifier
 class Stage(Enum):
    TRAIN = auto()
    DEV = auto()
@click.command('train:main')
 def main():
    dev_after = 20
    visible_devices = None
    lr = 1e-4
    epochs = 10
    debug = False
    torch.manual_seed(0)
    num_workers = 0
    embedding_length = int(os.getenv('EMBEDDING_LENGTH', 384))
    dataset = NewsDataset()
    trainset, devset = torch.utils.data.random_split(dataset, [0.8, 0.2])
    batch_size = 512
    trainloader = DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=num_workers, drop_last=True)
    devloader = DataLoader(devset, shuffle=False, num_workers=num_workers)
    accelerator = Accelerator()
    model = Classifier(embedding_length=embedding_length, classes=5)
    # it's possible to control which GPUs the process can see using an environmental variable
    if visible_devices:
        os.environ['CUDA_VISIBLE_DEVICES'] = visible_devices
    if debug:
        os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
        #accelerator.log({"message" :"debug enabled"})
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    # wrap objects with accelerate
    model, optimizer, trainloader, devloader = accelerator.prepare(model, optimizer, trainloader, devloader)
    def run():
        """runner for training and valdating"""
        running_loss = 0.0
        # set the model to train model
        model.train() if stage == Stage.TRAIN else model.eval()
        dataloader = trainloader if stage == Stage.TRAIN else devloader
        desc = 'train epoch' if stage == Stage.TRAIN else 'dev epoch'
        if debug:
            ...
        # Make sure there are no leftover gradients before starting training an epoch
        optimizer.zero_grad()
        for batch, (x, y) in enumerate(tqdm(dataloader, desc=desc)):
            pred_y = model(x)  # Forward pass through model
            loss = criterion(pred_y, y)
            running_loss += loss                        # Increment running loss
            # Only update model weights on training
            if stage == Stage.TRAIN:
                accelerator.backward(loss)             # Increment gradients within model by sending loss backwards
                optimizer.step()                       # Update model weights
                optimizer.zero_grad()                  # Reset gradients to 0
        return running_loss / len(dataloader)
    for epoch in range(epochs):
        if (epoch - 1) % dev_after == 0:
            stage = Stage.DEV
            log = run()
            print(f"dev loss: {log}")
        else:
            stage = Stage.TRAIN
            log = run()
            print(f"train loss: {log}")
    breakpoint()
    from data.main import data_dir, connect
    import numpy as np
    import pandas as pd
    from bias import int_to_label
    embeddings = dataset.embeddings
    embedding_ids = dataset.data
    DB = connect()
    query = """
        SELECT
            s.id
            ,title
            ,p.name
            ,count(1) over (partition by publisher_id) as stories
        FROM stories s
        JOIN publishers p
        on p.id = s.publisher_id
        WHERE s.publisher_id NOT IN (
            SELECT
                id 
            FROM publisher_bias b
        )
    """
    data = DB.sql(query).df()
    embeddings = np.load(data_dir() / 'embeddings.npy')
    embedding_ids = pd.DataFrame(np.load(data_dir() / 'embedding_ids.npy'), columns=['id']).reset_index()
    for i in range(10):
        embedding =  embeddings[embedding_ids[embedding_ids['id'] == data.iloc[i]['id']]['index']]
        title = data.iloc[i]['title']
        publisher = data.iloc[i]['name']
        class_pred = nn.functional.softmax( model(torch.tensor(embedding))).detach()
        class_id = int(torch.argmax(nn.functional.softmax( model(torch.tensor(embedding))).detach()))
        print(f"{publisher}: {int_to_label(class_id)} - \"{title}\"")
    embedding_ids['id'] == data.iloc[0]['id']
    embedding_ids[embedding_ids['id'] == data.iloc[0]['id']]
    embedding =  embeddings[embedding_ids[embedding_ids['id'] == data.iloc[0]['id']]['index']]
    title
    publisher
    model().get_last_layer(torch.tensor(embedding))
--- a/src/train/model.py
+++ b/src/train/model.py
@ -0,0 +1,28 @@
 from torch import nn
 class Classifier(nn.Module):
    def __init__(self, embedding_length: int, classes: int):
        super().__init__()
        out_len = 16
        self.stack = nn.Sequential(
            nn.Linear(embedding_length, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, out_len),
            nn.ReLU(),
        )
        self.logits = nn.Linear(out_len, classes)
    def forward(self, x):
        x = self.stack(x)
        self.last_hidden_layer = x.detach()
        return self.logits(x)
    def get_last_layer(self, x):
        x = self.stack(x)
        return x
--- a/src/word.py
+++ b/src/word.py
@ -1,7 +1,7 @@
 import click
 from transformers import AutoTokenizer, RobertaModel
 import numpy as np
-from data import Data, from_db, connect, data_dir
+from data.main import Data, from_db, connect, data_dir
 from tqdm import tqdm
 import torch
 from pathlib import Path