Merge branch 'feature_paper'

2023-06-07 20:53:53 -07:00 · 2023-06-07 20:53:53 -07:00 · d06b8d5e23
parent e37d39bc4a 7edb8543a7
commit d06b8d5e23
41 changed files with 1192 additions and 390 deletions
--- a/.gitignore
+++ b/.gitignore
@ -7,3 +7,6 @@ tmp.py
 *.log
 *.out
 tmp.*
+*.bbl
+*.blg
+*.dvi
--- a/dist/hw1.pdf
+++ b/dist/hw1.pdf
--- a/dist/jensen_577_hw1.pdf
+++ b/dist/jensen_577_hw1.pdf
--- a/dist/jensen_577_paper.pdf
+++ b/dist/jensen_577_paper.pdf
--- a/dist/jensen_577_presentation.pdf
+++ b/dist/jensen_577_presentation.pdf
--- a/docs/acm_template.tex
+++ b/docs/acm_template.tex
@ -0,0 +1,213 @@
+\documentclass[sigconf,authorversion,nonacm]{acmart}
+
+\begin{document}
+
+\title{Political Polarization In Media Headlines}
+\subtitle{CSCI 577 - Data Mining}
+
+\author{Matt Jensen}
+\email{contact@publicmatt.com}
+\affiliation{%
+  \institution{Western Washington University}
+  \streetaddress{516 High St.}
+  \city{Bellingham}
+  \state{Washington}
+  \country{USA}
+  \postcode{98225}
+}
+
+\renewcommand{\shortauthors}{Jensen, et al.}
+
+\begin{abstract}
+Political polarization in the United States has increased in recent years according to studies \cite{stewart_polarization_2020}.
+A number of polling methods and data sources have been used to track this phenomenon  \cite{prior_media_2013}.
+A casual link between polarization and partisanship in elections and the community has been hard to establish.
+One possible cause is the media diet of the average American.
+In particular, the medium of consumption has shifted online and the range of sources has widened considerably.
+In an effort to quantify the range of online media, a study of online news article headlines was conducted.
+It found that titles with emotionally neutral wording have decreased in the share of all articles over time.
+A model was built to classify titles using BERT-style word embeddings and a simple classifier.
+
+\end{abstract}
+
+\keywords{data mining, datasets, classification, clustering, neural networks}
+
+\received{4 April 2023}
+\received[revised]{9 June 2023}
+
+\maketitle
+
+\section{Background}
+
+Media and new publishers have been accused of polarizing discussion to drive up revenue and engagement.
+This paper seeks to quantify those claims by classifying the degree to which news headlines have become more emotionally charged of time.
+A secondary goal is the investigate whether news organization have been uniformly polarized, or if one pole has been 'moving' more rapidly away from the 'middle'.
+This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
+Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.
+
+There is evidence supporting and increase in political polarization in the United States over the past 16 years.
+There have been a number of studies conducted in an attempt to measure and explain this phenomenon.  \cite{flaxman_filter_2016}
+
+These studies attempt to link increased media options and a decrease in the proportion of less engaged and less partisan voters. 
+This drop in less engaged voters might explain the increased partisanship in elections.
+However, the evidence regarding a direct causal relationship between partisan media messages and changes in attitudes or behaviors is inconclusive.
+Directly measuring the casual relationship between media messages and behavior is difficult.
+There is currently no solid evidence to support the claim that partisan media outlets are causing average Americans to become more partisan.
+
+The number of media publishers has increased and in this particular data set:
+
+These studies rest on the assumption that media outlets are becoming more partisan.
+We study this assumption in detail.
+
+Party Sorting: Over the past few decades, there has been a significant increase in party sorting, where Democrats have become more ideologically liberal, and Republicans have become more ideologically conservative.
+This trend indicates a growing gap between the two major political parties.
+A study published in the journal American Political Science Review in 2018 found that party sorting increased significantly between 2004 and 2016.
+
+Congressional Polarization: There has been a substantial increase in polarization among members of the U.S. Congress. Studies analyzing voting patterns and ideological positions of legislators have consistently shown a widening gap between Democrats and Republicans.
+The Pew Research Center reported that the median Democrat and the median Republican in Congress have become further apart ideologically between 2004 and 2017.
+
+Public Opinion: Surveys and polls also provide evidence of increasing political polarization among the American public.
+According to a study conducted by Pew Research Center in 2017, the gap between Republicans and Democrats on key policy issues, such as immigration, the environment, and social issues, has widened significantly since 1994.
+
+Media Fragmentation: The rise of social media and digital media platforms has contributed to the fragmentation of media consumption, leading to the creation of ideological echo chambers.
+Individuals are more likely to consume news and information that aligns with their pre-existing beliefs, reinforcing and intensifying polarization.
+
+    Increased Negative Attitudes: Studies have shown that Americans' attitudes towards members of the opposing political party have become increasingly negative. The Pew Research Center reported in 2016 that negative feelings towards the opposing party have doubled since the late 1990s, indicating a deepening divide.
+
+- Memeorandum: **stories**
+- AllSides: **bias**
+- HuggingFace: **sentiment**
+- ChatGPT: **election dates**
+\section{Data Sources}
+
+All data was collected over the course of 2023.
+
+\begin{table}
+    \label{tab:freq}
+    \caption{News Dataset Sources}
+    \label{tab:1}
+    \begin{tabular}{ll}
+        \toprule
+        Source     & Description \\
+        \midrule
+        Memeorandum & News aggregation service.     \\
+        AllSides    & Bias evaluator.   \\
+        MediaBiasFactCheck   & Bias evaluator.   \\
+        HuggingFace & Classification model repository. \\
+        \bottomrule
+    \end{tabular}
+\end{table}
+
+\section{Data Preparation}
+
+\subsection{Memeorandum}
+The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
+Each news article has a title, author, description, publisher, publish date and url. 
+All of these are non-numeric, except for the publication date which is ordinal.
+The site also has a concept of references, where a main, popular story may be covered by other sources.
+Using an archive of the website, each day's headlines were downloaded and parsed using python, then normalized and stored in sqlite database tables \cite{jensen_data_2023-1}.
+
+\subsection{AllSides\\MediaBiasFactCheck}
+
+
+
+What remains after cleaning is approximately 240,000 headlines from 1,700 publishers, 34,000 authors over about 64,000 days \ref{tab:1}.
+
+\begin{table}
+    \label{tab:freq}
+    \caption{News Dataset Statistics After Cleaning}
+    \label{tab:1}
+    \begin{tabular}{ll}
+        \toprule
+        stat       & value     \\
+        \midrule
+        publishers & 1,735     \\
+        stories    & 242,343   \\
+        authors    & 34,346    \\
+        children   & 808,628   \\
+        date range & 2006-2022 \\
+        \bottomrule
+    \end{tabular}
+\end{table}
+
+\subsection{Missing Data Policy}
+
+The only news headlines used in this study were those with an associated bias rating from either AllSides or MediaBiasFactCheck.
+This elimiated about 5300 publishers and 50,000 headlines, which are outlets publishing only less than 1 story per year.
+Another consideration was the relationship between the opinion and news sections of organizations.
+MediaBiasFactCheck makes a distinct between things like the Wall Street Journal's news organization, one it rates as 'Least Bias', and Wall Street Journal's opinion organization, one it rates as 'Right'.
+Due to the nature of the Memeorandum dataset, and the way that organizations design their url structure, this study was not able to parse the headlines into news, opinion, blogs or other sub-categories recognized by the bias datasets.
+As such, news and opinion was combined under the same bias rating, and the rating with the most articles published was taken as the default value.
+This might lead to organizations with large newsrooms to bias toward the center in the dataset.
+
+
+\section{Experiments}
+
+\subsection{Link Similarity Clustering and Classification}
+
+\subsection{Title Sentiment Classification}
+
+for every title, tokenize, classify.
+
+The classification of news titles into emotional categories was accomplished by using a pre-trained large language model from \href{https://huggingface.co/arpanghoshal/EmoRoBERTa}{HuggingFace}.
+This model was trained on \href{https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html}{a dataset curated and published by Google} which manually classified a collection of 58,000 comments into 28 emotions.
+The classes for each article will be derived by tokenizing the title and running the model over the tokens, then grabbing the largest probability class from the output.
+
+The data has been discretized into years.
+    Additionally, the publishers will have been discretized based of either principle component analysis on link similarity or based on the bias ratings of \href{https://www.allsides.com/media-bias/ratings}{All Sides}.
+Given that the features of the dataset are sparse, it is not expected to have any useless attributes, unless the original hypothesis of a temporal trend proving to be false.
+Of the features used in the analysis, there are enough data points that null or missing values can safely be excluded.
+
+No computational experiment have been done yet.
+Generating the tokenized text, the word embedding and the emotional sentiment analysis have made up the bulk of the work thus far.
+The bias ratings do not cover all publisher in the dataset, so the number of articles without a bias rating from their publisher will have to be calculated.
+If it is less than 30\% of the articles, it might not make sense to use the bias ratings.
+The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.
+
+
+\section{Results}
+
+\begin{figure}[h]
+    \centering
+    \includegraphics[width=\linewidth]{figures/articles_per_year.png}
+    \caption{Articles per year.}
+    \Description{descriptive statistics on the news data source}
+\end{figure}
+
+\begin{figure}[h]
+    \centering
+    \includegraphics[width=\linewidth]{figures/bias_vs_sentiment_over_time.png}
+    \caption{Sentiment vs. bias over time}
+    \Description{Timeseries classifcation of news titles sentiment and bias}
+\end{figure}
+
+\begin{figure}[h]
+    \centering
+    \includegraphics[width=\linewidth]{figures/link_pca_clusters_onehot.png}
+    \caption{kNN confusion matrix of related links adjacency matrix}
+    \Description{}
+\end{figure}
+
+
+% \section{Math Equations}
+
+% \begin{equation}
+%   \sum_{i=0}^{\infty}x_i=\int_{0}^{\pi+2} f
+% \end{equation}
+
+
+\begin{acks}
+To Dr. Hearne, for the instruction on clustering and classification techniques, and to Pax Newman for the discussion on word embeddings.
+\end{acks}
+
+\bibliographystyle{ACM-Reference-Format}
+\bibliography{data_mining_577}
+
+\appendix
+
+\section{Online Resources}
+
+The source code for the study is available on GitHub \cite{jensen_data_2023}.
+
+\end{document}
+\endinput
--- a/docs/data_mining_577.bib
+++ b/docs/data_mining_577.bib
@ -0,0 +1,144 @@
+
+@article{stewart_polarization_2020,
+	title = {Polarization under rising inequality and economic decline},
+	volume = {6},
+	issn = {2375-2548},
+	url = {https://www.science.org/doi/10.1126/sciadv.abd4201},
+	doi = {10.1126/sciadv.abd4201},
+	abstract = {Polarization can spread and become entrenched when inequality creates subpopulations that cannot afford risks.
+          , 
+            Social and political polarization is an important source of conflict in many societies. Understanding its causes has become a priority of scholars across disciplines. We demonstrate that shifts in socialization strategies analogous to political polarization can arise as a locally beneficial response to both rising wealth inequality and economic decline. In many contexts, interaction with diverse out-groups confers benefits from innovation and exploration greater than those that arise from interacting exclusively with a homogeneous in-group. However, when the economic environment favors risk aversion, a strategy of seeking lower-risk in-group interactions can be important to maintaining individual solvency. Our model shows that under conditions of economic decline or increasing inequality, some members of the population benefit from adopting a risk-averse, in-group favoring strategy. Moreover, we show that such in-group polarization can spread rapidly to the whole population and persist even when the conditions that produced it have reversed.},
+	language = {en},
+	number = {50},
+	urldate = {2023-05-16},
+	journal = {Science Advances},
+	author = {Stewart, Alexander J. and McCarty, Nolan and Bryson, Joanna J.},
+	month = dec,
+	year = {2020},
+	pages = {eabd4201},
+	file = {Stewart et al. - 2020 - Polarization under rising inequality and economic .pdf:/home/user/Zotero/storage/ZJXIIIBC/Stewart et al. - 2020 - Polarization under rising inequality and economic .pdf:application/pdf},
+}
+
+@article{prior_media_2013,
+	title = {Media and {Political} {Polarization}},
+	volume = {16},
+	issn = {1094-2939, 1545-1577},
+	url = {https://www.annualreviews.org/doi/10.1146/annurev-polisci-100711-135242},
+	doi = {10.1146/annurev-polisci-100711-135242},
+	abstract = {This article examines if the emergence of more partisan media has contributed to political polarization and led Americans to support more partisan policies and candidates. Congress and some newer media outlets have added more partisan messages to a continuing supply of mostly centrist news. Although political attitudes of most Americans have remained fairly moderate, evidence points to some polarization among the politically involved. Proliferation of media choices lowered the share of less interested, less partisan voters and thereby made elections more partisan. But evidence for a causal link between more partisan messages and changing attitudes or behaviors is mixed at best. Measurement problems hold back research on partisan selective exposure and its consequences. Ideologically one-sided news exposure may be largely conﬁned to a small, but highly involved and inﬂuential, segment of the population. There is no ﬁrm evidence that partisan media are making ordinary Americans more partisan.},
+	language = {en},
+	number = {1},
+	urldate = {2023-06-06},
+	journal = {Annual Review of Political Science},
+	author = {Prior, Markus},
+	month = may,
+	year = {2013},
+	pages = {101--127},
+	file = {Prior - 2013 - Media and Political Polarization.pdf:/home/user/Zotero/storage/SFKISRT9/Prior - 2013 - Media and Political Polarization.pdf:application/pdf},
+}
+
+@article{allcott_social_2017,
+	title = {Social {Media} and {Fake} {News} in the 2016 {Election}},
+	volume = {31},
+	issn = {0895-3309},
+	url = {https://pubs.aeaweb.org/doi/10.1257/jep.31.2.211},
+	doi = {10.1257/jep.31.2.211},
+	abstract = {Following the 2016 US presidential election, many have expressed concern about the effects of false stories (“fake news”), circulated largely through social media. We discuss the economics of fake news and present new data on its consumption prior to the election. Drawing on web browsing data, archives of fact-checking websites, and results from a new online survey, we find: 1) social media was an important but not dominant source of election news, with 14 percent of Americans calling social media their “most important” source; 2) of the known false news stories that appeared in the three months before the election, those favoring Trump were shared a total of 30 million times on Facebook, while those favoring Clinton were shared 8 million times; 3) the average American adult saw on the order of one or perhaps several fake news stories in the months around the election, with just over half of those who recalled seeing them believing them; and 4) people are much more likely to believe stories that favor their preferred candidate, especially if they have ideologically segregated social media networks.},
+	language = {en},
+	number = {2},
+	urldate = {2023-06-06},
+	journal = {Journal of Economic Perspectives},
+	author = {Allcott, Hunt and Gentzkow, Matthew},
+	month = may,
+	year = {2017},
+	pages = {211--236},
+	file = {jep.31.2.211.pdf:/home/user/577/repo/docs/references/jep.31.2.211.pdf:application/pdf},
+}
+
+@article{allcott_polarization_2020,
+	title = {Polarization and public health: {Partisan} differences in social distancing during the coronavirus pandemic},
+	volume = {191},
+	issn = {00472727},
+	shorttitle = {Polarization and public health},
+	url = {https://linkinghub.elsevier.com/retrieve/pii/S0047272720301183},
+	doi = {10.1016/j.jpubeco.2020.104254},
+	abstract = {We study partisan differences in Americans' response to the COVID-19 pandemic. Political leaders and media outlets on the right and left have sent divergent messages about the severity of the crisis, which could impact the extent to which Republicans and Democrats engage in social distancing and other efforts to reduce disease transmission. We develop a simple model of a pandemic response with heterogeneous agents that clariﬁes the causes and consequences of heterogeneous responses. We use location data from a large sample of smartphones to show that areas with more Republicans engaged in less social distancing, controlling for other factors including public policies, population density, and local COVID cases and deaths. We then present new survey evidence of signiﬁcant gaps at the individual level between Republicans and Democrats in self-reported social distancing, beliefs about personal COVID risk, and beliefs about the future severity of the pandemic.},
+	language = {en},
+	urldate = {2023-06-06},
+	journal = {Journal of Public Economics},
+	author = {Allcott, Hunt and Boxell, Levi and Conway, Jacob and Gentzkow, Matthew and Thaler, Michael and Yang, David},
+	month = nov,
+	year = {2020},
+	pages = {104254},
+	file = {1-s2.0-S0047272720301183-main.pdf:/home/user/577/repo/docs/references/1-s2.0-S0047272720301183-main.pdf:application/pdf},
+}
+
+@article{flaxman_filter_2016,
+	title = {Filter {Bubbles}, {Echo} {Chambers}, and {Online} {News} {Consumption}},
+	volume = {80},
+	issn = {0033-362X, 1537-5331},
+	url = {https://academic.oup.com/poq/article-lookup/doi/10.1093/poq/nfw006},
+	doi = {10.1093/poq/nfw006},
+	abstract = {Online publishing, social networks, and web search have dramatically lowered the costs of producing, distributing, and discovering news articles. Some scholars argue that such technological changes increase exposure to diverse perspectives, while others worry that they increase ideological segregation. We address the issue by examining webbrowsing histories for 50,000 US-located users who regularly read online news. We find that social networks and search engines are associated with an increase in the mean ideological distance between individuals. However, somewhat counterintuitively, these same channels also are associated with an increase in an individual’s exposure to material from his or her less preferred side of the political spectrum. Finally, the vast majority of online news consumption is accounted for by individuals simply visiting the home pages of their favorite, typically mainstream, news outlets, tempering the consequences—both positive and negative—of recent technological changes. We thus uncover evidence for both sides of the debate, while also finding that the magnitude of the effects is relatively modest.},
+	language = {en},
+	number = {S1},
+	urldate = {2023-06-06},
+	journal = {Public Opinion Quarterly},
+	author = {Flaxman, Seth and Goel, Sharad and Rao, Justin M.},
+	year = {2016},
+	pages = {298--320},
+	file = {bubbles.pdf:/home/user/577/repo/docs/references/bubbles.pdf:application/pdf},
+}
+
+@article{guess_almost_2021,
+	title = {({Almost}) {Everything} in {Moderation}: {New} {Evidence} on {Americans}' {Online} {Media} {Diets}},
+	volume = {65},
+	issn = {0092-5853, 1540-5907},
+	shorttitle = {({Almost}) {Everything} in {Moderation}},
+	url = {https://onlinelibrary.wiley.com/doi/10.1111/ajps.12589},
+	doi = {10.1111/ajps.12589},
+	abstract = {Does the internet facilitate selective exposure to politically congenial content? To answer this question, I introduce and validate large-N behavioral data on Americans’ online media consumption in both 2015 and 2016. I then construct a simple measure of media diet slant and use machine classification to identify individual articles related to news about politics. I find that most people across the political spectrum have relatively moderate media diets, about a quarter of which consist of mainstream news websites and portals. Quantifying the similarity of Democrats’ and Republicans’ media diets, I find nearly 65\% overlap in the two groups’ distributions in 2015 and roughly 50\% in 2016. An exception to this picture is a small group of partisans who drive a disproportionate amount of traffic to ideologically slanted websites. If online “echo chambers” exist, they are a reality for relatively few people who may nonetheless exert disproportionate influence and visibility.},
+	language = {en},
+	number = {4},
+	urldate = {2023-06-06},
+	journal = {American Journal of Political Science},
+	author = {Guess, Andrew M.},
+	month = oct,
+	year = {2021},
+	pages = {1007--1022},
+	file = {guess2021.pdf:/home/user/577/repo/docs/references/guess2021.pdf:application/pdf},
+}
+
+@article{autor_importing_2020,
+	title = {Importing {Political} {Polarization}? {The} {Electoral} {Consequences} of {Rising} {Trade} {Exposure}},
+	volume = {110},
+	issn = {0002-8282},
+	shorttitle = {Importing {Political} {Polarization}?},
+	url = {https://pubs.aeaweb.org/doi/10.1257/aer.20170011},
+	doi = {10.1257/aer.20170011},
+	abstract = {Has rising import competition contributed to the polarization of US politics? Analyzing multiple measures of political expression and results of congressional and presidential elections spanning the period 2000 through 2016, we find strong though not definitive evidence of an ideological realignment in trade-exposed local labor markets that commences prior to the divisive 2016 US presidential election. Exploiting the exogenous component of rising import competition by China, we find that trade exposed electoral districts simultaneously exhibit growing ideological polarization in some domains, meaning expanding support for both strong-left and strong-right views, and pure rightward shifts in others. Specifically, trade-impacted commuting zones or districts saw an increasing market share for the Fox News channel (a rightward shift), stronger ideological polarization in campaign contributions (a polarized shift), and a relative rise in the likelihood of electing a Republican to Congress (a rightward shift). Trade-exposed counties with an initial majority White population became more likely to elect a GOP conservative, while trade-exposed counties with an initial majority-minority population became more likely to elect a liberal Democrat, where in both sets of counties, these gains came at the expense of moderate Democrats (a polarized shift). In presidential elections, counties with greater trade exposure shifted toward the Republican candidate (a rightward shift). These results broadly support an emerging political economy literature that connects adverse economic shocks to sharp ideological realignments that cleave along racial and ethnic lines and induce discrete shifts in political preferences and economic policy. (JEL D72, F14, J15, L82, R23)},
+	language = {en},
+	number = {10},
+	urldate = {2023-06-06},
+	journal = {American Economic Review},
+	author = {Autor, David and Dorn, David and Hanson, Gordon and Majlesi, Kaveh},
+	month = oct,
+	year = {2020},
+	pages = {3139--3183},
+	file = {w22637.pdf:/home/user/577/repo/docs/references/w22637.pdf:application/pdf},
+}
+
+@misc{jensen_data_2023,
+	title = {Data {Mining} 577: {Political} {Polarization} {Source} {Code}},
+	url = {https://github.com/publicmatt/data_mining_577},
+	publisher = {https://github.com/publicmatt/data\_mining\_577},
+	author = {Jensen, Matt},
+	year = {2023},
+}
+
+@misc{jensen_data_2023-1,
+	title = {Data {Mining} 577: {Political} {Polarization} {Data}},
+	url = {https://data.publicmatt.com/national_news/stories},
+	author = {Jensen, Matt},
+	year = {2023},
+}
--- a/docs/figures/bias_vs_sentiment_over_time.png
+++ b/docs/figures/bias_vs_sentiment_over_time.png
--- a/docs/figures/distinct_publishers.png
+++ b/docs/figures/distinct_publishers.png
--- a/docs/figures/link_pca_clusters_onehot.png
+++ b/docs/figures/link_pca_clusters_onehot.png
--- a/docs/figures/publisher_avg_sentiment_vs_bias_over_time.png
+++ b/docs/figures/publisher_avg_sentiment_vs_bias_over_time.png
--- a/docs/hw2.md
+++ b/docs/hw2.md
@ -11,5 +11,3 @@
 > A formal proof is not necessary.
 > Just an intelligent discussion.
 > You will not lose points by giving the wrong answer, only by not being intelligent in your discussion.
-
-
--- a/docs/paper.tex
+++ b/docs/paper.tex
@ -1,6 +1,17 @@
 \documentclass{article}
-\usepackage{multicol}
+\usepackage{multicol,caption}
 \usepackage{hyperref}
+\usepackage{caption}
+\usepackage{subcaption}
+\usepackage{graphicx}
+\usepackage{fancyvrb}
+\usepackage[utf8]{inputenc}
+\bibliographystyle{acm}
+
+\newenvironment{Figure}
+    {\par\medskip\noindent\minipage{\linewidth}}
+    {\endminipage\par\medskip}
+
 \title{Data Mining CS 571}
 \author{Matt Jensen}
 \date{2023-04-25}
@ -17,16 +28,62 @@ A secondary goal is the investigate whether news organization have been uniforml
 This analysis will probe to what degree has the \href{https://en.wikipedia.org/wiki/Overton_window}{Overton Window} has shifted in the media.
 Naom Chomsky had a hypothesis about manufactured consent that is beyond the scope of this paper, so we will restrict our analysis to the presence of agenda instead of the cause of it.

+
 \begin{multicols}{2}

+\section{Background}
+
+There is evidence supporting and increase in political polarization in the United States over the past 16 years.
+There have been a number of studies conducted in an attempt to measure and explain this phenomenon. \cite{stewart_polarization_2020} \cite{flaxman_filter_2016}
+
+These studies attempt to link increased media options and a decrease in the proportion of less engaged and less partisan voters. 
+This drop in less engaged voters might explain the increased partisanship in elections.
+However, the evidence regarding a direct causal relationship between partisan media messages and changes in attitudes or behaviors is inconclusive.
+Directly measuring the casual relationship between media messages and behavior is difficult \cite{prior_media_2013}.
+There is currently no solid evidence to support the claim that partisan media outlets are causing average Americans to become more partisan.
+
+The number of media publishers has increased and in this particular data set:
+    
+\begin{Figure}
+    \centering
+    \includegraphics[width=\linewidth]{figures/distinct_publishers.png}
+    \captionof{figure}{Publishers Per Year}
+\end{Figure}
+
+These studies rest on the assumption that media outlets are becoming more partisan.
+We study this assumption in detail.
+
+Party Sorting: Over the past few decades, there has been a significant increase in party sorting, where Democrats have become more ideologically liberal, and Republicans have become more ideologically conservative. This trend indicates a growing gap between the two major political parties. A study published in the journal American Political Science Review in 2018 found that party sorting increased significantly between 2004 and 2016.
+
+Congressional Polarization: There has been a substantial increase in polarization among members of the U.S. Congress. Studies analyzing voting patterns and ideological positions of legislators have consistently shown a widening gap between Democrats and Republicans. The Pew Research Center reported that the median Democrat and the median Republican in Congress have become further apart ideologically between 2004 and 2017.
+
+Public Opinion: Surveys and polls also provide evidence of increasing political polarization among the American public. According to a study conducted by Pew Research Center in 2017, the gap between Republicans and Democrats on key policy issues, such as immigration, the environment, and social issues, has widened significantly since 1994.
+
+Media Fragmentation: The rise of social media and digital media platforms has contributed to the fragmentation of media consumption, leading to the creation of ideological echo chambers. Individuals are more likely to consume news and information that aligns with their pre-existing beliefs, reinforcing and intensifying polarization.
+
+    Increased Negative Attitudes: Studies have shown that Americans' attitudes towards members of the opposing political party have become increasingly negative. The Pew Research Center reported in 2016 that negative feelings towards the opposing party have doubled since the late 1990s, indicating a deepening divide.
+
 \section{Data Preparation}
 The subject of analysis is a set of news article headlines scraped from the news aggregation site \href{https://mememorandum.com}{Memeorandum} for news stories from 2006 to 2022.
-Each news article has a title, author, description, publisher, publish date, url and related discussions. 
+    Each news article has a title, author, description, publisher, publish date, url and related discussions \ref{tab:1}. 
 The site also has a concept of references, where a main, popular story may be covered by other sources.
 This link association might be used to support one or more of the hypothesis of the main analysis.
 After scraping the site, the data will need to be deduplicated and normalized to minimize storage costs and processing errors.
+
 What remains after these cleaning steps is approximitely 6,400 days of material, 300,000 distinct headlines from 21,000 publishers and 34,000 authors used in the study.

+\begin{center}
+    \begin{tabular}{ll}
+        publishers & 1,735     \\
+        stories    & 242,343   \\
+        children   & 808,628   \\
+        date range & 2006-2022
+    \end{tabular}
+\captionof{table}{dataset statistics}
+\label{tab:1}
+\end{center}
+
+
 \section{Missing Data Policy}

 The largest data policy that will have to be dealt with is news organizations that share the same parent company, but might have slightly different names.
@ -54,8 +111,19 @@ If it is less than 30\% of the articles, it might not make sense to use the bias
 The creation and reduction of the link graph with principle component analysis will need to be done to visualize the relationship between related publishers.

 \section{Results}
-\textbf{TODO.}
+
+\begin{Figure}
+    \centering
+    \includegraphics[width=\linewidth]{figures/articles_per_year.png}
+    \captionof{figure}{Three simple graphs}
+\end{Figure}
+
+test

 \end{multicols}

+\newpage
+
+\bibliography{data_mining_577.bib}
+
 \end{document}
--- a/docs/paper_guidelines.md
+++ b/docs/paper_guidelines.md
@ -0,0 +1,24 @@
+In the next progress report, in addition to significant progress, I hope to see the following
+deficiencies corrected. If you have none, don't worry about it, but most progress reports
+exhibited one or more of them.
+
+1. A full draft of the abstract. I know that there is a lot of provisional thinking about what
+you may be able to accomplish, but by now, you should be able to articulate a goal well
+enough to come up with a proper abstract.
+
+2. Remove irrelevant text. Many submissions had text leftover from the original template;
+this is now a distraction.
+
+3. Use the template originally provided, or a near equivalent. Don’t change formatting of
+margins, the type point, statement of authorship, etc.
+
+4. Attend to mechanics: spelling, capitalization, grammar, etc.
+
+5. Document all external sources, including data sets, software and sources of algorithmic
+ideas.
+
+6. Be careful of common errors in writing, like unnecessarily saying the same thing more
+than once. We will discuss this more in class.
+
+7. Be careful about overmuch etiological narrative, describing your progress and
+failures. We will also give this further attention in class.
--- a/src/cli.py
+++ b/src/cli.py
@ -2,70 +2,85 @@ import click
 from dotenv import load_dotenv
 import data
 import plots
+import mining
+import train

@click.group()
 def cli():
    ...

+@cli.group(name="data")
+def data_subcommand():
+    """data subcommands"""
+    ...
+
+@cli.group(name="mining")
+def mining_subcommand():
+    """mining subcommands"""
+    ...
+
+@cli.group(name="plot")
+def plot_subcommand():
+    """plotting subcommands"""
+    ...
+
+@cli.group(name="train")
+def train_subcommand():
+    """train subcommands"""
+    ...
+
 if __name__ == "__main__":
    load_dotenv()

    # original bias ratings
-    cli.add_command(data.scrape.download)
-    cli.add_command(data.scrape.parse)
-    cli.add_command(data.scrape.load)
-    cli.add_command(data.scrape.normalize)
-    cli.add_command(data.scrape.create_elections_table)
+    data_subcommand.add_command(data.scrape.download)
+    data_subcommand.add_command(data.scrape.parse)
+    data_subcommand.add_command(data.scrape.load)
+    data_subcommand.add_command(data.scrape.normalize)
+    data_subcommand.add_command(data.scrape.create_elections_table)

-    cli.add_command(data.factcheck.parse_index)
-    cli.add_command(data.factcheck.scrape)
+    data_subcommand.add_command(data.factcheck.parse_index)
+    data_subcommand.add_command(data.factcheck.scrape)

-    cli.add_command(data.links.create_table)
-    cli.add_command(data.links.create_pca)
-    cli.add_command(data.links.create_clusters)
+    data_subcommand.add_command(data.links.create_table)
+    data_subcommand.add_command(data.links.create_pca)
+    data_subcommand.add_command(data.links.create_clusters)

-    import word
-    # cli.add_command(word.distance)
-    # cli.add_command(word.train)
-    cli.add_command(word.embed)
-    cli.add_command(word.max_sequence)
-    import bias
-    cli.add_command(bias.parse)
-    cli.add_command(bias.load)
-    cli.add_command(bias.normalize)
+    data_subcommand.add_command(data.bias.parse)
+    data_subcommand.add_command(data.bias.load)
+    data_subcommand.add_command(data.bias.normalize)

-    import mine
-    cli.add_command(mine.embeddings)
-    cli.add_command(mine.cluster)
-    cli.add_command(mine.plot)
+    data_subcommand.add_command(data.emotion.extract)
+    data_subcommand.add_command(data.emotion.normalize)
+    data_subcommand.add_command(data.emotion.analyze)
+    data_subcommand.add_command(data.emotion.create_table)

-    import emotion
-    cli.add_command(emotion.extract)
-    cli.add_command(emotion.normalize)
-    cli.add_command(emotion.analyze)
-    cli.add_command(emotion.create_table)
+    data_subcommand.add_command(data.word.embed)
+    data_subcommand.add_command(data.word.max_sequence)
+    data_subcommand.add_command(data.sentence.embed)
+    data_subcommand.add_command(data.sentence.create_avg_pca_table)

-    import sentence
-    cli.add_command(sentence.embed)
-    cli.add_command(sentence.create_avg_pca_table)
+    mining_subcommand.add_command(mining.main.embeddings)
+    mining_subcommand.add_command(mining.main.cluster)
+    mining_subcommand.add_command(mining.main.plot)

-    from train import main as train_main
-    cli.add_command(train_main.main)
+    plot_subcommand.add_command(plots.descriptive.articles_per_year)
+    plot_subcommand.add_command(plots.descriptive.distinct_publishers)
+    plot_subcommand.add_command(plots.descriptive.stories_per_publisher)
+    plot_subcommand.add_command(plots.descriptive.top_publishers)
+    plot_subcommand.add_command(plots.descriptive.common_tld)
+    plot_subcommand.add_command(plots.sentence.sentence_pca)
+    plot_subcommand.add_command(plots.sentence.avg_sentence_pca)
+    plot_subcommand.add_command(plots.emotion.emotion_over_time)
+    plot_subcommand.add_command(plots.emotion.emotion_regression)
+    plot_subcommand.add_command(plots.sentiment.over_time)
+    plot_subcommand.add_command(plots.sentiment.bias_over_time)
+    plot_subcommand.add_command(plots.sentiment.bias_vs_recent_winner)
+    plot_subcommand.add_command(plots.links.elbow)
+    plot_subcommand.add_command(plots.links.link_pca_clusters)
+    plot_subcommand.add_command(plots.classifier.pca_with_classes)

-    cli.add_command(plots.descriptive.articles_per_year)
-    cli.add_command(plots.descriptive.distinct_publishers)
-    cli.add_command(plots.descriptive.stories_per_publisher)
-    cli.add_command(plots.descriptive.top_publishers)
-    cli.add_command(plots.descriptive.common_tld)
-    cli.add_command(plots.sentence.sentence_pca)
-    cli.add_command(plots.sentence.avg_sentence_pca)
-    cli.add_command(plots.emotion.emotion_over_time)
-    cli.add_command(plots.emotion.emotion_regression)
-    cli.add_command(plots.sentiment.over_time)
-    cli.add_command(plots.sentiment.bias_over_time)
-    cli.add_command(plots.sentiment.bias_vs_recent_winner)
-    cli.add_command(plots.links.elbow)
-    cli.add_command(plots.links.link_pca_clusters)
-    cli.add_command(plots.classifier.pca_with_classes)
+    train_subcommand.add_command(train.main.main)
+    train_subcommand.add_command(train.main.validate)

    cli()
--- a/src/data/init.py
+++ b/src/data/init.py
@ -2,9 +2,24 @@ import data.main
 import data.scrape
 import data.factcheck
 import data.links
+import data.bias
+import data.emotion
+import data.broken_links
+import data.selection
+import data.sentence
+import data.sentiment
+import data.word
+
 __all__ = [
    'main'
    ,'scrape'
    ,'factcheck'
    ,'links'
+    ,'bias'
+    ,'emotion'
+    ,'broken_links'
+    ,'selection'
+    ,'sentence'
+    ,'sentiment'
+    ,'word'
 ]
--- a/src/data/bias.py
+++ b/src/data/bias.py
--- a/src/data/broken_links.py
+++ b/src/data/broken_links.py
@ -3,23 +3,20 @@ import seaborn as sns
 import matplotlib.pyplot as plt
 import click

-from data import connect
+from data.main import connect

@click.command(name="broken:crawl")
 def crawl():
    """crawl story urls checking for link rot or redirects."""
-    DB = connect()
-
-    urls = DB.query("""
-        select 
-            id
-            ,url
-        from stories 
-        order by published_at asc
-        limit 5
-    """).fetchall()
-
-    DB.close()
+    with connect() as db:
+        urls = db.query("""
+            select 
+                id
+                ,url
+            from stories 
+            order by published_at asc
+            limit 5
+        """).fetchall()

    story_id, url = urls[1]
    # url
--- a/src/data/emotion.py
+++ b/src/data/emotion.py
@ -5,7 +5,7 @@ import pandas as pd
 import numpy as np

 from transformers import BertTokenizer
-from model import BertForMultiLabelClassification
+from train.model import BertForMultiLabelClassification
 from data.main import connect, data_dir
 import seaborn as sns
 import matplotlib.pyplot as plt
--- a/src/data/factcheck.py
+++ b/src/data/factcheck.py
@ -8,7 +8,7 @@ from pathlib import Path
 import os
 import sys
 import click
-from data.main import connect, map_tld, paths
+from data.main import connect, map_tld, paths, reporting_label_to_int
 from random import randint
 from time import sleep
 from tqdm import tqdm
@ -155,7 +155,7 @@ def create_tables():
            FROM stories s
        """).df()

-    stories['tld'] = stories.url.apply(map_tld)
+    raw_stories['tld'] = raw_stories.url.apply(map_tld)
    
    with connect() as db:
        db.sql("""
@ -167,5 +167,25 @@ def create_tables():
            JOIN mbfc.publishers p
            ON p.tld = s.tld
        """)
+    with connect() as db:
+        data = db.sql("""
+            select
+                id,
+                reporting
+            from mbfc.publishers p
+        """).df()

+    with connect() as db:
+        db.sql("""
+            alter table mbfc.publishers add column reporting_ordinal int
+        """)

+    data['ordinal'] = data.reporting.apply(reporting_label_to_int)
+
+    with connect() as db:
+        db.sql("""
+            update mbfc.publishers
+            set reporting_ordinal = data.ordinal
+            from data
+            where data.id = publishers.id
+        """)
--- a/src/data/main.py
+++ b/src/data/main.py
@ -22,6 +22,8 @@ def paths(name='app'):
        return Path(os.environ['DATA_MINING_DOCS_DIR'])
    if 'figure' in name:
        return Path(os.environ['DATA_MINING_DOCS_DIR']) / 'figures'
+    if 'model' in name:
+        return Path(os.environ['DATA_MINING_DATA_DIR']) / 'models'

 def connect():
    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
@ -105,3 +107,32 @@ def bias_int_to_label(class_id: int, source: str = 'mbfc') -> str:
    except:
        print(f"no mapping for {class_id}", file=sys.stderr)
        return -1
+
+def reporting_label_to_int(label):
+    mapping = {
+        'Very Low': 0,
+        'Low': 1,
+        'Mixed': -1,
+        'Mostly Factual': 3,
+        'High': 4,
+        'Very High': 5
+    }
+    try:
+        return mapping[label]
+    except:
+        return -1
+
+def save_model(model, name):
+    import pickle
+    save_to = paths('models') / name
+    with open(save_to, 'wb') as file:
+        pickle.dump(model, file)
+    print(f"saved model: {save_to}")
+
+def load_model(name):
+    import pickle
+    open_from = paths('models') / name
+    print(f"loading model: {open_from}")
+    with open(open_from, 'rb') as file:
+        model = pickle.load(file)
+    return model
--- a/src/data/sentence.py
+++ b/src/data/sentence.py
@ -1,13 +1,11 @@
+import click
 from transformers import AutoTokenizer, AutoModel
 import torch
 import torch.nn.functional as F
-from data.main import connect, paths
-import os
-from pathlib import Path
+from data.main import connect, paths, save_model, load_model, ticklabels
 import numpy as np
 import pandas as pd
 from tqdm import tqdm
-import click

 #Mean Pooling - Take attention mask into account for correct averaging
 def mean_pooling(model_output, attention_mask):
@ -24,15 +22,14 @@ def embed(chunks):
    model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

    # load data
-    DB = connect()
-    table = DB.sql("""
-        select
-        id
-        ,title
-        from stories
-        order by id desc
-    """).df()
-    DB.close()
+    with connect() as db:
+        table = db.sql("""
+            select
+            id
+            ,title
+            from stories
+            order by id desc
+        """).df()

    # normalize text
    table['title'] = table['title'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
@ -67,7 +64,7 @@ def embed(chunks):
    print(f"embeddings saved: {save_to}")

    # save ids
-    save_to = data_dir() / 'embedding_ids.npy'
+    save_to = paths('data') / 'embedding_ids.npy'
    np.save(save_to, ids)
    print(f"ids saved: {save_to}")

@ -133,25 +130,28 @@ def create_pca_table():

    embeddings = np.load(path('data') / 'embeddings.npy')
    embedding_ids = np.load(path('data') / 'embedding_ids.npy')
+    ids = pd.DataFrame(embedding_ids, columns=['story_id']).reset_index()

    with connect() as db:
        data = db.query("""
            SELECT
                ids.index
                ,s.id
-                ,b.ordinal
+                ,p.bias
+                ,p.ordinal
            FROM ids
-            JOIN top.stories s
+            JOIN stories s
            ON ids.story_id = s.id
-            JOIN top.publisher_bias pb
-            ON pb.publisher_id = s.publisher_id
-            JOIN bias_ratings b
-            ON b.id = pb.bias_id
+            JOIN mbfc.publisher_stories ps
+            ON s.id = ps.story_id
+            JOIN mbfc.publishers p
+            ON p.id = ps.publisher_id
+            WHERE p.ordinal != -1
        """).df()
        pub = db.query("""
            SELECT
                *
-            FROM top.publishers
+            FROM mbfc.publishers
        """).df()

    x = embeddings[data['index']]
@ -161,8 +161,7 @@ def create_pca_table():
    data['first'] = pred[:, 0]
    data['second'] = pred[:, 1]

-    table_name = f"top.story_embeddings_pca"
-
+    table_name = f"story_embeddings_pca"
    with connect() as db:
        db.query(f"""
            CREATE OR REPLACE TABLE {table_name} AS
@ -172,11 +171,12 @@ def create_pca_table():
                ,data.second as second
            FROM data
        """)
-
    print(f"created {table_name}")

@click.command('sentence:create-svm-table')
 def create_svm_table():
+    """sentence to classifier"""
+
    from sklearn import svm
    from sklearn.linear_model import SGDClassifier

@ -189,22 +189,99 @@ def create_svm_table():
            SELECT
                ids.index
                ,s.id
-                ,b.ordinal
+                ,p.ordinal
+                ,p.bias
            FROM ids
-            JOIN top.stories s
+            JOIN stories s
            ON ids.story_id = s.id
-            JOIN top.publisher_bias pb
-            ON pb.publisher_id = s.publisher_id
-            JOIN bias_ratings b
-            ON b.id = pb.bias_id
+            JOIN mbfc.publisher_stories ps
+            ON s.id = ps.story_id
+            JOIN mbfc.publishers p
+            ON p.id = ps.publisher_id
+            WHERE p.ordinal != -1
        """).df()

    x = embeddings[data['index']]
-    #y = data['ordinal'].to_numpy().reshape(-1, 1)
    y = data['ordinal']

    model = SGDClassifier()
-    pred = model.fit(x, y)
-    data['pred'] = pred.predict(x)
-    data
+    model = model.fit(x, y)
+    # data['pred'] = pred.predict(x)
+    save_model(model, 'sgdclassifier.pkl')

+def interence():
+
+    with connect() as db:
+         bias = db.query("""
+            SELECT
+                p.bias
+                ,p.ordinal
+            FROM mbfc.publishers p
+            WHERE p.ordinal != -1
+            GROUP BY
+                p.bias
+                ,p.ordinal
+            ORDER BY
+                p.ordinal
+        """).df()
+
+    sdg = load_model( 'sgdclassifier.pkl')
+
+
+    tokens = tokenizer(["hello, i hate woke culture.", "trump is winning"], padding=True, truncation=True, return_tensors='pt')
+
+    with torch.no_grad():
+        output = model(**tokens)
+
+    output = mean_pooling(output, tokens['attention_mask'])
+
+    output = F.normalize(output, p=2, dim=1)
+    sdg.predict(output)
+
+    tokens
+    dir(output)
+
+def validation():
+
+    from sklearn.model_selection import train_test_split
+    from sklearn.svm import LinearSVC
+    from sklearn.metrics import ConfusionMatrixDisplay
+    import matplotlib.pyplot as plt
+
+    embeddings = np.load(paths('data') / 'embeddings.npy')
+    embedding_ids = np.load(paths('data') / 'embedding_ids.npy')
+    ids = pd.DataFrame(embedding_ids, columns=['story_id']).reset_index()
+
+    with connect() as db:
+        data = db.query("""
+            SELECT
+                ids.index
+                ,s.id
+                ,p.ordinal
+                ,p.bias
+            FROM ids
+            JOIN stories s
+            ON ids.story_id = s.id
+            JOIN mbfc.publisher_stories ps
+            ON s.id = ps.story_id
+            JOIN mbfc.publishers p
+            ON p.id = ps.publisher_id
+            WHERE p.ordinal != -1
+        """).df()
+
+    x = embeddings[data['index']]
+    y = data['ordinal']
+
+
+    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
+
+    clf = LinearSVC()
+    clf.fit(x_train, y_train)
+
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    ConfusionMatrixDisplay.from_predictions(y_test, clf.predict(x_test), ax=ax)
+    ax.set(title="confusion matrix for kNN classifier on test data.", xticklabels=ticklabels(), yticklabels=ticklabels())
+    plt.show()
+
+    plt.savefig(save_to)
--- a/src/data/sentiment.py
+++ b/src/data/sentiment.py
@ -20,15 +20,14 @@ def extract(chunks):


    # load data
-    DB = connect()
-    table = DB.sql("""
-        select
-        id
-        ,title
-        from stories
-        order by id desc
-    """).df()
-    DB.close()
+    with connect() as db:
+        table = db.sql("""
+            select
+            id
+            ,title
+            from stories
+            order by id desc
+        """).df()

    # normalize text
    table['title'] = table['title'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
@ -56,12 +55,12 @@ def extract(chunks):
    story_ids = np.concatenate(story_ids)

    # save embeddings
-    save_to = data_dir() / 'sentiment.npy'
+    save_to = paths('data') / 'sentiment.npy'
    np.save(save_to, sentiments)
    print(f"sentiments saved: {save_to}")

    # save ids
-    save_to = data_dir() / 'sentiment_ids.npy'
+    save_to = paths('data') / 'sentiment_ids.npy'
    np.save(save_to, story_ids)
    print(f"ids saved: {save_to}")

--- a/src/data/word.py
+++ b/src/data/word.py
@ -1,7 +1,7 @@
 import click
 from transformers import AutoTokenizer, RobertaModel
 import numpy as np
-from data.main import Data, from_db, connect, data_dir
+from data.main import connect, paths
 from tqdm import tqdm
 import torch
 from pathlib import Path
@ -9,30 +9,23 @@ from pathlib import Path
@click.command(name="word:max-sequence")
 def max_sequence():
    """calculate the maximum token length given the story titles"""
-    db = connect()
-    longest = db.sql("""
-        select
-            title
-        from stories
-        order by length(title) desc
-        limit 5000
-    """).df()
-    db.close()
+    with connect() as db:
+        longest = db.sql("""
+            select
+                title
+            from stories
+            order by length(title) desc
+            limit 5000
+        """).df()

    tokenizer = AutoTokenizer.from_pretrained("roberta-base")
    tokens = tokenizer(longest['title'].to_list())
    print(f"{max([len(x) for x in tokens['input_ids']])}")

-@click.command(name="word:train")
-def train():
-    """TODO"""
-    table = from_db(Data.Titles)
-    n_classes = 10
-
@click.command(name="word:embed")
@click.option('-c', '--chunks', type=int, default=5000, show_default=True)
-@click.option('--embedding_dir', help="path to save embeddings as np array", type=Path, default=Path(data_dir() / 'embeddings'), show_default=True)
-@click.option('--token_dir', help="path to save tokens as np array", type=Path, default=Path(data_dir() / 'tokens'), show_default=True)
+@click.option('--embedding_dir', help="path to save embeddings as np array", type=Path, default=Path(paths('data') / 'embeddings'), show_default=True)
+@click.option('--token_dir', help="path to save tokens as np array", type=Path, default=Path(paths('data') / 'tokens'), show_default=True)
@click.option('--device', help="device to process data on", type=str, default="cuda:0", show_default=True)
 def embed(chunks, embedding_dir, token_dir, device):
    """ given titles, generate tokens and word embeddings and saves to disk """
@ -44,14 +37,13 @@ def embed(chunks, embedding_dir, token_dir, device):
    model.to(device)

    # load data
-    db = connect()
-    table = db.sql("""
-        select
-        title
-        from stories
-        order by id desc
-    """).df()
-    db.close()
+    with connect() as db:
+        table = db.sql("""
+            select
+            title
+            from stories
+            order by id desc
+        """).df()

    # normalize text
    table['title'] = table['title'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
@ -82,7 +74,7 @@ def distance():
    closest = np.unravel_index(min_index, distances.shape)
    distances.flatten().shape

-# path = data_dir() / 'embeddings'
+# path = paths('data') / 'embeddings'
 # chunks = [x for x in path.iterdir() if x.match('*.npy')]
 # chunks = sorted(chunks, key=lambda x: int(x.stem.split('_')[1]))
 # 
@ -98,4 +90,4 @@ def distance():
 # 
 # data.shape
 # 
-# np.save(data, data_dir() / 'embeddings.npy')
+# np.save(data, paths('data') / 'embeddings.npy')
--- a/src/hw/assignment1.py
+++ b/src/hw/assignment1.py
@ -0,0 +1,62 @@
+import click
+from dotenv import load_dotenv
+from data.main import paths, connect
+import pandas as pd
+import math
+
+@click.group()
+def cli():
+    ...
+
+@cli.command('hw1:simple')
+def simple_mean():
+    data = pd.read_csv(paths('data') / 'hw' / 'q1.csv', sep="|").sort_values('salary').reset_index(drop=True)
+    mean = sum(data.salary) / len(data.salary)
+    print(f"mean: {mean:.1f}")
+
+    count = data.groupby('salary')['salary'].count()
+    weighted_mean = sum([a * b for a, b in zip(list(count.index), list(count))]) / len(data)
+    print(f"weighted: {weighted_mean:.1f}")
+
+    total = 1
+    for i in data.salary:
+        total *= i
+    geometric = total ** (1 / len(data))
+    print(f"geometric: {geometric:.1f}")
+
+    median = data.iloc[len(data) // 2]['salary']
+    print(f"median: {median}")
+
+    counts = dict(zip(list(count.index), list(count)))
+    mode = max(counts, key=counts.get)
+    print(f"mode: {mode}")
+
+    variance = sum(((data - mean) ** 2)['salary']) / len(data)
+    print(f"variance: {variance:.1f}")
+
+    std = math.sqrt(variance)
+    print(f"std: {std:.2f}")
+
+    z_scores = round((data - mean) / std, 2)
+    z_scores = list(zip(data.salary, z_scores.salary))
+    print(f"z_scores: {z_scores}")
+
+
+    coeff_v = std / mean * 100
+    print(f"coeff. of var.: {coeff_v:.2f}%")
+
+    q_1 = data.iloc[len(data) // 4]['salary']
+    print(f"first quartile: {q_1}")
+
+    q_3 = data.iloc[(len(data) // 4) * 3]['salary']
+    print(f"third quartile: {q_3}")
+
+
+    data = pd.read_csv(paths('data') / 'hw' / 'a1_q2.csv', sep="|")
+    mode = (3 *data['median']) - (2 * data['mean'])
+    print(f"mode: {mode.values}")
+
+
+if __name__ == "__main__":
+    load_dotenv()
+    cli()
--- a/src/mining/init.py
+++ b/src/mining/init.py
@ -0,0 +1,9 @@
+import mining.main
+import mining.apriori
+import mining.bias
+
+__all__ = [
+    'main'
+    ,'apriori'
+    ,'bias'
+]
--- a/src/mining/apriori.py
+++ b/src/mining/apriori.py
@ -1,3 +1,5 @@
+import click
+
 from efficient_apriori import apriori
 from data.main import connect

--- a/src/mining/main.py
+++ b/src/mining/main.py
--- a/src/model.py
+++ b/src/model.py
@ -15,49 +15,3 @@ class Model(nn.Module):
        outs = self.act(self.linear(outs.last_hidden_state))
        return outs

-import torch.nn as nn
-from transformers import BertPreTrainedModel, BertModel
-
-
-class BertForMultiLabelClassification(BertPreTrainedModel):
-    def __init__(self, config):
-        super().__init__(config)
-        self.num_labels = config.num_labels
-
-        self.bert = BertModel(config)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
-        self.loss_fct = nn.BCEWithLogitsLoss()
-
-        self.init_weights()
-
-    def forward(
-        self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        labels=None,
-    ):
-        outputs = self.bert(
-            input_ids,
-            attention_mask=attention_mask,
-            token_type_ids=token_type_ids,
-            position_ids=position_ids,
-            head_mask=head_mask,
-            inputs_embeds=inputs_embeds,
-        )
-        pooled_output = outputs[1]
-
-        pooled_output = self.dropout(pooled_output)
-        logits = self.classifier(pooled_output)
-
-        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
-
-        if labels is not None:
-            loss = self.loss_fct(logits, labels)
-            outputs = (loss,) + outputs
-
-        return outputs  # (loss), logits, (hidden_states), (attentions)
--- a/src/nearest_neighbor.py
+++ b/src/nearest_neighbor.py
@ -1,5 +0,0 @@
-import pandas as pd
-import math
-
-df = pd.read_csv('/tmp/attr.csv')
-((((df.left - 9.1) ** 2) + ((df.right - 11.0) ** 2)) ** 0.5).sort_values()
--- a/src/plots/init.py
+++ b/src/plots/init.py
@ -3,6 +3,7 @@ import plots.emotion
 import plots.sentiment
 import plots.links
 import plots.classifier
+import plots.descriptive

 __all__ = [
    'sentence'
@ -10,4 +11,5 @@ __all__ = [
    'sentiment',
    'links',
    'classifier',
+    'descriptive',
 ]
--- a/src/plots/classifier.py
+++ b/src/plots/classifier.py
@ -5,7 +5,7 @@ import seaborn as sns
 import matplotlib.pyplot as plt
 from pathlib import Path

-@click.command('plot:pca-with-classes')
+@click.command('classifier:pca-with-classes')
@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
 def pca_with_classes(source):

--- a/src/plots/descriptive.py
+++ b/src/plots/descriptive.py
@ -6,7 +6,7 @@ import matplotlib.pyplot as plt
 from pathlib import Path
 import numpy as np

-@click.command('plot:articles-per-year')
+@click.command('descriptive:articles-per-year')
 def articles_per_year():
    save_to = paths('figures') / 'articles_per_year.png'

@ -27,29 +27,34 @@ def articles_per_year():
    plt.savefig(save_to)
    print(f"saved: {save_to}")

-@click.command('plot:distinct-publishers')
+@click.command('descriptive:distinct-publishers')
 def distinct_publishers():
    save_to = paths('figures') / 'distinct_publishers.png'

    with connect() as db:
-        data = DB.query("""
+        data = db.query("""
            select
-                year(published_at) as year
-                ,count(distinct publisher_id) as publishers
-            from stories
+                count(distinct p.id) as publishers
+                ,date_trunc('year', s.published_at) as date
+            from stories s
+            join mbfc.publisher_stories ps
+            on s.id = ps.story_id
+            join mbfc.publishers p
+            on ps.publisher_id = p.id
+            and year(s.published_at) not in (2005, 2023)
            group by
-                year(published_at)
+                date_trunc('year', s.published_at)
        """).df()

-    ax = sns.barplot(x=data.year, y=data.publishers, color='tab:blue')
-    ax.tick_params(axis='x', rotation=90)
-    ax.set(title="count of publishers per year", ylabel="count of publishers (#)")
+    ax = sns.barplot(x=data.date.dt.year, y=data.publishers, color='tab:blue')
+    ax.tick_params(axis='x', rotation=45)
+    ax.set(ylabel="count of publishers (#)", xlabel="year")
    plt.tight_layout()
    plt.savefig(save_to)
    plt.close()
    print(f"saved: {save_to}")

-@click.command('plot:stories-per-publisher')
+@click.command('descriptive:stories-per-publisher')
 def stories_per_publisher():
    save_to = paths('figures') / 'stories_per_publisher.png'

@ -100,7 +105,7 @@ def stories_per_publisher():
    print(f"saved: {save_to}")


-@click.command('plot:top-publishers')
+@click.command('descriptive:top-publishers')
 def top_publishers():
    """plot top publishers over time"""

@ -164,7 +169,7 @@ def top_publishers():
    print(f"saved: {save_to}")


-@click.command('plot:common_tld')
+@click.command('descriptive:common_tld')
 def common_tld():
    import dataframe_image as dfi
    save_to = paths('figures') / 'common_tld.png'
@ -189,42 +194,71 @@ def common_tld():
 def stats():

    # raw
-    DB.query("""
-        SELECT
-            'total stories' as key
-            ,COUNT(1) as value
-        FROM stories
-        UNION
-        SELECT
-            'total related' as key
-            ,COUNT(1) as value
-        FROM related_stories
-        UNION
-        SELECT
-            'top level domains' as key
-            ,COUNT(distinct tld) as value
-        FROM stories
-        UNION
-        SELECT
-            'publishers' as key
-            ,COUNT(1) as value
-        FROM publishers
-        UNION
-        SELECT
-            'authors' as key
-            ,COUNT(distinct author) as value
-        FROM stories
-        UNION
-        SELECT
-            'min year' as key
-            ,min(year(published_at)) as value
-        FROM stories
-        UNION
-        SELECT
-            'max year' as key
-            ,max(year(published_at)) as value
-        FROM stories
-    """).df().to_markdown(index=False)
+    with connect() as db:
+        db.query("""
+            SELECT
+                'total stories' as key
+                ,COUNT(1) as value
+            FROM stories
+            UNION
+            SELECT
+                'total related' as key
+                ,COUNT(1) as value
+            FROM related_stories
+            UNION
+            SELECT
+                'top level domains' as key
+                ,COUNT(distinct tld) as value
+            FROM stories
+            UNION
+            SELECT
+                'publishers' as key
+                ,COUNT(1) as value
+            FROM mbfc.publishers
+            UNION
+            SELECT
+                'authors' as key
+                ,COUNT(distinct author) as value
+            FROM stories
+            UNION
+            SELECT
+                'years' as key
+                ,min(year(published_at)) || '-' || min(year(published_at)) as value
+            FROM stories
+            UNION
+            SELECT
+                'max year' as key
+                ,max(year(published_at)) as value
+            FROM stories
+            UNION
+            SELECT
+                'publishers with ratings' as key
+                ,count(distinct ps.publisher_id) 
+            FROM mbfc.publisher_stories ps
+            UNION
+            SELECT
+                'publishers without ratings' as key
+                ,count(distinct s.publisher_id) 
+            from stories s
+            left join mbfc.publisher_stories ps
+            on ps.story_id = s.id
+            where ps.publisher_id is null
+            UNION
+            SELECT
+                'stories with ratings' as key
+                ,count(distinct ps.story_id) 
+            FROM mbfc.publisher_stories ps
+            UNION
+            SELECT
+                'stories without ratings' as key
+                ,count(distinct s.id) 
+            from stories s
+            left join mbfc.publisher_stories ps
+            on ps.story_id = s.id
+            where ps.publisher_id is null
+        """)
+
+        #.df().to_markdown(index=False)

    # selected
    DB.query("""
@ -264,7 +298,7 @@ def stats():
        FROM top.stories
    """).df().to_markdown(index=False)

-@click.command('plot:bias-stats')
+@click.command('descriptive:bias-stats')
 def bias_stats():
    import dataframe_image as dfi
    save_to = paths('figures') / 'bias_stats.png'
@ -322,7 +356,7 @@ def bias_stats():
    DB.close()
    print(df.to_markdown(index=False))

-@click.command('plot:bias-over-time')
+@click.command('descriptive:bias-over-time')
 def bias_over_time():
    """plot bias labels over time"""

--- a/src/plots/emotion.py
+++ b/src/plots/emotion.py
@ -6,7 +6,7 @@ import matplotlib.pyplot as plt
 import numpy as np
 import pandas as pd

-@click.command('plot:emotion-over-time')
+@click.command('emotion:over-time')
 def emotion_over_time():

    filename = "emotion_over_time.png"
@ -35,7 +35,7 @@ def emotion_over_time():
    print(f"saved: {save_to}")
    os.system(f'xdg-open {save_to}')

-@click.command('plot:emotion-regression')
+@click.command('emotion:regression')
 def emotion_regression():
    """plot emotion over time as regression"""

@ -114,7 +114,7 @@ def emotion_regression():
    plt.close()
    print(f"saved: {save_to}")

-@click.command('plot:emotion-hist')
+@click.command('emotion:hist')
 def emotion_hist():

    filename = "emotion_hist.png"
--- a/src/plots/links.py
+++ b/src/plots/links.py
@ -1,16 +1,13 @@
 import click
-from data.main import connect
-from links import to_matrix
-import os
+from data.main import connect, ticklabels, paths
 import seaborn as sns
 import matplotlib.pyplot as plt
-from pathlib import Path
 import numpy as np
 from sklearn.metrics import silhouette_score
 import pandas as pd


-@click.command('plot:link-elbow')
+@click.command('links:elbow')
 def elbow():
    from sklearn.cluster import KMeans

@ -42,7 +39,7 @@ def elbow():

    # randomly pick 8

-@click.command('plot:link-pca-clusters')
+@click.command('links:pca-clusters')
@click.option('--source', type=click.Choice(['links', 'normalized', 'onehot']), default='links')
 def link_pca_clusters(source):

@ -57,20 +54,22 @@ def link_pca_clusters(source):
                ,pca.first
                ,pca.second
                ,s.cnt as stories
-            FROM top.publisher_clusters_{source} c
-            JOIN top.publishers p
-            ON c.publisher_id = p.id
+            FROM publisher_clusters_{source} c
+            JOIN mbfc.publisher_stories ps
+            ON ps.publisher_id = c.publisher_id
+            JOIN mbfc.publishers p
+            ON ps.publisher_id = p.id
            JOIN 
            (
                select
-                    s.publisher_id
+                    p.id as publisher_id
                    ,count(1) as cnt
-                FROM top.stories s
+                FROM mbfc.publishers p
                GROUP BY
-                    s.publisher_id
+                    p.id
            ) s
            ON s.publisher_id = p.id
-            JOIN top.publisher_pca_{source} pca
+            JOIN publisher_pca_{source} pca
            ON pca.publisher_id = p.id
        """).df()

@ -107,7 +106,7 @@ def test():
        """)


-@click.command('plot:link-confusion')
+@click.command('links:confusion')
 def link_confusion():
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
@ -166,7 +165,7 @@ def link_confusion():
    plt.close()
    print(f"saved plot: {save_to}")

-@click.command('plot:link-classifier')
+@click.command('links:classifier')
 def link_confusion():
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
@ -178,18 +177,16 @@ def link_confusion():
        bias = db.query("""
            SELECT
                p.id as publisher_id
-                ,b.ordinal
-            FROM top.publishers p
-            JOIN top.publisher_bias pb
-            ON pb.publisher_id = p.id
-            JOIN bias_ratings b
-            ON b.id = pb.bias_id
+                ,p.ordinal
+            FROM mbfc.publishers p
+            where ordinal != -1
        """).df()

+    with connect() as db:
        df = db.query("""
            SELECT
                *
-            FROM top.link_edges
+            FROM link_edges
            WHERE parent_id in (
                select
                    publisher_id
@ -203,36 +200,22 @@ def link_confusion():
        """).df()

    pivot = df.pivot(index='parent_id', columns='child_id', values='links').fillna(0)
-
-    x = pivot.values
-    y = bias.sort_values('publisher_id').ordinal
-
-    with connect() as db:
-        data = db.query(f"""
-            SELECT
-                p.id as publisher_id
-                ,pca.first
-                ,pca.second
-            FROM top.publisher_pca_onehot pca
-            JOIN top.publishers p
-            ON pca.publisher_id = p.id
-        """).df()
-
-
+    publisher_matrix = pd.merge(pivot, bias, left_on='parent_id', right_on='publisher_id')
+    x = publisher_matrix.loc[:, ~publisher_matrix.columns.isin(['publisher_id', 'ordinal'])].values
+    y = publisher_matrix['ordinal']

    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(x, y)
    y_pred = model.predict(x)
-
-    plot = bias.sort_values('publisher_id')
-    plot['pred'] = y_pred
-    data = pd.merge(plot, data)
+    publisher_matrix['pred'] = y_pred
+    publisher_matrix


-    fig, ax = plt.subplots(figsize=(10, 5))
-    ConfusionMatrixDisplay.from_predictions(data['ordinal'], data['pred'], ax=ax)
-    ticklabels = ['left', 'left-center', 'center', 'right-center', 'right']
-    ax.set(title="confusion matrix for link matrix kNN classifier", xticklabels=ticklabels, yticklabels=ticklabels)
+    fig, ax = plt.subplots(figsize=(5, 5))
+    ConfusionMatrixDisplay.from_predictions(publisher_matrix['ordinal'], publisher_matrix['pred'], ax=ax)
+    ax.set(xticklabels=ticklabels(), yticklabels=ticklabels())
+    plt.xticks(rotation=45)
+    plt.tight_layout()
    plt.savefig(save_to)
    plt.close()
    print(f"saved plot: {save_to}")
--- a/src/plots/sentence.py
+++ b/src/plots/sentence.py
@ -7,7 +7,7 @@ import matplotlib.pyplot as plt
 import numpy as np
 import pandas as pd

-@click.command('plot:sentence-pca')
+@click.command('sentence:pca')
 def sentence_pca():
    save_to = paths('figures') / "embedding_sentence_pca.png"

@ -30,7 +30,7 @@ def sentence_pca():
    ax.set(title="pca components vs. bias label", xlabel="first component", ylabel="second component")
    plt.savefig(save_to)

-@click.command('plot:avg-sentence-pca')
+@click.command('sentence:avg-pca')
 def avg_sentence_pca():
    save_to = paths('figures') / "avg_embedding_sentence_pca.png"

@ -54,7 +54,7 @@ def avg_sentence_pca():
    ax.set(title="avg. publisher embedding pca components vs. bias label", xlabel="first component", ylabel="second component")
    plt.savefig(save_to)

-@click.command('plot:sentence-confusion')
+@click.command('sentence:confusion')
 def sentence_confusion():
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
--- a/src/plots/sentiment.py
+++ b/src/plots/sentiment.py
@ -3,7 +3,7 @@ from data.main import connect, paths, ticklabels
 import seaborn as sns
 import matplotlib.pyplot as plt

-@click.command('plot:sentiment-over-time')
+@click.command('sentiment:over-time')
 def over_time():

    filename = "sentiment_over_time.png"
@ -28,7 +28,74 @@ def over_time():
    plt.close()
    print(f"saved: {save_to}")

-@click.command('plot:bias-vs-sentiment-over-time')
+@click.command('sentiment:bias-over-time')
+def bias_over_time():
+    """plot sentiment/bias vs. time"""
+
+    filename = "publisher_avg_sentiment_vs_bias_over_time.png"
+    save_to = paths('figures') / filename
+
+    with connect() as db:
+        data = db.sql("""
+                with cte as (
+                SELECT
+                    avg(sent.class_id) as sentiment
+                    ,date_trunc('yearweek', s.published_at) as date
+                    ,p.id
+                    ,p.bias
+                FROM story_sentiments sent
+                JOIN stories s
+                ON s.id = sent.story_id
+                JOIN mbfc.publisher_stories ps
+                ON ps.story_id = s.id
+                JOIN mbfc.publishers p
+                ON p.id = ps.publisher_id
+                WHERE p.ordinal != -1
+                and year(date) not in (2005, 2023)
+                GROUP BY
+                    date_trunc('yearweek', s.published_at)
+                    ,p.id
+                    ,p.bias
+                ) ,b as (
+                    select
+                        avg(sentiment) as sentiment
+                        ,median(sentiment) as median_sentiment
+                        ,bias
+                        ,date
+                    from cte
+                    group by
+                        bias
+                        ,date
+                )
+                select
+                    median(sentiment) OVER (PARTITION BY bias ORDER BY date DESC ROWS BETWEEN 0 PRECEDING AND 7 FOLLOWING) as sentiment
+                    ,bias
+                    ,date
+                from b
+        """).df()
+
+    ax = sns.lineplot(data, x='date', y='sentiment', palette='rainbow', hue='bias', hue_order=ticklabels())
+    plt.axhline(y=0.5, color='black', linestyle='--', label='neutral') 
+    ax.set(ylabel='8 week rolling avg. sentiment', xlabel='date', ylim=[0,1])
+    plt.tight_layout()
+    plt.savefig(save_to)
+    plt.close()
+    print(f"saved: {save_to}")
+
+#    from scipy.stats import pearsonr
+#    pivot = data.pivot(index=['date'], columns=['bias'], values='sentiment')
+#
+#
+#    for left in pivot.keys():
+#        for right in pivot.keys():
+#            if left != right:
+#                result = pearsonr(pivot[left], pivot[right])
+#                print(f"{left:<15}/{right:<15} | p: {result.pvalue:.2e} | coef: {result.statistic:.3f}")
+#
+#    pivot
+
+
+@click.command('sentiment:bias-over-time')
 def bias_over_time():
    """plot sentiment/bias vs. time"""

@ -62,16 +129,15 @@ def bias_over_time():
            WHERE year(date) not in (2005, 2023)
        """).df()

-    #ax = sns.relplot(data, x='date', y='sentiment', col='bias', palette='rainbow', hue='bias', col_order=ticklabels())
    ax = sns.lineplot(data, x='date', y='sentiment', palette='rainbow', hue='bias', hue_order=ticklabels())
    plt.axhline(y=0.5, color='black', linestyle='--', label='neutral') 
-    ax.set(title='sentiment and bias vs. time', ylabel='8 week rolling avg. sentiment', xlabel='date')
+    ax.set(ylabel='8 week rolling avg. sentiment', xlabel='date', ylim=[0,1])
    plt.tight_layout()
    plt.savefig(save_to)
    plt.close()
    print(f"saved: {save_to}")

-@click.command('plot:sentiment-recent-winner')
+@click.command('sentiment:recent-winner')
 def bias_vs_recent_winner():
    """plot bias vs. distance to election"""

@ -106,7 +172,7 @@ def bias_vs_recent_winner():
    plt.close()
    print(f"saved: {save_to}")

-@click.command('plot:sentiment-hist')
+@click.command('sentiment:hist')
 def sentiment_hist():

    filename = "sentiment_hist.png"
--- a/src/train/init.py
+++ b/src/train/init.py
@ -1,5 +1,7 @@
 import train.main
+import train.model

 __all__ = [
    'main'
+    ,'model'
 ]
--- a/src/train/dataset.py
+++ b/src/train/dataset.py
@ -1,38 +1,104 @@
 from torch.utils.data import Dataset
-from data.main import connect, data_dir
-from bias import label_to_int
+from data.main import connect, paths
 import numpy as np
 import pandas as pd
+import os

 class NewsDataset(Dataset):
    def __init__(self):
-        self.embeddings = np.load(data_dir() / 'embeddings.npy')
-        embedding_ids = pd.DataFrame(np.load(data_dir() / 'embedding_ids.npy'), columns=['id']).reset_index()
+        self.embeddings = np.load(paths('data') / 'embeddings.npy')
+        self.embedding_ids = pd.DataFrame(np.load(paths('data') / 'embedding_ids.npy'), columns=['id']).reset_index()

-        DB = connect()
-        query = """
-            SELECT
-                s.id
-                ,b.label
-                ,count(1) over (partition by publisher_id) as stories
-            FROM stories s
-            JOIN publisher_bias b
-            ON b.id = s.publisher_id
-            WHERE b.label != 'allsides'
-        """
-        data = DB.sql(query).df()
-        DB.close()
-
-        data['label'] = data['label'].apply(lambda x: label_to_int(x))
-        data = data.merge(embedding_ids)
-        self.data = data
+        with connect() as db:
+            self.data = db.sql("""
+                WITH cte AS (
+                    SELECT
+                        s.id
+                        ,p.ordinal
+                        ,date_part('epoch', s.published_at) as epoch
+                        ,count(1) over(partition by p.id) as publisher_stories
+                        ,row_number() over(partition by p.ordinal) as label_row
+                    FROM stories s
+                    JOIN mbfc.publisher_stories ps
+                    ON ps.story_id = s.id
+                    JOIN mbfc.publishers p
+                    ON ps.publisher_id = p.id
+                    WHERE p.ordinal != -1
+                )
+                SELECT
+                    id
+                    ,epoch
+                    ,publisher_stories
+                    ,ordinal
+                FROM cte
+                WHERE label_row < 40000
+            """).df()
+        self.data = self.data.merge(self.embedding_ids)
+        self.data['epoch_norm'] = (self.data['epoch'] - self.data['epoch'].min())/(self.data['epoch'].max()-self.data['epoch'].min())
+        self.data['publisher_stories_norm'] = (self.data['publisher_stories'] - self.data['publisher_stories'].min())/(self.data['publisher_stories'].max()-self.data['publisher_stories'].min())

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
-        y = row['label']
-        # x = np.concatenate((self.embeddings[row['index']], [row['stories']])).astype(np.float32)
-        x = self.embeddings[row['index']]
+        y = int(row['ordinal'])
+        x = self.embeddings[int(row['index'])]
+        x = np.append(x, row[['epoch_norm', 'publisher_stories_norm']].values).astype(np.float32)
        return x, y
+
+    def normalized_epoch(self, idx):
+        epoch = self.data['epoch']
+        return (epoch.iloc[idx]-epoch.min())/(epoch.max()-epoch.min())
+
+    def normalized_stories(self, idx):
+        count = self.data['publisher_stories']
+        return (count.iloc[idx]-count.min())/(count.max()-count.min())
+
+    def get_in_out_size(self):
+        return int(os.getenv('EMBEDDING_LENGTH', 384)), int(os.getenv('CLASSES', 5)), 
+
+class PublisherDataset(Dataset):
+    def __init__(self):
+        embeddings = np.load(paths('data') / 'embeddings.npy')
+        embedding_ids = pd.DataFrame(np.load(paths('data') / 'embedding_ids.npy'), columns=['id']).reset_index()
+
+        with connect() as db:
+            data = db.sql("""
+                WITH cte AS (
+                    SELECT
+                        s.id
+                        ,p.id as publisher_id
+                        ,p.ordinal
+                        ,row_number() over(partition by p.ordinal) as label_row
+                    FROM stories s
+                    JOIN mbfc.publisher_stories ps
+                    ON ps.story_id = s.id
+                    JOIN mbfc.publishers p
+                    ON ps.publisher_id = p.id
+                    WHERE p.ordinal != -1
+                )
+                SELECT
+                    id
+                    ,ordinal
+                    ,publisher_id
+                FROM cte
+                WHERE label_row < 40000
+            """).df()
+
+        data = data.merge(self.embedding_ids)
+        self.x = []
+        self.y = []
+        for (publisher_id, ordinal), group in data.groupby(['publisher_id', 'ordinal'])[['ordinal', 'index']]:
+            self.x.append(embeddings[group['index']].mean(axis=0))
+            self.y.append(ordinal)
+
+    def __len__(self):
+        return len(self.x)
+
+    def __getitem__(self, idx):
+        return self.x[idx], self.y[idx]
+
+    def get_in_out_size(self):
+        return int(os.getenv('EMBEDDING_LENGTH', 384)), int(os.getenv('CLASSES', 5)), 
+
--- a/src/train/main.py
+++ b/src/train/main.py
@ -5,34 +5,32 @@ from dotenv import load_dotenv
 import os

 import torch
-from torch import nn
-from torch import optim
 from torch.utils.data import DataLoader
 from accelerate import Accelerator

 from train.dataset import NewsDataset
 from train.model import Classifier
-#from model.linear import LinearClassifier
+from data.main import paths, connect, ticklabels
+import numpy as np
+import pandas as pd

 class Stage(Enum):
    TRAIN = auto()
    DEV = auto()

-@click.command('train:main')
-def main():
-    dev_after = 20
+@click.command('main')
+@click.option('--epochs', default=10, type=int)
+def main(epochs):
+    dev_after = 5
    visible_devices = None
    lr = 1e-4
-    epochs = 10
    debug = False
    torch.manual_seed(0)
-    num_workers = 0
-
+    num_workers = int(os.getenv('NUMBER_OF_WORKERS', 0))
    embedding_length = int(os.getenv('EMBEDDING_LENGTH', 384))
-
    dataset = NewsDataset()
    trainset, devset = torch.utils.data.random_split(dataset, [0.8, 0.2])
-    batch_size = 512
+    batch_size = int(os.getenv('BATCH_SIZE', 512))
    trainloader = DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=num_workers, drop_last=True)
    devloader = DataLoader(devset, shuffle=False, num_workers=num_workers)
    accelerator = Accelerator()
@ -46,7 +44,7 @@ def main():
        #accelerator.log({"message" :"debug enabled"})

    criterion = torch.nn.CrossEntropyLoss()
-    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
+    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    # wrap objects with accelerate
    model, optimizer, trainloader, devloader = accelerator.prepare(model, optimizer, trainloader, devloader)
@ -76,57 +74,45 @@ def main():


    for epoch in range(epochs):
-        if (epoch - 1) % dev_after == 0:
+        if (epoch + 1) % dev_after == 0:
            stage = Stage.DEV
            log = run()
-            print(f"dev loss: {log}")
-        else:
-            stage = Stage.TRAIN
-            log = run()
-            print(f"train loss: {log}")
+            print(f"dev loss: {log:.3f}")
+        stage = Stage.TRAIN
+        log = run()
+        print(f"train loss: {log:.3f}")
+    torch.save(model.state_dict(), paths('model') / 'torch_clf.pth')

+@click.command('validate')
+def validate():
+    from sklearn.metrics import ConfusionMatrixDisplay
+    import matplotlib.pyplot as plt
+    import seaborn as sns
+
+    embeddings = np.load(paths('data') / 'embeddings.npy')
+    embedding_ids = pd.DataFrame(np.load(paths('data') / 'embedding_ids.npy'), columns=['id']).reset_index()
+
+    embedding_length = int(os.getenv('EMBEDDING_LENGTH', 384))
+    model = Classifier(embedding_length=embedding_length, classes=5)
+    model.load_state_dict(torch.load(paths('model') / 'torch_clf.pth'))
+    model.eval()
+
+    dataset = NewsDataset()
+
+    y = dataset[:][1]
+    with torch.no_grad():
+        out = model(torch.tensor(dataset[:][0]))
+
+    sns.histplot(pd.DataFrame(out).melt(), x='value', hue='variable', palette='rainbow')
+    out_path = (paths('data') / 'runs')
+    out_path.mkdir(exist_ok=True)
+    plt.savefig(out_path / 'label_hist.png')
+    plt.close()
+
+    y_pred = out.argmax(axis=1)
+    fig, ax = plt.subplots(figsize=(10, 5))
+    ConfusionMatrixDisplay.from_predictions(y, y_pred, ax=ax)
+    ax.set(title="confusion matrix for kNN classifier on test data.", xticklabels=ticklabels(), yticklabels=ticklabels())
+    plt.savefig(out_path / 'confusion_matrix.png')
+    plt.close()
    breakpoint()
-    from data.main import data_dir, connect
-    import numpy as np
-    import pandas as pd
-    from bias import int_to_label
-
-    embeddings = dataset.embeddings
-    embedding_ids = dataset.data
-
-    DB = connect()
-    query = """
-        SELECT
-            s.id
-            ,title
-            ,p.name
-            ,count(1) over (partition by publisher_id) as stories
-        FROM stories s
-        JOIN publishers p
-        on p.id = s.publisher_id
-        WHERE s.publisher_id NOT IN (
-            SELECT
-                id 
-            FROM publisher_bias b
-        )
-    """
-    data = DB.sql(query).df()
-    embeddings = np.load(data_dir() / 'embeddings.npy')
-    embedding_ids = pd.DataFrame(np.load(data_dir() / 'embedding_ids.npy'), columns=['id']).reset_index()
-
-
-    for i in range(10):
-        embedding =  embeddings[embedding_ids[embedding_ids['id'] == data.iloc[i]['id']]['index']]
-        title = data.iloc[i]['title']
-        publisher = data.iloc[i]['name']
-        class_pred = nn.functional.softmax( model(torch.tensor(embedding))).detach()
-        class_id = int(torch.argmax(nn.functional.softmax( model(torch.tensor(embedding))).detach()))
-        print(f"{publisher}: {int_to_label(class_id)} - \"{title}\"")
-
-    embedding_ids['id'] == data.iloc[0]['id']
-    embedding_ids[embedding_ids['id'] == data.iloc[0]['id']]
-    embedding =  embeddings[embedding_ids[embedding_ids['id'] == data.iloc[0]['id']]['index']]
-    title
-    publisher
-
-    model().get_last_layer(torch.tensor(embedding))
--- a/src/train/model.py
+++ b/src/train/model.py
@ -1,4 +1,5 @@
 from torch import nn
+from transformers import BertPreTrainedModel, BertModel

 class Classifier(nn.Module):
    def __init__(self, embedding_length: int, classes: int):
@ -26,3 +27,47 @@ class Classifier(nn.Module):
    def get_last_layer(self, x):
        x = self.stack(x)
        return x
+
+
+class BertForMultiLabelClassification(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
+        self.loss_fct = nn.BCEWithLogitsLoss()
+
+        self.init_weights()
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+    ):
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+
+        if labels is not None:
+            loss = self.loss_fct(logits, labels)
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), logits, (hidden_states), (attentions)