add progress and better scraping.

2023-04-22 13:00:24 -07:00
parent 297aeec32d
commit d43ed4658a
7 changed files with 287 additions and 73 deletions
--- a/docs/progress.md
+++ b/docs/progress.md
@@ -1,31 +1,39 @@
 # Data Mining - CSCI 577

-# Project Status Report I
+# Project Status Report III

-*2023-04-04*
+*2023-04-18*

 ## Participants

 Matt Jensen

-## Overarching Purpose
+Computer Science 477/577
+Project Status Report III
+Due: Tuesday, April 18

-I hope to use a dataset of new articles to track the polarization of news over time.
-I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
-I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
+## Tools

-## Data Source
+> The third project progress report should include a preliminary account of the existing software tools you will be using.
+> Ideally, you obtain the software you will (probably) need and run it on sample files (or your real files), so make sure that you understand how they work.
+> Do not wait verify that there are no hidden complications.
+> The are many plausible sources for such software, including the following:

-To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
-I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
-The site also has a concept of references, where a main, popular story may be covered by other sources.
-So there is a concept of link similarity that could be explored in this analysis too.
+I will use the following suite of python tools to conduct my research:

-## Techniques
+- python
+- pytorch
+- scikit-learn
+- duckdb
+- requests
+- pandas
+- matplotlib
+- seaborn

-I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
-I think there is a way to test the ideal number of clusters should exist to minimize the error.
-This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.
+## Purpose
+
+> This progress should also provide a definitive description of your purpose and how you intend to conduct it.
+> This should take the form of a detailed outline of the procedures you will undertake in exploring your dataset(s) and maximizing the knowledge that can be extracted from it.

 \newpage 

@@ -103,3 +111,31 @@ Another goal is to look at the political alignment over time.
 I will train a classifier to predict political bias based on the word embeddings as well.
 There is a concept of the [Overton Window](https://en.wikipedia.org/wiki/Overton_window) and I would be curious to know if title of new articles could be a proxy for the location of the overton window over time.

+\newpage 
+
+# Project Status Report I
+
+*2023-04-04*
+
+## Participants
+
+Matt Jensen
+
+## Overarching Purpose
+
+I hope to use a dataset of new articles to track the polarization of news over time.
+I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
+I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
+
+## Data Source
+
+To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
+I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
+The site also has a concept of references, where a main, popular story may be covered by other sources.
+So there is a concept of link similarity that could be explored in this analysis too.
+
+## Techniques
+
+I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
+I think there is a way to test the ideal number of clusters should exist to minimize the error.
+This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.
--- a/docs/progress_spec_3.docx
+++ b/docs/progress_spec_3.docx