add progress and better scraping.

This commit is contained in:
matt
2023-04-22 13:00:24 -07:00
parent 297aeec32d
commit d43ed4658a
7 changed files with 287 additions and 73 deletions

View File

@@ -1,31 +1,39 @@
# Data Mining - CSCI 577
# Project Status Report I
# Project Status Report III
*2023-04-04*
*2023-04-18*
## Participants
Matt Jensen
## Overarching Purpose
Computer Science 477/577
Project Status Report III
Due: Tuesday, April 18
I hope to use a dataset of new articles to track the polarization of news over time.
I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
## Tools
## Data Source
> The third project progress report should include a preliminary account of the existing software tools you will be using.
> Ideally, you obtain the software you will (probably) need and run it on sample files (or your real files), so make sure that you understand how they work.
> Do not wait verify that there are no hidden complications.
> The are many plausible sources for such software, including the following:
To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
The site also has a concept of references, where a main, popular story may be covered by other sources.
So there is a concept of link similarity that could be explored in this analysis too.
I will use the following suite of python tools to conduct my research:
## Techniques
- python
- pytorch
- scikit-learn
- duckdb
- requests
- pandas
- matplotlib
- seaborn
I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
I think there is a way to test the ideal number of clusters should exist to minimize the error.
This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.
## Purpose
> This progress should also provide a definitive description of your purpose and how you intend to conduct it.
> This should take the form of a detailed outline of the procedures you will undertake in exploring your dataset(s) and maximizing the knowledge that can be extracted from it.
\newpage
@@ -103,3 +111,31 @@ Another goal is to look at the political alignment over time.
I will train a classifier to predict political bias based on the word embeddings as well.
There is a concept of the [Overton Window](https://en.wikipedia.org/wiki/Overton_window) and I would be curious to know if title of new articles could be a proxy for the location of the overton window over time.
\newpage
# Project Status Report I
*2023-04-04*
## Participants
Matt Jensen
## Overarching Purpose
I hope to use a dataset of new articles to track the polarization of news over time.
I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
## Data Source
To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
The site also has a concept of references, where a main, popular story may be covered by other sources.
So there is a concept of link similarity that could be explored in this analysis too.
## Techniques
I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
I think there is a way to test the ideal number of clusters should exist to minimize the error.
This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.

BIN
docs/progress_spec_3.docx Normal file

Binary file not shown.