add progress and better scraping.
This commit is contained in:
@@ -1,31 +1,39 @@
|
||||
# Data Mining - CSCI 577
|
||||
|
||||
# Project Status Report I
|
||||
# Project Status Report III
|
||||
|
||||
*2023-04-04*
|
||||
*2023-04-18*
|
||||
|
||||
## Participants
|
||||
|
||||
Matt Jensen
|
||||
|
||||
## Overarching Purpose
|
||||
Computer Science 477/577
|
||||
Project Status Report III
|
||||
Due: Tuesday, April 18
|
||||
|
||||
I hope to use a dataset of new articles to track the polarization of news over time.
|
||||
I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
|
||||
I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
|
||||
## Tools
|
||||
|
||||
## Data Source
|
||||
> The third project progress report should include a preliminary account of the existing software tools you will be using.
|
||||
> Ideally, you obtain the software you will (probably) need and run it on sample files (or your real files), so make sure that you understand how they work.
|
||||
> Do not wait verify that there are no hidden complications.
|
||||
> The are many plausible sources for such software, including the following:
|
||||
|
||||
To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
|
||||
I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
|
||||
The site also has a concept of references, where a main, popular story may be covered by other sources.
|
||||
So there is a concept of link similarity that could be explored in this analysis too.
|
||||
I will use the following suite of python tools to conduct my research:
|
||||
|
||||
## Techniques
|
||||
- python
|
||||
- pytorch
|
||||
- scikit-learn
|
||||
- duckdb
|
||||
- requests
|
||||
- pandas
|
||||
- matplotlib
|
||||
- seaborn
|
||||
|
||||
I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
|
||||
I think there is a way to test the ideal number of clusters should exist to minimize the error.
|
||||
This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.
|
||||
## Purpose
|
||||
|
||||
> This progress should also provide a definitive description of your purpose and how you intend to conduct it.
|
||||
> This should take the form of a detailed outline of the procedures you will undertake in exploring your dataset(s) and maximizing the knowledge that can be extracted from it.
|
||||
|
||||
\newpage
|
||||
|
||||
@@ -103,3 +111,31 @@ Another goal is to look at the political alignment over time.
|
||||
I will train a classifier to predict political bias based on the word embeddings as well.
|
||||
There is a concept of the [Overton Window](https://en.wikipedia.org/wiki/Overton_window) and I would be curious to know if title of new articles could be a proxy for the location of the overton window over time.
|
||||
|
||||
\newpage
|
||||
|
||||
# Project Status Report I
|
||||
|
||||
*2023-04-04*
|
||||
|
||||
## Participants
|
||||
|
||||
Matt Jensen
|
||||
|
||||
## Overarching Purpose
|
||||
|
||||
I hope to use a dataset of new articles to track the polarization of news over time.
|
||||
I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
|
||||
I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
|
||||
|
||||
## Data Source
|
||||
|
||||
To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
|
||||
I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
|
||||
The site also has a concept of references, where a main, popular story may be covered by other sources.
|
||||
So there is a concept of link similarity that could be explored in this analysis too.
|
||||
|
||||
## Techniques
|
||||
|
||||
I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
|
||||
I think there is a way to test the ideal number of clusters should exist to minimize the error.
|
||||
This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.
|
||||
|
||||
BIN
docs/progress_spec_3.docx
Normal file
BIN
docs/progress_spec_3.docx
Normal file
Binary file not shown.
Reference in New Issue
Block a user