add progress report.

add scraping for downloading and parsing.
add joining of bias dataset.
add broken links checker.
This commit is contained in:
matt 2023-04-11 21:42:05 -07:00
parent b9c63414a0
commit feb3a4b8ed
9 changed files with 276 additions and 0 deletions

1
.gitignore vendored
View File

@ -1,2 +1,3 @@
*.csv
*.swp
__pycache__

BIN
dist/jensen_577_progress_report.pdf vendored Normal file

Binary file not shown.

105
docs/progress.md Normal file
View File

@ -0,0 +1,105 @@
# Data Mining - CSCI 577
# Project Status Report I
*2023-04-04*
## Participants
Matt Jensen
## Overarching Purpose
I hope to use a dataset of new articles to track the polarization of news over time.
I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
## Data Source
To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
The site also has a concept of references, where a main, popular story may be covered by other sources.
So there is a concept of link similarity that could be explored in this analysis too.
## Techniques
I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
I think there is a way to test the ideal number of clusters should exist to minimize the error.
This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.
\newpage
# Project Status Report II
*2023-04-11*
## Participants
Matt Jensen
## Dataset Description
The dataset I will be using for my analysis has the following attributes:
- title
- a text description of the news item.
- discrete, nominal.
- ~800k distinct titles.
- url
- a text description and unique identifier for the news item.
- discrete, nominal.
- ~700k distinct urls.
- author
- a text name.
- discrete, nominal.
- ~42k distinct authors.
- publisher
- a text name.
- discrete, nominal.
- ~13k distinct outlets.
- related links
- an adjacency matrix with the number of common links between two publishers.
- continuous, ratio.
- counts are less than total number of stories, obviously.
- published date
- the date the article was published.
- continuous, interval.
- ~5.5k distinct dates.
In addition, I will augment the data with the following attributes:
- title word embedding
- a vectorized form of the title from the output of a LLM or BERT model which embeds semantic meaning into the sentence.
- continuous, nominal.
- 800k vectors, of 768 values.
- political bias of the publisher
- a measure of how voters feel the political leanings of the publisher map to the political parties (Democrat/Republican).
- continuous, ordinal.
- ~30% of the publishers are labelled in [allsides.com](https://www.allsides.com/media-bias/ratings) ratings.
- estimated viewership of the publisher
- an estimate of the size of the audience that consumes the publisher's media.
- continous, ratio.
- I still need to parse [The Future of Media Project](https://projects.iq.harvard.edu/futureofmedia/index-us-mainstream-media-ownership) data to get a good idea of this number.
- number of broken links
- I will navigate all the links and count the number of 200, 301 and 404 status codes return.
- discrete, nominal
- size of this dataset is still unknown.
## Purpose
I want to analyze data from the news aggregation site [memeorandum.com](https://www.memeorandum.com/) and combine it with media bias measurements from [allsides.com](https://www.allsides.com/media-bias/ratings).
My goal for the project is to cluster the data based on the word embeddings of the titles.
I will tokenize each title, and use a BERT style model to generate word embeddings from the token.
Word embedding output from language models encode semantic meaning of sentences.
Specifically, BERT models output embeddings of 768 dimensional space.
Clustering these vectors will map from this semantic space to a lower dimensional cluster space.
My understanding of cluster leads me to believe that this lower dimensional space encodes meaning like similarity.
In this way, I hope to find outlets that tend to publish similar stories and group them together.
I would guess that this lower dimensional space will reflect story quantity and political leanings.
I would expect new outlets with similar quantity of stories and political leanings to be grouped together.
Another goal is to look at the political alignment over time.
I will train a classifier to predict political bias based on the word embeddings as well.
There is a concept of the [Overton Window](https://en.wikipedia.org/wiki/Overton_window) and I would be curious to know if title of new articles could be a proxy for the location of the overton window over time.

BIN
docs/progress_spec_1.docx Normal file

Binary file not shown.

BIN
docs/progress_spec_2.docx Normal file

Binary file not shown.

23
src/broken_links.py Normal file
View File

@ -0,0 +1,23 @@
import requests
import click
from data import connect
import seaborn as sns
import matplotlib.pyplot as plt
DB = connect()
DB.sql("""
with cte as (
select
count(1) as cnt
from stories
group by url, outlet
)
select
cast(sum(cnt) filter (where cnt = 1) as float)
/ sum(cnt) filter (where cnt > 1) as dups
from cte
""")
sns.histplot(x=hist['cnt'])
plt.show()

View File

@ -6,6 +6,15 @@ from enum import Enum
class Data(str, Enum):
Titles = 'titles'
def data_dir():
return Path(os.environ['DATA_MINING_DATA_DIR'])
def connect():
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
# APP_DIR = Path(os.environ['DATA_MINING_APP_DIR'])
DB = duckdb.connect(str(DATA_DIR / 'project.duckdb'))
return DB
def from_db(t: Data):
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
# APP_DIR = Path(os.environ['DATA_MINING_APP_DIR'])

46
src/join_bias.py Normal file
View File

@ -0,0 +1,46 @@
import click
import duckdb
from data import connect
import polars as ps
DB = connect()
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
bias = ps.read_csv(DATA_DIR / 'allsides_bias.csv', sep="|")
DB.sql("""
with cte as (
select
outlet
,count(1) as stories
from stories
group by outlet
)
,total as (
select
sum(stories) as total
from cte
)
select
cte.outlet
,cte.stories
,bias.outlet
,bias.lean
,sum(100 * (cte.stories / cast(total.total as float))) over() as rep
,total.total
from cte
join bias
on jaro_winkler_similarity(bias.outlet, cte.outlet) > 0.9
cross join total.total
""")
DB.sql("""
select
outlet
,count(1) as stories
from stories
group by outlet
order by count(1) desc
limit 50
""")
outlets

92
src/scrape.py Normal file
View File

@ -0,0 +1,92 @@
from datetime import date, timedelta
import datetime
import requests
from pathlib import Path
import click
from tqdm import tqdm
from data import data_dir
from lxml import etree
import pandas as pd
@click.group()
def cli():
...
@cli.command()
@click.option('-o', 'output_dir', type=Path, default=data_dir() / "memeorandum")
def download(output_dir):
day = timedelta(days=1)
cur = date(2005, 10, 1)
end = date.today()
dates = []
while cur <= end:
dates.append(cur)
cur = cur + day
date_iter = tqdm(dates, postfix="test")
for i in date_iter:
date_iter.set_postfix_str(f"{i}")
save_as = output_dir / f"{i.strftime('%y-%m-%d')}.html"
if save_as.exists():
continue
url = f"https://www.memeorandum.com/{i.strftime('%y%m%d')}/h2000"
r = requests.get(url)
with open(save_as, 'w') as f:
f.write(r.text)
@cli.command()
@click.option('-d', '--directory', type=Path, default=data_dir() / "memeorandum")
@click.option('-o', '--output_dir', type=Path, default=data_dir())
def parse(directory, output_dir):
directory = data_dir() / "memeorandum"
parser = etree.HTMLParser()
pages = [f for f in directory.glob("*.html")]
published = []
others = []
#page = pages[0]
page_iter = tqdm(pages, postfix="starting")
for page in page_iter:
page_iter.set_postfix_str(f"{page}")
date = datetime.datetime.strptime(page.stem, '%y-%m-%d')
# tree = etree.parse(str(page), parser)
tree = etree.parse(str(page), parser)
root = tree.getroot()
items = root.xpath("//div[contains(@class, 'item')]")
for item in items:
out = dict()
citation = item.xpath('./cite')
if not citation:
continue
author = citation[0]
if author.text:
author = ''.join(author.text.split('/')[:-1]).strip()
else:
author = ''
out['author'] = author
url = citation[0].getchildren()[0].get('href')
publisher = citation[0].getchildren()[0].text
out['publisher'] = publisher
out['publisher_url'] = url
title = item.xpath('.//strong/a')[0].text
out['title'] = title
item_id = hash((title,page.stem,url))
out['id'] = item_id
published.append(out)
related = item.xpath(".//span[contains(@class, 'mls')]/a")
# relation = related[0]
for relation in related:
another = dict()
another['url'] = relation.get('href')
another['publisher'] = relation.text
another['parent_id'] = item_id
others.append(another)
df = pd.DataFrame(published)
df.to_csv(output_dir / 'stories.csv', sep='|', index=False)
df = pd.DataFrame(others)
df.to_csv(output_dir / 'related.csv', sep='|', index=False)
if __name__ == "__main__":
cli()