add progress report.
add scraping for downloading and parsing. add joining of bias dataset. add broken links checker.
This commit is contained in:
parent
b9c63414a0
commit
feb3a4b8ed
|
@ -1,2 +1,3 @@
|
||||||
*.csv
|
*.csv
|
||||||
*.swp
|
*.swp
|
||||||
|
__pycache__
|
||||||
|
|
Binary file not shown.
|
@ -0,0 +1,105 @@
|
||||||
|
# Data Mining - CSCI 577
|
||||||
|
|
||||||
|
# Project Status Report I
|
||||||
|
|
||||||
|
*2023-04-04*
|
||||||
|
|
||||||
|
## Participants
|
||||||
|
|
||||||
|
Matt Jensen
|
||||||
|
|
||||||
|
## Overarching Purpose
|
||||||
|
|
||||||
|
I hope to use a dataset of new articles to track the polarization of news over time.
|
||||||
|
I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
|
||||||
|
I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
|
||||||
|
|
||||||
|
## Data Source
|
||||||
|
|
||||||
|
To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
|
||||||
|
I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
|
||||||
|
The site also has a concept of references, where a main, popular story may be covered by other sources.
|
||||||
|
So there is a concept of link similarity that could be explored in this analysis too.
|
||||||
|
|
||||||
|
## Techniques
|
||||||
|
|
||||||
|
I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
|
||||||
|
I think there is a way to test the ideal number of clusters should exist to minimize the error.
|
||||||
|
This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.
|
||||||
|
|
||||||
|
\newpage
|
||||||
|
|
||||||
|
# Project Status Report II
|
||||||
|
|
||||||
|
*2023-04-11*
|
||||||
|
|
||||||
|
## Participants
|
||||||
|
|
||||||
|
Matt Jensen
|
||||||
|
|
||||||
|
## Dataset Description
|
||||||
|
|
||||||
|
The dataset I will be using for my analysis has the following attributes:
|
||||||
|
|
||||||
|
- title
|
||||||
|
- a text description of the news item.
|
||||||
|
- discrete, nominal.
|
||||||
|
- ~800k distinct titles.
|
||||||
|
- url
|
||||||
|
- a text description and unique identifier for the news item.
|
||||||
|
- discrete, nominal.
|
||||||
|
- ~700k distinct urls.
|
||||||
|
- author
|
||||||
|
- a text name.
|
||||||
|
- discrete, nominal.
|
||||||
|
- ~42k distinct authors.
|
||||||
|
- publisher
|
||||||
|
- a text name.
|
||||||
|
- discrete, nominal.
|
||||||
|
- ~13k distinct outlets.
|
||||||
|
- related links
|
||||||
|
- an adjacency matrix with the number of common links between two publishers.
|
||||||
|
- continuous, ratio.
|
||||||
|
- counts are less than total number of stories, obviously.
|
||||||
|
- published date
|
||||||
|
- the date the article was published.
|
||||||
|
- continuous, interval.
|
||||||
|
- ~5.5k distinct dates.
|
||||||
|
|
||||||
|
In addition, I will augment the data with the following attributes:
|
||||||
|
|
||||||
|
- title word embedding
|
||||||
|
- a vectorized form of the title from the output of a LLM or BERT model which embeds semantic meaning into the sentence.
|
||||||
|
- continuous, nominal.
|
||||||
|
- 800k vectors, of 768 values.
|
||||||
|
- political bias of the publisher
|
||||||
|
- a measure of how voters feel the political leanings of the publisher map to the political parties (Democrat/Republican).
|
||||||
|
- continuous, ordinal.
|
||||||
|
- ~30% of the publishers are labelled in [allsides.com](https://www.allsides.com/media-bias/ratings) ratings.
|
||||||
|
- estimated viewership of the publisher
|
||||||
|
- an estimate of the size of the audience that consumes the publisher's media.
|
||||||
|
- continous, ratio.
|
||||||
|
- I still need to parse [The Future of Media Project](https://projects.iq.harvard.edu/futureofmedia/index-us-mainstream-media-ownership) data to get a good idea of this number.
|
||||||
|
- number of broken links
|
||||||
|
- I will navigate all the links and count the number of 200, 301 and 404 status codes return.
|
||||||
|
- discrete, nominal
|
||||||
|
- size of this dataset is still unknown.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
I want to analyze data from the news aggregation site [memeorandum.com](https://www.memeorandum.com/) and combine it with media bias measurements from [allsides.com](https://www.allsides.com/media-bias/ratings).
|
||||||
|
My goal for the project is to cluster the data based on the word embeddings of the titles.
|
||||||
|
I will tokenize each title, and use a BERT style model to generate word embeddings from the token.
|
||||||
|
|
||||||
|
Word embedding output from language models encode semantic meaning of sentences.
|
||||||
|
Specifically, BERT models output embeddings of 768 dimensional space.
|
||||||
|
Clustering these vectors will map from this semantic space to a lower dimensional cluster space.
|
||||||
|
|
||||||
|
My understanding of cluster leads me to believe that this lower dimensional space encodes meaning like similarity.
|
||||||
|
In this way, I hope to find outlets that tend to publish similar stories and group them together.
|
||||||
|
I would guess that this lower dimensional space will reflect story quantity and political leanings.
|
||||||
|
I would expect new outlets with similar quantity of stories and political leanings to be grouped together.
|
||||||
|
Another goal is to look at the political alignment over time.
|
||||||
|
I will train a classifier to predict political bias based on the word embeddings as well.
|
||||||
|
There is a concept of the [Overton Window](https://en.wikipedia.org/wiki/Overton_window) and I would be curious to know if title of new articles could be a proxy for the location of the overton window over time.
|
||||||
|
|
Binary file not shown.
Binary file not shown.
|
@ -0,0 +1,23 @@
|
||||||
|
import requests
|
||||||
|
import click
|
||||||
|
from data import connect
|
||||||
|
import seaborn as sns
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
|
||||||
|
DB.sql("""
|
||||||
|
with cte as (
|
||||||
|
select
|
||||||
|
count(1) as cnt
|
||||||
|
from stories
|
||||||
|
group by url, outlet
|
||||||
|
)
|
||||||
|
select
|
||||||
|
cast(sum(cnt) filter (where cnt = 1) as float)
|
||||||
|
/ sum(cnt) filter (where cnt > 1) as dups
|
||||||
|
from cte
|
||||||
|
""")
|
||||||
|
|
||||||
|
sns.histplot(x=hist['cnt'])
|
||||||
|
plt.show()
|
|
@ -6,6 +6,15 @@ from enum import Enum
|
||||||
class Data(str, Enum):
|
class Data(str, Enum):
|
||||||
Titles = 'titles'
|
Titles = 'titles'
|
||||||
|
|
||||||
|
def data_dir():
|
||||||
|
return Path(os.environ['DATA_MINING_DATA_DIR'])
|
||||||
|
|
||||||
|
def connect():
|
||||||
|
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
|
||||||
|
# APP_DIR = Path(os.environ['DATA_MINING_APP_DIR'])
|
||||||
|
DB = duckdb.connect(str(DATA_DIR / 'project.duckdb'))
|
||||||
|
return DB
|
||||||
|
|
||||||
def from_db(t: Data):
|
def from_db(t: Data):
|
||||||
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
|
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
|
||||||
# APP_DIR = Path(os.environ['DATA_MINING_APP_DIR'])
|
# APP_DIR = Path(os.environ['DATA_MINING_APP_DIR'])
|
||||||
|
|
|
@ -0,0 +1,46 @@
|
||||||
|
import click
|
||||||
|
import duckdb
|
||||||
|
from data import connect
|
||||||
|
import polars as ps
|
||||||
|
|
||||||
|
DB = connect()
|
||||||
|
DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
|
||||||
|
bias = ps.read_csv(DATA_DIR / 'allsides_bias.csv', sep="|")
|
||||||
|
|
||||||
|
DB.sql("""
|
||||||
|
with cte as (
|
||||||
|
select
|
||||||
|
outlet
|
||||||
|
,count(1) as stories
|
||||||
|
from stories
|
||||||
|
group by outlet
|
||||||
|
)
|
||||||
|
,total as (
|
||||||
|
select
|
||||||
|
sum(stories) as total
|
||||||
|
from cte
|
||||||
|
)
|
||||||
|
select
|
||||||
|
cte.outlet
|
||||||
|
,cte.stories
|
||||||
|
,bias.outlet
|
||||||
|
,bias.lean
|
||||||
|
,sum(100 * (cte.stories / cast(total.total as float))) over() as rep
|
||||||
|
,total.total
|
||||||
|
from cte
|
||||||
|
join bias
|
||||||
|
on jaro_winkler_similarity(bias.outlet, cte.outlet) > 0.9
|
||||||
|
cross join total.total
|
||||||
|
""")
|
||||||
|
|
||||||
|
DB.sql("""
|
||||||
|
select
|
||||||
|
outlet
|
||||||
|
,count(1) as stories
|
||||||
|
from stories
|
||||||
|
group by outlet
|
||||||
|
order by count(1) desc
|
||||||
|
limit 50
|
||||||
|
""")
|
||||||
|
|
||||||
|
outlets
|
|
@ -0,0 +1,92 @@
|
||||||
|
from datetime import date, timedelta
|
||||||
|
import datetime
|
||||||
|
import requests
|
||||||
|
from pathlib import Path
|
||||||
|
import click
|
||||||
|
from tqdm import tqdm
|
||||||
|
from data import data_dir
|
||||||
|
from lxml import etree
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
@click.group()
|
||||||
|
def cli():
|
||||||
|
...
|
||||||
|
|
||||||
|
@cli.command()
|
||||||
|
@click.option('-o', 'output_dir', type=Path, default=data_dir() / "memeorandum")
|
||||||
|
def download(output_dir):
|
||||||
|
day = timedelta(days=1)
|
||||||
|
cur = date(2005, 10, 1)
|
||||||
|
end = date.today()
|
||||||
|
dates = []
|
||||||
|
while cur <= end:
|
||||||
|
dates.append(cur)
|
||||||
|
cur = cur + day
|
||||||
|
date_iter = tqdm(dates, postfix="test")
|
||||||
|
for i in date_iter:
|
||||||
|
date_iter.set_postfix_str(f"{i}")
|
||||||
|
save_as = output_dir / f"{i.strftime('%y-%m-%d')}.html"
|
||||||
|
if save_as.exists():
|
||||||
|
continue
|
||||||
|
url = f"https://www.memeorandum.com/{i.strftime('%y%m%d')}/h2000"
|
||||||
|
r = requests.get(url)
|
||||||
|
with open(save_as, 'w') as f:
|
||||||
|
f.write(r.text)
|
||||||
|
|
||||||
|
|
||||||
|
@cli.command()
|
||||||
|
@click.option('-d', '--directory', type=Path, default=data_dir() / "memeorandum")
|
||||||
|
@click.option('-o', '--output_dir', type=Path, default=data_dir())
|
||||||
|
def parse(directory, output_dir):
|
||||||
|
directory = data_dir() / "memeorandum"
|
||||||
|
parser = etree.HTMLParser()
|
||||||
|
pages = [f for f in directory.glob("*.html")]
|
||||||
|
published = []
|
||||||
|
others = []
|
||||||
|
#page = pages[0]
|
||||||
|
page_iter = tqdm(pages, postfix="starting")
|
||||||
|
for page in page_iter:
|
||||||
|
page_iter.set_postfix_str(f"{page}")
|
||||||
|
date = datetime.datetime.strptime(page.stem, '%y-%m-%d')
|
||||||
|
# tree = etree.parse(str(page), parser)
|
||||||
|
tree = etree.parse(str(page), parser)
|
||||||
|
root = tree.getroot()
|
||||||
|
items = root.xpath("//div[contains(@class, 'item')]")
|
||||||
|
|
||||||
|
for item in items:
|
||||||
|
out = dict()
|
||||||
|
citation = item.xpath('./cite')
|
||||||
|
if not citation:
|
||||||
|
continue
|
||||||
|
author = citation[0]
|
||||||
|
if author.text:
|
||||||
|
author = ''.join(author.text.split('/')[:-1]).strip()
|
||||||
|
else:
|
||||||
|
author = ''
|
||||||
|
out['author'] = author
|
||||||
|
url = citation[0].getchildren()[0].get('href')
|
||||||
|
publisher = citation[0].getchildren()[0].text
|
||||||
|
out['publisher'] = publisher
|
||||||
|
out['publisher_url'] = url
|
||||||
|
title = item.xpath('.//strong/a')[0].text
|
||||||
|
out['title'] = title
|
||||||
|
item_id = hash((title,page.stem,url))
|
||||||
|
out['id'] = item_id
|
||||||
|
published.append(out)
|
||||||
|
|
||||||
|
related = item.xpath(".//span[contains(@class, 'mls')]/a")
|
||||||
|
# relation = related[0]
|
||||||
|
for relation in related:
|
||||||
|
another = dict()
|
||||||
|
another['url'] = relation.get('href')
|
||||||
|
another['publisher'] = relation.text
|
||||||
|
another['parent_id'] = item_id
|
||||||
|
others.append(another)
|
||||||
|
df = pd.DataFrame(published)
|
||||||
|
df.to_csv(output_dir / 'stories.csv', sep='|', index=False)
|
||||||
|
df = pd.DataFrame(others)
|
||||||
|
df.to_csv(output_dir / 'related.csv', sep='|', index=False)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
cli()
|
Loading…
Reference in New Issue