add progress report.

add scraping for downloading and parsing. add joining of bias dataset. add broken links checker.
2023-04-11 21:42:05 -07:00 · 2023-04-11 21:42:05 -07:00 · feb3a4b8ed
parent b9c63414a0
commit feb3a4b8ed
9 changed files with 276 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,2 +1,3 @@
 *.csv
 *.swp
 __pycache__
--- a/dist/jensen_577_progress_report.pdf
+++ b/dist/jensen_577_progress_report.pdf
--- a/docs/progress.md
+++ b/docs/progress.md
@ -0,0 +1,105 @@
 # Data Mining - CSCI 577
 # Project Status Report I
 *2023-04-04*
 ## Participants
 Matt Jensen
 ## Overarching Purpose
 I hope to use a dataset of new articles to track the polarization of news over time.
 I have a hypothesis that news has become more polarized superficially, but has actually converged into only two dominate views points.
 I think there is a connection to be made to other statistics, like voting polarity in congress, or income inequality, or consolidation of media into the hands of the few.
 ## Data Source
 To test this thesis, I will crawl the archives of [memeorandum.com](https://www.memeorandum.com/) for news stories from 2006 onward.
 I will grab the title, author, publisher, published date, url and related discussions and store it in a .csv.
 The site also has a concept of references, where a main, popular story may be covered by other sources.
 So there is a concept of link similarity that could be explored in this analysis too.
 ## Techniques
 I am unsure of which technique specifically will work best, but I believe an unsupervised clustering algorithm will serve me well.
 I think there is a way to test the ideal number of clusters should exist to minimize the error.
 This could be a good proxy for how many 'viewpoints' are allowed in 'mainstream' news media.
 \newpage 
 # Project Status Report II
 *2023-04-11*
 ## Participants
 Matt Jensen
 ## Dataset Description 
 The dataset I will be using for my analysis has the following attributes:
 - title
    - a text description of the news item.
    - discrete, nominal.
    - ~800k distinct titles.
 - url
    - a text description and unique identifier for the news item.
    - discrete, nominal.
    - ~700k distinct urls.
 - author
    - a text name.
    - discrete, nominal.
    - ~42k distinct authors.
 - publisher
    - a text name.
    - discrete, nominal.
    - ~13k distinct outlets.
 - related links
    - an adjacency matrix with the number of common links between two publishers.
    - continuous, ratio.
    - counts are less than total number of stories, obviously.
 - published date
    - the date the article was published.
    - continuous, interval.
    - ~5.5k distinct dates.
 In addition, I will augment the data with the following attributes:
 - title word embedding
    - a vectorized form of the title from the output of a LLM or BERT model which embeds semantic meaning into the sentence.
    - continuous, nominal.
    - 800k vectors, of 768 values.
 - political bias of the publisher
    - a measure of how voters feel the political leanings of the publisher map to the political parties (Democrat/Republican).
    - continuous, ordinal.
    - ~30% of the publishers are labelled in [allsides.com](https://www.allsides.com/media-bias/ratings) ratings.
 - estimated viewership of the publisher
    - an estimate of the size of the audience that consumes the publisher's media.
    - continous, ratio.
    - I still need to parse [The Future of Media Project](https://projects.iq.harvard.edu/futureofmedia/index-us-mainstream-media-ownership) data to get a good idea of this number.
 - number of broken links
    - I will navigate all the links and count the number of 200, 301 and 404 status codes return.
    - discrete, nominal
    - size of this dataset is still unknown.
 ## Purpose
 I want to analyze data from the news aggregation site [memeorandum.com](https://www.memeorandum.com/) and combine it with media bias measurements from [allsides.com](https://www.allsides.com/media-bias/ratings).
 My goal for the project is to cluster the data based on the word embeddings of the titles.
 I will tokenize each title, and use a BERT style model to generate word embeddings from the token.
 Word embedding output from language models encode semantic meaning of sentences.
 Specifically, BERT models output embeddings of 768 dimensional space. 
 Clustering these vectors will map from this semantic space to a lower dimensional cluster space.
 My understanding of cluster leads me to believe that this lower dimensional space encodes meaning like similarity.
 In this way, I hope to find outlets that tend to publish similar stories and group them together.
 I would guess that this lower dimensional space will reflect story quantity and political leanings.
 I would expect new outlets with similar quantity of stories and political leanings to be grouped together.
 Another goal is to look at the political alignment over time.
 I will train a classifier to predict political bias based on the word embeddings as well.
 There is a concept of the [Overton Window](https://en.wikipedia.org/wiki/Overton_window) and I would be curious to know if title of new articles could be a proxy for the location of the overton window over time.
--- a/docs/progress_spec_1.docx
+++ b/docs/progress_spec_1.docx
--- a/docs/progress_spec_2.docx
+++ b/docs/progress_spec_2.docx
--- a/src/broken_links.py
+++ b/src/broken_links.py
@ -0,0 +1,23 @@
 import requests
 import click
 from data import connect
 import seaborn as sns
 import matplotlib.pyplot as plt
 DB = connect()
 DB.sql("""
 with cte as (
    select 
        count(1) as cnt 
    from stories 
    group by url, outlet
 )
 select
    cast(sum(cnt) filter (where cnt = 1) as float)
    / sum(cnt) filter (where cnt > 1) as dups
 from cte
 """)
 sns.histplot(x=hist['cnt'])
 plt.show()
--- a/src/data.py
+++ b/src/data.py
@ -6,6 +6,15 @@ from enum import Enum
 class Data(str, Enum):
    Titles = 'titles'
 def data_dir():
    return Path(os.environ['DATA_MINING_DATA_DIR'])
 def connect():
    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
    # APP_DIR = Path(os.environ['DATA_MINING_APP_DIR'])
    DB = duckdb.connect(str(DATA_DIR / 'project.duckdb'))
    return DB
 def from_db(t: Data):
    DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
    # APP_DIR = Path(os.environ['DATA_MINING_APP_DIR'])
--- a/src/join_bias.py
+++ b/src/join_bias.py
@ -0,0 +1,46 @@
 import click
 import duckdb
 from data import connect
 import polars as ps
 DB = connect()
 DATA_DIR = Path(os.environ['DATA_MINING_DATA_DIR'])
 bias = ps.read_csv(DATA_DIR / 'allsides_bias.csv', sep="|")
 DB.sql("""
    with cte as (
        select 
            outlet 
            ,count(1) as stories
            from stories 
            group by outlet
    ) 
    ,total as (
        select
            sum(stories) as total
        from cte
    )
    select
        cte.outlet
        ,cte.stories
        ,bias.outlet
        ,bias.lean
        ,sum(100 * (cte.stories / cast(total.total as float))) over() as rep
        ,total.total
    from cte
    join bias 
    on jaro_winkler_similarity(bias.outlet, cte.outlet) > 0.9
    cross join total.total
 """)
 DB.sql("""
    select
        outlet
        ,count(1) as stories
    from stories
    group by outlet
    order by count(1) desc
    limit 50
 """)
 outlets
--- a/src/scrape.py
+++ b/src/scrape.py
@ -0,0 +1,92 @@
 from datetime import date, timedelta
 import datetime
 import requests
 from pathlib import Path
 import click
 from tqdm import tqdm
 from data import data_dir
 from lxml import etree
 import pandas as pd
@click.group()
 def cli():
    ...
@cli.command()
@click.option('-o', 'output_dir', type=Path, default=data_dir() / "memeorandum")
 def download(output_dir):
    day = timedelta(days=1)
    cur = date(2005, 10, 1)
    end = date.today()
    dates = []
    while cur <= end:
        dates.append(cur)
        cur = cur + day
    date_iter = tqdm(dates, postfix="test")
    for i in date_iter:
        date_iter.set_postfix_str(f"{i}")
        save_as = output_dir / f"{i.strftime('%y-%m-%d')}.html"
        if save_as.exists():
            continue
        url = f"https://www.memeorandum.com/{i.strftime('%y%m%d')}/h2000"
        r = requests.get(url)
        with open(save_as, 'w') as f:
            f.write(r.text)
@cli.command()
@click.option('-d', '--directory', type=Path, default=data_dir() / "memeorandum")
@click.option('-o', '--output_dir', type=Path, default=data_dir())
 def parse(directory, output_dir):
    directory = data_dir() / "memeorandum"
    parser = etree.HTMLParser()
    pages = [f for f in directory.glob("*.html")]
    published = []
    others = []
    #page = pages[0]
    page_iter = tqdm(pages, postfix="starting")
    for page in page_iter:
        page_iter.set_postfix_str(f"{page}")
        date = datetime.datetime.strptime(page.stem, '%y-%m-%d')
        # tree = etree.parse(str(page), parser)
        tree = etree.parse(str(page), parser)
        root = tree.getroot()
        items = root.xpath("//div[contains(@class, 'item')]")
        for item in items:
            out = dict()
            citation = item.xpath('./cite')
            if not citation:
                continue
            author = citation[0]
            if author.text: 
                author = ''.join(author.text.split('/')[:-1]).strip()
            else:
                author = ''
            out['author'] = author
            url = citation[0].getchildren()[0].get('href')
            publisher = citation[0].getchildren()[0].text
            out['publisher'] = publisher
            out['publisher_url'] = url
            title = item.xpath('.//strong/a')[0].text
            out['title'] = title
            item_id = hash((title,page.stem,url))
            out['id'] = item_id
            published.append(out)
            related = item.xpath(".//span[contains(@class, 'mls')]/a")
            # relation = related[0]
            for relation in related:
                another = dict()
                another['url'] = relation.get('href')
                another['publisher'] = relation.text
                another['parent_id'] = item_id
                others.append(another)
    df = pd.DataFrame(published)
    df.to_csv(output_dir / 'stories.csv', sep='|', index=False)
    df = pd.DataFrame(others)
    df.to_csv(output_dir / 'related.csv', sep='|', index=False)
 if __name__ == "__main__":
    cli()