add final notecard

2023-06-07 20:52:06 -07:00 · 2023-06-07 20:52:06 -07:00 · e74feb73f4
parent e37d39bc4a
commit e74feb73f4
2 changed files with 352 additions and 0 deletions
--- a/dist/final_notecard.pdf
+++ b/dist/final_notecard.pdf
--- a/docs/final_notecard.tex
+++ b/docs/final_notecard.tex
@ -0,0 +1,352 @@
 \documentclass[sigconf,authorversion,nonacm]{acmart}
 \usepackage{bbold}
 \usepackage{geometry}
 \geometry{margin=0.3in}
 \nonstopmode
 \begin{document}
 \title{CSCI577 Final}
 \maketitle
 \section{Classification}
 \paragraph{Definition}
 Constructing a method of classifying new instances using information in a training set
 \begin{itemize}
    \item Naive Bayes (Conditional Probabilities)
    \item Decision Trees
    \item Logistic Regression
    \item Neural Networks
 \end{itemize}
 \subsection{TDIDT}
 \paragraph{Definition}
 Top-Down Induction of Decision Trees
 \subsection{Adequacy}
 \paragraph{Definition}
 No two instances with the same values of all the attributes may belong to different classes.
 Naive bayes can still be used when this doesn't obtain, as it will still be able to obtain the probabilities of each class.
 kNN can still be used, as long as then multiple datapoints in the same location in euclidean space would still function as expected.
 \paragraph{Application}
 TDIDT algrorithm is negatively affected
 \paragraph{Algorithm}
 Until no more splitting is possible:
 \begin{itemize}
    \item IF all the instances in the training set belong to the same class THEN return the value of the class
    \item ELSE 
    \begin{enumerate}
        \item (a) Select an attribute A to split on
        \item (b) Sort the instances in the training set into subsets, one for each value of attribute A
        \item (c) Return a tree with one branch for each non-empty subset
            \begin{itemize}
                \item Each branch having a descendant subtree or a class value produced by applying the algorithm recursively
            \end{itemize}
    \end{enumerate}
 \end{itemize}
 \subsection{Overfitting}
 Understand the concept of overfitting and be able to tell how you would know that a classification system overfit
 \paragraph{Definition} 
 If classifier generates a decision tree (or other mechanism) too well adapted to the training set
 Performs well on training set, not well on other data.
 Some overfitting inevitable.
 Remedy:
 \begin{itemize}
    \item Adjust a decision tree while it is being generated: Pre-pruning
    \item Modify the tree after creation: Post-pruning
 \end{itemize}
 \subsubsection{Clashes}
 Two (or more) instances of a training set have identical attribute values, but different classification.
 Especially a problem for TDIDT's 'Adequacy condition'.
 \paragraph{Stems from}
 \begin{itemize}
    \item Classification incorrectly recorded
    \item Recorded attributes insufficient - Would need more attributes, normally impossible
 \end{itemize}
 \paragraph{Solutions}
 \begin{itemize}
    \item Discard the branch to the clashing node from the node above
    \item Of clashing instances, assume the majority label
 \end{itemize}
 \subsubsection{Prepruning}
 When pre-pruning, may reduce accuracy on training set but may be better on test set (and subsequent) data than unpruned classifier.
 \begin{enumerate}
    \item test whether a termination condition applies.
        \begin{itemize}
            \item If so, current subset is treated as a 'clash set'
            \item Resolve by 'delete branch,' 'majority voting,' etc.
        \end{itemize}
    \item two methods methods
        \begin{itemize}
            \item Size Cuttoff – prune of subset has fewer than X instances
            \item Maximum depth: prune if length of branch exceed Y
        \end{itemize}
 \end{enumerate}
 \subsubsection{PostPruning}
 \begin{enumerate}
    \item look for a non-leaf nodes that have descendants of length 1.
    \item In this tree, only node G and D are candidates for pruning (consolidation).
 \end{enumerate}
 \subsection{Discretizing}
 \subsubsection{Equal Width Intervals}
 \subsubsection{Pseudo Attributes}
 \subsubsection{Processing Sorted Instance Table}
 \subsubsection{ChiMerge}
 \paragraph{Rationalization}
 Initially, each distinct value of a numerical attribute $A$ is considered to be one interval.
 $\chi^{2}$ tests are performed for every pair of adjacent intervals.
 Adjacent intervals with the least $\chi^{2}$ values are merged together, because $\chi^{2}$ low values for a pair indicates similar class distributions.
 This merging process proceeds recursively until a predefined stopping criterion is met.
 For two adjacent intervals, if $\chi^{2}$ test concludes that the class is independent intervals should be merged.
 If $\chi^{2}$ test concludes that they are not independent, i.e. the difference in relative class frequency is statistically significant, the two intervals should remain separate.
 \paragraph{Calculation}
 To calculate expected value for any combination of row and class:
 \begin{enumerate}
    \item Take the product of the corresponding row sum and column sum
    \item Divided by the grand total of the observed values for the two rows.
 \end{enumerate}
 Then:
 \begin{enumerate}
    \item Using observed and expected values, calculate, for each of the cells:
    \begin{math}
        \frac{(O - E)^{2}}{E}
    \end{math}
    \item Sum each cell's $\chi^{2}$
 \end{enumerate}
 When exceeds $\chi^{2}$ threshold, hypothesis is rejected.
 Small value for supports hypothesis.
 Important adjustment, when $E < 0.5$ replace it with $0.5$.
 \begin{enumerate}
    \item Select the smallest value
    \item Compare it to the threshold
    \item If it falls below the threshold, merge it with the row immediately below it
    \item recalculate $\chi^{2}$, Only need to do this for rows adjacent to the recently merged one.
 \end{enumerate}
 Large numbers of intervals does little to solve the problem of discretization.
 Just one interval cannot contribute to a decision making process.
 Modify significance level hypothesis of independence must pass, triggering interval merge.
 Set a minimum and a maximum number of intervals
 \subsection{Entropy}
 \paragraph{Definition}
 Entropy is the measure of the presence of there being more than one possible classification.
 Used for splitting attributes in decision trees
 Entropy minimizes the complexity, number of branches, in the decision tree.
 No guarantee that using entropy will always lead to a small decision Tree
 Used for feature reduction: Calculate the value of information gain for each attribute in the original dataset. Discard all attributes that do not meet a specified criterion.
 Pass the revised dataset to the preferred classification algorithm
 Entropy has bias towards selecting attributes with a large number of values
 \paragraph{Calculation}
 To decide if you split on an attribute: 
 \begin{enumerate}
    \item find the entropy of the data in each of the branches after the split
    \item then take the average of those and use it to find information gain.
    \item The attribute split with the highest information gain (lowest entropy) is selected.
 \end{enumerate}
 \begin{itemize}
    \item Entropy is always positive or zero
    \item Entropy is zero when $p_{i} = 1$, aka when all instances have the same class
    \item Entropy is at its max value for the # of classes when all classes are evenly distributed
 \end{itemize}
 If there are classes, we can denote the proportion of instances with classification $i$ by $p_{i}$ for $i = 1 to K$.
 $p_{i} = \frac{\text{instances of class} i}{\text{total number of instances}}$
 \begin{displaymath}
    \text{Entropy} = E = -\sum_{i=1}^{K} p_{i} log_{2} p_{i}
 \end{displaymath}
 where $K =$ non-empty classes and $p_{i} = \frac{\lvert i \rvert}{N}$, instances in class $i$ over total number of instances $N$.
 \subsection{GINI}
 \paragraph{Calculation}
 \begin{enumerate}
    \item For each non-empty column, form the sum of the squares of the values in the body of the table and divide by the column sum.
    \item Add the values obtained for all the columns and divide by N (the number of instances).
    \item Subtract the total from 1.
 \end{enumerate}
 \subsection{Information Gain}
 \paragraph{Definition}
 The difference between the entropy before and after splitting on a given attribute in a decision tree.
 Maximizing information gain is the same as minimizing $E_{new}$.
 \paragraph{Calculation}
 \begin{displaymath}
    \text{Information Gain} = E_{\text{start}} - E_{\text{new}}
 \end{displaymath}
 Starting node:
 \begin{eqnarray}
    E_{\text{start}} = -\frac{4}{24}log_{2}\frac{4}{24} \\ \nonumber
    -\frac{5}{24}log_{2}\frac{5}{24} \\ \nonumber
    -\frac{15}{24}log_{2}\frac{15}{24}
 \end{eqnarray}
 After spliting on attribute:
 \begin{eqnarray}
    E_{\text{new}} = \frac{8}{24}E_{1}\\ \nonumber
    + \frac{8}{24}E_{2}\\ \nonumber
    + \frac{8}{24}E_{3}\\ \nonumber
 \end{eqnarray}
 \paragraph{Uses}
 \section{Clustering}
 \paragraph{Definition}
 Grouping data into seperate groups.
 Use distance metric between two datapoints. 
 Groups should be distinct from another and composed of items similar to one another, and different from items in other groups.
 \subsection{Naïve Bayes}
 \begin{displaymath}
    P(c_{i} | v) = P(c_{i}) \prod_{j=1}^{n} P(a_{j} = v_{j} | \text{class} = c_{i})
 \end{displaymath}
 \subsection{Nearest Neighbors}
 Mainly used when all attribute values are continuous
 General strategy:
 \begin{enumerate}
    \item Find the $k$ training instances that are closest to the unseen instance.
    \item Take the most commonly occurring classification for these instances.
 \end{enumerate}
 \begin{itemize}
    \item KMeans
    \item DBSCAN
 \end{itemize}
 \section{Sequence Mining}
 \textbf{TODO}
 \paragraph{Definition}
 Finding meaningful, recurring sequences of events
 \section{Association Rule Analysis}
 \paragraph{Definition}
 Given a collection of collections (database of transactions of food items), find items with high co-occurance.
 Let $m$ be the number possible items that can be bought.
 Let $I$ denote the set of all possible items.
 Possible itemsets: $s^{\lvert I \rvert}$
 An itemset $S$ matches a transaction $T$ (itself an itemset) if $S \subset T$.
 \subsection{Support}
 \paragraph{Definition}
 $support(S)$: proportion of itemsets matched by $S$. 
 Proportion of transactions that contain all the items in $S$. 
 Frequency with which the items in S occur together in the database.
 \paragraph{Calculation}
 \begin{displaymath}
    \text{support}(S) = \frac{count(S)}{n}
 \end{displaymath}
 where n is the number of transactions in the database.
 \subsubsection{Uses}
 \subsection{Confidence}
 \paragraph{Calculation}
 Confidence of a rule can be calculated either by
 \begin{displaymath}
    Confidence(L \rightarrow R) = \frac{count(L \cup R)}{count(L)}
 \end{displaymath}
 or 
 \begin{displaymath}
    Confidence(L \rightarrow R) = \frac{support(L \cup R)}{support(L)}
 \end{displaymath}
 Reject rules where 
 \begin{displaymath}
    support < minsup \approx 0.01 = 1\%
 \end{displaymath}
 Also called a frequent|large|supported itemset.
 Reject rules where 
 \begin{displaymath}
    confidence < minconf \approx 0.8 = 80\%
 \end{displaymath}
 \paragraph{Uses}
 \subsection{Lift}
 \paragraph{Definition}
 Lift measures how many more times the items in and occur together than would be expected if they were statistically independent.
 \paragraph{Calculation}
 \begin{eqnarray}
    \text{Lift}(L \rightarrow R) 
    = \frac{\text{count}(L \cup R)}{\text{count}(L) \times \text{support}(R)} \nonumber
    \\
    = \frac{\text{support}(L \cup R)}{\text{support}(L) \times \text{support}(R)}  \nonumber
    \\
    = \frac{\text{confidence}(L \rightarrow R)}{\text{support}(R)} \nonumber
    \\
    = \frac{N \times \text{confidence}(L \rightarrow R)}{\text{count}(R)} \nonumber
    \\
    = \frac{N \times \text{confidence}(R \rightarrow L)}{\text{count}(R)} \nonumber
    \\
    = \text{Lift}(R \rightarrow L)
 \end{eqnarray}
 \paragraph{Uses}
 \subsection{Frequent Itemsets}
 \begin{enumerate}
    \item Find itemsets of size $k$ made from 2 supported itemsets of size $k-1$
    \item For each new itemset: 
        \begin{enumerate}
            \item check if every sub-itemset in it also exists in the supported itemsets of size $k - 1$. 
            \item If not every sub-itemset does, then prune it
        \begin{enumerate}
    \item Now with the final candidates, determine if they have mininum support
    \item To determine association rules, find which itemsets have at least minimum confidence
 \end{enumerate}
 \subsection{Rules Possible}
 \begin{displaymath}
    \sideset{_k}{_i}C
 \end{displaymath}
 or
 \begin{displaymath}
    2^{k} - 2
 \end{displaymath}
 \end{document}
 \endinput