Probablity of a class given attributes is a product of probability of that class overall, with the sum product of each individual attribute given the class:
Understand the concept of overfitting and be able to tell how you would know that a classification system overfit
\paragraph{Definition}
If classifier generates a decision tree (or other mechanism) too well adapted to the training set
Performs well on training set, not well on other data.
Some overfitting inevitable.
Remedy:
\begin{itemize}
\item Adjust a decision tree while it is being generated: Pre-pruning
\item Modify the tree after creation: Post-pruning
\end{itemize}
\subsubsection{Clashes}
Two (or more) instances of a training set have identical attribute values, but different classification.
Especially a problem for TDIDT's 'Adequacy condition'.
\paragraph{Stems from}
\begin{itemize}
\item Classification incorrectly recorded
\item Recorded attributes insufficient - Would need more attributes, normally impossible
\end{itemize}
\paragraph{Solutions}
\begin{itemize}
\item Discard the branch to the clashing node from the node above
\item Of clashing instances, assume the majority label
\end{itemize}
\subsubsection{Prepruning}
When pre-pruning, may reduce accuracy on training set but may be better on test set (and subsequent) data than unpruned classifier.
\begin{enumerate}
\item test whether a termination condition applies.
\begin{itemize}
\item If so, current subset is treated as a 'clash set'
\item Resolve by 'delete branch,' 'majority voting,' etc.
\end{itemize}
\item two methods methods
\begin{itemize}
\item Size Cuttoff – prune of subset has fewer than X instances
\item Maximum depth: prune if length of branch exceed Y
\end{itemize}
\end{enumerate}
\subsubsection{PostPruning}
\begin{enumerate}
\item look for a non-leaf nodes that have descendants of length 1.
\item In this tree, only node G and D are candidates for pruning (consolidation).
\end{enumerate}
\subsection{Discretizing}
\subsubsection{Equal Width Intervals}
\subsubsection{Pseudo Attributes}
\subsubsection{Processing Sorted Instance Table}
\subsubsection{ChiMerge}
\paragraph{Rationalization}
Initially, each distinct value of a numerical attribute $A$ is considered to be one interval.
$\chi^{2}$ tests are performed for every pair of adjacent intervals.
Adjacent intervals with the least $\chi^{2}$ values are merged together, because $\chi^{2}$ low values for a pair indicates similar class distributions.
This merging process proceeds recursively until a predefined stopping criterion is met.
For two adjacent intervals, if $\chi^{2}$ test concludes that the class is independent intervals should be merged.
If $\chi^{2}$ test concludes that they are not independent, i.e. the difference in relative class frequency is statistically significant, the two intervals should remain separate.
\paragraph{Calculation}
To calculate expected value for any combination of row and class:
\begin{enumerate}
\item Take the product of the corresponding row sum and column sum
\item Divided by the grand total of the observed values for the two rows.
\end{enumerate}
Then:
\begin{enumerate}
\item Using observed and expected values, calculate, for each of the cells:
\begin{math}
\frac{(O - E)^{2}}{E}
\end{math}
\item Sum each cell's $\chi^{2}$
\end{enumerate}
When exceeds $\chi^{2}$ threshold, hypothesis is rejected.
Small value for supports hypothesis.
Important adjustment, when $E < 0.5$ replace it with $0.5$.
\begin{enumerate}
\item Select the smallest value
\item Compare it to the threshold
\item If it falls below the threshold, merge it with the row immediately below it
\item recalculate $\chi^{2}$, Only need to do this for rows adjacent to the recently merged one.
\end{enumerate}
Large numbers of intervals does little to solve the problem of discretization.
Just one interval cannot contribute to a decision making process.
Modify significance level hypothesis of independence must pass, triggering interval merge.
Set a minimum and a maximum number of intervals
\subsection{Entropy}
\paragraph{Definition}
Entropy is the measure of the presence of there being more than one possible classification.
Used for splitting attributes in decision trees
Entropy minimizes the complexity, number of branches, in the decision tree.
No guarantee that using entropy will always lead to a small decision Tree
Used for feature reduction: Calculate the value of information gain for each attribute in the original dataset. Discard all attributes that do not meet a specified criterion.
Pass the revised dataset to the preferred classification algorithm
Entropy has bias towards selecting attributes with a large number of values
\paragraph{Calculation}
To decide if you split on an attribute:
\begin{enumerate}
\item find the entropy of the data in each of the branches after the split
\item then take the average of those and use it to find information gain.
\item The attribute split with the highest information gain (lowest entropy) is selected.
\end{enumerate}
\begin{itemize}
\item Entropy is always positive or zero
\item Entropy is zero when $p_{i}=1$, aka when all instances have the same class
Although lift is a useful measure of interestingness it is not always the best one to use.
In some cases a rule with higher support and lower lift can be more interesting than one with lower support and higher lift because it applies to more cases.