# – ITS632_Chapter9PPT1.ppt

FIND A SOLUTION AT Academic Writers Bay

Data Mining

Anomaly Detection

Lecture Notes for Chapter 9

Introduction to Data Mining

by

Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *

Anomaly/Outlier Detection

• What are anomalies/outliers?
• The set of data points that are considerably different than the remainder of the data
• Variants of Anomaly/Outlier Detection Problems
• Given a database D, find all the data points x  D with anomaly scores greater than some threshold t
• Given a database D, find all the data points x  D having the top-n largest anomaly scores f(x)
• Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D
• Applications:
• Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection

Importance of Anomaly Detection

Ozone Depletion History

• In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels
• Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
• The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!

Anomaly Detection

• Challenges
• How many outliers are there in the data?
• Method is unsupervised
• Validation can be quite challenging (just like for clustering)
• Finding needle in a haystack
• Working assumption:
• There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data

Anomaly Detection Schemes

• General Steps
• Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
• Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics

differ significantly from the normal profile
• Types of anomaly detection

schemes
• Graphical & Statistical-based
• Distance-based
• Model-based

Statistical Approaches

• Assume a parametric model describing the distribution of the data (e.g., normal distribution)
• Apply a statistical test that depends on
• Data distribution
• Parameter of distribution (e.g., mean, variance)
• Number of expected outliers (confidence limit)

Grubbs’ Test

• Detect outliers in univariate data
• Assume data comes from normal distribution
• Detects one outlier at a time, remove the outlier, and repeat
• H0: There is no outlier in data
• HA: There is at least one outlier
• Grubbs’ test statistic:
• Reject H0 if:

Statistical-based – Likelihood Approach

• Assume the data set D contains samples from a mixture of two probability distributions:
• M (majority distribution)
• A (anomalous distribution)
• General Approach:
• Initially, assume all the data points belong to M
• Let Lt(D) be the log likelihood of D at time t
• For each point xt that belongs to M, move it to A
• Let Lt+1 (D) be the new log likelihood.
• Compute the difference,  = Lt(D) – Lt+1 (D)
• If  > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A

Limitations of Statistical Approaches

• Most of the tests are for a single attribute
• In many cases, data distribution may not be known
• For high dimensional data, it may be difficult to estimate the true distribution

Distance-based Approaches

• Data is represented as a vector of features
• Three major approaches
• Nearest-neighbor based
• Density based
• Clustering based

Nearest-Neighbor Based Approach

• Approach:
• Compute the distance between every pair of data points
• There are various ways to define outliers:
• Data points for which there are fewer than p neighboring points within a distance D
• The top n data points whose distance to the kth nearest neighbor is greatest
• The top n data points whose average distance to the k nearest neighbors is greatest

Outliers in Lower Dimensional Projection

• In high-dimensional space, data is sparse and notion of proximity becomes meaningless
• Every point is an almost equally good outlier from the perspective of proximity-based definitions
• Lower-dimensional projection methods
• A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density

Clustering-Based

• Basic idea:
• Cluster the data into groups of different density
• Choose points in small cluster as candidate outliers
• Compute the distance between candidate points and non-candidate clusters.
• If candidate points are far from all other non-candidate points, they are outliers

Base Rate Fallacy in Intrusion Detection

• I: intrusive behavior,

I: non-intrusive behavior

A: alarm

A: no alarm
• Detection rate (true positive rate): P(A|I)
• False alarm rate: P(A|I)
• Goal is to maximize both
• Bayesian detection rate, P(I|A)
• P(I|A)

QUALITY: 100% ORIGINAL PAPER NO PLAGIARISM – CUSTOM PAPER

## Why Choose Us?

• 100% non-plagiarized Papers
• 24/7 /365 Service Available
• Affordable Prices
• Any Paper, Urgency, and Subject
• Will complete your papers in 6 hours
• On-time Delivery
• Money-back and Privacy guarantees
• Unlimited Amendments upon request
• Satisfaction guarantee

## How It Works

• Click on the “Place Your Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
• Fill in your paper’s requirements in the “PAPER DETAILS” section.
• Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
• Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
• From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.