# Data Stream mining

## Simulating data stream scenario

In data streams, data arrives continuously. If it is not processed immediately or stored, then it is lost forever. In this exercise, we will read data from a csv file as a stream instead of reading it using `pd.read_csv`

. We will use the oil price dataset from Kaggle.

## Reservoir Sampling

Reservoir sampling is a randomized algorithm for selecting \(k\) samples from a list of \(n\) items. The number of selected values \(k\) is usually much smaller than the total number of elements in the list \(n\). In data streams, the total number of elements is unknow. Typically, \(n\) is large enough such that the data cannot fit into active memory.

## Windowing Models

Another approach for processing data streams is to look at a fixed length (Window) of the last \(k\) elements from the stream. We will combine the windowing model with estimating the probability density function of the values to learn the data distribution. We will use the kernel density estimation to estimate the density of the oil price. For more information about using the kernel density estimator can be found in the scikit learn documentation about 1D Kernel Density Estimation

## Hashing Data Streams

Hashing is an important technique for filtering data streams. In this exercise, we will work on simple techniques for hasing the data streams. **Again**, we will not use `pd.read_csv`

to read the csv file of the oil dataset. Instead, we will read it line-by-line and parse each line, convert the values of `dcoilwtico`

from `strings`

to `float`

and process the records one-by-one. We will ignore the records that have empty string in the `dcoilwtico`

attribute.