Data Stream mining
Simulating data stream scenario
In data streams, data arrives continuously. If it is not processed immediately or stored, then it is lost forever. In this exercise, we will read data from a csv file as a stream instead of reading it using pd.read_csv
. We will use the oil price dataset from Kaggle.
Reservoir Sampling
Reservoir sampling is a randomized algorithm for selecting \(k\) samples from a list of \(n\) items. The number of selected values \(k\) is usually much smaller than the total number of elements in the list \(n\). In data streams, the total number of elements is unknow. Typically, \(n\) is large enough such that the data cannot fit into active memory.
Windowing Models
Another approach for processing data streams is to look at a fixed length (Window) of the last \(k\) elements from the stream. We will combine the windowing model with estimating the probability density function of the values to learn the data distribution. We will use the kernel density estimation to estimate the density of the oil price. For more information about using the kernel density estimator can be found in the scikit learn documentation about 1D Kernel Density Estimation
Hashing Data Streams
Hashing is an important technique for filtering data streams. In this exercise, we will work on simple techniques for hasing the data streams. Again, we will not use pd.read_csv
to read the csv file of the oil dataset. Instead, we will read it line-by-line and parse each line, convert the values of dcoilwtico
from strings
to float
and process the records one-by-one. We will ignore the records that have empty string in the dcoilwtico
attribute.