ExploreIT

Monday, November 26, 2018

TensorFlow

TensorFlow !!

one more cool yet powerful framework/library to develop ML programs.

The underlying basis for this is a directed graph. The data otherwise called edges flows through nodes which are otherwise called operations.

TensorFlow normally works in a lazy way i.e., it builds the graphs and execution happens later during the programming although, there is an 'eager' way to execute in runtime mode. And, the TensorFlow programming happens at "estimator" api level. Or there is another high level API called 'keras' which can be used to define models and execute at ease.

This is how it can be installed..

conda - package manager can be used for this. update conda and then use conda to upgrade all the packages if you are not sure which ones are dependent packages like numpy etc.,
And, then install python packages using pip3 to install/upgrade tensorflow and keras.

Sunday, November 11, 2018

ML how to..

How to approach an ML problem.. here is what I think

1. Take a look at the data and decide what do you want to do like, what to predict - Explore the data

2. Decide what type of ML problem is it - unsupervised / supervised (classification / regression)

3. Clean up the data .. remove unwanted data, noise, decide on features and create features, normalize feature values - data cleanup and feature engineering

4. Create data sets - training/validation/test data sets

5. Train the model - supply train and validation sets, choose the model parameters, loss function, number of iterations etc., - Create and train a model

6. evaluate the error on training set and validation set. Tune the parameters and re-train until the error is close on both training and validation sets - model validation and tuning

7. Check the model prediction on the test data set which is the data set that the model was not aware of the output label value in the above phases - final check

8. all good ? release to live data. If not, repeat above by revisiting input features and model and model parameters to get a better fit. - going live

Saturday, November 10, 2018

Machine Learning

Machine Learning - as the name suggest, it is about teaching a machine to learn about something. The crust is about how to teach a machine.. the better you teach the better it learns.

How we teach a machine ? well, remember the word 'maths' ? it is every where.. most of the logical problems are solved through maths and that is how machines work. so, we teach the same way.. to do that, it is required to collect 'enough' data sets and feed it to a machine asking to fit a mathematical equation. By doing so, the machine learns about the data and its dynamics .. once it is done, the machine would be able to predict the outcome for a future data point.

the first step in doing so, is identify what kind of ML problem is it. Generally, an ml problem falls into either of the two categories or sometimes both ! depends on what you want..

ML Problem
/ \
/ \
supervised unsupervised
/ \ |
/ \ |
classification regression grouping/clustering

supervised problems are the cases where the data sets contains both input (features) and output (label). The teaching process will be carried out in guidance with the output so that, ml algorithm will know how is it doing how to evaluate the model so that it better fits to the data.

This is further categorized into two types. 'classification' is type of problem where the output is basically classified into two or a few unique values and the goal is to fit a model so that, the future data point will be classified into the given sets. like what is the number that a given picture contains..

Regression is the second type of supervised problems where the output variable is a continuous label instead of limited/discrete value. like whether your favorite team wins a match against a given opponent.

Unsupervised problems are the cases where the outcome is not really known. So, the goal is to fit a model in order to identify the clusters in the data sets.

There are lot of ML Algorithms and libraries and frameworks available as opensource and as commercial products in the market already..

Friday, January 12, 2018

Web Application's Navigation Paths and Page Performance

I have been trying to find out a better way to visualize and better understand how a web application is being accessed by the end users and how the end user journey looks like and which pages are indirectly became the close groups based on how the end users are distributed and their interests in the application features. And, finally how smooth is the path between pages..

And, stumbled on a wonderful visualization tool called Gephi. And, below is how a sample website looks like depicting the user navigation paths, the page grouping and page performance which is characterized by the weights of the path between the pages. In the graph terminology, the pages are nodes and the paths between pages are called edges which are attributed by response times as weight.

Wednesday, October 4, 2017

A Sample Splunk Dashboard

A sample dashboard built using Splunk.. showing some meaningful stats from some pseudo data.
Have to say wow ! it was so quick to build and the responsiveness of the features are so fast .. and it's fun learning this stuff..

Thursday, May 25, 2017

Applying machine learning to study and do capacity planning

ML algorithms and libraries have evolved and the scope of applying the techniques has also grown a lot. I am trying here to apply one of such techniques to study the performance metrics and then estimate the required capacity at a future load.

Lets take a simple set like pageviews and heap usage
Example:
pv, heapusage(mb)
105,637
110,638
115,640
120,642
125,644
130,646
135,648
140,650
145,652

Now, since this is not a classification type dataset, the regression models can be used. There are a few regression models from one of the coolest ML libraries - scikit (sklearn).

let's take a linearRegression model and try to fit it on the set.

The logic to implement the model is:

- declare the data set into arrays (have not changed the parameters like squaring them etc.,)

- create sets for training and validation as 20% is used for validation

- choose the model as linearRegression

- fit the model on the train set to make it learn the trends

- get the score of learning to verify how well it has fit on the dataset

- get the coefs and intecepts used to fit the model

- predict the validation set and compare it with actual values in validation set

- if it is all good, then try to predict the load for future increased load

Once, this is done, I have build wrappers and a simple UI.

UI Page 1: To simply upload the feature file i.e. the metrics file

UI Page 2: This will show the result. The correlation between the metrics, validation sets and how close the predicted values and finally the predicted output i.e. required Heap Memory for the given page views.

Although this is a simple demonstration, more features can be added..

Sunday, April 23, 2017

some useful linux commands

Below are some of the useful linux commands to diagnose issues..

To know about processes, parent and call heirarchy and to know the process's resource usage etc., we can use pstree, ps, top -H etc.,

Example:
pstree

init─┬─Xvnc
├─crond
├─firefox───9*[{firefox}]
├─gnome-terminal─┬─bash
│ ├─gnome-pty-helpe
│ └─{gnome-terminal}
├─gnome-terminal─┬─bash───su───bash───startWebLogic.s───java───103*[{java}]

ps -e f

132 ? Sl 0:01 gnome-terminal
137 ? S 0:00 \_ gnome-pty-helper
138 pts/1 Ss 0:00 \_ bash
353 pts/1 S 0:00 | \_ su
357 pts/1 S+ 0:00 | \_ bash
460 pts/1 S 0:00 | \_ /bin/sh ./startWebLogic.sh
510 pts/1 Sl 26:44 | \_ /...../jdk/bin/java -server -Xms256m -Xmx1024m -Dwe
993 pts/0 Ss+ 0:00 \_ bash

while ps -ef can give you full command of the process
And, it will also show the parent process id and process cpu, rss size etc.,

To know all the list of files that the process has accessed, lsof is the command..

/usr/sbin/lsof -p 510

So the weblogic server which is a JVM has accessed 2K+ files. And example is below..

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 510 root cwd DIR 202,2 4096 5657417 .../DefaultDomain

FD -is the file descriptor like below
cwd current working directory;
mem memory-mapped file;
mmap memory-mapped device;
pd parent directory;
rtd root directory;
tr kernel trace file

TYPE is the file type like a REGulare or DIRectory..
NODE is the inode number
DEVICE indicates the device type and partition numbers

There are couple of other useful commands to debug issues like top -H -p will give the thread level processor utilization which is useful when debugging high CPU consumption
issues in a JVM..

strace is another useful command to know what a process is doing at OS level like socket connections, reads etc.,

netstat is another command with options like -nap gives the information on the connections and rec/send Q which can indicate any network or program slow or being blocked and which
connections in which state etc.,

Disk space check commands like df -h, du -sh . etc., are useful to verify sizes and space on disk and network shares

free -g is another useful command to know how much memory has been consumed in physical, swap and cache/buffer areas.

SAR is another great collection of metrics rangning from processes, memory, swap activity, CPU, load average, disk, network etc.,

To flush out the cache and buffers drop_caches is what that needs to be cleared out. Example as shown below

free -m
total used free shared buffers cached
Mem: 15500 15330 169 0 249 11778
-/+ buffers/cache: 3302 12197
Swap: 10047 1704 8342

echo 1 > /proc/sys/vm/drop_caches --- this will flush out the cache and buffers.

free -m
total used free shared buffers cached
Mem: 15500 3367 12132 0 0 635
-/+ buffers/cache: 2731 12768
Swap: 10047 1704 8342

The above is useful if you want something to be loaded into memory again with some runtime changes and also to test out something like a disk speed

To check disk speed: DD - a linux command is a very useful one.

Example Write test:
date; time dd if=ip/test.txt of=op/test.txt bs=1024k count=1000
Sun Apr 23 06:46:17 PDT 2017
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.15715 seconds, 332 MB/s

real 0m3.164s
user 0m0.000s
sys 0m1.104s

-- date is to print out which date and time is to time the command and it is writing a file from ip dir to op dir with block size as 1K and a 1000 times i.e. a total of 1GB. This has spent 3.16s total i.e. 'real' means the elapsed time and system mode CPU time is 1.1 s and nothing really in user mode. So its basically the speed i.e. 332 MB/s took that much time..

But to test it in isolation i.e. read and write speeds, one can use /dev/zero which is a special null char file gives as many as read..

date; time dd of=/dev/zero if=test.txt bs=1024k count=1000 --- read speed
date; time dd if=/dev/zero of=test.txt bs=1024k count=1000 --- write speed

some of other commands like ping, tracert can be used to test the network speed and network paths..