Saturday, August 29, 2015

Add Color to your SQL

Topic: this post is about some simple techniques to add color to SQL scripts and their terminal output using ANSI escape codes.

Colors can be used to improve the output of command line tools. This is common practice, for example with the (bash) shell. Colors can also be very useful to improve the quality of the output of SQL scripts. In my experience this is not used frequently, probably because of the the need of specialized techniques and also because the results depend on the terminal emulator (or tool) used to display the output. In this post you will find some pointers and examples that you can use to add color to your SQL output.

Notable previous work

Tanel Poder has published a couple of years ago his very cool logo to the great snapper v4 script using SQL and color-rich terminal output.
Sayan Malakshinov has published a blog article and a script color.sql that provide some ready-to-use and simple techniques for coloring the output of Oracle SQL*plus scripts.
ANSI escape codes are the main underlying technique to add color to the terminal output, see more details on how this works at: Wikipedia article on ANSI escape codes.
Putty is a widely usedterminal emulator that support ANSI escape calls. If you are a Windows user, note that CMD.EXE does not support ANSI escape codes, therefore it will not be suitable to run the scripts described in this post.
Heat map visualization is a powerful technique to explore 3D data, by providing the third dimension as color. I have integrated heat map visualization and command line/ terminal output with two tools OraLatencyMap and PyLatencyMap aimed at the study of I/O latency. I will share in the next paragraph some of the tips and lessons learned from developing those tools regarding the use of color on the terminal output.

Color palettes by example

Color palettes are very useful for heat map visualization. I have identified two simple palettes for displaying I/O latency histograms on terminal output: the first one is composed of shades of blue, the other is yellow-to-red. See an example of their usage as heat maps at this link. The scripts Color_palette_blue.sql and Color_palette_yellow-red.sql show two basic examples of how to generate color palettes using ANSI escape codes. The SQL, also pasted here below, works simply by changing the character background color, printing a white space and finally resetting the background back to normal:

define ANSICODE_PREFIX="chr(27)||'[48;5;'"
define ANSICODE_BACKTONORMAL="chr(27)||'[0m'"

select 0 ID, &ANSICODE_PREFIX|| '0m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL  -- Black
select 1 ID, &ANSICODE_PREFIX|| '15m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL  -- White
select 2 ID, &ANSICODE_PREFIX|| '51m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL  -- Light blue
select 3 ID, &ANSICODE_PREFIX|| '45m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL 
select 4 ID, &ANSICODE_PREFIX|| '39m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL 
select 5 ID, &ANSICODE_PREFIX|| '33m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL 
select 6 ID, &ANSICODE_PREFIX|| '27m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL  -- Dark blue 
select 7 ID, &ANSICODE_PREFIX|| '21m '|| &ANSICODE_BACKTONORMAL COLOR from dual; 

The generation of the ANSI escape codes has little complexity and can be ported with little effort to many other language of interest. Here an example in PL/SQL (from OraLatencyMap):

create or replace function asciiescape_color (p_token pls_integer, p_palette_type varchar2) 
return varchar2 
   is
      type t_palette is varray(7) of pls_integer;        -- a palette of 7 colors
      v_palette_blue t_palette := t_palette(15,51,45,39,33,27,21);      -- shades of blue
      v_palette_red t_palette := t_palette(15,226,220,214,208,202,196); -- white-yellow-red
      v_colornum pls_integer;
   begin
      if ((p_token < 0) or (p_token > 6)) then
          raise_application_error(-20001,'The color palette has 7 colors, 0<=p_token<=6, found instead:'||to_char(p_token));
      end if; 

      if (p_palette_type = 'blue') then
          v_colornum := v_palette_blue(p_token+1);                               
      else
          v_colornum := v_palette_red(p_token+1);                                 
      end if;
      return(chr(27)||'[48;5;'||to_char(v_colornum)||'m');   
      --return the ascii escape sequence to change background color 
end;
/

An example in Python (from PyLatencyMap):

def asciiescape_color(token, palette):
    blue_palette = {0:15, 1:51, 2:45, 3:39, 4:33, 5:27, 6:21}        # palette, shades of blue 
    red_palette  = {0:15, 1:226, 2:220, 3:214, 4:208, 5:202, 6:196}  # white-yellow-red palette
    if palette == 'blue':
       color_asciival = blue_palette[token]
    elif palette == 'red':
       color_asciival = red_palette[token]
    else:
       raise Exception('Wrong or missing palette name.')
       exit(1)
    return(chr(27) + '[48;5;' + str(color_asciival) + 'm')


Other ANSI escape codes of interest from OraLatencyMap and PyLatencyMap are the codes to restore the cursor back to the normal value and to clear the screen. Here is an example from PyLatencyMap (Python):

#reset the background color back to normal
asciiescape_backtonormal = chr(27) + '[0m'

# clear screen and move cursor to top line
line += chr(27) + '[0m' + chr(27) + '[2J' + chr(27) + '[H' 

An example of colorizing SQL output

You will see in this paragraph an example that I hope is both instructive and fun: how to add colors to a script that computes an image of the Mandelbrot set. The starting script is quite interesting by itself as it uses just SQL for computation and output display. The code is not original, I have ported it to Oracle from code on the PostgreSQL wiki, with some minor modifications. Mandelbrot_SQL_Oracle_text.sql is the "black and white" starting starting script before adding color.
Adding colors to the output for this example is just a matter of combining the original "black and white" script with the SQL scripts for generating color palettes. The results are the the following two scripts (see also the figure below for an example of their output):

The figure illustrates how to add color to some classes of SQL output. The examples use a test script which generates an image of the Mandelbrot set using SQL. Note that adding color to the plain text version is achieved here just by adding an extra table join (with the PALETTE common table expression). Other methods to add color to Oracle scripts include using PL/SQL functions or the use of other languages, languages, such as Python, as discussed in the text of this post.

A version of the script for PostgreSQL that uses the ideas discussed in this post for colorizing SQL is: Mandelbrot_SQL_PostgreSQL_color_blue.sql

Conclusions

Colors can improve the effectiveness of command line scripts and terminal output, including SQL scripts. ANSI escape codes provide a powerful tool for many terminal operations. Heat map visualization is a powerful data visualization technique that can be implemented also on the terminal output using ANSI escape codes. In this post you can find simple tips on how to add colors to the terminal output both for SQL and other languages, notably Python. Adding color to "black and white" script output can be fun and useful at the same time, as illustrated with the Mandelbrot SQL example. Happy coloring!

Download: The tools discussed in this post can be downloaded from this webpage and from Github

Additional links and references

Tanel's colored fished logo to the snapper v4 script
Sayan Malakshinov's blog article and script color.sql
Wikipedia on ANSI escape codes
Latency heat maps for I/O latency measurements on the CLI with OraLatencyMap and PyLatencyMap
Fun SQL snippets from the PostgreSQL wiki
Additional examples of recursive common table expression (recursive subquery factoring) and non-standard uses of SQL: how to find numeric solutions to basic physics equations using SQL



Wednesday, July 29, 2015

Diagnose High-Latency I/O Operations Using SystemTap

Topic: this post is about some simple tools and techniques that can be used to drill down high-latency I/O events using SystemTap probes.

The problem: Operations with high latency on a filesystem and/or a storage volume can sometimes be attributed to just a few disks 'misbehaving', possibly because they are suffering mechanical failures and/or are going to break completely in the near future.
I/Os of high latency on just a few disks can then appear as latency outliers when accessing volumes build on a large number of disks and overall affect the performance of the entire storage. I write this having in mind the example of a storage system built with (SATA) JBODs using Oracle ASM as volume manager/DB filesystem. However the main ideas and tools described in this post apply to many other volume managers and file systems, including HDFS.

The standard tools: One way to find that one or more disks are serving I/O requests with high latency is with the use of standard Linux tools such as iostat, sar or collectl. Typically you would use those tools to spot anomalous values of average service time, average wait time and also of queue size.
A structured approach on how to do this is described in Brendan Gregg's USE method and the tools that can be used in Linux to implement it.

SystemTap scripts: In this post we focus on a technique and a couple of simple scripts that can be used to identify disks serving I/O with high latency using SystemTap probes to investigate I/O latency of the block devices.
The script blockio_latency_outliers_per_device.stp provides a measurement of some basics latency statistics for block device, including number of I/Os, average and maximum latency. The script also provides details of all the I/O where the latency is above a certain programmable threshold (the default threshold is set at 500 microseconds).
An example of its use (edited output for clarity) is here below. Note the latency warning message and overall the very large maximum value measured for the latency of the /dev/sdy block device:

[root@myhost] # stap -v blockio_latency_outliers_per_device.stp 10

Measuring block I/O latency and statistics
A warning will be printed for I/Os with latency higher than 500000 microseconds
Statistics will be printed every 10 seconds. Press CTRL-C to stop

...
latency warning, >500000 microsec: device=sdy, sector=166891231, latency=714984
latency warning, >500000 microsec: device=sdy, sector=165679327, latency=582708
latency warning, >500000 microsec: device=sdy, sector=167102975, latency=1162550
....

I/O latency basic statistics per device, measurement time: 10 seconds
Latency measured in microseconds
Disk name          Num I/Os    Min latency    Avg latency    Max latency
....
sdu                     219            106           6217          27117
sdz                     200            123           5995          27205
sdq                     211             71           6553          31120
sdh                     256            103           6643          22663
sds                     224            101           6610          29743
sdm                     238             92           7550          35571
sde                     243             90           8652          52029
sdt                     200            105           5997          25180
sdk                     200             94           5696          35057
sdi                     206             99           7849          30636
sdg                     269             74           6746          36746
sdy                     197            102          98298        1167977
sdr                     200             89           6559          27873
sdl                     200            140           8789          31996
sdw                     210             99           7009          37118
sdd                     217             94           7440          56071
sdn                     205             99           6628          41339
....


When candidate disks for high latency have been identified, the second step is to further drill down using the script blockio_rq_issue_filter_latencyhistogram.stp. This script gathers and displays I/O latency histograms for a subset of block devices that can be specified using filters in the script header. The default filters are:

# SystemTap variables used to define filters, edit as needed
global IO_size = 8192    # this will be used as a filter for the I/O request size
                         # the value 8192 targets 8KB operations for Oracle single-block I/O
                         # use the value -1 to disable this filter
global IO_operation = 0  # this will be used as a filter: only read operations
                         # a value of 0 considers only read operations (the value 1 is for write)
                         # use the value -1 to disable this filter
global IO_devmaj = -1    # this will be used as a filter: device major number (-1 means no filter)
                         # example: use the value 253 for device mapper block devices
global IO_devmin = -1    # device minor number (or -1 if no filter)

You can use  blockio_rq_issue_filter_latencyhistogram.stp to drill down on the latency histogram for those disks that have shown high latency I/O and also compare them with "good" disks. In the example above the candidate "trouble disk" is /dev/sdy with major number 65 and minor 128 (you can major and minor device numbers for "sdy" simply using ls -l /dev/sdy).
Example:

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 128 (-1=disable filter)

lock I/O latency histogram, measurement time: 10 seconds, I/O count: 199
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
  value |-------------------------------------------------- count
     16 |                                                    0
     32 |                                                    0
     64 |@                                                   2
    128 |@@@@@@@                                            14
    256 |@@                                                  4
    512 |@                                                   2
   1024 |@@@                                                 7
   2048 |@@@@@@@@@                                          19
   4096 |@@@@@@@@@@@@@@@@@@                                 37
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      59
  16384 |@@@@@@@                                            14
  32768 |                                                    0
  65536 |@@@@@          HIGH                                10
 131072 |@@@@@          LATENCY                             11
 262144 |@@@@@          I/O                                 10
 524288 |@@@            OPERATIONS                           6
1048576 |@@                                                  4
2097152 |                                                    0


Note the presence of several I/O operations at high latency together with more normal I/O latecy, for example I/Os served from SATA disks at around 8 ms and also I/O served from the controller cache at sub-millisecond latency.
For comparison here below is the latency histogram measured on another disk while the same workload was running. We can see that in this case high latency points found in the previous example are no more present. Most of the I/O operations are around 8 ms latency and some operations served from cache at sub-millisecond latency. The I/O reported for the /dev/sdy disks with latency of 64 ms and above are not present in this case.

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 48 (-1=disable filter)


Block I/O latency histogram, measurement time: 10 seconds, I/O count: 196
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
value |-------------------------------------------------- count
   32 |                                                    0
   64 |                                                    0
  128 |@@@@@@@@@                                          19
  256 |@@@@@@@@                                           16
  512 |                                                    1
 1024 |@@@@                                                9
 2048 |@@@@@@@@@@@@@@                                     28
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       56
 8192 |@@@@@@@@@@@@@@@@@@@@@@@@                           49
16384 |@@@@@@@@@                                          18
32768 |                                                    0
65536 |                                                    0

The tests reported here have been performed on a old system running RHEL 5 (kernel 2.6.18-371)  and SystemTap v. 1.8, however the scripts have also been tested on OL and CentOS 7.1 (kernel 3.10.0-229) and SystemTap 2.8. If you are using SystemTap 2.6 or above you can use the scripts blockio_rq_issue_latencyhistogram_new.stp and blockio_rq_issue_filter_latencyhistogram_new.stp instead.

Conclusion: In this post we have shown a simple technique to diagnose I/O latency outliers and in particular on how to drill down on I/Os served at high latency because of one or more misbehaving disk in the storage volume/filesystem. The investigation has been done using SystemTap scripts used first to discover which disks were serving some of their I/O s with high latency and then drilling down on specific devices with the use of latency histograms. The fact of identifying and later replacing the badly performing disks can be beneficial for the performance of the entire storage systems.

Download: The tools discussed in this post can be downloaded from this webpage and from Github.

Acknowledgements and additional links: Brendan Gregg has published an extensive set of articles and tools on performance tuning, including storage latency investigations.
For more information and examples on how to use SystemTap see the SystemTap wiki
Additional tools for Oracle ASM I/O investigations also see Bertrand Drouvot's asm_metrics.pl
Kevin Closson's SLOB was used as workload generator for the examples discussed here.
Additional links on tools and techniques on Oracle I/O troubleshooting in this blog: Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Event Histogram Metric and Oracle 12c and Life of an Oracle I/O: Tracing Logical and Physical I/O with Systemtap

Tuesday, July 14, 2015

Heat Map Visualization of Latency Histograms for NetApp C-Mode

Topic: This post is about collecting and visualizing I/O latency histograms for NetApp filers in C-mode.

Motivations: The drill down of I/O latency is an important technique for troubleshooting and benchmarking storage. Average latency values can hide details of what is happening on the storage. Think for example of storage systems with flash and spindles, each serving I/O at different latency. Moreover averaging the measured values over time can hide details in case of varying workload or with issues that appear sporadically.

Latency heat maps: Representing latency histograms over time is a 3-D visualization problem and can be nicely solved using heat maps. This has been discussed in details by Brendan Gregg in ...

PyLatencyMap is a simple visualization tool that works on the command line and takes as input latency histograms and outputs two latency maps using ANSI codes for colors.
The latency heat map on the top of the screen uses a blue palette and represents the number of operations per second as a function of the latency bucket and time. The second map in yellow-red represents the time waited as a function of the latency bucket and time.

How to collect and display I/O latency histograms for NetApp filers

For a quick you can use Example10_NetApp_Cmode_reads.sh. Before running it you will need to edit the parameters: password to connect to the storage, the junction point you want to measure, the cluster management node and optionally the time interval between data points. Then run the example script. Similarly Example10b_NetApp_Cmode_writes.sh will capture and display heat maps for write operations if you prefer.

This is an example of the output

Figure1: Example output of PyLatencyMap measuring and displaying I/O read latency histograms using performance counters from the NetApp filer (C-mode). The top heat map (blue color) represents I/O operations latency over time. The lower heat map (yellow-red color) represents the time waited per bucket.

Additional details

The 'glue' between NetApp and PyLatencyMap is provided by two Perl scripts written by Ruben Gaspar. The scripts are: (1) NetApp_histogram_Cmode.pl connects to the filer and gets the latency histograms (2) NetApp_latency_connector.pl transforms the histograms from the format of the NetApp filer into a format that is understood by PyLatencyMap (i.e. a power-of-two histograms).

Note that data aggregation in buckets is done by NetApp_latency_connector.pl using the same conventions as Oracle's wait event histograms (v$event_histogram and v$event_histogram_micro), that is a value of N for bucket B means that there have been 100 I/O operations with latency between B and B/2. For those of you who use DTrace 'quantize' or SystemTap 'hist_log', please note the difference as in those tools the convention is that a value of N for bucket B means N operations with latency between B and 2*B.
Note also the following dependencies: the scripts discussed here for the integration between NetApp performance counters and PyLatencyMap have been developed for NetApp Clustered Data ONTAP and in particular have been tested on version 8.2 an 8.3 (notably they will not work against OnTap version 7). Also Perl is needed including the modules Net::OpenSSH and IO::Pty. PyLatencyMap is written in Python.

Conclusions and additional links

PyLatencyMap is a tool for collecting and displaying latency heat maps on the command line intended as a aid for performance troubleshooting and benchmarking activities. PyLatencyMap has now been integrated with NetApp performance counters which adds to the existing data sources of Oracle wait events, SystemTap and DTrace probes among others.

Download PyLatencyMap from:
https://github.com/LucaCanali/PyLatencyMap or http://canali.web.cern.ch/canali/resources.htm 

Previous blog entries on PyLatencyMap:
http://externaltable.blogspot.com/2015/03/heat-map-visualization-for-systemtap.html
http://externaltable.blogspot.com/2013/09/getting-started-with-pylatencymap.html
http://externaltable.blogspot.com/2013/08/pylatencymap-performance-tool-to-drill.html

Monday, June 15, 2015

Event Histogram Metric and Oracle 12c

Topic: event histogram metric, a script and some related discussions on collecting and displaying wait event latency histograms for Oracle performance troubleshooting. Also discussed is the new 12c feature of event histrograms with microsecond buckets.

Why: Latency histograms (and by extension wait event histograms) provide very useful information when troubleshooting performance for systems exhibiting response time with multi-mode distribution. In such cases average wait values are often not sufficient to understand the behavior of the system under study and histograms provide a finer level of details. A common case is with I/O performance investigations for many modern storage systems, which  can serve I/O from HDD or flash, with completely different latency characteristics.

What: Event histogram metric is a command-line script for "near real-time" monitoring, based on the simple idea of taking two snapshots of the relevant V$ views and print the wait event histogram of the delta values, see also Figure 1 for an example. You can find an example of a case when event histogram metric has helped for performance troubleshooting in a previous blog entry at this link.

What's new: The latest version of event histogram metric (v 1.1, June 2015) adds the support for a little but very useful  new feature of Oracle version 12.1.0.2, where histogram buckets go down to the microsecond level with V$EVENT_HISTOGRAM_MICRO. Another feature I have added event histogram metric v 1.1 is the calculation of the estimated time waited per bucket. This gives an indication of the relative weight/importance of the time waited in each latency bucket and allows, for example, to compare the relative weight of many low-latency waits vs. a few events with long latency.

Download the tool from http://cern.ch/canali/resources.htm or from https://github.com/lucacanali The relevant scripts for this post are:
  • ehm.sql - a script to collect and visualize wait event histograms for Oracle RAC, 11g and 12c.
  • ehm_local.sql - same as ehm.sql but collects data only for the local instance (V$ view data), 11g and 12c.
  • ehm_micro.sql - collects and displays wait event histograms with microsecond buckets for RAC, requires Oracle 12.1.0.2.
  • ehm_micro_local - same as ehm_micro but for the local instance (V$ view data), requires Oracle 12.1.0.2.

Example of usage

Figure 1: Example of usage of the event histogram metric script on Oracle 12c. You can see low latency and high latency I/O modes. Sub millisecond I/Os come from the storage SSD/cache, high latency come from I/Os served by reading SATA disks. Note also that the average latency value, reported on the bottom of the figure, does not capture the details of the multiple modes and therefore does not represent the additional complexity of this workload.


Wait events of interest

Wait event 'db file sequential read' is used by the Oracle engine to provide a measure of the time spent on single-block read operations. This is an important source of data when troubleshooting the performance of many OLTP workloads, as it accounts for the time spent on storage access for random I/O (additional details and comments on the limitations of this approach are detailed in the next paragraph). This is the typical data access path when using indexes, for operations like primary key look-ups, index range scans, nested loops joins, etc, that are quite common in many transactional workloads. A short investigation of the internals of Oracle and its interaction with the OS (see also this blog post for details) shows that in most cases the time accounted by 'db file sequential read' is essentially the time taken to complete the underlying system call call to read data from the storage. Therefore this event provides a good measure for the storage response time as seen from the database server. Moreover in RAC it is easy to aggregate wait event data for all instances in one histogram using GV$ views and this provide an easy source of data for the shared storage.
Note: in Exadata deployments instead of 'db file sequential read' you should expect to see the wait event 'cell single block physical read'.

Another important wait event for troubleshooting the performance of many Oracle databases with significant transactional workloads is 'log file sync'. This wait event measures the time a session will wait on commit operations. There are several components to the wait time accounted in log file sync (see paragraph below for some additional hints and known pitfalls), one of the most important for the scope of this article is the time necessary to perform I/O on the redo log files. Studying the latency of this event can be a very good source of information for performance troubleshooting as it may allow you to understand if writes to redo log files are performed at low latency, so most likely into the storage DRAM or SSD cache, or rather with high latency, which may indicate the lack of caching, or the cache going into write through mode or other issues worth further investigations.

Limitations and additional comments

When investigating latency and storage response time for random I/O, the wait event 'db file sequential read' is a very important source of data, however not all random I/O performed by Oracle is accounted under this wait event. When Oracle see the opportunity of optimizing storage access by using prefetching or batching I/O, it will use asynchronous I/O (when available). Asynchronous I/O for random reads will be accounted with the wait event 'db file parallel read'. Asynchronous I/O improves the utilization of the I/O subsystem by the Oracle processes, as it allows to submit and reap many I/O operations in groups, as opposed for example to reading blocks one by one as in the case of pread system calls. However this makes the task of correlating the wait event latency with storage response time much more difficult. It is the case of 'db file parallel read' but also the case of wait events associated to asynchronous I/O used for sequential I/O operations such as 'direct path read' More on this topic and the pitfalls when using the wait event interface for asynchronous I/O are detailed in this other blog post (see also Oracle Support Doc ID 7448407.8  "Bug 7448407  Resource Accounting and Tracing change for Direct Path Waits").

One known pitfall when interpreting the measurements of the 'log file sync' wait event is that the wait time includes also what can be a significant CPU workload. This has been discussed in details by Kevin Closson. There are also many other details, including a new adaptive mechanism for the communication between the server process and log writer introduced in 11.2.0.3 (see: adaptive log file sync ) and more changes introduced in 12c with redo writer worker processes. See also Oracle Support Doc ID 34592.1 "WAITEVENT: log file sync Reference Note " for a list of known issues and bugs in this area.

As an addendum to the discussion: this is the formula used in the script event histogram metric to compute the the estimated value for the wait time in each bucket: T_wait = 0.75 * bucket_max_val * wait_count. As an example of the usage of the formula, consider 100 wait event in the Oracle bucket 2 ms. This means that the wait time has been between 1 ms * 100 events = 100 ms and 2 ms * 100 event = 200 ms. According to the formula above the estimated value is 0.75 * 2 ms * 100, that is the mean value between the maximum was of 2 ms * 100 and minimim wait of 1 ms * 100.

Related work and links

The study of latency histograms as a function of time leads to a 3D representation problem. A natural solution for it is to represent latency values in a heat map. If you are interested in more details see this blog post on OraLatencyMap, a tool to collect and display heat maps of wait events using SQL*plus and this series of posts on PyLatencyMap, a python-based tool integrated with many data sources, including Oracle wait events, SystemTap and DTrace.
Event histogram metric is part of a group of short command line scripts I use for Oracle troubleshooting, including scripts for near-real time measurement of Oracle performance. The scripts are available for download, you can read more details following this link.
When troubleshooting Oracle performance and investigating I/O latency details it can be useful to get graphs of historical trends from the AWR repository. PerfSheet4 is a tool that I use to extract and visualize AWR data.

Conclusions

Latency histograms of Oracle wait events are a powerful drill-down tool for performance investigations. Histograms are necessary when average values do not provide an accurate picture, for example when studying I/O latency of storage with multi-mode response time. Two wait events are of particular interest for many Oracle workloads: db file sequential read and log file sync. In Oracle 12.1.0.2 a new view V$EVENT_HISTOGRAM_MICRO provides latency details with microsecond buckets, which is very useful when investigating SSDs and low-latency storage in general. A set of simple scripts for near real-time data collection and display of the histograms have been discussed in this post and are available for download.

Wednesday, March 18, 2015

Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap

Topic: PyLatencyMap v1.2 a tool for collecting and visualizing I/O latency data collected and about its integration with SystemTap.

Introduction: When studying storage performance latency drill-down can be very important. Measuring the average I/O latency is often not enough, latency histograms are proven to be more suitable for investigating modern storage systems. This is because the storage service time can present multiple "modes". A common case is with storage systems where I/O requests can be served by low-latency SSD or by spindles with much higher service time. For such systems latency histograms provide an additional dimension to the performance investigation.
Latency heat maps come into play when we want to visualize a series of latency histograms over time. This is a 3-D visualization problem and heat maps provide a way to represent the third dimension as color (see as an example Figure 1 later in this article).
The main ideas behind the use of latency histograms for investigating storage performance and their visualization heat maps have been previously detailed in the excellent paper by Brendan Gregg on ACM queue and also on blog articles and presentations.

PyLatencyMap is a tool and set of scripts for gathering  and displaying heat maps using the command line interface (supporting ANSI terminal graphics). The main use for which PyLatencyMap has been developed is to measure and display storage latency data. PyLatencyMap can also be used to display heat maps from a generic source of latency data.
The latest update to PylatencyMap, version 1.2, provides integration with SystemTap and a few example scripts for getting started. SystemTap can be used to collect storage latency and PylatencyMap for producing the associated heat maps. Many more data sources have already been integrated with PyLatencyMap and examples are available with the tool: Oracle wait event histograms, DTrace histograms and custom aggregations from trace files.
PyLatencyMap has a modular architecture, which follows the typical use of Unix/Linux CLI tools. Data is generated by a collector script (for example a SystemTap script), then it is piped through an optional filter and finally piped into to the visualization engine:

latency data_source | <optional connector script> | python LatencyMap.py <options>

A series of examples and get-started scripts are provided with the tool. In this article we will focus on the scripts based on SystemTap. For discussions on how to use PyLatencyMap for SQL*plus and DTrace see also this other blog entry. Just to get started see the following CLI to run data collection with SystemTap and visualization with PyLatencyMap:

stap -v SystemTap/blockio_rq_issue_basic_pylatencymap.stp |python SystemTap/systemtap_connector.py |python LatencyMap.py

The script blockio_rq_issue_basic_pylatencymap.stp measures the latency of block I/O operations and can be used to collect latency histograms and display them as heat maps. The probe hooks top kernel.trace("block_rq_issue") and kernel.trace("block_rq_complete").
There are three probes in the script: one that gathers information at each block io operation, another that collects stats at the end of the I/O and a third probe that periodically prints the latency histogram. Note the use of the aggregation operator '<<<'  that SystemTap uses to populate the histograms, which are later printed with the use 'hist_log'.

probe kernel.trace("block_rq_issue") {
        requestTime[$rq] = gettimeofday_us()
}

probe kernel.trace("block_rq_complete") {
   t = gettimeofday_us()
   s = requestTime[$rq]
   if (s > 0) {
       latencyTimes <<< (t-s)
       delete requestTime[$rq]
   }
}

probe timer.sec(3) {
   printf("\n<begin record>\n")
   printf("timestamp, microsec, %d, %s\n",gettimeofday_us(),tz_ctime(gettimeofday_s()))
   printf("label, Latency of block I/O requests measured with SystemTap\n")
   printf("latencyunit, microsec\n")
   printf("datasource, systemtap\n")
   if (@count(latencyTimes) > 0)
       println(@hist_log(latencyTimes))
   printf("\n<end record>\n")

}

SystemTap scripts for PyLatencyMap:

For production usage there are more advanced scripts than the one just discussed. The script  Example9_SystemTap_blockIO_req.sh is based on blockio_rq_issue_pylatencymap.stp and provides tracing based on block_rq_issue and block_rq_complete as the example above but contains additional filters to remove 'false positives'. For ease of use this is packed into

Another example script is Example9b_SystemTap_blockIO_filter_read_8KB.sh is based on blockio_rq_issue_filter_pylatencymap.stp which provides the possibility to further filter on the block I/O being issues, in particular the defaults filter for 8KB-reads, which are of interested when testing OLTP applications in Oracle.

Example9c_SystemTap_pread.sh is a scripts that packages a SystemTap probe on pread calls. These are typically the underlying OS calls for single block reads that show up as 'db file sequential read' wait events in Oracle. Note however that Oracle can do single-block reads also using asynchronous, therefore the probes discussed above on block_io trace points are more general than this one.

Examples 9e, 9f: the probes in these scripts measure block_io hooking on different probe points and are provided for reference. Note, these scripts require to have kernel debuginfo symbols.

oracle_event_latency.stp: this script measures the latency of Oracle wait events and produces histograms by hooking directly into the Oracle executable. More details on how this works can be found at the blog entry "SystemTap into Oracle".

Example:

Figure 1 here below is an example the output from PyLatencyMap run using the script Example9b_SystemTap_blockIO_filter_read_8KB.sh. The measured workload has been produced with SLOB (at increasing load). The top heat map represents the number of I/O operations per latency bucket and as a function of time. On the horizontal axis you can fine the time, on the vertical axis the latency buckets. The tool assumes a power-of-2 scale. Color is used to represent the number of I/O operations per second. A blue palette is used for the frequency heat map.
The  heat map on the lower half of the figure uses a yellow-red palette to represent the amount of time waited in each bucket. I refer to it as a intensity/importance heat map. It highlights the relative weight of latency time spent on each bucket and helps in comparing the impact of a large number of low-latency operations vs. a smaller number of high latency operations.


Figure 1:  The output of PyLatencyMap contains the latency histogram of the latest measured point and  two heat maps: on top the heat map of the IOPS, on the lower graph the heat map of the waited time.

Note: for SystemTap (and DTrace) data sources, the latency bucket has to be interpreted as follows: a value of N in bucket B, means N I/O operations with latency between B and 2*B.
Other histogram latency data sources (notably Oracle) use a different convention. With the same example a value of N in a latency bucket B for Oracle histograms means N waits of latency between B and B/2. PyLatencyMap takes care of this difference when calculating the time waited histograms (i.e. the
Note: The scripts discussed here for block I/O tracing use the trace point block_rq_issue, it is also possible to use the trace point block_rq_insert to measure I/O latency from the point of insertion into the queue instead.

References and download links:

PyLatencyMap can be downloaded from this link or can be forked from Github. Additional information on PyLatencyMap is also available at this blog entry and also at this other blog entry.

Conclusions and acknowledgements

SystemTap is a very powerful tool for measuring I/O and producing latency histograms for storage performance investigations. PyLatencyMap provides an easy way to visualize histograms produced by SystemTap as latency heat maps. PyLatencyMap collects data and displays heat maps using the command line interface on character terminals with ANSI escape codes. This provides the combined benefit of a graphical output and the power of the command line. PyLatencyMap can also be used to visualize heat maps from latency histograms produced with DTrace, Oracle's SQL*plus and/or a from latency data aggregated from text and trace files.

Many thanks go to Brendan Gregg for sharing in his blog many awesome ideas and tools on the topic of heat map visualization and on using DTrace and SystemTap that have provided a great source of inspiration for this work.

Note: this entry was originally posted in March 2015, with major updates in July 2015.

Wednesday, February 18, 2015

Latest Updates to PerfSheet4, a Tool for Oracle AWR Data Mining and Visualization

Topic: This post is about the latest updates to PerfSheet4 v3.7 (February 2015). PerfSheet4 is a tool aimed at DBAs and Oracle performance analysts. It provides a simplified interface to extract and visualize AWR time series data using Excel pivot charts.


Why PerfSheet4: PerfSheet4 is aimed at querying and displaying time-series data from AWR repository tables. This is very a rich source of information to analyze database workloads and trends in the context of performance analysis or capacity planning. The tool automates several time consuming tasks and provides a point-and-click user interface:
  • PerfSheet4 comes with a set of pre-defined queries to extract data from some of the most common AWR table and in particular to compute deltas for the performance values were relevant.  
  • PerfSheet4 can query the database directly or work in a offline mode (you run the sql script from sqlplus and later load the csv file into PerfSheet4)
  • PerfSheet4 uses Excel Pivot Charts. Pivot Charts provide a powerful interface to display and navigate through AWR data.
  • PerfSheet4 comes with a set of pre-defined graphs for each of the pre-defined queries. This makes preparing some of the most common workload graphs a matter of a few clicks and also provides a good starting point into more advanced usages of the Pivot Charts.

What's new: If you have used previous versions of PerfSheet4 you will find in v3.7 (February 2015 update) a larger collection of pre-defined graphs. This is aimed at making PerfSheet4 easier to use for many of the standard tasks. Typically 4 graphs will be presented for each of the pre-defined queries. Another new feature is the choice of using Oracle ODBC drivers, as opposed to using the default Microsoft ODBC driver for Oracle. This feature has been introduced especially for Windows 8/8.1 where Microsoft ODBC driver for Oracle does not seem to work correctly.


PerfSheet4 is a free tool and is available for download at this link. The tool has been developed and tested for Oracle 11gR2 and 12cR1. On the client side it has been tested for Excel 2010 and 2013 for Windows (7 and 8/8.1). The figures here below show two examples of the user interface.

Figure 1: The start page and user interface of PerfSheet4. This is where you can run the query of interest to extract data from AWR. or load csv data into Excel. From this page you can then launch the Pivot Chart plot. 


Figure 2: A Pivot Chart generated by PerfSheet4. The pre-defined graph is just a starting point. Customize the chart from there by selecting the data of interest and/or changing any other feature of the graph. 


Additional details: Here below is the list of AWR queries and charts available in PerfSheet4, v3.7 (February 2015). The tool is open source and therefore fully configurable, in case you want to modify his default behavior. There are two versions packaged in the zip file of the tool: the main version is called called PerfSheet4_v3.7.xlsm. A second version is available, called PerfSheet4_v3.7_noActivex_controls.xlsm, which avoids the use of Activex controls and should be better suited for those cases where Activex controls don't work well and/or for older versions of Excel.

Query: Workload data

  • Extracts data from dba_hist_sysmetric_summary. System metrics provide details on the usage of system resources over time.
  • Pre-defined charts: "Average active sessions", "CPU usage and database time", "Physical reads and writes IOPS", "Physical reads and writes throughput".

Query: System statistics  

  • Extracts data from dba_hist_sysstat. Hundreds of system statistics provide valuable details on the workload. The query computes delta values between snapshots and rates (that is delta values divided over delta time).
  • Pre-defined charts: "Physical read and write IOPS", "Physical read and write throughput", "CPU and wait time", "CPU used per instance".

Query: Wait events  

  • Extracts data from dba_hist_system_event. This provides information on all the wait events. The query computes delta values between snapshots and rates (i.e. delta values divided over delta time).
  • Pre-defined charts: "db file sequential read - wait event for single block reads", "db file scattered read - wait event for multiblock reads into the buffer cache", "log file sync - wait event for commit time".

Query: Wait events per class  

  • Extracts data from dba_hist_system_event.  This provides information on all the wait events, computes delta values and rates (delta value over delta time) aggregating over wait class.
  • Pre-defined chart: "Wait time grouped per class".

Query: IO stats details  

  • Extracts data from dba_hist_iostat_detail. This provides details on the I/O operations performed by Oracle, computes delta values and rates (delta value over delta time) aggregating over function and file type.
  • Pre-defined charts: "Small read IOPS", "Large read throughput", "Small write IOPS", "Large write throughput".

Query: IO wait histograms  

  • Extracts data from dba_hist_event_histogram for I/O related events. This provide drill-down information on latency. The query computes delta values between snapshots and rates (i.e. delta values divided over delta time).
  • Pre-defined charts: "db file sequential read latency histogram - number of wait events", "db file sequential read latency histogram - time waited (millisec/sec)", "commit time wait latency histogram - number of wait events per sec", "commit time wait latency histogram - time waited (millisec/sec)".

Query: Stats per service  

  • Extracts data from dba_hist_service_stat, computes delta values between snapshots and rates (i.e. delta values divided over delta time) aggregating by service name.
  • Pre-defined charts: "DB time", "CPU time", "User I/O wait time".


Query: Top 3 wait events and CPU  

  • Extracts data from dba_hist_system_event and dba_hist_sysstat. This query selects the top 3 non-idle wait events and the CPU usage for each instance. It computes delta values and rates (delta value over delta time).
  • Pre-defined charts: "Instance N.1 - top 3 wait and CPU", "Instance N.2 - top 3 wait and CPU", "All instances - top 3 wait and CPU".


Previous work and acknowledgments: PerfSheet4 is a spin off of previous original work by Tanel Poder, rewritten and integrated with additional functionality for AWR analysis and with important changes to the user interface with the goal of making it more 'point-and-click'. A previous post on PerfSheet4 can be found at this link. Many thanks go to Tanel for his original work on PerfSheet and for his support to this development.


Getting started video: http://youtu.be/sdvx4zB-fvo



Wednesday, December 3, 2014

Talks at UKOUG TECH 2014 with CERN Speakers

This is a short blog post with a list of the talks at the UKOUG TECH 2014 conference in Liverpool with CERN speakers. Come and say hi, we hope to see many of you there!


(*) slides can be downloaded from: http://cern.ch/canali