Wednesday, March 18, 2015

Heat Map Visualization for SystemTap

Topic: PyLatencyMap v1.2 and how it can be used to produce heat map visualization of SystemTap histograms.

Introduction: When studying storage performance, the latency drill down is a very important data source. Measuring the average I/O latency is often not enough, latency histograms are proven to more suitable for investigating modern storage systems. This is because for many storage systems the response time has multiple modes: think for example of the common case of storage systems where I/O requests can be served by low-latency SSD or by spindles with much higher response time. For such systems latency histograms provide a much needed additional dimension to the performance investigation.
Latency heat maps come into play when we want to visualize a series of latency histograms over time. This is a 3-D visualization problem and heat maps provide a way to represent the third dimension as color (see as an example Figure 1 later in this article).
The main ideas behind the use of latency histograms for investigating storage performance and their visualization heat maps have been previously detailed in the excellent paper by Brendan Gregg on ACM queue and on subsequent blog articles and presentations.

PyLatencyMap is a tool and set of scripts to gather and display heat maps using the command line interface using ANSI terminal graphics. The original use for which PyLatencyMap has been developed is to measure and display storage latency data. PyLatencyMap can be extended to display heat maps from a generic source of latency data.
The latest update to PylatencyMap, version 1.2, provides integration with SystemTap and a few example scripts for getting started with using SystemTap to collect storage latency and PylatencyMap for producing the associated heat maps. Many more data sources have been integrated with PyLatencyMap: Oracle wait event histograms, DTrace histograms and custom aggregations from trace files.
PyLatencyMap has a modular architecture, which follows the typical use of Unix/Linux CLI tools. Data is generated by a collector script (for example a SystemTap script), then it is piped through an optional filter and finally piped into to the visualization engine:

latency data_source | <optional connector script> | python LatencyMap.py <options>

A series of examples and get-started scripts are provided with the tool. The following example uses SystemTap to collect data (Example 9 in the tool zip bundle):

stap -v SystemTap/blockio_ioscheduler_latency.stp |python SystemTap/systemtap_connector.py |python LatencyMap.py

Data gathering is done by hooking on the Linux kernel and in particular to the ioscheduler elevator to measure I/O latency times:

#!/usr/bin/stap
#
# blockio_ioscheduler_latency.stp
#
# This is a SystemTap script to gather block I/O latency from the Linux kernel block I/O interface
# and print I/O latency histograms to be consumed by PyLatencyMap for heatmap visualization
#
# By Luca.Canali@cern.ch, March 2015
#

# Note this script requires kernel debuginfo
#

global latencyTimes, requestTime

probe ioscheduler.elv_next_request {
   requestTime[$q] = gettimeofday_us()
}


probe ioscheduler.elv_completed_request {
  t = gettimeofday_us()
  s = requestTime[$q]
  delete requestTime[$q]
  if (s > 0)
     latencyTimes <<< (t-s)
}


probe timer.sec(3) {
   printf("\n<begin record>\n")
   printf("timestamp, microsec, %d, %s\n",gettimeofday_us(),tz_ctime(gettimeofday_s()))
   printf("label, Latency of block I/O requests measured with SystemTap and ioscheduler provider\n")
   printf("latencyunit, microsec\n")
   printf("datasource, systemtap\n")
   if (@count(latencyTimes) > 0)
       println(@hist_log(latencyTimes))
   printf("\n<end record>\n")
}



The script blockio_ioscheduler_latency.stp measures the latency of block I/O operations and can be used to collect latency histograms and display them as heat maps. See also the Figure here below for an example of the typical output from PyLatencyMap:

Figure 1: Two heat maps are displayed. The top one is a frequency heat map. On the horizontal axis we represent the time, on the vertical axis the latency bucket. The tool assumes a logarithmic scale. Color is used to represent the number of I/O operations per second. A blue palette is used for the frequency heat map. The  heat map on the lower half of the figure uses a yellow-red palette to represent the amount of time waited in each bucket. I refer to it as a intensity/importance heat map. It can be used to understand the relative weight of latency time spent for example on a large number of low-latency operations vs. a smaller number of high latency operations.

Another script that can be used to measure I/O latency is blockio_rq_latency.stp. This SystemTap script hooks on kernel.trace("block_rq_issue") and kernel.trace("block_rq_complete"). The main advantage compared to the script blockio_ioscheduler_latency.stp of Example 9 above, is that this script does not need the kernel debuginfo package to be installed (which can be quite a large package). However there is also the disadvantage that blockio_rq_latency.stp can measure also queueing time at the Linux level in certain circumstances (I'll postpone the details for another blog post), thus making this probe less accurate in measuring storage I/O latency:

#!/usr/bin/stap
#
# blockio_rq_latency.stp
#
# This is a SystemTap script to gather block I/O latency from the Linux kernel block I/O interface
# and print I/O latency histograms to be consumed by PyLatencyMap for heatmap visualization
#
# By Luca.Canali@cern.ch, March 2015
#
# This script is based on original ideas of biolatency-nd.stp of systemtap-lwtools
# by Brendan Gregg
#
# Note the use of kernel.trace("block_rq_issue") and kernel.trace("block_rq_complete") instead of
# ioblock.request and ioblock_end, this allow the probe to work without the need to have kernel debug info.

#
global latencyTimes, requestTime

probe kernel.trace("block_rq_issue") {
        requestTime[$rq] = gettimeofday_us()

}

probe kernel.trace("block_rq_complete") {
   t = gettimeofday_us()
   s = requestTime[$rq]
   if (s > 0) {
       latencyTimes <<< (t-s)
       delete requestTime[$rq]
   }
}


probe timer.sec(3) {
   printf("\n<begin record>\n")
   printf("timestamp, microsec, %d, %s\n",gettimeofday_us(),tz_ctime(gettimeofday_s()))
   printf("label, Latency of block I/O requests measured with SystemTap\n")
   printf("latencyunit, microsec\n")
   printf("datasource, systemtap\n")
   if (@count(latencyTimes) > 0)
       println(@hist_log(latencyTimes))
   printf("\n<end record>\n")

}

More SystemTap based data sources are provided together with PyLatencyMap (see the download link below):
blockio_rq_latency_8k_read.stp: similarly to Example 9b, this script uses block I/O trace points, however it further adds filters and measure only read operations of 8 KB (this is aimed at studying Oracle I/O single-block read workload).
pread_latency.stp: this script measures pread calls and produces latency histograms as output
oracle_event_latency.stp: this script measures the latency of Oracle wait events and produces histograms by hooking directly into the Oracle executable. More details on how this works can be found at the blog entry "SystemTap into Oracle".

References and download links
PyLatencyMap can be downloaded from this link or can be forked from Github. Additional information on PyLatencyMap is also available at this blog entry and also at this other blog entry.

Conclusions and acknowledgements
SystemTap is a very powerful tool for collecting latency details and latency histograms for storage performance investigations. Among others. PyLatencyMap provides an easy way to visualize histograms produced by SystemTap as latency heat maps. PyLatencyMap collects data and display heat maps using the command line interface and character terminals with ANSI escape codes, this is for speed and ease of use. PyLatencyMap can also be used to visualize heat maps from latency histograms produced with DTrace, Oracle sql*plus and/or a from latency data aggregated from text and trace files.

Many thanks go to Brendan Gregg for sharing many awesome ideas and tools on the topic of heat map visualization and on using DTrace and SystemTap that have provided a great source of inspiration for this work.

Wednesday, February 18, 2015

Latest updates to PerfSheet4, a tool for Oracle AWR data mining and visualization

Topic: This post is about the latest updates to PerfSheet4 v3.7 (February 2015). PerfSheet4 is a tool aimed at DBAs and Oracle performance analysts. It provides a simplified interface to extract and visualize AWR time series data using Excel pivot charts.


Why PerfSheet4: PerfSheet4 is aimed at querying and displaying time-series data from AWR repository tables. This is very a rich source of information to analyze database workloads and trends in the context of performance analysis or capacity planning. The tool automates several time consuming tasks and provides a point-and-click user interface:
  • PerfSheet4 comes with a set of pre-defined queries to extract data from some of the most common AWR table and in particular to compute deltas for the performance values were relevant.  
  • PerfSheet4 can query the database directly or work in a offline mode (you run the sql script from sqlplus and later load the csv file into PerfSheet4)
  • PerfSheet4 uses Excel Pivot Charts. Pivot Charts provide a powerful interface to display and navigate through AWR data.
  • PerfSheet4 comes with a set of pre-defined graphs for each of the pre-defined queries. This makes preparing some of the most common workload graphs a matter of a few clicks and also provides a good starting point into more advanced usages of the Pivot Charts.

What's new: If you have used previous versions of PerfSheet4 you will find in v3.7 (February 2015 update) a larger collection of pre-defined graphs. This is aimed at making PerfSheet4 easier to use for many of the standard tasks. Typically 4 graphs will be presented for each of the pre-defined queries. Another new feature is the choice of using Oracle ODBC drivers, as opposed to using the default Microsoft ODBC driver for Oracle. This feature has been introduced especially for Windows 8/8.1 where Microsoft ODBC driver for Oracle does not seem to work correctly.


PerfSheet4 is a free tool and is available for download at this link. The tool has been developed and tested for Oracle 11gR2 and 12cR1. On the client side it has been tested for Excel 2010 and 2013 for Windows (7 and 8/8.1). The figures here below show two examples of the user interface.

Figure 1: The start page and user interface of PerfSheet4. This is where you can run the query of interest to extract data from AWR. or load csv data into Excel. From this page you can then launch the Pivot Chart plot. 


Figure 2: A Pivot Chart generated by PerfSheet4. The pre-defined graph is just a starting point. Customize the chart from there by selecting the data of interest and/or changing any other feature of the graph. 


Additional details: Here below is the list of AWR queries and charts available in PerfSheet4, v3.7 (February 2015). The tool is open source and therefore fully configurable, in case you want to modify his default behavior. There are two versions packaged in the zip file of the tool: the main version is called called PerfSheet4_v3.7.xlsm. A second version is available, called PerfSheet4_v3.7_noActivex_controls.xlsm, which avoids the use of Activex controls and should be better suited for those cases where Activex controls don't work well and/or for older versions of Excel.

Query: Workload data

  • Extracts data from dba_hist_sysmetric_summary. System metrics provide details on the usage of system resources over time.
  • Pre-defined charts: "Average active sessions", "CPU usage and database time", "Physical reads and writes IOPS", "Physical reads and writes throughput".

Query: System statistics  

  • Extracts data from dba_hist_sysstat. Hundreds of system statistics provide valuable details on the workload. The query computes delta values between snapshots and rates (that is delta values divided over delta time).
  • Pre-defined charts: "Physical read and write IOPS", "Physical read and write throughput", "CPU and wait time", "CPU used per instance".

Query: Wait events  

  • Extracts data from dba_hist_system_event. This provides information on all the wait events. The query computes delta values between snapshots and rates (i.e. delta values divided over delta time).
  • Pre-defined charts: "db file sequential read - wait event for single block reads", "db file scattered read - wait event for multiblock reads into the buffer cache", "log file sync - wait event for commit time".

Query: Wait events per class  

  • Extracts data from dba_hist_system_event.  This provides information on all the wait events, computes delta values and rates (delta value over delta time) aggregating over wait class.
  • Pre-defined chart: "Wait time grouped per class".

Query: IO stats details  

  • Extracts data from dba_hist_iostat_detail. This provides details on the I/O operations performed by Oracle, computes delta values and rates (delta value over delta time) aggregating over function and file type.
  • Pre-defined charts: "Small read IOPS", "Large read throughput", "Small write IOPS", "Large write throughput".

Query: IO wait histograms  

  • Extracts data from dba_hist_event_histogram for I/O related events. This provide drill-down information on latency. The query computes delta values between snapshots and rates (i.e. delta values divided over delta time).
  • Pre-defined charts: "db file sequential read latency histogram - number of wait events", "db file sequential read latency histogram - time waited (millisec/sec)", "commit time wait latency histogram - number of wait events per sec", "commit time wait latency histogram - time waited (millisec/sec)".

Query: Stats per service  

  • Extracts data from dba_hist_service_stat, computes delta values between snapshots and rates (i.e. delta values divided over delta time) aggregating by service name.
  • Pre-defined charts: "DB time", "CPU time", "User I/O wait time".


Query: Top 3 wait events and CPU  

  • Extracts data from dba_hist_system_event and dba_hist_sysstat. This query selects the top 3 non-idle wait events and the CPU usage for each instance. It computes delta values and rates (delta value over delta time).
  • Pre-defined charts: "Instance N.1 - top 3 wait and CPU", "Instance N.2 - top 3 wait and CPU", "All instances - top 3 wait and CPU".


Previous work and acknowledgments: PerfSheet4 is a spin off of previous original work by Tanel Poder, rewritten and integrated with additional functionality for AWR analysis and with important changes to the user interface with the goal of making it more 'point-and-click'. A previous post on PerfSheet4 can be found at this link. Many thanks go to Tanel for his original work on PerfSheet and for his support to this development.


Getting started video: http://youtu.be/sdvx4zB-fvo



Wednesday, December 3, 2014

Talks at UKOUG TECH 2014 with CERN speakers

This is a short blog post with a list of the talks at the UKOUG TECH 2014 conference in Liverpool with CERN speakers. Come and say hi, we hope to see many of you there!


(*) slides can be downloaded from: http://cern.ch/canali

Tuesday, November 4, 2014

Life of an Oracle I/O: Tracing Logical and Physical I/O with Systemtap

Topic: This post is about tracing logical and physical reads in Oracle using systemtap. You will find a few examples illustrating the main mechanisms used by Oracle to do physical and logical I/O and you will learn how to build systemtap scripts to further explore and troubleshoot Oracle I/O.

Introduction: Performing input/output to and from the storage media, i.e. physical I/O, is one of the critical tasks underlying all database engines. When accessing data stored in RAM from the database cache we can talk of logical I/O. Physical I/O is performed by the database using system calls to the O/S kernel. The type of system calls used depends on database settings, the O/S and the type of storage. Logical I/O is performed by the database processes using Oracle kernel functions. Oracle instruments the timing of physical I/O calls with the wait event interface, while logical I/O time is generically accounted for as CPU time. Tracing I/O operations performed by Oracle processes together with the details provided by the Oracle wait event interface is a powerful technique to investigate Oracle I/O internals. With the help of a few real-life examples we will investigate what the mechanisms are that Oracle uses to access storage, how the I/O operations correlate to the wait event interface and what the meaning of the wait event time is for the most common wait events associated with I/O.

Building the tracing tool: Investigations of what Oracle and the O/S do when an I/O operation is performed require specialized tools. Systemtap is a dynamic tracing tool specific to Linux that allows tracing of both the O/S and the Oracle functions (Linux kernel functions, system calls and Oracle database kernel functions). In the following you will read a description of a systemtap script that can be used to trace Oracle logical and physical I/O: it is composed of probes that can attach to the function calls of interest and can read function parameters, memory and CPU registers.
Systemtap requires relatively recent kernels to be able to probe userspace: either the utrace or uprobe_events functionality needs to be available. This is the case with RHEL/OEL 6.5 and higher but not with RHEL5. Moreover for tracing system calls the kernel debuginfo package needs to be installed.
More details on this and on how to build a lab environment can be found in this article. The script that we will use in this blogpost to trace Oracle logical and physical I/O and wait event interface is: trace_oracle_logicalio_wait_events_physicalio_12102.stp (for Oracle 12.1.0.2). A version for 11.2.0.4 is also available: trace_oracle_logicalio_wait_events_physicalio_11204.stp Before moving to the examples and studies of Oracle I/O I would like to review the three main parts of the script: tracing physical I/O, logical I/O and wait events.

Tracing physical I/O: Oracle will ultimately issue system calls to execute I/O operations. The type of calls depend on database settings, the O/S type and the type of storage. In this article we will restrict our examples to Oracle on Linux using ASM storage on block devices. Notably direct NFS, local storage and ASM with asmlib will be left out and possibly addressed in future studies. System calls can be traced using the strace(1) utility in Linux too. However, in this article I use systemtap probes to gather information on the system calls of interest: pread, pwrite, io_submit and io_getevents. Additional useful systemtap probes to the underlying block I/O layer are: ioblock.request, ioblock_trace.request and ioblock.end (note, these are not system calls, but rather tracepoints for function calls inside the Linux kernel).

Probe name Purpose Selected parameters
syscall.pread/pwrite synchronous I/O: read/write from/to a file descriptor fd (file descriptor number), offset, count. 
return value: bytes read/written
syscall.io_submit asynchronous I/O: submit blocks for processing nr (number of I/O). For each I/O: file descriptor, offset, bytes, opcode (0 for read, 1 for write)
return value: number of I/O operations performed
syscall.io_getevents asynchronous I/O: read events from the completion queue min_nr (min number of I/Os to read), timeout for event waiting.
Return value: number of I/O reaped, For each I/O: file descriptor, offset, bytes
ioblock.request, ioblock_trace.request I/O requests sent to the block device interface kernel layer devname, sector, size, rw, block_io structure address
ioblock.end return from the block device interface kernel layer devname, sector, rw, block_io structure address


Tracing logical I/O: This is about providing the history and details of the logical I/O operations, which is when Oracle reads a data block from memory (buffer cache). Logical I/O is a super-set of physical I/O: if Oracle does not find data in memory (buffer cache) it will issue a physical read. You can find a few examples of this mechanism in action later in this post.
Oracle instrumentation provides extensive information on logical reads in several V$ views, such as V$SYSSTAT, V$SESSTAT, V$SYSMETRIC. Also logical read data is available from sql*plus with "set autotrace on" and with tkprof reports of 10046 traces . Logical reads come in two main "flavors": consistent reads and current reads. Oracle will use consistent reads when reading data at a given system change number SCN, current reads are used when performing DML operations on the block. Consistent reads can be traced in Oracle using the events 10200 and 10201. More details can be found in the presentation by Julian Dyke on logical I/O.
In this article we will trace logical I/O by attaching systemtap probes to a set of Oracle kernel functions. The role of functions kcbgtcr and kcbgcur for tracing consistent and current reads has already been discussed by Alexander Anokhin in his excellent blog articles on Dtrace LIO. More recently I have investigated this topic and found useful to also trace the functions kcbzib, kcbzgb, kcbzvb.
A summary of the main finding on these functions is in the table here below. Note: the expressions %rdi, %rsi, %r8 indicate the values of the corresponding CPU registers. With systemtap the register value can be read with the function register, example: register("rdi"). The CPU registers are used to extract the function call arguments. Systemtap does not provide the function call arguments when there are no debug symbols available for the executable. Notably the calling conventions for x86-64 as documented in  www.x86-64.org/documentation/abi.pdf state that the integer parameters for a function call are available in the following registers (in order): %rdi, %rsi, %rdx, %rcx, %r8 and %r9.


Function name Purpose Selected parameters
kcbgtcr Kernel Cache Buffers Get Consistent Read

used for consistent reads
tbs#=user_int32(%rdi) 
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF 
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbgcur Kernel Cache Buffers Current

used for current reads
tbs#=user_int32(%rdi)
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbzib kcbZIB should stand for: Z (kcbz.o is a module for physical IO helper functions), IB: Input Buffer

Oracle will perform physical read(s) into the buffer cache

kcbzgb The suffix GB in kcbZGB should stand for: Get (space for) Buffer

Oracle allocates space in the buffer cache for a given block (typically before I/O operations). 

tbs n#=%rsi,
rel file n#=%rdx >> 22 & 0x003FFFFF
block#=%rdx & 0x003FFFFF
data_object_id#=%r8
object_id#=%r9

Note: for bigfile tablespaces:
block#=%rdx
kcbzvb Invoked after Oracle has read a given block

Note: this function is used both for reads in the buffer cache and for direct reads

tbs n#=%rsi, 
rel file n#=%rdx >> 22 & 0x003FFFFF,
block#=%rdx & 0x003FFFFF
data_object_id=user_int32(%rdi+24)

Note: for bigfile tablespaces:
block#=%rdx

Note: as I write this only uprobes for kernel versions 3.10 or higher support return probes for userspace functions. Therefore this functionality is not available with RHEL or OEL 6, but it is available on RHEL 7.0 kernel.

Tracing wait events: The wait event interface is probably one of the best features in Oracle performance tuning as it provides a quantitative method to identify performance issues and bottlenecks. Cary Millsap has published inspiring work on the subject.
For our purposes it suffices to say that wait events are instrumentation points inside the Oracle where selected Oracle kernel actions are timed. In particular Oracle will time the start and end calls to most physical I/O operations, therefore providing a good way to understand how much time is spent during physical I/O (some limitations of this mechanism will be discussed later in this post). There are also limitations to using the wait event interface to study I/O latency, as you will see in the example section some interesting challenges related to timing asynchronous I/O events.
Wait event tracing is typically activated in Oracle using the 10046 event or with by calling DBMS_MONITOR.SESSION_TRACE_ENABLE. In this article we will get the wait event details by attaching systemtap probes to the relevant Oracle kernel functions, in particular kskthbwt and kskthewt to gather information respectively at the start and at the end of each wait event. It’s important to realize that enabling wait event tracing using event 10046 at level 8 and other methods is a way to externalize the wait event timing information for analysis (as well as other SQL processing details), not to enable the wait information. This is done using the database parameter TIMED_STATISTICS, and is and should be set to TRUE in modern Oracle databases. One key piece of information that the Oracle kernel functions kskthbwt and kskthewt give us is a pointer into the X$KSUSE fixed table (or rather to the underlying segment array in the SGA), which is the underlying table to V$SESSION and therefore opens the road to correlate the wait event with many useful information on session executing the SQL. Resolving the pointer in register r13 to the base value of X$KSUSE and computing the offset for the fields of interest requires additional work, this is described in a previous article (see also the script: ksuse_find_offsets.sql).

Function name Purpose Selected parameters
kskthbwt Kernel service Kompile thread begin wait.
This function is called at the start of an Oracle wait event.
kslwtbctx is its parent function call and marks the start of a wait event. 

Useful information when probing the function:
register r13 -> points into X$KSUSE (V$SESSION) SGA segmented array
register rsi -> timestamp of the beginning of the wait (in microseconds)
register rdx -> wait event number
kskthewt Kernel service Kompile thread end wait.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
The suffix "ewt" most likely stands for "end wait".
Useful information when probing the function:
register r13 -> points into X$KSUSE (V$SESSION) SGA segmented array
register rdi -> timestamp of the beginning of the wait (in microseconds)
register rsi -> wait event number


Running the script:
Below you find a few examples of traces for specific read-life Oracle I/O access patterns. The traces have been obtained by runnig  stap -v trace_oracle_logicalio_wait_events_physicalio_12102.stp -x <pid> where <pid> is the process id of the session being traced. The output is piped into sed -f eventsname.sed, which takes care of translating the event number with the associated event name. The file eventsname.sed is generated by running the script eventsname.sql. It’s important to rerun the script to generate the event names before testing, because the wait event numbers can change without any warning, for example by migrations to newer versions, but potentially also by applying a patch. The system used to collect the traces on the following examples is Oracle 12.1.0.2 running on OEL 7.0 with UEK3 under Virtualbox.


Example 1: single-block random reads

This example is about single block (random) reads. This access path to data is very important for many OLTP workloads as it is used for table access via a index key. In Figure 1 here below you can see a snippet of a trace file of a query of the type: select <col_list> from table <tab> where <indexed_col>=<val>.

Figure 1: Oracle single-block random read and db file sequential read wait event

The key points are:
  • Oracle issues a logical read on one of the blocks needed to to retrieve data for the SQL of this example
    • additional details are provided in the logical read call where we can see that Oracle wants to read block#=2505675 from tbs#=7 and belonging to obj#=32174 (which is the table segment in this example)
  • The block is not found the cache so a new block is allocated and a physical read is initiated
  • The database marks the start of a wait event for 'db file sequential read'
  • The physical read request for a block becomes a pread call to the O/S
    • The IO request enters the kernel via a system call, any function encountered after the system call is a kernel function (such as ioblock.request), until the system call returns, and thus gets back to userland, which means the Oracle executable in this case.
  • This read call is passed on to the block I/O interface to read 8192 bytes (1 Oracle block), from the block device device (/dev/sdl in the example of Figure 1)
  • The block interface and then the OS call return the read data to the Oracle kernel
  • The Oracle wait interface marks the end of the wait and computes the elapsed time
    • the wait event in db file sequential read and the parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks read (1 block).
    • From the timestamp values in the trace you can see that the duration of the db file sequential read wait is mostly accounted for by the time it took to do the pread call, with a small overhead (note also that the overhead could have been more significant if we had high CPU utilization on the DB server, for example because of workload from other Oracle sessions)
  • We can conclude that in this example the measured wait time for db file sequential read corresponds to physical read latency.

Example 2: sequential I/O into the buffer cache with synchronous I/O

This example is about multi-block reads into the cache. This is an access path that is used for example when Oracle performs a full table scan. There are multiple ways in which Oracle can access the storage to perform full scans, in this example you will see the case when Oracle uses synchronous I/O (i.e. the pread call) to perform O/S reads, examples 3 and 4 will cover sequential reads with asynchronous I/O. In Figure 2 here below you can see a snippet from tracing the workload of a query of the type: select <col_list> from table <tab>, a full scan of <tab> is performed by Oracle.

Figure 2: Oracle sequential I/O and pread O/S calls

The key points of the trace in Figure 2:
  • Oracle issues a consistent read for tbs#=7 block#=2621443 and finds it in the cache
  • Oracle issues a consistent read for the next block, does not find it in the cache so prepares to read a list of block from storage in one physical read. The list is 124-blocks long and shortened in this example
  • The databases marks the start of a wait event for "db file scattered read"
  • The physical read requests become a single pread call to the OS for 1015808 bytes (124 blocks)
  • This request moves down the chain to the block I/O interfaces which splits it into 2 reads from the underlying block device, as in the system used for tests the maximum size of I/O is 524288 bytes.
    • As investigated by Frits Hoogland: the submitted I/O request lands in the IO scheduler scattered/gather list, where it is broken down to the IO size as advertised by the block device.
  • The I/Os are completed at the block I/O interface
  • The pread call returns
  • The wait event interface times the end of the wait
    • The wait event is db file scattered read and the parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks.
    • From the timestamp values in the trace you can see that the duration of the db file scattered read wait is mostly accounted for by the time it took to do the pread call, with a small overhead 
  • We can conclude that in this example the wait time for db file scattered read corresponds to I/O latency.
  • From the calls to kcbzvb we get additional confirmation of the blocks that have been read

Example 3: sequential I/O into the buffer cache with asynchronous I/O O/S calls

In Figure 3 here below you can see a snippet from the trace of  the SQL: select <col_list> from table <tab>, where Oracle performs a full scan of table <tab>. This is the same query that was used for Example 2 above, the difference from that case is that the trace in Figure 3 has been taken at a later time in the execution of the query, when the Oracle process switched from using synchronous I/O with pread calls to using the asynchronous  I/O interface. Asynchronous I/O is implemented in Linux with io_submit and io_getevents calls, to send the I/O requests and then to reap the results. Although Oracle uses the asynchronous interface to perform read operations in this example, the final result is very similar to what described in Example 2 (i.e. with synchronous I/O): the Oracle wait event is the same, "db file scattered read", the reads are also performed into the buffer cache and I/Os are reaped with blocking calls, therefore making the process effectively synchronous. A more 'aggressive use' of the asynchronous I/O interface by Oracle aimed is found in Oracle with direct reads, that are discussed in Example 4 below.

Figure 3: Oracle performing sequential I/O reads into the buffer cache using asynchronous I/O calls

The key points are:
  • Oracle issues a consistent read for tbs#=7 block#=2522760, it does not find it in the cache so prepares to read a list of blocks into the buffer cache. The list is 128-blocks long and shortened in this example
  • The databases marks the start of a wait event for "db file scattered read"
  • The physical read requests are passed on to the OS as a call to io_submit with 2 requests for a total of 1MB in size (that is 128 blocks): the first request is for a read (opcode 0) of 65536 bytes, the second for a read operation of 983040 bytes (that is 1MB -64 KB)
  • These requests are passed to the block I/O interface which splits the largest request into 2 reads from the underlying block device (/dev/sdi in this example). The read of 64KB is passed on unchanged in size to /dev/sdr
  • Oracle calls io_getevents to reap the results of the asynchronous I/O requests that have just been submitted and waits. The timeout.tv_sec field is set to 600, which indicates this is a blocking wait (which means io_getevents waits for 600 secs (tv_sec) and 0 nano secs (tv_nsec)) for min_nr=2 of IOs to finish. 
  • The block device interface returns the data upstream.
  • The io_getevents call returns the 2 I/O read.
  • The wait event interface marks the end of the wait. The wait event in this case is db file scattered read. 
    • The wait event parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks.
    • The wait event duration is basically the time is took to do the io_submit and io_getevents calls plus a small overhead.
    • Note that although Oracle has used the asynchronous interface, it has also waited for all the I/Os to complete therefore making the I/O essentially synchronous
  • The wait time for db file scattered read in this example is not a precise measure of the I/O latency, as two reads on two different devices where submitted in parallel and the total time was reported as if it was a single read.
  • From the calls to kcbzvb we get additional confirmation of the blocks that have been read.

Example 4: full scans with direct reads and asynchronous I/O

In this paragraph you will see two examples of Oracle I/O for the case of direct reads against ASM. The same query that was used in Examples 2 and 3 above will used here too: it is a query that performs a full table scan. In this example differently from Examples 2 and 3 Oracle uses serial direct read to access data. Serial direct read is a feature introduced in 11gR2 for optimizing full segment scan operations and bypassing the buffer cache layer, similarly to what happens with parallel queries. Oracle chooses when to use serial direct reads vs. cached reads based on a few parameters. In first approximation direct reads are preferred when the table is "big" compared to the size of buffer cache, the actual mechanism is more complex and can potentially change between Oracle versions, you can read more on this topic in this blog post by Tanel Poder.
For the particular setup I used to run this tests I had to force the use of serial direct reads before running the query, this was done with: alter session set "_serial_direct_read"=always; 
Note that alter session set "_serial_direct_read"=never; can be used to disable serial direct reads and force Oracle to use cached reads (that is the type of I/O you have seen in Example 2 and 3 above). The default value for the underscore parameter "_serial_direct_read" is "auto".
For the investigations in this example I will use a slightly different systemtap probe from what we have used in the previous examples. I have removed tracing of logical I/O and block I/O for reducing cluttering and added a new probe on the userspace function io_getevents_0_4 of libaio (libaio is the Linux library that implements asynchronous I/O). This is because serial direct reads use non-blocking calls to reap I/Os and these calls are best traced at the libaio interface for reasons described by Frits Hoogland in this blog post. The systemtap script that I will use in Figure 4a and 4b is: trace_oracle_wait_events_asyncio_libaio_12102.stp (a version for 11.2.0.4 is available at trace_oracle_wait_events_asyncio_libaio_11204.stp).

Figure 4a: Trace of the wait event and I/O calls of Oracle during a full table scan with serial direct read and asynchronous I/O, this is what happens in the first phases of the full scan

The key points of Figure 4a are:
  • Oracle requests I/O in chunks of contiguous storage space of 1MB. This is to reduce the overhead of seek time and potentially make use of readahead when available. 
    • The value of 1MB for the read request comes from the parameter db_file_multiblock_read_count=128 multiplied by the block size of 8KB. 
  • Oracle submits one I/O of 1MB with a call to io_submit and then calls io_getevents_0_4 in the libiao library using the non-blocking mode to try to reap the I/O. This happens twice, with no results. Oracle then submits another I/O of 1MB and tries for 4 times to reap I/O in non-blocking mode, with no results (the I/O operations have not yet finished).
  • The wait event timing starts just before the calls to io_getevents, which are used to reap the I/O results this time in blocking mode (with a timeout of 600 seconds)
    • The wait event in this case is: "direct path read"
    • Oracle reaps the 2 I/O in two subsequent calls to io_getevents
  • After reaping the I/Os Oracle marks the end of wait for the event. 
  • The result is that 2 reads of 1 MB each have been performed 
  • The wait event interface does not report the correct amount of I/O performed:
    • the "direct path read" wait in this example reports one read of 128 blocks (1 MB), while 2 MB were read
  • The elapsed time for the wait event "direct path read" does not measure the I/O latency
    • Oracle is only timing the blocking calls to io_getevents, which start after the I/Os are submitted. 
    • Oracle is timing 2 reads together.
The use of serial direct reads in Oracle is more complex than the simple example of Figure 4a. Frits Hoogland has discussed this in more details the internals of direct reads in Oracle in this presentation. In particular Oracle uses an adaptive mechanism for direct reads with the goal of driving the storage system to high throughput when possible. This is implemented by increasing the number of outstanding I/O when calling the asynchronous I/O interface. Internally this makes use of I/O slots. Note, from a comment on Oracle diagnostic event 10353 we have additional indirect information on this topic: "Slots are a unit of I/O and this factor controls the number of outstanding I/Os".
Another finding of Frits' is that there are I/O operations performed for direct reads that are not timed by the wait event interface. This happens because Oracle can issue the io_getevents with timeout=0, that is reaping the I/O without having to block. You can see the these mechanisms in action in the trace of Figure 4b, which is for the same full scan as reported above in Figure 4a just at later time when the mechanism of adaptive serial direct reads has tuned Oracle I/O requests to be "more aggressive" on the use of storage resources.

Figure 4b: Trace of the wait event and I/O calls of Oracle during a full table scan with serial direct read and asynchronous I/O, this is an example of what happens when the adaptive mechanism increases the number of outstanding I/Os for optimizing the throughput

The key points of Figure 4b are:
  • This is the same full scan as shown in Figure 4a, the difference is that the trace in Figure 4b has been taken at a later time of the full scan operation (the table has 1 million blocks) when adaptive serial direct reads has increased the number of I/O slots (outstanding asynchronous I/O operations).
  • You can see that several io_submit operations are performed before the wait event timing is started. 
  • Oracle is trying to keep several outstanding I/Os in the queue: up to 9 outstanding asynchronous I/O operations can be seen in Figure 4b, while only 2 were present in simpler case of Figure 4a, that is at the beginning of the full scan.
  • Several attempts to reap I/Os with io_getevents operations are run by the Oracle process before the wait event timing is started
    • 2 reads of 1MB each are reaped by one io_getevents call before the wait event timing has started, therefore these reads will not be accounted for in the wait event interface.
    • It is important to note that the timeout for this particular io_getevents call (as well as for the io_getevents_0_4 calls in libaio which have not returned any I/O) is set to 0 by the Oracle process, which means that Oracle does not wait for I/O results to be available, it will only reap them if they are available.
  • Between the moment at which Oracle starts timing the wait event "direct path read" and the end of the wait event 3 io_getevents operations have been performed (with the timeout set to 600, therefore blocking for I/O reads to be available) and a total of 9 reads of 1MB each are performed
    • The wait event interface does not report all the I/O that has been performed:
      • In the wait event parameters we can only find trace that 1 MB has been read (128 blocks), while in reality 9 MB have been read.
      • This is in addition to the 2 MB read outside of the wait event instrumentation.
    • The elapsed time for "db file direct read" does not accurately reflect I/O latency but rather the wait time for reaping I/Os as seen in Figure 4b.

Example 5: asynchronous I/O for random reads

Oracle will optimize single-block reads when it can do prefetching and/or batching of I/O. This is an optimization for random I/O using the asynchronous I/O interface, as Oracle can group multiple I/O requests in a single io_submit call, as opposed to sending them one after the other as in the case of Example 1, where the OS call pread was used. The example from Figure 5 comes from running Kevin Closson's SLOB on a 12c database. In the execution plan please note step 3 where Oracle uses the operation "TABLE ACCESS BY INDEX ROWID BATCHED" to access table data.

Figure 5: execution plan for a SLOB query illustrating batch table access

Here below in Figure 6 you can see a snippet of the Oracle I/O tracing when running the SLOB workload. You can see that the asynchronous I/O interface is being used to to perform random I/O. In the case of Figure 6, 126 requests are submitted in a single io_submit operation and subsequently reaped by io_getevents, waiting for all the requests to finish. The wait event in this case is "db file parallel read" and the calculated wait time is the time for the batch of 126 requests to finish. Therefore the wait time of db file parallel read wait does not provide a measurement of the I/O latency for single block random reads.

Figure 6: Oracle performing I/O for random reads with asynchronous I/Os, for the case of batched reads. The trace has been edited replacing a long list of similar trace lines (typically 126 lines only varying for the block#) with "..."

Example 6: tracing DML

In Figure 7 you can see a snippet of tracing Oracle logical I/O and wait events when inserting a row into a table followed by a commit. No physical I/O was performed in this operation (the table has been previously cached). You can see that most of the block reads have been done in current read mode, as expected for DML. Moreover the blocks involved are both belonging to the table being inserted (you can identify them in Figure 7 as belonging to tbs#=11) to rollback segments (you can identify them in Figure 7 as belonging to tbs#=2). Note that for simplicity the table in this example has no indexes.

Figure 7: Logical I/O trace of Oracle performing and insert and commit operation

Example 7: tracing log writer
In Figure 8 you can see the trace of log writer during a commit operation. This is the same commit operation traced in Example 6 (see also Figure 7 above). The wait event is "log file parallel write" and its duration is timed by oracle from the beginning to the end of the asynchronous I/O operations performed. Therefore the measured wait event time "log file parallel write" is not an accurate measure of block write latency, as its duration depends on the number of operations performed. In the case of Figure 8 you can see that log writer writes to two distinct block devices, which is consistent with the use of 1 logfile member per redo log group, allocated on an ASM diskgroup with normal redundancy: log writer sends 2 write I/O requests simultaneously in order to write the primary and mirror extents of the logfile and waits for both of them to finish.

Figure 8: log writer tracing during a commit operation

Conclusions
Dynamic tracing and in particular systemtap provide the opportunity to investigate Oracle physical and logical I/O and open the way to drill-down investigations of Oracle's I/O into the various layer of the Oracle engine down to the Linux kernel layer. In this post you have seen the methods, tools and examples of how these investigations can be performed.
The examples from real-world Oracle workloads highlight some of the most common cases of Oracle logical and physical I/O and the connections between the wait event interface and the O/S calls issued by Oracle processes. The Oracle wait event interface can provide useful data for measuring I/O latency and in general for troubleshooting I/O problems. This is true for random reads instrumented by db file sequential read. The information provided by wait events related to asynchronous I/O, however, require additional attention: such events typically do not provide an accurate measure of the I/O latency. Moreover we have illustrated examples where some of the asynchronous I/Os for Oracle direct reads are not instrumented at all by the wait event interface.

I would like to thank Tanel Poder and Frits Hoogland for original ideas that have inspired this work. Special thanks go to Frits Hoogland for providing comments and additional ideas from his deep expertise on the subject.

Monday, September 22, 2014

SystemTap into Oracle for Fun and Profit

Topic: This post is about using SystemTap for investigating and troubleshooting Oracle RDBMS. In particular you will learn how to probe Oracle processes and their userspace functions. These techniques aim to be useful as well as fun to learn for those keen into peeking under the hood of the technology and improve their effectiveness in troubleshooting and performance investigations.

Introduction


Userspace probing is a very powerful technique that can be used to complement the available Oracle performance and profiling data, such as V$ views and 'event 10046' trace files, for advanced troubleshooting and investigations into the Oracle kernel. These type of investigations have been available for several years already mostly on Solaris with the introduction of DTrace (see for example Tanel's blog).
Dynamic tracing tools that allow kernel and userspace probing are also gaining momentum in the Linux word thanks to recent additions to the kernel. Several tools are available to provide the interface for collecting, aggregating and printing dynamic tracing data. SystemTap is one of the more mature tools, however more tools are now emerging, including 2 ports of DTrace. For a review of this topic see Brendan Gregg's presentation at LinuxCon 2014.
I have first learned about the possibility of using SystemTap for userspace tracing from Frits Hoogland and have since been quite excited by the possibilities that this opens for advanced troubleshooting, and in particular on how this can help investigating Oracle workloads for performance tuning and advanced troubleshooting.
I was was also quite happy to discover that the techniques described in this article work also under RHEL 6.5, this is definitely very good news for me as it means that I can start using SystemTap on many of the existing systems, without waiting for upgrades to the very latest kernels.

SystemTap documentation has additional details on the implementation of userspace probes: with Linux Kernels higher than 3.5 (such as when using RHEL 7.0, OL 7.0 or OL6.5 with UEK3) SystemTap will use the UPROBES module for userspace tracing, with older kernels SystemTap can use the UTRACE kernel extensions if available. SystemTap is able to provide userspace probing both with UTRACE and with UPROBES, more mother dynamic tracing tools seem to work only with UPROBES and therefore only with more recent versions of Linux. As reported by Frits in the above mentioned article you will need SystemTap 2.5 or higher for userspace tracing. As I write this SystemTap 2.6 has just come out.

This is how to check if UTRACE extensions are configured:
# grep CONFIG_UTRACE /boot/config-`uname -r`
CONFIG_UTRACE=y

This is how to check if UPROBES are available:
# grep CONFIG_UPROB /boot/config-`uname -r`
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENT=y

In the following you will find a few example, ideas and test code to get started with using SystemTap userspace probing with Oracle. The scripts discussed in this blog can be found at this link.


Deep Dive: SystemTap reads Oracle wait event interface data and X$ tables


You will see in this paragraph how to hook SystemTap into the Oracle wait event interface. This provides a way to link together the two types of instrumentation, SystemTap probe-based data and
Oracle-based wait event interface data, and therefore opens the way for more advanced investigations.
The main implementation detail is to create a SystemTap probe into the Oracle kernel at the 'right point' and make it read the information we want.
As discussed in previous work I have published about DTrace and Oracle, "kskthewt" is an Oracle kernel function that can be used for this purpose: it is called by the Oracle kernel at the end of each wait event presumably to update the relevant performance counters. Our first step is to create a probe that starts executing once the Oracle executable steps into the function kskthewt. The next step is make SystemTap read the information we want. What is available are CPU registers and memory areas. Some CPU registers are used to pass arguments to the function, as detailed in the x86 calling conventions. In other cases the CPU registers can contain useful information 'left over' from parent function(s) processing. Reading directly from memory areas and in particular from the SGA opens the way to most of Oracle instrumentation data.
Additional facts that I have found recently  in collaboration with Frits Hoogland and that are of help on building a probe on the Oracle function kskthewt are:
  • When execution jumps into the function kskthewt, the CPU register R13 turns out to point into the X$KSUSE memory structure (with an offset). R12 seems to point to the same address most of the time but I have seen some cases when it didn't so I prefer to use R13.
  • X$KSUSE is a base fixed table for V$SESSION and therefore carries a lot of very interesting information. X$KSUSE is allocated in memory as a segmented array in the SGA, see also Julian Dyke's presentation on SGA internals for more details on this interesting topic.
  • V$SESSION.SADDR, the base address of a given session's V$SESSION data and of X$KSUSE, can be derived from the value of R13 by subtracting an offset. The value of the offset depends of the Oracle version and port and can be found 'experimentally' by using SystemTap or gdb (see more details in the script directory: trace_oracle_events_debug.stp).
  • The fields of interest of X$KSUSE (i.e. V$SESSION column values) can be found from the base value of X$KSUSE with an offset calculated with a query from X$KQFCO and X$KQFTA (see the script: ksuse_find_offsets.sql).
  • The register RDI (which is mapped as arg1 in the SystemTap probe) is found to be set to the timestamp value. This is a timestamp of the wait event, the same value as what can be seen in a trace file for the event 10046, if activated.
  • The register RSI (that is arg2 in SystemTap probes) is found to be set to the wait event number.

If you put all of this together you can write a SystemTap script to collect and print wait event and selected v$session details. An example of such a script is demonstrated in the the figure here. The scripts can be downloaded from here as a zip file. Direct links: trace_oracle_events_11204.stp and trace_oracle_events_12102.stp and have been tested with Oracle 11.2.0.4 on RHEL6.5 and with Oracle 12.1.0.2 on OEL7 respectively.

Figure 1: Script and example output of a SystemTap probe that collects wait event and V$SESSION details from Oracle kernel functions and SGA memory.

Note: In the example above you can see that SystemTap output is piped to sed -f eventsname.sed. This is to transform event id numbers into strings. The file eventsname.sed is generated by running the script eventsname.sql.


Systemtap aggregations and filtering of wait event data


One of the strengths of SystemTap, similarly to the rest of dynamic tracing tools, is to do on-the-fly aggregations and filtering. This helps reducing the amount of data collected, therefore helping to minimize the footprint of the tracing operations, while still producing meaningful reports and real-time tracing. Building on the results discussed above on how to write SystemTap probes hooking into the wait event interface you can now add aggregations and filtering to compute wait event histograms with microsecond precision.
One simple and powerful result that you can reproduce easily is to write a script that effectively reproduces/back-ports the functionality of the 12c feature of microsecond wait event histograms. This can be very useful when studying events of short duration such as random disk reads from solid state devices. In addition to aggregation, with custom filtering you can aggregate wait event details for a given user or on a given file, or any other custom filter condition. Overall SystemTap allows for much freedom in programming our probes, which is quite empowering.
Two examples scripts that can be downloaded from here, direct links: histograms_oracle_events_11204.stp and histograms_oracle_events_12102.stp

Figure2: Script and example output of a SystemTap probe that computes microsecond precision wait event histograms.

Another and alternative way to build a SystemTap probe for gathering and optionally aggregating and filtering wait Oracle event latency data is described in the following. For this you need also an additional piece of research in Oracle kernel function calls: the function kews_update_wait_time is called just before kskthewt with the CPU register RSI set to the wait time in micro seconds (register RSI is also available in SystemTap probes as arg2). See also details reported in this previous work. The results is that you can build a probe computing event histograms that is independent of the Oracle version. An example script is: histograms_oracle_events_version_independent.stp

Figure 2: Script and example output of a SystemTap probe that computes microsecond precision wait event histograms and is independent of the Oracle version.


Tracing Oracle logical I/O with SystemTap userspace probes


You can use Systemtap userspace probes to trace Oracle logical I/O too. The main research underlying this probe comes from Tanel Poder (search for the DTrace-based script qer_trace.sh) and the extensive work of Alexander Anokhin.
By hooking on the Oracle kernel function kcbgtcr we can get the details of the logical reads, or rather cache consistent reads as explained by Alexander Anokhin, kcbgtcr =Kernel Cache Buffer Get Consistent Read and the first argument of the function call points to a memory structure with the details of the tablespace, relative file number, block number and object number of the consistent read. We can use all this to built a profiler probe trace_oracle_logical_io_basic.stp

probe process("oracle").function("kcbgtcr") {
    printf("tbs#=%d, rfile=%d, block#=%d, obj#=%d\n",user_int32(u64_arg(1)), user_int32(u64_arg(1)+4) >> 22 & 0x003FFFFF, user_int32(u64_arg(1)+4) & 0x003FFFFF, user_int32(u64_arg(1)+8))
}

From there you can also build probes that perform aggregates and compute statics of the logical read operations, see for example: trace_oracle_logical_io_count.stp

Figure 3: Script and example output of a SystemTap probe to capture and aggregate details of logical I/O (for consistent read) in Oracle.


Systemtap can integrate Oracle wait event data with OS tracing


SystemTap's strength and first use case is to trace system calls. You can use this to go deeper with Oracle tracing by combining Oracle wait event interface data (collected with SystemTap as shown above) with data coming from SystemTap OS probes. 
In the following you can see an example based on Oracle using ASM, when storage is made visible to the OS as block devices (asmlib is not used in this example). The I/O calls issued by Oracle processes in this case will typically be pread and pwrite for synchronous single-block I/O, io_submit and io_getevents for multi-block and/or asynchronous I/O. At a lower level all I/O to block devices can be traced as ioblock requests.
You can start exploring the relationship between Oracle wait events, I/O system calls and block I/O with this example starter script from: trace_oracle_iocalls.stp

Figure 4: SystemTap probes for OS I/O tracing and Oracle wait event tracing. This allows to get an integrated view of the actions underlying an Oracle I/O event and relative OS actions.

In the Figure 5 here below you can see an example of tracing Oracle I/O where single-block (synchronous) I/O is performed via a pread call to the OS. This is a typical access path for Oracle random I/O. On the OS side the pread call is cascaded to the block device stack as io_block_request. On the Oracle wait event interface the I/O operation is accounted as db file sequential read wait, with indication of the file number and block number (p1 and p2 parameters)

Figure 5: Oracle performing single-block I/O profiled using SystemTap probes. This provides end-to-end tracing of the I/O operation: we can see the OS call details and the corresponding to the Oracle wait event data

Linux asynchronous I/O interface is often used by Oracle for multi-block I/O. Thisis mainly for performance reasons and can only be done when the underlying filesystem allows it (this is the case in the examples reported here as we are using ASM). A study of Oracle async I/O would take us too far from our main subject, we will limit the discussion to some main observations and an example of how you can use SystemTap probes to further investigate this topic.
With AIO the kernel is actually doing the IO, the user process is just asking for an IO to be done, and checks back to get the requested information. When investigating async I/O you need to trace at least two calls: one for submitting the I/O requests (io_submit) and one for reaping the results (io_getevents). I/O requests are then passed on to the block I/O layer. At this stage they can be split into more operations of smaller size, as it is the case of the examples of Figure 6. See also Martin Bach's post on the topic of "Increasing the maximum I/O size in Linux". More details on tracing Oracle I/O can be found in Frits Hooglands' excellent work on profiling Oracle with gdb and study on multiblock I/O.


Figure 6: SytemTap probe example showing the profiling Oracle I/O in a case of multi-block read when Oracle has chosen to perform asynchronous I/O. The I/O requests at the OS level are submitted using io_submit and reaped with io_getevents. SystemTap allows to see also what happens at the block device level when the I/Os are split into smaller chunks of maximum size 512KB in the system used for this test.


Lab setup and documentation


Here below you can find some pointers on how to setup a test environment to experiment with the techniques and scripts discussed here above:

- Download and install Oracle Virtual Box
- Download OEL 7.0 (or OEL 6.5/6.6) for example from https://edelivery.oracle.com/linux
  - notably install also the kernel-devel rpm

If you want to use OS probes with SystemTap download and install the debuginfo and debuginfo-common packages for the kernel you are using. The rpms can be found from https://oss.oracle.com/ol7/debuginfo/ and https://oss.oracle.com/el6/debuginfo/ respectively for OEL 7 and OEL 6. Similarly install libaio-debuginfo and libaio-devel if you want to trace OS calls for asynchronous I/O.
Please note that if you want only to run probes for the Oracle userspace functions you can skip the installation of the kernel and libaio debuginfo packages.

The scripts discussed here require systemtap 2.5 or higher. Consider uninstalling systemtap if already installed with an older version.
Download the latest version of systemtap from https://sourceware.org/systemtap/ftp/releases/ (version 2.6 as I write this).
To install SystemTap from the downloaded source run:

./configure
# the output will tell you if additional OS packages are needed, in case install them and repeat
make
make install

As post-installation check run:
stap --help

A good point to start with the SystemTap documentation is the beginners guide, see also the language reference and tapset reference manual.

Download and install Oracle version 12.1.0.2 or 11.2.0.4.
The software can be downloaded from https://edelivery.oracle.com/
Pre-installed VMs with Oracle database are also available for download from OTN (for example at this link)
Installation guides, besides the Oracle documentation, are also available at http://www.oracle-base.com/articles/12c/articles-12c.php and in the racattack wiki book.


Conclusions


Dynamic tracing makes available userspace-level probes into the Oracle kernel and Linux probes at the OS kernel level, which can be used to complement Oracle's performance data, notably data from the wait event interface and the statistics counters. These techniques can considerably extend the amount of data available when investigating performance issues and for advanced troubleshooting, therefore helping the troubleshooter to use systematic approaches to problem solving and overall to be more successful and consistent.

You have seen in this short article a few pointers and get-started examples (download from this link). There is a lot of potential in applying and extending these techniques for investigating Oracle and for advanced troubleshooting of Oracle and other complex platforms. I hope to cover some more examples in future posts. For the time being, I hope you enjoyed going through the listed examples and I wish to all the readers who have been patient enough to make this far fun and success in trying these techniques on their systems!


Acknowledgements


I would like to thank three authors of many original ideas that have inspired the material I have discussed here: Brendan Gregg, Tanel Poder and Frits Hoogland. Additional thanks go to Frits Hoogland for reviewing this post.