Memory ballooning and virtio_balloon driver in qemu-kvm

Lots of people have heard the word “ballooning” and had a thought about it. In the computer world, this word is mostly aligned with ‘virtualization’. I will try to explain ‘memory ballooning’ in this blog.

In simple terms a virtual environment composed of 2 things.

1) HOST and 2) GUEST. Let me explain these 2 terms in a layman language. Virtualization is used to run another operating system inside your computer. The operating system running in your hardware can be called ‘HOST’ and the other operating system running in your HOST is called GUEST.

I have to answer that, guest memory is derived from Host memory. That is obvious. I am not explaining more about it, only because I just dont want to invite more confusion here..

coming into the word ‘balloon’, we all know there can be two actions inflate/deflate wrt to balloon. Inflate means ‘you are filling something in it”, deflate do opposite action.. Now, how these all terms fit into virtualization.

The guest system will include a balloon ( a balloon driver ex: virtio_balloon) which can be inflated or deflated. Why we need that? as I mentioned above, guest memory is part of the host memory. There will be different applications running in your host system. Sometimes a host system which hosting these different applications can run out of memory or in need of memory… At that time, using this balloon device, the host system can request memory from the guest. which will cause an inflate operation of the balloon device inside the guest. Once the balloon is filled ( memory /pages ) by guests, it can give that to the host system. The vice versa operation is also going in this way.

Now you know, host<-> guest co-operation is required for smooth operation of ‘memory ballooning’. At the time of ‘inflate’ operation, think about ‘unused’ memory in the guest system. This memory is given back to the host. How-ever inflate operation can increase memory pressure in the guest..or you may see high ‘swapping’ in your guest.

The operation chain can look like this:

  • Request ( from host ) to reclaim memory =====> “inflate” operation inside the guest ::
  • Then guest send this memory back to host/hypervisor..

In qemu-kvm setup , virtio_balloon is the balloon driver which will be loaded in the guest system for the balloon device..

Inflate and Deflate operation ( In my bad art skills ? 🙂 ) :

1) Inflate operation of virtio balloon driver:

2) Deflate operation of virtio balloon driver

Now, there can be some questions like, Is it the hot memory add feature? I cannot agree with this.

What commands can be used to give/take memory from guests? in libvirt, qemu-kvm setup you can use ‘virsh setmem’ command for the same purpose.

Demo:

[root@humbles-lap misc]# virsh dumpxml 1
524288 ========== This is the maximuim memory set for this guest
524288 ========== current allocated memory

memballoon model='virtio' ========== virtio driver is used as balloon driver. If you want to disable ballooning use 'none'.


Now, check the memory inside the guest.

You will be seeing something near to ‘524288’..

Once you confirmed above, try to set memory as shown in below example:

#virsh setmem 24161

Now check the memory inside your guest..

This is how virtio_ballon or balloon drivers work!

I hope you enjoyed this article. Please feel free to ask your questions here or leave a comment.

What are jiffies/jiffies_64

First of all, jiffies are supposed to be the counter of ticks happened from system boot. In kernel space,  jiffies are designed as a global variable. The “64” bit version of “jiffies” can be called as jiffies_64. jiffies are designed to be a 32 bit value where jiffies_64 is a 64 bit variant of the …

Read more

Kernel module to print process pid, stack..etc information..

When browsing through my filesystem,  I saw a kernel module which I wrote a way back to print information about the current process and its stack, pid..etc.. Also, this module prints all the processes running in the system. So I loaded that into my running system and it was obedient 🙂

I had called him ‘process_informer.ko’..

[root@humbles-lap hello]# make
make -C /lib/modules/2.6.30.10-105.2.23.fc11.x86_64/build M=/misc/my_kernel_modules/hello modules
make[1]: Entering directory `/usr/src/kernels/2.6.30.10-105.2.23.fc11.x86_64′
CC [M]  /misc/my_kernel_modules/hello/process_informer.o
Building modules, stage 2.
MODPOST 1 modules
CC      /misc/my_kernel_modules/hello/process_informer.mod.o
LD [M]  /misc/my_kernel_modules/hello/process_informer.ko
make[1]: Leaving directory `/usr/src/kernels/2.6.30.10-105.2.23.fc11.x86_64′
[root@humbles-lap hello]#

[root@humbles-lap hello]# lsmod |grep process
[root@humbles-lap hello]# insmod process_informer.ko
[root@humbles-lap hello]# lsmod |grep process
process_informer       12427  0
[root@humbles-lap hello]#

[root@humbles-lap hello]# modinfo process_informer.ko
filename:       process_informer.ko
license:        GPL
description:    Module, for fun
version:        1.0.0
author:         Humble Chirammal
srcversion:     C62BE5C41BEC6E733271884
depends:        
vermagic:       2.6.30.10-105.2.23.fc11.x86_64 SMP mod_unload
[root@humbles-lap hello]#

[root@humbles-lap hello]# rmmod process_informer.ko
[root@humbles-lap hello]# lsmod |grep process

Jun  1 00:01:17 humbles-lap kernel: [ 6225.683529]  Hi World, somebody loaded me  ==> __init()
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683533]
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683534]  I am informed to track current process
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683538]
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683539]       Current process is : ‘insmod’
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683540]        pid : ‘8875’
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683541]       Kernel release : ‘2.6.30.10-105.2.23.fc11.x86_64’
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683544]      ‘RUNNABLE’ OR ‘UNRUNNABLE’
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683547]       PRIO : 120  STATE: 0
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683550]
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683551]       My thread_info located at : ffff88009d0dc000
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683552]
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683555]      Process 0 is  : swapper/0     pid :: 0
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683558]      Process 1 is  : systemd     pid :: 1
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683561]      Process 2 is  : kthreadd     pid :: 2
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683565]      Process 3 is  : ksoftirqd/0     pid :: 3
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683568]      Process 4 is  : migration/0     pid :: 6
Jun  1 00:01:17 humbles-lap kernel: [ 6225.683571]      Process 5 is  : watchdog/0     pid :: 7

************Truncated..

Jun  1 00:01:17 humbles-lap kernel: [ 6225.684065]      Process 157 is  : su     pid :: 7296
Jun  1 00:01:17 humbles-lap kernel: [ 6225.684068]      Process 158 is  : bash     pid :: 7306
Jun  1 00:01:17 humbles-lap kernel: [ 6225.684071]      Process 159 is  : bash     pid :: 7398
Jun  1 00:01:17 humbles-lap kernel: [ 6225.684074]      Process 160 is  : su     pid :: 7419

Jun  1 00:02:08 humbles-lap kernel: [ 6225.684117]      Process 174 is  : sleep     pid :: 8856

Jun  1 00:02:08 humbles-lap kernel: [ 6276.449168]   Bye Guys, he is kind enough to unload me      ==> __exit()

Hmmmm.. there are ‘174’ processes in my system …

Proccess states (TASK_RUNNNING, TASK_INTERRUPTIBLE..etc) in linux kernel

One of the ‘task_struct’ field is ‘state‘ which says the ‘state’ of this process..

I see there are 2 types of ‘states’, it can be either ‘state’ or ‘exit_state’

task->state is about ‘runnability‘..

task->exit_state is about ‘task exiting’

One of the flags in ‘state’ will be set and ‘state’ flags are mutually exclusive..

task->state flags are:

#define TASK_RUNNING 0 #define TASK_INTERRUPTIBLE 1 #define TASK_UNINTERRUPTIBLE 2 #define __TASK_STOPPED 4 #define __TASK_TRACED 8 #define TASK_DEAD 64 #define TASK_WAKEKILL 128 #define TASK_WAKING 256

task->exit_state flags are:

#define EXIT_ZOMBIE 16 #define EXIT_DEAD 32

ABOUT task->state runnability flags:

*) TASK_RUNNNING

A process with ‘TASK_RUNNING’ state means that the process is runnable, and it is either currently running or on a run queue waiting to run. This is the only possible state for a process executing in userspace. it can also apply to a process in kernel space that is actively running. That said, this process is either executing on CPU or waiting for the CPU to execute.

*) TASK_INTERRUPTIBLE

The process is suspended (sleeping) until some condition becomes true. Events like hardware interrupt, signal delivery, released system resources can wake up the process and change the status back to TASK_RUNNING. Processes in idle mode (ie not performing any task) should be in this state.

*) TASK_UNINTERRUPTIBLE

Like “TASK_INTERRUPTIBLE”, a signal delivery is not honored with the process is in TASK_UNINTERRPTIBLE” state. An example would be a process performing an atomic write operation.

The TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE indicate that the task is a wait state. Tasks at TASK_INTERRUPTIBLE state can be interrupted and wake up by an interrupt or a signal and will be returning to state TASK_RUNNING. Tasks at TASK_UNINTERRUPTIBALE must be explicitly waked up by an event. e.x. a task waiting for the data transferred from block dev to buffer.

TASK_INTERRUPTIBLE OR UNINTERRUPTIBLE processes are in wait queues. The task_structs which hold states either of these can be in wait queues..

struct __wait_queue_head { spinlock_t lock; struct list_head task_list; };

typedef struct __wait_queue_head wait_queue_head_t;

These are the processes waiting for an event to finish, system resource to be released, fixed interval of time to be
elapsed..

* ) TASK_STOPPED :

Process execution has been stopped; The process received any of ‘SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU’ signals.

SIGSTOP 17,19,23 Stop Stop process

SIGTSTP 18,20,24 Stop Stop typed at tty

SIGTTIN 21,21,26 Stop tty input for background process

SIGTTOU 22,22,27 Stop tty output for background process

*) TASK_TRACED

The subjected process has been traced by a debugger. Execution has been stopped by a debugger (ex: gdb ): When debugged each signal may put the process in this state.

2 possible states in task->exit_state.

1) EXIT_ZOMBIE:

Zombie processes are also called “defunct” processes in the system. Those are processes whose execution is completed, but the parent is not aware of it or parent didn’t reap its child, so it holds an entry in process table. It will not waste any other resources in the system. The only resource it wastes is the entry in the process table. It becomes an issue only when we/system reaches the situation: “The total no of processes running in the system is equal to the max limit of number of processes”. Then we may worry about them. The main chance for zombie creation is when a parent didn’t wait() for this child, in other words crappy application developers can generate a zombie process.

On the other hand, if the parent dies first, init (process 1) inherits the child and becomes its parent.

2) EXIT_DEAD:

This state comes to a picture when there is a transition happening from EXIT_ZOMBIE. The process is being removed by the system as the parent called wait ().

I hope it helps.

Retrieving current process/task_struct in linux kernel

As you know, processes are represented in the kernel by a structure called ‘task_struct’. All the processes in kernel got its own kernel stack. There is also an important structure called “thread_info” which is of size “52” bytes. .. Where “task” field in the thread_info structure is the association of process descriptor with the thread_info. The “rsp” or “esp” ( stack pointers) can be effectively used by the kernel to retrieve information about the active process on the CPU.

struct thread_info { struct task_struct *task; /* main task structure */

*****

‘thread_info’ and the ‘stack’ is a union as shown below.

union thread_union { struct thread_info thread_info; unsigned long stack[THREAD_SIZE/sizeof(long)]; };


Where THREAD_SIZE is equal ( After macro filling) to ((_AC(1,UL) << 12) << 1) which is 8192 ( for 8k stack).

To get ‘current’ process information, the kernel makes use of below macro in an effective way.

movl $0xffffe000,%ecx /* or 0xfffff000 for 4KB stacks */ andl %esp,%ecx movl (%ecx),p

line number “2” performs the main operation here. The stack pointer “esp”‘s last “13” or “12” bits are masked ( 8k and 4k stacks effectively) to get the thread_info structure. The thread_info structure’s “0th” offset holds a pointer to the ‘task_struct’. Thus kernel can easily retrieve the currently running process (line 3) from the CPU.

The “current” variable in kernel code refers to this task_struct.

#define get_current() (current_thread_info()->task)

Easy way to know, which command/process is sleeping and in which kernel function using WCHAN?..

I quite oftenly use below command to know the same..

# ps -eopid,tt,user,fname,tmout,f,wchan

The important bit here is “WCHAN” field and the “COMMAND” field.. “WCHAN” points to the “sleeping function” in kernel mode..

WCHAN : address of the kernel function where the process is sleeping (use wchan if you want the kernel
                           function name). Running tasks will display a dash (‘-‘) in this column.

[root@humbles-lap ~]$ ps -eopid,tt,user,fname,tmout,f,wchan
  PID TT       USER     COMMAND  TMOUT F WCHAN
1 ?        root     systemd      – 4 epoll_wait
2 ?        root     kthreadd     – 1 kthreadd
3 ?        root     ksoftirq     – 1 run_ksoftirqd
6 ?        root     migratio     – 1 cpu_stopper_thread
7 ?        root     watchdog     – 5 watchdog
21 ?        root     cpuset       – 1 rescuer_thread
22 ?        root     khelper      – 1 rescuer_thread
23 ?        root     kdevtmpf     – 5 devtmpfsd
24 ?        root     netns        – 1 rescuer_thread
25 ?        root     sync_sup     – 1 bdi_sync_supers
26 ?        root     bdi-defa     – 1 bdi_forker_thread
27 ?        root     kintegri     – 1 rescuer_thread
28 ?        root     kblockd      – 1 rescuer_thread
29 ?        root     ata_sff      – 1 rescuer_thread
30 ?        root     khubd        – 1 hub_thread
31 ?        root     md           – 1 rescuer_thread
55 ?        root     kswapd0      – 1 kswapd
56 ?        root     ksmd         – 1 ksm_scan_thread
57 ?        root     khugepag     – 1 khugepaged_alloc_sleep
58 ?        root     fsnotify     – 1 fsnotify_mark_destroy
59 ?        root     crypto       – 1 rescuer_thread
65 ?        root     kthrotld     – 1 rescuer_thread
66 ?        root     scsi_eh_     – 1 scsi_error_handler
67 ?        root     scsi_eh_     – 1 scsi_error_handler
68 ?        root     scsi_eh_     – 1 scsi_error_handler
69 ?        root     scsi_eh_     – 1 scsi_error_handler
70 ?        root     scsi_eh_     – 1 scsi_error_handler
71 ?        root     scsi_eh_     – 1 scsi_error_handler
79 ?        root     kpsmouse     – 1 rescuer_thread
130 ?        root     ttm_swap     – 1 rescuer_thread
311 ?        root     kdmflush     – 1 rescuer_thread
320 ?        root     jbd2/dm-     – 1 kjournald2
321 ?        root     ext4-dio     – 1 rescuer_thread
355 ?        root     kauditd      – 1 kauditd_thread
357 ?        root     udevd        – 4 poll_schedule_timeout
378 ?        root     systemd-     – 4 epoll_wait
391 ?        root     kvm-irqf     – 1 rescuer_thread
640 ?        root     ips-adju     – 1 ips_adjust
641 ?        root     ips-moni     – 1 ips_monitor
648 ?        root     cfg80211     – 1 rescuer_thread
661 ?        root     hci0         – 1 rescuer_thread
671 ?        root     iwlwifi      – 1 rescuer_thread
688 ?        root     hd-audio     – 1 rescuer_thread
736 ?        root     hd-audio     – 1 rescuer_thread
772 ?        root     flush-25     – 1 bdi_writeback_thread
775 ?        root     kdmflush     – 1 rescuer_thread
802 ?        root     kjournal     – 1 kjournald
811 ?        root     jbd2/sda     – 1 kjournald2
812 ?        root     ext4-dio     – 1 rescuer_thread
829 ?        root     bluetoot     – 4 poll_schedule_timeout
841 ?        avahi    avahi-da     – 4 poll_schedule_timeout
859 ?        root     abrtd        – 0 poll_schedule_timeout
862 ?        avahi    avahi-da     – 1 unix_stream_recvmsg
890 ?        root     gpm          – 5 hrtimer_nanosleep
893 ?        root     NetworkM     – 4 poll_schedule_timeout
895 ?        root     mcelog       – 4 poll_schedule_timeout

*********** Truncated

Hope this helps!!