Python Sorted Containers

In this article, I would like to talk about Python Sorted Containers. In python language, the build-in data collection is implemented differently from another language (e.g., C/C++). That makes python losing some ability or performance to access specific data. For example, in C++ language, std::map and std::set are implemented by a tree structure, that causes the data collections to have an ability to search upper or lower bound (upper_bound, lower_bound) of the selected value. In python, such abilities don’t exist in the built-in data collection. However, Python has a library to support that ability, which can install additionally. Sorted containers is an open-sourced library that allows us to insert or delete elements very efficiently and the elements maintaining a sorted order. The library contains three containers: SortedList, SortedDict, and SortedSet.

Installation: we can install sorted collection by pip command:

$sudo pip install sortedcontainers 

Python Sorted Container contains three sorted data structure:

  • SortedList 
  • SortedDict 
  • SortedSet

We can use these sorted containers in the same way as build-in python; List/Dict/Set. These are some examples of code to use the Sorted containers.

SortedList

from random import random
from sortedcontainers import SortedList

mainList = SortedList()
for i in range(100):
    mainList.add(random())

SortedDict

from sortedcontainers import SortedDict
from random import random

mainList = SortedDict()

for i in range(100):
    mainList[i]=(random())

SortedSet

from sortedcontainers import SortedSet
from random import random

mainList = SortedSet()

for i in range(100):
    mainList.add(random())

Since the data inside Python Sorted Containers are sorted, the containers can fetch the nearest value (upper bound/lower bound) like upper_bound, lower_bound in c++ language. In Python Sorted Containers, they are called  bisect_left(value) and bisect_right(value). 

bisect_left(value)  return an index to insert value in the sorted container. if the value already exists in the container, the function will return the existing value’s index.

bisect_left(value) return an index to insert value in the sorted container. if the value already exists in the container, the function will return the index after the existing value’s index.

Let’s see an example.

>>> sl = SortedList([10, 12, 14, 16, 18])
>>> sl.bisect_left(12)
1
>>> sl.bisect_left(13)
2
>>> sl.bisect_right(13)
2
>>> sl.bisect_right(14)
3

As we can see from the example, function bisect_left returns an index of the value, if the value exists in the list; sl.bisect_left(12) return 1 because the index of 12 is the 1st. On the other hand, if the value doesn’t exist, the function will return the index of an insertion point of the value; sl.bisect_left(13) return 2 because the index 2 is the new position of 13 if there is an insertion. In a similar way to bisect_right, if the value doesn’t exist, the function will return the index of an insertion position; sl.bisect_right(13) return 2, because it is the insertion position. However, if the value is already in the container, the function will return the index after the existing value; sl.bisect_right(14) returns 3 because 14 is at index 2.

The most advantage of these two functions is the performance. bisect_left and bisect_right can run with O(nlgn) running time, which is optimized for searching application.

reference: http://www.grantjenks.com/docs/sortedcontainers/

Run a job in the background on Linux instance

When we are working on a remote Linux instance, there are a lot of times that we need to run some intensive computation Job/Process on the instance, which are time-consuming. The Job/Process can be run for more than an hour or even a day, and the struggle connecting can ruin the whole process. I had that issue a few years ago that I need to run some experiments for more than a day and I always stuck with the message “Write Failed: broken pipe” by unstable connection. Therefore, in this post, I want to share a solution to that issue. On the one hand, you can increase the timeout of ssh to avoid the connection lost issue. However, that is not a good solution to the problem, power or the Internet outage can still ruin your process. There is a special command for this task.

On the Linux system, there is a command called “nohup“, which allows Job/Process to able to run in the background (even we are logged off). The usage of this command is very simple. We just need to run “nohup” with our regular command. 

$nohup ./testRunning "argv1" "argv2" ... &

Note that the last symbol (&) is required. This will set the system to run the process even we are logged off. However, we might want to keep the output (stdout and stderr) from the process. We can also save the result to output file by the following command

$nohup ./testRunning "argv1" "argv2" ... > out.txt 2>&1 &

The command will redirect the output from both stdout and stderr to a filename “out.txt”. The first “>” symbol pipeline stdout to filename out.txt and the second symbol “2>&1” append stderr (“2”) to stdout (“1”).

With nohup command, you don’t need to keep worrying about the connection to the instant and you don’t need to keep your machine login to the instant ( Safe some energy ( ͡❛ ͜ʖ ͡❛) ).

What is a Ph.D. Study?

Recently, I just obtained my Ph.D., and I think it is a good time to write about this topic. What is a Ph.D. study? What exactly are we doing? Why it take so long? Many people who are not in the research field might think Ph.D. is one kind of degree which you need to take a lot of class/exam and after you complete all requirement you will obtain that degree. Someone might think if you are smart enough, you can just go for it as a straight line and you will get a degree in well defined years. Thought, it is incorrect. A Ph.D. study doesn’t have a solid definition like other degrees that have a solid plan or flow chart of the study.

So what is a Ph.D. study? If you are lazy to read the whole article, the short answer is “Ph.D. study just one kind of training to become a researcher or an educator”. Yes, it is just training to fulfill skills that you need to become a researcher or an educator. During the training, you will also have to study something that very narrow or wide area. In this process, everyone will face different training depending on what skill you are lacking. As we can see, it is hard to determine how long it will take to finish. For example, some skills can be developed in a short amount of time, however, some others could take many years of practice. In some fields of study, it could take less than 3 years, but in some other areas, it could take more than 7 years. Then at the end of the training, you will be awarded the highest academic degree or Ph.D. Therefore, the next question will be how to know that a candidate is ready to obtain the degree? 

The answer is a candidate need to invent some new knowledge and present it to the world. By that, every university/institute has different criteria to qualify candidates. Not only that but most of the time the laboratory, which insides university/institute, also has additional criteria. For example, in my case, my requirement to allow me to graduate from the university are obtaining course work graduate credits hours, passing a qualifying exam, presenting a major area research work to a committee, and defending a dissertation to a committee member. However, I also have an additional requirement from my laboratory that I need to publish three scientific papers in a well know conference/journal in my field. As we can see by this requirement, not only a researcher or educator is the outcome of the training process, but also a brand new research topic that can push the humanity to a new frontier of knowledge.

In conclusion, Ph.D. is a unique degree that doesn’t have a solid timeline or flow chart. It is actually just training to become a researcher or educator. In addition to that, you will also study to become a specialist in some area. The outcomes of a Ph.D. study is not the thesis/Dissertation, but it is the candidate who will become the next generation of researcher or educator.

I also have a final note for any readers who are pursuing the degree as a Ph.D. Candidate, I encourage you to keep pushing, beyond the frontier. I know that you may found a challenging or failure in this path, but please remember one thing. You are smart and you are working with the smartest group of people on the planet. If you can’t solve this problem who will?

How to reset and clean git?

Recently, I need to run a command to rest and clean local git frequently, but I always forgot the command. Therefore in this post, I write it as a note on how to reset and clean our local git directory. The short answer should be:

$git reset --hard && git clean -d -x -f

The command should completely reset and clean your git folder. Any modified or untracked files will be removed. The command consists of two important commands line, “git reset” and “git clean”.

“git reset” reset modified files

“git clean” remove untracked files.

Parameters:
“–hard” means reseting the index and working tree.
“-d” means including directories.
“-x” means including files ignored by git.
“-f” means force to delete.

Segmentation Fault? Valgrind can detect that

Since this is the first post on my blog, I want to start with a simple thing, but it is an annoying problem in the software development process. An error knows as a segmentation fault or memory corruption. Recently, I had found a project that has memory corruption in linux system, and the software was randomly crash. The segmentation fault is not a difficult bug to solve if we can identify where it is. However, in my case, the crash was random, and the project was large (line by line checking is out of an option).

I did some research online and found an interesting tool on a Linux system, which is called “Valgrind” (https://valgrind.org). Valgrind is a dynamic analysis tool that can detect memory management and threading bugs. Valgrind can also generate deep profile programs for performance analysis. However, in this post, we will focus on the debugging part.

By simply call Valgrind with the program, the tool can identify the line of code that has illegal access to a memory address. Valgrind can detect all of the major memory access issues; e.g., out of range read, out of range write, uninitialized access memory.

Let’s look into an example of using Valgrind, here is the example of a c++ program that has memory corruption.

//Filename Test.cpp
#include <iostream>

int main() {
    int *x = new int[10];
    x[12] = 0; // out of range write
}

As we can see, the program contains out of range writing error. Then we compile the program with the “-g” flag and run the program by Valgrind. Note that the “-g” parameter will generate source-level debugging information.

$gcc -g test.cpp 
$valgrind ./a.out
==23101== Memcheck, a memory error detector
==23101== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==23101== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==23101== Command: ./a.out
==23101== 
==23101== Invalid write of size 4
==23101==    at 0x1087A8: main (test.cpp:6)
==23101==  Address 0x5b7dcb0 is 8 bytes after a block of size 40 alloc'd
==23101==    at 0x4C3089F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==23101==    by 0x10879B: main (test.cpp:5)
==23101== 
.
.
.

Valgrind will show the output of the analysis. As we can see from our example, Valgrind detects Invalid write of 4 bytes and at line number 6 (test.cpp:6).

Bonus point, Valgrind can also search for memory leak in the program. We just need to add an option of “—leak-chek=yes” as an example below.

$valgrind --leak-check=yes ./a.out
.
.
.
==23152== HEAP SUMMARY:
==23152==     in use at exit: 40 bytes in 1 blocks
==23152==   total heap usage: 2 allocs, 1 frees, 72,744 bytes allocated
==23152== 
==23152== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1
==23152==    at 0x4C3089F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==23152==    by 0x10879B: main (test.cpp:5)
==23152== 
==23152== LEAK SUMMARY:
==23152==    definitely lost: 40 bytes in 1 blocks
==23152==    indirectly lost: 0 bytes in 0 blocks
==23152==      possibly lost: 0 bytes in 0 blocks
==23152==    still reachable: 0 bytes in 0 blocks
==23152==         suppressed: 0 bytes in 0 blocks

As we can see Valgrind is a great tool for analyzing the memory and detect the segmentation fault. It can detect most of the memory errors/bugs with a little effort from the developer.