God's in his heaven.
All's right with the world.

0%

关于PageRank的基础知识简介请参见博文:《深入探讨PageRank(一):PageRank算法原理入门》

一、PageRank算法的简单举例

Google PageRank算法的思想精华在于:将一个网页级别/重要性的排序问题转化成了一个公共参与、以群体民主投票的方式求解的问题,网页之间的链接即被认为是投票行为。同时,各个站点投票的权重不同,重要的网站投票具有较大的分量,而该网站是否重要的标准还需要依照其PageRank值。这看似是一个矛盾的过程:即我们需要用PageRank值来计算PageRank值。

听起来有点不可思议,既像是递归,又像是迭代,似乎陷入了一个漩涡,Google的创始人佩奇和布林证明了这个过程最终收敛值与初始值无关。遗憾的是我一直都没有找到这个证明,甚至我把佩奇他们当年那篇论文找出来看也没有发现。

对于PageRank的收敛性,我们是可以找到反例的,这说明PageRank至少在某些情况下是不可能收敛的,或者说是收敛不完备的。在本文的第三部分,我们将PageRank的问题转化为了马尔可夫链的概率转移问题,其收敛性的证明也即转化为了马氏链的平稳分布是否存在的证明。我们先来看一个简单的例子:

阅读全文 »

一、PageRank简介

大名鼎鼎的PageRank算法是Google排名运算法则(排名公式)的一个非常重要的组成部分,其用于衡量一个网站好坏的标准。在揉合了诸如Title、Keywords标识等所有其它因素之后,Google利用PageRank来调整网页的排名,使得“等级/重要性”的网页会相对排在前面。简单来说,Google通过下述几个步骤来实现网页在其搜索结果页面中排名:

  1. 找到所有与搜索关键词匹配的网页
  2. 根据页面因素如标题、关键词密度等排列等级
  3. 计算导入链接的锚文本中关键词
  4. 通过PageRank得分调整网站排名结果

事实上,真正的网站的排名过程并非这么简单,我们会在后面进行详细深入阐述。

阅读全文 »

Introduction

If we use common synchronization primitives like mutexes and critical sections, then the following sequence of events occur between two threads that are looking to acquire a lock:

  1. Thread 1 acquires lock L and executes.
  2. T2 tries to acquire lock L, but it’s already held and therefore blocks incurring a context switch.
  3. T1 releases the lock L. This signals T2 and at lower level, this involves some sort of kernel transition.
  4. T2 wakes up and acquires the lock L incurring another context switch.

So there are always at least two context switches when primitive synchronization objects are used. A spin lock can get away with expensive context switches and kernel transition.

Most modern hardware supports atomic instructions and one of them is called ‘compare and swap’ (CAS). On Win32 systems, they are called interlocked operations. Using these interlocked functions, an application can compare and store a value in an atomic uninterruptible operation. With interlocked functions, it is possible to achieve lock freedom to save expensive context switches and kernel transitions which can be a bottleneck in a low latency application. On a multiprocessor machine, a spin lock (a kind of busy waiting) can avoid both of the above issues to save thousands of CPU cycles in context switches. However, the downside of using spin locks is that they become wasteful if held for a longer period of time, in which case they can prevent other threads from acquiring the lock and progressing. The implementation shown in this article is an effort to develop a general purpose spin lock.

Algorithm

A typical (or basic) spin lock acquire and release function would look something like below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// acquire the lock
class Lock
{
volatile int dest = 0;
int exchange = 100;
int compare = 0;
void acquire()
{
While(true)
{
if(interlockedCompareExchange(&dest, exchange, compare) == 0)
{
// lock acquired
break;
}
}
}
// release the lock
Void release()
{
// lock released
dest = 0;
}
};
.......

Here, thread T1 acquires the lock by calling the function acquire(). In this case, the value of dest would become 100. When thread T2 tries to acquire the lock, it will loop continuously (a.k.a. busy waiting) as the values of dest and compare are different and therefore the function InterlockedCompareExchange will fail. When T1 calls release(), it sets the value of dest to 0 and therefore allows T2 to acquire the lock. Because only those threads that acquire() will call release(), mutual exclusion is guaranteed.

Above is a simple implementation of a spin lock. However, this implementation alone is not production fit because spinning consumes CPU cycles without doing any useful work, meaning that the thread spinning will still be scheduled on the processor until it is pre-empted. Another downside of spinning is that it will continuously access memory to re-evaluate the value of dest in the function Interlockedxxx and this also puts the pressure on bus communication.

On a single processor machine, spin wait would be a total waste of CPU as another thread T2 wouldn’t even get scheduled until the spinning thread is switched by the kernel.

So far this implementation isn’t good enough. A general purpose spin lock requires a bit more work in terms of falling back to true waiting in a worst case scenario when it spins for a longer period. Here are some of the points which must be considered:

Yield Processor

The Win32 function YieldProcessor() emits a ‘no operation’ instruction on processors. This makes the processor aware that the code is currently performing spin waits and will make the processor available to other logical processors in a hyper threading enabled processor so that the other logical processors can make progress.

Switch to Another Thread

Sometimes it is useful to force a context switch when a spinning thread has already consumed enough time spinning equivalent to its thread time slice allocated by the kernel. Here, it makes good sense to allow another thread to do useful work instead. The function SwitchToThread() relinquishes the calling thread’s time slice and runs another thread in the ready state. It returns true when a switch occurs, otherwise false.

Sleeping

SwitchToThread() may not consider all threads on the system for execution, therefore it may be wise to sometimes call Sleep() or Sleepex(). Calling Sleep() with an argument of 0 is a good approach as it does not result in a context switch if there are no threads of equal priority in the ready state. Sleep(0) will result in a context switch if a higher priority thread is in ready state.

Other Considerations

A pure spin lock is only good enough when the lock is held for a very short period of time. Here the critical region may have not more than 10 instructions and practically even simple memory allocation or virtual calls or file I/O can take more than 10 instructions.

Secondly, as mentioned above, it would wasteful to use spin locks when an application runs on a single processor.

Sample Project and Implementation

The sample project in C++ consists of a spin lock implementation considering the points stated above. It also has an implementation of Stack, Queue, and a thin Producer-Consumer class. I’ll only focus on then Spin Lock implementation here as the rest of it is easy to follow.

The file SpinLock.h defines these constants:

  • YIELD_ITERATION set to 30 - What this means is that the thread spinning will spin for 30 iterations waiting for the lock to acquire before it calls sleep(0) to give an opportunity to other threads to progress.
  • MAX_SLEEP_ITERATION set to 40 - This means when the total iteration (or spin) count reaches 40, then it would force a context switch using the function SwitchToThread() in case another thread is in ready state.

The struct tSpinLock acts as a lock object which is declared in the class whose objects are being synchronized. This object is then passed in the constructor to the object of tScopedLock which initializes (references) the lock object passed to it. The tScopedLock() constructor locks the object using the member function of the class tSpinWait. The destructor ~tScopedLock() releases the lock.

The Lock() function in the class tSpinWait has got a nested while loop. This is done on purpose. So if a thread is spinning to acquire the lock, it wouldn’t call interlockedxxx() with every iteration, rather it would be looping in the inner while loop. This hack avoids the system memory bus being overly busy due to continuous calls to the interlockedxx function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// spin wait to acquire
while(LockObj.dest != LockObj.compare) {
if(HasThreasholdReached())
{
if(m_iterations + YIELD_ITERATION >= MAX_SLEEP_ITERATION)
Sleep(0);
if(m_iterations >= YIELD_ITERATION && m_iterations < MAX_SLEEP_ITERATION)
SwitchToThread();
}
// Yield processor on multi-processor but if on single processor
// then give other thread the CPU
m_iterations++; if(Helper::GetNumberOfProcessors() > 1)
{
YieldProcessor(/*no op*/);
}
else { SwitchToThread(); }
}

The inner while loop just compares the value of dest and compare and if they are not equal, then it tries to acquire them using interlockedxxx. Depending on the iteration count, the thread is either put to sleep or switched. When the application is running on a single CPU, then it always forces a context switch.

Test Results

I tested the performance of this Spin Lock implementation by inserting 10000 integers into a queue from multiple threads (each thread inserting 10000 integers into the queue). I then replaced SpinLock with a Critical Section synchronization primitive in the code and ran the same tests. I ran all the tests on an Intel Core DUO CPU T9600 @ 2.80 GHz.

Results.jpg

The x-axis is the number of threads and y-axis is the time taken in milliseconds. Both synchronization methods (spinlock and CS) showed close performance when the number of threads were 2 and 4. As the number of threads increased, critical section locking took more than double the time as compared to spin locks. Spin lock seemed to have scaled a lot better when contention increased due to the high number of threads. The time taken is calculated using QueryPerformanceCounter Win32 methods. However, I would suggest performing your own testing on the platform you intend to use.

Here is the table with the results:

| No. of Threads | Time taken (Spin lock) | Time taken (Critical Section) | | --- | --- | --- | | 2 | | 6.6 | 7.14 | | 4 | | 12.81 | 14.09 | | 6 | | 16.01 | 46.37 | | 8 | | 23.32 | 54.34 | | 10 | | 26.21 | 74.76 | | 15 | | 41.17 | 89.05 | | 20 | | 47.63 | 116.82 | | 25 | | 62.25 | 147.68 | | 30 | | 64.37 | 169.17 | | 35 | | 88.02 | 210.07 | | 40 | | 93.99 | 296.32 |

Future Work

  • Profiling the code on different platforms.
  • Adding a couple more data structures to the project like associated arrays and hashtable.

Conclusion

This was an effort to develop a general purpose spin lock implementation. Pure spin locking isn’t a good option in all scenarios and therefore there is a need for an implementation which allows the spinning thread to be suspended by the kernel.

History

  • First draft.
  • Revision 1 - Fixed a couple of typos.
  • Revision 2 - Code is now re-entrant safe.
  • Revision 3 - Lock release now uses interlockedexchange.
  • Revision 4 - Added test results.

本文地址:http://xnerv.wang/spin-lock-in-cpp/
转载自:Spin Lock in C++

exe调用了cglover.dll中的一个导出类成员函数,是一个读取lua配置文件类,其中一个函数是读取int数组,并通知参数std::vector传递。
结果一出调用函数作用域,就挂调用,挂在std::vector的析构释放内存上~ 郁闷找不到原因,往床上一躺,突然想起关于很早以前就在游戏编程精粹上看到过exe释放dll分配的内存会有问题,但一直没明白为什么,也没有深究,应该是这个问题了。于是上网查了下,然后用windbg跟了下,总算整明白了。

阅读全文 »

Question

In Windows, for very demanding applications, a programmer may use HeapCreate, HeapAlloc in order to better manage and control the allocation of memory- speed it up (aka private allocators). What is the equivalent in Linux c++ programming?

Answer by psmears

If you want to use your own private allocator, then use mmap() to map an amount of memory into your process, then you can use that memory as you like. Open a file descriptor to /dev/zero, and then use that as the ‘fildes’ parameter to mmap(). See man mmap for full details of the parameters to pass. In this respect mmap() plays the same role as HeapCreate().


Article link: http://xnerv.wang/heapcreate-heapalloc-in-linux-private-allocator-for-linux/
Reprinted from: (StackOverflow) HeapCreate, HeapAlloc in Linux, private allocator for Linux

Question

I am trying to use the perfmon windows utility to debug memory leaks in a process.

This is how perfmon explains the terms:

Working Set is the current size, in bytes, of the Working Set of this process. The Working Set is the set of memory pages touched recently by the threads in the process. If free memory in the computer is above a threshold, pages are left in the Working Set of a process even if they are not in use. When free memory falls below a threshold, pages are trimmed from Working Sets. If they are needed they will then be soft-faulted back into the Working Set before leaving main memory.

Virtual Bytes is the current size, in bytes, of the virtual address space the process is using. Use of virtual address space does not necessarily imply corresponding use of either disk or main memory pages. Virtual space is finite, and the process can limit its ability to load libraries.

Private Bytes is the current size, in bytes, of memory that this process has allocated that cannot be shared with other processes.

阅读全文 »

Question

I’ve always wondered this - why can’t you declare variables after a case label in a switch statement? In C++ you can declare variables pretty much anywhere (and declaring them close to first use is obviously a good thing) but the following still won’t work:

1
2
3
4
5
6
7
8
9
10
switch (val)
{
case VAL:
// This won't work
int newVal = 42;
break;
case ANOTHER_VAL:
...
break;
}

The above gives me the following error (MSC):

initialization of ‘newVal’ is skipped by ‘case’ label

This seems to be a limitation in other languages too. Why is this such a problem?

阅读全文 »

Question

I’m working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I’ve got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.

Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I’d like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.

The mmap() code could potentially get very messy since mmap’d blocks need to lie on page sized boundaries (my understanding) and records could potentially like across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we’re not limited to reading blocks that lie on page sized boundaries.

How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?

阅读全文 »

本贴涉及的硬件平台是X86,如果是其它平台,嘻嘻,不保证能一一对号入座,但是举一反三,我想是完全可行的。

概念

物理地址(physical address)

用于内存芯片级的单元寻址,与处理器和CPU连接的地址总线相对应。
——这个概念应该是这几个概念中最好理解的一个,但是值得一提的是,虽然可以直接把物理地址理解成插在机器上那根内存本身,把内存看成一个从0字节一直到最大空量逐字节的编号的大数组,然后把这个数组叫做物理地址,但是事实上,这只是一个硬件提供给软件的抽像,内存的寻址方式并不是这样。所以,说它是“与地址总线相对应”,是更贴切一些,不过抛开对物理内存寻址方式的考虑,直接把物理地址与物理的内存一一对应,也是可以接受的。也许错误的理解更利于形而上的抽像。

阅读全文 »