ext4的fsync性能和nodelalloc参数的分析

发表于 2011-06-24 | 分类于未分类

感叹归感叹，发泄完了还得继续过。

前几天忙的不可开交，周报上面竟然能列出11项，想想以前在T公司时候的清闲，现在的老板的真幸运了。

好了，言归正传。

我们的系统是使用ext4作为文件系统的，ext4怎么好呢？主要是我对它感觉比较好，呵呵，开玩笑的。还记得第一次使用一个全新的ext4作为文件系统（不是ext3转过来的）时候感觉性能的feeling，应该用神奇来形容。

在我们android系统上使用ext4呢，主要是觉得它mount比较快，这样开机时间会很快。还记得当年在一点一点的抠启动时间，从40秒终于搞到30秒以内了，结果现在到了gingerbread(2.3)以后，没什么特别优化感觉都已经跑入15秒以内了。

现在遇到的问题是我们在跑一个benchmark的时候，分数比竞争对手低好几倍。总是自我感觉不良好的我们认为这可能就是我们比人家慢吧。这种故事通常的结果就是，到了实在要命的时候，比如一个很大的客户在挑战的时候。就要开始进去查了，好的，这次是我进去了。

调查的手段呢，第一个想到的就是strace，因为是ＩＯ嘛，必定和系统调用有关，所以strace肯定能够看出来一个一二三的，再加上strace的时间打印，就可以大概看出来哪些操作比较慢了。果然有发现，通过strace，发现fsync(3)消耗很多时间，中间甚至进程都出现了明显的调度出去，至于write，read这些操作，倒也不知道快慢。就先看这个fsync()为什么这么费时间吧。　其实一开始就怀疑是fsync()搞的鬼，因为有一个问题就是我们之前的kernel版本是2.6.31，这个bechmark跑的就很高，而升级到2.6.35上以后，这个分数就下降到1/3这么多。

还有一个类似的问题就是USB Mass Stroage的性能，在2.6.31上的写性能就很快，而2.6.35上的写性能就奇慢。而USM的f_storage.c里面是调用vfs_write()来进行写Block设备。通过把vfs_write()和mmc的command dump出来发现。原来在2.6.31上，加上了F_SYNC参数的vfs_write()在mmc这层，还是乱序的。而在2.6.35上，发现每一条vfs_write()都对应几条mmc命令，等这几条命令发完以后，才去从USB那里传数据，这样就成了一个很傻很慢的家伙了。而为什么2.6.31上明明加上了F_SYNC参数还是会乱序的写，我想这是一个BUG吧，在2.6.35上修复了而已。

所以这里的ext4文件系统fsync()慢可能也是和这个有关系的。但是作为嵌入式设备，随处会面对掉电的风险。所以掉电保护就很重要，不能说为了性能吧所有sync的写都变成un-sync的写，那些数据丢失会比较严重。

Google了两天，发现很多关于fsync和ext4的讨论，放在这里一些万一别人要看呢， [1]

无头绪，于是继续看ext4在kernel里面的文档，看到了mount参数这节，忽然灵机一动想起换换mount参数跑这个benchmark会不会有所改进呢？

于是就把那些看似和write相关的参数都做了一个表格。

Ext4 with different option	nobarrier	nodelalloc	journal_async_commit

(no combine)	558	1087	524
nobarrier	NA	NA	522
nodelalloc	1052	NA	1051
journal_aysnc_commit
& nobarrier & nodelalloc

可以看的出来，nodelalloc在这里贡献非常大。几乎是一倍的分数。

为什么这个参数nodelalloc会这样呢，这是它的文档中的解释：

delalloc        (*)     Defer block allocation until just before ext4
                        writes out the block(s) in question. This
                        allows ext4 to better allocation decisions
                        more efficiently.

nodelalloc              Disable delayed allocation. Blocks are allocated
                        when the data is copied from userspace to the
                        page cache, either via the write(2) system call
                        or when an mmap'ed page which was previously
                        unallocated is written for the first time.
先看这个delalloc，这是默认值，就是说把所有的block分配推后到真正要写数据的时候，当有sync调用的时候，也就是这种时候。

而关掉这个默认feather以后，块号就会在page cache的时候分配。如果区别只是这里，就无法解释为什么分配块号会花费这么多的时间了。是的，瓶颈不在这里。

我们接着看fsync（）这个系统调用，它在手册里面的解释是：

fsync() transfers ("flushes") all modified in-core data of (i.e., modi‐
       fied buffer cache pages for) the file referred to by the file descrip‐
       tor fd to the disk device (or other permanent storage device) where
       that file resides.

所以它仅仅要求文件系统把所有*该文件*的修改写到磁盘中。

然后我们去看看ext4对于它的实现。

/*
 * akpm: A new design for ext4_sync_file().
 *
 * This is only called from sys_fsync(), sys_fdatasync() and sys_msync().
 * There cannot be a transaction open by this task.
 * Another task could have dirtied this inode.  Its data can be in any
 * state in the journalling system.
 *
 * What we do is just kick off a commit and wait on it.  This will snapshot the
 * inode to disk.
 *
 * i_mutex lock is held when entering and exiting this function
 */

int ext4_sync_file(struct file *file, int datasync)
{
        struct inode *inode = file->f_mapping->host;
        struct ext4_inode_info *ei = EXT4_I(inode);
        journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
        int ret;
        tid_t commit_tid;

        J_ASSERT(ext4_journal_current_handle() == NULL);

        trace_ext4_sync_file(file, datasync);

        if (inode->i_sb->s_flags & MS_RDONLY)
                return 0;

        ret = flush_completed_IO(inode);
        if (ret < 0)
                return ret;
        if (!journal) {
                ret = generic_file_fsync(file, datasync);
                if (!ret && !list_empty(&inode->i_dentry))
                        ext4_sync_parent(inode);
                return ret;
        }

        /*
         * data=writeback,ordered:
         *  The caller's filemap_fdatawrite()/wait will sync the data.
         *  Metadata is in the journal, we wait for proper transaction to
         *  commit here.
         *
         * data=journal:
         *  filemap_fdatawrite won't do anything (the buffers are clean).
         *  ext4_force_commit will write the file data into the journal and
         *  will wait on that.
         *  filemap_fdatawait() will encounter a ton of newly-dirtied pages
         *  (they were dirtied by commit).  But that's OK - the blocks are
         *  safe in-journal, which is all fsync() needs to ensure.
         */
        if (ext4_should_journal_data(inode))
                return ext4_force_commit(inode->i_sb);

        commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
        if (jbd2_log_start_commit(journal, commit_tid)) {
                /*
                 * When the journal is on a different device than the
                 * fs data disk, we need to issue the barrier in
                 * writeback mode.  (In ordered mode, the jbd2 layer
                 * will take care of issuing the barrier.  In
                 * data=journal, all of the data blocks are written to
                 * the journal device.)
                 */
                if (ext4_should_writeback_data(inode) &&
                    (journal->j_fs_dev != journal->j_dev) &&
                    (journal->j_flags & JBD2_BARRIER))
                        blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
                                        NULL, BLKDEV_IFL_WAIT);
                ret = jbd2_log_wait_commit(journal, commit_tid);
        } else if (journal->j_flags & JBD2_BARRIER)
                blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
                        BLKDEV_IFL_WAIT);
        return ret;
}

我们的文件系统是以ordered的方式mount的，所以要调到的函数基本是：

flush_completed_IO()
jdb2_log_start_commit()
jdb2_log_wait_commit()

所以我们可以看到，对于一条fsync（）, ext4会把所有的日志都commit掉，所以这才是真正慢的地方。所以在需要经常做fsync()的应用下，比如sqltie就是一点典型例子。但是我觉得这个功能对于磁盘设备得大于失，但是对于闪存类型的设备，就没什么优势了。

后来又做一个一个在O_SYNC参数下面的write性能对于关不关delalloc的对比：这里是的Y轴是差值，高于0就是delalloc的性能好，低于就是差。 X轴代表一次write操作的单元，不同颜色的线代表不同的文件大小。单位都是KB

从图上可以看出，对于很大的文件，16M的文件，几乎所有的情况都是delolloc的性能要好。但是对于64K-512K的文件，性能就要差很多。

对于文件unit的大小，可以看得出来256K是一个分水岭。在接近256K的时候，延迟分配性能就要好很多，这个原因是因为我们的L2缓存是256K，所以当写的数据接近256K的时候，由于延迟分配技术不用去分配Block 块，所以大部分的memory write都可以用来作为文件写page cache，如果有了分配block这些数据，就会导致cache不对其，所以性能就会比延迟分配差很多。

还有一个地方是L1缓存（我们的是32K），这里前面小于L1的写都是延迟分配要快很多。原因和前面类似，但是不同的是接近L1的时候，反而都是不延迟分配要快一些，这点不知道怎么解释。可能的原因是在L1从L2中取数据的时延比较小.

这里还有一个有趣的地方是，对于512K大小的unit，delalloc的性能就要明显达到一个最高点。这是为啥呢？

【注】想起一个事情，为什么512K是一个特殊的点呢? 因为512K是mmc设备的defualt block size. 但是对于为什么去掉延迟写入的性能会高那么多呢？很有意思。

[1] right thing, but really affect performance http://postgresql.1045698.n5.nabble.com/ext4-finally-doing-the-right-thing-td2076089.html

[2] This patch fix it. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f3481e9a80c240f169b36ea886e2325b9aeb745

生活不易，卢瑟的更不易

发表于 2011-06-24 | 分类于未分类

卢瑟，记得两年前就拿到手就这么多了，现在还是这么多。而生活的成本却上升的受不了的快。税前想想还可以安心睡个觉，税后就连安慰心爱的人都不知道怎么安慰了。只能说面包会有的，什么都会有的之类的话。对于现在的我们，这些话说来就没有太多的用了。睡不着，一年前西湖边的甜蜜合影，右手边没有喝完的二锅头，电脑旁边的云南白药。

忙着累着，公司一堆事情，家里一堆事情，再看看那可怜的工资。想想遥不可及的梦想。会走到今天这么悲凉，我走了一条自己坚持的路，坚持做自己想做的事情的代价就是没有得到相应的回报，感觉是时间不够吧。可是在这一段时间，这样的坚持是何等的煎熬。好久没有快乐的感觉了。

我开始怀疑是不是我错了，当初就应该选择钱多的那方？

这就是我，得意时飘飘然，低落时寂寞然。

一个3G router的性能优化经历

发表于 2011-06-01 | 分类于未分类

这里记录一个在工作中经历过的一次优化记录，其中会隐去一些真实的名字。

背景是这样的，公司有一个客户要做一个3G的WIFI移动AP，就是像 http://www.360buy.com/product/273676.html 这个东东一样的一个产品。使用的CPU是ARM9为Core，运行在454MHz上，并且内存宽度只有16-bit. 这种设备通常对成本和功耗比较敏感，所以硬件配置都比较低，但是对于一方面的性能还是有很高的要求的。

架构大概是这样的：

这里3G要通过USB，然后经过Net Filter过滤，通常这个规则是很少的，但是是必须的，然后通过SDIO连接到WIFI芯片上，然后再通过无线连接到各种Device上面。

该说说目标了，客户的要求的目标比较远大，你看上面列出来那个HW的产品，上行是7Mbps 下行是 5Mbps. 呵呵，注意是bits的。而对于这个产品，客户的要求是这样的25Mbps/28Mbps，整整高了5倍左右。任重而道远啊。当然这个速度是下行速度。

首先是没有优化过的版本跑一下，得到的数据是13Mbps, 好像比上面那位兄台的要强很多了嘛，不过和客户的要求还差整整一倍呢。于是就开始优化，装备ftrace， oprofile等工具。监视统计信息。

首先看到的是 /proc/cpu/alignment，这里记录一些非对其访问的次数：比如

probash-3.2# cat /proc/cpu/alignment
User:           0
System:         0
Skipped:        0
Half:           0
Word:           0
DWord:          0
Multi:          0
User faults:    4 (signal)

首先发现这里的有一个值异常的高，而对于ARM9的架构，对于这种非4字节对其的访问是会由一个trap交给Kernel来处理，所以会很慢。再追这个非对其访问来自于哪里以后发现是由tcp的栈上的访问。最后发现这个是在我们的测试环境中的FEC的driver的问题，因为太网头是12个字节，这样减下来就导致，后面TCP的访问都不是4字节对齐的了。最后用在FEC的驱动中前面留了2个字节的padding，这样以后TCP的访问就都是4字节对齐的访问了。

经过这次优化以后，达到18Mbps.

然后接着看USB，通过ftrace观察发现，在sdio要搬数据的时候，会经常被USB的中断打断，因为3G modem用的中断模式进行USB传输，而这个频率是1ms一次，这样会造成CPU总是在相应中断，而FIFO中的数据不能搬运到WIFI上，因为WIFI是用的软中断来进行搬运。

intterupt(UBS) -> 2. softirq--> 3. list(WIFI) --> 4. two thread write SDIO(WIFI/task)

由于IRQ的优先级高，所以导致其他的task和softirq根本无法被调到。

所以调整到8ms以后，就可以达到23Mbps的速度了。

好了，眼看就达到目标了。

接下来就要看一些通用的方法了，上oprofile, 用oprofile以后发现，memcpy被调用很多次，接下来减少memcpy应该会增加不少性能。

定位以后可以发现是在WIFI driver里面做的memcpy，进一步观察发现，是因为wifi发现这个buffer前面的头太小了，要插入一个wifi的包头空间不够才会做这样的memcpy，而这个包是从USB来的，那么就在USB生成包的时候预留更多的空间给wifi做报头，这样就不需要memcpy了。

在sdio的DMA中，用了sg-list 的DMA方式，这样也减少了一部分的memcpy，因为如果没有sg-list的DMA传输，driver就得把几个小buffer，copy成为一个大的buffer，然后进行DMA传输，有了sg-list传输以后，就可以通过组成一个sg-list的表然后把这些小的内存都进行DMA，进一步减少memcpy传输。

进行这些优化以后就达到了28Mbps的要求了。

其实这里面主要就几点，调度， memcpy, 字节对齐访问。

性格的纠结

发表于 2011-05-22 | 分类于未分类

这些年里面，太过注重技术层面的学习，看书，写代码，研究细节。到了今天蓦然回首，却发现我过的很不幸福，处理非技术层次中生活的一些问题很差。以前有一个懂我的人经常和我说，你的性格太软弱，我听不进去。今天头一次当二房东，在选择租客的问题上我犯下了一个小错误，从小后悔小纠结到很后悔很纠结。再到自我安慰，不知道以后还会出现多大的后悔。

看着镜子里面的自己，才发现这些都是自己不愿意面对的性格的弱点。不坚持，没有信心，摇摆。这件小事的发展过程和感情上最大的失败大致相同。

1. 首先是给出了不合理的承诺（性格很乐观），

2. 到了一个看似不大的决定的时候，虽然很纠结，但是还是按照当时承诺的做了，虽然已经发现这样做不对了。

3. 后事情发展已经超出了乐观的范围的时候，开始追悔了当时没有选择纠结的另一面。

4. 想作出挽救，却发现已经不行了（可是这时候还是有一点机会的），做了挽救的尝试没多久，就放弃了。然后事情再发展，再次回到3循环。

这样的范式在我人生的很多重大选择中，都出现过了。一定要改正它， 2011年的一个目标。

这样的范式的突破点在那里呢？

首先是1：给出承诺的时候一定要想好了，想不清楚就等等，不要让外界的压力逼迫作出仓促的决定，尝试找一个玩偶来说说两方面的影响，如果当时不允许，自己脑子里面和自己说一下。

然后是2. 这个纠结的时候很重要，因为你现在还有挽回的机会。这时候最容易发生的就是那个错误的方向对你做出诱惑，进行说服，可是你要坚信，在这个时候都已经后悔了，就不要在继续了，后面的后悔的时候会更多，更痛苦。

到了3-4，之间的纠结的时候，已经不太可能完美的解决这个问题了，只能作出舍弃，这个舍弃可能很大，但是需要有坚持的作出这样的割舍。想想以后还要在3-4之间纠结的时候，这样的决定越早损失就越少。

反思

UART蓝牙Linux调试的一些经验, DMA, FTP

发表于 2011-04-01 | 分类于未分类

这几天在调试一块Atheros 3001的蓝牙芯片，今天算是有一个里程碑了，总结一下放在这里吧，开始写博客的习惯吧。尽管这有些out，不过当作自己的一个习惯吧。

先介绍下这个芯片吧， Ath3001, 一块通过uart，也就是串口，连到host的蓝牙芯片。特点呢，我不清楚，一个比较明显的感觉就是比较快吧。最多支持3000000(3M)的波特率，如果忽略上层的话，数据吞吐量应该比较好。不过应该主流的蓝牙芯片都差不多吧。

我的任务呢，就是要在我们的平台上, Freescale i.MX53上把这块芯片跑好，跑顺。。。说实话，我不太喜欢做第三方的东西支持，有时候交流起来比较困难，但是做成以后呢，自己又好像没多大意思似的。不过既然是咱做，咱就把它做好吧。

最后通过该驱动，打到了3M 的波特率。FTP的速度可以达到，接受140KB/s，输出75KB/s。具体无法达到375K的理论最高速度应该是和上层的MTU之类的设置有关。插播一下，刚搜到说iPhone之间传文件最高为67KB/s，尽管我没有测试过两台mx53之间的速度，但是能和苹果达到一个数量级已经让我感到很欣慰了。阿门。

期间遇到了这么几个问题，

最早是CTS/RTS的问题。

因为这块芯片的驱动，在芯片的reset以后probe的过程中，需要手动控制RTS脚去踢芯片来唤醒它，一开始我们的UART驱动没有很好的对Linux tty标准做适配，无法在运行过程中对流控进行开关。后来在驱动中加入这个以后，芯片就起来了。

波特率

然后首先试的Profile是A2DP，也就是立体声听歌的配置，注意到声音在播放一段时间以后就会有规律的卡一次，然后从log上看是写数据下去的时候被什么东西给block住了，当时还有一个猜想就是波特率不够，在增加波特率到56K以后，有些缓解，可是还是会有一两次的卡。这个问题一直到最近才解决掉了。是uart驱动里面对于tx的处理有些问题。

DMA传输

现在uart传输是用中断模式做的，这样做的好处，好处就是简单，呵呵。但是坏处比较大，如果我以一个高的波特率传输音乐或者文件，可怜的cpu被中断的非常痛苦，这时候会对系统新能造成很大影响。具体计算，可以简单的用你的速度/fifo大小来得出，比如我们以30K/S的速度传文件，我们的FIFO是32个字节，那么每秒CPU将被中断1000次，所以这是不可接受的，当然这么多中断的话，就不可能达到这个速度 :)

解决办法只有一个，打开DMA传输。只有在打开和不打开之间，才可以体验到这个东西的优越性。如果用中断，我大概A2DP的时候是90+%的CPU loading，如果是DMA，那么就到了3%的loading了。差距啊。 DMA设置的BUFFER size可以比你的fifo大，比如我设置的是128的，fifo只有32。 DMA控制器会从你的uart fifo里面一直搬数据，搬满了就给你个中断。让你去处理这些数据。

当开DMA的过程也有些曲折，因为打开DMA的以后会要求SOC打开一个访问DDR的emi_fast的clock，当时也费了点功夫查这个问题。 ps， 53的clock真复杂。

A2DP的调试还好，因为听歌传错一两个帧没什么大碍。但是传文件这个就不一样了。错一点都不行。

UART驱动的BUG

于是在调试FTP的时候，又发现了两个小BUG在我们的uart驱动中。

一个是在传输的时候，因为是用DMA传送的，DMA在传完一个buffer的时候会去调用一个callback，这时候原来的驱动会设置好新的buffer地址以后去启动一个tasklet来进行传下一个buffer。可是我发现这种异步的方法会有丢一些数据，换成直接调用那个函数来传下一个buffer就没问题了。我想这个问题大概是这样的，因为dmaread的callback和write的callback都修改一个bufferid的变量来记录哪段buffer，而如果调用tasklet来传输下一段buffer，会有这样一种情况:
dma_wirtecallback->让tasklet开始run.
可是由于tasklet在soft irq的context下面，会有一点的延迟。
这个时候，dma_readcallback来了，并且在这个tasklet前面执行，所以就会修改掉buffer id。
然后传输失败了，也有可能是传错东西了，远端不认识这些数据。就没给正确的回应。所以蓝牙驱动里面就报错了。

这时候，FTP协议就会不正常了。

另外一个问题是read的时候，由于使用了dma传输，每次会有128或者更多的data从给到dma_read的callback上面，这时候原来的驱动会去调用tty_buffer_request_room()来去申请空间，然后把申请到空间的长度的数据调用tty_insert_flip_string来送给上层。这么做是不对的，申请空间的工作应该由tty_insert_flip_string来完成，它在 tty_insert_flip_string_fixed_flag() 函数中会有一个循环来request_room,而每次reqeust_room的值都不会太大以至于失败。而uart驱动中就是在这里一次申请了太大的数据，比如128之类的，导致申请失败返回，所以数据就丢掉了。

我在传输两边用md5sum能够看到传输的文件内容不同了。

硬件的波特率

在调试过程中还碰到过一种情况，就是在ftp传文件传的很high的时候，会莫名其妙的把数据传错。最后发现是因为我们用的这个ATH3001的卡上的传输器（transceiver）天生只对几种波特率支持的比较好，比如1500000,3000000之类的，而我当时用的是1152000的波特率。当切换了波特率以后就不会出现这种莫名其妙的数据错误了。

接下来做做HFP的工作吧，这大概是蓝牙这玩意最有用的功能了。

PS，我这个驱动主要是在arm的ubuntu下面调试完成的，由于强大的工具支持，这个过程会比较顺畅。调试好了底层，把东西跑在Android上也是非常顺利的。

wm8993啸叫问题调试总结

发表于 2011-03-27 | 分类于未分类

最近调了一个wm8993的啸叫问题，看了一些Linux ALSA Asoc的资料，发现了很多很好的地方，放在这里做一个集合吧。也许能帮到别人。
那个wm8993的啸叫问题现象是这样的，在录音的时候，从系统里面播放一个声音，就会发生啸叫。由于wm8993有很恐怖的AudioMap，可以参考wm8993文档的27页。所以一开始分析觉得是在播放的路由有一声音跑到了录音那里，然后录音出来的的声音又跑到了Speaker上。这样就会形成一个死循环，导致声音越来越大。

啸叫我知道两种情况：

一种啸叫是由于内部路由的问题，就像现在这种情况，所以你按住Mic（也就是不让Mic听到Speaker放出来的声音）是没有用的。
还有一种啸叫是外部原因，因为结构的问题把Mic放到了Speaker的前面，就像你在卡拉OK听到的哪种。
判断这两种有一个方法就是在发生啸叫的时候，按住Mic，如果还有啸叫就是内部那种。

经过把那张路由表打出来，必须是A3的，A4的纸都看不清楚。检查出来确实有几个通路形成了这样的循环。在配置里面关掉就好了。有一个奇怪的地方是，这些通路默认都是关掉的，是谁把他们打开了呢？现在还不知道，一个猜测可能是在alsa.conf里面没有配置好，或者是asoc的machine代码中没有把这些场景定义好。

下面是一些链接：

DAI: Digital Audio Interfaces(音频设备的硬件接口)

Alsa SoC Audio(part 1)

Alsa SoC Audio(part 2)

Neo1973 相关的，一个完整的电话声音，非常有参考价值

http://wiki.openmoko.org/wiki/Neo1973_Audio_Subsystem
http://people.openmoko.org/joerg/ALSA/doc/WM8753_control_diag_gsmhandset_mic_std.png
http://people.openmoko.org/joerg/ALSA/doc/WM8753_control_diag__gsmhandset_tx+rx-processed.png
http://people.openmoko.org/joerg/ALSA/doc/WM8753_control_diag.png
http://wiki.openmoko.org/wiki/Neo_Freerunner_audio_subsystem
http://wiki.openmoko.org/wiki/Neo_1973_and_Neo_FreeRunner_gsm_modem#AT.25Nxxxx

Dell T3500 安装 ubuntu记录

发表于 2011-03-25 | 分类于未分类

这几天部门买了台工作站，Dell T3500，很好很强大，硬件RAID0/5支持。于是要装一台Ubuntu 10.10 64位机器出来。没想到过程很是痛苦，让我折腾了好几天。这里记录一下吧，也许有人也在痛苦着。首先是安装环节，Ubuntu10.10 默认用的2.6.35的kernel，这个kernel在T3500的这个AHCI的Controller有一个BUG，会导致识别不出来硬盘。

找了一堆以后发现，可以通过在安装程序的cmdline里面加一条命令完成，如果你是用Cmdline模式的安装，那么输入 install pci=nocrs 或者是图形界面，选择高级以后，然后在cmdline的末尾“ro quiet”后面写上，pci=nocrs。否则在安装程序格盘的时候会只能找到iSCSI的盘。

第二，也许你现在已经开始装了，但是别着急，好戏还在后头呢。由于这台工作站有硬件的RADI，而且是默认打开的，你会发现安装完毕以后，initrd会找不到根目录，提示大概是UUID=xxx找不到。然后把你扔到一个busybox里面了。调查了很久，还看了一下kernel的文档和代码。

终于发现症结了，原来是dm(Hardwaer RAID)的支持在这个Ubuntu的版本不好，尽管kernel支持了，可是在initrd里面，没有去创建设备节点，你可以看到在/dev/mapper/下面只有两个文件，一个是control, 一个是zxz...Volume0,上面的分区节点都找不到了，所以这就是为什么系统起不来的原因。大概是mdadm一类的脚本写的不好。我想在下一个Release会解决这个问题吧。但是生活还得继续，虽然说还有一个月Ubuntu就会出11.04的版本了，可是我下了Alpha3的版本发现根本不能安装，所以只有换了。

最终的解决办法是使用软Raid，在机器启动的时候你能看到一个RAID的管理程序，进去以后把里面的RAID volume都删掉，然后再重新安装的时候配置成Software RAID。这里还有一个提示就是/boot分区不要用任何RAID之类的套起来，否则grub会找不到它的config。

我这里是把 sda1的分区format成EXT2的分区，用来挂/boot 其他一个RAID0 挂 / 一个RAID0 挂 /home 一个16G的swap 这里有一个技巧就是，把同一个raid分区的两个字卷放在两个磁盘上，这样会有性能的提升。如果你放在同一个磁盘上，我想会有性能的下降吧。

分布式系统笔记 - 密码学的应用

发表于 2010-12-15 | 分类于未分类

这个笔记是《分布式系统》里面的一章的内容，里面的几个例子大大的让我理解了现在的一些网络安全方面的应用，比如共享密钥，为什么要用公私钥，什么是证书，服务器登录等。

因为书要卖了，于是抄下这些例子。里面有一些公式，所以还是用latex写比较好，发现转成HTML以后有很多图片，所以还是把pdf放在这里吧。

Linux Kernel and Android Suspend/Resume

发表于 2010-11-20 | 分类于未分类

Author: zhangjiejing <kzjeef#gmail.com> thinksrc.com

Abstract
I11N
Version
Introducion of suspend
Normal Linux Suspend
Android Suspend

Abstract

Suspend & Resume is a huge function that Linux kernel provied, it's more and more useful with the mobile and quick start requirememnt increasing. This post will introduce the big picture of Linux suspend and resume, and how android power management works.

I11N

English Version : link
中文版 : link

Version

Linux Kernel: v2.6.28
Android: v2.0

Introducion of suspend

Suspend have 3 major part: Freezing process and tasks Call every driver's suspend callback Suspend CPU and core system devices Freezing process is like stop all process, and when resume, it will start execute as if not stop ever. User space process and kernel space taskes will never know this stop, They are like babies at all. How user let Linux goto suspend ? User can read/write sys fs file: /sys/power/state to control and get kernel power managment(PM) service. such as:

# echo standby > /sys/power/state

to let system going to suspend. also

# cat /sys/power/state

to get how many PM method you kernel supported.

Normal Linux Suspend

Files:

you can checkout a standard linux source code, below is the path.

linux_soruce/kernel/power/main.c
linux_source/kernel/arch/xxx/mach-xxx/pm.c

Let 's going to see how these happens. The userspace interface /sys/power/state is state_store() function in main.c: You can write the strings defined by const char * const pm_state[]: such as "mem", "standby". In a normal linux kernel, It will going to enter_state() in main.c enter_state() will first do some check of state. sync file system. Below is the source code:

/**
 *      enter_state - Do common work of entering low-power state.
 *      @state:         pm_state structure for state we're entering.
 *
 *      Make sure we're the only ones trying to enter a sleep state. Fail
 *      if someone has beat us to it, since we don't want anything weird to
 *      happen when we wake up.
 *      Then, do the setup for suspend, enter the state, and cleaup (after
 *      we've woken up).
 */
static int enter_state(suspend_state_t state)
{
int error;

if (!valid_state(state))
return -ENODEV;

if (!mutex_trylock(&pm_mutex))
return -EBUSY;

printk(KERN_INFO "PM: Syncing filesystems ... ");
sys_sync();
printk("done.n");

pr_debug("PM: Preparing system for %s sleepn", pm_states[state]);
error = suspend_prepare();
if (error)
goto Unlock;

if (suspend_test(TEST_FREEZER))
goto Finish;

pr_debug("PM: Entering %s sleepn", pm_states[state]);
error = suspend_devices_and_enter(state);

Finish:
pr_debug("PM: Finishing wakeup.n");
suspend_finish();
Unlock:
mutex_unlock(&pm_mutex);
return error;
}

Prepare, Freezing Process

Going to suspend_prepare(), this func will alloc a console for suspend, running suspend notifiers, disable user mode helper, and call suspend_freeze_processes() freeze all process, it will make all process save current state, in the freeze stage, maybe some task/user space process will refuze to going to freezing,it will abort and unfreezing all precess.

/**
 *      suspend_prepare - Do prep work before entering low-power state.
 *
 *      This is common code that is called for each state that we're entering.
 *      Run suspend notifiers, allocate a console and stop all processes.
 */
static int suspend_prepare(void)
{
  int error;
  unsigned int free_pages;
  if (!suspend_ops || !suspend_ops->enter)
    return -EPERM;

  pm_prepare_console();

  error = pm_notifier_call_chain(PM_SUSPEND_PREPARE);
  if (error)
    goto Finish;

  error = usermodehelper_disable();
  if (error)
    goto Finish;

  if (suspend_freeze_processes()) {
    error = -EAGAIN;
    goto Thaw;
  }

  free_pages = global_page_state(NR_FREE_PAGES);
  if (free_pages < FREE_PAGE_NUMBER) {
    pr_debug("PM: free some memoryn");
    shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
    if (nr_free_pages() < FREE_PAGE_NUMBER) {
      error = -ENOMEM;
      printk(KERN_ERR "PM: No enough memoryn");
    }
  }
  if (!error)
    return 0;

 Thaw:
  suspend_thaw_processes();
  usermodehelper_enable();
 Finish:
  pm_notifier_call_chain(PM_POST_SUSPEND);
  pm_restore_console();
  return error;
}

Suspend Devices

For now, all the other process(process/workqueue/kthread) is stoped, they may have locked semaphore, if you waiting for them in driver's suspend function, it will a dead lock. And then, kernel will free some memory for later use. Finally, it will call suspend_devices_and_enter() to suspend all devices, in this function, first will call suspend_ops->begin() if this machine have this function, device_suspend() in driver/base/power/main.c will be called, this function will call dpm_suspend() to all all device list and their suspend() callback. After suspend devices, it will call the suspend_ops->prepare() to let machine do some machine related prepare job(could be empty on some machine), it will disable nonboot cpus to avoid race conditions, so , after that, it will only one cpu will running. suspend_ops is a machine related pm op, normally it registed by arch/xxx/mach-xxx/pm.c And then, is suspend_enter() will be called, here will disable arch irqs will suspend, call device_power_down(), this message will call each of suspend_late() callback, thi will be the last call back before system hold, and suspend all system devices, I guess it means, all devices under /sys/devices/system/*, and then it will call suspend_pos->enter() to let cpu going to a power save mode, system will stop here, aka, the code executing stop here.

/**
 *      suspend_devices_and_enter - suspend devices and enter the desired system
 *                                  sleep state.
 *      @state:           state to enter
 */
int suspend_devices_and_enter(suspend_state_t state)
{
  int error, ftrace_save;

  if (!suspend_ops)
    return -ENOSYS;

  if (suspend_ops->begin) {
    error = suspend_ops->begin(state);
    if (error)
      goto Close;
  }
  suspend_console();
  ftrace_save = __ftrace_enabled_save();
  suspend_test_start();
  error = device_suspend(PMSG_SUSPEND);
  if (error) {
    printk(KERN_ERR "PM: Some devices failed to suspendn");
    goto Recover_platform;
  }
  suspend_test_finish("suspend devices");
  if (suspend_test(TEST_DEVICES))
    goto Recover_platform;

  if (suspend_ops->prepare) {
    error = suspend_ops->prepare();
    if (error)
      goto Resume_devices;
  }

  if (suspend_test(TEST_PLATFORM))
    goto Finish;

  error = disable_nonboot_cpus();
  if (!error && !suspend_test(TEST_CPUS))
    suspend_enter(state);

  enable_nonboot_cpus();
 Finish:
  if (suspend_ops->finish)
    suspend_ops->finish();
 Resume_devices:
  suspend_test_start();
  device_resume(PMSG_RESUME);
  suspend_test_finish("resume devices");
  __ftrace_enabled_restore(ftrace_save);
  resume_console();
 Close:
  if (suspend_ops->end)
    suspend_ops->end();
  return error;

 Recover_platform:
  if (suspend_ops->recover)
    suspend_ops->recover();
  goto Resume_devices;
}

Resume

If the system wake up by interrupt or other event, the code executing will be continue. The first thing system resume is resume the devices under /sys/devices/system/, and enable irq, and then, it will enable nonboot cpus, and call suspend_ops->finish() to let machine know it will start resume, suspend_devices_and_enter() function later will will call every device 's resume() fucntion to resume devices, resume the console, and finally, call the suspend_ops->end(). Let's return to enter_state() function, after suspend_devices_and_enter() returns, the devices is running, but user space process and task is still freezed, enter_state will later call suspend_finish(), it will thaw the processes and enable user mode helper, and notify all pm they are exit from a suspend stage, and resume the console. This is a stardard linux suspend and resume sequence.

Android Suspend

In android patched kernel, going to request_suspend_state() in kernel/power/earlysuspend.c (since android add the Early suspend & wakelock feather in kernel). For detail understand that, let first introduct serval new feather android imported.

Files:

linux_source/kernel/power/main.c
linux_source/kernel/power/earlysuspend.c
linux_source/kernel/power/wakelock.c

Feathers

Early Suspend

Early suspend is a mechanism that android introduced into linux kernel. This state is btween really suspend, and trun off screen. After Screen is off, several device such as LCD backlight, gsensor, touchscreen will stop for battery life and functional requirement.

Late Resume

Late resume is a mechinism pairs to early suspend, executed after the kernel and system resume finished. It will resume the devices suspended during early suspend.

Wake Lock

Wake lock acts as a core member in android power management system. wake lock is a lock can be hold by kernel space ,system servers and applications with or without timeout. In an android patched linux kernel (referenced as android kernel below) will timing how many and how long the lock have. If there isn't any of wake lock prevent suspend(WAKE_LOCK_SUSPEND), android kernel will call linux suspend (pm_suspend()) to let entire system going to suspend.

Android Suspend

when user write "mem"/"stanby" to /sys/power/state the state_store() will called. And then will going to request_suspend_state(), this function will check the state, if the request is suspend it will queue the early_suspend_work -> early_suspend(),

void request_suspend_state(suspend_state_t new_state)
{
  unsigned long irqflags;
  int old_sleep;

  spin_lock_irqsave(&state_lock, irqflags);
  old_sleep = state & SUSPEND_REQUESTED;
  if (debug_mask & DEBUG_USER_STATE) {
    struct timespec ts;
    struct rtc_time tm;
    getnstimeofday(&ts);
    rtc_time_to_tm(ts.tv_sec, &tm);
    pr_info("request_suspend_state: %s (%d->%d) at %lld "
	    "(%d-%02d-%02d %02d:%02d:%02d.%09lu UTC)n",
	    new_state != PM_SUSPEND_ON ? "sleep" : "wakeup",
	    requested_suspend_state, new_state,
	    ktime_to_ns(ktime_get()),
	    tm.tm_year + 1900, tm.tm_mon + 1, tm.tm_mday,
	    tm.tm_hour, tm.tm_min, tm.tm_sec, ts.tv_nsec);
  }
  if (!old_sleep && new_state != PM_SUSPEND_ON) {
    state |= SUSPEND_REQUESTED;
    queue_work(suspend_work_queue, &early_suspend_work);
  } else if (old_sleep && new_state == PM_SUSPEND_ON) {
    state &= ~SUSPEND_REQUESTED;
    wake_lock(&main_wake_lock);
    queue_work(suspend_work_queue, &late_resume_work);
  }
  requested_suspend_state = new_state;
  spin_unlock_irqrestore(&state_lock, irqflags);
}

Early Suspend

in early_suspend(): It will first check was the state still suspend (in case the suspend request was canceled during the time), if abort, this work will quit. If not, this func will call the all of registered early suspend handlers, and call suspend() of these handlers. And then, sync file system, and most important, give up a main_wake_lock, this wake lock is used by wakelock self and early suspend. This wake lock is not a timeout wake lock, so, if this lock is holded, wake lock will going to suspend even these was none of wake lock actived. During this time, the system suspend was not called. Because of early suspend give up the main_wake_lock, so the wake lock can decide if going to suspend the system.

static void early_suspend(struct work_struct *work)
{
  struct early_suspend *pos;
  unsigned long irqflags;
  int abort = 0;

  mutex_lock(&early_suspend_lock);
  spin_lock_irqsave(&state_lock, irqflags);
  if (state == SUSPEND_REQUESTED)
    state |= SUSPENDED;
  else
    abort = 1;
  spin_unlock_irqrestore(&state_lock, irqflags);

  if (abort) {
    if (debug_mask & DEBUG_SUSPEND)
      pr_info("early_suspend: abort, state %dn", state);
    mutex_unlock(&early_suspend_lock);
    goto abort;
  }

  if (debug_mask & DEBUG_SUSPEND)
    pr_info("early_suspend: call handlersn");
  list_for_each_entry(pos, &early_suspend_handlers, link) {
    if (pos->suspend != NULL)
      pos->suspend(pos);
  }
  mutex_unlock(&early_suspend_lock);

  if (debug_mask & DEBUG_SUSPEND)
    pr_info("early_suspend: syncn");

  sys_sync();
 abort:
  spin_lock_irqsave(&state_lock, irqflags);
  if (state == SUSPEND_REQUESTED_AND_SUSPENDED)
    wake_unlock(&main_wake_lock);
  spin_unlock_irqrestore(&state_lock, irqflags);
}

Late Resume

After all the kernel resume is finished, the user space process and service is running, the wake up of system for these reasons:

In CallingIf In Calling, the modem will send command to rild (RING command), and rild will send message to WindowManager and Application to deal with in call event, PowerManagerSerivce also will write "on" to interface to let kernel execute late resume.
User Key EventWhen system waked by a key event, such as a power key, or menu key, these key event will send to WindowManager, and it will deal with it, if the key is not the key can wake up system, such as return key/home key, the WindowManager will drop the wake lock to let system going to suspend again. if the key is a wake key, the WindowManager will RPC PowerManagerSerivce interface to execute late resume.
Late Resume will call the resume func in list of early suspend devices.

static void late_resume(struct work_struct *work)
{
  struct early_suspend *pos;
  unsigned long irqflags;
  int abort = 0;

  mutex_lock(&early_suspend_lock);
  spin_lock_irqsave(&state_lock, irqflags);
  if (state == SUSPENDED)
    state &= ~SUSPENDED;
  else
    abort = 1;
  spin_unlock_irqrestore(&state_lock, irqflags);

  if (abort) {
    if (debug_mask & DEBUG_SUSPEND)
      pr_info("late_resume: abort, state %dn", state);
    goto abort;
  }
  if (debug_mask & DEBUG_SUSPEND)
    pr_info("late_resume: call handlersn");
  list_for_each_entry_reverse(pos, &early_suspend_handlers, link)
    if (pos->resume != NULL)
      pos->resume(pos);
  if (debug_mask & DEBUG_SUSPEND)
    pr_info("late_resume: donen");
 abort:
  mutex_unlock(&early_suspend_lock);
}

Wake Lock

Let's see how the wake lock mechinism run, we will focus on file wakelock.c. wake lock have to state, lock or unlock. The Lock have two method:

Unlimited LockThis type of lock will never unlock until some one call unlock
Wake Lock with TimeoutThis type of lock is alloc with a timeout, is the time expired, this lock will automatic unlock.

Also have two type of lock:

WAKE_LOCK_SUSPENDThis type of Lock will prevent system going to suspend.
WAKE_LOCK_IDLEThis type of Lock not prevent system going to suspend, not a lock can make system wake, I can't figure out why this lock exist.In wake lock functions, there was 3 enter pointer can call the suspend() workqueue:
1. In wake_unlock(), if there was none of wake lock after unlock, the suspend started.
2. after the timeout timer expired, the callback of timer will be called, in this function, it will check if there no of wake lock, system goto suspend.
3. In wake_lock(), if add lock success, it will check if there was none of wake lock, if none of wake lock, it will going to suspend. I think the way check here is unnessary at all, the better way is let wake_lock() wake_unlock() to be atomic, since this check add here also have chance missing the unlock.
Wakelock debug

There is a very useful way to enable wake lock's debug information in runtime as below, it will print all wake lock acquire and release information in your console, it's very useful while debugging the suspend/resume issue on android.

echo 15 > /sys/module/wakelock/parameter/debug_mask

Suspend

If the wake lock call the suspend workqueue, the suspend() will be called, this function check wake lock,sysc file system, and then call the pm_suspend()->enter_state() to going standard linux suspend sequence.

static void suspend(struct work_struct *work)
{
	int ret;
	int entry_event_num;

	if (has_wake_lock(WAKE_LOCK_SUSPEND)) {
		if (debug_mask & DEBUG_SUSPEND)
			pr_info("suspend: abort suspendn");
		return;
	}

	entry_event_num = current_event_num;
	sys_sync();
	if (debug_mask & DEBUG_SUSPEND)
		pr_info("suspend: enter suspendn");
	ret = pm_suspend(requested_suspend_state);
	if (current_event_num == entry_event_num) {
		wake_lock_timeout(&unknown_wakeup, HZ / 2);
	}
}

Different Between Standard Linux Suspend

the pm_suspend() will call the enter_state() to going to a suspend() state, but it's not 100% same as standard kernel suspend sequence:

When freezing process, android will check if there was any of wakelock, if have, the suspend sequence will be interrupted.
In suspend_late callback, this callback will have a final check of wake lock, if some driver or freezed have the wake lock, it will return an error, this will make system going to resume. This could a problem in some situation. But this check is can't avoid, since the caller of wake_lock() normally not check the return value. So maybe some process start freezing without wake lock, but acquire some wake lock during the freezing, (I'm sure would this happen).

If the pm_suspend() success, the log after that will not seen until system resume success. some times, folks said can't see the log printed in suspend, some times is some error on resume, so the log will never been seen. So the suspend error is hard to debug. The log during suspend can print to console by add "no_console_suspend" to kernel command line , thanks kasim.

A more detailed about linux suspend please see http://kerneltrap.org/node/14004

UART: hardware flow control, story of CTS/RTS.

发表于 2010-11-08 | 分类于未分类

这几天在调试蓝牙(Bluetooth, 以下用BT代替), 这个蓝牙呢是通过把Soc和BT芯片连起来的. 也就是传说中的hci_uart. 芯片是AR3001, 这个调试成功是一个比较曲折的路程, 大致再这里总结一下吧.

首先拿到的时候, 对UART不是很熟悉, 连公母头都分不清楚, 一开始是用飞线把开发板上的UART和BT上的UART口连起来的. 连了5根线. 分别是, RX, TX, CTS, RTS, GND, 分别是接受, 发送, (Clear to Send)接受方流控(BT芯片控制), (Request to Send)发送用的流控(开发板控制). 首先得搞清除那边控制那边, 所以还是画个图来的比较明显:

图中的方向是指数据的方向, RTS 是由CPU来控制的, 而CTS是由BT来控制的. 他们再连起来的时候是一个交叉连的关系. RX,TX是RS232的2,3针, CTS, RTX是RS232的7,8针, 具体是几针请对这你的原理图看吧, 一定要看仔细了. 我就在这个上弄了很久.

还有一个比较重要的就是硬件流控, 也就是Hardware flow control, 这里的硬件是指谁呢, 这里的硬件是指UART双方的芯片. 如果不加流控呢, UART传输数据的方式类似于一个双向的管子, 两边有了东西就往里面塞, 不管能不能放得下. 因为UART是有一个FIFO, 这个FIFO通常不会很大, 32个字节是比较常见的, 因为做大了也不好. 通常如果你发的太快了, 就会导致UART 的硬件overrun, 糟了, 来不及收了. 那UART会怎么办呢, 丢呗. 它就会忽略掉后面来的数据, 或许你有过经历,就是往一个终端里面复制数据, 复制的太快了发现复制进去的数据乱了. 这可能就是没有打开流控导致的.这里注意一点,就是是接受方需要流控,为什么发送方不需要呢, 因为你发送的时候知道FIFO满了可以先把数据防到内存里面阿.

好了, 加了流控能够怎么改善这个情况呢?

好, 现在发送方数据, 尽力发阿. 来吧, 发到了接收方受不了了, 接收方就拉起了他的RTS, 也就是RTS拉高电压, (这里的RTS,CTS,都是低电平有效,其实这些很简单, 他这种东西只有两种电平, 一个高一个低, 二进制嘛), 告诉发送方, 不要发了. 好了, 发送方就乖乖的听话, 不发了. 等到接收方处理的能够接受了, 就拉低RTS的电平. 这样就可以继续发送了.但是这一切都是由硬件控制的. 这时候软件是不能够去手动的操作RTS的.

有一种情况我们可能需要自己控制RTS和查看CTS的情况, 比如就说这个AR3001吧, 它因为是想省电,所以就经常睡觉, 睡觉的时候你想唤醒它吧,为啥, 你想听歌阿, 你想传东西阿, 你想用手机上上网阿,之类的. 这时候我们唯一能够控制的就是RTS口, 通过这个给一个短脉冲, 在BT芯片那边就会当成一个中断来把它唤醒.这样就能够继续操作了.这时候你就得把硬件流控给关掉, 才能够控制这位. 怎么关呢? 在driver里面, 你可以调用这个tty驱动的termios方法, 去掉c_flags里面CRTSCTS那一位. 这样就可以腾出手来操作RTS那位了, RTS怎么操作呢, 代码吧:

static int ath_wakeup_ar3k(struct tty_struct *tty)
{
   struct termios settings;
   int status = tty->driver->ops->tiocmget(tty, NULL);

   if (status & TIOCM_CTS)
      return status;

   /* Disable Automatic RTSCTS */
   n_tty_ioctl_helper(tty, NULL, TCGETS, (unsigned long)&settings);
   settings.c_cflag &= ~CRTSCTS;
   n_tty_ioctl_helper(tty, NULL, TCSETS, (unsigned long)&settings);

   /* Clear RTS first */
   status = tty->driver->ops->tiocmget(tty, NULL);
   tty->driver->ops->tiocmset(tty, NULL, 0x00, TIOCM_RTS);
   mdelay(20);

   /* Set RTS, wake up board */
   status = tty->driver->ops->tiocmget(tty, NULL);
   tty->driver->ops->tiocmset(tty, NULL, TIOCM_RTS, 0x00);
   mdelay(20);

   status = tty->driver->ops->tiocmget(tty, NULL);

   n_tty_ioctl_helper(tty, NULL, TCGETS, (unsigned long)&settings);
   settings.c_cflag |= CRTSCTS;
   n_tty_ioctl_helper(tty, NULL, TCSETS, (unsigned long)&settings);

   return status;
}

然后你操作完了就再把利空给恢复了吧, 就像这段代码那样, 这段代码是从http://www.linuxhq.com/kernel/v2.6/36-rc2/drivers/bluetooth/hci_ath.c这里摘抄的.也就是这个AR3001的驱动代码.

还有,就是你汇发现这里面的tty驱动, 其实如果遇到问题, 你也要钻到TTY驱动里面的,
对这Reference manual看看里面的操作是否正确. 因为我遇到的问题就是, imx的uart驱动不是很标准,
打开硬件流控和关闭硬件流控没有实现. 折腾了半天, 现象就是UART时好时坏, 就像便秘一样.

最后自己打了实现打开和关闭流控的代码之后就一切没有问题了.别看小小的UART驱动, 如果不标准,也会带来很大的麻烦阿!

Table of Contents

Abstract

I11N

Version

Introducion of suspend

Normal Linux Suspend

Files:

Prepare, Freezing Process

Android Suspend

Files:

Feathers

Early Suspend

Late Resume

Wake Lock

Android Suspend

Early Suspend

Late Resume

Wake Lock

Suspend