主題五:過度使用技術
Test automation is based on a simple economic proposition:
測試自動化基于一個簡單的經濟觀點:
· If a manual test costs $X to run the first time, it will cost just about $X to run each time thereafter, whereas:
· 如果第一次手工測試的成本是$X,則其后每次測試的成本大致都是$X,然而:
· If an automated test costs $Y to create, it will cost almost nothing to run from then on.
· 如果創建自動化測試的成本是$Y,則其后的運行成本幾乎為零。
$Y is bigger than $X. I've heard estimates ranging from 3 to 30 times as big, with the most commonly cited number seeming to be 10. Suppose 10 is correct for your application and your automation tools. Then you should automate any test that will be run more than 10 times.
$Y比$X大。我了解到的估計范圍是從3至30倍,而常常被引用的數值似乎是10。假設10對于應用程序和自動化工具是正確的。這樣應當將運行10次以上的測試都進行自動化。
A classic mistake is to ignore these economics, attempting to automate all tests, even those that won't be run often enough to justify it. What tests clearly justify automation?
一個典型錯誤是忽略這些經濟上的考慮,試圖自動化所有的測試,甚至包括那些不常運行的測試以至不能證明自動化是必要的。哪些測試能明顯地證明自動化是必要的?
· Stress or load tests may be impossible to implement manually. Would you have a tester execute and check a function 1000 times? Are you going to sit 100 people down at 100 terminals?
· 壓力或負載測試可能無法手工實現。你會讓測試員執行并檢查一個函數1000次嗎?你會找100個人坐在100個終端前面嗎?
· Nightly builds are becoming increasingly common. (See [McConnell96] or [Cusumano95] for descriptions of the procedure.) If you build the product nightly, you must have an automated "smoke test suite". Smoke tests are those that are run after every build to check for grievous errors.
· 夜間構建變得越來越普遍了。(參見[McConnell96]或[Cusumano95]可以了解這個過程的描述)。如果在夜間構建產品,就必須有一個自動化的“冒煙測試套件”。 冒煙測試指的是那些在每次構建之后都去檢查嚴重錯誤的測試。
· Configuration tests may be run on dozens of configurations.
· 配置測試可能要在數十種配置上運行。
The other kinds of tests are less clear-cut. Think hard about whether you'd rather have automated tests that are run often or ten times as many manual tests, each run once. Beware of irrational, emotional reasons for automating, such as testers who find programming automated tests more fun, a perception that automated tests will lead to higher status (everything else is "monkey testing"), or a fear of not rerunning a test that would have found a bug (thus leading you to automate it, leaving you without enough time to write a test that would have found a different bug).
其他種類的測試不是這個明顯。仔細想一下,對于那些多次運行或者運行次數是手工運行次數10倍的自動化測試,你是否只運行一次。要當心實現自動化的不理性的、感情的原因,例如測試員發現程序自動化測試更有趣,認為自動化測試將帶來更高的地位(其他測試都是“猴子測試”),或者是害怕不能重新運行一個會發現 bug 的測試(這導致你將它自動化,使你沒有足夠的時間編寫一個會發現其他 bug 的測試)。
You will likely end up in a compromise position, where you have:
你可能在最后有一個折中的方式,你將有:
1. a set of automated tests that are run often.
一套經常運行的自動化測試。
2. a well-documented set of manual tests. Subsets of these can be rerun as necessary. For example, when a critical area of the system has been extensively changed, you might rerun its manual tests. You might run different samples of this suite after each major build.
一套文檔齊備的手工測試。這些測試的子集合可以在需要的時候重新運行。例如,當一個系統的關鍵領域被大規模地改變時,可能會重新運行手工測試。在每一次主要構建之后,都可能會運行這個套件的不同樣例。
3. a set of undocumented tests that were run once (including exploratory "bug bash" tests).
一套沒有文檔的、只運行一次的測試(包括探索性的“bug 大清除”測試)。
Beware of expecting to rerun all manual tests. You will become bogged down rerunning tests with low bug-finding value, leaving yourself no time to create new tests. You will waste time documenting tests that don't need to be documented.
注意不要期望重新運行所有的手工測試。重新運行這些很少能找到 bug 的測試會使你停滯不前,使你自己沒有時間創建新的測試。你會把時間浪費在為不需要文檔的測試編寫文檔上。
You could automate more tests if you could lower the cost of creating them. That's the promise of using GUI capture/replay tools to reduce test creation cost. The notion is that you simply execute a manual test, and the tool records what you do. When you manually check the correctness of a value, the tool remembers that correct value. You can then later play back the recording, and the tool will check whether all checked values are the same as the remembered values.
如果你能夠降低創建自動測試的成本,就可以多做一些。這也是使用GUI捕獲/回放工具能夠減少創建測試的成本的承諾。這個想法是你只需要執行手工測試,工具會錄制下你所做的操作。當你手工檢查一個值是否正確時,工具會記著那個正確值。過后你可以回放錄制,工具會檢查是否所有檢查的值是否都與記憶的值相同。
There are two variants of such tools. What I call the first generation tools capture raw mouse movements or keystrokes and take snapshots of the pixels on the screen. The second generation tools (often called "object oriented") reach into the program and manipulate underlying data structures (widgets or controls).
這類工具有兩個變種。我稱為第一代的工具只是捕獲原始的鼠標移動或擊鍵操作,并記下屏幕上象素的瞬象。第二代工具(常稱為“面向對象的”)深入程序并操縱底層數據結構(小配件或控件)。
First generation tools produce unmaintainable tests. Whenever the screen layout changes in the slightest way, the tests break. Mouse clicks are delivered to the wrong place, and snapshots fail in irrelevant ways that nevertheless have to be checked. Because screen layout changes are common, the constant manual updating of tests becomes insupportable.
第一代工具產生的是不可維護的測試。不論什么時候,只要屏幕布局有了非常微小的變化,測試就要中斷。鼠標的點擊傳送到不正確的位置,瞬象以一種不相關的方式失敗,必須予以檢查。因為屏幕布局變化是常見情況,所以經常手動更新測試也變得無法忍受。
Second generation tools are applicable only to tests where the underlying data structures are useful. For example, they rarely apply to a photograph editing tool, where you need to look at an actual image - at the actual bitmap. They also tend not to work with custom controls. Heavy users of capture/replay tools seem to spend an inordinate amount of time trying to get the tool to deal with the special features of their program - which raises the cost of test automation.
第二代工具只有在底層數據結構有用時才是可行的。例如,它們很少能用于照片編輯工具,因為你需要查看實際的圖象,即實際的位圖。它們也不大能夠與定制的控件一起使用。大量用戶的捕獲/回放工具似乎都要花費大量時間來使得工具能夠處理他們程序的特殊功能——這增加了自動測試的成本。
Second generation tools do not guarantee maintainability either. Suppose a radio button is changed to a pulldown list. All of the tests that use the old controls will now be broken.
第二代工具也不能保證可維護性。假設一個單選按鈕改變為下拉列表。所有使用老控件的測試都將中斷。
GUI interface changes are of course common, especially between releases. Consider carefully whether an automated test that must be recaptured after GUI changes is worth having. Keep in mind that it can be hard to figure out what a captured test is attempting to aclearcase/" target="_blank" >ccomplish unless it is separately documented.
GUI界面當然是常常會改變的,特別是在不同的發行版之間。仔細考慮一下一個在GUI變化之后必須重新捕獲的自動化測試工具是否值得擁有。記住,除非另外使用文檔記錄下來,否則想要了解一個錄制的測試能夠完成什么工作是一件困難的事。
As a rule of thumb, it's dangerous to assume that an automated test will pay for itself this release, so your test must be able to survive a reasonable level of GUI change. I believe that capture/replay tests, of either generation, are rarely robust enough.
一個基本原則是,認為自動化測試的投資在這個發行版就能收回的想法是危險的,所以在一個合理的GUI變化范圍之內測試必須能夠繼續使用。我相信不論是第一代還是第二代捕獲/回放測試,都不夠健壯。
An alternative approach to capture/replay is scripting tests. (Most GUI capture/replay tools also allow scripting.) Some member of the testing team writes a "test API" (application programmer interface) that lets other members of the team express their tests in less GUI-dependent terms. Whereas a captured test might look like this:
捕獲/回放的一個替代方法是腳本化測試。(大多數GUI捕獲/回放工具都允許編寫腳本。)測試小組的某些成員編寫一個“測試API(應用編程接口)”,允許小組的其他成員以較少依賴GUI的方式表達他們的測試。一個捕獲的測試類似于這樣:
· text $main.accountField "12"
click $main.OK
menu $operations
menu $withdraw
click $withdrawDialog.all
...
文本 $main.accountField "12"
點擊 $main.OK
菜單 $operations
菜單 $withdraw
點擊 $withdrawDialog.all
a script might look like this:
而一個腳本類似于:
· select-account 12
withdraw all
...
select-account 12
withdraw all
The script commands are subroutines that perform the appropriate mouse clicks and key presses. If the API is well-designed, most GUI changes will require changes only to the implementation of functions like withdraw, not to all the tests that use them. Please note that well-designed test APIs are as hard to write as any other good API. That is, they're hard, and you shouldn't expect to get it right the first time.
腳本命令是執行適當的鼠標點擊和按鍵的子程序。如果API設計得好,大多數GUI變化僅需要對函數(例如withdraw)實現變化,而不是所有使用它們的測試。請注意設計精良的API和其他好API一樣難寫。也就是說,因為它們不容易寫,你也不要指望第一次就得到正確結果。
In a variant of this approach, the tests are data-driven. The tester provides a table describing key values. Some tool reads the table and converts it to the appropriate mouse clicks. The table is even less vulnerable to GUI changes because the sequence of operations has been abstracted away. It's also likely to be more understandable, especially to domain experts who are not programmers. See [Pettichord96] for an example of data-driven automated testing.
這個方法的一個變種,是數據驅動的測試。測試員提供一個表來描述鍵值。某些工具讀取表并將它轉換為特定的鼠標點擊。這個表即使在GUI變化時也不易受到損害,因為操作序列已經被抽象出來了。它也有可能是更易于理解,尤其是對于非程序員的領域專家。查看[Pettichord96]可以獲得數據驅動自動化測試的示例。
Note that these more abstract tests (whether scripted or data-driven) do not necessarily test the user interface thoroughly. If the Withdraw dialog can be reached via several routes (toolbar, menu item, hotkey), you don't know whether each route has been tried. You need a separate (most likely manual) effort to ensure that all the GUI components are connected correctly.
注意這些抽象測試(不論是腳本化的還是數據驅動的)不一定會完全測試用戶界面。如果“取款”對話框能夠通過幾個途徑(工具條、菜單項)達到,你無法知道是否嘗試了每個路線。你需要一個單獨的(很可能是手工的)的工作來確保所有的GUI部件都正確地連接。
Whatever approach you take, don't fall into the trap of expecting regression tests to find a high proportion of new bugs. Regression tests discover that new or changed code breaks what used to work. While that happens more often than any of us would like, most bugs are in the product's new or intentionally changed behavior. Those bugs have to be caught by new tests.
不論你采用的是什么方法,不要陷入期望回歸測試發現高比例的新 bug 的陷阱?;貧w測試是發現以前起作用、但新代碼或更改后的代碼不起作用的現象。雖然它比我們希望的發生的次數更多,但許多 bug 是產品的新的或故意更改的行為。那些 bug 必須通過新測試來捕捉。
Code coverage
代碼覆蓋率
GUI capture/replay testing is appealing because it's a quick fix for a difficult problem. Another class of tool has the same kind of attraction.
GUI捕獲/回放測試因為可以快速修復困難問題而具有吸引力。另一類工具也同樣具有吸引力。
The difficult problem is that it's so hard to know if you're doing a good job testing. You only really find out once the product has shipped. Understandably, this makes managers uncomfortable. Sometimes you find them embracing code coverage with the devotion that only simple numbers can inspire. Testers sometimes also become enamored of coverage, though their romance tends to be less fervent and ends sooner.
困難的問題是很難知道你是否圓滿地完成了測試工作??赡苤挥挟敭a品已交付后才能真正知道??梢岳斫獾氖?,這使得經理們不舒服。有時候你會發現他們熱心采用代碼覆蓋率,認為只有那些簡單的數字可以鼓舞士氣。候測試員也變得傾心于覆蓋率,雖然他們的興趣沒有那么高,而且結束得也快。
What is code coverage? It is any of a number of measures of how thoroughly code is exercised. One common measure counts how many statements have been executed by any test. The appeal of such coverage is twofold:
什么是代碼覆蓋率?它是代碼是否全面執行的數字衡量。一個常見的衡量是計算所有測試共執行了多少條語句。對這種覆蓋率的呼吁有兩方面:
1. If you've never exercised a line of code, you surely can't have found any of its bugs. So you should design tests to exercise every line of code.
如果你從未執行過某一行代碼,你當然不能找出它的任何 bug 。所以應當設計一個可以執行每一行代碼的測試。
2. Test suites are often too big, so you should throw out any test that doesn't add value. A test that adds no new coverage adds no value.
測試套件常常很大,所以應該拋棄任何不能增值的測試。一個不增加新覆蓋率的測試不能增加任何價值。
Only the first sentences in (1) and (2) are true. I'll illustrate with this picture, where the irregular splotches indicate bugs:
句子(1)和(2)中,只有第一句是正確的。我將用下面的圖說明,其中的不規則黑點指示的是 bug :
If you write only the tests needed to satisfy coverage, you'll find bugs. You're guaranteed to find the code that always fails, no matter how it's executed. But most bugs depend on how a line of code is executed. For example, code with an off-by-one error fails only when you exercise a boundary. Code with a divide-by-zero error fails only if you divide by zero. Coverage-adequate tests will find some of these bugs, by sheer dumb luck, but not enough of them. To find enough bugs, you have to write additional tests that "redundantly" execute the code.
如果你僅編寫需要滿足覆蓋率的測試,你會發現 bug 。那些總是失敗的代碼不論怎樣執行,你都肯定能發現它們。但是大多數的 bug 取決于如何執行某一行代碼。例如,對于“大小差一”(off-by-one)錯誤的代碼,只有當你執行邊界測試時才會失敗。只有在被零除的時候,代碼才會發生被零除的錯誤。覆蓋率足夠的測試會發現這些 bug 中的一部分,全靠運氣,但發現得還不夠多。要發現足夠多的 bug ,你必須編寫其他的測試“冗余地”執行代碼。
For the same reason, removing tests from a regression test suite just because they don't add coverage is dangerous. The point is not to cover the code; it's to have tests that can discover enough of the bugs that are likely to be caused when the code is changed. Unless the tests are ineptly designed, removing tests will just remove power. If they are ineptly designed, using coverage converts a big and lousy test suite to a small and lousy test suite. That's progress, I suppose, but it's addressing the wrong problem.
同樣的原因,因為有些測試不能增加覆蓋率而將它們從回歸測試套件中去掉也是危險的。關鍵不是覆蓋代碼;而是測試那些當代碼更改時容易被發現的 bug 。除非測試用例是不熟練的設計,否則去掉測試用例就是去除作用力。如果它們是不熟練的設計,可以使用覆蓋率將一個大而粗劣測試用例套件轉化成一個小而粗劣的測試用例套件。我想這是進步,但是與這個問題無關。
A grave danger of code coverage is that it is concrete, objective, and easy to measure. Many managers today are using coverage as a performance goal for testers. Unfortunately, a cardinal rule of management applies here: "Tell me how a person is evaluated, and I'll tell you how he behaves." If a person is evaluated by how much coverage is achieved in a given time (or in how little time it takes to reach a particular coverage goal), that person will tend to write tests to achieve high coverage in the fastest way possible. Unfortunately, that means shortchanging careful test design that targets bugs, and it certainly means avoiding in-depth, repetitive testing of "already covered" code.
代碼覆蓋率的一個重大危險是它是具體、主觀而易于衡量的。今天的許多經理都使用覆蓋率作為測試員的績效目標。不幸的是,一個重要的管理規則適用于這里:“告訴我如何評價一個人,然后我才能告訴你他的表現?!比绻粋€人是通過在給定的時間內覆蓋了多少代碼(或者是在多么少的時間內達到了特定覆蓋目標)來評估的,那么那個人將傾向于以盡可能快的方式達到高覆蓋率的測試。不幸的是,這將意味對以發現 bug 為目的的仔細測試設計的偷工減料,這當然也意味著避開了深層次、重復地測試“已經覆蓋”的代碼。
Using coverage as a test design technique works only when the testers are both designing poor tests and testing redundantly. They'd be better off at least targeting their poor tests at new areas of code. In more normal situations, coverage as a guide to design only decreases the value of the tests or puts testers under unproductive pressure to meet unhelpful goals.
僅當測試員設計了的測試質量不高并且冗余地進行測試時,將測試度作為測試設計技巧才能起作用。至少可以讓他們將這些把這些質量不高的測試轉移到新的領域中。在正式的場合,覆蓋率作為一個設計的指導只會減少測試的價值,或將測試員置于低效率的壓力下,以達到沒有用處的目標。
Coverage does play a role in testing, not as a guide to test design, but as a rough evaluation of it. After you've run your tests, ask what their coverage is. If certain areas of the code have no or low coverage, you're sure to have tested them shallowly. If that wasn't intentional, you should improve the tests by rethinking their design. Coverage has told you where your tests are weak, but it's up to you to understand how.
覆蓋率在測試中確實能起作用,但不是作為測試設計的指導,而是作為一個大致的評估。在運行完測試后,看一下它們的覆蓋率是多少。如果某個領域的代碼沒有覆蓋到或覆蓋率很低,可以確定你對它們的測試很膚淺。如果不是故意那樣做的,你應該考慮重新設計它們以改進測試。覆蓋率告訴你測試的哪個部分是薄弱的,但如何理解則取決于你。
You might not entirely ignore coverage. You might glance at the uncovered lines of code (possibly assisted by the programmer) to discover the kinds of tests you omitted. For example, you might scan the code to determine that you undertested a dialog box's error handling. Having done that, you step back and think of all the user errors the dialog box should handle, not how to provoke the error checks on line 343, 354, and 399. By rethinking design, you'll not only execute those lines, you might also discover that several other error checks are entirely missing. (Coverage can't tell you how well you would have exercised needed code that was left out of the program.)
你也不能完全忽略覆蓋率。你可以瀏覽未覆蓋的代碼行(可能是在程序員的輔助下)以發現你忽略的某種測試。例如,你可能瀏覽代碼以確定你是否對某個對話框的錯誤處理測試不足。在完成這些之后,你翻回頭應該考慮對話框應該處理的所有用戶錯誤,而不是檢查第343、354和399行的錯誤。通過重新思考設計,你不僅能執行那些行,而且可能會發現幾個其他完全被忽略了錯誤。(覆蓋率不能告訴你程序之外的、所需要代碼的執行情況)。
There are types of coverage that point more directly to design mistakes than statement coverage does (branch coverage, for example). However, none - and not all of them put together - are so accurate that they can be used as test design techniques.
還有幾類覆蓋率,比語句覆蓋率更直接地指向設計錯誤(例如分支覆蓋率)。但是,其他種類——即使把他們都放在一起——也不能夠精確到用于測試用例設計技巧。
One final note: Romances with coverage don't seem to end with the former devotee wanting to be "just good friends". When, at the end of a year's use of coverage, it has not solved the testing problem, I find testing groups abandoning coverage entirely. That's a shame. When I test, I spend somewhat less than 5% of my time looking at coverage results, rethinking my test design, and writing some new tests to correct my mistakes. It's time well spent.
最后再說明一下:對覆蓋率的興趣似乎不能以從前的愛好者希望成為“好朋友”而結束。在使用了一年的覆蓋率之后,它沒有解決測試問題,我發現測試小組完全放棄了覆蓋率。這是一件丟人的事情。當我測試的時候,我花大約5%的時間查看覆蓋率結果,重新考慮我的測試用例設計,并編寫一些新的測試用例校正我的錯誤。這個時間是值得花的。