Experiences with SAL/PREfast

Peter Gutmann, pgut001@cs.auckland.ac.nz
http://www.cs.auckland.ac.nz/~pgut001/pubs/sal.html

Code Analysis Tools

Code analysis tools (or more generally formal methods) have had a long and rather troubled history that may be summarised briefly by saying that they're so incredibly painful to use that almost no-one does. This has lead to Eric's Razor (after Eric S.Raymond, author of “The Cathedral and the Bazaar”):

In order to gain user acceptance, a software assurance tool mustn't require that you re-invent your entire universe before allowing you to write 'Hello, world'.

PREfast is a (rare) example of such a tool. Another one is the open-source Deputy, which I'll provide a comparison with later on. There are also various commercial tools available like the Coverity static code analyzer and Fortify security analyzer that are extremely powerful but also extremely expensive. If you're interested in this sort of thing at a more technical level, Secure Programming with Static Analysis provides a general overview, although much of the book is devoted to general secure-programming issues rather than the juicy details of how a static analyser works.

Getting a Copy

At the moment PREfast is only included in the Deluxe Professional Enterprise Team High-def Surround-sound Extreme Plus Ultra edition of Visual Studio. Luckily though there's a backdoor method of getting it that works with all versions down to the free Express edition, and (at least as of late 2007) looks like it'll continue to be supported in the future. If you download the latest Windows SDK (which is currently the Windows SDK Update for Windows Vista, but use whatever version is current at the time you read this), you can get the latest version of the development tools to go with your existing Visual Studio setup. Download the installer for the SDK and run it. You'll be given a long list of things to be installed totaling what seems like about half a terabyte of data. Deselect everything and then select only “Vista headers + libraries” and “VC++ compilers” (but omitting the IA64 one). The total download size should now be about 50MB, or about 10-15MB less if you don't care about the x86-64 stuff (note that if you're using the free Express edition you definitely don't want to bother getting the x86-64 stuff since it won't work with that).

Go through the install and then fire up Visual Studio. Open up Tools | Options | Projects and Solutions | VC++ Directories, and in “Executable Files” add the path to the compiler tools in the SDK (which should be something like /Program Files/Microsoft SDKs/Windows/v6.0/VC/Bin, but not the other 'Bin' directory /Program Files/Microsoft SDKs/Windows/v6.0/Bin, note the lack of a VC in the path). Then in “Include Files” add the path to the headers. In this case there are two directories to add, first something like /Program Files/Microsoft SDKs/Windows/v6.0/VC/INCLUDE for the standard C headers and then something like /Program Files/Microsoft SDKs/Windows/v6.0/Include/ for the Win32 headers (if you don't add both of these then you won't get the SAL-annotated versions, since Visual Studio will fall back to the default non-annotated ones). Finally, add the library files in the same location. These should be the first entries in each list, before all of the existing entries.

Now you're ready to go, using the PREfast-enabled compiler from the SDK rather than the default that comes with the IDE. Open the project properties and click on “Configuration Manager”. In the “Active solution configuration” combobox select “<New…>”. Give the new configuration a name like “Analyse” (if you're from the US and spell things funny then you can use “Analyze” instead) and copy across the settings from some convenient existing configuration like “Debug”.

Finally, again in the project properties, go to Configuration Properties | C/C++ | Command Line, and in the “Additional Options” box add /analyze.

That's it, you're now set to run PREfast when you select the “Analyse” configuration.

Using SAL

So you've installed PREfast and fired up your project in Visual Studio. There's an immediate temptation to use the Big Bang approach and build everything in analysis mode to see what happens. If you try this you'll get about, oh, eight million warnings, and probably decide that it's not worth the trouble. This is not a good way to get started.

Instead, you're better off building your code file by file, annotating each one as you go. Open the first file and then in turn open the headers that it uses (not the system headers, Microsoft has already annotated those for you). Annotate your header files and the source file and then build just that one file. Fix any problems (if possible), or add the necessary annotations to guide PREfast if the problem is (genuinely) a false positive (more on those further on). Then do the same for the next file, until you've completed the entire project.

Unless you're an exceptionally talented programmer who's managed to write near-perfect code, this is not going to be a small task. For any reasonable- sized project plan to spend at least a couple of weeks going through all of your code adding the necessary annotations and making code changes. On the other hand the level of checking that you're getting by doing this will definitely pay off in the long term.

Documentation

Once we get down to the specifics of using SAL things get a bit complicated since the documentation is rather confusing. If you look at the documentation for VS 2008 you'll see that it actually talks about the low-level attribute notation used internally (or at least in very low-level headers) by the compiler, and not what you'd be using to annotate your code. The best source for proper documentation is either A Brief Introduction to the Standard Annotation Language (SAL) or the (somewhat device driver-specific) PREfast for Drivers. There's also an overview page in MSDN that doesn't use the attribute notation, but what's there doesn't correspond to what's in the SAL header file from the SDK or in the previous two linked articles. There is however an older version of the page available that does correspond to the header and the other articles. Finally, the only really complete source seems to be the code comments in the SAL header file itself, /Program Files/Microsoft SDKs/Windows/v6.0/VC/INCLUDE/sal.h.

If you look inside sal.h you'll see that there are vast numbers of possible annotations, most of which you'll probably never use. The reason for this huge list is that in order to avoid requiring users to write long strings of attributes to annotate their code, Microsoft have collected the most common combinations into a single name like “__deref_inout”, which expands to “__notnull __elem_readableTo(1) __pre __deref __valid __post __deref __valid __refparam” (phew!), each of which in turn expands further to a “__declspec(…)” in the older implementation or the “([…])” attribute notation in the newer one.

The resulting mass of annotations can be rather intimidating. Rather than trying to use SAL directly, I've found that it's easier to identify common code patterns and then create a subset of the SAL annotations that work for you, which I'll cover in the next section. This both helps manage the complexity and means that you can retarget your annotations for whatever format might be used in future versions of Visual Studio.

SAL Workflow

The first step in setting up a SAL workflow is to identify common patterns in your code that correspond to SAL annotations. For example one pattern that's used frequently is { buffer, length }, in which a function takes a buffer containing a set number of bytes and does something with it. An example of this is memcpy(), which might be annotated as:


memcpy( dest, __in_bcount( length ) src, length );

which tells PREfast that the number of bytes in src is given by length.

memcpy() also requires a second annotation to tell PREfast how many bytes of output are going to be produced. Since this is the same as the number of input bytes, the final annotation is:


memcpy( __out_bcount( length ) dest, __in_bcount( length ) src, length );

whose meaning should be fairly obvious. Using these annotations, PREfast can now check that the dest buffer is large enough to contain what's in src.

There are some other annotations that you can add as well, but some of the more obvious ones are automatic if you're reasonably careful with your function declarations. For example the length parameter would in theory need a __in annotation, but if you use const with the declaration then it's automatic.

Another common pattern is { buffer, maxLength, *length }, in which a function takes a buffer of a certain maximum length and deposits data in it, returning an indication of how many bytes were processed. Windows contains many instances of this pattern, an example being ReadFile, which might be annotated as:

ReadFile( HANDLE hFile,
          __out_bcount_part( nNumberOfBytesToRead, *lpNumberOfBytesRead ) \
          LPVOID lpBuffer,
          DWORD nNumberOfBytesToRead,
          LPDWORD lpNumberOfBytesRead,
          LPOVERLAPPED lpOverlapped );

which tells PREfast that lpBuffer has a maximum possible size of nNumberOfBytesToRead and the number of bytes actually read will be returned in *lpNumberOfBytesRead.

All in all there's a huge number of annotation combinations that are possible. You can get an idea of what's available in the MSDN summary. As you can see by cross-referencing this to sal.h, this can lead to a combinatorial explosion of annotation complexity.

In order to manage this complexity, the next step once you've identified useful patterns is to reduce the number of annotations to something that you can work with and (hopefully) remember. What I've found works best for me is to use the preprocessor to create names for the small subset of annotations I'll be using that are immediately meaningful to me. So instead of having to try and recall whether a parameter that returns a pointer is meant to be annotated with __out, __deref_out, __out_ecount, __deref_out_z, __deref_out_ecount, or several dozen other annotations, I can just say OUT_PTR and leave it at that. Isomorphic annotations like __deref_out_opt and __deref_opt_out are particularly entertaining, so having a single form that's meaningful for you helps a lot here.

Another advantage to using your own names is that you're not tied to the use of sal.h in all situations. In other words by including in a global header code like:

#if defined( _MSC_VER ) && defined( _PREFAST_ )
  #include <sal.h>

  #define IN_BUFSIZE       __in_bcount
  […]
#else
  #define IN_BUFSIZE( size )
  […]
#endif

you can ensure that your code builds without problems whether code analysis is available or not. Here's an extract from the header file I use with SAL:

#if defined( _MSC_VER ) && defined( _PREFAST_ )
  […]
  #define IN_BUFSIZE       __in_bcount
  #define IN_BUFSIZE_OPT   __in_bcount_opt
  […]
  #define OUT_BUFSIZE      __out_bcount_part
  #define OUT_BUFSIZE_OPT  __out_bcount_part_opt
  #define OUT_PTR          __deref_out
  #define OUT_PTR_OPT      __deref_opt_out
#else
  […]
  #define IN_BUFSIZE( size )
  #define IN_BUFSIZE_OPT( size )
  […]
  #define OUT_BUFSIZE( max, size )
  #define OUT_BUFSIZE_OPT( max, size )
  #define OUT_PTR
  #define OUT_PTR_OPT
#endif /* VC++ with source analysis enabled */

This has reduced the 700-odd-line sal.h to just over a dozen annotations that I actively need and use, represented in a form that I find easy to remember.

A final advantage is the fact that you can use this extra level of indirection when the SAL annotation format changes. An earlier section mentioned the changes in the annotation format from Visual Studio 2005 to Visual Studio 2008, by using indirection you can say:

#if _MSC_VER <= 1400
  #define IN_BUFSIZE       __in_bcount
#else
  #define IN_BUFSIZE       _In_count_
#endif /* Different versions of SAL markup */

(Note that this uses the notation from the Visual Studio 2008 MSDN documentation, which doesn't correspond to any version of sal.h that I've been able to get my hands on so I haven't been able to check it).

Code Refactoring

One thing that you'll quickly notice when you start annotating your code with SAL is that (unless you're exceptionally disciplined) you've taken various shortcuts that need to be fixed. For example instead of:


int function( void *buffer, const int maxLength, int *length );

it's probably been much easier write:


int function( void *buffer, const int maxLength );

with the length returned as the function return value and errors reported as negative values (this is a pattern carried over from the Posix API, which really likes to do this). In other words the return value is being overloaded to perform two tasks at once. While PREfast can perform at least some checking on this sort of function, it can't perform the same level as you'd get with the function success status and the returned length information separated out (there is an annotation __success but it seems intended to designate when by-reference parameters are modified rather than for this type of overloaded return-value handling).

This is why I mentioned earlier that you should plan to take at least a few weeks on this if you want to do the job properly. Using PREfast effectively isn't just a case of spraying annotations around your code, you actually have to put in a bit of effort to fix problem areas.

Typical Annotations

Here's a short list of typical code patterns and the annotations that you might use with them. (Note: This is taken from comments in sal.h and may not be 100% correct).

int writeData( const void *buffer, const int length );

This function accesses a block of memory of size length (without modifying it). The resulting annotation is pretty straightforward:


int writeData( __in_bcount( length ) const void *buffer, const int length );

A simple variation on this is the read/write version of the above:


int readData( void *buffer, const int maxLength, int *length );

This function takes a block of memory of up to maxLength bytes and returns the byte count in length:


int readData( __out_bcount_part( maxLength, *length ) void *buffer,
const int maxLength, int *length );

The next function returns a pointer to a list element:


int getListPointer( void **listPtrPtr );

The annotation for this is:


int getListPointer( __deref_out void **listPtrPtr );

to indicate that the result is stored in the dereference of *listPtrPtr (I've mapped this to OUT_PTR when I use it because this is easier to remember than __deref_out).

The next function is a standard by-reference call:


int getInfo( struct thing *thingPtr );

The annotation is again fairly straightforward:


int getInfo( __inout struct thing *thingPtr );

A final example for standard null-terminated C strings:


int writeString( const char *string );

The annotation is again straightforard:


int writeString( __in_z const char *string );

Note that PREfast can perform much better checking if you provide an explicit length (although calling strlen() to fake out the analyser just before you call the function isn't going to help much). The TR 24731 safer-C functions also address this in some detail.

(If anyone has any further common patterns that they'd like to see listed here, please let me know).

False Positives

One problem that every code analysis tool has is false positives. Fortunately PREfast has relatively few of them, with one big exception: If a pointer is supposed to be non-NULL, it'll complain if you use it without checking it first. This is problematic in terms of false positives because passing complex data structures around by reference is standard practice and if you have several levels of nesting then a single check at the top level is sufficient — there's no need to keep checking for null pointers at every level of the call stack.

Similarly, if you're doing something like following a linked list using code like:

for( listCursor = listHead; listCursor != NULL; listCursor = listCursor->next )
  {
  do_function( listCursor );
  }

then you know that anything passed to do_function, and any functions that it in turn calls, will be non-NULL. However, PREfast can't easily detect this and warns about a potential NULL pointer dereference in do_function() and each function that it in turn passes the pointer to. The brute-force fix for this problem is:

#if defined( _MSC_VER ) && defined( _PREFAST_ )
  #pragma warning( disable: 6011 )
#endif /* VC++ with source analysis enabled */

but this may be overkill since it's turning off all warnings about potential NULL pointer dereferences and not just the flood of false positives. Luckily there's an alternative solution, the under-documented __analysis_assume annotation which you can use to guide the analyser. So in the above example you would use:


__analysis_assume( listCursor != NULL );

to remove the false positive. You need to be a bit careful in how you use __analysis_assume. It may look like the traditional C assert but its function is quite different. To make sure I don't accidentally use it as if it was an assert I've mapped it to ANALYSER_HINT in my code-analysis header file.

Some of the problem of false positives with pointers is caused by the fact that there's no way to tell PREfast that you specifically want things checked in the caller rather than the callee. In other words if you have a function like:


int function( __in_bcount( length ) const void *buffer, const int length );

then PREfast will warn when it processes the code in function() that an access at a disallowed point in buffer is possible if length has an inappropriate value. At this point you need to track down every possible use of function() in your code to make sure that length has an appropriate value. What would be more useful is an ability to tell PREfast that it should check all calls to function() but not function() itself:


int function( __in_bcount( length ) const void *buffer, __check_caller const int length );

This annotation would direct PREfast to assume that length has a valid value when it's passed to function() and instead check every call to function() to make sure that what's being passed is valid. Implementing this would require an ability to specify via SAL what values length can legally take, which is covered in the next section.

A second area in which PREfast occasionally has problems is when figuring out which variables are live or not. Consider the following code sequence:

int var1, var2, var3;

status = foo( &var1 );
if( status == SUCCESS )
  status = bar( &var2 );
if( status == SUCCESS )
  status = baz( &var3 );
if( status != SUCCESS )
  return( status );

result = var1 + var2 + var3;

In some cases PREfast has trouble following the code flow and emits a used-before-initialised warning. This only happens occasionally, and I'm not sure if there's a way to fix this (__analysis_assume doesn't address it because you can say __analysis_assume( ptr != NULL ); but there's no way to say __analysis_assume( value == __initialised__ );). This is probably one of those things that you just have to live with. Fortunately it's reasonably rare. In addition there's quite a bit of overlap between PREfast warnings of this type and compiler warning level 4 warnings, so if you eliminate all the warning level 4 warnings you'll get rid of most of the PREfast ones as well.

A variation of this is that PREfast has problems with the code flow in min() and max() and similar x ? y : z-style operations, so that limiting the byte count for an operation with min() still produces a warning about a potential buffer overrun. For example a code sequence:

__out_bcount_full( dest, maxLength );
[…]
length = min( maxLength, byteCount );
memcpy( dest, src, length );

will produce a warning that the buffer is maxLength bytes long but UINT_MAX bytes might be written.

In the case of the used-before-initialised warning, one possible workaround is to define a dummy initialiser:

#define DUMMY_INIT      0
#define DUMMY_INIT_PTR  NULL

and use that to perform a dummy initialisation of any variables for which PREfast returns a false positive (this also helps for other false positives such as when building with warning level 4 but without PREfast). One downside to this is that if you're telling others to do this rather than doing it yourself you're going to encounter remarkable levels of pushback from developers who refuse to insert totally superfluous initialisations into their code just because some stupid compiler can't figure out what is and isn't initialised. A few public executions pour encourager les autres may be required.

A second problem, which is a bit more hypothetical, arises when the following sequence of events occurs:

Compiler generation n produces a false positive;
Developers add a workaround to remove the false positive;
Compiler generation n+1 fails to warn of a genuine problem because of the false-positive workaround applied earlier;

This is where the use of the preprocessor comes in. Remember the earlier discussion about using a custom mapping from whatever works for you to SAL? If you're using the preprocessor in this way, you can add different defines for genuine vs. fix-false-positive annotations and then when a new generation of PREfast arrives you can temporarily disable the fix-false-positive annotations to see if (a) they're still a problem and (b) there isn't something else there that wouldn't have been detected otherwise.

A final problem, which I'll cover a bit more in the next section, is that there's no way to tell PREfast about integer ranges. For example if you have code like:

int length;

get_data_length( &length );
memcpy( dest, src, length );

and you know that get_data_length() will only ever return a value from 1 to 10, there's no way to tell PREfast this. As a result it'll issue a warning saying that dest is only X bytes long but up to UINT_MAX bytes may be copied. This one goes into my wishlist in the next section.

Everyone's a Critic

The one big thing missing from PREfast is handling of integer ranges and range checks. The previous section already gave one example of this where the analyser has to assume that all integer values are a worst-case INT_MAX/UINT_MAX because there's no way to tell it otherwise. What would help here is the addition of an integer range notation, say __range, that allows you to specify an allowable range for a variable. In the above example it'd be:


get_data_length( __out __range( 1, 10 ) int *length );

This then leads on to a second wishlist item, the extension of this ability to perform range checking to cover integer values in general. Detecting problems with integer ranges is something that makes finding buffer overflows look like child's play by comparison (the recent universal-remote Windows kernel vulnerability may have been due to an integer-range error, for example (although this is based on third-part reverse-engineering comments since no technical details have yet been released)). CERT's secure coding guidelines provide an overview of the magnitude of the problem, but its true size is only really apparent when you start going through some code trying to apply these rules to all uses of chars, ints, and longs.

To make use of PREfast for range checking, when declaring a variable you'd use:


int __range( 1, 10 ) foo;

and PREfast would warn when an operation on foo could potentially cause it to go outside this range. If you're familiar with Pascal and related languages you'll have encountered this before, although Pascal et al do their checking at runtime rather than compile time. It's also supported in the Cyclone safer-C variant, which uses the same mechanism for checking integer ranges as it does for array bounds checks:


typedef { int x : x >= 1 && x <= 10 } smallint;

A more typical use for restricted-subrange integers would be to declare a distinct type based on ranges:


typedef int __range( 1, 10 ) smallint;

and then wherever you used smallint you'd get warnings if it could potentially go outside the defined range. Even more so than buffer overflows, integer range checking is something that really needs automated assistance because once you get to more than one or two levels of arithmetic operations it's no longer possible for most humans to track where the numbers are going.

A final wishlist feature is something like SQL's EXPLAIN verb to provide more information on why PREfast thinks that a particular issue is a problem. Without knowing why a (confusing) warning is issued it's too easy to dismiss something as a false positive when it may really be a well-hidden genuine problem that only PREfast's analysis is capable of spotting.

Comparison with Other Tools

In the introduction I mentioned that a few other tools like PREfast exist. One that's been around for awhile is splint, which evolved from the earlier LCLint, which in turn came from the even earlier Larch formal specification and verification project. Unfortunately splint is dangerously close to violating Eric's Razor: It's very labour-intensive, there's a large amount of annotation required to get useful output, and the huge amounts of false positives produced require even further annotations to suppress them.

Here's a quick example of splint annotation. Remember the earlier comment about Microsoft reducing long strings of annotations into a single combined form in sal.h? With splint you don't have this facility, so you end up with annotations like:


void /*@alt char * @*/ strcpy( /*@unique@*/ /*@out@*/ /*@returned@*/ char *s1,
char *s2 ) /*@modifies *s1@*/ /*@requires maxSet( s1 ) >= maxRead( s2 ) @*/
/*@ensures maxRead( s1 ) == maxRead( s2 ) @*/;

(yes, there's a function prototype buried under all that!). The SAL equivalent is something like:


void strcpy( __out_z char *s1, __in_z const char *s2 );

The closest open-source equivalent that I've found to PREfast is Deputy, which is quite SAL-like in its notation. For example the same function as above decorated for use by Deputy would be:


void strcpy( NTS char *s1, NTS const char *s2 );

(with NTS being short for “null-terminated string”). The nice thing about this is that because PREfast and Deputy use quite similar notation it should be possible to annotate code so that it can be checked by both tools, where one may find problems that the other can't detect (so far I've only checked this on a small amount of toy code, SAL is Windows-only and Deputy is anything-but-Windows only so it's a bit tricky setting up a full- blown test).