Constructed Regexes in a CLR Application

FINDING .NET REGEXES USING WINDBG

We are going to use WinDbg to peek into a sample application to see which regex patterns are being used.

Required Environment

  • WinDbg as a part of Debugging Tools For Windows, which can be found here.
  • Set symbol path using the .symfix command.
  • Network access to Microsoft Symbol Server.
  • Sample application that uses .NET regexes, which can be downloaded from the github repository krk/regextest.

Background

This article is not about introducing WinDbg, SOS or .NET flavored Regexes. Only short descriptions are provided.

What is WinDbg?

WinDbg is a powerful debugging tool from Microsoft that you get with a Windows license. WinDbg is a native debugger with extensions support.

What is SOS?

SOS or Son-of-Strike is a WinDbg extension that helps the debugger to traverse the CLR data structures in memory with ease to facilitate managed debugging. SOS.dll uses the Data Access Component in mscordacwks.dll to access the .NET data structures.

In order to debug managed code, exact version match of these dlls is needed between the debugging machine and the machine where the dump is taken. It is usually a good idea to pack SOS.dll and mscordacwks.dll together with the memory dump file.

What is a regular expression?

A regular expression is a pattern string that matches an input string when evaluated with respect to the regular expression grammar of the library or the host language, e.g. PERL, python, .NET Regex. In the context of .NET, a regex means an instance of the type System.Text.RegularExpressions.Regex.

Goal

WinDbg will be used to break on Regex class constructor invocations to log the pattern which is passed as a parameter to the Regex class.

Setting up your environment

Start WinDbg (X86) and load the sample executable called RegexTest.exe.

Press Ctrl+E to open the executable:

ModLoad: 00520000 00528000 RegexTest.exe
SHIMVIEW: ShimInfo(Complete)
(22a0.182c): Break instruction exception - code 80000003 (first chance)
eax=00000000 ebx=00000000 ecx=fa190000 edx=00000000 esi=7e797000 edi=00000000
eip=77283bed esp=006af664 ebp=006af690 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
ntdll!LdrpDoDebuggerBreak+0x2b:
77283bed cc int 3

Fix symbols if needed:

0:000> .symfix c:\symbols
0:000> .reload
Reloading current modules
......

Load the Son-of-Strike extension sos.dll. SOS must be loaded after CLR is loaded. For .NET 2.0, use mscorwks instead of clr

0:000> sxe ld clr.dll
0:000> g
ModLoad: 73f40000 745da000   C:\Windows\Microsoft.NET\Framework\v4.0.30319\clr.dll
eax=00000000 ebx=00800000 ecx=00000000 edx=00000000 esi=00000000 edi=7e79e000
eip=7720c6fc esp=006af2a0 ebp=006af2fc iopl=0         nv up ei pl nz ac pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000216
ntdll!NtMapViewOfSection+0xc:
7720c6fc c22800          ret     28h
0:000> .loadby sos clr

Break on loading of System.ni.dll to set a breakpoint on the .NET entry method.

0:000> sxe ld System.ni.dll
0:000> g
(22a0.182c): Unknown exception - code 04242420 (first chance)
ModLoad: 722a0000 72c20000   C:\Windows\assembly\NativeImages_v4.0.30319_32\System\2482799f892ac9e635f6e1453f0dfdd5\System.ni.dll
eax=00000000 ebx=00800000 ecx=00000000 edx=00000000 esi=00000000 edi=7e79e000
eip=7720c6fc esp=006aa620 ebp=006aa67c iopl=0         nv up ei pl nz ac po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000212
ntdll!NtMapViewOfSection+0xc:
7720c6fc c22800          ret     28h

Find the entry method address which can be found with ILSpy or a similar IL decompiler. In our case, it is the Program.Main method.

0:000> !name2ee RegexTest.exe Program.Main
Module:      007a3fbc
Assembly:    RegexTest.exe
Token:       06000001
MethodDesc:  007a4cc4
Name:        RegexTest.Program.Main(System.String[])
Not JITTED yet. Use !bpmd -md 007a4cc4 to break on run. <- We use this address

Break on the Main method using the bpmd SOS extension.

0:000> !bpmd -md 00c54cc4
MethodDesc = 00c54cc4
Adding pending breakpoints...
0:000> g
(16ec.1dc4): CLR notification exception - code e0444143 (first chance)
JITTED RegexTest!RegexTest.Program.Main(System.String[])
Setting breakpoint: bp 00D9049C [RegexTest.Program.Main(System.String[])]
Breakpoint 0 hit
eax=00000000 ebx=00b5f4dc ecx=02992350 edx=00000000 esi=02992350 edi=00b5f424
eip=00d9049c esp=00b5f3a4 ebp=00b5f438 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
00d9049c 90              nop

We are safe to assume that all Regex constructors call the private Regex ctor which has 4 parameters. We list all Regex ctors and set a breakpoint on the private ctor.

0:000> !name2ee System.dll System.Text.RegularExpressions.Regex..ctor
Module:      722a1000
Assembly:    System.dll
Token:       06003c8b
MethodDesc:  722a37f8
Name:        System.Text.RegularExpressions.Regex..ctor()
JITTED Code Address: 7242a690
-----------------------
Token:       06003c8c
MethodDesc:  722a3800
Name:        System.Text.RegularExpressions.Regex..ctor(System.String)
JITTED Code Address: 72441804
-----------------------
Token:       06003c8d
MethodDesc:  722a380c
Name:        System.Text.RegularExpressions.Regex..ctor(System.String, System.Text.RegularExpressions.RegexOptions)
JITTED Code Address: 72439850
-----------------------
Token:       06003c8e
MethodDesc:  722a3818
Name:        System.Text.RegularExpressions.Regex..ctor(System.String, System.Text.RegularExpressions.RegexOptions, System.TimeSpan)
JITTED Code Address: 72851118
-----------------------
Token:       06003c8f
MethodDesc:  722a3824
Name:        System.Text.RegularExpressions.Regex..ctor(System.String, System.Text.RegularExpressions.RegexOptions, System.TimeSpan, Boolean) 
JITTED Code Address: 7243bd60 <- We need this constructor address.
Token:       06003c90
MethodDesc:  722a3838
Name:        System.Text.RegularExpressions.Regex..ctor(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
JITTED Code Address: 72851134

Set a breakpoint on the private constructor which is used by all public constructors.

0:000> bp 7243bd60
*** WARNING: Unable to verify checksum for C:\Windows\assembly\NativeImages_v4.0.30319_32\System\2482799f892ac9e635f6e1453f0dfdd5\System.ni.dll
0:000> g
Breakpoint 1 hit
eax=723d4680 ebx=00b5f4dc ecx=02992450 edx=02992438 esi=02992438 edi=02992450
eip=7243bd60 esp=00b5f380 ebp=00b5f39c iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
System_ni+0x19bd60:
7243bd60 55              push    ebp

JIT seems to use the fastcall convention [citation needed] and passes a pointer to the this instance as the first parameter which is in ecx so our string pattern parameter should be in edx.

0:000> !do edx
Name:        System.String
MethodTable: 730e0938
EEClass:     72d1dd18
Size:        22(0x16) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      aaa1 <- This is the first regex generated by the sample application.
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
730e27c0  4000243        4         System.Int32  1 instance        4 m_stringLength
730e137c  4000244        8          System.Char  1 instance       61 m_firstChar
730e0938  4000248       40        System.String  0   shared   static Empty
    >> Domain:Value  00e27888:NotInit  <<

Get the unicode string associated with the String object.

0:000> .printf "Regex: %mu\n", 8+edx
Regex: aaa1

Now we clear the previous breakpoints and automate the task and log all regexes to a file.

0:000> bc *
0:000> bp 7243bd60 ".printf \"Regex: %mu\\n\", 8+edx ;g"
breakpoint 1 redefined
0:000> .logopen /u c:\logs\regex.txt
Opened log file 'c:\logs\regex.txt'
0:000> g
Regex: aaa2
Regex: aaa3
Regex: bcd1
Regex: bcd2
Regex: bcd3
Regex: ceg1
Regex: ceg2
Regex: ceg3
Regex: dgj1
Regex: dgj2
…

When the application exits or you press Ctrl+Break, call .logclose and your regex log file will be ready to be consumed.

0:000> .logclose

 

GitHub repository for the sample application is https://github.com/krk/regextest

Project Euler Problem 53

Project Euler 53. Problem ve Çözümü

53. Problem’de 1 <= n <= 100 ve C(n,r) > 1000000 koşullarını sağlayan kaç adet C(n,r) şeklinde kombinasyon olduğunu bulmamız isteniyor.

Bu problemin çözümü için akla gelen pek çok yol içerisinden hızlı ve basit olan bir yol ile problemi çözmeye çalışalım. Problemimizde kombinasyonu hesaplayarak 1000000’dan büyük olup olmadığını sınamak yerine kombinasyonun ve kontrol değerinin logaritmasını alarak logaritmaları karşılatırıyoruz.  Öncelikle bazı kombinasyon formüllerini hatırlayalım:

Faktöriyel ve kombinasyonun formülü

ve bazı logaritma formüllerini:

Çarpımın logaritması

Bölümün logaritması

Yukarıdaki formülleri kullanarak faktoriyelin logaritmasını aşağıdaki gibi yazabiliriz:

Faktöriyelin logaritması

Bu fonksiyon project euler ile doğrudan ilgili olmasa da gammaln fonksiyonunun analogudur.

Böylece kombinasyonun logaritmasını da aşağıdaki şekilde yazabiliriz:

Kombinasyonun logaritması

Buraya kadar yazdıklarımızdan çıkarmak istediğimiz sonuç olan problem 53‘ün logaritmalar ile ifadesini aşağıda görebiliriz:Logarithm of one million

Euler Problem 53

Böylece logaritmanın da yardımı ile kendimizi big-integer çarpım, bölüm ve karşılaştırmalarından kurtararak işimizi double toplamlara ve karşılaştırmalara indirgemiş olduk. Aynı zamanda faktoriyelin logaritma fonksiyonunu tasarlarken memoization tekniğini de kullanabiliriz. Problemi çözen örnek C# kodu aşağıda verilmiştir:

// 2009-06-29
using System;
using System.Collections.Generic;
using System.Text;

namespace ProjectEuler53
{
    class Program
    {
        static void Main(string[] args)
        {
            // count will be the answer of the problem.
            int count = 0;

            for (int n = 1; n <= 100; n++) // upper limit imposed by the problem on 'n' is 100
                for (int r = 1; r <= n; r++) // n > r
                {
                    if (LogCombination(n, r) > 6) // log(1000000) = log(10^6) = 6 log10 = 6
                        count++;
                }

            Console.WriteLine(count);
            Console.ReadKey(true);
        }

        // Memoization dictionary for logarithm of factorials.
        static Dictionary<int, double> dictLogFactorial = new Dictionary<int, double>();

        // Returns logarithm of the combination C(n,r)
        static double LogCombination(int n, int r)
        {
            return LogFactorial(n) - LogFactorial(r) - LogFactorial(n - r);
        }

        /// Returns logarithm of n!, if called before value comes from a dictionary.
        static double LogFactorial(int n)
        {
            if (n == 0 || n == 1)
                return 1;

            if (dictLogFactorial.ContainsKey(n))
                return dictLogFactorial[n];

            double ret = 0;
            for (int i = 1; i <= n; i++)
            {
                ret += Math.Log10(i);
            }
            dictLogFactorial.Add(n, ret);
            return ret;
        }

    }
}