Showing posts with label Bug. Show all posts
Showing posts with label Bug. Show all posts

Tuesday, October 8, 2024

Do Your Bug Report Annoy You and Fellow Testers?


I read the quotes or thoughts often about the code being written. Like, write the code for other programmer and not just for you; so that, the other programmer can pick it in ease and work from there.  You should have come across similar thoughts on code.

Have you ever come across thought[s] that speak about the bug report being written?

The bug report you write, is it for you alone? Or, is it for the audience to whom you wrote? Or, is it for someone who picks it up and work upon later?

How good are the audience of your bug report on reading it?  How did other fellow testers feel reading your bug report?  How easy it was for you to read your own bug report and work on it later?  How smooth it was for other tester to understand your bug report and test the fix?

I experience this:
  • A bug report with a precise and helpful technical details did not serve the audience and fellow testers
  • A bug report with no precise and helpful technical details did not serve the audience and fellow testers
  • A bug report with plain English and attachments did not serve

While I say, that, I see this helped most audience sometime:
  • A bug report in plain English giving the context, little or no technical information and associated details

It has happened, that I have rewritten my bug reports on reading it after an hour.  And, I have rewritten the bug report of others as well after testing the fix.  In both cases, I "prevent" the pain which I and others go through to some extent.  At least, I hope so!

To end, I recall this quote of Martin Fowler

Any fool can write code that a computer can understand. Good Programmers write code that humans can understand.

I see, this holds good for a bug report as well.  All and any of us can write a bug report.  A skilled engineer [and test engineer] can write a bug report which does not bring unwanted pain to her or his audience.

Anytime, did you read your own bug report after 3 months of writing it? How deep was the pain and annoyance to know what it was all about?  Give the same bug report to your fellow tester or programmer or product owner; ask, what did they know from it.  




Friday, July 26, 2024

My First Hand Analysis of CrowdStrike Falcon Update Incident


I attempted to analyze the process dump of CrowdStrike shared by my friend.  He said, there could be an attack which is leading to crash of Windows OS globally.  This made me curious to look into the dump and learn.

I had no much context around it, but, a test engineer in me did not sit quite.  I started to analyze the dump information.  Here is my first hand analysis that I made on 19th July 2024 post 10:30 AM IST.


What I Saw?

  • It is a Windows OS's process dump.
  • Looks like something with C or C++ application reading how the memory offsets were in the dump.
  • It started to read a memory offset.
  • Then the process witnessed an exception.
    • Here the program could not read further
    • Why it could not read further from this offset?
      • My little experience of testing drivers on Windows OS for a card printer machine, refreshed and recalled what I had witnessed when testing.


Scratching and Striking My Mind


I started to ask these questions myself while I asked what could have gone wrong.  I could not stop here as I was curious what led Windows machine crash.  I referred to web and learn there was an update by CrowdStrike, and then this incident.

The bugs do exist in every software no matter the level and depth of testing, automation and engineering's excellence.  All software do crash and OS is not an exception to it.  But, what made the update to crash the Windows OS?  Pointing and blaming CrowdStrike or Microsoft is not a way for the practicing test engineer.  If these two organizations are serving its huge customer base, they have something working and reliable.  Engineering does not eliminate problems.

By now, I had a thought that it is not an attack.  It is a software bug!  Where is the bug?  What is the bug?  Was it not experienced in pipeline?


The Open Ended Questions


I had these questions as I analyzed and spoke to my friend.
  • What is Falcon?
  • What was this update to Falcon?
  • How frequently the updates are rolled out?
  • How the updates are rolled out globally?
  • What pipeline do they have in testing?
  • Who is impacted the most in business? Is it Microsoft or CrowdStrike?  Impacted in what way?
  • What is CrowdStrike?  What they do?  Who are the customers?
  • Where do the CrowdStrike's Falcon sit in the OS and what it does?
  • How CrowdStrike works in the machines and what it offers?
  • What do the dump say? Relook into it with different perspectives.
  • How this could have been prevented?
  • How will I prevent this if I join this team knowing this incident?
With these questions, I started to analyze the process dump which was shared.

I had more such questions, but these were the first few that I crossed as I started.



Analysis of Process Dump


My interpretation, tells me the below for today
  1. Accept that it is an incident as any other incident which I witness in production environment.
  2. Do not fall to the speculation happening around.  Remain calm and focus to interpret and understand your exploration.
  3. I see, if it can start to read from an offset and then ending to experience a non-existent or invalid offset, is it a NULL Pointer?
    • What is NULL Pointer?
      • A NULL Pointer is a pointer that does NOT point to any memory location and hence does not hold the address of any variables.
      • If I do not initialize and assign, the pointer will have NULL as its value.
      • For example, int *test;
        • When I want to access the pointer test (a location in memory) pointing to, I will not be sure what is in the pointer when I read it.
          • I may not set it later or set it.
          • In this case, the code can tell if the pointer is valid or pointing to a garbage memory
        • But, if I declare it like int *test = NULL;
          • I can check if was set and initialized
        • It is a better practice to assign a NULL value to a pointer during initialization so that we can check if it is NULL or as any address assigned to it.
      • This understanding of Pointer makes me think, is it due not initializing a pointer and so the error code c0000005 on reading a memory that is not valid.
      • When we assign a NULL value to pointer, it is a null pointer in C++
        • We assign null value for testing and asserting
          • If the memory is allocated to a pointer or not
          • If it has a return address and is a valid one or not
          • If a pointer is not initialized, assigning null it prevents problems to certain extent
    • With this understanding, I also read, it started to read from an offset 0x9c, and then failing.
      • What is 0x9c?
        • In Octal it is 234. In Decimal it is 156.
        • Can there be such address in a computer's memory? I don't know.
        • If it is a access violation, then is it a memory which is in preemption of the OS?
          • If so the OS can terminate the program or process which is trying to access it.
          • Is this killing the process and aborting the operation of Falcon's IPC and eventually Windows coming to BSOD?
      • This tells me it is not a NULL Pointer in first case but not initializing a pointer to NULL.
        • I infer, if the pointer was assigned to NULL, that is initialized, there could have been some hint in the state and event when accessing the memory.
          • This is my analysis; but, I have not seen the test code nor aware of the product.  All this inference is based on the process dump and my experience of testing drivers.
      • It got something in between from update (a config or pattern?) for which it cannot find and read in the memory?  Why?
        • This indicates me, it could be a bug, that is, a logical problem.  This is my hunch for today!
  4. Data in the dump
    • Exception Address
    • Read from Address 0x9c
    • Exception Code: c0000005 (Access violation)

Testing my Interpretations


CrowdStrike as an org when it caters its SAAS to such a customer base, won't it have a testing pipeline
  • It will have, I have no doubt in it.  They test and roll out the updates, I believe in it.

Did they witness any such incidents earlier?
  • I searched on web for it and I did not find something similar on the Windows, earlier.

Is this a NULL Pointer?  Are you sure?
  • No, I'm not sure.  But, there is something that is leading it to address which does not exist or which is invalid?  I will have to wait for their RCA to know technically what caused this.  But this is my understanding reading the dump.

How do you think it is a memory access problem?
  • The error code 0xc0000005 says that.
  • I referred to driver easy website for the information because my experience of testing the drivers for Windows OS and experiencing such incidents led me there.  This is what I learn:
    • https://www.drivereasy.com/knowledge/solved-how-to-fix-0xc0000005-error/

Do you think the programmer would not have handled the obvious Pointer and NULL initialization?
  • I believe there will be a check for Pointer and what it is pointing to.  But is it due to no initialization?  Technically this has to be analyzed which I cannot do.  I will have to wait for CrowdStrike team to share the tech details.

Is this a driver problem that killed the Windows kernel?
  • I don't know.  But, the .sys file will not have driver as per my learning.  It will have information about the drivers and any configurations.
  • This incident is a problem, which impacted both CrowdStrike and Microsoft.  Maybe, both will have their areas to look and fix it they see so.  But, in this context, CrowdStrike can fix it quicker and that is much better -- is what I understand.
  • I'm a Windows user for long time.  I see, Windows has worked well to all my contexts so far.  The Engineers of Windows OS knows better than me here.  I'm not well aware and informed as they are.
  • CrowdStrike's engineering team are skilled and they are rolling out updates often in a day.  They have a better pipeline when this is being done.
    • But, the question I have is, how did this happen?
    • No one lets such problem into production when they are aware of it.  Do you?
    • There is something that has not come to their observation and experience.  What is that?
    • Knowing this will help to prevent this and similar incidents happening in future.
      • I'm waiting to know what did not come to their experience and led to this incident.

What could be in the .sys file of CrowdStrike?
  • I don't know!  I want to learn that.
  • But, from my testing of .sys file and drivers on Windows OS, I learn there could be a configuration details with certain pattern or information to capture at run time, and help the installed software to run.  This is my learning and awareness from my testing.
  • That said, testing at OS level and Anti Virus engines are not obvious.  Testing of drivers is like the risky mines.  What is sufficient and good enough in test coverage?  It needs an expertise at OS internals level.
  • Windows OS having such a fragmentation in its versions, updates and patches, it is a battle field and mines for engineers building such solutions for sure!
  • I learn, the Windows OS stopped when an application tried to access the invalid region or non-existent memory.
    • The update which was rolled out, did it have a configuration or a pattern that showed a logical problem when processing it?
    • I have such questions and thoughts that are striking my mind as I think and build a problem model for the same.

Is this a race condition incident?
  • I see, it is not a race condition incident as users across globe experienced it.

Is this specific to a Falcon version, OS version and hardware?
  • Not all host machines would be on latest version of Falcon, is my presumption.
  • At least, n-1 and n-2 versions should be on host machine which experienced this behavior.
    • So it is not a Falcon version specific, I see.
  • It looks to me as it is not specific to the Windows OS version and hardware configuration.
    • It is an application software problem which occurred at driver level is what I see.
      • This is an IPC communication and process is my understanding.
        • The driver can receive the IPC communication in continuous mode.
        • At times, this can get queued based on the application and what it does.


Where is the Problem?


Well, I'm looking and pulling from my visualization by relating with my experience of testing the driver on Windows OS.  I don't know the exact reason or close enough to tell what could have gone wrong.

Reading the process dump, it says accessing a memory that does not exist or corrupted.  One of the high possibility is, the starting offset is seen but it is not helping when reading.
  • For example, Ravi has the address of India's Prime Minister house.
    • But, he does not know from where to start despite having the address.
    • He is void and null in knowing where to start and what to do when he is not initialized with the start location to begin the travel to the Prime Minister's house.
    • In short, he do not know where the address is pointing to and what it has, though he is given a address to start.
      • Can he access the Prime Minister's house premise without any access granted and authorized to do so?
      • If not, won't he be arrested by police or other security forces and stop him?

Do I Know the Precise Problem?


I don't know!  I do not know the CrowdStrike product and platform.  I'm waiting to read the technical details from Crowd Strike.

I see, it comes to the data, state and event.  I would focus on how to prevent it learning which data, state and event led to this behavior.  I think of figuring out the Test Design and Strategy that can help me to identify such use cases.  I focus here and see can it brought into the automation so that it gets exercised and regressed consistently.

If it is due to the memory access that had a problem, I did such tests when testing driver for a hardware machine on Windows OS.  I will share the tests that I did in upcoming blog.

I wrote the technical analysis from process dump to CrowdStrike and Microsoft.  I did not get a response.  Anyways, I'm sharing the overall information in a non-technical way so that it is consumable to most readers here.



Note: Here are another threads of me sharing my thoughts on same:
1. https://x.com/testingGarage/status/1814215089525821763?t=XSFdx69ElL0ZmBOcEFrTjg&s=19
2. https://www.linkedin.com/posts/ravisuriya_%3F%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F-%3F-activity-7221156949445206017-oeRa




Saturday, November 5, 2022

Technically, What is a Bug?

 

I'm mentoring the Software Test Engineers.  In one of the pair sessions with a mentee, we were discussing the technical aspects of one technology.  We started to test the application and a mentee said, she found a bug.

She explained the bug.  Further, she asked how, can I explain this bug technically.  And, going ahead, she asked, "Can you technically tell what a bug is?"  


Technically, What is a Bug?


I have come across various definitions of a bug from other software testing practitioners.  If I have to tell technically what a bug is, I put it this way:

  • A bug, is a logical incident experienced
  • It is logical because the programming instructions written are logical
  • Technically, the bug is a logical incident
 

Wednesday, August 18, 2021

I Can Test - Debugging an Inconsistent Behavior

 

Before I Say, I Can Test

In this post, I'm demonstrating how I approached my testing and debugged an inconsistent behavior that was reported in the Telegram space of The Test Chat.  A contest that is about to start and is hosted by The Test Chat.  The title of the contest is - So You Think You Can Test?  The registration is open.  How to register detail is shared on the Telegram chat and other social media space of TTC.

The QR Code is shared in Telegram; asked to scan and submit the registration.  Here is what I observe reading through messages in the Telegram:

  1. Few could scan the QR Code and could register
  2. A couple of members could not see the registration form as the scanning of the QR Code failed  
  3. Requests made to share the URL of the registration form than sharing the QR Code
  4. Requests made to share both -- the QR Code and URL of the registration form
  5. The reason -- why the QR Code is shared and not the URL
  6. And the URL of the registration form is also shared now

What made me curious is, a member had replied that on multiple attempts to scan the QR Code using a mobile app, it did not fetch the URL.  This member observed the same behavior on the web, that is, on uploading the QR Code image, it did not fetch the URL.

I see a behavior now to Test Investigate and debug to learn what's happening.  With that, I have an opportunity now to say I Can Test right here on the registration procedure of So You Think You Can Test? contest.


What I did and What I Observe

It is a QR Code!  

  • This QR Code shared from TTC:
    • Is not like other regular QR Code I usually see
      • I see black background and data with yellow foreground
      • I see the Finding Pattern i.e. concentric squares in an oval shape
        • These two observations are prominent in this QR Code
  • I installed a QR Code reader app on my phone and scanned the QR Code
  • It fetches me the registration form URL; I can open the registration form
  • Then I get the question -- why those testers are experiencing a problem?
  • I read through their messages and observe for clues what they have left for me
  • I read the words -- drag and drop
    • Ah! The web browser is as well used
    • This is a very useful clue to me
    • I have no idea on what desktop browser, QR Code reader websites, mobile apps, and smartphones used by these testers
  • I proceed now to use an online QR Code reader
    • I pick these two:
      • qrreader dot online
      • helloacm dot com/tools/qrcode-reader/
  • I uploaded the QR Code shared by TTC on these two websites
  • I see the same message and in the same format on these two websites
    • The message reads -- error decoding QR Code
    • Per me, this is a key observation
      • What makes these two web pages show the same message and in the same format?
  • I analyze the network when uploading the image and for the response I receive
    • In the request
      • I see the Data URL used
      • Protocol mentions data
      • No remote address, that is no server IP to which the request is to be sent
        • Another critical observation
      • I see the request initiator
      • The data (jpeg image) is sent in base64 format - a binary format
      • I can see the preview of the QR Code
      • I see the request method as GET
        • This is interesting!
        • Why GET and not POST?
      • I see the HTTP Status Code 200
      • I see just the User-Agent in the Request Headers
    • In the Response
      • In the Response tab, I see the message -- This request has no response data available
      • I see Content-Type: image/jpeg in the Response Headers

As I see no remote address for this request, I turned off the network.  I uploaded the QR Code image; I see the data URL fired.  Further, I observe this request is exactly similar to the one I see with the network.  So I learn, 
  1. Fetching the URL from the QR Code is being done within the browser
  2. I have to just launch these pages and use the QR Code images to fetch the data out of it
    • No need for internet here
    • And moreover, there is no remote address at all; then why the internet is needed to upload images!
  3. This tells me, could be JavaScript is doing the job here!
    • It is a key learning from my so far observations
Now, I look at the Console to see if I can find more hints to my test investigation.



Diving into Console and JavaScript


In the Console, 
  1. For qareader dot online, I see:
    • Couldn't find enough finder patterns (found 0)
  2. For helloacm dot com/tools/qrcode-reader/, I see
    • Couldn't find enough finder patterns

Pic: Message on qrreader dot com



Pic: Message on helloacm dot com



This is the source of the problem -- the QR Code could not be decoded.  If the Finder Patterns are not identifiable in the QR Code, then data in the QR Code cannot be decoded.  I see "found 0" in the console log.

But, why the Finder Patterns are not identified in this case though it is seen in an image by the human eyes?  This is the start of the actual test investigation and debugging for the behavior experienced.

Further, I learn both these websites make use of the same JavaScript -- llqrcode.jsAnother key learning!  I see this JavaScript is copyrighted to Lazar Laszlo.  And, I found another website that scans the QR Code image -- webqr dot com.  I experience the same behavior on uploading the TTC QR Code here as well, that is the message -- error decoding QR Code.  The same text and in the same format!



Pic: Message on webqr dot com



Reading through the below JavaScript, I make few more observations.


Pic: llqrcode.js

I learn:
  1. When the image is about to be decoded
    • It is taken as a 2D image
    • The height and width of the QR Code image are collected and calculated
      • The check is made if they are appropriate to consume and process further
  2. In the process function,
    • The QR Code is converted to a grayscale image
      • The grayscale is the usual one that we see around us in black-and-white
  3. Now trying to look for Finding Patterns,
    • Looks like it is executing the condition if (h < 3)
      • So the message in console -- Couldn't find enough finder patterns
    • As a result, the decoding of QR is returning message -- error decoding QR Code
This information tells me, there is something to do with the QR Code shared by the TTC.  But, how come it works on smartphone apps?  

I have not attempted to fetch the code from the smartphone app and analyze it at this point in time of testing.  I make an assumption -- could be the program used in the smartphone app can identify the Finder Patterns irrespective of the color and shape in the QR Code.

I had made an observation documented in the beginning -- the Finder Pattern is in oval shape and not in the concentric square shape.



Testing the Tests


I picked the QR Code image shared by the TTC and converted it to a grayscale image.  I used this grayscale image in the above said three websites.  I see the same message -- error decoding QR Code.

I picked the registration form URL that I had got by scanning QR Code using the mobile app.  I generated the QR Code and uploaded it to these three websites.  I see the QR Code decoded successfully now; I see the Google Form URL to register.  Note that, if I turn off the internet and use a valid QR Code, I see the URL.

This tells me, there is no problem with Google Form URL accessibility or encoding or decoding.  It is to do with the QR Code shared by TTC.


Pic: QR Code shared by TTC



Pic: Grayscale QR Code generated by me


Pic: QR Code generated by me with Google Form URL




Pic: QR Code on qrreader dot online on uploading TTC QR Code




Pic: QR Code on webqr dot com on uploading TTC QR Code


The dimension and size of the QR Code image file shared by TTC are not the same as the one generated by JavaScript.  I see pixelation and a bit of distortion in the QR Code generated by JavaScript using TTC QR Code.  

Look at the Pattern Finders in the QR Code from JavaScript; compare it with the QR Code shared by the TTC.  They don't look the same.


Understanding the QR Code


QR stands for Quick Response.  QR Code was pioneered by Masahiro Hara at Japanese company Denso-Wave in 1990.  QR code has different sections and Finding Patterns is one.

I find information on these web pages useful:
  • https://www.explainthatstuff.com/how-data-matrix-codes-work.html
  • http://qrcode.meetheed.com/question14.php
  • http://www.keepautomation.com/tips/qr_code/functions_of_qr_code_function_patterns.html.


So, What's the Problem Here?


From the inferences I'm making from my tests so far, it looks like
  1. The QR Code from TTC has data (shape and color) that cannot be processed by this JavaScript?
    • Not very sure!
    • But the so far analysis says yes with the code read
  2. Need to generate more customized QR Code
    • If possible include Finding Patterns in different geometric shapes -- primary suspect
    • Then rule out or point in if that's the problem source
    • If this is not the problem, then
      • The dimensions of the QR Code image file is the problem?
        • For now, this is the second suspect
          • But, the read JavaScript code does not say this
Need to understand how the mobile app code can read it successfully.  Then figure out the differences in the mobile app code and the JavaScript referred here.

I'm stopping my testing for now.
I can test! I test!




Friday, May 14, 2021

Bug Report: Applying the Single Responsibility Principle (SRP) and KISS

 

On completing my test sessions, I started to write bug reports into the tracker. I had this thought coming up again in me: "Should I keep all these problems under one bug report or have a separate one for each?".

  • When we have the test with SRP (Single Responsibility Principle) and KISS (Keep it Simple and Straight), why not the bug reports?
  • What's wrong if each symptom (consequence problems due to the root problem) has an individual bug report but linked to the root problem report?  

At times, I'm said to include all symptoms in one bug report along with the root cause.
  • I have witnessed the symptoms of a bug do not get fixed if it is mentioned collectively in one bug report.
  • Also, the linked bug reports (i.e. symptoms of the root cause) do not get fixed when the root is fixed. It will be marked as Resolved and Fixed as the root is fixed and resolved.

Hardly I have seen the symptoms as well fixed along with the root cause.

That leaves with the questions:
  1. Just one bug report or separate bug reports?
  2. When a test has to be specific with individual responsibility and objective, why not the bug report?
  3. If the root cause is fixed does it resolve the symptoms?
    • And, if the symptoms are resolved does it also resolve the root cause?

I look up to my consciousness should it be the separate bug reports or just in one bug report.  I see, I can apply the SRP and KISS to the bug report effectively.

Looking at the number of bugs is not a wise strategy. But looking at the number of problems that one root problem opens and tracing them, is useful in the engineering of a product.



Monday, July 4, 2016

That's the common thing to happen! What is next?


The words "hang' and "crash" looks so attention grabbing!  I see these words grabs attention today as well but it does not retain the attention for longer time unless the someone is behind the programmer and tester for it seriously.

Gradually, I'm learning, the crash and hang are something very usual in software applications or hardware products.  Crash and hang are not cruel but it is an indicator of what is being missed to see. Like fever in human body is not a threat; but, it is an outcome of something that is not good in body and an indication.

I have to admit this honestly. I was so thrilled some years back when I happen to see the hang and crash. I use to say it hanged and people use to run for lab.  Unfortunately I was not in a state of mind to tell "that is okay, but what we have to looks is what made it hang or crash".   The hang or crash behavior is an outcome of an actual problem which was never learned by me, then. 

While I happened to practice the test investigation, I learn, hang is a symptom. I fetching as much as information from symptom to unlock the problem node, is the essential part of what I'm practicing now.

Today when fellow practicing testers come saying, "it got crashed", "it is hanged and unable to proceed" and as this, I say "It is a common thing to happen in software products.  Than writing this as a bug, go figure out what is making the product to fall into this state.  If you are not aware to learn that from test investigation, let us pair up and learn it."

Most times or all most every time, the symptom is reported as bug while the problem source is unattended by the engineering teams. Fixing symptom is easier so that symptom is not visible again. But the problem resides and it shows up something else as symptom and one more bug if identified this symptom.

Symptoms are trace backs to the problem source. Not stopping here is a good little small step for being better in test investigation skills.