Showing posts with label Investigation. Show all posts
Showing posts with label Investigation. Show all posts

Friday, July 26, 2024

My First Hand Analysis of CrowdStrike Falcon Update Incident


I attempted to analyze the process dump of CrowdStrike shared by my friend.  He said, there could be an attack which is leading to crash of Windows OS globally.  This made me curious to look into the dump and learn.

I had no much context around it, but, a test engineer in me did not sit quite.  I started to analyze the dump information.  Here is my first hand analysis that I made on 19th July 2024 post 10:30 AM IST.


What I Saw?

  • It is a Windows OS's process dump.
  • Looks like something with C or C++ application reading how the memory offsets were in the dump.
  • It started to read a memory offset.
  • Then the process witnessed an exception.
    • Here the program could not read further
    • Why it could not read further from this offset?
      • My little experience of testing drivers on Windows OS for a card printer machine, refreshed and recalled what I had witnessed when testing.


Scratching and Striking My Mind


I started to ask these questions myself while I asked what could have gone wrong.  I could not stop here as I was curious what led Windows machine crash.  I referred to web and learn there was an update by CrowdStrike, and then this incident.

The bugs do exist in every software no matter the level and depth of testing, automation and engineering's excellence.  All software do crash and OS is not an exception to it.  But, what made the update to crash the Windows OS?  Pointing and blaming CrowdStrike or Microsoft is not a way for the practicing test engineer.  If these two organizations are serving its huge customer base, they have something working and reliable.  Engineering does not eliminate problems.

By now, I had a thought that it is not an attack.  It is a software bug!  Where is the bug?  What is the bug?  Was it not experienced in pipeline?


The Open Ended Questions


I had these questions as I analyzed and spoke to my friend.
  • What is Falcon?
  • What was this update to Falcon?
  • How frequently the updates are rolled out?
  • How the updates are rolled out globally?
  • What pipeline do they have in testing?
  • Who is impacted the most in business? Is it Microsoft or CrowdStrike?  Impacted in what way?
  • What is CrowdStrike?  What they do?  Who are the customers?
  • Where do the CrowdStrike's Falcon sit in the OS and what it does?
  • How CrowdStrike works in the machines and what it offers?
  • What do the dump say? Relook into it with different perspectives.
  • How this could have been prevented?
  • How will I prevent this if I join this team knowing this incident?
With these questions, I started to analyze the process dump which was shared.

I had more such questions, but these were the first few that I crossed as I started.



Analysis of Process Dump


My interpretation, tells me the below for today
  1. Accept that it is an incident as any other incident which I witness in production environment.
  2. Do not fall to the speculation happening around.  Remain calm and focus to interpret and understand your exploration.
  3. I see, if it can start to read from an offset and then ending to experience a non-existent or invalid offset, is it a NULL Pointer?
    • What is NULL Pointer?
      • A NULL Pointer is a pointer that does NOT point to any memory location and hence does not hold the address of any variables.
      • If I do not initialize and assign, the pointer will have NULL as its value.
      • For example, int *test;
        • When I want to access the pointer test (a location in memory) pointing to, I will not be sure what is in the pointer when I read it.
          • I may not set it later or set it.
          • In this case, the code can tell if the pointer is valid or pointing to a garbage memory
        • But, if I declare it like int *test = NULL;
          • I can check if was set and initialized
        • It is a better practice to assign a NULL value to a pointer during initialization so that we can check if it is NULL or as any address assigned to it.
      • This understanding of Pointer makes me think, is it due not initializing a pointer and so the error code c0000005 on reading a memory that is not valid.
      • When we assign a NULL value to pointer, it is a null pointer in C++
        • We assign null value for testing and asserting
          • If the memory is allocated to a pointer or not
          • If it has a return address and is a valid one or not
          • If a pointer is not initialized, assigning null it prevents problems to certain extent
    • With this understanding, I also read, it started to read from an offset 0x9c, and then failing.
      • What is 0x9c?
        • In Octal it is 234. In Decimal it is 156.
        • Can there be such address in a computer's memory? I don't know.
        • If it is a access violation, then is it a memory which is in preemption of the OS?
          • If so the OS can terminate the program or process which is trying to access it.
          • Is this killing the process and aborting the operation of Falcon's IPC and eventually Windows coming to BSOD?
      • This tells me it is not a NULL Pointer in first case but not initializing a pointer to NULL.
        • I infer, if the pointer was assigned to NULL, that is initialized, there could have been some hint in the state and event when accessing the memory.
          • This is my analysis; but, I have not seen the test code nor aware of the product.  All this inference is based on the process dump and my experience of testing drivers.
      • It got something in between from update (a config or pattern?) for which it cannot find and read in the memory?  Why?
        • This indicates me, it could be a bug, that is, a logical problem.  This is my hunch for today!
  4. Data in the dump
    • Exception Address
    • Read from Address 0x9c
    • Exception Code: c0000005 (Access violation)

Testing my Interpretations


CrowdStrike as an org when it caters its SAAS to such a customer base, won't it have a testing pipeline
  • It will have, I have no doubt in it.  They test and roll out the updates, I believe in it.

Did they witness any such incidents earlier?
  • I searched on web for it and I did not find something similar on the Windows, earlier.

Is this a NULL Pointer?  Are you sure?
  • No, I'm not sure.  But, there is something that is leading it to address which does not exist or which is invalid?  I will have to wait for their RCA to know technically what caused this.  But this is my understanding reading the dump.

How do you think it is a memory access problem?
  • The error code 0xc0000005 says that.
  • I referred to driver easy website for the information because my experience of testing the drivers for Windows OS and experiencing such incidents led me there.  This is what I learn:
    • https://www.drivereasy.com/knowledge/solved-how-to-fix-0xc0000005-error/

Do you think the programmer would not have handled the obvious Pointer and NULL initialization?
  • I believe there will be a check for Pointer and what it is pointing to.  But is it due to no initialization?  Technically this has to be analyzed which I cannot do.  I will have to wait for CrowdStrike team to share the tech details.

Is this a driver problem that killed the Windows kernel?
  • I don't know.  But, the .sys file will not have driver as per my learning.  It will have information about the drivers and any configurations.
  • This incident is a problem, which impacted both CrowdStrike and Microsoft.  Maybe, both will have their areas to look and fix it they see so.  But, in this context, CrowdStrike can fix it quicker and that is much better -- is what I understand.
  • I'm a Windows user for long time.  I see, Windows has worked well to all my contexts so far.  The Engineers of Windows OS knows better than me here.  I'm not well aware and informed as they are.
  • CrowdStrike's engineering team are skilled and they are rolling out updates often in a day.  They have a better pipeline when this is being done.
    • But, the question I have is, how did this happen?
    • No one lets such problem into production when they are aware of it.  Do you?
    • There is something that has not come to their observation and experience.  What is that?
    • Knowing this will help to prevent this and similar incidents happening in future.
      • I'm waiting to know what did not come to their experience and led to this incident.

What could be in the .sys file of CrowdStrike?
  • I don't know!  I want to learn that.
  • But, from my testing of .sys file and drivers on Windows OS, I learn there could be a configuration details with certain pattern or information to capture at run time, and help the installed software to run.  This is my learning and awareness from my testing.
  • That said, testing at OS level and Anti Virus engines are not obvious.  Testing of drivers is like the risky mines.  What is sufficient and good enough in test coverage?  It needs an expertise at OS internals level.
  • Windows OS having such a fragmentation in its versions, updates and patches, it is a battle field and mines for engineers building such solutions for sure!
  • I learn, the Windows OS stopped when an application tried to access the invalid region or non-existent memory.
    • The update which was rolled out, did it have a configuration or a pattern that showed a logical problem when processing it?
    • I have such questions and thoughts that are striking my mind as I think and build a problem model for the same.

Is this a race condition incident?
  • I see, it is not a race condition incident as users across globe experienced it.

Is this specific to a Falcon version, OS version and hardware?
  • Not all host machines would be on latest version of Falcon, is my presumption.
  • At least, n-1 and n-2 versions should be on host machine which experienced this behavior.
    • So it is not a Falcon version specific, I see.
  • It looks to me as it is not specific to the Windows OS version and hardware configuration.
    • It is an application software problem which occurred at driver level is what I see.
      • This is an IPC communication and process is my understanding.
        • The driver can receive the IPC communication in continuous mode.
        • At times, this can get queued based on the application and what it does.


Where is the Problem?


Well, I'm looking and pulling from my visualization by relating with my experience of testing the driver on Windows OS.  I don't know the exact reason or close enough to tell what could have gone wrong.

Reading the process dump, it says accessing a memory that does not exist or corrupted.  One of the high possibility is, the starting offset is seen but it is not helping when reading.
  • For example, Ravi has the address of India's Prime Minister house.
    • But, he does not know from where to start despite having the address.
    • He is void and null in knowing where to start and what to do when he is not initialized with the start location to begin the travel to the Prime Minister's house.
    • In short, he do not know where the address is pointing to and what it has, though he is given a address to start.
      • Can he access the Prime Minister's house premise without any access granted and authorized to do so?
      • If not, won't he be arrested by police or other security forces and stop him?

Do I Know the Precise Problem?


I don't know!  I do not know the CrowdStrike product and platform.  I'm waiting to read the technical details from Crowd Strike.

I see, it comes to the data, state and event.  I would focus on how to prevent it learning which data, state and event led to this behavior.  I think of figuring out the Test Design and Strategy that can help me to identify such use cases.  I focus here and see can it brought into the automation so that it gets exercised and regressed consistently.

If it is due to the memory access that had a problem, I did such tests when testing driver for a hardware machine on Windows OS.  I will share the tests that I did in upcoming blog.

I wrote the technical analysis from process dump to CrowdStrike and Microsoft.  I did not get a response.  Anyways, I'm sharing the overall information in a non-technical way so that it is consumable to most readers here.



Note: Here are another threads of me sharing my thoughts on same:
1. https://x.com/testingGarage/status/1814215089525821763?t=XSFdx69ElL0ZmBOcEFrTjg&s=19
2. https://www.linkedin.com/posts/ravisuriya_%3F%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F-%3F-activity-7221156949445206017-oeRa




Saturday, February 3, 2024

Database: Finding the Tables Having Specified Column Name

 

In today's pair testing session with a mentee, we were testing for Database I/O.  We were on PostgreSQL.  One of the questions a mentee had is,

How can I figure out the tables having this column name?

Running through every tables and exploring if the column being looked for is present or not, is time consuming.  It is not a approach to take as well.

I went through this when I started the ETL testing practice in 2011.

Here is the query that works on PostgreSQL to find table names which has specified column name.


Query:

select table_name, column_name
from Information_Schema.Columns
where table_catalog='database_name' and column_name like '%column_name%'


It is a better approach to know the precise column name and using the condition as -- column_name='EmployeeId'.


This query should work on MySQL and MSSQL Server.  If not working on MSSQL, need to look into the FROM and WHERE clauses if it is vendor specific.



Monday, October 16, 2023

Performance & Tests: Getting Started and Data Analysis

 

On running tests,

  • We will have data (information) as one of the byproduct.
  • Analyzing the data of the integrated sub-systems in isolation and correlation,
    • It will lead us to a technical analysis on each integrated system.
In the report, we draft this analysis along with actions to be taken.

Note: When said sub-systems do not ignore or skip the client or consumer; the system does not comprise just server.


No Golden Rule

There is no one way to do a testing.  Likewise, there is no one way or the golden rule to test for performance.  It is contextual and depends on what I want to learn.

In fact, in few contexts, we can have a value adding performance test with just one request.  Just, I should be well aware of -- what is that I want to know and learn from this test.

That said, there are multiple interfaces where we can observe, analyze and learn from the performance data collected.

The fourth question from season two of 100 Days of Skilled Testing, is:

What are your favorite hacks to analyze performance testing results and find anomalies?

Well, this question do not mention explicitly if it is for server or client or database or caching or messaging or for what interface of a system.  It is a question; but, to me it looks too generic and at a point it looks vague.  Having said this, that is how the learning journey and curve starts! 


Result vs Report

What is a result?

  • Is it an evaluation after a data [information] is put to scrutiny?
  • Or, the result is a data that is collected and not yet interpreted?

It depends on individual or team and how it being practiced.

The result is different from a report.


Getting Started and Data Analysis


I should know how the system architecture is designed and orchestrated with its boundaries and interfaces.  This helps a lot.  What kind of architecture is this?  Is it a monolith?  If it is monolith, my approach to test for performance differs.

If I'm asked to start the analysis of data for a system that I'm not aware of,
  • I will start by analyzing the below indicators on knowing the architecture and the orchestration of the sub-systems for critical business workflows
    1. CPU usage
    2. RAM usage
    3. Data I/O
    4. Network usage
    5. The Heat and sound dissipated from the hardware which holds and binds
      • CPU, RAM, Data I/O, Network and tech stacks installed and configured

It hints me to look further and test investigate, when I observe:
  • Having a steady consumption
    • What is steady consumption in this context?
  • Having a low consumption
    • What is low consumption in this context?
  • Having a unusual consumption spike and fall of it
    • I follow the pattern to study further
    • What is considered as knee, spike and fall, in this context?
  • Having a zero consumption
  • Having a maximum consumption
    • What is maximum consumption in this context?
Having a high consumption doesn't mean a problem.  Likewise, having a low consumption does not mean all is well.  I have to uncover them to learn what it means in the given context.

In each of this, there will be a pattern.  I will learn them.  I will correlate with other sub-systems and learn what they were doing in the said timeline.

Do you recollect this line -- "the architecture should provide the Testability"?
  • I wrote about it in one of the blog posts of Performance Engineering.

I refer to the below by traversing with the timeline,
  • The logs by asking for it
  • Data recorded
  • Any APMs that are in place
I correlate all these with above said indicators.

This gives me a start. It is one of easiest start that I can have to get started with analysis.


Well this is to analyze at the server end.  What about the client [consumer] end?  It is simpler and will share in the coming blog posts.



Do you want to know more on this and other strategies that can be used contextually?  Let us get connected and converse.  I'm happy to share and learn on listening to you.  It is fun and awareness!



Tuesday, April 4, 2023

My Interpretation of localhost and 127.0.0.1

 

Incident

When executing a test, I see the below message 

The target is unreachable, Please make sure your target is up and running

I have multiple Docker images running on my machine.  Most of these images are different products that are bound to run locally.


Debugging and Observations

In this context, I'm using http://127.0.0.1:portNumber as an IP to communicate with an application which is a Docker image.  I see the above said message.

But, I have not mentioned any IP address specifically; and there is no port forward.  This puts me into question -- What's going wrong here?

I use http://localhost:portNumber and try to communicate with a Docker image. Now the test execution does not see the message above said.

I see  a next question in me:

  • Isn't localhost and 127.0.0.1 both mean the same on a local box?


Debugging and Interpretation

  1. I see the IP address 127.0.0.1 refers to a loopback interface on a local box
  2. When TCP/IP sees the IP address which starts with 127, it understands this is a loopback request
    • This request does not go out of the local box
    • The response to this request is returned to the top TCP/IP layer of the same local box
  3. But, the local box does need to have the same and only one IP address, that is 127.0.0.1
  4. The IP of the local box can range from 127.0.0.1 to 127.255.255.255 along with the combination of port numbers which can range from 0 to 65535
    • 0 is the reserved port number in TCP/IP; it cannot be used
  5. localhost is the domain name for the loopback IP address which is 127.x.x.x.
    • This also means, the loopback IP of localhost does not necessarily have to be 127.0.0.1 all time
    • It can be different
    • And, if the port number is used along with the loopback IP address, then 127.0.0.1 can still be used with different port numbers for different applications and its communication
      • Are localhost:5555 and localhost:8888 two different applications on the same IP address?
        • It looks so! Technically and logically as well it looks okay
      • But, using the same IP address with a different port number is a monolith concept, right?
        • Apart from the monolith concept, this is not a good approach
        • But, for now when running multiple Docker images locally and testing locally, this is a state and transition which is more likely to occur
  6. Further I see, the application which is binded to the domain name localhost might not necessarily have the IP address 127.0.0.1 always
    • The IP can range up to 127.255.255.255
    • This indicates me there is a difference between 127.0.0.1 and localhost
      • That is, localhost and 127.0.0.1 does not mean one and the same every time
  7. I see there should be binding here to the local box's domain name -- localhost
    • This is a differentiator

Given this context, that is, running multiple Docker images locally and testing within a TCP/IP of a local box,
  • I see using localhost [with port number] in the URL is a wise strategy in testing if 127.0,0,1 fails.  

Knowing the exact local IP address with the port number should work.  If this fails, we have an incorrect port number or a problem connecting to the port.  

All these are being spoken in the context of
  • A local box
  • Communication between the images on a local box.

It is all about which application on the local box has taken [registered] the domain name localhost.  Or, find the exact IP address on the loopback interface along with the port number used by an application

How I'm bringing up the servers and what the configuration includes for the IP address and port numbers, is not to be taken casually. Especially on the local box!


Your Thoughts?

If you see my interpretations are incorrect, kindly help me to learn by commenting on this blog post.  We will connect to share and discuss.  I'm open to unlearn and learn!



Note: I have read about the term Small Weight Tests.  These tests have requests and communication which does not go out of the box.  The request and response happen within the box with little or no I/O and CPU operations.




Wednesday, July 13, 2022

IntelliJ and Cache: Maven Dependencies Not Resolved


This post is about the Maven dependencies not resolving.  I'm recording the incident, my understanding, and what worked for me. It can help a fellow Software Test Engineer.


Incident and its Details

I'm using the machine which has the below setup:

  • OS: Windows 10 Pro
  • IDE: IntelliJ IDEA 2022.13 (Community Edition)
  • JDK: 1.8
  • Maven: 3.5.4 

I created a new Maven project and in the pom.xml, I added the below dependencies.  The IDE showed that these dependencies are not resolved.

<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.3.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.testng/testng -->
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>7.3.0</version>
<scope>test</scope>
</dependency>

<!-- https://mvnrepository.com/artifact/io.github.bonigarcia/webdrivermanager -->
<dependency>
<groupId>io.github.bonigarcia</groupId>
<artifactId>webdrivermanager</artifactId>
<version>4.3.1</version>
</dependency>

<!-- https://mvnrepository.com/artifact/io.rest-assured/rest-assured -->
<dependency>
<groupId>io.rest-assured</groupId>
<artifactId>rest-assured</artifactId>
<version>4.4.0</version>
<scope>test</scope>
</dependency>


I observed, that the added dependencies not getting resolved, that is, not getting added to libraries.  In the IDE, I see the error as this:

  • Dependency 'io.github.bonigarcia:webdrivermanager:4.3.1' not found
  • Dependency 'org.seleniumhq.selenium:selenium-java:4.3.0' not found 
  • Dependency 'org.testng:testng:7.3.0' not found
  • Dependency 'io.rest-assured:rest-assured:4.4.0' not found

Note: I have other Maven projects to which I have added the same dependencies but of different versions, in my workspace. Each time I add a dependency it will download to the libraries in my workspace though it exists, and this is my configuration in IDE.

At this point, I did not know what is happening and how to approach resolving this problem.  All I know is, that there are other versions of the same dependencies in the Maven's library on my machine.


Understanding the Problem


I tried to understand what it is saying to me.  It says it cannot find the dependency.  Could be that it is looking for the dependency in the Maven's library on my box, and it is not finding it.  

Further, I looked into the web to see did someone else face the same incident and said behavior.  Few posts said using the "compile" scope solved the problem in their case for WebDriverManager.  But, how can the change in scope fix the dependency not found incident?  This was the question in me!  To my curiosity, I did try that for WebDriverManager; I did not see any change.


Invalidating the Cache - Is that a Fix?


In the IntelliJ IDE, I tried to invalidate the cache and reboot the IDE and open the project.  I see the dependencies being resolved and I do not see the said problem.

I don't know how this fixed the said problem here.  Also, I don't know precisely what exactly the problem is, here.  Invalidating the cache and reopening the IDE (also the project) worked.

On invalidating cache, here is what happens per IntelliJ:
  • Removes the cache files for all projects ever run in the current version of IDE
  • The files will be recreated the next time you open these projects

I remember now that I upgraded the IntelliJ to the latest version before creating the said new Maven project, here.  Later, it resulted in the above said dependency problem. 

Note that, I had created a few Maven projects using an older version of IntelliJ.  I have the same dependencies in the old and new projects, just the version of these dependencies is the only difference.

Now, invalidating the cache on this new version of IntelliJ, it got fixed.  I understand that the IDE uses the cached instances of these dependencies.  Could be there is a relationship between this cache, dependency versions, and the IDE version?  I'm technically not sure of the same, but my instincts say it is.



Saturday, March 5, 2022

The Never Ending Execution and Code Smell

 

I write this post to share my learning with all Test Engineers and not just be confined to Testers in a group.  I read the below question in the Telegram group - Testing Mini Bytes.  

A fellow Test Engineer in the group asked this question by sharing the method where he experienced a problem.

Question:

This is going on infinite loop can someone let me know why the loop the condition is not working



Pic: The method whose execution is not ending


Here is how I approached analyzing the stated problem and asisted the fellow engineer:

  1. Read and understand the problem description
  2. Understand whose problem it is and why from the description if the detail is available
  3. Know the Programming Language used
  4. Read the code and understand its flow
  5. Looked for the debug statements
  6. What is declared and how, why, and what does it hold?
  7. Check if any state exists and what it is
  8. Logic check for condition and flow
  9. Check on maxAttempt count
  10. Syntax check for condition and flow

Now, I look at how the condition check is being done:
  1. The operator used in condition with the operands
  2. Declaration
  3. Assignment
  4. Updating
  5. Referring

In this case, the condition has IF and ELSE blocks.  The IF block has a check on the status and maxAttempt count.  The IF condition has logical AND between these two.  And, the ELSE block has a check on maxAttempt count alone.

What I infer is:
  1. The operator != should not be a problem in this case
  2. Though JS also has !== operator, != should work in this context; !== also works if used
  3. Could be the status is not yet COMPETED or COMPLETED
  4. But, the maxAttempt will not be less than 5 in the first check
  5. maxAttempt is incremented
  6. And, the same having this IF and ELSE block is called
    1. Doing so, the maxAttempt will be reset to zero
    2. Note that maxAttempt was incremented before resetting it to zero
    3. In this method, maxAttempt will always be zero that is less than 5
  7. Irrespective of whether the status is COMPLETED or not, this execution will run into a loop


Not having the debug prints or console log for status and maxAttempt, it won't be obvious why this execution is going on & not stopping.  It is a kind of recursion experienced that is not intended to be here.  Unless short of system resources, this recursion continues.

Though I see the condition check logic can be written in other approaches, I did not get into how it can be done.


Learnings

  • Use the debug statements or console logs; it helps in debugging and understanding what's happening
  • Code Smells
    • Multiple Point of Failure
      • The place where the maxAttempt is declared and initialized
      • How the method is being called within it
    • Long Method
      • Though the method is simple in what it does, it involves multiple operations
      • Also, the method calls itself within it
        • It complicates by calling itself; breaking this method is good
        • One method one task
The method looks simple but how it flows within by resetting maxAttempt to zero.  And, a recursion kind of nature makes it uneasy to follow when there are no debug statements.




Monday, January 3, 2022

The Automation Strategy Problem; Not a Appium Challenge

 

In The Test Tribe's forum, I read the post which described the problem as in the below paragraphs and picture.  On looking into it, I learned this can be made as a blog post that tells a strategy for automation.  

Maybe, 10 years back I would have asked the same question.  That's a learning curve.  Today as well, I end up in thinking for a while asking self -- how to test it and how to automate it.  

I want to share how this problem can be looked at from the perspective of testing and automation, and then approach it to automate.

 

Folks. I've two issues on Appium automation which needs your help. 

1. I'm working on a ecommerce website where a payment method is integrated (lets take the example as PhonePe). When i try to place the order in the mobile website with payment method as PhonePe, the payment method app will be opened and I've to complete the payment using it and I'm navigated back to the browser. Issue is - How can i switch context between the mobile browser and the app? I tried using driver.startActivity() but on performing any other actions, it errors out. 

2. Since i need to use the browser to place order and the payment using the payment app, I tried to set up the driver instance with browserName and app as the desired capabilities together. But on running the test, it errors out - browserName and app can't be used together. How can i approach this problem? Anyone who has automated such flows?

Apologies, i'm pretty new to Appium and so, please excuse my ignorance.


Picture: Problem Statement - Description of Scenario & Challenge


Understanding the Scenario and Functional Flow

I observe the below in the said scenario:

  1. It is a website; it also has a mobile website
  2. It has got a payment option integrated
  3. The Appium's Desired Capabilities defined has browserName and the app
    • borwserName -- name of the mobile web browser used in automation; it is an empty string if automating an app
    • app -- the path of an app to be automated
  4. When using a mobile website on a mobile device -- assuming it a mobile web app
    • On selecting a payment app -- assuming it a native app
      • The context changes to payment app UI
      • On completing the payment, the context changes to the mobile website



Challenges Described in the Funtional Flow


I see these as challenges:
  • How to handle this said scenario in automation using Appium?
  • How to switch context between mobile browser and the mobile app?
  • Using driver.startActivity(), it yields an error on performing any other actions
    • On making any actions on UI after using the above said method, the error is observed
      • Reading the description, it is said that the error is thrown when running the automation
        • And, when changing the context back to mobile website from payment app
The driver.startActivity(), takes two arguments -- app's package name and activity to be started.  What's passed for the package name and activity name is not clear from the problem description.  

If the mobile browser is used to launch the mobile website and mimic the action, what is passed as app's package name and activity in driver.startActivity() ? This is not mentioned and unclear to me.

Also what is mentioned for the browserName and app in desired capabilities is not clear.



A Common Use Case


In recent years this is a common use case in a mobile native app having a web view and the websites that have payment transactions.  For example, in the native app when making payment, the web view of payment gateway that shows list of payment choices.  On successful payment, the view switches to native view from web view.



Questions on Reading the Problem Statement:


I have the below questions on reading the problem description:
  1. Why did it throw the errors on any actions post calling the driver.startActivity()
    • driver.startActivity() will start an Android activity using package name and the activity name
  2. The context picked on switching from web view to native and then back to native, is not well picked?
    • But it is a mobile website which means it is opened on a mobile browser, right?
      • No where it is mentioned as a Hybird app i.e. the mobile website installed as an app
    • Does this mobile website maintains its context when switching to a native app (payment app), and then changing the context to (web view) mobile browser?
This takes me to seek clarity for:
  1. Is mobile website a installed Hybrid app? Or, is it a regular website which also has a mobile website and accessed on a mobile browser?
  2. Is it possible to switch the context of web page from mobile web browser to native app, and vice versa?
    • I need to explore it; I'm unsure of it
    • When read the desired capabilities, it looks like this can be done
      • That is context switching of mobile web browser to native app, and back to mobile browser from native app is possible
      • I need to explore on the same to be very sure of it

Code Snippets for Context Switching


Refer to this page for details on using the Web view with Appium.  The below code snippets tell how to find the context of web and native views, and switching to it.
Snippet illustrating the change of context to Web view

Snippet illustrating the change of context to Native view



But, What's Actually the Problem?


If automated as described in problem statement, do we end up in a problem?  I see, yes we will end up in a problem:
  1. Need to maintain our automation to make sure it executes the payment app UI anytime
    • If the UI of the payment app changes, we need maintain the code
  2. Do we have stage environment payment app in this case?
    • If we test the mobile website in the stage and make transaction in production payment app,
      • Can we continue as this in each test iterations?
      • If yes, how long can we continue to use production payment app and pay?
      • Will there by any transaction fee charged each time from payment app?
        • Can this become a financial cost to the business and client or to stakeholders?
        • What other cost should I bear for using this approach?

I need to know:

  1. What is that I want to learn from the use case or scenario on automating it?
  2. What would be the impact if the test did not help me to learn what I want to learn from automating this use case and scenario?
  3. Should I be testing the payment app along with my app? 
    • As I write UI automation to handle the web view of payment gateway and then the native payment app, it becomes part of this test.  Should I do that?
  4. What information, risk and problem discovery I miss, if I do not automate the payment app flows?
    • Is it okay for the business and product, if I miss any information here or if I do not test the flow in payment app?
    • How to arrive at this decision?
The decision here need to be rational.  But, being rational alone may not help always.  Can I be reasonable here when I'm deciding or influencing stakeholders when deciding?



This is a Automation Strategy Problem!


If seen, for first this is not a Appium problem.  It is a problem with -- what to automate, how to automate, when to automate, how much to automate, and why automate.   That is, it is a problem with automation strategy on how to approach and execute it.

To me it is a problem to solve with approaching and execution of automation for payment transaction, and not a automation library usage and implementation problem.  



How can I Approach the Automation Here?


I will learn, should this payment scenario be automated on the UI layer for first?  If yes, why?  And, then I will have the below questions
  1. Can I use the developer APIs of payment service to test and complete the transaction?
    • If yes, then
      • Can I use the stage APIs of payment to simulate the transaction flow and its completion?
        • If I just use APIs, I will not know what's the functional experience of transactions in native payment app.  Is this okay?
  2. I and the product I test, do not have control over the payment system and its apps
    • When I have no control over it at any point in time, should I test it as part of my system?  If I did so, should my product as well include the probabilities and complexities of payment system?
      • Having this information is good!
      • But what can I do with that information?
        • Do I have an authority to change or fix payment system with that information?
        • If yes, good; if no, then the time and resource spent on this s a value return to my stakeholders and their business?
    • It is wise to mention that I'm not including and testing the payment system and its transactions as a part of my system
      • Because my system does not have a control over payment system in any means
  3. If the API that is used for initiating transaction is functional and usable, then I do not have to worry technically from functional perspective of transactions
    • We will have to work on -- if the payment initiating web view is functional on my native app and in my website or a mobile website
      • From here the control of payment and any transaction problem that arises are in the realm of the payment system
  4. In the test report
    • I will include the stage payment API request and its response with data
      • Talking to payment app organization, we may get the developer API access on stage to test our system on their stage
      • Talk to payment app organization!
      • Also we can mock the payment API to an extent and in the test report say this is a mock result
        • If relied on mock, then we can miss the change in payment system
        • I will have the mocking as last approach just to complete a business flow and it will not be my pick unless someone wants to see a business flow completion in a test
    • Have a test that tell about functional and usable aspect of the payment page in -- a mobile website and the payment web view in native/hybrid app


Benefits of this Approach

  1. I and my tests will have clarity what is in my control and what not
  2. When I have control, the test and automation can be well maintained
  3. The flaky areas can be identified; I can come to a decision to eliminate it from the automation or not
  4. It helps to identify what is my problem and what is the problem that I don't own in terms of authority
  5. With this approach, the tests and automation provides clarity when we uncover a risk or problem
While I know the benefits, I must also know the cost of having this approach.


Sunday, December 26, 2021

Before Identifying and Listing My Tests

 

I read the below query in TTC's Telegram chat. The discussion had started on this thread and fellow members here were responding.  Further, I read this line and it made me look into it -- "The question was we have to use valid username and password..and perform a negative testcase".



The Default Thinking and Applying Interface

Including me, I see it is subconsciously common for us to approach the problem statement visualization in terms of Graphical User Interface.  When I ask why it is so, maybe it is rooted in our subconscious thinking i.e. with first order and second order or any orders of thinking.

I want to give a try to attempt approaching it by reminding and asking self the below questions:

  1. Is it a GUI specific problem?
  2. Is it a problem that is tied to the context of GUI?
  3. What does this question encapsulate within and open as an interface?
  4. What forms do these interfaces take when I stand out of specific interface?
  5. Should I stick to one interface to learn and attempt this problem?


Identify the Tests and Framing of Tests

We test to learn

  • Does the system do what it is supposed to do and how, why, and when?
  • When the system does not do what it is supposed to do and how, why, and when?
Should I call it Negative Tests?  This is not what I share in this post.

To me, these are tests that help me to learn when the system responds and behaves in the other way than I expected.

I can start to identify the straight use cases for inputting an error (a human introduced error) at a given state/data/event; then look for the behavior of the system.  It is good when we can keep identifying and ideating the use cases.  

We get limited with use cases as we continue to think about use cases.  That said, for sure we will identify and frame the tests within identified use cases.  But, we need tests that help to learn when the system fails in doing what is supposed to do.

To supplement it there is another way, which I use.  I do not say this is the only way to supplement.  I use multiple approaches to supplement and identify the tests.  When I do so, I ask the question to the system with the help of these tests and evaluate the response of the system.


Questions to Identify the Priority Tests


I learn and understand the system each time, to identify the better tests.  And, each time I learn something new about the system that I did not know.  

When I'm asked a question in the interview, I ask for details that help me to test better or to demonstrate my deliverable better.  I will watch the questions that I ask!

If I were the candidate who got this question in an interview, I would ask the below questions.  When I learn this is good enough for the initial tests, I will pause with questions.  I move to identify and frame the tests using the responses I got for the questions that I asked.  

These questions will surely help me to be precise and close to the context that better demonstrates my testing skills.  If it is not close, then there is a problem (or a difference) in my presenting and expectations in the interview.  I will have to address it with the help of the interviewer.

Questions:

  1. What is the interface where I'm entering the username and password?
    • Where is this authentication used?
    • On UI (if so which UI), or CLI, or touch interface, or what is its interface type?
    • At which layer of the system this authentication is used?
  2. Where is the format of username and password?
  3. What is used as Authorization identity on successful authentication?
    • What happens if my authentication is not successful in the UI you want me to test?
    • How do I understand that UI is communicating to me that my authentication is not successful?
  4. How is this authentication processed?
  5. Where the authentication is mapped to authorization and stored for references?
  6. What protocol is used to communicate in authentication?
    • What protocol and communication order is used to grant and revoke authorization?
  7. Who uses this authentication and authorization?
    • To know the different means of doing the same
  8. Is there any other form of authentication that grants me the authorization?
    • Do these different entry points of authentication update my authorization?
    • Will I have different authorization data to authenticate? If yes, how the data, states, and events are maintained for my authentication and account?
  9. What's the language and Unicode supported by this system?
    • Will the languages and Unicode used in the system have any impact when I try to authorize by changing the language and Unicode?  How does the system understand these differences and maintain one state of data with authorization?
  10. Are there any computing differences for authentication and authorization on big and small endian machines?  If yes, how and for what context of the system's behavior, processing, and decision?
  11. Where and how the authentication and authorization details are processed, stored, and presented back.
    • Is there any specific reason for doing it in this particular way?
    • How you have strengthened the authentication process to grant the authorization?
      • For example, 1FA, 2FA, nFA, what else?
  12. Does any other system use your authentication to authenticate and authorize?
  13. Do you use SSO for authentication and authorization?
  14. What testability layer do I have that I can make use of to support and identify the tests?
    • Does this testability layer help me to identify more tests and also classify them?

I can keep generating the questioning like this.  But I will have to pause and start working on what the questions offer me.  

With the help of these questions, I can learn better about the system before attempting to identify the test and frame it.  This also pulls out the risk or problem area if any that looks important and of priority.

I have eased my work to an extent when I know:
  • the target surface area to start my work 
  • what it takes and brings back, and how

In this context, I would have started this way!



Wednesday, August 18, 2021

I Can Test - Debugging an Inconsistent Behavior

 

Before I Say, I Can Test

In this post, I'm demonstrating how I approached my testing and debugged an inconsistent behavior that was reported in the Telegram space of The Test Chat.  A contest that is about to start and is hosted by The Test Chat.  The title of the contest is - So You Think You Can Test?  The registration is open.  How to register detail is shared on the Telegram chat and other social media space of TTC.

The QR Code is shared in Telegram; asked to scan and submit the registration.  Here is what I observe reading through messages in the Telegram:

  1. Few could scan the QR Code and could register
  2. A couple of members could not see the registration form as the scanning of the QR Code failed  
  3. Requests made to share the URL of the registration form than sharing the QR Code
  4. Requests made to share both -- the QR Code and URL of the registration form
  5. The reason -- why the QR Code is shared and not the URL
  6. And the URL of the registration form is also shared now

What made me curious is, a member had replied that on multiple attempts to scan the QR Code using a mobile app, it did not fetch the URL.  This member observed the same behavior on the web, that is, on uploading the QR Code image, it did not fetch the URL.

I see a behavior now to Test Investigate and debug to learn what's happening.  With that, I have an opportunity now to say I Can Test right here on the registration procedure of So You Think You Can Test? contest.


What I did and What I Observe

It is a QR Code!  

  • This QR Code shared from TTC:
    • Is not like other regular QR Code I usually see
      • I see black background and data with yellow foreground
      • I see the Finding Pattern i.e. concentric squares in an oval shape
        • These two observations are prominent in this QR Code
  • I installed a QR Code reader app on my phone and scanned the QR Code
  • It fetches me the registration form URL; I can open the registration form
  • Then I get the question -- why those testers are experiencing a problem?
  • I read through their messages and observe for clues what they have left for me
  • I read the words -- drag and drop
    • Ah! The web browser is as well used
    • This is a very useful clue to me
    • I have no idea on what desktop browser, QR Code reader websites, mobile apps, and smartphones used by these testers
  • I proceed now to use an online QR Code reader
    • I pick these two:
      • qrreader dot online
      • helloacm dot com/tools/qrcode-reader/
  • I uploaded the QR Code shared by TTC on these two websites
  • I see the same message and in the same format on these two websites
    • The message reads -- error decoding QR Code
    • Per me, this is a key observation
      • What makes these two web pages show the same message and in the same format?
  • I analyze the network when uploading the image and for the response I receive
    • In the request
      • I see the Data URL used
      • Protocol mentions data
      • No remote address, that is no server IP to which the request is to be sent
        • Another critical observation
      • I see the request initiator
      • The data (jpeg image) is sent in base64 format - a binary format
      • I can see the preview of the QR Code
      • I see the request method as GET
        • This is interesting!
        • Why GET and not POST?
      • I see the HTTP Status Code 200
      • I see just the User-Agent in the Request Headers
    • In the Response
      • In the Response tab, I see the message -- This request has no response data available
      • I see Content-Type: image/jpeg in the Response Headers

As I see no remote address for this request, I turned off the network.  I uploaded the QR Code image; I see the data URL fired.  Further, I observe this request is exactly similar to the one I see with the network.  So I learn, 
  1. Fetching the URL from the QR Code is being done within the browser
  2. I have to just launch these pages and use the QR Code images to fetch the data out of it
    • No need for internet here
    • And moreover, there is no remote address at all; then why the internet is needed to upload images!
  3. This tells me, could be JavaScript is doing the job here!
    • It is a key learning from my so far observations
Now, I look at the Console to see if I can find more hints to my test investigation.



Diving into Console and JavaScript


In the Console, 
  1. For qareader dot online, I see:
    • Couldn't find enough finder patterns (found 0)
  2. For helloacm dot com/tools/qrcode-reader/, I see
    • Couldn't find enough finder patterns

Pic: Message on qrreader dot com



Pic: Message on helloacm dot com



This is the source of the problem -- the QR Code could not be decoded.  If the Finder Patterns are not identifiable in the QR Code, then data in the QR Code cannot be decoded.  I see "found 0" in the console log.

But, why the Finder Patterns are not identified in this case though it is seen in an image by the human eyes?  This is the start of the actual test investigation and debugging for the behavior experienced.

Further, I learn both these websites make use of the same JavaScript -- llqrcode.jsAnother key learning!  I see this JavaScript is copyrighted to Lazar Laszlo.  And, I found another website that scans the QR Code image -- webqr dot com.  I experience the same behavior on uploading the TTC QR Code here as well, that is the message -- error decoding QR Code.  The same text and in the same format!



Pic: Message on webqr dot com



Reading through the below JavaScript, I make few more observations.


Pic: llqrcode.js

I learn:
  1. When the image is about to be decoded
    • It is taken as a 2D image
    • The height and width of the QR Code image are collected and calculated
      • The check is made if they are appropriate to consume and process further
  2. In the process function,
    • The QR Code is converted to a grayscale image
      • The grayscale is the usual one that we see around us in black-and-white
  3. Now trying to look for Finding Patterns,
    • Looks like it is executing the condition if (h < 3)
      • So the message in console -- Couldn't find enough finder patterns
    • As a result, the decoding of QR is returning message -- error decoding QR Code
This information tells me, there is something to do with the QR Code shared by the TTC.  But, how come it works on smartphone apps?  

I have not attempted to fetch the code from the smartphone app and analyze it at this point in time of testing.  I make an assumption -- could be the program used in the smartphone app can identify the Finder Patterns irrespective of the color and shape in the QR Code.

I had made an observation documented in the beginning -- the Finder Pattern is in oval shape and not in the concentric square shape.



Testing the Tests


I picked the QR Code image shared by the TTC and converted it to a grayscale image.  I used this grayscale image in the above said three websites.  I see the same message -- error decoding QR Code.

I picked the registration form URL that I had got by scanning QR Code using the mobile app.  I generated the QR Code and uploaded it to these three websites.  I see the QR Code decoded successfully now; I see the Google Form URL to register.  Note that, if I turn off the internet and use a valid QR Code, I see the URL.

This tells me, there is no problem with Google Form URL accessibility or encoding or decoding.  It is to do with the QR Code shared by TTC.


Pic: QR Code shared by TTC



Pic: Grayscale QR Code generated by me


Pic: QR Code generated by me with Google Form URL




Pic: QR Code on qrreader dot online on uploading TTC QR Code




Pic: QR Code on webqr dot com on uploading TTC QR Code


The dimension and size of the QR Code image file shared by TTC are not the same as the one generated by JavaScript.  I see pixelation and a bit of distortion in the QR Code generated by JavaScript using TTC QR Code.  

Look at the Pattern Finders in the QR Code from JavaScript; compare it with the QR Code shared by the TTC.  They don't look the same.


Understanding the QR Code


QR stands for Quick Response.  QR Code was pioneered by Masahiro Hara at Japanese company Denso-Wave in 1990.  QR code has different sections and Finding Patterns is one.

I find information on these web pages useful:
  • https://www.explainthatstuff.com/how-data-matrix-codes-work.html
  • http://qrcode.meetheed.com/question14.php
  • http://www.keepautomation.com/tips/qr_code/functions_of_qr_code_function_patterns.html.


So, What's the Problem Here?


From the inferences I'm making from my tests so far, it looks like
  1. The QR Code from TTC has data (shape and color) that cannot be processed by this JavaScript?
    • Not very sure!
    • But the so far analysis says yes with the code read
  2. Need to generate more customized QR Code
    • If possible include Finding Patterns in different geometric shapes -- primary suspect
    • Then rule out or point in if that's the problem source
    • If this is not the problem, then
      • The dimensions of the QR Code image file is the problem?
        • For now, this is the second suspect
          • But, the read JavaScript code does not say this
Need to understand how the mobile app code can read it successfully.  Then figure out the differences in the mobile app code and the JavaScript referred here.

I'm stopping my testing for now.
I can test! I test!