Showing posts with label Investigation. Show all posts
Showing posts with label Investigation. Show all posts

Tuesday, March 17, 2026

45 Seconds of Confusion: When a Familiar GUI Fails the Human Eye

 

In meetings, we often hear the same line, 

"That's not a bug.  Report it as an enhancement."

Sometimes the observed behavior never even makes it to the enhancement list.  

But, what happens when the problem is not about functionality, but about how the users experience the GUI and its usability?

My peer Dhanasekar Subramaniam (DS) recently published a blog about a UI design that delayed him using the app. This made me curious.  How could a UI that an engineering team acknowledged and used slow down a user?

I decided to test investigate the design.  On testing and analyzing the UI behavior and usability, I discovered something interesting -- the GUI looks as in code, but behaved differently to the human eye.  

I went through the usability and experience problem, but I was conscious about this behavior so I could identify it in quick time.

If you are a SDET or Test Engineer, this blog will help you develop new perspectives while analyzing GUI problems.  If you are a manager or decision-maker, it highlights why seemingly small GUI problems should not be ignored.



When a Simple Task Creates Anxiety

Late at night 11 PM, the user opened the cab booking app Rapido, just as they had done many times before.  The goal was simple -- book a cab and reach the bus stop.

But something unexpected happened.

This time, the users could not figure out how to book the cab.

Seconds started passing.  The GUI looked familiar, yet the action to book the ride was not obvious.  Nearly 45+ seconds went by, trying to understand what to do next.

The situation made it worse.

It was 11 PM, the bus departure time was getting closer, and the user was unaware how to proceed because of the app's GUI.

That moment -- when the user knows the app, knows the task, yet cannot complete it, creates anxiety.

So the question is,

Why did two tech-savvy users, using an iPhone and familiar with the app, spend more than 45 seconds trying to figure out how to book a cab?


45 seconds for a task that usually takes less than 5 makes the problem feel bigger instantly.



Understanding the Cause of Anxiety


Here is how I started to learn and understand,
  1. I installed the Rapido on the Android phone.
  2. I have no ride history with Rapido.
  3. I signed in for the first time.
  4. I see the Ride screen.
    • I see the three addresses given which I had not chosen. I could save these as favorite; I did not.
On the Ride screen, I do not see where to enter or select the pick up location and destination.  


TL;DR -- In quick and short here is what caused the confusion which led to anxiety.
  1. The text in the search text field.
  2. The color contrast of search text field.
  3. The color contrast of view showing the three addresses.
  4. The color contrast in between the 2 and 3.
  5. User not able to identify that is a search text field which is tappable.

What to fix?

  1. To rephrase the search text.
    • "Enter pickup location" works as charm; refer Pic-4 in this blog post.
  2. To have better color contrasting for the three GUI elements.
    • The GUI color and contrast to have ΔE ≥ 3  -- good and preferred
  3. To highlight a search text field so that I will be prompted to tap and enter or select the pick-up and drop location.
  4. When experimenting with AB Test configs, the GUI design to follow the suggested GUI Design & Color Engineering practices.
In usability and user experience what is not noticed is as good as not present.

Continue to read the below sections for the detailed information on usability and user experience problems.  

If you want to quickly know the technical analysis alone then move to sub-section -- Why It Fails - Mathematical Analysis and Human Brain. This tells why the present Rapido app's GUI Design and colors confuses a human brain and eye.  

No wonder, why the users got into anxiety when booking the cab at 11 PM!




The First Usability Pitfall in the GUI


Now continue to read with much attention.
  1. I looked at the top of the screen.  I see text -- "Where are you going?".
  2. Below the text, I see the three locations listed which I had not chosen or entered or of my current interest.
Ah! that confused me.  Why?

I closely looked my mobile screen, that is Ride screen.


Pic-1:  The confusing text and three locations displayed
  1. I see a search text field.
  2. I see a search icon next to the text field.
  3. The search text field has a text -- "Where are you going?"

This is the first usability pitfall in the confusing GUI.

Why am I asked where am I going and listing the three locations that I did not enter or choose?





The Second Usability Pitfall in the GUI


In the below image Pic-2, I see,
  1. There is no prominent visual difference in the layouts of
    • Search text field
      • Color hex code #FFF8FAFC
    • The three location displayed
      • Color hex code #FFF6FAFF
  2. Yes my brain could not perceive the differences right away between these two layouts.



Pic-2: The color contrast of the GUI elements.


The color contrast of these two layouts are almost same.
  • This added for the confusion.
  • My brain was perplexed in knowing what's happening here.
  • I'm wasting time here to learn how to book a cab.
    • Is this what a Rapido need as a business?
    • Or, does it need a user to book a cab right away on opening the app?
    • Won't this experience drive away the user to the competitors app -- Ola, Uber, Namma Yatri.
If my brain cannot perceive the differences and is processing to understand what's happening here, is that a good user experience?

Forget about the user experience.  Is that a serving UI Design and Engineering?  I will leave this to your thoughts.

Further, the space in between these two layouts is with color contrast of #FFFFFFFF.  This makes the confusion much stronger.  Why?  
  • All these three GUI components are on one main view
  • To the human eye and brain, the color contrast of these three GUI components blend as one rather three distinct GUI elements.

This is the second usability pitfall in the confusing GUI.
Not being able to distinguish between these three GUI elements in a quick time is a problem.  Why the app has confusing color engineering for these three GUI components?  Why the GUI design did not highlight that search text field as tappable?  Why the search text is confusing when combined with the GUI color?

If the GUI components had distinguishing contrast colors.




Rapido's Competitor GUI and Usability


The competitor of Rapido has similar GUI, but it is more intuitive with the search text and color contrast of GUI components.  Refer the below pic.

In the Ola and Uber apps,
  • The search text is straight and easy to understand.
  • The readability of search text is close to the context of using the app.
  • Importantly, the search text field is easily distinguishable easily.
The search text and distinguished search text field makes me understand I should tap on it and enter the pick up and destination location.



Pic-3: The search text field and GUI in Ola, Uber and Rapido apps.




The Two GUIs of Ride Screen


When test investigating, I experienced the two GUIs of Ride screen.  

The other GUI looks better in terms of usability and prompted me to tap on the search text field.  But my question is when do I get this screen?

Could be a AB test parameter coming in to the app shows the different Ride screen.  I did not pick this for debugging as it looked better.

In the below picture, screen 2 shows a better search text.  Also, it does not have those three locations that I see in screen 1.



Pic-4: The two different Ride screens of Rapido.



Test Investigation & Analysis - Why My Brain & Eyes Took 45+ Seconds?

This section has details of my debugging, test investigation and analysis.  I have put my eyes, brain, smart phone, reasoning and the Rapido app to evaluation.

If you are a Test Engineer or SDET in a role, this should be super helpful when you are testing for GUI.  Do not skip it!


Human Eyes and Cones for Blue Shades


I learn, the human eyes have three cone types L, M and S; below are it sensitivity.
  • L-cones is for Long-wavelength cones; it is sensitive to red-ish light.
  • M-cones is for Medium-wavelength cones; it is sensitive to Green-ish light.
  • S-cones is for Short-wavelength cones; it is to Blue-ish light.

You remember, I shared the hexa color code for the two GUI components, that is
  • #FFF8FAFC
  • #FFF6FAFF



Pic-5:  #FFF8FAFC color

Pic-6:  #FFF6FAFF color


In between the above two hexa color code images, we have white (#FFFFFFFF) background as in the Rapido app.

These two hexa color codes explain my observation.
  • I struggled to distinguish the subtle color difference, especially in certain ranges.
  • So the two users who were booking the cab at 11 PM.  Why?  
    • We humans have a fewer S-cones and it is less sensitive.  Hence, the small changes in blue/cyan hues are hard to see.

But, the small changes in red/green are easier to detect.  

Have you seen sky in the night when an aero plane is flying.  

You see the red-light of an airplane though the distance between sea level and the air plane is around 10 to 13 KM.  

Why do the plane use red-light and not blue or any shades of blue?  Hope this should trigger your eyes and mind now.  

With this simple daily life example shared, tell me about the two blue shades discussed here with minimal difference and next to each other as GUI components in an mobile application?  

To add more to the complexities, the hardware and display capabilities of smart phone models from the same OEM varies.  You see, how critical is the UI Engineering now in software design!


Display Behavior of Smart Phones


Even before your and my eyes see the color, the smart phone's display (hardware + software) processes it.

That is, the smart phones,

  • Quantize colors (round up values)
  • Use OLED sub-pixels
  • Apply gamma corrections

This leads the rgb(246, 250, 252) and (248, 250, 252) to be same emitted light.  Why?  The display hardware will round or merge the small difference.  An another reason why on the Android device that I used and on an iPhone the other two users could not differentiate between the two GUI elements of Rapido app.


Viewing Angle Makes It Even Worse


I was holding my smart phone at 180 degree to the ground -- that is device at an angle to view. 


Pic-7:  Holding the smart phone to view at an 180 degree to the ground.

At an angle,
  • The contrast reduces
  • The colors shift
  • The subpixels blur

So even a small difference that might exist becomes visually flattened.  Small hue differences are flattened by the panel optics.  This effect is common in cyan/blue hues.

Further our human visual system averages nearby pixels.  The two adjacent colors like #FFF8FAFC and #FFF6FAFC are interpreted by the brain as a single averaged blue.



Why It Fails - Mathematical Analysis and Human Brain


In color science, the term "Empfindung" is used when talking about the experience of a color.  It is a German word meaning sensation or perceived differences.

In the UI and Design Engineering it is used as ΔE.  Where Δ (delta) means change or difference, and E is Empfindung -- perceptual sensation.

The professional UI Engineering rule enforces the minimum color difference of 3 to 5 RGB units.  Or the perceptual metrics as ΔE > 3 to ensure UI elements remain distinguishable

For these two colors #FFF8FAFC and #FFF6FAFC in discussion here, the calculation using CIEDE2000 color difference formula results in ΔE = 2.0 to 2.3.
  • This range falls into interpretation as slightly noticeable -- borderline perception.

But, both colors in discussion here are with very high lightness (almost white) and low chroma (very low saturation) -- this is critical. 
  • For such colors the human sensitivity to differences drops significantly 
  • Despite the ΔE ≈ 2, in reality the users will not notice the differences, especially on the mobile phones.

The smart phone display may map the two colors discussed here to the same or near-identical output.  Why?
  • Display Quantization
    • The values of R and B are near maximum in the above said color
      • The maximum color value is 255.
      • In our case the RGB of the two are rgb(246, 250, 252) and (248, 250, 252).
    • Rounding and Gamma correction will compress the differences from the hardware and software of a mobile device.
  • OLED Screen
    • The smart phones having OLED screens, with high brightness levels,
      • The subpixel differences become less distinguishable
        • +3 in blue channel may not produce a visible shift
        • -2 in red may be completely lost
  • Viewing Conditions
    • On smart phones, the brightness varies, ambient light interferes and viewing angle shifts color.
    • As a Result, ΔE ≈ 2 is often perceived as identical by human brain.
      • That is, the human brain cannot differentiate between the colors

Using these two colors #FFF8FAFC and #FFF6FAFC for buttons, states and backgrounds is risky and leads to users unable to distinguish them reliably.

For those with accessibility concerns and conditions, ΔE ≈ 2 is effectively invisible.  It fails practical usability and experience expectations.


The final outcome from the test investigation and debugging is,
  1. These two colors used is not helpful and unreliable.
  2. Not suitable for distinguishing the GUI elements.
  3. Needs stronger contrast for interactive GUI design.
  4. If the ΔE ≈ 2.0 to 2.3, it is borderline and unreliable.
    • The range 2.0 to 2.3 may be ok only for subtle background variations.
    • In this case it failed; we all three users had difficulty and trouble to understand the GUI.
  5. Use the colors and contrasts with ΔE ≥ 3.

Use the below as a reference (heuristic) for standard perception thresholds.
  • ΔE is < 1,
    • the interpretation is not visible.
  • ΔE is 1 to 2, 
    • the interpretation is barely noticeable.
  • ΔE is 2 to 3, 
    • the interpretation is slightly noticeable
    • But it does not serve on mobile app engineering.
  • ΔE > 3, 
    • clearly visible

On the lighter fun side refer to the below pic.  Let me know what is the Empfindung of your eyes for the discussed two colors together with the white background.


Pic-8:  The screenshot of this blog post on my mobile screen.



The three colors FFFFFFFF, FFF8FAFC and FFF8FAFC appear to merge and look as one color.  Doesn't it?  

You can  try an experiment with the above pic.  Look at this pic, by increasing and decreasing the brightness and contrast of the screen (smart phone or monitor) by being in the lighted and dark.  What's your experience?

Hope this should be a sufficient data to understand the seriousness of the problem discussed in this post.




What's the Fix?

  1. For mobile app engineering, the recommendation for GUI color and contrast is
    • ΔE ≥ 3  -- good and preferred
    • ΔE ≥ 5  -- safe
  2. Use the better text in the search text field
    • This looks better and prompts to tap on it -- "Enter pickup location"
  3. Distinguish and highlight the search text field GUI component prominently
  4. When experimenting with AB Test configs, the GUI design to follow the suggested GUI Design & Color Engineering practices.
These fixes also benefit the users with accessibility concerns and conditions.  



Any questions or information needed on this please do connect with me.  I'm just one ping away!



Tuesday, December 16, 2025

Payment Transaction Declined With Code 201, Why?


I read a LinkedIn post yesterday.  This post asks for the appropriate contextual message to deal and how to continue upon what is shown to the user.  I see it is fair and straight.  I expect the same.  

Thanks Liudas Jankauskas for writing this LinkedIn post and sharing your experience.

What surprised me is why the discussion was not taken forward in the post.  As a  result, we did not seed the mindset and attitude of Test Engineering and Prevention discussion.  Testing does not prevent; but, the outcome of the test will persuade the prevention efforts and culture.

Okay! If you see this post is too long to read, then here is in short to you -- This is not a API problemThe API has worked seamlessly and it has done what it is supposed to do.  The code 201 returned is right and it is supposed to be there.  Here, the 201 is not a HTTP Status Code.  It is 3DS Error Response code.

Then, whose problem it is?  To know that, read the entire blog post.  You will thank me and yourself!


I Say This, Before I Start

  1. I'm writing this post as an interpretation and analysis for the JSON shared in the LinkedIn post.
  2. There is no intention of pointing to anyone.
    1. The intention is to share -- how to analyze and have perspectives to analyze when I [a test engineer] experience it.
    2. This post can become a reference to someone who is serious about Test Engineering and Prevention, and Testing.
  3. The intention of this post is to let know,
    1. How to analyze such incidents in the payment context?
    2. Is this a API problem?  What behaviors should be classified as an API problem?
    3. Which component and layer is supposed to handle it?  How and why?
    4. Which other components and layer is supposed to assist to handle it?
    5. How to see and interpret the HTTP and its status code in a narrowed cases?
    6. It is not always the API problem!
  4. I'm open to correct myself, unlearn, learn and update this blog post, when shared why I'm not making sense with I have written here.
    1. I will be thankful and humble to you for helping me to correct myself. 🙏
    2. Be comfortable and do connect and help me if you see I need it.


The Payment Failed; Could Not Buy Ferry Ticket

The user wanted to buy the ferry tickets by making the online payment.  The payment did not happen and could not buy the ferry ticket.

Instead, the user sees the JSON on the UI. 

The JSON has error description which reads as -- "TdsServerTransID is not received in Cres."

To understand what is TdsServerTransID and Cres, it is required to understand the payment gateway flows.   Continue reading the next sections to know them.


How It Is Layered and Works?


Before I get into analysis and say where the problem is, this context requires an understanding of the payment gateway flow.  Having this understanding, it helps to know how it could have been handled and prevented.

Refer to this below pic.  It gives the sequences of interactions in the payment transaction.  Note that, the below pic is not complete.  It is kept to what is needed to the context of this blog post.


Transaction between 3DS Server and Bank for the Challenge through
POST request.  It is shown as CReq and CRes.


The Sequence of Interactions
  1. I initiate the payment on merchant website or app to buy the ferry ticket.
  2. The merchant creates a transaction id for tracking.  Then, it hands over the rest to payment gateway.
  3. The payment gateway uses the 3DS way to carry this transaction.
  4. The 3DS initiates the Authentication Request (AReq) which will pass through the Directory Server.
    • The Directory Server reads the data in the request and forwards it to the right bank who have issued the card [or account] used in the payment.
  5. The bank receives the AReq.  
    • The bank decides should it give a challenge to the user to make the payment or just agree with the data passed in AReq
  6. Now, in this case the bank has decided to give the challenge to the user.  
    • The bank lets know the 3DS server through the response ARes.
  7. The challenge usually is to enter the OTP received over a SMS and authenticate.  I presume the same in this case.
  8. The user enters the OTP and POSTS the request.  Let us call this request as CReq.  The CReq is sent from from the user's browser or app to the bank.  The user interface on the browser or app at this point is handled by the payment gateway and not the merchant's web or app where the ticket is being brought.
    • This CReq will have,
      1. tdsServerTransactionId
      2. messageVersion
      3. messageType
      4. challengeWindowSize
      5. ascTransactionId
    • Note that, the CReq having the tdsServerTransactionId which is generated by the payment gateway.  tdsServerTransactionId is required in the CReq and expected by the bank.
  9. On receiving, the CReq, the bank processes the request.  The bank responds back to the 3DS server; let us call it CRes.
  10. The CRes will have,
    1. tdsServerTransactionId
    2. acsTransactionId
    3. challengeCompletionId
    4. transStatus
    5. messageType
    6. messageVersion
  11. If all goes as expected, the authorization will be given for the payment.  
    • If not, the bank responds to the payment gateway (3DS) that there is a problem and the message.
      • It is the payment gateway which has to read this problem and message.
      • The payment gateway has to give the better contextual information through its interface to the merchant who is consuming the service.
      • This is key!
      • Hope you found the spot here!


The Code 201 


Maybe, we test engineers presume the error code 201 returned is HTTP Status Code.  Ask your team what is this code and what is the purpose of returning in the JSON.

Back to 201, here.  It is not a HTTP Status Code.  It is neither a custom developed error code by payment gateway nor merchant who is selling the ferry ticket.

This is a payment domain specific code used in the payment technology.  Yeah!  The code to tell the payment gateway and merchant what happened with the initiated payment transaction.  

To be particular and more specific to the context, the 201 here is a 3DS APIs response code.


The 3DS Error Code and its Description.
The credit of this image is to developer.ravelin.com

The above image is shared here so that you know the error codes are available in the payment context.  I'm aware of 3DS protocol hence I could relate and understand the error code 201.  Refer to the References section at the end of this blog post.

The payment gateway has to read this error code. Then provide the right contextual message and instruction to merchant saying what to do incase of payment failure.

Here is the catch.  Not all merchant develops a mechanism to handle this.  Instead, it is left to payment gateway service provider and depend on it.  

Do the merchant bother about handling it at their end? 
  • As a merchant, I can switch to another payment gateway tomorrow.  
  • Why should I invest in building, developing and maintaining this as a part of my business when I'm paying another business to do it for me?  
  • Isn't that senseful business question or decision?
Now, you tell me who is expected to handle this 201 error? Who should let me know on the merchant web user interface what to do?



The Case and My Initial Observations


On submitting the challenge given by the bank, the below details is shown to the user.  


Pic used from the post of Liudas.
It is the CRes returned from Bank to 3DS Server at Payment Gateway.

My Observations
  1. Looks like the ticket being bought and payment made was in Turkey.
  2. The payment gateway involved to complete the payment transaction is Payten.
  3. The user is tryin to buy the ferry tickets and is making online payment. This error is seen on submitting the challenge given by the bank,
  4. Looks like the card is used for the payment.
  5. In the LinkedIn post, I read, multiple attempts were made to make payment, and the same message is seen.
  6. The Error Code returned is 201.
    • Which means, a mandatory field is missing and it is a violation.
  7. Error Component is S, which probably means 3DS server.
    • The component that raised the error.
  8. There is an error which is not handled.
    • The error reads as Unexpected error.
    • There is an id to track it.  That is, for auditing and tracking by the payment gateway.
    • This id do not have anything to do with the bank and merchant who is selling the ferry tickets.
  9. The message has a version that reads as 2.1.0
    • This means, 3DS protocol used is of version 2.1.0
    • This is communicated by 3DS server to the Directory Server and Bank Server, so that they use the same version in the contract.
      • If the same version is not available at the bank server side, the lower version will be used.
  10. Error description reads, the TdsServerTransID is not received in the CRes.
    • The 3DS component is raising this error.
    • TdsServerTransID means ThreeDS Server Transaction Id
    • 3DS means Three-Domain Secure, which is a security protocol used for online card transactions.
  11. This means, the response from bank is, it cannot fulfil the payment request.
    • Because, the request from the 3DS layer do not have TdsServerTransID.
    • Wait!
      • Do I actually know if the CReq had the TdsServerTransID in the payload?
      • This is critical to know.
    • This is important to know.
      • The AReq (Authentication Request) will have the TdsServerTransID in its payload.
      • That means, this was present earlier.  So, the CReq is fired from the 3DS server on receiving the ARes.
      • The bank received TdsServerTransID and said it is giving a challenge  before proceeding to authorization for payment.
        • This response (ARes) from the bank will have TdsServerTransID.
          • Along with this, the below are also present as a correlation ID to track this transaction.
            • dsTransID
              • The Directory Server ID to track this transaction
            • acsTransID
              • The bank server ID to track this transaction
      • From here, the CReq payload requires the TdsServerTransID and acsTransID
        • Where did the TdsServerTransID go missing now?  How?  Why?
        • This needs to be test investigated.  I have shared possible causes in the next section.

The TransactionID of the merchant who is selling the ferry ticket is different from the TdsServerTransID.
    • The TdsServerTransID is needed for the bank and Payment Gateway to complete the transaction.
    • Without this TdsServerTransId, the bank cannot proceed to authorize the payment.

All this is happening between the payment gateway [3DS server] and the Bank.  The user and the merchant website is not in the picture here.



Why Did It Happen?


It happened because, the CReq from the 3DS Server of payment gateway did not have that TdsServerTransID.  That is what the CRes from the bank says.  

Note that, I presume, the bank systems and services are functioning and serving its customer at this point in time.  This is an analyzed assumption I'm making here and I'm aware of this assumption I have made.

What made the CReq to miss the TdsServerTransID?
  1. I read, the user made multiple attempts to initiate and make the payment.
  2. I have multiple perspectives and interpretations with hypotheses to say this could have lead to miss the TdsServerTransID.
    • Talking about each hypotheses in detail is not the scope of this blog post.
    • But, here are some of the possibilities that could have led to this situation.
      • A glitch at payment gateway at that point in time.
      • Someone is breaching in the middle and tweaking the request and response over the network.
      • Device and hardware
      • Geo location
      • Network and traffic
      • Timeouts
      • Latency
      • Storage running out
      • Configuration gone wrong
      • Caching and missing -- not persisting
      • Intermediate and dependency services clogged
      • A new release deployed and updated
      • Downtime in the bank's services
      • Intermittent bank's services
      • Mismatching 3DS protocol versions that is not supported and accepted at either ends
      • And, more!
In simple, it was well handled by the bank.  I assume, the amount is not debited from the user's account.  This is equally important and the LinkedIn post do not share about it.

I see the bank's service have worked well in this context and done what is expected out of it.  The thumb rule is, when there is discrepancy in the data expected and received, do not authorize the payment request; abort it

Can the service do much more if the TdsServerTransID is missed in the payload of AReq?

Me as a test engineer in the payment gateway engineering team, I will test for the below in minimal and as a must,
  1. AReq with no TdsServerTransID and observe how what happens!
  2. Do the 3DS server still fire a AReq to the bank with no TdsServerTransID?
    1. If yes, then that is a problem!  It should be handled here.
    2. This problem can be prevented!
    3. I will ask my team why are we encouraging such request?
      • This increases our customer support cost and operations time in responding to our clients for using our payment gateway.
      • And, merchant can lose the business because of our payment gateway.
  3. What should 3DS server do when initiated AReq is missing the TdsServerTransID?
    1. What all other data in the transaction and session should be retrained and intact?
    2. Will these data change with a fresh TdsServerTransID created?
  4. I will explore and figure out what factors caused the 3DS to lose the user's authentication.


Who Created The Problem?

  1. From the error shared, I see,
    •  It is 3DS server who created this problem presuming the bank and its response is right.  


Is This An API Problem?

  1. No, it is not a API problem.
  2. If we call it as a API problem, we have not understood what is an API.


Whose Problem It Is?

  1. It is the problem of the service that fired the CReq.
    • Because, it fired CReq without the TdsServerTransID which is mandatory key-value in the payload.  
    • TdsServerTransID value cannot be null nor empty in JSON.
It appears to be a business logic problem of the service at 3DS server side for firing CReq with no TdsServerTransID.



How It Can Be Prevented?

  1. By making sure CReq will always have a distinct TdsServerTransID.
  2. If no distinct TdsSevrerTransID in a fresh CReq being fired, abort the request.
    • Create the distinct TdsServerTransID and then construct the request payload before firing the CReq.
To make sure, payment gateway [3DS server] preserves the TdsServerTransID, dsTransID, and acsTransID of the session.  If any one of this not matching at any of the components during the transaction, aborting the transaction is the best and right action to do.  

Can you recall your daily life experience where you are said to not refresh or click on the browser's back button when you have initiated the payment transaction on web or mobile app?  This is the reason!  To preserve the data of the transactions in the session between these systems -- merchant web or app, payment gateway [3DS server], Directory Server and Bank Server.

This is an automation candidate.  It has to be part of the daily test runs in the automation.



Who Should Be Fixing It?

  1. In my opinion for today upon analysis and assumption I have made, it should be fixed by the payment gateway.
  2. The payment gateway has to interpret the 3DS error code returned by the bank.  
    • And, then initiate the appropriate action as the payment transaction has failed. 
  3. A new transaction between the 3DS and bank has to be started for that merchant's order.  
    • If it cannot happen, the payment gateway should abort all the current open session tied to that merchant's order.
    • And, let know the user what is happening, and then direct the user.
  4. The payment gateway can read the ARes and CRes.
    • That means, the payment gateway can read the HTTP Status Code of ARes and CRes.
    • If there is an error code in ARes and CRes,
      • Then, the gateway should assert for the 3DS error code along with the HTTP Status Code.
      • For example, 
        • Say, the HTTP Status Code is 400 and 3DS Error Code 201 in CRes.
          • Assert for these two and direct with appropriate contextual message and direction.
          • In general, this is how custom developed error codes are handled by the client on receiving it from the services.
Note that, it is Error Code and not the Response Code.  The two are different.  And, the HTTP Status code is different from these two.



Why It Is Not An API Problem?


API is an interface which exposes the available services to the consumer.

It is the services which collects data and build the request payload, and expects the payload to process. 

This request will pass through an interface which is opened to the consumer.  This interface is called an API -- Application Programming Interface.

Analogy,
  • The car has gear stick to switch the gear.
    • This is an interface to the car's gear system.
    • The driver will use the car's gear system through this gear stick.
  • The gear box in the car is a service.
    • The gear box adjusts and responds by switching to the gear per the driver's input.
    • The different operations [business logic] provided by the gear box services are,
      • Reverse
      • Parking
      • Neutral
      • Switching the gear and assisting other components to speed up or speed down the car.

Can the interface have problem?  Yes, it can have.  In that case, the service will not be available to serve or to discover.  
  • Like, if the gear stick has problem, I cannot use the gears of the car, but, the gear box can be fully functional and in working condition.  
  • It is just the interface [gear stick] having the problem and as a result the consumer [driver] is unable to use the services [gear box].

In this case, the payment gateway could fire CReq to the bank.  That is, the API exposing this service is functioning.  It is the problem in a service.  It is the service that has missed to ensure and mandate the presence of TdsServerTransID -- a business logic problem.

In simple, we Software Engineers [including me] use the term API vaguely and with no sense of what it means.  This is my observation.  Further, the Software Testers have tossed this term API in all possible ways and learning it in incorrect ways.  I have no doubt in it when I say this.

Next time, when you say it is a API problem, rethink on what you are saying.  Is it an API or the service(s) which is accessed through that API?  It is useful when we describing the behavior of the system and its layer.  

I cannot say it is a gear box problem for the gear stick not [usable] working, and vice versa.  You see that?




To stop here,
  1. Testing skills and testing will help when it is collaborated with awareness and skills of the tech stack used in building the software system.
  2. Testing skills, programming and tech skills are not enough!
    • The domain skills and awareness is essential and critical.
    • If one is aware of the domain where the problem is observed, then the 201 will not be read as HTTP status code.
  3. Build the domain skills and maintaining the knowledge base as a GitHub project helps in a longer run.
  4. Interfaces and its gateways should not be overloaded with additional responsibilities to make sure the mandatory key-value is present.
    • If did so, one should learn why. Because, it is an anti-pattern and not a suggested software engineering practice.
  5. It is the services that has problems most times.  
    • The interfaces and its gateway will have the discovery, orchestration and traffic problems along with the risks of security.


References:
  1. https://httpstatuses.io/
  2. https://developer.elavon.com/products/3dsecure2/v1/api-reference
  3. https://developer.ravelin.com/psp/api/endpoints/3d-secure/errors/
  4. https://developer.elavon.com/products/3dsecure2/v1/3ds-error-codes
  5. threeDSServerTransID  (in our case the JSON reads as TdsServerTransID)
    • 3DS Server Transaction ID
    • Universally unique transaction identifier assigned by the 3DS server to identify a single transaction.
    • The 3DS server auto-populates and appends this filed value in the authentication request (AReq) it sends to bank in addition to the data you send.
    • This value can also be found by a lookup in the response received by this service -- /3ds2/lookup


Friday, July 26, 2024

My First Hand Analysis of CrowdStrike Falcon Update Incident


I attempted to analyze the process dump of CrowdStrike shared by my friend.  He said, there could be an attack which is leading to crash of Windows OS globally.  This made me curious to look into the dump and learn.

I had no much context around it, but, a test engineer in me did not sit quite.  I started to analyze the dump information.  Here is my first hand analysis that I made on 19th July 2024 post 10:30 AM IST.


What I Saw?

  • It is a Windows OS's process dump.
  • Looks like something with C or C++ application reading how the memory offsets were in the dump.
  • It started to read a memory offset.
  • Then the process witnessed an exception.
    • Here the program could not read further
    • Why it could not read further from this offset?
      • My little experience of testing drivers on Windows OS for a card printer machine, refreshed and recalled what I had witnessed when testing.


Scratching and Striking My Mind


I started to ask these questions myself while I asked what could have gone wrong.  I could not stop here as I was curious what led Windows machine crash.  I referred to web and learn there was an update by CrowdStrike, and then this incident.

The bugs do exist in every software no matter the level and depth of testing, automation and engineering's excellence.  All software do crash and OS is not an exception to it.  But, what made the update to crash the Windows OS?  Pointing and blaming CrowdStrike or Microsoft is not a way for the practicing test engineer.  If these two organizations are serving its huge customer base, they have something working and reliable.  Engineering does not eliminate problems.

By now, I had a thought that it is not an attack.  It is a software bug!  Where is the bug?  What is the bug?  Was it not experienced in pipeline?


The Open Ended Questions


I had these questions as I analyzed and spoke to my friend.
  • What is Falcon?
  • What was this update to Falcon?
  • How frequently the updates are rolled out?
  • How the updates are rolled out globally?
  • What pipeline do they have in testing?
  • Who is impacted the most in business? Is it Microsoft or CrowdStrike?  Impacted in what way?
  • What is CrowdStrike?  What they do?  Who are the customers?
  • Where do the CrowdStrike's Falcon sit in the OS and what it does?
  • How CrowdStrike works in the machines and what it offers?
  • What do the dump say? Relook into it with different perspectives.
  • How this could have been prevented?
  • How will I prevent this if I join this team knowing this incident?
With these questions, I started to analyze the process dump which was shared.

I had more such questions, but these were the first few that I crossed as I started.



Analysis of Process Dump


My interpretation, tells me the below for today
  1. Accept that it is an incident as any other incident which I witness in production environment.
  2. Do not fall to the speculation happening around.  Remain calm and focus to interpret and understand your exploration.
  3. I see, if it can start to read from an offset and then ending to experience a non-existent or invalid offset, is it a NULL Pointer?
    • What is NULL Pointer?
      • A NULL Pointer is a pointer that does NOT point to any memory location and hence does not hold the address of any variables.
      • If I do not initialize and assign, the pointer will have NULL as its value.
      • For example, int *test;
        • When I want to access the pointer test (a location in memory) pointing to, I will not be sure what is in the pointer when I read it.
          • I may not set it later or set it.
          • In this case, the code can tell if the pointer is valid or pointing to a garbage memory
        • But, if I declare it like int *test = NULL;
          • I can check if was set and initialized
        • It is a better practice to assign a NULL value to a pointer during initialization so that we can check if it is NULL or as any address assigned to it.
      • This understanding of Pointer makes me think, is it due not initializing a pointer and so the error code c0000005 on reading a memory that is not valid.
      • When we assign a NULL value to pointer, it is a null pointer in C++
        • We assign null value for testing and asserting
          • If the memory is allocated to a pointer or not
          • If it has a return address and is a valid one or not
          • If a pointer is not initialized, assigning null it prevents problems to certain extent
    • With this understanding, I also read, it started to read from an offset 0x9c, and then failing.
      • What is 0x9c?
        • In Octal it is 234. In Decimal it is 156.
        • Can there be such address in a computer's memory? I don't know.
        • If it is a access violation, then is it a memory which is in preemption of the OS?
          • If so the OS can terminate the program or process which is trying to access it.
          • Is this killing the process and aborting the operation of Falcon's IPC and eventually Windows coming to BSOD?
      • This tells me it is not a NULL Pointer in first case but not initializing a pointer to NULL.
        • I infer, if the pointer was assigned to NULL, that is initialized, there could have been some hint in the state and event when accessing the memory.
          • This is my analysis; but, I have not seen the test code nor aware of the product.  All this inference is based on the process dump and my experience of testing drivers.
      • It got something in between from update (a config or pattern?) for which it cannot find and read in the memory?  Why?
        • This indicates me, it could be a bug, that is, a logical problem.  This is my hunch for today!
  4. Data in the dump
    • Exception Address
    • Read from Address 0x9c
    • Exception Code: c0000005 (Access violation)

Testing my Interpretations


CrowdStrike as an org when it caters its SAAS to such a customer base, won't it have a testing pipeline
  • It will have, I have no doubt in it.  They test and roll out the updates, I believe in it.

Did they witness any such incidents earlier?
  • I searched on web for it and I did not find something similar on the Windows, earlier.

Is this a NULL Pointer?  Are you sure?
  • No, I'm not sure.  But, there is something that is leading it to address which does not exist or which is invalid?  I will have to wait for their RCA to know technically what caused this.  But this is my understanding reading the dump.

How do you think it is a memory access problem?
  • The error code 0xc0000005 says that.
  • I referred to driver easy website for the information because my experience of testing the drivers for Windows OS and experiencing such incidents led me there.  This is what I learn:
    • https://www.drivereasy.com/knowledge/solved-how-to-fix-0xc0000005-error/

Do you think the programmer would not have handled the obvious Pointer and NULL initialization?
  • I believe there will be a check for Pointer and what it is pointing to.  But is it due to no initialization?  Technically this has to be analyzed which I cannot do.  I will have to wait for CrowdStrike team to share the tech details.

Is this a driver problem that killed the Windows kernel?
  • I don't know.  But, the .sys file will not have driver as per my learning.  It will have information about the drivers and any configurations.
  • This incident is a problem, which impacted both CrowdStrike and Microsoft.  Maybe, both will have their areas to look and fix it they see so.  But, in this context, CrowdStrike can fix it quicker and that is much better -- is what I understand.
  • I'm a Windows user for long time.  I see, Windows has worked well to all my contexts so far.  The Engineers of Windows OS knows better than me here.  I'm not well aware and informed as they are.
  • CrowdStrike's engineering team are skilled and they are rolling out updates often in a day.  They have a better pipeline when this is being done.
    • But, the question I have is, how did this happen?
    • No one lets such problem into production when they are aware of it.  Do you?
    • There is something that has not come to their observation and experience.  What is that?
    • Knowing this will help to prevent this and similar incidents happening in future.
      • I'm waiting to know what did not come to their experience and led to this incident.

What could be in the .sys file of CrowdStrike?
  • I don't know!  I want to learn that.
  • But, from my testing of .sys file and drivers on Windows OS, I learn there could be a configuration details with certain pattern or information to capture at run time, and help the installed software to run.  This is my learning and awareness from my testing.
  • That said, testing at OS level and Anti Virus engines are not obvious.  Testing of drivers is like the risky mines.  What is sufficient and good enough in test coverage?  It needs an expertise at OS internals level.
  • Windows OS having such a fragmentation in its versions, updates and patches, it is a battle field and mines for engineers building such solutions for sure!
  • I learn, the Windows OS stopped when an application tried to access the invalid region or non-existent memory.
    • The update which was rolled out, did it have a configuration or a pattern that showed a logical problem when processing it?
    • I have such questions and thoughts that are striking my mind as I think and build a problem model for the same.

Is this a race condition incident?
  • I see, it is not a race condition incident as users across globe experienced it.

Is this specific to a Falcon version, OS version and hardware?
  • Not all host machines would be on latest version of Falcon, is my presumption.
  • At least, n-1 and n-2 versions should be on host machine which experienced this behavior.
    • So it is not a Falcon version specific, I see.
  • It looks to me as it is not specific to the Windows OS version and hardware configuration.
    • It is an application software problem which occurred at driver level is what I see.
      • This is an IPC communication and process is my understanding.
        • The driver can receive the IPC communication in continuous mode.
        • At times, this can get queued based on the application and what it does.


Where is the Problem?


Well, I'm looking and pulling from my visualization by relating with my experience of testing the driver on Windows OS.  I don't know the exact reason or close enough to tell what could have gone wrong.

Reading the process dump, it says accessing a memory that does not exist or corrupted.  One of the high possibility is, the starting offset is seen but it is not helping when reading.
  • For example, Ravi has the address of India's Prime Minister house.
    • But, he does not know from where to start despite having the address.
    • He is void and null in knowing where to start and what to do when he is not initialized with the start location to begin the travel to the Prime Minister's house.
    • In short, he do not know where the address is pointing to and what it has, though he is given a address to start.
      • Can he access the Prime Minister's house premise without any access granted and authorized to do so?
      • If not, won't he be arrested by police or other security forces and stop him?

Do I Know the Precise Problem?


I don't know!  I do not know the CrowdStrike product and platform.  I'm waiting to read the technical details from Crowd Strike.

I see, it comes to the data, state and event.  I would focus on how to prevent it learning which data, state and event led to this behavior.  I think of figuring out the Test Design and Strategy that can help me to identify such use cases.  I focus here and see can it brought into the automation so that it gets exercised and regressed consistently.

If it is due to the memory access that had a problem, I did such tests when testing driver for a hardware machine on Windows OS.  I will share the tests that I did in upcoming blog.

I wrote the technical analysis from process dump to CrowdStrike and Microsoft.  I did not get a response.  Anyways, I'm sharing the overall information in a non-technical way so that it is consumable to most readers here.



Note: Here are another threads of me sharing my thoughts on same:
1. https://x.com/testingGarage/status/1814215089525821763?t=XSFdx69ElL0ZmBOcEFrTjg&s=19
2. https://www.linkedin.com/posts/ravisuriya_%3F%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F-%3F-activity-7221156949445206017-oeRa




Saturday, February 3, 2024

Database: Finding the Tables Having Specified Column Name

 

In today's pair testing session with a mentee, we were testing for Database I/O.  We were on PostgreSQL.  One of the questions a mentee had is,

How can I figure out the tables having this column name?

Running through every tables and exploring if the column being looked for is present or not, is time consuming.  It is not a approach to take as well.

I went through this when I started the ETL testing practice in 2011.

Here is the query that works on PostgreSQL to find table names which has specified column name.


Query:

select table_name, column_name
from Information_Schema.Columns
where table_catalog='database_name' and column_name like '%column_name%'


It is a better approach to know the precise column name and using the condition as -- column_name='EmployeeId'.


This query should work on MySQL and MSSQL Server.  If not working on MSSQL, need to look into the FROM and WHERE clauses if it is vendor specific.



Monday, October 16, 2023

Performance & Tests: Getting Started and Data Analysis

 

On running tests,

  • We will have data (information) as one of the byproduct.
  • Analyzing the data of the integrated sub-systems in isolation and correlation,
    • It will lead us to a technical analysis on each integrated system.
In the report, we draft this analysis along with actions to be taken.

Note: When said sub-systems do not ignore or skip the client or consumer; the system does not comprise just server.


No Golden Rule

There is no one way to do a testing.  Likewise, there is no one way or the golden rule to test for performance.  It is contextual and depends on what I want to learn.

In fact, in few contexts, we can have a value adding performance test with just one request.  Just, I should be well aware of -- what is that I want to know and learn from this test.

That said, there are multiple interfaces where we can observe, analyze and learn from the performance data collected.

The fourth question from season two of 100 Days of Skilled Testing, is:

What are your favorite hacks to analyze performance testing results and find anomalies?

Well, this question do not mention explicitly if it is for server or client or database or caching or messaging or for what interface of a system.  It is a question; but, to me it looks too generic and at a point it looks vague.  Having said this, that is how the learning journey and curve starts! 


Result vs Report

What is a result?

  • Is it an evaluation after a data [information] is put to scrutiny?
  • Or, the result is a data that is collected and not yet interpreted?

It depends on individual or team and how it being practiced.

The result is different from a report.


Getting Started and Data Analysis


I should know how the system architecture is designed and orchestrated with its boundaries and interfaces.  This helps a lot.  What kind of architecture is this?  Is it a monolith?  If it is monolith, my approach to test for performance differs.

If I'm asked to start the analysis of data for a system that I'm not aware of,
  • I will start by analyzing the below indicators on knowing the architecture and the orchestration of the sub-systems for critical business workflows
    1. CPU usage
    2. RAM usage
    3. Data I/O
    4. Network usage
    5. The Heat and sound dissipated from the hardware which holds and binds
      • CPU, RAM, Data I/O, Network and tech stacks installed and configured

It hints me to look further and test investigate, when I observe:
  • Having a steady consumption
    • What is steady consumption in this context?
  • Having a low consumption
    • What is low consumption in this context?
  • Having a unusual consumption spike and fall of it
    • I follow the pattern to study further
    • What is considered as knee, spike and fall, in this context?
  • Having a zero consumption
  • Having a maximum consumption
    • What is maximum consumption in this context?
Having a high consumption doesn't mean a problem.  Likewise, having a low consumption does not mean all is well.  I have to uncover them to learn what it means in the given context.

In each of this, there will be a pattern.  I will learn them.  I will correlate with other sub-systems and learn what they were doing in the said timeline.

Do you recollect this line -- "the architecture should provide the Testability"?
  • I wrote about it in one of the blog posts of Performance Engineering.

I refer to the below by traversing with the timeline,
  • The logs by asking for it
  • Data recorded
  • Any APMs that are in place
I correlate all these with above said indicators.

This gives me a start. It is one of easiest start that I can have to get started with analysis.


Well this is to analyze at the server end.  What about the client [consumer] end?  It is simpler and will share in the coming blog posts.



Do you want to know more on this and other strategies that can be used contextually?  Let us get connected and converse.  I'm happy to share and learn on listening to you.  It is fun and awareness!



Tuesday, April 4, 2023

My Interpretation of localhost and 127.0.0.1

 

Incident

When executing a test, I see the below message 

The target is unreachable, Please make sure your target is up and running

I have multiple Docker images running on my machine.  Most of these images are different products that are bound to run locally.


Debugging and Observations

In this context, I'm using http://127.0.0.1:portNumber as an IP to communicate with an application which is a Docker image.  I see the above said message.

But, I have not mentioned any IP address specifically; and there is no port forward.  This puts me into question -- What's going wrong here?

I use http://localhost:portNumber and try to communicate with a Docker image. Now the test execution does not see the message above said.

I see  a next question in me:

  • Isn't localhost and 127.0.0.1 both mean the same on a local box?


Debugging and Interpretation

  1. I see the IP address 127.0.0.1 refers to a loopback interface on a local box
  2. When TCP/IP sees the IP address which starts with 127, it understands this is a loopback request
    • This request does not go out of the local box
    • The response to this request is returned to the top TCP/IP layer of the same local box
  3. But, the local box does need to have the same and only one IP address, that is 127.0.0.1
  4. The IP of the local box can range from 127.0.0.1 to 127.255.255.255 along with the combination of port numbers which can range from 0 to 65535
    • 0 is the reserved port number in TCP/IP; it cannot be used
  5. localhost is the domain name for the loopback IP address which is 127.x.x.x.
    • This also means, the loopback IP of localhost does not necessarily have to be 127.0.0.1 all time
    • It can be different
    • And, if the port number is used along with the loopback IP address, then 127.0.0.1 can still be used with different port numbers for different applications and its communication
      • Are localhost:5555 and localhost:8888 two different applications on the same IP address?
        • It looks so! Technically and logically as well it looks okay
      • But, using the same IP address with a different port number is a monolith concept, right?
        • Apart from the monolith concept, this is not a good approach
        • But, for now when running multiple Docker images locally and testing locally, this is a state and transition which is more likely to occur
  6. Further I see, the application which is binded to the domain name localhost might not necessarily have the IP address 127.0.0.1 always
    • The IP can range up to 127.255.255.255
    • This indicates me there is a difference between 127.0.0.1 and localhost
      • That is, localhost and 127.0.0.1 does not mean one and the same every time
  7. I see there should be binding here to the local box's domain name -- localhost
    • This is a differentiator

Given this context, that is, running multiple Docker images locally and testing within a TCP/IP of a local box,
  • I see using localhost [with port number] in the URL is a wise strategy in testing if 127.0,0,1 fails.  

Knowing the exact local IP address with the port number should work.  If this fails, we have an incorrect port number or a problem connecting to the port.  

All these are being spoken in the context of
  • A local box
  • Communication between the images on a local box.

It is all about which application on the local box has taken [registered] the domain name localhost.  Or, find the exact IP address on the loopback interface along with the port number used by an application

How I'm bringing up the servers and what the configuration includes for the IP address and port numbers, is not to be taken casually. Especially on the local box!


Your Thoughts?

If you see my interpretations are incorrect, kindly help me to learn by commenting on this blog post.  We will connect to share and discuss.  I'm open to unlearn and learn!



Note: I have read about the term Small Weight Tests.  These tests have requests and communication which does not go out of the box.  The request and response happen within the box with little or no I/O and CPU operations.