Showing posts with label Investigation. Show all posts
Showing posts with label Investigation. Show all posts

Tuesday, December 16, 2025

Payment Transaction Declined With Code 201, Why?


I read a LinkedIn post yesterday.  This post asks for the appropriate contextual message to deal and how to continue upon what is shown to the user.  I see it is fair and straight.  I expect the same.  

Thanks Liudas Jankauskas for writing this LinkedIn post and sharing your experience.

What surprised me is why the discussion was not taken forward in the post.  As a  result, we did not seed the mindset and attitude of Test Engineering and Prevention discussion.  Testing does not prevent; but, the outcome of the test will persuade the prevention efforts and culture.

Okay! If you see this post is too long to read, then here is in short to you -- This is not a API problemThe API has worked seamlessly and it has done what it is supposed to do.  The code 201 returned is right and it is supposed to be there.  Here, the 201 is not a HTTP Status Code.  It is 3DS Error Response code.

Then, whose problem it is?  To know that, read the entire blog post.  You will thank me and yourself!


I Say This, Before I Start

  1. I'm writing this post as an interpretation and analysis for the JSON shared in the LinkedIn post.
  2. There is no intention of pointing to anyone.
    1. The intention is to share -- how to analyze and have perspectives to analyze when I [a test engineer] experience it.
    2. This post can become a reference to someone who is serious about Test Engineering and Prevention, and Testing.
  3. The intention of this post is to let know,
    1. How to analyze such incidents in the payment context?
    2. Is this a API problem?  What behaviors should be classified as an API problem?
    3. Which component and layer is supposed to handle it?  How and why?
    4. Which other components and layer is supposed to assist to handle it?
    5. How to see and interpret the HTTP and its status code in a narrowed cases?
    6. It is not always the API problem!
  4. I'm open to correct myself, unlearn, learn and update this blog post, when shared why I'm not making sense with I have written here.
    1. I will be thankful and humble to you for helping me to correct myself. 🙏
    2. Be comfortable and do connect and help me if you see I need it.


The Payment Failed; Could Not Buy Ferry Ticket

The user wanted to buy the ferry tickets by making the online payment.  The payment did not happen and could not buy the ferry ticket.

Instead, the user sees the JSON on the UI. 

The JSON has error description which reads as -- "TdsServerTransID is not received in Cres."

To understand what is TdsServerTransID and Cres, it is required to understand the payment gateway flows.   Continue reading the next sections to know them.


How It Is Layered and Works?


Before I get into analysis and say where the problem is, this context requires an understanding of the payment gateway flow.  Having this understanding, it helps to know how it could have been handled and prevented.

Refer to this below pic.  It gives the sequences of interactions in the payment transaction.  Note that, the below pic is not complete.  It is kept to what is needed to the context of this blog post.


Transaction between 3DS Server and Bank for the Challenge through
POST request.  It is shown as CReq and CRes.


The Sequence of Interactions
  1. I initiate the payment on merchant website or app to buy the ferry ticket.
  2. The merchant creates a transaction id for tracking.  Then, it hands over the rest to payment gateway.
  3. The payment gateway uses the 3DS way to carry this transaction.
  4. The 3DS initiates the Authentication Request (AReq) which will pass through the Directory Server.
    • The Directory Server reads the data in the request and forwards it to the right bank who have issued the card [or account] used in the payment.
  5. The bank receives the AReq.  
    • The bank decides should it give a challenge to the user to make the payment or just agree with the data passed in AReq
  6. Now, in this case the bank has decided to give the challenge to the user.  
    • The bank lets know the 3DS server through the response ARes.
  7. The challenge usually is to enter the OTP received over a SMS and authenticate.  I presume the same in this case.
  8. The user enters the OTP and POSTS the request.  Let us call this request as CReq.  The CReq is sent from from the user's browser or app to the bank.  The user interface on the browser or app at this point is handled by the payment gateway and not the merchant's web or app where the ticket is being brought.
    • This CReq will have,
      1. tdsServerTransactionId
      2. messageVersion
      3. messageType
      4. challengeWindowSize
      5. ascTransactionId
    • Note that, the CReq having the tdsServerTransactionId which is generated by the payment gateway.  tdsServerTransactionId is required in the CReq and expected by the bank.
  9. On receiving, the CReq, the bank processes the request.  The bank responds back to the 3DS server; let us call it CRes.
  10. The CRes will have,
    1. tdsServerTransactionId
    2. acsTransactionId
    3. challengeCompletionId
    4. transStatus
    5. messageType
    6. messageVersion
  11. If all goes as expected, the authorization will be given for the payment.  
    • If not, the bank responds to the payment gateway (3DS) that there is a problem and the message.
      • It is the payment gateway which has to read this problem and message.
      • The payment gateway has to give the better contextual information through its interface to the merchant who is consuming the service.
      • This is key!
      • Hope you found the spot here!


The Code 201 


Maybe, we test engineers presume the error code 201 returned is HTTP Status Code.  Ask your team what is this code and what is the purpose of returning in the JSON.

Back to 201, here.  It is not a HTTP Status Code.  It is neither a custom developed error code by payment gateway nor merchant who is selling the ferry ticket.

This is a payment domain specific code used in the payment technology.  Yeah!  The code to tell the payment gateway and merchant what happened with the initiated payment transaction.  

To be particular and more specific to the context, the 201 here is a 3DS APIs response code.


The 3DS Error Code and its Description.
The credit of this image is to developer.ravelin.com

The above image is shared here so that you know the error codes are available in the payment context.  I'm aware of 3DS protocol hence I could relate and understand the error code 201.  Refer to the References section at the end of this blog post.

The payment gateway has to read this error code. Then provide the right contextual message and instruction to merchant saying what to do incase of payment failure.

Here is the catch.  Not all merchant develops a mechanism to handle this.  Instead, it is left to payment gateway service provider and depend on it.  

Do the merchant bother about handling it at their end? 
  • As a merchant, I can switch to another payment gateway tomorrow.  
  • Why should I invest in building, developing and maintaining this as a part of my business when I'm paying another business to do it for me?  
  • Isn't that senseful business question or decision?
Now, you tell me who is expected to handle this 201 error? Who should let me know on the merchant web user interface what to do?



The Case and My Initial Observations


On submitting the challenge given by the bank, the below details is shown to the user.  


Pic used from the post of Liudas.
It is the CRes returned from Bank to 3DS Server at Payment Gateway.

My Observations
  1. Looks like the ticket being bought and payment made was in Turkey.
  2. The payment gateway involved to complete the payment transaction is Payten.
  3. The user is tryin to buy the ferry tickets and is making online payment. This error is seen on submitting the challenge given by the bank,
  4. Looks like the card is used for the payment.
  5. In the LinkedIn post, I read, multiple attempts were made to make payment, and the same message is seen.
  6. The Error Code returned is 201.
    • Which means, a mandatory field is missing and it is a violation.
  7. Error Component is S, which probably means 3DS server.
    • The component that raised the error.
  8. There is an error which is not handled.
    • The error reads as Unexpected error.
    • There is an id to track it.  That is, for auditing and tracking by the payment gateway.
    • This id do not have anything to do with the bank and merchant who is selling the ferry tickets.
  9. The message has a version that reads as 2.1.0
    • This means, 3DS protocol used is of version 2.1.0
    • This is communicated by 3DS server to the Directory Server and Bank Server, so that they use the same version in the contract.
      • If the same version is not available at the bank server side, the lower version will be used.
  10. Error description reads, the TdsServerTransID is not received in the CRes.
    • The 3DS component is raising this error.
    • TdsServerTransID means ThreeDS Server Transaction Id
    • 3DS means Three-Domain Secure, which is a security protocol used for online card transactions.
  11. This means, the response from bank is, it cannot fulfil the payment request.
    • Because, the request from the 3DS layer do not have TdsServerTransID.
    • Wait!
      • Do I actually know if the CReq had the TdsServerTransID in the payload?
      • This is critical to know.
    • This is important to know.
      • The AReq (Authentication Request) will have the TdsServerTransID in its payload.
      • That means, this was present earlier.  So, the CReq is fired from the 3DS server on receiving the ARes.
      • The bank received TdsServerTransID and said it is giving a challenge  before proceeding to authorization for payment.
        • This response (ARes) from the bank will have TdsServerTransID.
          • Along with this, the below are also present as a correlation ID to track this transaction.
            • dsTransID
              • The Directory Server ID to track this transaction
            • acsTransID
              • The bank server ID to track this transaction
      • From here, the CReq payload requires the TdsServerTransID and acsTransID
        • Where did the TdsServerTransID go missing now?  How?  Why?
        • This needs to be test investigated.  I have shared possible causes in the next section.

The TransactionID of the merchant who is selling the ferry ticket is different from the TdsServerTransID.
    • The TdsServerTransID is needed for the bank and Payment Gateway to complete the transaction.
    • Without this TdsServerTransId, the bank cannot proceed to authorize the payment.

All this is happening between the payment gateway [3DS server] and the Bank.  The user and the merchant website is not in the picture here.



Why Did It Happen?


It happened because, the CReq from the 3DS Server of payment gateway did not have that TdsServerTransID.  That is what the CRes from the bank says.  

Note that, I presume, the bank systems and services are functioning and serving its customer at this point in time.  This is an analyzed assumption I'm making here and I'm aware of this assumption I have made.

What made the CReq to miss the TdsServerTransID?
  1. I read, the user made multiple attempts to initiate and make the payment.
  2. I have multiple perspectives and interpretations with hypotheses to say this could have lead to miss the TdsServerTransID.
    • Talking about each hypotheses in detail is not the scope of this blog post.
    • But, here are some of the possibilities that could have led to this situation.
      • A glitch at payment gateway at that point in time.
      • Someone is breaching in the middle and tweaking the request and response over the network.
      • Device and hardware
      • Geo location
      • Network and traffic
      • Timeouts
      • Latency
      • Storage running out
      • Configuration gone wrong
      • Caching and missing -- not persisting
      • Intermediate and dependency services clogged
      • A new release deployed and updated
      • Downtime in the bank's services
      • Intermittent bank's services
      • Mismatching 3DS protocol versions that is not supported and accepted at either ends
      • And, more!
In simple, it was well handled by the bank.  I assume, the amount is not debited from the user's account.  This is equally important and the LinkedIn post do not share about it.

I see the bank's service have worked well in this context and done what is expected out of it.  The thumb rule is, when there is discrepancy in the data expected and received, do not authorize the payment request; abort it

Can the service do much more if the TdsServerTransID is missed in the payload of AReq?

Me as a test engineer in the payment gateway engineering team, I will test for the below in minimal and as a must,
  1. AReq with no TdsServerTransID and observe how what happens!
  2. Do the 3DS server still fire a AReq to the bank with no TdsServerTransID?
    1. If yes, then that is a problem!  It should be handled here.
    2. This problem can be prevented!
    3. I will ask my team why are we encouraging such request?
      • This increases our customer support cost and operations time in responding to our clients for using our payment gateway.
      • And, merchant can lose the business because of our payment gateway.
  3. What should 3DS server do when initiated AReq is missing the TdsServerTransID?
    1. What all other data in the transaction and session should be retrained and intact?
    2. Will these data change with a fresh TdsServerTransID created?
  4. I will explore and figure out what factors caused the 3DS to lose the user's authentication.


Who Created The Problem?

  1. From the error shared, I see,
    •  It is 3DS server who created this problem presuming the bank and its response is right.  


Is This An API Problem?

  1. No, it is not a API problem.
  2. If we call it as a API problem, we have not understood what is an API.


Whose Problem It Is?

  1. It is the problem of the service that fired the CReq.
    • Because, it fired CReq without the TdsServerTransID which is mandatory key-value in the payload.  
    • TdsServerTransID value cannot be null nor empty in JSON.
It appears to be a business logic problem of the service at 3DS server side for firing CReq with no TdsServerTransID.



How It Can Be Prevented?

  1. By making sure CReq will always have a distinct TdsServerTransID.
  2. If no distinct TdsSevrerTransID in a fresh CReq being fired, abort the request.
    • Create the distinct TdsServerTransID and then construct the request payload before firing the CReq.
To make sure, payment gateway [3DS server] preserves the TdsServerTransID, dsTransID, and acsTransID of the session.  If any one of this not matching at any of the components during the transaction, aborting the transaction is the best and right action to do.  

Can you recall your daily life experience where you are said to not refresh or click on the browser's back button when you have initiated the payment transaction on web or mobile app?  This is the reason!  To preserve the data of the transactions in the session between these systems -- merchant web or app, payment gateway [3DS server], Directory Server and Bank Server.

This is an automation candidate.  It has to be part of the daily test runs in the automation.



Who Should Be Fixing It?

  1. In my opinion for today upon analysis and assumption I have made, it should be fixed by the payment gateway.
  2. The payment gateway has to interpret the 3DS error code returned by the bank.  
    • And, then initiate the appropriate action as the payment transaction has failed. 
  3. A new transaction between the 3DS and bank has to be started for that merchant's order.  
    • If it cannot happen, the payment gateway should abort all the current open session tied to that merchant's order.
    • And, let know the user what is happening, and then direct the user.
  4. The payment gateway can read the ARes and CRes.
    • That means, the payment gateway can read the HTTP Status Code of ARes and CRes.
    • If there is an error code in ARes and CRes,
      • Then, the gateway should assert for the 3DS error code along with the HTTP Status Code.
      • For example, 
        • Say, the HTTP Status Code is 400 and 3DS Error Code 201 in CRes.
          • Assert for these two and direct with appropriate contextual message and direction.
          • In general, this is how custom developed error codes are handled by the client on receiving it from the services.
Note that, it is Error Code and not the Response Code.  The two are different.  And, the HTTP Status code is different from these two.



Why It Is Not An API Problem?


API is an interface which exposes the available services to the consumer.

It is the services which collects data and build the request payload, and expects the payload to process. 

This request will pass through an interface which is opened to the consumer.  This interface is called an API -- Application Programming Interface.

Analogy,
  • The car has gear stick to switch the gear.
    • This is an interface to the car's gear system.
    • The driver will use the car's gear system through this gear stick.
  • The gear box in the car is a service.
    • The gear box adjusts and responds by switching to the gear per the driver's input.
    • The different operations [business logic] provided by the gear box services are,
      • Reverse
      • Parking
      • Neutral
      • Switching the gear and assisting other components to speed up or speed down the car.

Can the interface have problem?  Yes, it can have.  In that case, the service will not be available to serve or to discover.  
  • Like, if the gear stick has problem, I cannot use the gears of the car, but, the gear box can be fully functional and in working condition.  
  • It is just the interface [gear stick] having the problem and as a result the consumer [driver] is unable to use the services [gear box].

In this case, the payment gateway could fire CReq to the bank.  That is, the API exposing this service is functioning.  It is the problem in a service.  It is the service that has missed to ensure and mandate the presence of TdsServerTransID -- a business logic problem.

In simple, we Software Engineers [including me] use the term API vaguely and with no sense of what it means.  This is my observation.  Further, the Software Testers have tossed this term API in all possible ways and learning it in incorrect ways.  I have no doubt in it when I say this.

Next time, when you say it is a API problem, rethink on what you are saying.  Is it an API or the service(s) which is accessed through that API?  It is useful when we describing the behavior of the system and its layer.  

I cannot say it is a gear box problem for the gear stick not [usable] working, and vice versa.  You see that?




To stop here,
  1. Testing skills and testing will help when it is collaborated with awareness and skills of the tech stack used in building the software system.
  2. Testing skills, programming and tech skills are not enough!
    • The domain skills and awareness is essential and critical.
    • If one is aware of the domain where the problem is observed, then the 201 will not be read as HTTP status code.
  3. Build the domain skills and maintaining the knowledge base as a GitHub project helps in a longer run.
  4. Interfaces and its gateways should not be overloaded with additional responsibilities to make sure the mandatory key-value is present.
    • If did so, one should learn why. Because, it is an anti-pattern and not a suggested software engineering practice.
  5. It is the services that has problems most times.  
    • The interfaces and its gateway will have the discovery, orchestration and traffic problems along with the risks of security.


References:
  1. https://httpstatuses.io/
  2. https://developer.elavon.com/products/3dsecure2/v1/api-reference
  3. https://developer.ravelin.com/psp/api/endpoints/3d-secure/errors/
  4. https://developer.elavon.com/products/3dsecure2/v1/3ds-error-codes
  5. threeDSServerTransID  (in our case the JSON reads as TdsServerTransID)
    • 3DS Server Transaction ID
    • Universally unique transaction identifier assigned by the 3DS server to identify a single transaction.
    • The 3DS server auto-populates and appends this filed value in the authentication request (AReq) it sends to bank in addition to the data you send.
    • This value can also be found by a lookup in the response received by this service -- /3ds2/lookup


Friday, July 26, 2024

My First Hand Analysis of CrowdStrike Falcon Update Incident


I attempted to analyze the process dump of CrowdStrike shared by my friend.  He said, there could be an attack which is leading to crash of Windows OS globally.  This made me curious to look into the dump and learn.

I had no much context around it, but, a test engineer in me did not sit quite.  I started to analyze the dump information.  Here is my first hand analysis that I made on 19th July 2024 post 10:30 AM IST.


What I Saw?

  • It is a Windows OS's process dump.
  • Looks like something with C or C++ application reading how the memory offsets were in the dump.
  • It started to read a memory offset.
  • Then the process witnessed an exception.
    • Here the program could not read further
    • Why it could not read further from this offset?
      • My little experience of testing drivers on Windows OS for a card printer machine, refreshed and recalled what I had witnessed when testing.


Scratching and Striking My Mind


I started to ask these questions myself while I asked what could have gone wrong.  I could not stop here as I was curious what led Windows machine crash.  I referred to web and learn there was an update by CrowdStrike, and then this incident.

The bugs do exist in every software no matter the level and depth of testing, automation and engineering's excellence.  All software do crash and OS is not an exception to it.  But, what made the update to crash the Windows OS?  Pointing and blaming CrowdStrike or Microsoft is not a way for the practicing test engineer.  If these two organizations are serving its huge customer base, they have something working and reliable.  Engineering does not eliminate problems.

By now, I had a thought that it is not an attack.  It is a software bug!  Where is the bug?  What is the bug?  Was it not experienced in pipeline?


The Open Ended Questions


I had these questions as I analyzed and spoke to my friend.
  • What is Falcon?
  • What was this update to Falcon?
  • How frequently the updates are rolled out?
  • How the updates are rolled out globally?
  • What pipeline do they have in testing?
  • Who is impacted the most in business? Is it Microsoft or CrowdStrike?  Impacted in what way?
  • What is CrowdStrike?  What they do?  Who are the customers?
  • Where do the CrowdStrike's Falcon sit in the OS and what it does?
  • How CrowdStrike works in the machines and what it offers?
  • What do the dump say? Relook into it with different perspectives.
  • How this could have been prevented?
  • How will I prevent this if I join this team knowing this incident?
With these questions, I started to analyze the process dump which was shared.

I had more such questions, but these were the first few that I crossed as I started.



Analysis of Process Dump


My interpretation, tells me the below for today
  1. Accept that it is an incident as any other incident which I witness in production environment.
  2. Do not fall to the speculation happening around.  Remain calm and focus to interpret and understand your exploration.
  3. I see, if it can start to read from an offset and then ending to experience a non-existent or invalid offset, is it a NULL Pointer?
    • What is NULL Pointer?
      • A NULL Pointer is a pointer that does NOT point to any memory location and hence does not hold the address of any variables.
      • If I do not initialize and assign, the pointer will have NULL as its value.
      • For example, int *test;
        • When I want to access the pointer test (a location in memory) pointing to, I will not be sure what is in the pointer when I read it.
          • I may not set it later or set it.
          • In this case, the code can tell if the pointer is valid or pointing to a garbage memory
        • But, if I declare it like int *test = NULL;
          • I can check if was set and initialized
        • It is a better practice to assign a NULL value to a pointer during initialization so that we can check if it is NULL or as any address assigned to it.
      • This understanding of Pointer makes me think, is it due not initializing a pointer and so the error code c0000005 on reading a memory that is not valid.
      • When we assign a NULL value to pointer, it is a null pointer in C++
        • We assign null value for testing and asserting
          • If the memory is allocated to a pointer or not
          • If it has a return address and is a valid one or not
          • If a pointer is not initialized, assigning null it prevents problems to certain extent
    • With this understanding, I also read, it started to read from an offset 0x9c, and then failing.
      • What is 0x9c?
        • In Octal it is 234. In Decimal it is 156.
        • Can there be such address in a computer's memory? I don't know.
        • If it is a access violation, then is it a memory which is in preemption of the OS?
          • If so the OS can terminate the program or process which is trying to access it.
          • Is this killing the process and aborting the operation of Falcon's IPC and eventually Windows coming to BSOD?
      • This tells me it is not a NULL Pointer in first case but not initializing a pointer to NULL.
        • I infer, if the pointer was assigned to NULL, that is initialized, there could have been some hint in the state and event when accessing the memory.
          • This is my analysis; but, I have not seen the test code nor aware of the product.  All this inference is based on the process dump and my experience of testing drivers.
      • It got something in between from update (a config or pattern?) for which it cannot find and read in the memory?  Why?
        • This indicates me, it could be a bug, that is, a logical problem.  This is my hunch for today!
  4. Data in the dump
    • Exception Address
    • Read from Address 0x9c
    • Exception Code: c0000005 (Access violation)

Testing my Interpretations


CrowdStrike as an org when it caters its SAAS to such a customer base, won't it have a testing pipeline
  • It will have, I have no doubt in it.  They test and roll out the updates, I believe in it.

Did they witness any such incidents earlier?
  • I searched on web for it and I did not find something similar on the Windows, earlier.

Is this a NULL Pointer?  Are you sure?
  • No, I'm not sure.  But, there is something that is leading it to address which does not exist or which is invalid?  I will have to wait for their RCA to know technically what caused this.  But this is my understanding reading the dump.

How do you think it is a memory access problem?
  • The error code 0xc0000005 says that.
  • I referred to driver easy website for the information because my experience of testing the drivers for Windows OS and experiencing such incidents led me there.  This is what I learn:
    • https://www.drivereasy.com/knowledge/solved-how-to-fix-0xc0000005-error/

Do you think the programmer would not have handled the obvious Pointer and NULL initialization?
  • I believe there will be a check for Pointer and what it is pointing to.  But is it due to no initialization?  Technically this has to be analyzed which I cannot do.  I will have to wait for CrowdStrike team to share the tech details.

Is this a driver problem that killed the Windows kernel?
  • I don't know.  But, the .sys file will not have driver as per my learning.  It will have information about the drivers and any configurations.
  • This incident is a problem, which impacted both CrowdStrike and Microsoft.  Maybe, both will have their areas to look and fix it they see so.  But, in this context, CrowdStrike can fix it quicker and that is much better -- is what I understand.
  • I'm a Windows user for long time.  I see, Windows has worked well to all my contexts so far.  The Engineers of Windows OS knows better than me here.  I'm not well aware and informed as they are.
  • CrowdStrike's engineering team are skilled and they are rolling out updates often in a day.  They have a better pipeline when this is being done.
    • But, the question I have is, how did this happen?
    • No one lets such problem into production when they are aware of it.  Do you?
    • There is something that has not come to their observation and experience.  What is that?
    • Knowing this will help to prevent this and similar incidents happening in future.
      • I'm waiting to know what did not come to their experience and led to this incident.

What could be in the .sys file of CrowdStrike?
  • I don't know!  I want to learn that.
  • But, from my testing of .sys file and drivers on Windows OS, I learn there could be a configuration details with certain pattern or information to capture at run time, and help the installed software to run.  This is my learning and awareness from my testing.
  • That said, testing at OS level and Anti Virus engines are not obvious.  Testing of drivers is like the risky mines.  What is sufficient and good enough in test coverage?  It needs an expertise at OS internals level.
  • Windows OS having such a fragmentation in its versions, updates and patches, it is a battle field and mines for engineers building such solutions for sure!
  • I learn, the Windows OS stopped when an application tried to access the invalid region or non-existent memory.
    • The update which was rolled out, did it have a configuration or a pattern that showed a logical problem when processing it?
    • I have such questions and thoughts that are striking my mind as I think and build a problem model for the same.

Is this a race condition incident?
  • I see, it is not a race condition incident as users across globe experienced it.

Is this specific to a Falcon version, OS version and hardware?
  • Not all host machines would be on latest version of Falcon, is my presumption.
  • At least, n-1 and n-2 versions should be on host machine which experienced this behavior.
    • So it is not a Falcon version specific, I see.
  • It looks to me as it is not specific to the Windows OS version and hardware configuration.
    • It is an application software problem which occurred at driver level is what I see.
      • This is an IPC communication and process is my understanding.
        • The driver can receive the IPC communication in continuous mode.
        • At times, this can get queued based on the application and what it does.


Where is the Problem?


Well, I'm looking and pulling from my visualization by relating with my experience of testing the driver on Windows OS.  I don't know the exact reason or close enough to tell what could have gone wrong.

Reading the process dump, it says accessing a memory that does not exist or corrupted.  One of the high possibility is, the starting offset is seen but it is not helping when reading.
  • For example, Ravi has the address of India's Prime Minister house.
    • But, he does not know from where to start despite having the address.
    • He is void and null in knowing where to start and what to do when he is not initialized with the start location to begin the travel to the Prime Minister's house.
    • In short, he do not know where the address is pointing to and what it has, though he is given a address to start.
      • Can he access the Prime Minister's house premise without any access granted and authorized to do so?
      • If not, won't he be arrested by police or other security forces and stop him?

Do I Know the Precise Problem?


I don't know!  I do not know the CrowdStrike product and platform.  I'm waiting to read the technical details from Crowd Strike.

I see, it comes to the data, state and event.  I would focus on how to prevent it learning which data, state and event led to this behavior.  I think of figuring out the Test Design and Strategy that can help me to identify such use cases.  I focus here and see can it brought into the automation so that it gets exercised and regressed consistently.

If it is due to the memory access that had a problem, I did such tests when testing driver for a hardware machine on Windows OS.  I will share the tests that I did in upcoming blog.

I wrote the technical analysis from process dump to CrowdStrike and Microsoft.  I did not get a response.  Anyways, I'm sharing the overall information in a non-technical way so that it is consumable to most readers here.



Note: Here are another threads of me sharing my thoughts on same:
1. https://x.com/testingGarage/status/1814215089525821763?t=XSFdx69ElL0ZmBOcEFrTjg&s=19
2. https://www.linkedin.com/posts/ravisuriya_%3F%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F-%3F-activity-7221156949445206017-oeRa




Saturday, February 3, 2024

Database: Finding the Tables Having Specified Column Name

 

In today's pair testing session with a mentee, we were testing for Database I/O.  We were on PostgreSQL.  One of the questions a mentee had is,

How can I figure out the tables having this column name?

Running through every tables and exploring if the column being looked for is present or not, is time consuming.  It is not a approach to take as well.

I went through this when I started the ETL testing practice in 2011.

Here is the query that works on PostgreSQL to find table names which has specified column name.


Query:

select table_name, column_name
from Information_Schema.Columns
where table_catalog='database_name' and column_name like '%column_name%'


It is a better approach to know the precise column name and using the condition as -- column_name='EmployeeId'.


This query should work on MySQL and MSSQL Server.  If not working on MSSQL, need to look into the FROM and WHERE clauses if it is vendor specific.



Monday, October 16, 2023

Performance & Tests: Getting Started and Data Analysis

 

On running tests,

  • We will have data (information) as one of the byproduct.
  • Analyzing the data of the integrated sub-systems in isolation and correlation,
    • It will lead us to a technical analysis on each integrated system.
In the report, we draft this analysis along with actions to be taken.

Note: When said sub-systems do not ignore or skip the client or consumer; the system does not comprise just server.


No Golden Rule

There is no one way to do a testing.  Likewise, there is no one way or the golden rule to test for performance.  It is contextual and depends on what I want to learn.

In fact, in few contexts, we can have a value adding performance test with just one request.  Just, I should be well aware of -- what is that I want to know and learn from this test.

That said, there are multiple interfaces where we can observe, analyze and learn from the performance data collected.

The fourth question from season two of 100 Days of Skilled Testing, is:

What are your favorite hacks to analyze performance testing results and find anomalies?

Well, this question do not mention explicitly if it is for server or client or database or caching or messaging or for what interface of a system.  It is a question; but, to me it looks too generic and at a point it looks vague.  Having said this, that is how the learning journey and curve starts! 


Result vs Report

What is a result?

  • Is it an evaluation after a data [information] is put to scrutiny?
  • Or, the result is a data that is collected and not yet interpreted?

It depends on individual or team and how it being practiced.

The result is different from a report.


Getting Started and Data Analysis


I should know how the system architecture is designed and orchestrated with its boundaries and interfaces.  This helps a lot.  What kind of architecture is this?  Is it a monolith?  If it is monolith, my approach to test for performance differs.

If I'm asked to start the analysis of data for a system that I'm not aware of,
  • I will start by analyzing the below indicators on knowing the architecture and the orchestration of the sub-systems for critical business workflows
    1. CPU usage
    2. RAM usage
    3. Data I/O
    4. Network usage
    5. The Heat and sound dissipated from the hardware which holds and binds
      • CPU, RAM, Data I/O, Network and tech stacks installed and configured

It hints me to look further and test investigate, when I observe:
  • Having a steady consumption
    • What is steady consumption in this context?
  • Having a low consumption
    • What is low consumption in this context?
  • Having a unusual consumption spike and fall of it
    • I follow the pattern to study further
    • What is considered as knee, spike and fall, in this context?
  • Having a zero consumption
  • Having a maximum consumption
    • What is maximum consumption in this context?
Having a high consumption doesn't mean a problem.  Likewise, having a low consumption does not mean all is well.  I have to uncover them to learn what it means in the given context.

In each of this, there will be a pattern.  I will learn them.  I will correlate with other sub-systems and learn what they were doing in the said timeline.

Do you recollect this line -- "the architecture should provide the Testability"?
  • I wrote about it in one of the blog posts of Performance Engineering.

I refer to the below by traversing with the timeline,
  • The logs by asking for it
  • Data recorded
  • Any APMs that are in place
I correlate all these with above said indicators.

This gives me a start. It is one of easiest start that I can have to get started with analysis.


Well this is to analyze at the server end.  What about the client [consumer] end?  It is simpler and will share in the coming blog posts.



Do you want to know more on this and other strategies that can be used contextually?  Let us get connected and converse.  I'm happy to share and learn on listening to you.  It is fun and awareness!



Tuesday, April 4, 2023

My Interpretation of localhost and 127.0.0.1

 

Incident

When executing a test, I see the below message 

The target is unreachable, Please make sure your target is up and running

I have multiple Docker images running on my machine.  Most of these images are different products that are bound to run locally.


Debugging and Observations

In this context, I'm using http://127.0.0.1:portNumber as an IP to communicate with an application which is a Docker image.  I see the above said message.

But, I have not mentioned any IP address specifically; and there is no port forward.  This puts me into question -- What's going wrong here?

I use http://localhost:portNumber and try to communicate with a Docker image. Now the test execution does not see the message above said.

I see  a next question in me:

  • Isn't localhost and 127.0.0.1 both mean the same on a local box?


Debugging and Interpretation

  1. I see the IP address 127.0.0.1 refers to a loopback interface on a local box
  2. When TCP/IP sees the IP address which starts with 127, it understands this is a loopback request
    • This request does not go out of the local box
    • The response to this request is returned to the top TCP/IP layer of the same local box
  3. But, the local box does need to have the same and only one IP address, that is 127.0.0.1
  4. The IP of the local box can range from 127.0.0.1 to 127.255.255.255 along with the combination of port numbers which can range from 0 to 65535
    • 0 is the reserved port number in TCP/IP; it cannot be used
  5. localhost is the domain name for the loopback IP address which is 127.x.x.x.
    • This also means, the loopback IP of localhost does not necessarily have to be 127.0.0.1 all time
    • It can be different
    • And, if the port number is used along with the loopback IP address, then 127.0.0.1 can still be used with different port numbers for different applications and its communication
      • Are localhost:5555 and localhost:8888 two different applications on the same IP address?
        • It looks so! Technically and logically as well it looks okay
      • But, using the same IP address with a different port number is a monolith concept, right?
        • Apart from the monolith concept, this is not a good approach
        • But, for now when running multiple Docker images locally and testing locally, this is a state and transition which is more likely to occur
  6. Further I see, the application which is binded to the domain name localhost might not necessarily have the IP address 127.0.0.1 always
    • The IP can range up to 127.255.255.255
    • This indicates me there is a difference between 127.0.0.1 and localhost
      • That is, localhost and 127.0.0.1 does not mean one and the same every time
  7. I see there should be binding here to the local box's domain name -- localhost
    • This is a differentiator

Given this context, that is, running multiple Docker images locally and testing within a TCP/IP of a local box,
  • I see using localhost [with port number] in the URL is a wise strategy in testing if 127.0,0,1 fails.  

Knowing the exact local IP address with the port number should work.  If this fails, we have an incorrect port number or a problem connecting to the port.  

All these are being spoken in the context of
  • A local box
  • Communication between the images on a local box.

It is all about which application on the local box has taken [registered] the domain name localhost.  Or, find the exact IP address on the loopback interface along with the port number used by an application

How I'm bringing up the servers and what the configuration includes for the IP address and port numbers, is not to be taken casually. Especially on the local box!


Your Thoughts?

If you see my interpretations are incorrect, kindly help me to learn by commenting on this blog post.  We will connect to share and discuss.  I'm open to unlearn and learn!



Note: I have read about the term Small Weight Tests.  These tests have requests and communication which does not go out of the box.  The request and response happen within the box with little or no I/O and CPU operations.




Wednesday, July 13, 2022

IntelliJ and Cache: Maven Dependencies Not Resolved


This post is about the Maven dependencies not resolving.  I'm recording the incident, my understanding, and what worked for me. It can help a fellow Software Test Engineer.


Incident and its Details

I'm using the machine which has the below setup:

  • OS: Windows 10 Pro
  • IDE: IntelliJ IDEA 2022.13 (Community Edition)
  • JDK: 1.8
  • Maven: 3.5.4 

I created a new Maven project and in the pom.xml, I added the below dependencies.  The IDE showed that these dependencies are not resolved.

<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.3.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.testng/testng -->
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>7.3.0</version>
<scope>test</scope>
</dependency>

<!-- https://mvnrepository.com/artifact/io.github.bonigarcia/webdrivermanager -->
<dependency>
<groupId>io.github.bonigarcia</groupId>
<artifactId>webdrivermanager</artifactId>
<version>4.3.1</version>
</dependency>

<!-- https://mvnrepository.com/artifact/io.rest-assured/rest-assured -->
<dependency>
<groupId>io.rest-assured</groupId>
<artifactId>rest-assured</artifactId>
<version>4.4.0</version>
<scope>test</scope>
</dependency>


I observed, that the added dependencies not getting resolved, that is, not getting added to libraries.  In the IDE, I see the error as this:

  • Dependency 'io.github.bonigarcia:webdrivermanager:4.3.1' not found
  • Dependency 'org.seleniumhq.selenium:selenium-java:4.3.0' not found 
  • Dependency 'org.testng:testng:7.3.0' not found
  • Dependency 'io.rest-assured:rest-assured:4.4.0' not found

Note: I have other Maven projects to which I have added the same dependencies but of different versions, in my workspace. Each time I add a dependency it will download to the libraries in my workspace though it exists, and this is my configuration in IDE.

At this point, I did not know what is happening and how to approach resolving this problem.  All I know is, that there are other versions of the same dependencies in the Maven's library on my machine.


Understanding the Problem


I tried to understand what it is saying to me.  It says it cannot find the dependency.  Could be that it is looking for the dependency in the Maven's library on my box, and it is not finding it.  

Further, I looked into the web to see did someone else face the same incident and said behavior.  Few posts said using the "compile" scope solved the problem in their case for WebDriverManager.  But, how can the change in scope fix the dependency not found incident?  This was the question in me!  To my curiosity, I did try that for WebDriverManager; I did not see any change.


Invalidating the Cache - Is that a Fix?


In the IntelliJ IDE, I tried to invalidate the cache and reboot the IDE and open the project.  I see the dependencies being resolved and I do not see the said problem.

I don't know how this fixed the said problem here.  Also, I don't know precisely what exactly the problem is, here.  Invalidating the cache and reopening the IDE (also the project) worked.

On invalidating cache, here is what happens per IntelliJ:
  • Removes the cache files for all projects ever run in the current version of IDE
  • The files will be recreated the next time you open these projects

I remember now that I upgraded the IntelliJ to the latest version before creating the said new Maven project, here.  Later, it resulted in the above said dependency problem. 

Note that, I had created a few Maven projects using an older version of IntelliJ.  I have the same dependencies in the old and new projects, just the version of these dependencies is the only difference.

Now, invalidating the cache on this new version of IntelliJ, it got fixed.  I understand that the IDE uses the cached instances of these dependencies.  Could be there is a relationship between this cache, dependency versions, and the IDE version?  I'm technically not sure of the same, but my instincts say it is.



Saturday, March 5, 2022

The Never Ending Execution and Code Smell

 

I write this post to share my learning with all Test Engineers and not just be confined to Testers in a group.  I read the below question in the Telegram group - Testing Mini Bytes.  

A fellow Test Engineer in the group asked this question by sharing the method where he experienced a problem.

Question:

This is going on infinite loop can someone let me know why the loop the condition is not working



Pic: The method whose execution is not ending


Here is how I approached analyzing the stated problem and asisted the fellow engineer:

  1. Read and understand the problem description
  2. Understand whose problem it is and why from the description if the detail is available
  3. Know the Programming Language used
  4. Read the code and understand its flow
  5. Looked for the debug statements
  6. What is declared and how, why, and what does it hold?
  7. Check if any state exists and what it is
  8. Logic check for condition and flow
  9. Check on maxAttempt count
  10. Syntax check for condition and flow

Now, I look at how the condition check is being done:
  1. The operator used in condition with the operands
  2. Declaration
  3. Assignment
  4. Updating
  5. Referring

In this case, the condition has IF and ELSE blocks.  The IF block has a check on the status and maxAttempt count.  The IF condition has logical AND between these two.  And, the ELSE block has a check on maxAttempt count alone.

What I infer is:
  1. The operator != should not be a problem in this case
  2. Though JS also has !== operator, != should work in this context; !== also works if used
  3. Could be the status is not yet COMPETED or COMPLETED
  4. But, the maxAttempt will not be less than 5 in the first check
  5. maxAttempt is incremented
  6. And, the same having this IF and ELSE block is called
    1. Doing so, the maxAttempt will be reset to zero
    2. Note that maxAttempt was incremented before resetting it to zero
    3. In this method, maxAttempt will always be zero that is less than 5
  7. Irrespective of whether the status is COMPLETED or not, this execution will run into a loop


Not having the debug prints or console log for status and maxAttempt, it won't be obvious why this execution is going on & not stopping.  It is a kind of recursion experienced that is not intended to be here.  Unless short of system resources, this recursion continues.

Though I see the condition check logic can be written in other approaches, I did not get into how it can be done.


Learnings

  • Use the debug statements or console logs; it helps in debugging and understanding what's happening
  • Code Smells
    • Multiple Point of Failure
      • The place where the maxAttempt is declared and initialized
      • How the method is being called within it
    • Long Method
      • Though the method is simple in what it does, it involves multiple operations
      • Also, the method calls itself within it
        • It complicates by calling itself; breaking this method is good
        • One method one task
The method looks simple but how it flows within by resetting maxAttempt to zero.  And, a recursion kind of nature makes it uneasy to follow when there are no debug statements.




Monday, January 3, 2022

The Automation Strategy Problem; Not a Appium Challenge

 

In The Test Tribe's forum, I read the post which described the problem as in the below paragraphs and picture.  On looking into it, I learned this can be made as a blog post that tells a strategy for automation.  

Maybe, 10 years back I would have asked the same question.  That's a learning curve.  Today as well, I end up in thinking for a while asking self -- how to test it and how to automate it.  

I want to share how this problem can be looked at from the perspective of testing and automation, and then approach it to automate.

 

Folks. I've two issues on Appium automation which needs your help. 

1. I'm working on a ecommerce website where a payment method is integrated (lets take the example as PhonePe). When i try to place the order in the mobile website with payment method as PhonePe, the payment method app will be opened and I've to complete the payment using it and I'm navigated back to the browser. Issue is - How can i switch context between the mobile browser and the app? I tried using driver.startActivity() but on performing any other actions, it errors out. 

2. Since i need to use the browser to place order and the payment using the payment app, I tried to set up the driver instance with browserName and app as the desired capabilities together. But on running the test, it errors out - browserName and app can't be used together. How can i approach this problem? Anyone who has automated such flows?

Apologies, i'm pretty new to Appium and so, please excuse my ignorance.


Picture: Problem Statement - Description of Scenario & Challenge


Understanding the Scenario and Functional Flow

I observe the below in the said scenario:

  1. It is a website; it also has a mobile website
  2. It has got a payment option integrated
  3. The Appium's Desired Capabilities defined has browserName and the app
    • borwserName -- name of the mobile web browser used in automation; it is an empty string if automating an app
    • app -- the path of an app to be automated
  4. When using a mobile website on a mobile device -- assuming it a mobile web app
    • On selecting a payment app -- assuming it a native app
      • The context changes to payment app UI
      • On completing the payment, the context changes to the mobile website



Challenges Described in the Funtional Flow


I see these as challenges:
  • How to handle this said scenario in automation using Appium?
  • How to switch context between mobile browser and the mobile app?
  • Using driver.startActivity(), it yields an error on performing any other actions
    • On making any actions on UI after using the above said method, the error is observed
      • Reading the description, it is said that the error is thrown when running the automation
        • And, when changing the context back to mobile website from payment app
The driver.startActivity(), takes two arguments -- app's package name and activity to be started.  What's passed for the package name and activity name is not clear from the problem description.  

If the mobile browser is used to launch the mobile website and mimic the action, what is passed as app's package name and activity in driver.startActivity() ? This is not mentioned and unclear to me.

Also what is mentioned for the browserName and app in desired capabilities is not clear.



A Common Use Case


In recent years this is a common use case in a mobile native app having a web view and the websites that have payment transactions.  For example, in the native app when making payment, the web view of payment gateway that shows list of payment choices.  On successful payment, the view switches to native view from web view.



Questions on Reading the Problem Statement:


I have the below questions on reading the problem description:
  1. Why did it throw the errors on any actions post calling the driver.startActivity()
    • driver.startActivity() will start an Android activity using package name and the activity name
  2. The context picked on switching from web view to native and then back to native, is not well picked?
    • But it is a mobile website which means it is opened on a mobile browser, right?
      • No where it is mentioned as a Hybird app i.e. the mobile website installed as an app
    • Does this mobile website maintains its context when switching to a native app (payment app), and then changing the context to (web view) mobile browser?
This takes me to seek clarity for:
  1. Is mobile website a installed Hybrid app? Or, is it a regular website which also has a mobile website and accessed on a mobile browser?
  2. Is it possible to switch the context of web page from mobile web browser to native app, and vice versa?
    • I need to explore it; I'm unsure of it
    • When read the desired capabilities, it looks like this can be done
      • That is context switching of mobile web browser to native app, and back to mobile browser from native app is possible
      • I need to explore on the same to be very sure of it

Code Snippets for Context Switching


Refer to this page for details on using the Web view with Appium.  The below code snippets tell how to find the context of web and native views, and switching to it.
Snippet illustrating the change of context to Web view

Snippet illustrating the change of context to Native view



But, What's Actually the Problem?


If automated as described in problem statement, do we end up in a problem?  I see, yes we will end up in a problem:
  1. Need to maintain our automation to make sure it executes the payment app UI anytime
    • If the UI of the payment app changes, we need maintain the code
  2. Do we have stage environment payment app in this case?
    • If we test the mobile website in the stage and make transaction in production payment app,
      • Can we continue as this in each test iterations?
      • If yes, how long can we continue to use production payment app and pay?
      • Will there by any transaction fee charged each time from payment app?
        • Can this become a financial cost to the business and client or to stakeholders?
        • What other cost should I bear for using this approach?

I need to know:

  1. What is that I want to learn from the use case or scenario on automating it?
  2. What would be the impact if the test did not help me to learn what I want to learn from automating this use case and scenario?
  3. Should I be testing the payment app along with my app? 
    • As I write UI automation to handle the web view of payment gateway and then the native payment app, it becomes part of this test.  Should I do that?
  4. What information, risk and problem discovery I miss, if I do not automate the payment app flows?
    • Is it okay for the business and product, if I miss any information here or if I do not test the flow in payment app?
    • How to arrive at this decision?
The decision here need to be rational.  But, being rational alone may not help always.  Can I be reasonable here when I'm deciding or influencing stakeholders when deciding?



This is a Automation Strategy Problem!


If seen, for first this is not a Appium problem.  It is a problem with -- what to automate, how to automate, when to automate, how much to automate, and why automate.   That is, it is a problem with automation strategy on how to approach and execute it.

To me it is a problem to solve with approaching and execution of automation for payment transaction, and not a automation library usage and implementation problem.  



How can I Approach the Automation Here?


I will learn, should this payment scenario be automated on the UI layer for first?  If yes, why?  And, then I will have the below questions
  1. Can I use the developer APIs of payment service to test and complete the transaction?
    • If yes, then
      • Can I use the stage APIs of payment to simulate the transaction flow and its completion?
        • If I just use APIs, I will not know what's the functional experience of transactions in native payment app.  Is this okay?
  2. I and the product I test, do not have control over the payment system and its apps
    • When I have no control over it at any point in time, should I test it as part of my system?  If I did so, should my product as well include the probabilities and complexities of payment system?
      • Having this information is good!
      • But what can I do with that information?
        • Do I have an authority to change or fix payment system with that information?
        • If yes, good; if no, then the time and resource spent on this s a value return to my stakeholders and their business?
    • It is wise to mention that I'm not including and testing the payment system and its transactions as a part of my system
      • Because my system does not have a control over payment system in any means
  3. If the API that is used for initiating transaction is functional and usable, then I do not have to worry technically from functional perspective of transactions
    • We will have to work on -- if the payment initiating web view is functional on my native app and in my website or a mobile website
      • From here the control of payment and any transaction problem that arises are in the realm of the payment system
  4. In the test report
    • I will include the stage payment API request and its response with data
      • Talking to payment app organization, we may get the developer API access on stage to test our system on their stage
      • Talk to payment app organization!
      • Also we can mock the payment API to an extent and in the test report say this is a mock result
        • If relied on mock, then we can miss the change in payment system
        • I will have the mocking as last approach just to complete a business flow and it will not be my pick unless someone wants to see a business flow completion in a test
    • Have a test that tell about functional and usable aspect of the payment page in -- a mobile website and the payment web view in native/hybrid app


Benefits of this Approach

  1. I and my tests will have clarity what is in my control and what not
  2. When I have control, the test and automation can be well maintained
  3. The flaky areas can be identified; I can come to a decision to eliminate it from the automation or not
  4. It helps to identify what is my problem and what is the problem that I don't own in terms of authority
  5. With this approach, the tests and automation provides clarity when we uncover a risk or problem
While I know the benefits, I must also know the cost of having this approach.