Showing posts with label Test Engineering. Show all posts
Showing posts with label Test Engineering. Show all posts

Tuesday, December 16, 2025

Payment Transaction Declined With Code 201, Why?


I read a LinkedIn post yesterday.  This post asks for the appropriate contextual message to deal and how to continue upon what is shown to the user.  I see it is fair and straight.  I expect the same.  

Thanks Liudas Jankauskas for writing this LinkedIn post and sharing your experience.

What surprised me is why the discussion was not taken forward in the post.  As a  result, we did not seed the mindset and attitude of Test Engineering and Prevention discussion.  Testing does not prevent; but, the outcome of the test will persuade the prevention efforts and culture.

Okay! If you see this post is too long to read, then here is in short to you -- This is not a API problemThe API has worked seamlessly and it has done what it is supposed to do.  The code 201 returned is right and it is supposed to be there.  Here, the 201 is not a HTTP Status Code.  It is 3DS Error Response code.

Then, whose problem it is?  To know that, read the entire blog post.  You will thank me and yourself!


I Say This, Before I Start

  1. I'm writing this post as an interpretation and analysis for the JSON shared in the LinkedIn post.
  2. There is no intention of pointing to anyone.
    1. The intention is to share -- how to analyze and have perspectives to analyze when I [a test engineer] experience it.
    2. This post can become a reference to someone who is serious about Test Engineering and Prevention, and Testing.
  3. The intention of this post is to let know,
    1. How to analyze such incidents in the payment context?
    2. Is this a API problem?  What behaviors should be classified as an API problem?
    3. Which component and layer is supposed to handle it?  How and why?
    4. Which other components and layer is supposed to assist to handle it?
    5. How to see and interpret the HTTP and its status code in a narrowed cases?
    6. It is not always the API problem!
  4. I'm open to correct myself, unlearn, learn and update this blog post, when shared why I'm not making sense with I have written here.
    1. I will be thankful and humble to you for helping me to correct myself. 🙏
    2. Be comfortable and do connect and help me if you see I need it.


The Payment Failed; Could Not Buy Ferry Ticket

The user wanted to buy the ferry tickets by making the online payment.  The payment did not happen and could not buy the ferry ticket.

Instead, the user sees the JSON on the UI. 

The JSON has error description which reads as -- "TdsServerTransID is not received in Cres."

To understand what is TdsServerTransID and Cres, it is required to understand the payment gateway flows.   Continue reading the next sections to know them.


How It Is Layered and Works?


Before I get into analysis and say where the problem is, this context requires an understanding of the payment gateway flow.  Having this understanding, it helps to know how it could have been handled and prevented.

Refer to this below pic.  It gives the sequences of interactions in the payment transaction.  Note that, the below pic is not complete.  It is kept to what is needed to the context of this blog post.


Transaction between 3DS Server and Bank for the Challenge through
POST request.  It is shown as CReq and CRes.


The Sequence of Interactions
  1. I initiate the payment on merchant website or app to buy the ferry ticket.
  2. The merchant creates a transaction id for tracking.  Then, it hands over the rest to payment gateway.
  3. The payment gateway uses the 3DS way to carry this transaction.
  4. The 3DS initiates the Authentication Request (AReq) which will pass through the Directory Server.
    • The Directory Server reads the data in the request and forwards it to the right bank who have issued the card [or account] used in the payment.
  5. The bank receives the AReq.  
    • The bank decides should it give a challenge to the user to make the payment or just agree with the data passed in AReq
  6. Now, in this case the bank has decided to give the challenge to the user.  
    • The bank lets know the 3DS server through the response ARes.
  7. The challenge usually is to enter the OTP received over a SMS and authenticate.  I presume the same in this case.
  8. The user enters the OTP and POSTS the request.  Let us call this request as CReq.  The CReq is sent from from the user's browser or app to the bank.  The user interface on the browser or app at this point is handled by the payment gateway and not the merchant's web or app where the ticket is being brought.
    • This CReq will have,
      1. tdsServerTransactionId
      2. messageVersion
      3. messageType
      4. challengeWindowSize
      5. ascTransactionId
    • Note that, the CReq having the tdsServerTransactionId which is generated by the payment gateway.  tdsServerTransactionId is required in the CReq and expected by the bank.
  9. On receiving, the CReq, the bank processes the request.  The bank responds back to the 3DS server; let us call it CRes.
  10. The CRes will have,
    1. tdsServerTransactionId
    2. acsTransactionId
    3. challengeCompletionId
    4. transStatus
    5. messageType
    6. messageVersion
  11. If all goes as expected, the authorization will be given for the payment.  
    • If not, the bank responds to the payment gateway (3DS) that there is a problem and the message.
      • It is the payment gateway which has to read this problem and message.
      • The payment gateway has to give the better contextual information through its interface to the merchant who is consuming the service.
      • This is key!
      • Hope you found the spot here!


The Code 201 


Maybe, we test engineers presume the error code 201 returned is HTTP Status Code.  Ask your team what is this code and what is the purpose of returning in the JSON.

Back to 201, here.  It is not a HTTP Status Code.  It is neither a custom developed error code by payment gateway nor merchant who is selling the ferry ticket.

This is a payment domain specific code used in the payment technology.  Yeah!  The code to tell the payment gateway and merchant what happened with the initiated payment transaction.  

To be particular and more specific to the context, the 201 here is a 3DS APIs response code.


The 3DS Error Code and its Description.
The credit of this image is to developer.ravelin.com

The above image is shared here so that you know the error codes are available in the payment context.  I'm aware of 3DS protocol hence I could relate and understand the error code 201.  Refer to the References section at the end of this blog post.

The payment gateway has to read this error code. Then provide the right contextual message and instruction to merchant saying what to do incase of payment failure.

Here is the catch.  Not all merchant develops a mechanism to handle this.  Instead, it is left to payment gateway service provider and depend on it.  

Do the merchant bother about handling it at their end? 
  • As a merchant, I can switch to another payment gateway tomorrow.  
  • Why should I invest in building, developing and maintaining this as a part of my business when I'm paying another business to do it for me?  
  • Isn't that senseful business question or decision?
Now, you tell me who is expected to handle this 201 error? Who should let me know on the merchant web user interface what to do?



The Case and My Initial Observations


On submitting the challenge given by the bank, the below details is shown to the user.  


Pic used from the post of Liudas.
It is the CRes returned from Bank to 3DS Server at Payment Gateway.

My Observations
  1. Looks like the ticket being bought and payment made was in Turkey.
  2. The payment gateway involved to complete the payment transaction is Payten.
  3. The user is tryin to buy the ferry tickets and is making online payment. This error is seen on submitting the challenge given by the bank,
  4. Looks like the card is used for the payment.
  5. In the LinkedIn post, I read, multiple attempts were made to make payment, and the same message is seen.
  6. The Error Code returned is 201.
    • Which means, a mandatory field is missing and it is a violation.
  7. Error Component is S, which probably means 3DS server.
    • The component that raised the error.
  8. There is an error which is not handled.
    • The error reads as Unexpected error.
    • There is an id to track it.  That is, for auditing and tracking by the payment gateway.
    • This id do not have anything to do with the bank and merchant who is selling the ferry tickets.
  9. The message has a version that reads as 2.1.0
    • This means, 3DS protocol used is of version 2.1.0
    • This is communicated by 3DS server to the Directory Server and Bank Server, so that they use the same version in the contract.
      • If the same version is not available at the bank server side, the lower version will be used.
  10. Error description reads, the TdsServerTransID is not received in the CRes.
    • The 3DS component is raising this error.
    • TdsServerTransID means ThreeDS Server Transaction Id
    • 3DS means Three-Domain Secure, which is a security protocol used for online card transactions.
  11. This means, the response from bank is, it cannot fulfil the payment request.
    • Because, the request from the 3DS layer do not have TdsServerTransID.
    • Wait!
      • Do I actually know if the CReq had the TdsServerTransID in the payload?
      • This is critical to know.
    • This is important to know.
      • The AReq (Authentication Request) will have the TdsServerTransID in its payload.
      • That means, this was present earlier.  So, the CReq is fired from the 3DS server on receiving the ARes.
      • The bank received TdsServerTransID and said it is giving a challenge  before proceeding to authorization for payment.
        • This response (ARes) from the bank will have TdsServerTransID.
          • Along with this, the below are also present as a correlation ID to track this transaction.
            • dsTransID
              • The Directory Server ID to track this transaction
            • acsTransID
              • The bank server ID to track this transaction
      • From here, the CReq payload requires the TdsServerTransID and acsTransID
        • Where did the TdsServerTransID go missing now?  How?  Why?
        • This needs to be test investigated.  I have shared possible causes in the next section.

The TransactionID of the merchant who is selling the ferry ticket is different from the TdsServerTransID.
    • The TdsServerTransID is needed for the bank and Payment Gateway to complete the transaction.
    • Without this TdsServerTransId, the bank cannot proceed to authorize the payment.

All this is happening between the payment gateway [3DS server] and the Bank.  The user and the merchant website is not in the picture here.



Why Did It Happen?


It happened because, the CReq from the 3DS Server of payment gateway did not have that TdsServerTransID.  That is what the CRes from the bank says.  

Note that, I presume, the bank systems and services are functioning and serving its customer at this point in time.  This is an analyzed assumption I'm making here and I'm aware of this assumption I have made.

What made the CReq to miss the TdsServerTransID?
  1. I read, the user made multiple attempts to initiate and make the payment.
  2. I have multiple perspectives and interpretations with hypotheses to say this could have lead to miss the TdsServerTransID.
    • Talking about each hypotheses in detail is not the scope of this blog post.
    • But, here are some of the possibilities that could have led to this situation.
      • A glitch at payment gateway at that point in time.
      • Someone is breaching in the middle and tweaking the request and response over the network.
      • Device and hardware
      • Geo location
      • Network and traffic
      • Timeouts
      • Latency
      • Storage running out
      • Configuration gone wrong
      • Caching and missing -- not persisting
      • Intermediate and dependency services clogged
      • A new release deployed and updated
      • Downtime in the bank's services
      • Intermittent bank's services
      • Mismatching 3DS protocol versions that is not supported and accepted at either ends
      • And, more!
In simple, it was well handled by the bank.  I assume, the amount is not debited from the user's account.  This is equally important and the LinkedIn post do not share about it.

I see the bank's service have worked well in this context and done what is expected out of it.  The thumb rule is, when there is discrepancy in the data expected and received, do not authorize the payment request; abort it

Can the service do much more if the TdsServerTransID is missed in the payload of AReq?

Me as a test engineer in the payment gateway engineering team, I will test for the below in minimal and as a must,
  1. AReq with no TdsServerTransID and observe how what happens!
  2. Do the 3DS server still fire a AReq to the bank with no TdsServerTransID?
    1. If yes, then that is a problem!  It should be handled here.
    2. This problem can be prevented!
    3. I will ask my team why are we encouraging such request?
      • This increases our customer support cost and operations time in responding to our clients for using our payment gateway.
      • And, merchant can lose the business because of our payment gateway.
  3. What should 3DS server do when initiated AReq is missing the TdsServerTransID?
    1. What all other data in the transaction and session should be retrained and intact?
    2. Will these data change with a fresh TdsServerTransID created?
  4. I will explore and figure out what factors caused the 3DS to lose the user's authentication.


Who Created The Problem?

  1. From the error shared, I see,
    •  It is 3DS server who created this problem presuming the bank and its response is right.  


Is This An API Problem?

  1. No, it is not a API problem.
  2. If we call it as a API problem, we have not understood what is an API.


Whose Problem It Is?

  1. It is the problem of the service that fired the CReq.
    • Because, it fired CReq without the TdsServerTransID which is mandatory key-value in the payload.  
    • TdsServerTransID value cannot be null nor empty in JSON.
It appears to be a business logic problem of the service at 3DS server side for firing CReq with no TdsServerTransID.



How It Can Be Prevented?

  1. By making sure CReq will always have a distinct TdsServerTransID.
  2. If no distinct TdsSevrerTransID in a fresh CReq being fired, abort the request.
    • Create the distinct TdsServerTransID and then construct the request payload before firing the CReq.
To make sure, payment gateway [3DS server] preserves the TdsServerTransID, dsTransID, and acsTransID of the session.  If any one of this not matching at any of the components during the transaction, aborting the transaction is the best and right action to do.  

Can you recall your daily life experience where you are said to not refresh or click on the browser's back button when you have initiated the payment transaction on web or mobile app?  This is the reason!  To preserve the data of the transactions in the session between these systems -- merchant web or app, payment gateway [3DS server], Directory Server and Bank Server.

This is an automation candidate.  It has to be part of the daily test runs in the automation.



Who Should Be Fixing It?

  1. In my opinion for today upon analysis and assumption I have made, it should be fixed by the payment gateway.
  2. The payment gateway has to interpret the 3DS error code returned by the bank.  
    • And, then initiate the appropriate action as the payment transaction has failed. 
  3. A new transaction between the 3DS and bank has to be started for that merchant's order.  
    • If it cannot happen, the payment gateway should abort all the current open session tied to that merchant's order.
    • And, let know the user what is happening, and then direct the user.
  4. The payment gateway can read the ARes and CRes.
    • That means, the payment gateway can read the HTTP Status Code of ARes and CRes.
    • If there is an error code in ARes and CRes,
      • Then, the gateway should assert for the 3DS error code along with the HTTP Status Code.
      • For example, 
        • Say, the HTTP Status Code is 400 and 3DS Error Code 201 in CRes.
          • Assert for these two and direct with appropriate contextual message and direction.
          • In general, this is how custom developed error codes are handled by the client on receiving it from the services.
Note that, it is Error Code and not the Response Code.  The two are different.  And, the HTTP Status code is different from these two.



Why It Is Not An API Problem?


API is an interface which exposes the available services to the consumer.

It is the services which collects data and build the request payload, and expects the payload to process. 

This request will pass through an interface which is opened to the consumer.  This interface is called an API -- Application Programming Interface.

Analogy,
  • The car has gear stick to switch the gear.
    • This is an interface to the car's gear system.
    • The driver will use the car's gear system through this gear stick.
  • The gear box in the car is a service.
    • The gear box adjusts and responds by switching to the gear per the driver's input.
    • The different operations [business logic] provided by the gear box services are,
      • Reverse
      • Parking
      • Neutral
      • Switching the gear and assisting other components to speed up or speed down the car.

Can the interface have problem?  Yes, it can have.  In that case, the service will not be available to serve or to discover.  
  • Like, if the gear stick has problem, I cannot use the gears of the car, but, the gear box can be fully functional and in working condition.  
  • It is just the interface [gear stick] having the problem and as a result the consumer [driver] is unable to use the services [gear box].

In this case, the payment gateway could fire CReq to the bank.  That is, the API exposing this service is functioning.  It is the problem in a service.  It is the service that has missed to ensure and mandate the presence of TdsServerTransID -- a business logic problem.

In simple, we Software Engineers [including me] use the term API vaguely and with no sense of what it means.  This is my observation.  Further, the Software Testers have tossed this term API in all possible ways and learning it in incorrect ways.  I have no doubt in it when I say this.

Next time, when you say it is a API problem, rethink on what you are saying.  Is it an API or the service(s) which is accessed through that API?  It is useful when we describing the behavior of the system and its layer.  

I cannot say it is a gear box problem for the gear stick not [usable] working, and vice versa.  You see that?




To stop here,
  1. Testing skills and testing will help when it is collaborated with awareness and skills of the tech stack used in building the software system.
  2. Testing skills, programming and tech skills are not enough!
    • The domain skills and awareness is essential and critical.
    • If one is aware of the domain where the problem is observed, then the 201 will not be read as HTTP status code.
  3. Build the domain skills and maintaining the knowledge base as a GitHub project helps in a longer run.
  4. Interfaces and its gateways should not be overloaded with additional responsibilities to make sure the mandatory key-value is present.
    • If did so, one should learn why. Because, it is an anti-pattern and not a suggested software engineering practice.
  5. It is the services that has problems most times.  
    • The interfaces and its gateway will have the discovery, orchestration and traffic problems along with the risks of security.


References:
  1. https://httpstatuses.io/
  2. https://developer.elavon.com/products/3dsecure2/v1/api-reference
  3. https://developer.ravelin.com/psp/api/endpoints/3d-secure/errors/
  4. https://developer.elavon.com/products/3dsecure2/v1/3ds-error-codes
  5. threeDSServerTransID  (in our case the JSON reads as TdsServerTransID)
    • 3DS Server Transaction ID
    • Universally unique transaction identifier assigned by the 3DS server to identify a single transaction.
    • The 3DS server auto-populates and appends this filed value in the authentication request (AReq) it sends to bank in addition to the data you send.
    • This value can also be found by a lookup in the response received by this service -- /3ds2/lookup


Saturday, July 12, 2025

Empathy - Missing in Engineers. Then, Why Think Like a User?


I did not know about the word 'empathy' in English.  I heard it for first time 15 years back.  This does not mean, I did not have or express or share the empathy.  

We engineers talk about empathy to the users when building the software.  That is, think like a user.  If you are a test engineer, testing, then you should have heard this.  And, you should have asked yourself -- "How will the user feel with this?".  Did you?

When you write a bug report or any test report, you think and ask yourself, "Can this be easily understood by the reader?".  Won't you?

You are having the thought and an emotion of empathy in that thought and question!



How's the Empathy of Engineers for Engineers?

Anytime, you asked these questions to you?  What do your voice say?

  1. Do I truly listen to my colleague and peers?
  2. Do I document?
    • How well and lean I document so that fellow engineers can use it?
    • Can they use it in my absence?  Does it serve?
    • For example, 
      • How well I name the variables, so that, someone who is reading my code can make sense out of it?  This is empathy to your colleagues.
      • How well I write the test engineering related testware, so that, it serves my fellow test engineers and others?  This is empathy to your fellow testers.
        • Before thinking like a user, think -- How my fellow testers feel for using your work and communication?
  3. Do I consider how my words and behavior affect the people I work and collaborate with?
  4. Do I listen and respect my colleagues and peers view while I discuss and share my perspective when it differs?
    • Am I ready to accept the same treatment and communication that I share with others?
    • Do I share and communicate my perspective and intent for first?
  5. How do I interpret other person's understanding and thought about a subject, work and practice?


If you are a programmer
Ask yourself when and how you respected your fellow programmer and testers thoughts and work?

If you are a tester
Ask yourself when and how you respected and listened to your fellow testers word and work?

If you are a manager or director or VP or CxO role person for engineers
Ask yourself, when was the last time you listened and tried to understand the words and thoughts of your programmers, testers and other roles?  How did you listen and communicate?
  • How did you help your teams and a team member to understand your point?
  • Did you consider one of your team members as your team?
  • Did you say anytime, she/he knows better than you, to any of your team member?
    • Did this end or impact the communication thereon between you two?
  • Further, ask yourself and list out the incidents that you could have avoided a situation and communicated well with dignity?

It is easy to say the slogan, "Think like a user".  
It is much easier to say this to your testing team.
But, it is not easy to think from the point of your colleague or peer.
Why?
What stops you?



Empathy Starts Within - A Message to Engineering Leaders


As leaders, we often say,
"Think like the user."
It is a powerful principle.

But, here is the question.
"Do you think like your engineers?"
You hired your engineers by conversing with them.  What are your efforts to make your engineers understand your thinking?  What's your principle on this?

If you cannot think and respond to the point views of your colleague or peer or team, how can you think and build the software system for a user?  Yet, say from decades, "Think like a user"!

Let me ask this.  Anytime, you tried to impose or overrule the testing team's thoughts, pain, voice and perspectives?  Or, did you talk to them to understand it and then shared why it should be the way you are deciding?  
Did you think as your test engineers and test team?  And, you want them to think like a user?  How?  

I'm not saying to agree and acknowledge to whatever the test team say and ask.  But, are you aware that, you work with test teams?  They are one among who consumes your decisions and work to deliver as a team.   What's your empathy to them?  Also reflect on what's your empathy to other teams?

If you cannot communicate well and help them understand your points and need of the business, then, how effectively as a team can we deliver value to the users who buys the service of the business?

When no empathy to fellow engineers on the team by an engineer, and, by the one who is heading the engineering teams, how can one have and express the empathy for the users who buys the service of the business?

Do you have the empathy for the engineers in your team?
Do you have the empathy for the teams you lead?
What is empathy?  What is your empathy?

I understand, in the business and with an executive role in business, empathy has scale.  What's your empathy scale for your engineers?



To end here for now, before the users, let your engineers and teams experience your empathy.

 


Tuesday, March 11, 2025

AMYQ: Approaching the Solution to Test Data Challenges of Shrini Kulkarni

 

In this post, I'm trying to reason in brief for the questions shared by Shrini Kulkarni in this AMYQ.  Here are the challenges/questions shared by Srini,

  • #1 challenge - setting up data in upstream systems to suite test cases that need to be run. There is AUT and there are upstream systems. In a corp setup -- individual teams are setup for each application. Hence getting another team to set some data in other system often encounters lots of manual effort and red-tapism
  • #2 Reserving test data created in AUT or upstream systems for specific team's use so that other teams do not change it.


My Understanding of These Challenges

  1. AUT is in place and it has upstream systems.
  2. There are multiple teams for each AUT in an org.
    • Say teams A, B, and C for sake of learning here.
  3. Team A has difficulty when wanting to create the Test Data in the space of Team B, and vice versa.
    • The difficult for Team A can be as,
      • Not permitted to create data
      • No awareness and understanding of Team B's system
      • Cannot make progress in testing unless there is data created for Team A
      • The created Test Data by Team B is not shared to Team A for multiple reasons as data pollution and corruption which disturbs their test cycles
      • And, more ...
  4. How to reserve Test Data created in Upstream system for use of a specific team?
    • How to make sure other team does not use it or edit it or delete it?
This is my understanding of Shrini's challenges.



Conway's Law and My Experiences

The Conway's Law says,

The architecture of a system will be determined by the communication and organization structures of the company.

Inverse of Conway's Law says,
The organizational structure of a company is determined by the architecture of its product.

The above stated both laws hold good for Microservices teams and the teams which consumes their services.  

I have experienced,
  • Each microservices teams creating their own set of test data.
  • These test data is not shared to any other team.
  • The other teams are not allowed to create data in their space.
  • No team is aware of what others are doing in tests and what they are using to test.
  • The teams work in silos.
This leads to friction, aggression and unhealthy communications between teams.  Now, where is the possibility of having the Test Data in upstream system which can be consumed to run tests by every other teams?

Do you see the statements of Conway's Law and Inverse of Conway's Law here in how these teams are setup, orchestrated and communicates in building one product?



What's the Problem?

Here is my understanding of the Problem Statement from the challenges shared by Shrini,

How to setup Test Data in upstream system, to suite the test cases, that need to be run?  And, how to ensure Test Data created by one team is not modified by other teams?

If observed, the outcome described in previous section is not a technical problem.  

It is the collaboration and communication problem, that comes up in presumption -- other teams using one's test data it affects their team's work and delivery.

Yes, it can impact badly if the creation, editing and managing the test data is not done attentively by teams when everyone shares and consumes it.  So the respective team's engineering managers and directors show the resistance to create data in their team's space as it impacts their pipeline and delivery.


How I Solved It?

In the contexts where I work, I have different upstream systems to which I'm supposed to interact and get the data.  Then, use these data to test the service offered by the product.

But, creating test data in other team's space is not allowed!

There are multiple solutions that I approached with and solved this problem in different organizations.  The below said approach worked with most clients and so I share it here.  It is like a Design Pattern; I can use the structure of it in multiple teams to solve the similar problems.

I came up with an approach to create a suite having the endpoints of all these upstream systems.  And, I need not say how difficult and pain it was to get the endpoints and its details from teams.  It was a marathon circus of me to get it!

Here is how I approached the solution,

  1. I understood my teams were comfortable with Postman.  
  2. For this problem solving, I see Postman was simple and quicker than writing a automation project with libraries like RestAssured and request.
    • Team can run the Postman Collection from their system in quick time.
  3. I created an Test Data Inventory
    • The inventory has the Test Data which the test teams engineered for their testing and automation
    • These data is not just for functional testing; it had Test Data for security, performance and accessibility too.
  4. I built the Postman Collection having the endpoints of these upstream systems.
    • The collection is version maintained like any other code in the organization.
    • The meaningful commits are being made and pushed, as and when the need comes up.
  5. I crafted an environment file and wrote the scripts which places a variable having the stamp
    • This stamp tells, it is a test data created by automation run for this version of regression cycle and from this team.
    • The stamp is max of 10 characters and it is dynamic based on that run to avoid duplicate data.
      • I said it is dynamic and not the random!
      • This dynamic characters has meaning and an intent.
      • Note, when I say dynamic it is created for that iteration of Test Data
  6. These created test data are used by team eventually not touching the test data of other teams.
This let us move further with our tests and automation in the pipeline.   Note that, I have multiple upstream systems to which the AUT speaks.  Not disturbing them is a challenge!

I maintain the inventory of test data having versions of test data being crafted sprint-on-sprint.  We pull the test data of different versions from the inventory as and when it is needed for the intent of test and automation.

All that said, the team continues to work in silos with no much collaboration.  Who has to solve this?  Though I can solve this, my role and pay grade does not allow me to do so effectively.  I have solved within my teams and not to the organization level.  It is a engineering and culture problem which has to be addressed at the organization level.



To summarize, 
  1. The test data creation and its inventory management having versions of test data is an Engineering culture of an organization.
    • It has to be taken in consciously and help the teams move swiftly in building the resilient and usable systems.
  2. Test Data and its engineering is a project within an engineering project.
    • No wonder if any LLMs offers a business solution solely on Test Data in coming days!
    • Test the data offered by such LLMs before consuming it.
  3. Practice in the space of test data and testing the test data.
  4. There is Data Coverage in the engineering like it has Test Coverage.
    • What's your Data Coverage?



Tuesday, January 7, 2025

Is Software Testing a Cost to Business?


How quick one gets to have the awareness of the cost and value in the trade, it will be of help.  I consistently exercise the below questions in my practice.

  • What are the Values added from my work to the business?
    • How do I benefit from this?
  • What are the Values removed from my work to the business?
    • What does it bring to me?
  • Who is not getting benefited from the Values I'm adding and why?
    • What are the loses and its impact?
    • What are the benefits and gains we are losing?
  • Who is getting benefitted from the Values I'm adding and why?
    • What are the benefits and gains we are making?
  • What are the Costs added from my work to the business?
    • What are the impacts?
    • How does it impact me?
    • How does it benefit me?
  • What are the Costs removed from my work to the business?
    • How does it benefit me and what do I gain?
    • How does it benefit the business and others?  What do they gain?
  • Who are all experiencing the Costs from my work and why?
    • What are the pains from these costs?
  • Who are not having any Costs from my work and why?
    • Do they need anything from my work and its value add?

This exercise will help me to know the expectations of stakeholders and business. I align myself to deliver the expectations as I exercise these questions.  

The cost and value from my software test engineering work will have impact and influences my growth with the benefits that I will make.



Is Software Testing a Cost to Business?


I witness the discussion which says testing is a cost in software engineering business.  But, I do not hear the same statement on development.  This made me to think, why?

Here are my understanding on thinking over it for now.
  1. Software development is not about just programming.
    1. When said development it includes every teams and their work.
    2. Programming and Testing are parallel activities in the software engineering which helps each other.
    3. If testing is a cost, then for sure, programming is also a cost!  It is an associated activity.
  2. In business, every activity and investment on them is a cost in multiple ways.
    1. These are evaluated costs and being taken for a purpose in expectation of returns.
  3. Have I come across anyone saying Automation in Testing is a cost, like I hear for testing?
    1. It is seen as investment and necessity. Why?
      1. May be, because, in a belief if [once] automated, that is sufficient enough it goes on for itself without human intervention.
      2. If this is the thought, do you think we are doing it wrong?
  4. Have I seen the discussion and statements which concludes testing for performance and security is cost?
    1. If yes, how?
    2. If no, why these two are not seen as cost?
  5. Serving and well maintained engineering system is a trade-off
    1. I see, we engineers and business choose what to trade-off
      1. This decision can be on logical and non-logical basis; but, there will be a trade-off
      2. Each trade-off has costs and values
      3. What should I trade-off in the software test engineering and why?
      4. What should I trade-off in the software engineering that we are doing as a business?  Why?
  6. When one is offered a offer letter from a business, there is a CTC mention.
    1. CTC means Cost to Business
      1. Programmer has a CTC
      2. Tester has a CTC
      3. Every others in different teams have a CTC including the CxOs
      4. We all [and our job] are cost to a business, who can add value and get the high returns


To end for now, I learn, everything and anything business picks up is a cost.  Because, there is an investment on it.  If there is no investment, is there anything happening or progressing and changing?

I do not want to get into discussions which says software testing is a cost.  I see such discussion is lacking in understanding the software engineering and intrinsic for first.  Software Development is not all about programming; the programming is one small part of it.

There are different teams that work together in collaboration to develop and consistently delivering the usable software systems.  In this process, everything in software engineering business is an invested and evaluated cost in the interest of returns and values that are expected.

It is time to know and ask "What way of testing and its outcome is a cost?" than saying and listening to -- 'testing is a cost', which is nonsensical.  This holds good for any activity in software engineering.  

Any serving engineering marvel is a calculated and evaluated trade-off.

Monday, December 30, 2024

Testing Debt -- It Exists and Hits Every Day in All Environments


As an engineer on the team, I see the discussion and short conversations on Technical Debt.  No matter what, there will be a Technical Debt in the software system we build and deliver.

Likewise, when there is a Technical Debt, there will be a Testing Debt for sure.  Identifying and learning the magnitude and impact of the Testing Debt is part of my job.


My Understanding of Technical Debt


What is a Technical Debt?

  • It is the cost of additional rework caused by choosing the quickest solution rather than the effective solution.
  • Making decisions based on speed above all else is one factor that leads to the Technical Debt.
  • Technical Debt has pros and cons.
  • The unintended Technical Debt will not come to notice immediately or sooner.
    • This is one of the challenges we have to deal with.
      • Because intentional technical debt is what we are aware of from the decision made.
      • The impact of unintended technical debt has to be managed by every team involved in the software development.



Starting With The Testing Debt


I don't know if the term Testing Debt exist in the industry!

I have been using the term "Testing Debt" for the last 9 years in my practice.
  • I use this term to share and keep stakeholders informed about the rework which we have to do as a result of Technical Debt.  
  • On reading, what is Technical Debt, can we relate to what is Testing Debt?
  • Most times, the testability, automatability and observability characteristics of a system will be affected as a consequence of Technical Debt.
  • Further, to compound the cascading effect, the Testing Debt will come in.
    • One of the major impact due to the Testing Debt is the change in the Deterministic capability seeded with a test
      • Reworking on this deterministic capability is not simple and straight in all cases for the change introduced in the business operations, tech layer and infrastructure.

When I say, there is a Technical Debt arising from what we are doing and delivering, it also means, there is a Testing Debt that is created as a consequence.
  • To speak in terms of engineering, the testing is part of technical activity.
  • It looks funny to me when the one portrays the testing team as non-technical and label with terms.
    • For example, manual, repeatable, repetitive, etc.  

It is a cycle and repeatability to an extent.  Repeatability is one part in the engineering cycle and process.




Do You Know Your Testing Debt?


To remind you and me, the testing is sampling!  How do you sample when you have a growing Technical Debt and Testing Debt?

I will share one of the common Testing Debt which we everyone of us will have and I assume so.  

The ask for in-sprint automation and automate everything.  Is this a fair ask and practical?  This is not a line to discuss here.  But, I will tell you how we testing team will be hit by the Testing Debt in delivering for this expectation set by business and stakeholders.

The Technical Debt leads to rework sooner or later in the engineering and it will lead to major rework in the Test Engineering.  Isn't it?  

Say you have worked to build the testing infrastructure, regression suite, automation suite and integrated with the pipeline.  Now, there is a rework and change in the system as a fix for few Technical Debts.  
  • Does this effect your testing infrastructure, regression suite and automation suite that you have in place?  
    • Do we have [or given] enough time and resources for Testing Team to work and fix this and keep the pipeline running seamlessly?
  • This is not just about time and resources!
    • For the change in tech stack and engineering of the system, the Test Engineering has to adapt to it and challenge it.
      • If the Test Engineering does not challenge the engineering of the software system, we are not in a position to see the perspectives of risks and problems
      • Eventually, we will not even be in place to learn what are the no-risk perspectives and aspects of the system
  • How do I manage this so that I can turn around quickly to the context and keep the pipeline smooth?  
    • This is a challenge to every Test Engineer who faces the effect of Technical Debt.
We Test Engineers witness and experience the Technical Debt which also includes Testing Debt in different forms and intensity.  

Wait!  As you read this blog post, did you try to think of the intended Testing Debt in your project?

What are the unintended Testing Debt and its impact?
  • This is something not easy to identify and learn, while we can identify and learn the unintended Technical Debt to some extent.
    • This will exist; it will impact and hit all the environments, everyday!



Testing Debt and Test Engineer


Note this, building a Test Engineering solutions to withstand the effects and changes from a Technical Debt is a skill.  It is possible.
  • For this, the Technical Debt should not be damaging the deterministic attribute which is seeded to a test.
  • Picking and integrating such a deterministic layer within the layer of testability, automatability and observability is a mark of a skilled Test Engineer and Test Engineering.
Technical Debt and Testing Debt are not the same.  But, the Testing Debt is the outcome of a Technical Debt and has a relation to it.  

Testing does not drive the change in how the system is implemented; or, I have not experienced it so far.  Wait!  The outcome of the testing can and has changed how the system is implemented.  These two sentences are not the same.



To end here, how do I manage myself with all these debts?  We are also in the job where we have to grow together by talking, negotiating and solving the debts that are the outcome of decisions from business and stakeholders.



Sunday, November 3, 2024

Logarithm and the Expression of an Algorithm


I got to know about Logarithm in my High School.  But, I never knew it would be part of an engineer's life.  As a Software Test Engineer, I encounter the discussion that involves Logarithm often when testing an algorithm that is designed and implemented to solve a problem.


What is Logarithm?

I found this definition in Math is Fun. And, I see this is simple and straight.

How many of one number multiply together to make another number?


For example, how many of 4s make 64? 4 x 4 x 4 = 64.  
That is, when multiplied 3 of the 4s, I get 64.  Therefore, the logarithm is 3.

It is represented as below and said as "the logarithm of 64 with base 4 is 3".
  • log4(64) = 3

The same can be written or expressed as exponent in Math.  That is, 43 = 64.  The exponent and Logarithms are related.  The logarithm tells us what is the exponent.  

The logarithm answers the question -- What exponent do we need for the chosen base to get the given number?


John Napier introduced Logarithms in 1614 as means to simplify the calculations.


What Logarithm?

In the Computer Science, we use the Common Logarithm.  In my practice so far, when talking about the algorithms, I have been using Common Logarithm which is denoted using log.

The other type of Logarithm is, the Natural Logarithm. This is denoted using ln.


What is the Base?

One of the questions which is asked in discussion of an algorithm's analysis is what is the base?  This leads to question, What is the base that we should consider in Computer Science?

In my so far experience, I see, it depends on the context of an algorithm that is under evaluation and if it is of logarithmic nature in the time complexity and growth rate.

In the context of Computer Science, I see, the base do not matter much when equating the right hand side to the left hand side calculation.  But, one can pick the base that suits to context when needs to express the relationship of an algorithm under evaluation. The base need not be always 2 here.


Note
: I understand below from the peer discussions for the logarithmic base in Computer Science:

The asymptotic notation focuses on growth rate and ignores the constant factors. Any two logarithms differ by a constant factor, and the base makes no difference.  Hence, I do not see base specified for a logarithm when using asymptotic notation.  That said, when I have a logarithm with a constant base, it is okay to not specify base.

When the base of logarithm depends on parameter (an input or any external configuration) to the algorithm and which is not constant, it is a better practice to mention the base.


In my testing experience for an algorithm's functionality, performance, complexity and growth rate, I see, the engineers keep the input parameters as configurable.  That is, it is kept as a constant which can be changed based on the need basis for the context.  So the base is not mentioned in the logarithmic expression for an algorithm most times.


If you see, my understanding is not appropriate in context of logarithmic asymptotic notation, do share your thoughts as comments to this blog post.  I will be happy to introspect and learn.



References:

  • https://www.mathsisfun.com/algebra/logarithms.html


Friday, July 26, 2024

My First Hand Analysis of CrowdStrike Falcon Update Incident


I attempted to analyze the process dump of CrowdStrike shared by my friend.  He said, there could be an attack which is leading to crash of Windows OS globally.  This made me curious to look into the dump and learn.

I had no much context around it, but, a test engineer in me did not sit quite.  I started to analyze the dump information.  Here is my first hand analysis that I made on 19th July 2024 post 10:30 AM IST.


What I Saw?

  • It is a Windows OS's process dump.
  • Looks like something with C or C++ application reading how the memory offsets were in the dump.
  • It started to read a memory offset.
  • Then the process witnessed an exception.
    • Here the program could not read further
    • Why it could not read further from this offset?
      • My little experience of testing drivers on Windows OS for a card printer machine, refreshed and recalled what I had witnessed when testing.


Scratching and Striking My Mind


I started to ask these questions myself while I asked what could have gone wrong.  I could not stop here as I was curious what led Windows machine crash.  I referred to web and learn there was an update by CrowdStrike, and then this incident.

The bugs do exist in every software no matter the level and depth of testing, automation and engineering's excellence.  All software do crash and OS is not an exception to it.  But, what made the update to crash the Windows OS?  Pointing and blaming CrowdStrike or Microsoft is not a way for the practicing test engineer.  If these two organizations are serving its huge customer base, they have something working and reliable.  Engineering does not eliminate problems.

By now, I had a thought that it is not an attack.  It is a software bug!  Where is the bug?  What is the bug?  Was it not experienced in pipeline?


The Open Ended Questions


I had these questions as I analyzed and spoke to my friend.
  • What is Falcon?
  • What was this update to Falcon?
  • How frequently the updates are rolled out?
  • How the updates are rolled out globally?
  • What pipeline do they have in testing?
  • Who is impacted the most in business? Is it Microsoft or CrowdStrike?  Impacted in what way?
  • What is CrowdStrike?  What they do?  Who are the customers?
  • Where do the CrowdStrike's Falcon sit in the OS and what it does?
  • How CrowdStrike works in the machines and what it offers?
  • What do the dump say? Relook into it with different perspectives.
  • How this could have been prevented?
  • How will I prevent this if I join this team knowing this incident?
With these questions, I started to analyze the process dump which was shared.

I had more such questions, but these were the first few that I crossed as I started.



Analysis of Process Dump


My interpretation, tells me the below for today
  1. Accept that it is an incident as any other incident which I witness in production environment.
  2. Do not fall to the speculation happening around.  Remain calm and focus to interpret and understand your exploration.
  3. I see, if it can start to read from an offset and then ending to experience a non-existent or invalid offset, is it a NULL Pointer?
    • What is NULL Pointer?
      • A NULL Pointer is a pointer that does NOT point to any memory location and hence does not hold the address of any variables.
      • If I do not initialize and assign, the pointer will have NULL as its value.
      • For example, int *test;
        • When I want to access the pointer test (a location in memory) pointing to, I will not be sure what is in the pointer when I read it.
          • I may not set it later or set it.
          • In this case, the code can tell if the pointer is valid or pointing to a garbage memory
        • But, if I declare it like int *test = NULL;
          • I can check if was set and initialized
        • It is a better practice to assign a NULL value to a pointer during initialization so that we can check if it is NULL or as any address assigned to it.
      • This understanding of Pointer makes me think, is it due not initializing a pointer and so the error code c0000005 on reading a memory that is not valid.
      • When we assign a NULL value to pointer, it is a null pointer in C++
        • We assign null value for testing and asserting
          • If the memory is allocated to a pointer or not
          • If it has a return address and is a valid one or not
          • If a pointer is not initialized, assigning null it prevents problems to certain extent
    • With this understanding, I also read, it started to read from an offset 0x9c, and then failing.
      • What is 0x9c?
        • In Octal it is 234. In Decimal it is 156.
        • Can there be such address in a computer's memory? I don't know.
        • If it is a access violation, then is it a memory which is in preemption of the OS?
          • If so the OS can terminate the program or process which is trying to access it.
          • Is this killing the process and aborting the operation of Falcon's IPC and eventually Windows coming to BSOD?
      • This tells me it is not a NULL Pointer in first case but not initializing a pointer to NULL.
        • I infer, if the pointer was assigned to NULL, that is initialized, there could have been some hint in the state and event when accessing the memory.
          • This is my analysis; but, I have not seen the test code nor aware of the product.  All this inference is based on the process dump and my experience of testing drivers.
      • It got something in between from update (a config or pattern?) for which it cannot find and read in the memory?  Why?
        • This indicates me, it could be a bug, that is, a logical problem.  This is my hunch for today!
  4. Data in the dump
    • Exception Address
    • Read from Address 0x9c
    • Exception Code: c0000005 (Access violation)

Testing my Interpretations


CrowdStrike as an org when it caters its SAAS to such a customer base, won't it have a testing pipeline
  • It will have, I have no doubt in it.  They test and roll out the updates, I believe in it.

Did they witness any such incidents earlier?
  • I searched on web for it and I did not find something similar on the Windows, earlier.

Is this a NULL Pointer?  Are you sure?
  • No, I'm not sure.  But, there is something that is leading it to address which does not exist or which is invalid?  I will have to wait for their RCA to know technically what caused this.  But this is my understanding reading the dump.

How do you think it is a memory access problem?
  • The error code 0xc0000005 says that.
  • I referred to driver easy website for the information because my experience of testing the drivers for Windows OS and experiencing such incidents led me there.  This is what I learn:
    • https://www.drivereasy.com/knowledge/solved-how-to-fix-0xc0000005-error/

Do you think the programmer would not have handled the obvious Pointer and NULL initialization?
  • I believe there will be a check for Pointer and what it is pointing to.  But is it due to no initialization?  Technically this has to be analyzed which I cannot do.  I will have to wait for CrowdStrike team to share the tech details.

Is this a driver problem that killed the Windows kernel?
  • I don't know.  But, the .sys file will not have driver as per my learning.  It will have information about the drivers and any configurations.
  • This incident is a problem, which impacted both CrowdStrike and Microsoft.  Maybe, both will have their areas to look and fix it they see so.  But, in this context, CrowdStrike can fix it quicker and that is much better -- is what I understand.
  • I'm a Windows user for long time.  I see, Windows has worked well to all my contexts so far.  The Engineers of Windows OS knows better than me here.  I'm not well aware and informed as they are.
  • CrowdStrike's engineering team are skilled and they are rolling out updates often in a day.  They have a better pipeline when this is being done.
    • But, the question I have is, how did this happen?
    • No one lets such problem into production when they are aware of it.  Do you?
    • There is something that has not come to their observation and experience.  What is that?
    • Knowing this will help to prevent this and similar incidents happening in future.
      • I'm waiting to know what did not come to their experience and led to this incident.

What could be in the .sys file of CrowdStrike?
  • I don't know!  I want to learn that.
  • But, from my testing of .sys file and drivers on Windows OS, I learn there could be a configuration details with certain pattern or information to capture at run time, and help the installed software to run.  This is my learning and awareness from my testing.
  • That said, testing at OS level and Anti Virus engines are not obvious.  Testing of drivers is like the risky mines.  What is sufficient and good enough in test coverage?  It needs an expertise at OS internals level.
  • Windows OS having such a fragmentation in its versions, updates and patches, it is a battle field and mines for engineers building such solutions for sure!
  • I learn, the Windows OS stopped when an application tried to access the invalid region or non-existent memory.
    • The update which was rolled out, did it have a configuration or a pattern that showed a logical problem when processing it?
    • I have such questions and thoughts that are striking my mind as I think and build a problem model for the same.

Is this a race condition incident?
  • I see, it is not a race condition incident as users across globe experienced it.

Is this specific to a Falcon version, OS version and hardware?
  • Not all host machines would be on latest version of Falcon, is my presumption.
  • At least, n-1 and n-2 versions should be on host machine which experienced this behavior.
    • So it is not a Falcon version specific, I see.
  • It looks to me as it is not specific to the Windows OS version and hardware configuration.
    • It is an application software problem which occurred at driver level is what I see.
      • This is an IPC communication and process is my understanding.
        • The driver can receive the IPC communication in continuous mode.
        • At times, this can get queued based on the application and what it does.


Where is the Problem?


Well, I'm looking and pulling from my visualization by relating with my experience of testing the driver on Windows OS.  I don't know the exact reason or close enough to tell what could have gone wrong.

Reading the process dump, it says accessing a memory that does not exist or corrupted.  One of the high possibility is, the starting offset is seen but it is not helping when reading.
  • For example, Ravi has the address of India's Prime Minister house.
    • But, he does not know from where to start despite having the address.
    • He is void and null in knowing where to start and what to do when he is not initialized with the start location to begin the travel to the Prime Minister's house.
    • In short, he do not know where the address is pointing to and what it has, though he is given a address to start.
      • Can he access the Prime Minister's house premise without any access granted and authorized to do so?
      • If not, won't he be arrested by police or other security forces and stop him?

Do I Know the Precise Problem?


I don't know!  I do not know the CrowdStrike product and platform.  I'm waiting to read the technical details from Crowd Strike.

I see, it comes to the data, state and event.  I would focus on how to prevent it learning which data, state and event led to this behavior.  I think of figuring out the Test Design and Strategy that can help me to identify such use cases.  I focus here and see can it brought into the automation so that it gets exercised and regressed consistently.

If it is due to the memory access that had a problem, I did such tests when testing driver for a hardware machine on Windows OS.  I will share the tests that I did in upcoming blog.

I wrote the technical analysis from process dump to CrowdStrike and Microsoft.  I did not get a response.  Anyways, I'm sharing the overall information in a non-technical way so that it is consumable to most readers here.



Note: Here are another threads of me sharing my thoughts on same:
1. https://x.com/testingGarage/status/1814215089525821763?t=XSFdx69ElL0ZmBOcEFrTjg&s=19
2. https://www.linkedin.com/posts/ravisuriya_%3F%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F-%3F-activity-7221156949445206017-oeRa




Saturday, February 3, 2024

Performance Testing - What to Know Before User Behavior and Traffic Pattern?

 

This blog post is in series of 100 Days of Skilled Testing.  I see, I do not have to pick every questions asked in this series.  I pick and share to which I see, I can add value.

The twelfth question from the season two of 100 Days of Skilled Testing, is:

What strategies do you use to simulate realistic user behavior and traffic patterns when conducting performance tests?

The twelfth question asked is vague and it needs to be refined for preciseness to pick it up and continue.


The Question and the Gap

I see the below are missing in the above asked question:

  1. What aspect of performance is under evaluation?
  2. What is the system that is being evaluated for a performance's aspect?
  3. What part of the system is being evaluated for a performance's aspect?
    • Queuing? Messaging? Database I/O? Memory? Space? CPU? Client Performance? Functional Module?
  4. Who are the users?  What are their personas?
  5. How and where the users are accessing the system?
  6. What is the context of users accessing this system?
  7. What is the geo location of users who are accessing this system?
  8. How long these users are connected by accessing this system?
  9. Are there any differences among these users in their roles and privileges in accessing this system?
  10. Can the user access system through multiple interfaces?
  11. Are you assuming the user is on web browser and mobile apps to access this system?
  12. Is this system you are referring to, is a software system? Or any other system that is controlled environment like - access door, elevators, etc. ?
  13. You are asking to simulate the user behavior and traffic pattern.  Should I assume, I and you know or agree to any volume of user?  And, all these users are here for the same purpose when accessing the system?
  14. Are you considering any time or at a particular time when talking about the traffic pattern?
  15. Are there any unrealistic users who is accessing your system?  You say 'realistic user'.
    • Do you see that bots and non-human are also allowed as a user in your traffic?
  16. Have you evaluated this earlier in your system?
    • If yes, do you have the history and data for user behavior and traffic pattern?
    • If you don't have, do you allow to use or have your competitor's user behavior and traffic pattern data? 
  17. What is the tech stack of your system?
    • What part of your tech stack, you want to evaluate with this user behavior and traffic pattern?
  18. What is the architecture of your system?
  19. What part of your system and its architecture is being evaluated with this user behavior and traffic pattern?
  20. Are you running this exercise for the first time?  If not, where I can refer to previous exercises?
  21. How the interaction and events are handled from its start to completion?
    • What all are needed to complete the transaction in work flow?
    • How this transaction can go invalid for lack or incorrect data, state and action?
  22. What is spike, drop, saturation, expected, unexpected, and average numbers in the traffic coming in?
  23. What do you understand by traffic?  Do you mean number of requests coming in?
    • Do you mean the being committed I/O operations?
    • Do you mean the response received at the other end?
    • What is the definition of 'traffic' in this context?
  24. What is that you want to study and evaluate by the User Behavior and Traffic Pattern information gathered in this context?

Using the above questions, I will get an idea to proceed.

I will build a model from information I collect using above asked questions.  This model we will used to further in testing for a performance's aspect.  The value added to the performance test depends on this model as well.  To get a better model in context, it is useful to address the gaps.  From here, I start to think further.



What do you ask and look for when building a model for User Behavior and Traffic Pattern?



Performance Testing - The Unusual Ignorance in Practice & Culture

 

I'm continuing to share my experiences and learning for100 Days of Skilled Testing series.  I want to keep it short and as a mini blog posts.  If you see, the detailed insights and conversations needed, let us get in touch.


The ninth question from season two of  100 Days of Skilled Testing is

What are some common mistakes you see people making while doing performance testing?  How do they avoid it?


Mistakes or Ignorance?

It is mistake when I do an action though I'm aware that it is not right in the context.

I do not want to label what I share in this blog post as mistake.  But, I call it as ignorance despite having or not having the awareness, and the experience.

The ignorance said here is not just tied to the SDLC.  It is also tied to the organization's practice and culture that can create problems.

To this blog post's context, I categorize the ignorance in these categories -- Practitioner and Organization.

  1. Practitioner's ignorance
    • Not understanding the performance, performance engineering, and performance testing
      • When said performance testing, taking it as - "It is load testing"
      • No awareness on what is performance and performance engineering
        • Going to the tools immediately to solve the problem while not knowing what is the performance problem statement
      • Be it web, API, mobile or anything,
        • Going to one tool or tools and running tests
      • No much thinking on how to design the tests in the performance testing being done
      • Ignoring Math and Statistics, and its importance in Performance analysis
      • No idea on the system's architecture, and how it works
        • Why it is the way it is?
      • The idea of end-to-end is extended and used in testing for performance and having hard time to understand and interpret the performance data
        • How many end-to-end your tests have identified?
        • Can we test for performance to all these identified and unidentified end-to-end?
      • Relying on the resource/content in internet and applying or using it in one's context without understanding it
      • No idea on the tech stack and how to utilize the testability offered by it in evaluating the performance
      • Not using or asking for testability
      • Getting hung to most spoken and discussed 2 or 3 tools on the internet
      • Applying tools and calling out it as performance testing
      • No attempting to understand the infrastructure and resources
        • How it impacts and influences the performance evaluation and its data
      • Idea on Saturation of resources
        • Thinking it as a problem
        • Thinking it as not a problem
      • Not working to identify where will be the next bottleneck when solving a current bottleneck
      • What to measure?
      • How to measure?
      • When to measure?
      • What to look when measuring?
      • Not understanding the OS, Hardware resources, Tech Stacks, Libraries, Frameworks, Programming Language, CPU & Cores, Network, Orchestration, and more
      • Not knowing the tool and what it offers
        • I learn the tool everyday; today, it is not the same to me compared to yesterday
          • I discover something new that I was not aware of what it offered and exist
          • I learn the new ways of using the tool in different approaches
      • No story in the report with information/image that is self-describable to most who reads it
      • And, more; but the above said resonates with most of us
  2. Organization's ignorance
    • At the org level, for first and to start, it is ignorance in Performance Engineering
      • Ignoring the practice of performance engineering in what is built and deployed
      • Thinking and advocating, increasing the hardware resources will increase and better the performance
        • In fact, it will deteriorate over a period of time no matter how much the resources are scaled up and added
      • Ignoring the performance evaluation and its presence in CI-CD pipeline
      • The performance tests on CI-CD pipeline should not take beyond few minutes
        • What is that "few minutes"?
      • Not prioritizing the importance of having the requirements for Performance Engineering

Recently, I was asked a question - How to evaluate the login performance of a mobile app using a tool "x"?

In another case, I see, a controller having all HTTP requests made when using web browser. Running these requests and trying to learn the numbers using a tool.


I do not say this is wrong way of doing.  That is a start.

But, we should NOT stay here thinking this is a performance engineering and that is how to run tests for learning a performance aspect[s].


To end, the performance is not just - how [why, when, what, where] fast or slow?  If that is your definition, you are not wrong!  That is a start and good for start; but, do not stick on to it alone and call performance.   It is capability.  It is about getting what I want in the way I have been promised and I expect; this is contextual, subjective and relative.  The capability leads to an experience.  What is that experience experienced?

Sometimes, serving the requests by what you call as slow, is a performance.  What is slow, here?

The words fast and slow are subjective, contextual and relative.  It is one small part of performance engineering.

That said, let me know, what have you been ignoring and unaware in practice of Performance Engineering & Testing?